0% found this document useful (0 votes)
2 views

Httpsmygju.gju.Edu.jofacescourse Portfoliocourse Syllabuscourse Syllabus.xhtml 2

The document discusses data cleaning through attribute creation, focusing on methodologies like attribute extraction, sampling, and clustering to enhance data representation. It also covers frequent pattern mining, detailing concepts such as itemsets, support, and association rules, emphasizing their importance in various data mining tasks. Additionally, it introduces closed patterns and max-patterns as efficient ways to manage frequent patterns and reduce complexity.

Uploaded by

Dina Bardakji
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

Httpsmygju.gju.Edu.jofacescourse Portfoliocourse Syllabuscourse Syllabus.xhtml 2

The document discusses data cleaning through attribute creation, focusing on methodologies like attribute extraction, sampling, and clustering to enhance data representation. It also covers frequent pattern mining, detailing concepts such as itemsets, support, and association rules, emphasizing their importance in various data mining tasks. Additionally, it introduces closed patterns and max-patterns as efficient ways to manage frequent patterns and reduce complexity.

Uploaded by

Dina Bardakji
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 15

BIDA311: Data Mining

Ch.2:
Data, measurements, and preprocessing Lecture 3
Ch. 4+5: Pattern mining

by Dr. Jamal Al Qundus


Data Cleaning (Alternative): Attribute Creation
(Feature Generation)
• Create new attributes (features) that can capture the important information
in a data set more effectively than the original ones
• Three general methodologies
• Attribute extraction
• Domain-specific
• Mapping data to new space (see: data reduction)
• E.g., Fourier transformation, wavelet transformation, manifold approaches (not
covered)
• Attribute construction
• Combining features (see: discriminative frequent patterns in Chapter 7)
• Data discretization

17
Attribute extraction: Clustering
• Partition data set into clusters based on similarity, and
store cluster representation (e.g., centroid) only
• Can be very effective if data is clustered but not if data is
“smeared”
• Can have hierarchical clustering and be stored in multi-
dimensional index tree structures
• There are many choices of clustering definitions and
clustering algorithms

18
Attribute extraction: Sampling
• Sampling: obtaining a small sample s to represent the whole
data set N
• Key principle: Choose a representative subset of the data
• Simple random sampling may have very poor
performance
• Develop adaptive sampling methods, e.g., stratified
sampling

19
Types of Sampling
• Simple random sampling
• There is an equal probability of selecting any particular item
• Sampling without replacement
• Once an object is selected, it is removed from the population
• Sampling with replacement
• A selected object is not removed from the population
• Stratified sampling:
• Partition the data set, and draw samples from each partition
(proportionally, i.e., approximately the same percentage of the
data)

20
Sampling: With or without Replacement

W O R
SRS le random
im p h o ut
( s e wit
p l
sam ment)
p la c e
re

SRSW
R

Raw Data
21
Outline
• Mining Frequent Patterns
• Association and Correlations
• Basic Concepts and Methods
• Frequent Itemset Mining Methods
• Which Patterns Are Interesting?—Pattern
Evaluation Methods

• Goal: Understanding concept of mining


frequent patterns
What Is Frequent Pattern Analysis?
• Frequent pattern: a pattern (a set of items, subsequences, substructures,
etc.) that occurs frequently in a data set
• First proposed by Agrawal, Imielinski, and Swami [AIS93] in the context of
frequent itemsets and association rule mining
• Motivation: Finding inherent regularities in data
• What products were often purchased together?— milk and chocolate?!
• What are the subsequent purchases after buying a PC?
• What kinds of DNA are sensitive to this new drug?
• Can we automatically classify web documents?
• Applications
• Basket data analysis, cross-marketing, catalog design, sale campaign
analysis, Web log (click stream) analysis, and DNA sequence analysis.
23
Why Is Freq. Pattern Mining Important?

• Foundation for many essential data mining tasks


• Association, correlation, and causality analysis
• Sequential, structural (e.g., sub-graph) patterns
• Pattern analysis in multimedia, time-series, and stream
data
• Classification: discriminative, frequent pattern analysis
• Cluster analysis: frequent pattern-based clustering
• Data warehousing: iceberg cube
• Semantic data compression
• Broad applications

Iceberg cube represents items whose


aggregate values are above a given threshold
24
Basic Concepts: Frequent Patterns

Tid Items bought • itemset: A set of one or more


10 soda, Nuts, Chocolate items
20 soda, Coffee, Chocolate • k-itemset X = {x1, …, xk}
30 soda, Chocolate, Eggs
• (absolute) support, or, support
40 Nuts, Eggs, Milk
count of X: Frequency or
50 Nuts, Coffee, Chocolate, Eggs, Milk
occurrence of an itemset X
Customer
buys both
Customer • (relative) support, s, is the fraction
buys of transactions that contains X
chocolate
(i.e., the probability that a
transaction contains X)
• An itemset X is frequent if X’s
Customer support is no less than a minsup
buys soda threshold
25
Basic Concepts: Association Rules
Tid Items bought • Find all the rules X à Y with
10 Soda, Nuts, Chocolate minimum support and confidence
20 Soda, Coffee, Chocolate
• support, s, probability that a
30 Soda, Chocolate, Eggs
40 Nuts, Eggs, Milk
transaction contains X È Y
50 Nuts, Coffee, Chocolate, Eggs, • confidence, c, conditional
Customer
Milk
probability that a transaction
Customer
buys both
buys
having X also contains Y
chocolate Let minsup = 50%, minconf = 50%
Freq. Pat.: Soda:3, Nuts:3, Chocolate :4,
Eggs:3, {Soda, Chocolate}:3
Customer
buys soda n Association rules: (many more!)
n soda à chocolate (60%, 100%)
n chocolate à soda (80%, 75%)
26
Closed Patterns and Max-Patterns
• A long pattern contains a combinatorial number of sub-
patterns, e.g., {a1, …, a100} contains 2100 sub-patterns!
• Solution: Mine closed patterns and max-patterns instead
• An itemset X is closed if X is frequent and there exists no
super-pattern Y ‫ כ‬X, with the same support as X (proposed by
Pasquier, et al. @ ICDT’99)
• An itemset X is a max-pattern if X is frequent and there exists
no frequent super-pattern Y ‫ כ‬X (proposed by Bayardo @
SIGMOD’98)
• Closed pattern is a lossless compression of freq. patterns
• Reducing the # of patterns and rules
27
Max pattern? Closed pattern?

yes frequ no yes frequ no


ent ent

Part yes Part yes


no no
of of
super super

less
yes freque no
nt
super

Yes No Yes No

Pattern xy is a frequent pattern and there is no Pattern xy is a frequent pattern and also the only
super-pattern xyz. super-pattern xyz is less frequent than xy.
{a} = 4 {a,b,c} = 1
{b} = 2 {a,b,d} = 0 Minsupp = 50% = 3
{c} = 5 {a,b,e} = 1
{d} = 4 {a,c,d} = 2
{e} = 6 {a,c,e} = 3 Closed-pattern?
{a,b} = 1 {a,d,e} = 3 Max-pattern?
{a,c} = 3 {b,c,d} = 0
{a,d} = 3 {b,c,e} = 2
{a,e} = 4 {c,d,e} = 3
{b,c} = 2 {a,b,c,d} = 0
{b,d} = 0 {a,b,c,e} = 1
{b,e} = 2 {b,c,d,e} = 0
{c,d} = 3
{c,e} = 5
{d,e} = 4
{a} = 4 {a,b,c} = 1
{b} = 2 {a,b,d} = 0 Minsupp = 50% = 3
{c} = 5 {a,b,e} = 1
{d} = 4 {a,c,d} = 2
{e} = 6 {a,c,e} = 3 Closed-pattern:
{a,b} = 1 {a,d,e} = 3 e=6
a,e = 4
{a,c} = 3 {b,c,d} = 0
c,e = 5
{a,d} = 3 {b,c,e} = 2 d,e = 4
{a,e} = 4 {c,d,e} = 3 a,c,e = 3
{b,c} = 2 {a,b,c,d} = 0 a,d,e = 3
{b,d} = 0 {a,b,c,e} = 1 c,d,e = 3
{b,e} = 2 {b,c,d,e} = 0 Max-pattern:
a,c,e = 3
{c,d} = 3 a,d,e = 3
{c,e} = 5 c,d,e = 3
{d,e} = 4

You might also like