0% found this document useful (0 votes)
15 views44 pages

ICS 2408 - Lecture 5 - Association

Uploaded by

mmdennis25
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views44 pages

ICS 2408 - Lecture 5 - Association

Uploaded by

mmdennis25
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 44

Mining Frequent Patterns, Association and

Correlations

 Basic concepts
 Efficient and scalable frequent itemset mining methods
 Constraint-based association mining

August 12, 2024 Moso J : Dedan Kimathi University 1


What Is Association Mining?
 Association rule mining:
 Finding frequent patterns, associations, correlations, or causal
structures among sets of items or objects in transaction databases,
relational databases, and other information repositories.
 Frequent pattern: pattern (set of items, sequence, etc.) that occurs
frequently in a database

 Motivation: finding regularities in data


 What products were often purchased together? — Beer and
diapers?!
 What are the subsequent purchases after buying a PC?
 What kinds of DNA are sensitive to this new drug?
 Can we automatically classify web documents?

August 12, 2024 Moso J : Dedan Kimathi University 2


Why Is Association Mining Important?
 Foundation for many essential data mining tasks
 Association, correlation, causality
 Sequential patterns, temporal or cyclic association, partial
periodicity, spatial and multimedia association
 Associative classification, cluster analysis, iceberg cube, fascicles
(semantic data compression)
 Discloses an intrinsic and important property of data sets
 Broad applications
 Basket data analysis, cross-marketing, catalog design, sale
campaign analysis
 Web log (click stream) analysis, DNA sequence analysis, etc.

August 12, 2024 Moso J : Dedan Kimathi University 3


Market Basket Analysis (MBA)
 Retail – each customer purchases different set of products, different
quantities, different times
 MBA uses this information to:
 Identify who customers are (not by name)

 Understand why they make certain purchases

 Gain insight about its merchandise (products):

 Fast and slow movers

 Products which are purchased together

 Products which might benefit from promotion

 Take action:

 Store layouts

 Which products to put on specials, promote, coupons…

 Combining all of this with a customer loyalty card it becomes even


more valuable
August 12, 2024 Moso J : Dedan Kimathi University 4
Transactional Data

Market basket example:


Basket1: {bread, cheese, milk}
Basket2: {apple, eggs, salt, yogurt}

Basketn: {biscuit, eggs, milk}
Definitions:
 An item: an article in a basket, or an attribute-value pair
 I: the set of all items sold in the store
 A transaction: items purchased in a basket; it may have TID
(transaction ID)
 A transactional dataset: A set of transactions
August 12, 2024 Moso J : Dedan Kimathi University 5
Itemsets and Association Rules
 An itemset is a set of items.
 E.g., {milk, bread, cereal} is an itemset.
 A k-itemset is an itemset with k items.
 Given a dataset D, an itemset X has a (frequency) count in
D
 An association rule is about relationships between two
disjoint itemsets X and Y
XY
 It presents the pattern when X occurs, Y also occurs

August 12, 2024 Moso J : Dedan Kimathi University 6


Use of Association Rules
 Association rules do not represent any sort of causality or
correlation between the two itemsets.
 X  Y does not mean X causes Y, so no Causality

 X  Y can be different from Y  X, unlike correlation

 Association rules assist in:


 Marketing

 Targeted advertising

 Floor planning

 Inventory control,

 Churning management etc

August 12, 2024 Moso J : Dedan Kimathi University 7


Other Applications
 Market Basket Analysis: given a database of customer
transactions, where each transaction is a set of items the goal is to
find groups of items which are frequently purchased together.
 Telecommunication (each customer is a transaction containing the
set of phone calls)
 Credit Cards/ Banking Services (each card/account is a
transaction containing the set of customer’s payments)
 Medical Treatments (each patient is represented as a transaction
containing the ordered set of diseases)
 Fraud detection: Unusual combinations of insurance claims can be
a warning of fraud

August 12, 2024 Moso J : Dedan Kimathi University 8


Association Rule: Basic Concepts
 Given: (1) database of transactions,
(2) each transaction is a list of items (purchased by a
customer in a visit)
 Find: all rules that correlate the presence of one set of items with
that of another set of items
 E.g. 98% of people who purchase tires and auto accessories also

get automotive services done


 Applications
 Maintenance Agreement (What the store should do to boost
Maintenance Agreement sales)
 Home Electronics (What other products should the store stock
up?)
 Attached mailing in direct marketing

August 12, 2024 Moso J : Dedan Kimathi University 9


Rule Measures: Support and Confidence
Transaction-id Items bought  Itemset X = {x1, …, xk}
10 A, B, D  Find all the rules X  Y with minimum
20 A, C, D support and confidence
30 A, D, E  support, s, probability that a
40 B, E, F
transaction contains X  Y
50 B, C, D, E, F
 confidence, c, conditional
Customer
buys both
Customer probability that a transaction
buys diaper
having X also contains Y
Let supmin = 50%, confmin = 50%
Frequent Pattern: {A:3, B:3, D:4, E:3, AD:3}
Association rules:
Customer A  D (60%, 100%)
buys beer D  A (60%, 75%)

August 12, 2024 Moso J : Dedan Kimathi University 10


Support and Confidence
 Support count: The support count of an itemset X, denoted
by X.count, in a data set T is the number of transactions in T
that contain X. Assume T has n transactions.
 Then,
( X  Y ).count
support 
n
( X  Y ).count
confidence 
X .count
 Interesting association rules are (for now) those whose S and
C are greater than minSup and minConf
August 12, 2024 Moso J : Dedan Kimathi University 11
Support (utility)
 Usefulness of a rule can be measured with a minimum
support threshold
 This parameter lets us measure how many events have such
itemsets that match both sides of the implication in the
association rule
 Rules for events whose itemsets do not match boths sides
sufficiently often (defined by a threshold value) can be
excluded

August 12, 2024 Moso J : Dedan Kimathi University 12


Confidence (certainty)
 Certainty of a rule can be measured with a threshold for
confidence
 This parameter lets us measure how often an event’s itemset
that matches the left side of the implication in the association
rule also matches for the right side
 Rules for events whose itemsets do not match sufficiently
often the right side while matching the left (defined by a
threshold value) can be excluded

August 12, 2024 Moso J : Dedan Kimathi University 13


Example

Data set D
Count, Support,
TID Itemsets Confidence:
T100 134
Count(1 3)=2
T200 235
|D| = 4
T300 1235
Support(1 3)=0.5
T400 25
Support(32)=0.5
Confidence(32)=0.67

August 12, 2024 Moso J : Dedan Kimathi University 14


Mining Association Rules: Example

Transaction-id Items bought Min. support 50%


10 A, B, C
Min. confidence 50%
20 A, C Frequent pattern Support
30 A, D {A} 75%
40 B, E, F {B} 50%
{C} 50%
{A, C} 50%
For rule A  C :
support = support({A }{C }) =(2/4)*100= 50%
confidence = support({A }{C })/support({A })
=((2/4)/(3/4))*100 =66.6%

August 12, 2024 Moso J : Dedan Kimathi University 15


Mining Association Rules: What We Need to Know
 Goal: Rules with high support/confidence
 How to compute?
 Support: Find sets of items that occur frequently

 Confidence: Find frequency of subsets of supported

itemsets
 If we have all frequently occurring sets of items (frequent
itemsets), we can compute support and confidence!

August 12, 2024 Moso J : Dedan Kimathi University 16


Mining Frequent Itemsets: the Key Step

 Find the frequent itemsets: the sets of items that have


minimum support
 A subset of a frequent itemset must also be a frequent itemset
 i.e., if {AB } is a frequent itemset, both {A } and {B } should
be a frequent itemset
 Iteratively find frequent itemsets with cardinality from 1 to k (k-
itemset)
 Use the frequent itemsets to generate association rules.

August 12, 2024 Moso J : Dedan Kimathi University 17


Scalable Methods for Mining Frequent Patterns

 The downward closure property of frequent patterns


 The Apriori principal: Any subset of a frequent itemset

must be frequent
 If {beer, diaper, nuts} is frequent, so is {beer,

diaper} i.e., every transaction having {beer, diaper,


nuts} also contains {beer, diaper}
 Scalable mining methods: Three major approaches
 Apriori algorithm

 Frequent pattern growth

 Vertical data format approach

August 12, 2024 Moso J : Dedan Kimathi University 18


Apriori: A Candidate Generation-and-test Approach

 Apriori pruning principle: If there is any itemset which is


infrequent, its superset should not be generated/tested!
 Method:
 Initially, scan DB once to get frequent 1-itemset
 Generate length (k+1) candidate itemsets from length k frequent
itemsets
 Test the candidates against DB
 Terminate when no frequent or candidate set can be generated

August 12, 2024 Moso J : Dedan Kimathi University 19


Apriori: A Candidate Generation-and-test Approach
 Join Step: Ck is generated by joining Lk-1with itself
 Prune Step: Any (k-1)-itemset that is not frequent cannot be a subset of
a frequent k-itemset
 Pseudo-code:
Ck: Candidate itemset of size k
Lk : frequent itemset of size k

L1 = {frequent items};
for (k = 1; Lk !=; k++) do begin
Ck+1 = candidates generated from Lk;
for each transaction t in database do
increment the count of all candidates in Ck+1 that are contained in t
Lk+1 = candidates in Ck+1 with min_support
end
return k Lk;
August 12, 2024 Moso J : Dedan Kimathi University 20
The Apriori Algorithm—An Example
Supmin = 2 Itemset sup Itemset sup
Database TDB
{A} 2 L1 {A} 2
Tid Items C1
{B} 3 {B} 3
10 A, C, D
1st scan {C} 3 {C} 3
20 B, C, E
{D} 1 {E} 3
30 A, B, C, E
{E} 3
40 B, E
C2 Itemset sup C2 Itemset
{A, B} 1
L2 Itemset sup
{A, C} 2
2nd scan {A, B}
{A, C} 2 {A, C}
{A, E} 1
{B, C} 2 {A, E}
{B, C} 2
{B, E} 3
{B, E} 3 {B, C}
{C, E} 2
{C, E} 2 {B, E}
{C, E}

C3 Itemset
3rd scan L3 Itemset sup
{B, C, E} {B, C, E} 2
August 12, 2024 Moso J : Dedan Kimathi University 21
Important Details of Apriori
 How to generate candidates?
 Step 1: self-joining Lk
 Step 2: pruning
 How to count supports of candidates? :Example of Candidate-
generation
 L3={abc, abd, acd, ace, bcd}
 Self-joining: L3*L3
 abcd from abc and abd
 acde from acd and ace
 Pruning:
 acde is removed because ade is not in L3
 C4={abcd}
August 12, 2024 Moso J : Dedan Kimathi University 22
Step 2: Generating rules from frequent itemsets

 Frequent itemsets  association rules


 One more step is needed to generate association rules
 For each frequent itemset X,
For each proper nonempty subset A of X,
 Let B = X - A

 A  B is an association rule if

 Confidence(A  B) ≥ minconf,
support(A  B) = support(AB) = support(X)
confidence(A  B) = support(A  B) / support(A)

August 12, 2024 Moso J : Dedan Kimathi University 23


Generating rules: an example
 Given {B,C,E} is frequent, with sup=50% and minconf = 50% then
 Proper nonempty subsets: {B,C}, {B,E}, {C,E}, {B}, {C}, {E}, with sup=50%,
75%, 50%, 75%, 75%, 75% respectively
 These generate these association rules:
 B,C  E, confidence=100%
 B,E  C, confidence=67%
 C,E  B, confidence=100%
 B  C,E, confidence=67%
 C  B,E, confidence=67%
 E  B,C, confidence=67%
 All rules have support = 50%

August 12, 2024 Moso J : Dedan Kimathi University 24


Apriori Advantages and Disadvantages

 Advantages:
 Uses large itemset property.

 Easily parallelized

 Easy to implement.

 Disadvantages:
 Assumes transaction database is memory resident.

 Requires up to m database scans.

August 12, 2024 Moso J : Dedan Kimathi University 25


Methods to Improve Apriori’s Efficiency
 Hash-based itemset counting: A k-itemset whose corresponding
hashing bucket count is below the threshold cannot be frequent.
 Transaction reduction: A transaction that does not contain any
frequent k-itemset is useless in subsequent scans.
 Partitioning: Any itemset that is potentially frequent in DB must be
frequent in at least one of the partitions of DB.
 Sampling: mining on a subset of given data, lower support
threshold + a method to determine the completeness.
 Dynamic itemset counting: add new candidate itemsets only
when all of their subsets are estimated to be frequent.

August 12, 2024 Moso J : Dedan Kimathi University 26


Domains where Apriori is used

Adverse drug reaction detection


It is used to perform association analysis on the characteristics of patients,
the drugs they are taking, their primary diagnosis, co-morbid conditions, and
the ADRs or adverse events (AE) they experience.
Association rules are produced that indicate what combinations of
medications and patient characteristics lead to ADRs.
Oracle Bone Inscription Explication
One of the oldest writing in the world, but of all 6000 words found till now
there are only about 1500 words that can be explicated explicitly thus it’s an
open problem in this field.
The OBI data extracted from the OBI corpus are preprocessed; with and
used as input for Apriori algorithm to get the frequent itemset. And
combined by the interestingness measurement the strong association rules
between OBI words are produced.
August 12, 2024 Moso J : Dedan Kimathi University 27
Challenges of Frequent Pattern Mining
 If we use candidate generation
 Need to generate a huge number of candidate sets.

 Need to repeatedly scan the database and check a large set of

candidates by pattern matching.


 Tedious workload of support counting for candidates

 Can we avoid that?


 FP-Trees (Frequent Pattern Trees)

 FP-Growth: allows frequent itemset discovery without candidate


itemset generation. Two step approach:
 Step 1: Build a compact data structure called the FP-tree: Built

using 2 passes over the data-set.


 Step 2: Extracts frequent itemsets directly from the FP-tree

August 12, 2024 Moso J : Dedan Kimathi University 28


Step 1: FP-Tree Construction
 FP-Tree is constructed using 2 passes over the data-set:

Pass 1:
– Scan data and find support for each item.
– Discard infrequent items.
– Sort frequent items in decreasing order based on their support.
Use this order when building the FP-Tree, so common prefixes can
be shared.
Step 1: FP-Tree Construction
Pass 2:
Nodes correspond to items and have a counter
1. FP-Growth reads 1 transaction at a time and maps it to a path

2. Fixed order is used, so paths can overlap when transactions share


items (when they have the same prefix ).
– In this case, counters are incremented

3. Pointers are maintained between nodes containing the same item,


creating singly linked lists (dotted lines)
– The more paths that overlap, the higher the compression. FP-tree

may fit in memory.


4. Frequent itemsets extracted from the FP-Tree.
Step 1: FP-Tree Construction (Example)
FP-Tree size
 The FP-Tree usually has a smaller size than the uncompressed data -
typically many transactions share items (and hence prefixes).
– Best case scenario: all transactions contain the same set of items.

• 1 path in the FP-tree


– Worst case scenario: every transaction has a unique set of items
(no items in common)
• Size of the FP-tree is at least as large as the original data.
• Storage requirements for the FP-tree are higher - need to store the pointers
between the nodes and the counters.

 The size of the FP-tree depends on how the items are ordered
 Ordering by decreasing support is typically used but it does not
always lead to the smallest tree (it's a heuristic).
Step 2: Frequent Itemset Generation
 FP-Growth extracts frequent itemsets from the FP-tree.
 Bottom-up algorithm - from the leaves towards the root
 Divide and conquer: first look for frequent itemsets ending in e, then
de, etc. . . then d, then cd, etc. . .
 First, extract prefix path sub-trees ending in an item(set). (hint: use
the linked lists)
Prefix path sub-trees (Example)
Step 2: Frequent Itemset Generation
 Each prefix path sub-tree is processed
recursively to extract the frequent itemsets.
Solutions are then merged.
 E.g. the prefix path sub-tree for e will be

used to extract frequent itemsets ending in


e, then in de, ce, be and ae, then in cde,
bde, cde, etc.
 Divide and conquer approach
Conditional FP-Tree
 The FP-Tree that would be built if we only consider transactions containing a
particular itemset (and then removing that itemset from all transactions).
 Example: FP-Tree conditional on e.
Example
Let minSup = 2 and extract all frequent itemsets containing e.
1. Obtain the prefix path sub-tree for e:
Example
2. Check if e is a frequent item by adding the counts along the linked
list (dotted line). If so, extract it.
Yes, count =3 so {e} is extracted as a frequent itemset.
3. As e is frequent, find frequent itemsets ending in e.( i.e. de, ce, be
and ae).
4. Use the conditional FP-tree for e to find frequent itemsets ending in
de, ce and ae
Note that be is not considered as b is not in the conditional FP-tree
for e.
For each of them (e.g. de), find the prefix paths from the conditional

tree for e, extract frequent itemsets, generate conditional FP-tree, etc...


(recursive)
Example
 Example: e -> de -> ade ({d,e}, {a,d,e} are found to be frequent)

• Example: e -> ce ({c,e} is found to be frequent)


Result
Frequent itemsets found (ordered by surfix and order in which they are
found):
Advantages of FP-Tree
 Completeness:
 Never breaks a long pattern of any transaction

 Preserves complete information for frequent pattern mining

 Compactness
 Reduce irrelevant information—infrequent items are gone

 Frequency descending ordering: more frequent items are more

likely to be shared
 Never be larger than the original database (if not count node-links

and counts)
Advantages/Disadvantages of FP growth

 Advantages of FP-Growth
 Only 2 passes over data-set
 “Compresses” data-set
 No candidate generation
 Much faster than Apriori

 Disadvantages of FP-Growth
 FP-Tree may not fit in memory!!
 FP-Tree is expensive to build
Constraint-based (Query-Directed) Mining

 Finding all the patterns in a database autonomously? —


unrealistic!
 The patterns could be too many but not focused!
 Data mining should be an interactive process
 User directs what to be mined using a data mining query
language (or a graphical user interface)
 Constraint-based mining
 User flexibility: provides constraints on what to be mined
 System optimization: explores such constraints for efficient
mining—constraint-based mining

August 12, 2024 Moso J : Dedan Kimathi University 43


Constraints in Data Mining

 Knowledge type constraint: Specify the type of knowledge to be mined


 classification, association, etc.

 Data constraint :set of task relevant data- using SQL-like queries


 find product pairs sold together in stores in Nyeri in Dec.’14

 Dimension/level constraint: Specify the desired dimensions (or attributes)


of the data, or levels of the concept hierarchies, to be used in mining.
 in relevance to region, price, brand, customer category

 Rule (or pattern) constraint : The form of rules to be mined


 small sales (price < Ksh.10) triggers big sales (sum > Ksh.200)

 Interestingness constraint: Specify thresholds on statistical measures of


rule interestingness
 strong rules: min_support  3%, min_confidence  60%

August 12, 2024 Moso J : Dedan Kimathi University 44

You might also like