0% found this document useful (0 votes)
10 views

Week_13-ARM

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views

Week_13-ARM

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 26

Data Mining

• DM functionalities
– A ssocia tion (correlation and causality)
• Multi-dimensional vs.single-dimensional association
• age(X,“20..29”) ,income(X,“20..29K”) ⟶ buys(X,“PC”)
[support = 2%,confidence = 60%]
• contains(T,“computer”) ⟶ contains(x,“software”) [1%,75%]
– Cla ssification and P rediction
• Finding models (functions) that describe and distinguish classes
or concepts for future predictions
• Presentation:decision-tree,classification rule,neural network
• Prediction:predict some unknown or missing numerical values

Acknowledgment - Thanks to Wolf-Tilo Balke and Silviu Homoceanu - TU Braunschweig for the slides 3
Data Mining
– Cluster analysis
• Class label is unknown: group data to form new classes, e.g.,
cluster houses to find distribution patterns
• Clustering based on the principle: maximizing the intra-class
similarity and minimizing the interclass similarity
– Outlier analysis
• Outlier: a data object that does not comply with the general
behavior of the data
• Can be considered as noise or exception, but is quite useful
in fraud detection, rare events analysis

Acknowledgment - Thanks to Wolf-Tilo Balke and Silviu Homoceanu - TU Braunschweig for the slides 4
Association Rule Mining
• A ssocia tion rule m ining has the objective of
finding all co-occurrence relationships (called
a ssocia tions), among data items
– Classical application:m a rket ba sket data analysis,
which aims to discover how items are purchased by
customers in a supermarket
• E.g.,Cheese ⟶ Bread [support = 10%,confidence = 80%]
meaning that 10% of the customers buy cheese and 80%
of customers buying cheese also buy bread.

Acknowledgment - Thanks to Wolf-Tilo Balke and Silviu Homoceanu - TU Braunschweig for the slides 5
Association Rule Mining
• B a sic concepts of association rules
– Let I = {i1, i2, …, im} be a set of items.
Let T = {t1, t2, …, tn} be a set of
transactions where each transaction ti is
a set of items such that ti ⊆ I.

– An a ssocia tion rule is an implication of the form:


X ⟶ Y, where X ⊂ I, Y ⊂ I and X ⋂ Y = ∅
Bread ⟶ Butter but not Bread ⟶ Bread

Acknowledgment - Thanks to Wolf-Tilo Balke and Silviu Homoceanu - TU Braunschweig for the slides 6
Association Rule Mining

• Association rule mining market


basket analysis example
– I – set of all items sold in a store
• E.g., i1 = Beef, i2 = Chicken, i3 = Cheese, …
– T – set of transactions
• The content of a customers basket
• E.g., t1: Beef, Chicken, Milk; t2: Beef, Cheese; t3: Cheese,
Bread; t4: …
– An association rule might be
• Beef, Chicken ⟶ Milk, where {Beef, Chicken} is X and
{Milk} is Y

Acknowledgment - Thanks to Wolf-Tilo Balke and Silviu Homoceanu - TU Braunschweig for the slides 7
Association Rule Mining
• Rules can be weak or strong
– The strength of a rule is measured by its
support and confidence
– The support of a rule X ⟶ Y, is the percentage of
transactions in T that contains X and Y
• Can be seen as an estimate of the probability Pr({X,Y} ⊆ ti)
• With n as number of transactions in T, the support of the
rule X ⟶ Y is:
support = |{i | {X, Y} ⊆ ti}| / n
• Support deals with Data while the Confidence deals with
semantic/bond

Acknowledgment - Thanks to Wolf-Tilo Balke and Silviu Homoceanu - TU Braunschweig for the slides 8
Association Rule Mining
– The confidence of a rule X ⟶ Y, is the percentage of
transactions in T containing X, that contain X U Y
• Can be seen as estimate of the probability Pr(Y ⊆ ti |X ⊆ ti)

confidence = |{i | {X, Y} ⊆ ti}| / |{j | X ⊆ tj}|

Acknowledgment - Thanks to Wolf-Tilo Balke and Silviu Homoceanu - TU Braunschweig for the slides 9
Association Rule Mining
– Lift(l)
The lift of the rule X=>Y is the confidence of the rule
divided by the expected confidence, assuming that the
itemsets X and Y are independent of each other. The
expected confidence is the confidence divided by the
frequency of {Y}.
– Lift(X=>Y) = Conf(X=>Y) / Supp(Y)
Lift value near 1 indicates X and Y almost often appear
together as expected, greater than 1 means they appear
together more than expected and less than 1 means
they appear less than expected. Greater lift values
indicate stronger association
10
Association Rule Mining
• How do we interpret support and confidence?
– If support is too low, the rule may just occur due to
chance
• Acting on a rule with low support may not be profitable
since it covers too few cases
– If confidence is too low, we cannot reliably predict Y
from X
• Objective of mining association rules is to
discover all associated rules in T that have
support and confidence greater than a minimum
threshold (minsup, minconf)!

Acknowledgment - Thanks to Wolf-Tilo Balke and Silviu Homoceanu - TU Braunschweig for the slides 11
Association Rule Mining
• Finding rules based on support and confidence
thresholds Transactions
T1 Beef, Chicken, Milk
– Let minsup = 30% and
T2 Beef, Cheese
minconf = 80% T3 Cheese, Boots
– Chicken, Clothes ⟶ Milk T4 Beef, Chicken, Cheese

is valid, [sup = 3/7 T5 Beef, Chicken, Clothes, Cheese, Milk

(42.84%), conf = 3/3 T6 Clothes, Chicken, Milk


T7 Chicken, Milk, Clothes
(100%)]
– Clothes ⟶ Milk, Chicken is also valid,
and there are more…

Acknowledgment - Thanks to Wolf-Tilo Balke and Silviu Homoceanu - TU Braunschweig for the slides 12
Association Rule Mining
• This is rather a simplistic view of shopping
baskets
– Some important information is not considered, e.g.,
the quantity of each item purchased, the price paid,…
• There are a large number of rule mining
algorithms
– They use different strategies and data structures
– Their resulting sets of rules are all the same

Acknowledgment - Thanks to Wolf-Tilo Balke and Silviu Homoceanu - TU Braunschweig for the slides 13
Association Rule Mining
• Approaches in association rule mining
– Apriori algorithm
– Mining with multiple minimum supports
– Mining class association rules
• The best known mining algorithm is the Apriori
algorithm
– Step 1: find all frequent itemsets
(set of items with support ≥ minsup)
– Step 2: use frequent itemsets to generate rules

Acknowledgment - Thanks to Wolf-Tilo Balke and Silviu Homoceanu - TU Braunschweig for the slides 14
A priori A lg orithm : S tep 1
• Step 1: frequent itemset generation
– The key is the apriori property (downward closure
property): any subset of a frequent itemset is also a
frequent itemset
• E.g., for minsup = 30% Transactions
T1 Beef, Chicken, Milk
T2 Beef, Cheese
Chicken, Clothes, Milk
T3 Cheese, Boots
Chicken, Clothes Chicken, Milk Clothes, Milk T4 Beef, Chicken, Cheese
T5 Beef, Chicken, Clothes, Cheese, Milk

Chicken Clothes T6 Clothes, Chicken, Milk


Milk
T7 Chicken, Milk, Clothes

Acknowledgment - Thanks to Wolf-Tilo Balke and Silviu Homoceanu - TU Braunschweig for the slides 15
A priori A lg orithm : S tep 1
• Finding frequent items
– Find all 1-item frequent itemsets; then all 2-item
frequent itemsets, etc.
– In each iteration k, only consider itemsets that contain
a k-1 frequent itemset
– Optimization: the algorithm assumes that items are
sorted in lexicographic order
• The order is used throughout the algorithm in each itemset
• {w[1], w[2], …, w[k]} represents a k-itemset w consisting of
items w[1], w[2], …, w[k], where w[1] < w[2] < … < w[k]
according to the lexicographic order

Acknowledgment - Thanks to Wolf-Tilo Balke and Silviu Homoceanu - TU Braunschweig for the slides 16
Finding frequent items

– Initial step
• Find frequent itemsets of size 1: F1
– Generalization, k ≥ 2
• Ck = candidates of size k: those itemsets of size k that
could be frequent, given Fk-1
• Fk = those itemsets that are actually frequent, Fk ⊆ Ck
(need to scan the database once)

Acknowledgment - Thanks to Wolf-Tilo Balke and Silviu Homoceanu - TU Braunschweig for the slides 17
A priori A lg orithm : S tep 1
– Generalization of candidates uses Fk-1 as input and
returns a superset (candidates) of the set of all
frequent k-itemsets. It has two steps:
• Join step: generate all possible candidate itemsets Ck of
length k, e.g., Ik = join(Ak-1, Bk-1) ⟺ Ak-1= {i1, i2, …, ik-2, ik-1}
and Bk-1= {i1, i2, …, ik-2, i’k-1} and ik-1< i’k-1;Then Ik = {i1, i2, …,
ik-2, ik-1, i’k-1}
• Prune step: remove those candidates in Ck that do not
respect the downward closure property (include “k-1”
non-frequent subsets)

Acknowledgment - Thanks to Wolf-Tilo Balke and Silviu Homoceanu - TU Braunschweig for the slides 18
A priori A lg orithm : S tep 1
– Generalization e.g., F3 = {{1,2,3}, {1,2,4}, {1,3,4},
{1,3,5}, {2,3,4}}
• Try joining each 2 candidates from F3
{1, 2, 3} {1, 2, 4} {1, 2, 3, 4} {1, 2, 4} {1, 3, 4}
{1, 3, 4} these are candiates {1, 3, 5}
{1, 3, 5} {2, 3, 4}
must see thiss
{2, 3, 4}

{1, 3, 4} {1, 3, 5} {1, 3, 4, 5} {1, 3, 5} {2, 3, 4}


{2, 3, 4}

Acknowledgment - Thanks to Wolf-Tilo Balke and Silviu Homoceanu - TU Braunschweig for the slides 19
A priori A lg orithm : S tep 1
• A fter join C4 = {{1, 2, 3, 4},{1, 3, 4, 5}}
• Pruning:
{1, 2, 3} make candidates now
{1, 2, 3, 4} {1, 2, 4}
E F3 ⟹ {1, 2, 3, 4} is a good candidate
{1, 3, 4}

{2, 3, 4}
F3 = {{1, 2, 3}, {1, 2, 4}, {1, 3, 4},
{1, 3, 4, 5} {1, 3, 4} {1, 3, 5}, {2, 3, 4}}
{1, 3, 5}
{1, 4, 5}
∉ F3 ⟹ {1, 3, 4, 5} Removed from C4
{3, 4, 5}

• A fter pruning C4 = {{1, 2, 3, 4}}

Acknowledgment - Thanks to Wolf-Tilo Balke and Silviu Homoceanu - TU Braunschweig for the slides 20
A priori A lg orithm : S tep 1
• Finding frequent items, example, TID Items
T100 1, 3, 4
minsup = 0.5
T200 2, 3, 5
– First T scan ({item}:count) T300 1, 2, 3, 5

• • C1: {1}:2, {2}:3, {3}:3, {4}:1, {5}:3 T400 2, 5

• • F1: {1}:2, {2}:3, {3}:3, {5}:3;


• {4} has a support of ¼ < 0.5 so it does not belong to
the frequent items
• C2 = prune(join(F1))
• join : {1,2}, {1,3}, {1,5}, {2,3}, {2,5}, {3,5};
• prune: C2 : {1,2}, {1,3}, {1,5}, {2,3}, {2,5}, {3,5}; (all items
• belong to F1)

Acknowledgment - Thanks to Wolf-Tilo Balke and Silviu Homoceanu - TU Braunschweig for the slides 21
A priori A lg orithm : S tep 1
TID Items
– SecondT scan T100 1, 3, 4

• C2:{1,2}:1,{1,3}:2,{1,5}:1,{2,3}:2,{2,5}:3, T200 2, 3, 5

{3,5}:2 T300 1, 2, 3, 5
T400 2, 5
• F2:{1,3}:2,{2,3}:2,{2,5}:3,{3,5}:2
• Join:we could join {1,3} only with {1,4} or {1,5},but they are
not in F2.The only possible join in F2 is {2,3} with {2,5}
resulting in {2,3,5};
• prune({2,3,5}):{2,3},{2,5},{3,5} all belong to F2,
hence,C 3 : {2, 3, 5}
– ThirdT scan
• {2,3,5}:2,then sup({2,3,5}) = 50%,minsup condition is
fulfilled.Then F 3 : {2, 3, 5}
Acknowledgment - Thanks to Wolf-Tilo Balke and Silviu Homoceanu - TU Braunschweig for the slides 22
A priori A lg orithm : S tep 2
• S tep 2:generating rules from frequent itemsets
– Frequent itemsets are not the same as association
rules
– One more step is needed to generate association
rules: for each frequent itemset I, for each proper
nonempty subset X of I:
• Let Y = I \ X; X ⟶ Y is an association rule if:
– Confidence(X ⟶ Y) ≥ minconf,
– Support(X ⟶ Y) := |{i | {X, Y} ⊆ ti}| / n = support(I)
– Confidence(X ⟶ Y) := |{i | {X, Y} ⊆ ti}| / |{j | X ⊆ tj}|
= support(I) / support(X)

Acknowledgment - Thanks to Wolf-Tilo Balke and Silviu Homoceanu - TU Braunschweig for the slides 23
A priori A lg orithm : S tep 2
• Rule generation example,minconf = 50%
– Suppose {2,3,5} is a frequent itemset,with sup=50%,as
calculated in step 1
– Proper nonempty subsets:{2,3},{2,5},{3,5},{2},{3},{5},
with sup=50%,75%,50%,75%,75%,75% respectively
– These generate the following association rules:
• 2,3 ⟶ 5, confidence= 100%; (sup(I)= 50%; sup{2,3}= 50%;
50/50= 1)
TID Items
• 2,5 ⟶ 3, confidence= 67%; (50/75)
T100 1, 3, 4
• 3,5 ⟶ 2, confidence= 100%; (…)
T200 2, 3, 5
• 2 ⟶ 3,5, confidence= 67%
T300 1, 2, 3, 5
• 3 ⟶ 2,5, confidence= 67%
T400 2, 5
• 5 ⟶ 2,3, confidence= 67%
– All rules have support = support(I) = 50%
Acknowledgment - Thanks to Wolf-Tilo Balke and Silviu Homoceanu - TU Braunschweig for the slides 24
A priori A lg orithm : S tep 2
• Rule generation,summary
– In order to obtain X ⟶ Y,we need to know
support(I) and support(X)
– All the required information for confidence
computation has already been recorded in itemset
generation
• No need to read the transactions data any more
• This step is not as time-consuming as frequent itemsets
generation

Acknowledgment - Thanks to Wolf-Tilo Balke and Silviu Homoceanu - TU Braunschweig for the slides 25
A priori A lg orithm
• Apriori Algorithm, summary
– If k is the size of the largest itemset, then it makes at
most k passes over data (in practice, k is bounded
e.g., 10)
– The mining exploits sparseness of data, and high
minsup and minconf thresholds
– High minsup threshold makes it impossible to
find rules involving rare items in the data.
• The solution is a mining with multiple
minimum supports approach

Acknowledgment - Thanks to Wolf-Tilo Balke and Silviu Homoceanu - TU Braunschweig for the slides 26
Summary
• Association Rule Mining
Next week

• Multiple Minimum Supports

Acknowledgment - Thanks to Wolf-Tilo Balke and Silviu Homoceanu - TU Braunschweig for the slides 28

You might also like