Assignment 2
Assignment 2
Discovery
Assignment 2
Contents
1 Question 1 2
1.1 Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 Answer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2 Question 2 5
2.1 Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.2 Answer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
3 Question 3 7
3.1 Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
3.2 Answer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
Page 1 of 12
KSE525 Assignment 1
1 Question 1
1.1 Description
The Apriori algorithm uses a generate and count strategy for deriving frequent
itemsets. Candidate itemsets of size k + 1 are created by joining a pair of frequent
itemsets of size k (this is known as a candidate generation step). A candidate is
discarded if anyone of its subsets is found to be infrequent during the candidate pruning
step. Suppose the Apriori algorithm is applied to the dataset shown below with
minsup = 30%, i.e., any itemset occurring in less then 3 transactions is considered to
be infrequent.
1. Draw an itemset lattice representing the dataset given in the above table. Label
each node in the lattice with the following letter(s):
2. What is the percentage of frequent itemsets (with respect to all itemsets in the
lattice)?
3. What is the pruning ratio of the Apriori algorithm on this data set? (Pruning
ratio is defined as the percentage of itemsets not considered to be a candidate
because (1) they are not generated during candidate generation or (2) they are
pruned during the candidate pruning step.)
Page 2 of 12
KSE525 Assignment 1
4. What is the false alarm rate (i.e, percentage of candidate itemsets that are found
to be infrequent after performing support counting)
1.2 Answer
1. Firstly before drawing the lattice, we should calculate the support of each item-
set.
σ(a,b,d,e) 2
s1 = |T |
= 10
= 0, 2
σ(b,c,d) 2
s2 = |T |
= 10
= 0, 2
σ(a,b,d,e) 2
s3 = |T |
= 10
= 0, 2
σ(a,c,d,e) 1
s4 = |T |
= 10
= 0, 1
σ(b,c,d,e) 1
s5 = |T |
= 10
= 0, 1
σ(b,d,e) 4
s6 = |T |
= 10
= 0, 4
σ(c,d) 4
s7 = |T |
= 10
= 0, 4
σ(a,b,c) 1
s8 = |T |
= 10
= 0, 1
σ(a,d,e) 4
s9 = |T |
= 10
= 0, 4
σ(b,d) 6
s10 = |T |
= 10
= 0, 6
Page 3 of 12
KSE525 Assignment 1
Item Count
a,b 3
a,c 2
Item Count Item Count
a,d 4
a 5 a,b,d 2
a,e 4
b 7 a,b,e 2
c 5
→ b,c 3 → b,c,d 2
b,d 6
d 9 a,d,e 4
b,e 4
e 6 b,d,e 4
c,d 4
c,e 2
d,e 6
Σ(F ) 15
F req = |T |
= 31
= 49%
3. The pruning ratio of the algorithm is calculated by summing the number of in-
frequent and Apriori algorithm candidates and then dividing them by teh total
number of itemsets. Thus,
Page 4 of 12
KSE525 Assignment 1
Σ(I+N ) 16
P run = |T |
= 31
= 51%
4. The false alarm rate is calculated by dividing the sum of Infrequent items after
performing support counting by the total number of itemsets. Thus,
Σ(I) 5
Alarm = |T |
= 31
= 16%
2 Question 2
2.1 Description
The following contingency table summarizes supermarket transaction data, where hot
dogs refer to the transactions containing hot dogs, hot dogs refers to the transactions
that do not contain hot dogs, hamburgers refers to the transactions containing ham-
burgers, and hamburgers refers to the transactions that do not contain hamburgers.
1. Suppose that the association rule “hot dogs ⇒ hamburgers” is mined. Given a
minimum support threshold of 25% and a minimum confidence threshold of 50%
, is this association rule strong?
2. Based on the given data, is the purchase of hot dogs independent of the purchase
of hamburgers? If not, what kind of correlation relationship exists between the
two?
2.2 Answer
1. In order for the association rule to be strong the support should be greater then
the minimum support threshold and the confidence should be greater then the
minimum confidence threshold.
Page 5 of 12
KSE525 Assignment 1
In our case,
σ(hotdog,Hamburger) 2
sup = |T |
= 5
= 40%
σ(hotdog,Hamburger) 2
conf = σ(hotdog)
= 3
= 66.7%
Both are greater then their thresholds so we can say that the association rule is
strong.
P (A ∪ B) = P (A) P (B)
And this means that the lif t equals to 1 calculated by the following equation.
P (AB)
lif t = P (A)P (B)
In case the lif t is greater then 1, then the itemsets are positively correlated and
if it is less then 1, the itemsets are negatively correlated. So,
P (hotdog∪hamburger) 2000/5000
lif t = P (hotdog)P (hamburger)
= (3000/5000)(2500/5000)
= 1.33
Then we can say that the lift is greater then 1 and the itemsets are positively
correlated.
sup(AB)
AllConf = max(sup(A),sup(B))
sup(AB) sup(AB)
M axConf = max sup(A)
, sup(B)
sup(AB) 1 1
Kulc = 2 sup(A)
+ sup(B)
sup(AB)
Cosine = √
sup(A)sup(B)
Lift will be calculated by using the equation from the previous sub-question.
Page 6 of 12
KSE525 Assignment 1
Dataset 2000 500 1000 1500 0.67 0.8 0.732 0.731 1.33
3 Question 3
3.1 Description
Install R and then two packages arules and arulesViz. Answer the following ques-
tions using R. For each question, hand in your R code as well as your answer (result).
1. Load the “Groceries” data set. Please obtain the following information: (i) the
most frequent item, (ii) the length of the longest transaction, and (iii) the first
five transactions.
2. Mine all association rules with the minimum support 0.001 and the mini-
mum confidence 0.8.
3. Draw a scatter plot for all association rules. Here, the x − axis represents the
support, the y − axis represents the confidence, and the shading of a point
represents the lift. [Hint: use the “plot” function in the arulesViz package.]
4. Select the top-3 association rules according to the lift and print these rules.
5. Draw the top-3 rules as a graph such that a node becomes an item. [Hint: use
the “plot” function in the arulesViz package.]
• arules
• arulesVis
3.2 Answer
1. Before answering in the questions we should first install packages in Rstudio and
load the "Groceries" dataset.
1 # Install Dependencies
2 install . packages ( " arules " )
3
4 # Load Libraries
5 library ( " Matrix " )
6 library ( " arules " )
7
Page 7 of 12
KSE525 Assignment 1
Now we are able to run some statistics on the dataset. First we can check the
most frequent items and the length of the longest transaction just by checking
the summary of the dataset. Thus,
We can see that the most frequent item is "whole milk", and the most lengthy
transaction consists of 32 items. Below the code for the first five transactions
and then the output on the console.
Page 8 of 12
KSE525 Assignment 1
2. After running some simple statistics on the dataset and we know what it consist
of, we are able to mine association rules of the itemsets.
1 # Inspect rules
2 inspect ( rules [1:10])
Page 9 of 12
KSE525 Assignment 1
3. In order to draw the scatter plot we should first install the arulesViz package
and then load the library.
Page 10 of 12
KSE525 Assignment 1
1 # Install Dependencies
2 install . packages ( " arulesViz " )
3
4 # Load Libraries
5 library ( " arulesViz " )
6
7 # Plot rules
8 plot ( rules )
4. The top-3 association rules according to the lift are the following.
Page 11 of 12
KSE525 Assignment 1
5. In order to draw the graph of those rules, first we need to save those rules in a
seperate variable to be fed by the plot.
1 # Create a subrules variable to draw graph
2 subrules <- subset ( rules , lift >=8.34)
3
4 # Draw graph
5 plot ( subrules , method = " graph " )
Page 12 of 12