Session8 PDF
Session8 PDF
Today Objective
Key Concepts :
• Frequent Itemsets: The sets of item
which has minimum support (denoted
by Li for ith-Itemset).
• Join Operation: To find Lk , a set of
candidate k-itemsets is generated by
joining Lk with itself.
• Apriori Property: Any subset of
frequent itemset must be frequent.
Indian Institute of Management (IIM),Rohtak
Understanding Apriori through an Example
C1 L1
C2
Indian Institute of Management (IIM),Rohtak
Step 3: Generating 3-itemset Frequent Pattern
Itemset Sup
Generate candidate set C3 using L2 (join step).
Count
{I1, I2} 4
Condition of joining Lk-1 and Lk-1 is that it should
{I1, I3} 4 have (K-2) elements in common. So here, for L2,
{I1, I5} 2 first element should match.
{I2, I3} 4 •The generation of the set of candidate 3-
{I2, I4} 2 itemsets, C3 , involves use of the Apriori
{I2, I5} 2
Property.
•C3 = L2 Join L2={{I1, I2, I3},{I1, I2, I5},{I1, I3, I5},{I2, I3, I4}, {I2, I3, I5},{I2, I4,I5}}.
If we go for all
•C3 = L2 Join L2 = {{I1, I2, I3}, {I1, I2, I5}, {I1, I2, I4},
{I1, I3, I5}, {I2, I3, I4}, {I2, I3, I5}, {I2, I4, I5}}.
•Now, Join step is complete and Prune step will be used to
reduce the size of C3. Prune step helps to avoid heavy
computation due to large Ck.
Indian Institute of Management (IIM),Rohtak
Step 3: Generating 3-itemset Frequent Pattern
Compare
Scan D for Scan D for Itemset Sup. candidate Itemset Sup
count of Itemset count of support count
Count with min
Count
each each
candidate {I1, I2, I3} candidate {I1, I2, I3} 2 support count {I1, I2, I3} 2
{I1, I2, I5} {I1, I2, I5} 2
{I1, I2, I5} 2
C3 C3 L3
• Back To Example:
We had L = {{I1}, {I2}, {I3}, {I4}, {I5}, {I1,I2}, {I1,I3}, {I1,I5}, {I2,I3}, {I2,I4},
{I2,I5}, {I1,I2,I3}, {I1,I2,I5}}.
– Lets take l = {I1,I2,I5}.
– Its all nonempty subsets are {I1,I2}, {I1,I5}, {I2,I5}, {I1}, {I2}, {I5}.
Therefor ,the set of all frequent item sets are {A},{B},{D},{A B},{A
D},{B D},{A B D}
The Apriori algorithm generated 27 rules with the given constraints. Lets dive into the
Parameter Specification section of the output.
•minval is the minimum value of the support an itemset should satisfy to be a part of
a rule.
•smax is the maximum support value for an itemset.
•arem is an Additional Rule Evaluation Parameter. In the above code we have
constrained the number of rules using Support and Confidence. There are several
other ways to constrain the rules using the arem parameter in the function and we
will discuss more about it later in the session.
•aval is a logical indicating whether to return the additional rule evaluation measure
selected with arem.
•originalSupport The traditional support value only considers both LHS and RHS
items for calculating support. If you want to use only the LHS items for the
calculation then you need to set this to FALSE.
•maxtime is the maximum amount of time allowed to check for subsets.
•minlen is the minimum number of items required in the rule.
•maxlen is the maximum number of items that can be present in the rule.
https://towardsdatascience.com/a-
gentle-introduction-on-market-
basket-analysis-association-rules-
fa4b986a40ce
items.csv
Case(items.csv)
Question: Find the association rules with support = 0.22, and
confidence=0.7
Sol: Save the “items.csv” file in the working directory and load the data
item <- read.transactions("items.csv", format = "basket", sep = ",")
summary(item)
Note: read.transaction requires “arules” package to be installed and
loaded
http://www.learnbymarketin
g.com/1043/working-with-
arules-transactions-and-
read-transactions/
Case(supermarket.csv)
Question: Find the association rules with support = 0.4, and
confidence=0.95
Sol: Save the file in Documents folder (working directory)
Load the data:
• supermarket <- read.transactions("supermarket.csv",
format = "basket", sep = ",")
• summary(supermarket)
Note: format = "basket" is when you have multiple data
items
Mining the rules
• rules.all <- apriori(supermarket, parameter =
list(minlen=2, supp=0.22, conf=0.7))
• inspect(rules.all)
Indian Institute of Management (IIM),Rohtak
Association Mining
summary(groceries1)
transactions as itemMatrix in sparse format with
9835 rows (elements/itemsets/transactions) and
169 columns (items) and a density of 0.02609146
inspect(groceries1)
Indian Institute of Management (IIM),Rohtak
Association Mining
Case(groceries.csv)
Display first three transaction
inspect(groceries1[1:3])
items
[1] {citrus fruit,margarine,ready so
ups,semi-finished bread}
[2] {coffee,tropical fruit,yogurt}
[3] {whole milk}
itemFrequencyPlot(groceries1,topN=10)
itemFrequencyPlot(groceries1,support=.15)
inspect(sort(m1,by="lift")[1:4])
lhs rhs support confidence
[1] {citrus fruit,other vegetables,whole milk} => {root vegetables} 0.005795628
0.4453125
[2] {butter,other vegetables} => {whipped/sour cream} 0.005795628
0.2893401
[3] {herbs} => {root vegetables} 0.007015760 0.4312500
[4] {citrus fruit,pip fruit} => {tropical fruit} 0.005592272 0.4044118
lift
[1] 4.085493
[2] 4.036397
[3] 3.956477
[4] 3.854060
Rattle
• Rattle (the R Analytical Tool To Learn
Easily) is a graphical data mining application
built upon the statistical language R
• It provides a Graphical User Interface (GUI)
for easier use
• It is used for data mining and statistical model
building
https://rattle.togaware.com/
Installing rattle
• Before installing rattle, you need to install RGtk2
package
• In the R console type
install.packages("https://cran.r-
project.org/bin/windows/contrib/3.3/RGtk2_2.20.31.
zip", repos=NULL)
• Install rattle
install.packages("rattle")
• Load rattle
library(rattle)
Indian Institute of Management (IIM),Rohtak
Association Mining
Rattle GUI
• rattle() # This will open the Rattle GUI
Click
Frequency plot
• Click on the Freq Plot button in the Associate tab of Rattle GUI
• You can view the frequency plot in the Output tab of R studio
Coursetopics.csv
Therefor ,the set of all frequent item sets are {A},{B},{D},{A B},{A
D},{B D},{A B D}
Leverage
leverage(X -> Y) = P(X and Y) - (P(X)P(Y))
Leverage measures the difference of X and Y appearing together in
the data set and what would be expected if X and Y where statistically
dependent. The rational in a sales setting is to find out how many
more units (items X and Y together) are sold than expected from the
independent sells. Using min. leverage thresholds at the same time
incorporates an implicit frequency constraint. E.g., for setting a min.
leverage thresholds to 0.01% (corresponds to 10 occurrence in a data
set with 100,000 transactions) one first can use an algorithm to find all
itemsets with min. support of 0.01% and then filter the found item sets
using the leverage constraint. Because of this property leverage also
can suffer from the rare item problem.
Indian Institute of Management (IIM),Rohtak
ORANGE
Item.csv