apriori - mlxtend
apriori - mlxtend
Overview
Overview
API
Apriori is a popular algorithm [1] for extracting frequent itemsets with applications in association rule
learning. The apriori algorithm has been designed to operate on databases containing transactions, such as
purchases by customers of a store. An itemset is considered as "frequent" if it meets a user-specified support
threshold. For instance, if the support threshold is set to 0.5 (50%), a frequent itemset is defined as a set of
items that occur together in at least 50% of all transactions in the database.
References
[1] Agrawal, Rakesh, and Ramakrishnan Srikant. "Fast algorithms for mining association rules." Proc. 20th int.
conf. very large data bases, VLDB. Vol. 1215. 1994.
Related
FP-Growth
FP-Max
We can transform it into the right format via the TransactionEncoder as follows:
import pandas as pd
from mlxtend.preprocessing import TransactionEncoder
te = TransactionEncoder()
te_ary = te.fit(dataset).transform(dataset)
df = pd.DataFrame(te_ary, columns=te.columns_)
df
Apple Corn Dill Eggs Ice cream Kidney Beans Milk Nutmeg Onion Unicorn Yogurt
0 False False False True False True True True True False True
1 False False True True False True False True True False True
2 True False False True False True True False False False False
3 False True False False False True True False False True True
4 False True False True True True False False True False False
Now, let us return the items and itemsets with at least 60% support:
apriori(df, min_support=0.6)
support itemsets
0 0.8 (3)
1 1.0 (5)
2 0.6 (6)
3 0.6 (8)
4 0.6 (10)
5 0.8 (3, 5)
6 0.6 (8, 3)
7 0.6 (5, 6)
8 0.6 (8, 5)
9 0.6 (10, 5)
10 0.6 (8, 3, 5)
By default, apriori returns the column indices of the items, which may be useful in downstream operations
such as association rule mining. For better readability, we can set use_colnames=True to convert these
integer values into the respective item names:
support itemsets
0 0.8 (Eggs)
2 0.6 (Milk)
3 0.6 (Onion)
4 0.6 (Yogurt)
0 0.8 (Eggs) 1
2 0.6 (Milk) 1
3 0.6 (Onion) 1
4 0.6 (Yogurt) 1
Then, we can select the results that satisfy our desired criteria as follows:
frequent_itemsets[ (frequent_itemsets['length'] == 2) &
(frequent_itemsets['support'] >= 0.8) ]
Similarly, using the Pandas API, we can select entries based on the "itemsets" column:
Frozensets
Note that the entries in the "itemsets" column are of type frozenset , which is built-in Python type that is
similar to a Python set but immutable, which makes it more efficient for certain query or comparison
operations (https://docs.python.org/3.6/library/stdtypes.html#frozenset). Since frozenset s are sets, the
item order does not matter. I.e., the query
Apple Corn Dill Eggs Ice cream Kidney Beans Milk Nutmeg Onion Unicorn Yogurt
0 False False False True False True True True True False True
1 False False True True False True False True True False True
2 True False False True False True True False False False False
3 False True False False False True True False False True True
4 False True False True True True False False True False False
support itemsets
0 0.8 (Eggs)
2 0.6 (Milk)
3 0.6 (Onion)
4 0.6 (Yogurt)
Parameters
df : pandas DataFrame
pandas DataFrame the encoded format. Also supports DataFrames with sparse data; for more info,
please see (https://pandas.pydata.org/pandas-docs/stable/ user_guide/sparse.html#sparse-data-
structures)
Please note that the old pandas SparseDataFrame format is no longer supported in mlxtend >= 0.17.2.
A float between 0 and 1 for minumum support of the itemsets returned. The support is computed as
the fraction transactions_where_item(s)_occur / total_transactions .
If True , uses the DataFrames' column names in the returned DataFrame instead of column indices.
Maximum length of the itemsets generated. If None (default) all possible itemsets lengths (under the
apriori condition) are evaluated.
If True , uses an iterator to search for combinations above min_support . Note that while
low_memory=True should only be used for large dataset if memory resources are limited, because this
implementation is approx. 3-6x slower than the default.
Returns
pandas DataFrame with columns ['support', 'itemsets'] of all itemsets that are >= min_support and < than
max_len (if max_len is not None). Each itemset in the 'itemsets' column is of type frozenset , which is a
Python built-in type that behaves similarly to sets except that it is immutable (For more info, see
https://docs.python.org/3.6/library/stdtypes.html#frozenset).
Examples