Association Rules
Association Rules
Suppose you were given data from a music community site. For each user you may have a log
of every artist he/she had downloaded to their computer. You may even have demographic
information on the user (such as age, sex, location, occupation, and interests). Your objective is
to build a system that recommends new music to users in this community.
From the available information, it is usually quite easy to determine the support for (i.e., the
frequencies of listening to) various individual artists, as well as the joint support for pairs (or
larger groupings) of artists. All you have to do is count the incidences (0/1) across all members
of your network and divide those frequencies by the number of your members. From the
support we can calculate the confidence and the lift.
For illustration we use a large data set with close to 300,000 records of song (artist) selections
made by 15,000 users. Even larger data sets are available on the web (see, e.g., Celma (2010),
and the data sets on his web page http://ocelma.net/MusicRecommendationDataset
(http://ocelma.net/MusicRecommendationDataset)). Each row of our data set contains the
name of the artist the user has listened to. Our first user, a woman from Germany, has listened
to 16 artists, resulting in the first 16 rows of the data matrix. The two demographic variables
listed here (gender and country) are not used in our analysis. However, it would be
straightforward to stratify the following market basket analysis on gender and country of origin,
and investigate whether findings change (we recommend that you do this as an exercise).
The first thing we need to accomplish is to transform the data as given here into an incidence
matrix where each listener represents a row, with 0 and 1s across the columns indicating
whether or not he or she has played a certain artist.
The last step in the program involves the construction of the association rules. We look for
artists (or groups of artists) who have support larger than 0.01 (1%) and who give confidence
to another artist that is larger than 0.50 (50%). These requirements rule out rare artists.
1 of 8 2020/8/25, 12:30 PM
Statistical Machine Learning with Python Week #2 file:///Users/Vince/Google Drive/MCUT/Indonesia/rmd_python...
import pandas as pd
pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 500)
lastfm = pd.read_csv("lastfm.csv")
print(lastfm.head())
print(lastfm.dtypes)
## user int64
## artist object
## sex object
## country object
## dtype: object
print(lastfm.user.value_counts()[:5])
## 17681 76
## 15057 63
## 1208 55
## 19558 55
## 13424 54
## Name: user, dtype: int64
# 15,000 users
print(lastfm.user.unique().shape)
## (15000,)
print(lastfm.artist.value_counts()[:5])
## radiohead 2704
## the beatles 2668
## coldplay 2378
## red hot chili peppers 1786
## muse 1711
## Name: artist, dtype: int64
2 of 8 2020/8/25, 12:30 PM
Statistical Machine Learning with Python Week #2 file:///Users/Vince/Google Drive/MCUT/Indonesia/rmd_python...
# 1,004 artists
print(lastfm.artist.unique().shape)
## (1004,)
# group by user
grouped = lastfm.groupby('user')
print(list(grouped)[:2])
3 of 8 2020/8/25, 12:30 PM
Statistical Machine Learning with Python Week #2 file:///Users/Vince/Google Drive/MCUT/Indonesia/rmd_python...
4 of 8 2020/8/25, 12:30 PM
Statistical Machine Learning with Python Week #2 file:///Users/Vince/Google Drive/MCUT/Indonesia/rmd_python...
## artist
## user
## 7 22
## 9 19
## 12 30
## 13 7
## 14 8
grouped = grouped['artist']
music = [list(artist) for (user, artist) in grouped]
print([x for x in music if len(x) < 3][:2])
## [['michael jackson', 'a tribe called quest'], ['bob marley & the wailers']]
## (15000, 1004)
print(te.columns_[15:20])
df = pd.DataFrame(txn_binary, columns=te.columns_)
print(df.iloc[:5, 15:20])
# apriori
from mlxtend.frequent_patterns import apriori
5 of 8 2020/8/25, 12:30 PM
Statistical Machine Learning with Python Week #2 file:///Users/Vince/Google Drive/MCUT/Indonesia/rmd_python...
import time
start = time.time()
freq_itemsets = apriori(df, min_support=0.01,
use_colnames=True)
end = time.time()
print(end - start)
## 25.102942943572998
print(freq_itemsets.dtypes)
## support float64
## itemsets object
## length int64
## dtype: object
print(freq_itemsets[(freq_itemsets['length'] == 2)
& (freq_itemsets['support'] >= 0.05)])
# association_rules
from mlxtend.frequent_patterns import association_rules
# confidence >= 0.5
musicrules = association_rules(freq_itemsets,
metric="confidence", min_threshold=0.5)
print(musicrules.head())
6 of 8 2020/8/25, 12:30 PM
Statistical Machine Learning with Python Week #2 file:///Users/Vince/Google Drive/MCUT/Indonesia/rmd_python...
7 of 8 2020/8/25, 12:30 PM
Statistical Machine Learning with Python Week #2 file:///Users/Vince/Google Drive/MCUT/Indonesia/rmd_python...
8 of 8 2020/8/25, 12:30 PM