Unit IV Recommender System
Unit IV Recommender System
HIERARCHICAL CLUSTERING
Hierarchical clustering is a clustering algorithm which uses the following steps to develop clusters:
2. Find the data points with the shortest distance (using an appropriate distance measure) and merge
them to form a cluster.
3. Repeat step 2 until all data points are merged together to form a single cluster.
Recommender Systems
Recommendation systems are a set of algorithms which recommend most relevant items to
users based on their preferences predicted using the algorithms.
The following three algorithms that are widely used for building recommendation systems:
1. Association Rules
2. Collaborative Filtering
3. Matrix Factorization
Datasets
Using the following two publicly available datasets and build recommendations.
1. groceries.csv: This dataset contains transactions of a grocery store and can be downloaded from
http://www.sci.csueastbay.edu/~esuess/classes/Statistics_6620/Presentations/ml13/groceries.csv.
2. Movie Lens: This dataset contains 20000263 ratings and 465564 tag applications across 27278 movies.
The dataset can be downloaded from the link https://grouplens.org/datasets/movielens
Association rule considers all possible combination of items in the previous baskets and computes
various measures such as support, confidence, and li- to identify rules with stronger associations.
One solution to retail problem is to eliminate items that possibly cannot be part of any itemsets. One
such algorithm the association rules use apriori algorithm. The apriori algorithm was proposed by
Agrawal and Srikant (1994). The rules generated are represented as
which means that customers who purchased diapers also purchased beer in the same basket.
{diaper, beer} together is called itemset. {diaper} is called the antecedent and the {beer} is consequent.
Metrics
Concepts such as support, confidence, and lift are used to generate association rules
Support indicates the frequencies of items are appearing together in baskets with respect to all
possible baskets being considered.
Lift value 1 indicate the item being independent. Lift value less than 1 implies the product are
substitution, and greater than 1 implies purchasing product increases and necessary for creating
association rule.
The following code can be used for finding the size (shape or dimension) of the matrix.
one_hot_txns_df.shape
(9835, 171)
Generating Association Rules
To use apriori algorithms to generate itemset. The total number of itemset will depend on the
number of items that exist across all transactions
len(one_hot_txns_df.columns)
171
The code gives us an output of 171, that is, as mentioned in the previous section, there are 171 items. For
itemset containing 2 items in each set, the total number of itemsets will be 171C2, that is, the number of itemset
will be 14535.
Apriori algorithm takes the following parameters:
1. df: pandas − DataFrame in a one-hot-encoded format.
2. min_support: float − A float between 0 and 1 for minimum support of the itemsets returned.
Default is 0.5.
3. use_colnames: boolean − If true, uses the DataFrames’ column names in the returned DataFrame instead of
column indices.
The following commands can be used for setting minimum support.