MGS_616__Unsupervised_Learning
MGS_616__Unsupervised_Learning
Unsupervised Learning
Kyle Hunt
Hunt 1/101
| School of Management
Hunt 2/101
| School of Management
Hunt 3/101
| School of Management
Unsupervised learning
Unsupervised learning: we do not give the model labels for the data, but instead
want the model to learn patterns and consistencies among the data
We have only a set of features x1 , x2 , ..., xp measured on n observations.
i.e., we do not have an associated response variable, Y
Hunt 5/101
| School of Management
Hunt 6/101
| School of Management
Hunt 7/101
| School of Management
Hunt 8/101
| School of Management
Clustering
Clustering algorithms are used to process raw, unlabeled data into groups based on
structures or patterns in the information
Elements in a group or cluster should be as similar as possible, and points in
different groups should be as dissimilar as possible
Clustering algorithms can be categorized into a few types
Exclusive: stipulates that a data point can exist only in one cluster (k-means)
Overlapping : allows data points to belong to multiple clusters with separate degrees
of membership (fuzzy k-means)
Hierarchical : data is clustered in a top-down or bottom-up fashion where records are
groups based on similarity
Hunt 9/101
| School of Management
Association rules
Hunt 10/101
| School of Management
Collaborative filtering
Hunt 11/101
| School of Management
Hunt 12/101
| School of Management
Hunt 13/101
| School of Management
If green is purchased, white and red are always purchased. If orange is purchased, white
is always purchased...
Hunt 14/101
| School of Management
Hunt 15/101
| School of Management
Association Rules
Hunt 16/101
| School of Management
Hunt 17/101
| School of Management
Hunt 18/101
| School of Management
The idea behind association rules is to examine all possible rules between items in
an if–then format, and select only those that are most likely to be indicators of
true dependence
We use the term antecedent to describe the IF part, and consequent to
describe the THEN part
In association analysis, the antecedent and consequent are sets of items (called
itemsets) that are disjoint (do not have any items in common)
Note that itemsets are not records of what people buy; they are simply possible
combinations of items, including single items
Hunt 19/101
| School of Management
Returning to the faceplate purchase example, one possible rule is “if red and white,
then green”
1 What is the antecedent itemset?
2 What is the consequent itemset?
Another possible rule is “if yellow and green, then orange and black”
1 What is the antecedent itemset?
2 What is the consequent itemset?
Hunt 20/101
| School of Management
Returning to the faceplate purchase example, one possible rule is “if red and white,
then green”
1 What is the antecedent itemset?
2 What is the consequent itemset?
Another possible rule is “if yellow and green, then orange and black”
1 What is the antecedent itemset?
2 What is the consequent itemset?
Hunt 20/101
| School of Management
The first step in association rules is to generate all the rules that would be
candidates for indicating associations between items
Ideally, we might want to look at all possible combinations of items in a database
with p distinct items (in the phone faceplate example, p = 6)
This means finding all combinations of single items, pairs of items, triplets of
items, and so on, in the transactions database
Generating all these combinations requires a long computation time that grows
exponentially in p
A practical solution is to consider only combinations that occur with higher
frequency in the database. These are called frequent itemsets
Hunt 21/101
| School of Management
Hunt 22/101
| School of Management
Support example
Hunt 23/101
| School of Management
Hunt 24/101
| School of Management
Hunt 25/101
| School of Management
From the abundance of rules generated, the goal is to find only the rules that
indicate a strong dependence between the antecedent and consequent itemsets
To measure the strength of association implied by a rule, we use the measures of
confidence and lift ratio
Confidence: expresses the degree of uncertainty about the if–then rule
Compares the co-occurrence of the antecedent and consequent itemsets in the
database to the occurrence of the antecedent itemsets
Hunt 26/101
| School of Management
Confidence is defined as the ratio of the number of transactions that include all
antecedent and consequent itemsets (i.e., the support) to the number of
transactions that include all the antecedent itemsets:
support(antecedent and consequent)
Confidence =
support(antecedent)
Hunt 27/101
| School of Management
A better way to judge the strength of an association rule: compare the confidence
of the rule with a benchmark value
For this benchmark value, we assume that the occurrence of the antecedent and
consequent itemsets are independent of each other
Under the assumption of independence, the support for a rule would be
Hunt 28/101
| School of Management
Hunt 29/101
| School of Management
confidence
Lift ratio =
benchmark confidence
A lift ratio greater than 1.0 suggests that there is some usefulness to the rule
In other words, the level of association between the antecedent and consequent
itemsets is higher than would be expected if they were independent
The larger the lift ratio, the greater the strength of the association
Hunt 30/101
| School of Management
Data Formatting
Transaction data are usually displayed in one of two formats
1 A transactions database (with each row representing a list of items purchased in a
single transaction)
2 A binary incidence matrix in which columns are items, rows again represent
transactions, and each cell has either a 1 or a 0, indicating the presence or absence
of an item in the transaction
Hunt 31/101
| School of Management
Now suppose that we want association rules between items for this database that
have a support count of at least 2
In other words, rules based on items that were purchased together in at least 20%
of the transactions
Hunt 32/101
| School of Management
Hunt 33/101
| School of Management
Rule selection
Process of selecting strong rules is based on generating all association rules that
meet stipulated support and confidence requirements
This is done in two stages:
1 The first stage, described earlier, consists of finding all “frequent” itemsets (e.g.,
using the Apriori algorithm)
2 In the second stage, we use the frequent itemsets to generate association rules that
meet a user-defined confidence score
The first step removes item combinations that are rare in the database, and
the second stage selects only those combinations (rules) with high
confidence
Hunt 34/101
| School of Management
Rule selection
For example, from the itemset {red, white, green} in the faceplate data, we can
derive the following association rules, confidence values, and lift values:
If the desired minimum confidence is 70%, we would report only the second and
third rules
When generating association rules, we (i.e., the analyst) specify the minimum
support and minimum confidence level
Hunt 35/101
| School of Management
Interpreting results
Hunt 37/101
| School of Management
Interpreting results
The lift ratio indicates how efficient the rule is in finding consequents, compared to
random selection
In other words, how "strong" is the dependency between the antecedent and
consequent
A very efficient rule is preferred to an inefficient rule
But we must still consider support; a very efficient rule that has very low support
may not be as desirable as a less efficient rule with much greater support (why?)
Hunt 38/101
| School of Management
Another example
Here we examine associations among transactions involving various types of books
The full database includes 2000 transactions, and there are 11 different types of
books
Hunt 39/101
| School of Management
Another example
If we develop association rules for this dataset using a minimum support of 5% and
a minimum confidence of 50%:
Hunt 40/101
| School of Management
Collaborative Filtering
Hunt 41/101
| School of Management
Recommender systems
Hunt 42/101
| School of Management
Data requirements
Hunt 43/101
| School of Management
Data requirements
For n users (U1 , U2 , ..., Un ) and p items (I1 , I2 , ..., Ip ), we can think of the data as
an n x p matrix of n rows (users) by p columns (items)
Typically not every user purchases or rates every item, and therefore a purchase
matrix will have many zeros (sparse), and a rating matrix (see below) will have
many missing values
Hunt 44/101
| School of Management
Hunt 45/101
| School of Management
Step 1 requires choosing a distance (or proximity) metric to measure the distance
between our user and the other users
Once the distances are computed, we can use a threshold on the distance or on the
number of required neighbors to determine the nearest neighbors to be used in
Step 2
A nearest-neighbors approach measures the distance of our user to each of the
other users in the database, similar to k-NN
A popular proximity measure between two users is the Pearson correlation between
their ratings
Hunt 46/101
| School of Management
We denote the ratings of items I1 , ..., Ip by user U1 as r1,1 , r1,2 , ..., r1,p and their
average by r¯1
Similarly, denote the ratings by user U2 as r2,1 , r2,2 , ..., r2,p with average r¯2 .
The correlation proximity between the two users is defined as follows:
where the summations are over the items co-rated by both users
Hunt 47/101
| School of Management
Hunt 48/101
| School of Management
Hunt 49/101
| School of Management
Hunt 50/101
| School of Management
In Step 2, we look only at the k-nearest users, and among all the items that they
rated/purchased which our user did not, we choose the best one and recommend it
to our user
What is the best one?
For binary purchase data, it is the item most purchased
For rating data, it could be the highest rated, most rated, or a weighting of the two
Hunt 51/101
| School of Management
In Step 2, we look only at the k-nearest users, and among all the items that they
rated/purchased which our user did not, we choose the best one and recommend it
to our user
What is the best one?
For binary purchase data, it is the item most purchased
For rating data, it could be the highest rated, most rated, or a weighting of the two
Hunt 51/101
| School of Management
Hunt 52/101
| School of Management
When the number of users is much larger than the number of items, it is
computationally cheaper (and faster) to find similar items rather than similar users
When a user expresses interest in a particular item, the item-based collaborative
filtering algorithm has two steps:
Find the items that were co-rated, or co-purchased, (by any user) with the item of
interest
Recommend the most popular or correlated item(s) among the similar items
Similarity is now computed between items, instead of users
Hunt 53/101
| School of Management
Hunt 54/101
| School of Management
The 0 correlation is due to the two opposite ratings of movie 5 by the users who
also rated movie 1 (5 stars vs. 1 star)
Hunt 55/101
| School of Management
Association rules and collaborative filtering are both unsupervised methods used
for generating recommendations
But, they differ in several ways:
Frequent itemsets vs. personalized recommendations
Transactional data vs. user data
Binary data vs. ratings data
Two or more items
Hunt 56/101
| School of Management
Hunt 57/101
| School of Management
Hunt 58/101
| School of Management
Hunt 59/101
| School of Management
In association rules, the antecedent and consequent can each include one or more
items ("IF helmet and cleats THEN shoulder pads and mouth guard")
Thus, a recommendation might be a bundle of items (e.g., if you buy a helmet and
cleats, you will receive 50% off the shoulder pads and mouth guards)
In collaborative filtering, similarity is measured between pairs of items or pairs of
users
Thus, a recommendation will be either for a single item or for multiple items which
do not necessarily relate to each other (e.g., the top two most popular items
purchased by people like you, which you haven’t purchased)
Hunt 60/101
| School of Management
Clustering
Hunt 61/101
| School of Management
Overview
Goal is to segment data into sets of homogeneous clusters of records for purpose
of generating insights
Used in a variety of business applications, from customized marketing to industry
analysis
Market segmentation: customers are segmented based on demographic and
transaction history information, and a marketing strategy is tailored for each
segment
Market structure analysis: identifying groups of similar products according to
competitive measures of similarity
We will cover two popular clustering approaches: hierarchical clustering and
k-means clustering
Hunt 62/101
| School of Management
Examples
Hunt 63/101
| School of Management
Hunt 64/101
| School of Management
Hunt 65/101
| School of Management
Using a prespecified number of clusters, the method assigns records to each cluster
These methods are generally less computationally intensive and are therefore
preferred with very large datasets
k-means is a very popular non-hierarchical model
Hunt 66/101
| School of Management
Need to define two types of distances: distance between two records and
distance between two clusters
In both cases, there is a variety of metrics that can be used
Denote by dij a distance measure between records i and j. For record i we have
the vector of p measurements (xi1 , xi2 , ..., xip ), while for record j we have the
vector of measurements (xj1 , xj2 , ..., xjp )
Just as we saw with k-NN, the most popular distance measure in clustering is
Euclidean distance
Hunt 67/101
| School of Management
Hunt 68/101
| School of Management
Hunt 69/101
| School of Management
We know that Euclidean distance is highly influenced by the scale of each variable
Variables with larger scales have a much greater influence in the final distance
calculation
It’s most typical to normalize (z-score normalization) continuous measurements
before computing the Euclidean distance
Hunt 70/101
| School of Management
Hunt 71/101
| School of Management
Hunt 72/101
| School of Management
An example
Consider the first two companies {Arizona, Boston} as cluster A, and the next
three companies {Central, Commonwealth, Consolidated} as cluster B
Hunt 73/101
| School of Management
An example
Hunt 74/101
| School of Management
An example
Hunt 75/101
| School of Management
An example
Hunt 76/101
| School of Management
An example
Hunt 77/101
| School of Management
An example
Hunt 78/101
| School of Management
An example
Hunt 79/101
| School of Management
An example
Hunt 80/101
| School of Management
An example
What is the centroid distance between clusters?
An example
What is the centroid distance between clusters?
The centroid of cluster A is:
0.0459 − 1.0778 −0.8537 + 0.8133
[ , ] = [−0.516, −0.020]
2 2
The centroid of cluster B is:
0.0839 − 0.7017 − 1.5814 −0.0804 − 0.7242 + 1.6926
[ , ]
3 3
= [−0.733, 0.296]
The distance between the centroids is then:
q
(−0.516 + 0.733)2 + (−0.020 − 0.296)2 = 0.38
Hunt 82/101
| School of Management
An example
Hunt 83/101
| School of Management
Hunt 84/101
| School of Management
1 Start with n clusters (i.e., each data point is its own cluster)
2 The two closest records are merged into one cluster
3 At every step, combine the two clusters with the smallest distances; this means
that either single records are added to existing clusters or two existing clusters are
combined
Hunt 85/101
| School of Management
Hunt 86/101
| School of Management
What two observations would we join in the first iteration? Arizona and
Commonwealth
Next, we would recalculate a 4 × 4 distance matrix that would have the distances
between the remaining four clusters: {Arizona, Commonwealth}, {Boston},
{Central}, and {NY}
We then use these distances to identify the two clusters which are closest, and
merge those two clusters
This process continues until one cluster remains
Hunt 87/101
| School of Management
In single linkage clustering, the distance measure that we use is the minimum
distance
Back to our example, we would create the 4 × 4 distance matrix as follows:
Hunt 88/101
| School of Management
Hunt 90/101
| School of Management
Hunt 91/101
| School of Management
Setting the cutoff distance to 2.7 on this dendrogram results in six clusters:
{NY}, {Nevada}, {San Diego}, {Idaho, Puget}, {Central}, {Others}
Hunt 92/101
| School of Management
"Validating" clusters
Hunt 93/101
| School of Management
Appealing in that it does not require specification of the number of clusters, and in
that sense is purely data-driven
Dendrograms make this technique easy to understand and interpret
Some limitations:
Hierarchical clustering requires the computation and storage of an n × n distance
matrix; can be expensive and slow
Model only uses one pass through of the data; records that are allocated incorrectly
early in the process cannot be reallocated subsequently
Hierarchical clustering also tends to have low stability; dropping a few records can
lead to a different solution
Hunt 94/101
| School of Management
k-means clustering
Hunt 95/101
| School of Management
k-means
Hunt 96/101
| School of Management
Hunt 97/101
| School of Management
Hunt 98/101
| School of Management
We see that Boston is closer to cluster B, and that Central and Commonwealth are
each closer to cluster A
We therefore move each of these records to the other cluster and obtain A =
{Arizona, Central, Commonwealth} and B = {Consolidated, Boston}
Recalculating the centroids gives: x̄A = [−0.191, −0.553] and x̄B = [−1.33, 1.253]
Hunt 99/101
| School of Management
At this point we stop, because each record is allocated to its closest cluster
Hunt 100/101
| School of Management
Selecting k
Hunt 101/101