0% found this document useful (0 votes)
3 views

MGS_616__Unsupervised_Learning

The document provides an overview of unsupervised learning in predictive analytics, highlighting its definition, use-cases, and various techniques such as clustering and association rules. It explains how unsupervised learning differs from supervised learning and discusses the challenges associated with it. Additionally, it covers the process of generating and selecting strong association rules using metrics like support, confidence, and lift ratio.

Uploaded by

Pratiksha Biswal
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views

MGS_616__Unsupervised_Learning

The document provides an overview of unsupervised learning in predictive analytics, highlighting its definition, use-cases, and various techniques such as clustering and association rules. It explains how unsupervised learning differs from supervised learning and discusses the challenges associated with it. Additionally, it covers the process of generating and selecting strong association rules using metrics like support, confidence, and lift ratio.

Uploaded by

Pratiksha Biswal
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 103

| School of Management

Unsupervised Learning

Kyle Hunt

MGS 616: Predictive Analytics

Hunt 1/101
| School of Management

What is unsupervised learning?

Hunt 2/101
| School of Management

From supervised to unsupervised learning

Hunt 3/101
| School of Management

A quick step back...key concepts in predictive analytics

Classification: predicting which "class" (i.e., category) a specific data instance


belongs to (i.e., fraudulent or not)
Regression: predicting a continuous outcome for a specific observation (i.e., home
value, spending behavior)
Supervised learning : we give the model "labels" (i.e., the target variables) for the
data in training. Model tries to learn the relationship between the predictors and
target variable
Unsupervised learning: we do not give the model labels for the data, but instead
want the model to learn patterns and consistencies among the data
Reinforcement learning : models learn over time by exploring a (user-defined)
environment. Models are rewarded/penalized based on what they do, and therefore
make better decisions over time (e.g., AlphaGo)
Hunt 4/101
| School of Management

Unsupervised learning

Unsupervised learning: we do not give the model labels for the data, but instead
want the model to learn patterns and consistencies among the data
We have only a set of features x1 , x2 , ..., xp measured on n observations.
i.e., we do not have an associated response variable, Y

Hunt 5/101
| School of Management

Unsupervised learning use-cases

Recommender systems (Netflix, Spotify, Amazon, etc.)


Anomaly detection (what doesn’t belong?)
Customer segmentation (targeted marketing, product development. etc.)
Preparing data for supervised learning (dimensionality reduction)

Hunt 6/101
| School of Management

Product bundling example

Hunt 7/101
| School of Management

Different unsupervised learning tasks

Clustering: a technique which groups unlabeled data based on similarities in


feature space (creating subgroups of similar observations)
Recommender systems
Association rules (population level): what products are frequently purchased
together?
Collaborative filtering (individual level): what movie should we recommend to the
user next?

Hunt 8/101
| School of Management

Clustering

Clustering algorithms are used to process raw, unlabeled data into groups based on
structures or patterns in the information
Elements in a group or cluster should be as similar as possible, and points in
different groups should be as dissimilar as possible
Clustering algorithms can be categorized into a few types
Exclusive: stipulates that a data point can exist only in one cluster (k-means)
Overlapping : allows data points to belong to multiple clusters with separate degrees
of membership (fuzzy k-means)
Hierarchical : data is clustered in a top-down or bottom-up fashion where records are
groups based on similarity

Hunt 9/101
| School of Management

Association rules

Constitute a study of "what goes with what"


Also called market basket analysis because it originated with the study of
transaction databases to determine dependencies between purchases of different
items
Heavily used in retail for learning about items that are purchased together

Hunt 10/101
| School of Management

Collaborative filtering

Recommender systems provide personalized recommendations to a user based on


the user’s information as well as on similar users’ information
Collaborative filtering is based on the notion of identifying relevant items for a
specific user from a very large set of items (“filtering”) by considering preferences
of many users (“collaboration”)

Hunt 11/101
| School of Management

Dimensionality reduction: autoencoders

Dimensionality reduction is another category of unsupervised learning (outside scope


of MGS 616)

Hunt 12/101
| School of Management

A small exercise: thinking about product bundling


Given: the following sales data, which is related to the purchase of faceplates for
iPhones.
Our goal: based on the product(s) that a customer has recently added to their online
shopping cart, recommend add-on products that are "frequently purchased together"

Hunt 13/101
| School of Management

A small exercise: thinking about product bundling

If green is purchased, white and red are always purchased. If orange is purchased, white
is always purchased...
Hunt 14/101
| School of Management

Unsupervised learning challenges

Higher risk of inaccurate results


Human intervention to analyze/understand the patterns that the machine identified

Hunt 15/101
| School of Management

Association Rules

Hunt 16/101
| School of Management

What goes with what?

The availability of detailed information on transactions has led to techniques that


look for associations between items in a database
Thinking about customer transactions...consider a database of records which list
all items bought by a customer on a single-purchase transaction
Managers are interested to know if certain groups of items are consistently
purchased together
They use such information for making decisions on store layouts and item
placement, promotions, catalog design, and identifying customer segments based
on buying patterns
Association rules provide information of this type in the form of “if–then”
statements

Hunt 17/101
| School of Management

Back to our exercise


Given: the following sales data, which is related to the purchase of faceplates for
iPhones.
Our goal: based on the product(s) that a customer has recently added to their online
shopping cart, recommend add-on products that are "frequently purchased together"

Hunt 18/101
| School of Management

Generating candidate rules

The idea behind association rules is to examine all possible rules between items in
an if–then format, and select only those that are most likely to be indicators of
true dependence
We use the term antecedent to describe the IF part, and consequent to
describe the THEN part
In association analysis, the antecedent and consequent are sets of items (called
itemsets) that are disjoint (do not have any items in common)
Note that itemsets are not records of what people buy; they are simply possible
combinations of items, including single items

Hunt 19/101
| School of Management

Generating candidate rules

Returning to the faceplate purchase example, one possible rule is “if red and white,
then green”
1 What is the antecedent itemset?
2 What is the consequent itemset?
Another possible rule is “if yellow and green, then orange and black”
1 What is the antecedent itemset?
2 What is the consequent itemset?

Hunt 20/101
| School of Management

Generating candidate rules

Returning to the faceplate purchase example, one possible rule is “if red and white,
then green”
1 What is the antecedent itemset?
2 What is the consequent itemset?
Another possible rule is “if yellow and green, then orange and black”
1 What is the antecedent itemset?
2 What is the consequent itemset?

Hunt 20/101
| School of Management

Generating candidate rules

The first step in association rules is to generate all the rules that would be
candidates for indicating associations between items
Ideally, we might want to look at all possible combinations of items in a database
with p distinct items (in the phone faceplate example, p = 6)
This means finding all combinations of single items, pairs of items, triplets of
items, and so on, in the transactions database
Generating all these combinations requires a long computation time that grows
exponentially in p
A practical solution is to consider only combinations that occur with higher
frequency in the database. These are called frequent itemsets

Hunt 21/101
| School of Management

Generating candidate rules

Determining what qualifies as a frequent itemset is related to the concept of


support
The support of a rule is simply the number of transactions that include both the
antecedent and consequent itemsets
It is called support because it measures the degree to which the data “support” the
validity of the rule
What constitutes a frequent itemset is therefore defined as an itemset that has a
support that exceeds a selected minimum support, determined by the user

Hunt 22/101
| School of Management

Support example

Hunt 23/101
| School of Management

The Apriori algorithm

The key idea is to:


1 Begin by generating frequent itemsets with just one item (one-itemsets)
2 Recursively generate frequent itemsets with two items (i.e., building off of the
frequent one-itemsets)
3 Continue recursively generating itemsets with three items, and so on, until we have
generated frequent itemsets of all sizes
It is easy to generate frequent itemsets with just one item; we need to count, for
each item, how many transactions in the database include the item
We drop one-itemsets that have support below the desired minimum support to
create a list of the frequent one-itemsets

Hunt 24/101
| School of Management

The Apriori algorithm

To generate frequent two-itemsets, we use the frequent one-itemsets


The reasoning is that if a certain one-itemset did not exceed the minimum support,
any larger size itemset that includes it will not exceed the minimum support
In general, generating k-itemsets uses the frequent (k-1)-itemsets that were
generated in the preceding step
Each step requires a single run through the database, and therefore the Apriori
algorithm is very fast even for a large number of unique items in a database

Hunt 25/101
| School of Management

Selecting strong rules: confidence

From the abundance of rules generated, the goal is to find only the rules that
indicate a strong dependence between the antecedent and consequent itemsets
To measure the strength of association implied by a rule, we use the measures of
confidence and lift ratio
Confidence: expresses the degree of uncertainty about the if–then rule
Compares the co-occurrence of the antecedent and consequent itemsets in the
database to the occurrence of the antecedent itemsets

Hunt 26/101
| School of Management

Selecting strong rules: confidence

Confidence is defined as the ratio of the number of transactions that include all
antecedent and consequent itemsets (i.e., the support) to the number of
transactions that include all the antecedent itemsets:
support(antecedent and consequent)
Confidence =
support(antecedent)

Suppose that a supermarket database has 100,000 point-of-sale transactions. Of


these transactions, 2,000 include both orange juice and flu medication, and 800 of
these include soup purchases
The association rule “IF orange juice and flu medication are purchased THEN soup
is purchased” has a support of 800 transactions and a confidence of 40%
(=800/2000)

Hunt 27/101
| School of Management

Selecting strong rules: lift ratio

A better way to judge the strength of an association rule: compare the confidence
of the rule with a benchmark value
For this benchmark value, we assume that the occurrence of the antecedent and
consequent itemsets are independent of each other
Under the assumption of independence, the support for a rule would be

P(antecedent AND consequent) = P(antecedent)xP(consequent)

Hunt 28/101
| School of Management

Selecting strong rules: lift ratio

Under the assumption of independence, the support for a rule would be

P(antecedent AND consequent) = P(antecedent)xP(consequent)

The benchmark confidence would be:


P(antecedent)xP(consequent)
= P(consequent)
P(antecedent)

In simple terms (expressed as a proportion):

# transactions with consequent itemset


Benchmark confidence =
# transactions in database

Hunt 29/101
| School of Management

Selecting strong rules: lift ratio

We compare the confidence to the benchmark confidence by looking at their ratio:


this is called the lift ratio of a rule
The lift ratio is the confidence of the rule divided by the benchmark confidence:

confidence
Lift ratio =
benchmark confidence
A lift ratio greater than 1.0 suggests that there is some usefulness to the rule
In other words, the level of association between the antecedent and consequent
itemsets is higher than would be expected if they were independent
The larger the lift ratio, the greater the strength of the association

Hunt 30/101
| School of Management

Data Formatting
Transaction data are usually displayed in one of two formats
1 A transactions database (with each row representing a list of items purchased in a
single transaction)
2 A binary incidence matrix in which columns are items, rows again represent
transactions, and each cell has either a 1 or a 0, indicating the presence or absence
of an item in the transaction

Hunt 31/101
| School of Management

Back to our exercise

Now suppose that we want association rules between items for this database that
have a support count of at least 2
In other words, rules based on items that were purchased together in at least 20%
of the transactions
Hunt 32/101
| School of Management

Back to our exercise


By enumeration, we can see that only the itemsets listed below have a count of at
least 2
The first itemset {red} has a support of 6, because six of the transactions included
a red faceplate
Similarly, the support of the last itemset {red, white, green} shows that two
transactions included red, white, and green faceplates.

Hunt 33/101
| School of Management

Rule selection

Process of selecting strong rules is based on generating all association rules that
meet stipulated support and confidence requirements
This is done in two stages:
1 The first stage, described earlier, consists of finding all “frequent” itemsets (e.g.,
using the Apriori algorithm)
2 In the second stage, we use the frequent itemsets to generate association rules that
meet a user-defined confidence score
The first step removes item combinations that are rare in the database, and
the second stage selects only those combinations (rules) with high
confidence

Hunt 34/101
| School of Management

Rule selection

For example, from the itemset {red, white, green} in the faceplate data, we can
derive the following association rules, confidence values, and lift values:

If the desired minimum confidence is 70%, we would report only the second and
third rules
When generating association rules, we (i.e., the analyst) specify the minimum
support and minimum confidence level
Hunt 35/101
| School of Management

Rule selection: example output

These rules give us understandable insight


Example: if orange is purchased, then with confidence 100% white will also be
Hunt purchased (does this logic go both ways?) 36/101
| School of Management

Interpreting results

It is useful to look at the various measures


The support for the rule indicates its impact in terms of overall size
How many transactions are affected?
If only a small number of transactions are affected, the rule may be of little use
The confidence tells us at what rate consequents will be found with antecedents,
and is useful in determining the business or operational usefulness of a rule
A rule with low confidence may find consequents at too low a rate to be worth the
cost of (say) promoting the consequent in all the transactions that involve the
antecedent

Hunt 37/101
| School of Management

Interpreting results

The lift ratio indicates how efficient the rule is in finding consequents, compared to
random selection
In other words, how "strong" is the dependency between the antecedent and
consequent
A very efficient rule is preferred to an inefficient rule
But we must still consider support; a very efficient rule that has very low support
may not be as desirable as a less efficient rule with much greater support (why?)

Hunt 38/101
| School of Management

Another example
Here we examine associations among transactions involving various types of books
The full database includes 2000 transactions, and there are 11 different types of
books

Hunt 39/101
| School of Management

Another example
If we develop association rules for this dataset using a minimum support of 5% and
a minimum confidence of 50%:

Hunt 40/101
| School of Management

Collaborative Filtering

Hunt 41/101
| School of Management

Recommender systems

The recommender engine provides personalized recommendations to a user based


on their information as well as similar users’ information
Information means behaviors indicative of preference, such as purchases, ratings,
and clicking
The value that recommendation systems provide to users helps online companies
convert browsers into buyers, increase revenue per order, and build loyalty
Collaborative filtering is a popular technique used by such recommendation
systems

Hunt 42/101
| School of Management

Data requirements

Collaborative filtering requires availability of all item–user information


For each item–user combination, we should have some measure of the user’s
preference for that item
Preference can be a numerical rating or a binary behavior such as a purchase, a
"like," or a click

Hunt 43/101
| School of Management

Data requirements

For n users (U1 , U2 , ..., Un ) and p items (I1 , I2 , ..., Ip ), we can think of the data as
an n x p matrix of n rows (users) by p columns (items)
Typically not every user purchases or rates every item, and therefore a purchase
matrix will have many zeros (sparse), and a rating matrix (see below) will have
many missing values

Hunt 44/101
| School of Management

User-based collaborative filtering

One approach to generating personalized recommendations is based on finding


users with similar preferences, and recommending items that they liked but the
user hasn’t purchased
The algorithm has two steps:
1 Find users who are most similar to the user of interest (neighbors). This is done by
comparing the preference of our user to the preferences of other users
2 Considering only the items that the user has not yet purchased, recommend the ones
that are most preferred by the neighbors
This is the approach use by Amazon’s "Customers Who Bought This Item Also
Bought..."

Hunt 45/101
| School of Management

User-based collaborative filtering: step 1

Step 1 requires choosing a distance (or proximity) metric to measure the distance
between our user and the other users
Once the distances are computed, we can use a threshold on the distance or on the
number of required neighbors to determine the nearest neighbors to be used in
Step 2
A nearest-neighbors approach measures the distance of our user to each of the
other users in the database, similar to k-NN
A popular proximity measure between two users is the Pearson correlation between
their ratings

Hunt 46/101
| School of Management

User-based collaborative filtering: proximity measurement

We denote the ratings of items I1 , ..., Ip by user U1 as r1,1 , r1,2 , ..., r1,p and their
average by r¯1
Similarly, denote the ratings by user U2 as r2,1 , r2,2 , ..., r2,p with average r¯2 .
The correlation proximity between the two users is defined as follows:

where the summations are over the items co-rated by both users

Hunt 47/101
| School of Management

User-based collaborative filtering: proximity measurement


Let’s assume we have the data below, which reflects movie ratings
We will find the correlation proximity between customer 30878 and 823519

Hunt 48/101
| School of Management

User-based collaborative filtering: proximity measurement


Start by finding r¯30878 and r¯823519
r¯30878 = (4 + 1 + 3 + 3 + 4 + 5)/6 = 3.333
r¯823519 = (3 + 1 + 4 + 4 + 5)/5 = 3.4
Now find all movies that both customers rated (number 1, 28, and 30)

Hunt 49/101
| School of Management

User-based collaborative filtering: proximity measurement


Now, calculate the correlation using the given formula

Hunt 50/101
| School of Management

User-based collaborative filtering: step 2

In Step 2, we look only at the k-nearest users, and among all the items that they
rated/purchased which our user did not, we choose the best one and recommend it
to our user
What is the best one?
For binary purchase data, it is the item most purchased
For rating data, it could be the highest rated, most rated, or a weighting of the two

Hunt 51/101
| School of Management

User-based collaborative filtering: step 2

In Step 2, we look only at the k-nearest users, and among all the items that they
rated/purchased which our user did not, we choose the best one and recommend it
to our user
What is the best one?
For binary purchase data, it is the item most purchased
For rating data, it could be the highest rated, most rated, or a weighting of the two

Hunt 51/101
| School of Management

User-based collaborative filtering: computational challenge

The nearest-neighbors approach can be computationally expensive when we have a


large database of users (as we already know from k-NN)
One solution is to use clustering methods to group users into homogeneous
clusters in terms of their preferences, and then to measure the distance of our user
to each of the clusters
This approach places the computational load on the clustering step that can take
place earlier (and offline); it is then faster to compare our user to each of the
clusters in real time
The drawback of clustering is less accurate recommendations (because not all the
members of the closest cluster are the most similar to our user)

Hunt 52/101
| School of Management

Item-based collaborative filtering

When the number of users is much larger than the number of items, it is
computationally cheaper (and faster) to find similar items rather than similar users
When a user expresses interest in a particular item, the item-based collaborative
filtering algorithm has two steps:
Find the items that were co-rated, or co-purchased, (by any user) with the item of
interest
Recommend the most popular or correlated item(s) among the similar items
Similarity is now computed between items, instead of users

Hunt 53/101
| School of Management

Item-based collaborative filtering: Similarity measurement


Let’s look at the correlation between movie 1 and movie 5
r¯1 = (4 + 4 + 5 + 3 + 4 + 3 + 3 + 4 + 4 + 3)/10 = 3.7
r¯5 = (1 + 5)/2 = 3

Hunt 54/101
| School of Management

Item-based collaborative filtering: Similarity measurement


Now, calculate the correlation using the given formula

The 0 correlation is due to the two opposite ratings of movie 5 by the users who
also rated movie 1 (5 stars vs. 1 star)
Hunt 55/101
| School of Management

Association rules vs. collaborative filtering

Association rules and collaborative filtering are both unsupervised methods used
for generating recommendations
But, they differ in several ways:
Frequent itemsets vs. personalized recommendations
Transactional data vs. user data
Binary data vs. ratings data
Two or more items

Hunt 56/101
| School of Management

Frequent itemsets vs. personalized recommendations

Association rules find frequent item combinations and provide related


recommendations; collaborative filtering provides personalized recommendations
taking into account "global" consumer behavior
Association rules produce generic, impersonal rules (such as Amazon’s "Frequently
Bought Together" displaying the same recommendations to all users searching for
a specific item)
Collaborative filtering generates user-specific recommendations (e.g., Amazon’s
"Customers Who Bought This Item Also Bought...") and is therefore a tool
designed for personalization

Hunt 57/101
| School of Management

Transactional data vs. user data

Association rules provide recommendations of items based on their co-purchase


with other items in many transactions
Collaborative filtering provides recommendations of items based on their
co-purchase or co-rating by similar users

Hunt 58/101
| School of Management

Binary data vs. ratings data

Association rules treat items as binary data (1 = purchase, 0 = nonpurchase)


Collaborative filtering can operate on either binary data or on numerical ratings

Hunt 59/101
| School of Management

Two or more items

In association rules, the antecedent and consequent can each include one or more
items ("IF helmet and cleats THEN shoulder pads and mouth guard")
Thus, a recommendation might be a bundle of items (e.g., if you buy a helmet and
cleats, you will receive 50% off the shoulder pads and mouth guards)
In collaborative filtering, similarity is measured between pairs of items or pairs of
users
Thus, a recommendation will be either for a single item or for multiple items which
do not necessarily relate to each other (e.g., the top two most popular items
purchased by people like you, which you haven’t purchased)

Hunt 60/101
| School of Management

Clustering

Hunt 61/101
| School of Management

Overview

Goal is to segment data into sets of homogeneous clusters of records for purpose
of generating insights
Used in a variety of business applications, from customized marketing to industry
analysis
Market segmentation: customers are segmented based on demographic and
transaction history information, and a marketing strategy is tailored for each
segment
Market structure analysis: identifying groups of similar products according to
competitive measures of similarity
We will cover two popular clustering approaches: hierarchical clustering and
k-means clustering

Hunt 62/101
| School of Management

Examples

Finance: cluster analysis can be used for creating balanced portfolios


Given data on a variety of investment opportunities (e.g., stocks), one may find
clusters based on financial performance variables such as return (daily, weekly, or
monthly), volatility, etc.
Selecting stocks from different clusters can help create a balanced portfolio and
reduce overall volatility (mitigate risk)
Military: design of a new set of sizes for uniforms for women in the US Army
A study came up with a new clothing size system with only 20 sizes, where different
sizes fit different body types
Sizes are combinations of 5 body measurements

Hunt 63/101
| School of Management

What do clusters look like?

Hunt 64/101
| School of Management

Hierarchical clustering overview

Can be either agglomerative or divisive


Agglomerative methods begin with n clusters (where n is the number of
observations) and sequentially merge similar clusters until a single cluster is
obtained
Divisive methods work in the opposite direction, starting with one cluster that
includes all records

Hunt 65/101
| School of Management

Non-hierarchical clustering overview

Using a prespecified number of clusters, the method assigns records to each cluster
These methods are generally less computationally intensive and are therefore
preferred with very large datasets
k-means is a very popular non-hierarchical model

Hunt 66/101
| School of Management

Measuring the distance between two data points

Need to define two types of distances: distance between two records and
distance between two clusters
In both cases, there is a variety of metrics that can be used
Denote by dij a distance measure between records i and j. For record i we have
the vector of p measurements (xi1 , xi2 , ..., xip ), while for record j we have the
vector of measurements (xj1 , xj2 , ..., xjp )
Just as we saw with k-NN, the most popular distance measure in clustering is
Euclidean distance

Hunt 67/101
| School of Management

Other distance measures

Correlation-based similarity (same formula we saw for collaborative filtering)


Mahalanobis distance
Manhattan distance
...

Hunt 68/101
| School of Management

Scaling the data

Do we need to scale the data?

Hunt 69/101
| School of Management

Scaling the data

We know that Euclidean distance is highly influenced by the scale of each variable
Variables with larger scales have a much greater influence in the final distance
calculation
It’s most typical to normalize (z-score normalization) continuous measurements
before computing the Euclidean distance

Hunt 70/101
| School of Management

Measuring the distance between two clusters

Define a cluster as a set of one or more data points


Consider cluster A, which includes m records A1 , A2 , ..., Am and cluster B, which
includes n records B1 , B2 , ..., Bn
The most widely used measures of distance between clusters are:
Minimum distance: the distance between the pair of records Ai and Bj that are
closest min(distance(Ai , Bj )), i = 1, 2, ..., m; j = 1, 2, ..., n
Maximum distance: the distance between the pair of records Ai and Bj that are
farthest max(distance(Ai , Bj )), i = 1, 2, ..., m; j = 1, 2, ..., n
Centroid distance: The distance between the two cluster centroids

Hunt 71/101
| School of Management

Measuring the distance between two clusters

Hunt 72/101
| School of Management

An example

Consider the first two companies {Arizona, Boston} as cluster A, and the next
three companies {Central, Commonwealth, Consolidated} as cluster B

Hunt 73/101
| School of Management

An example

Using Euclidean distance for each distance calculation, we get

What is the minimum distance between clusters?

Hunt 74/101
| School of Management

An example

What is the minimum distance between clusters? 0.76

Hunt 75/101
| School of Management

An example

What is the minimum distance between clusters? 0.76


What is the maximum distance between clusters?

Hunt 76/101
| School of Management

An example

What is the minimum distance between clusters? 0.76


What is the maximum distance between clusters? 3.02

Hunt 77/101
| School of Management

An example

What is the minimum distance between clusters? 0.76


What is the maximum distance between clusters? 3.02
What is the average distance?

Hunt 78/101
| School of Management

An example

What is the minimum distance between clusters? 0.76


What is the maximum distance between clusters? 3.02
What is the average distance? (0.77 + 0.76 + 3.02 + 1.47 + 1.58 + 1.01)/6 =
1.44

Hunt 79/101
| School of Management

An example

What is the minimum distance between clusters? 0.76


What is the maximum distance between clusters? 3.02
What is the average distance? 1.44
What is the centroid distance between clusters?

Hunt 80/101
| School of Management

An example
What is the centroid distance between clusters?

The centroid of cluster A is:


0.0459 − 1.0778 −0.8537 + 0.8133
[ , ] = [−0.516, −0.020]
2 2
The centroid of cluster B is:
0.0839 − 0.7017 − 1.5814 −0.0804 − 0.7242 + 1.6926
[ , ]
3 3
= [−0.733, 0.296]
Hunt 81/101
| School of Management

An example
What is the centroid distance between clusters?
The centroid of cluster A is:
0.0459 − 1.0778 −0.8537 + 0.8133
[ , ] = [−0.516, −0.020]
2 2
The centroid of cluster B is:
0.0839 − 0.7017 − 1.5814 −0.0804 − 0.7242 + 1.6926
[ , ]
3 3
= [−0.733, 0.296]
The distance between the centroids is then:
q
(−0.516 + 0.733)2 + (−0.020 − 0.296)2 = 0.38

Hunt 82/101
| School of Management

An example

What is the minimum distance between clusters? 0.76


What is the maximum distance between clusters? 3.02
What is the average distance? 1.44
What is the centroid distance between cluster? 0.38

Hunt 83/101
| School of Management

Hierarchical agglomerative clustering

Hunt 84/101
| School of Management

Hierarchical agglomerative clustering algorithm

1 Start with n clusters (i.e., each data point is its own cluster)
2 The two closest records are merged into one cluster
3 At every step, combine the two clusters with the smallest distances; this means
that either single records are added to existing clusters or two existing clusters are
combined

Hunt 85/101
| School of Management

Hierarchical agglomerative clustering: example

What two observations would we join in the first iteration?

Hunt 86/101
| School of Management

Hierarchical agglomerative clustering: example

What two observations would we join in the first iteration? Arizona and
Commonwealth
Next, we would recalculate a 4 × 4 distance matrix that would have the distances
between the remaining four clusters: {Arizona, Commonwealth}, {Boston},
{Central}, and {NY}
We then use these distances to identify the two clusters which are closest, and
merge those two clusters
This process continues until one cluster remains
Hunt 87/101
| School of Management

Hierarchical agglomerative clustering: single linkage

In single linkage clustering, the distance measure that we use is the minimum
distance
Back to our example, we would create the 4 × 4 distance matrix as follows:

Hunt 88/101
| School of Management

Hierarchical agglomerative clustering: single linkage


In single linkage clustering, the distance measure that we use is the minimum
distance
Back to our example, we would create the 4 × 4 distance matrix as follows:

The next step would consolidate {Central} with {Arizona, Commonwealth}


because these two clusters are closest
The distance matrix will again be recomputed (this time it will be 3 × 3), and so
on
Hunt 89/101
| School of Management

Hierarchical agglomerative clustering: other types

Complete linkage: distance measure that we use is maximum distance


Average linkage: distance measure that we use is average distance
Centroid linkage: distance measure that we use is centroid distance
Wards’ method: considers the "loss of information" that occurs when records are
clustered together
When each cluster has one record, there is no loss of information and all individual
values remain available
When records are joined in clusters, information about an individual record is
replaced by the information for the cluster
We can measure this loss using error sum of squares

Hunt 90/101
| School of Management

Hierarchical clustering: dendograms


A treelike diagram that summarizes the process of clustering
On the x-axis are the records
Similar records are joined by lines whose vertical length reflects the distance
between the records

Hunt 91/101
| School of Management

Hierarchical clustering: dendograms

Setting the cutoff distance to 2.7 on this dendrogram results in six clusters:
{NY}, {Nevada}, {San Diego}, {Idaho, Puget}, {Central}, {Others}

Hunt 92/101
| School of Management

"Validating" clusters

We seek meaningful clusters


To see whether the cluster analysis is useful, consider each of the following
aspects:
Cluster interpretability: Is the interpretation of the resulting clusters reasonable?
Cluster stability: Do cluster assignments change significantly if some of the inputs
are altered slightly?
Cluster separation: Examine the ratio of between-cluster variation to within-cluster
variation to see whether the separation is reasonable
Number of clusters: The number of resulting clusters must be useful, given the
purpose of the analysis (e.g., matching # clusters with # marketing campaigns)

Hunt 93/101
| School of Management

The good and bad of hierarchical clustering

Appealing in that it does not require specification of the number of clusters, and in
that sense is purely data-driven
Dendrograms make this technique easy to understand and interpret
Some limitations:
Hierarchical clustering requires the computation and storage of an n × n distance
matrix; can be expensive and slow
Model only uses one pass through of the data; records that are allocated incorrectly
early in the process cannot be reallocated subsequently
Hierarchical clustering also tends to have low stability; dropping a few records can
lead to a different solution

Hunt 94/101
| School of Management

k-means clustering

Hunt 95/101
| School of Management

k-means

Another approach to forming good clusters is to pre-specify a desired number of


clusters, k, and assign each data point to one of the k clusters
Goal is to divide sample into k non-overlapping clusters such that clusters are as
homogeneous as possible
Common measure of within-cluster dispersion is the sum of distances (or sum of
squared Euclidean distances) of records from their cluster centroid

Hunt 96/101
| School of Management

k-means clustering algorithm

1 Start with k initial clusters (user chooses k)


2 At every step, each record is reassigned to the cluster with the "closest" centroid
3 Recompute the centroids of clusters that lost or gained a record, and repeat Step 2
4 Stop when moving any more records between clusters increases cluster dispersion

Hunt 97/101
| School of Management

k-means clustering: example


Using our example from earlier, set k = 2 and initialize our clusters as follows:
{Arizona, Boston} is cluster A and {Central, Commonwealth, Consolidated} is
cluster B
We calculated the centroids of these clusters earlier: x̄A = [−0.516, −0.020] and
x̄B = [−0.733, 0.296]
The distance of each record from each of these two centroids is shown below:

Hunt 98/101
| School of Management

k-means clustering: example

We see that Boston is closer to cluster B, and that Central and Commonwealth are
each closer to cluster A
We therefore move each of these records to the other cluster and obtain A =
{Arizona, Central, Commonwealth} and B = {Consolidated, Boston}
Recalculating the centroids gives: x̄A = [−0.191, −0.553] and x̄B = [−1.33, 1.253]
Hunt 99/101
| School of Management

k-means clustering: example


Recalculating the centroids gives: x̄A = [−0.191, −0.553] and x̄B = [−1.33, 1.253]
The distance of each record from each of the newly calculated centroids is given
below:

At this point we stop, because each record is allocated to its closest cluster
Hunt 100/101
| School of Management

Selecting k

The choice of the number of clusters can either be driven by external


considerations (e.g., previous knowledge, practical constraints, etc.), or we can try
a few different values for k and compare the resulting clusters

Hunt 101/101

You might also like