0% found this document useful (0 votes)
2 views

Chapter

Uploaded by

batengarania
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

Chapter

Uploaded by

batengarania
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 26

11

Unsupervised Data
Mining

Business Analytics, 1e
By Sanjiv Jaggia, Alison Kelly, Kevin Lertwachara, and Leida Chen

9/25/21
Chapter 11 Learning Objectives (LOs)

LO 11.1 Conduct hierarchical cluster analysis.


LO 11.3 Conduct association rule analysis.
Introductory Case: Nutritional Facts of Candy
Bars
• Aliyah is an honors student at a prestigious business school in
Southern California. She is also a fledgling entrepreneur and owns a
vending machine business. Aliyah is aware that California consumers
are becoming increasingly health conscious when it comes to food
purchase. Aliyah wants to come up with a better selection of candy bars
and strategically group and display them in her vending machines.

• Aliyah wants to use the information to accomplish the below tasks.


1. Analyze the nutritional fact data and group candy products according to their
nutritional content.
2. Select a variety of candy bars from each group to better meet the taste of today’s
consumers.
3. Display the candy bars in her vending machines according to the grouping.
11.1: Hierarchical Cluster Analysis (1/14)
• Unsupervised data mining requires no knowledge of the
target variable.
• The algorithms allow the computer to identify complex
processes and patterns without any specific guidance from
the analyst.
• It is an important part of exploratory data analysis because
it makes no distinction between the target variable 𝑦 and
the predictor variables 𝑥!, 𝑥", ⋯ , 𝑥# .
• Uses similarity measures: Euclidian, Manhattan, Jaccard’s
• We explore two core unsupervised data mining techniques:
cluster analysis and association rule analysis.
11.1: Hierarchical Cluster Analysis (2/14)
• Cluster analysis is an unsupervised data mining technique
that groups data into categories that share some similar
characteristic or trait.
– Similar within a cluster, dissimilar across clusters
– Uses similarity measures
• Allows useful exploratory analysis by summarizing a large
number of observations in a data set into a small number of
clusters.
• The cluster characteristics or profiles help us understand
and describe the different groups.
• A popular application of cluster analysis is called customer
or market segmentation.
• Two common clustering techniques: hierarchical clustering
and k-means clustering.
11.1: Hierarchical Cluster Analysis (3/14)
• Hierarchical clustering is a technique that uses an
iterative process to group data into a hierarchy of
clusters.
– Agglomerative clustering (AGNES): top-down, starts
with each observation being its own cluster, iteratively
merges clusters that are similar moving up the hierarchy
– Divisive clustering (DIANA): bottom-up, starts with a
single cluster, iteratively separating the most dissimilar
observations moving down the hierarchy
• We focus on agglomerative clustering, which is the
most commonly used approach.
• The methods can be adapted to implement divisive
clustering.
11.1: Hierarchical Cluster Analysis (4/14)
• With AGNES, each observation in the data initially forms its own cluster.
• The algorithm then successively merges these clusters into larger clusters
based on their similarity until all observations are merged into one final
cluster, referred to as a root.
• Uses (dis)similarity measures.
– Numeric: Euclidean distance or Manhattan distance
– Categorical: matching, Jaccard’s coefficient
• Uses the z-score standardization.
• Linkage methods to evaluate (dis)similarity between clusters.
– Single: nearest distance between a pair of observations not in the same cluster
– Complete: farthest distance between a pair of observations not in the same cluster
– Centroid: distance between the center/centroid or mean values of the clusters
– Average: average distance between all pairs of observations not in the same cluster
– Ward’s: uses error sum of squares (ESS), which is the squared difference between
individual observations and the cluster mean; measures the loss of information that
occurs when observations are clustered
11.1: Hierarchical Cluster Analysis (5/14)
11.1: Hierarchical Cluster Analysis (6/14)
• Once AGNES completes the clustering process, data are
usually represented in a treelike structure.
– Called a dendrogram
– Branches are clusters
– An observation is a “leaf”
– Visually inspect the clustering result and determine the appropriate number
of clusters
• The height of each branch (cluster) or sub-branch (sub-
cluster) indicates how dissimilar it is from the other
branches or sub-branches with which it is merged.
• The greater the height, the more distinctive the cluster is
from the other clusters.
11.1: Hierarchical Cluster Analysis (7/14)
11.1: Hierarchical Cluster Analysis (8/14)
• Relying solely on the height of a dendrogram tree branch may
lead to statistically distinctive clusters that have little or no
practical meaning.
• We often take into account both quantitative measures (such as
a dendrogram) and practical considerations to determine the
number of clusters.
• We should also review the profile of each cluster using
descriptive statistics.
• Another common approach to profile clusters is to incorporate
variables that were not used in clustering but of interest to the
decision maker.
• The ability of a clustering method to discover useful hidden
patterns of the data depends on how it is implemented: data
transformations, distance measures, algorithm, linkage.
• Try several techniques, use the one that makes the most sense.
11.1: Hierarchical Cluster Analysis (9/14)

• Example: Consider the crime crate, median


income, and poverty rate for 41 cities.
11.1: Hierarchical Cluster Analysis (10/14)
• With Excel
11.1: Hierarchical Cluster Analysis (11/14)
• With Excel
11.1: Hierarchical Cluster Analysis (12/14)
• With R
11.1: Hierarchical Cluster Analysis (13/14)
• With R
11.1: Hierarchical Cluster Analysis (14/14)
• With R
11.3: Association Rule Analysis (1/9)
• Association rule analysis is essentially a “what goes with what” study.
– Designed to identify events that tend to occur together
– Also known as affinity analysis or market basket analysis
• Classic application of market basket analysis: retail companies seek to
identify products that consumers tend to purchase together.
– Display products next to each other on a shelf
– Develop promotional campaigns to cross-sell or up-sell
• Other examples
– Improve sales and customer service
– Help diagnose illnesses based on different symptoms that occur together
• Association rules are If-Then logical statements that represent
relationships among different items or item sets.
– Designed to identify hidden patterns and co-occurring events in data
– If is the antecedent, then is the consequent
– Antecedents and consequents can comprise a single product or a combination of
products
– Products or a combination of products is called items or an item set
11.3: Association Rule Analysis (2/9)
• One inherent problem with searching for hidden relationships between
items or item sets is dealing with the extremely large number of
possible combinations.
• Let 𝑛 be the number of items. The number of possible combinations
exponentially increases: 3! − 2 !"# + 1.
– Example: 100 items gives 5.15378E+47 possible combinations
– The search problem becomes extremely computationally intensive and time-
consuming.
• There are several algorithms that can be used to perform association
rule analysis in a more efficient manner. They all focus on the
frequency of item sets.
• One of the most widely used algorithms is called the Apriori method.
– Designed to recursively generate item sets that exceed a predetermined frequency
threshold: the support of the item or item set.
– Set a minimum support value, below which an item or item set is excluded, thus
making the analysis more computationally feasible.
– Eliminates infrequent items that are below the support value, makes it easier to
analyze relevant information in a large data set.
11.3: Association Rule Analysis (3/9)
• With enough data, we can propose many of these If-Then association rules.
– We need a way to evaluate the effectiveness of these rules
– Only the strong associations that occur frequently have the potential to reappear consistently in
the future
• Support: the probability of the If-Then statement
!"#$%& '( )&*+,*-).'+, .+-/"0.+1 $')2 *+)%-%0%+) *+0 -'+,%3"%+)
4')*/ +"#$%& '( )&*+,-*).'+,

• Confidence of the association rule: probability that the antecedent and the
consequent occur given the antecedent occurs
𝑁𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑡𝑟𝑎𝑛𝑠𝑎𝑐𝑡𝑖𝑜𝑛𝑠 𝑖𝑛𝑐𝑙𝑢𝑑𝑖𝑛𝑔 𝑏𝑜𝑡ℎ 𝑎𝑛𝑡𝑒𝑐𝑒𝑑𝑒𝑛𝑡 𝑎𝑛𝑑 𝑐𝑜𝑛𝑠𝑒𝑞𝑢𝑒𝑛𝑡
𝑁𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑡𝑟𝑎𝑛𝑠𝑎𝑐𝑡𝑖𝑜𝑛𝑠 𝑖𝑛𝑐𝑙𝑢𝑑𝑖𝑛𝑔 𝑎𝑛𝑡𝑒𝑐𝑒𝑑𝑒𝑛𝑡
• Both of these can be misleading, if the antecedent and consequent are
common yet unrelated.
!"#$%&'#('
• The lift ratio evaluates the strength of the association: )*+'(,'& ("#$%&'#('
!"#$%& '( )&*+,*-).'+, .+-/"0.+1 -'+,%2"%+)
– Expected confidence = 3')*/ +"#$%& '( )&*+,*-).'+,
– Compares the confidence of the association rule with the overall unconditional probability
– Lift = 1: level of association is the same as no rule at all (random guessing)
– Lift > 1: strong (positive) association
– Lift between 0 and 1: negative association
11.3: Association Rule Analysis (4/9)
• Example: Consider the below table of transactions.

• For the association rule {mascara} => {eye liner}, compute


the support, confidence, and lift ratio.
11.3: Association Rule Analysis (5/9)

$
• 𝑆𝑢𝑝𝑝𝑜𝑟𝑡 = !% = 0.50
$
• 𝐶𝑜𝑛𝑓𝑖𝑑𝑒𝑛𝑐𝑒 = = 0.71
&
'
• 𝐸𝑥𝑝𝑒𝑐𝑡𝑒𝑑 𝑐𝑜𝑛𝑓𝑖𝑑𝑒𝑛𝑐𝑒 = !% = 0.60
%.&!
• 𝐿𝑖𝑓𝑡 𝑟𝑎𝑡𝑖𝑜 = = 1.19
%.'%
• The lift ratio is greater than 1, indicating a strong
association between the purchase of mascara and eyeliner.
• The association is 19% stronger than guessing at random.
11.3: Association Rule Analysis (6/9)
• Example: The store manager at an electronics store
collects data on the last 100 transactions. Five possible
products were purchased: a keyboard, an SD card, a
mouse, a USB drive, and/or a headphone.
11.3: Association Rule Analysis (7/9)
• With Excel
11.3: Association Rule Analysis (8/9)
• With R
11.3: Association Rule Analysis (9/9)
• With R

You might also like