DMW Module 5
DMW Module 5
MODULE 5
2 Cluster Analysis
Concepts
Types of Data in Cluster Analysis
Categorization of Clustering Methods
1. Partitioning methods: K-Means and K-
Medoid Clustering
2. Hierarchical Clustering method: BIRCH
- (Module 6)
3. Density-Based Clustering methods:
DBSCAN and OPTICS -(Module 6)
● Rules that satisfy both a minimum support threshold (min sup) and a
minimum confidence threshold (min conf ) are called strong.
● By convention, we write support and confidence values so as to occur
between 0% and 100%
Apriori AlgorithmI
Apriori AlgorithmII
Apriori Algorithm: Finding frequent itemsets- Example
Apriori AlgorithmIII
Apriori Algorithm: Finding frequent itemsets- Example (3)Generate
Apriori AlgorithmIV
Apriori AlgorithmV
Apriori Algorithm: Finding frequent itemsets- Example
Apriori
AlgorithmVI
Figure:Pruning C3 .Do any of the candidates have a subset that is not frequent?
Apriori AlgorithmVII
Apriori Algorithm: Finding frequent itemsets- Example
Apriori
AlgorithmVIII
Note:
● If L k = φ at any stage, then algorithm terminates, L will be the set
k −1
of frequent itemset.
Algorithm: Apriori. Find frequent itemsets using an iterative level-wise
approach based on candidate generation.
Apriori
AlgorithmIX
Apriori
AlgorithmX
Apriori AlgorithmXI
Apriori AlgorithmXII
Generating Association Rules from Frequent Itemsets -Example
support count(A ∪ B )
● Confidence ( A =⇒ B ) =
support count(A)
Apriori
AlgorithmXIII
Apriori AlgorithmXIV
How can we further improve the efficiency of Apriori-based mining?
1. Hash-based technique (hashing itemsets into corresponding
buckets)
● A hash-based technique can be used to reduce the size of the
kcandidate
− itemsets, Ck , for k > 1.
● Example: When we scan each transaction for creating frequent
1-itemsets, L 1 from C1 , we can generate all elements of C2 and L2, the
● frequent T 100 ={I1,
E x: Let 2-itemsets also } thea2hash
I2,byI5using functionare
− itemsets
{
● I1, I2}, {I2, I5}, {I1, I5}
For {I1, I5} we compute hash(H2) using function:
h(x, y) =((order o f (x) × 10 + order of (y)) mod 7
h(I1, I5) =((order of (I1) × 10 + order of (I5)) mod 7
h(I1, I5) = ((1 × 10 + 5) mod 7 = 1
//so {I1, I5 } is put into bucket #1
Apriori
AlgorithmXV
● Similary other 2-itemsets in T 100 can be mapped to a bucket using the
hash function
mod 7 indicated the number of buckets(we have 7)
● The process will be repeated for each transaction in D while
determining L 1 from C1
● Clearly we can see the bucket count indicate the support count for 2-
itemesets, thus if min support =3, then the itemsets in buckets
0, 1, 3, and 4 cannot be frequent and so they should not be included
in
C2
CS402 Data Mining & Warehousing
Association Rules Mining Apriori Algorithm
Apriori
AlgorithmXVI
2. Transaction reduction (reducing the number of transactions
scanned in future iterations)
● A transaction that does not contain any frequent k-itemsets cannot
contain any frequent (k + 1)-itemsets
● Therefore, such a transaction can be marked or removed from further
consideration because subsequent scans of the database for j-
itemsets,where j >k, will not require it.
3. Partitioning (partitioning the data to find candidate itemsets):
● A partitioning technique uses just two database scans to mine the
frequent itemsets
Apriori
AlgorithmXVII
● It consists of two phases. In Phase I, divide into Partitions and for each
partition, all frequent itemsets within the partition are found(local frequent
itemsets).
● In Phase II, a second scan of D is conducted in which the actual support
of each candidate is assessed in order to determine the global frequent
itemsets.
● Any itemset that is potentially frequent with respect to D must occur as
a frequent itemset in at least one of the partitions
4. Sampling (mining on a subset of the given data)
● Pick a random sample S of the given data D, and then search for
frequent itemsets inS instead of D
● Since searching for frequent itemsets in S rather than in D, it is
possible that we will miss some of the global frequent itemsets. To
● lessen this possibility, we use a lower support threshold than
minimum support to find the frequent itemsets local to S
Apriori
AlgorithmXVIII
Drawbacks of Apriori
● It may need to generate a huge number of candidate sets. if there are 104
frequent 1-itemsets, the Apriori algorithm will need to generate more
than 107 candidate 2-itemsets.
● It may need to repeatedly scan the database and check a large set of
candidates by pattern matching
Can we design a method that mines the complete set of frequent itemsets
without candidate generation?
Figure:The FP-tree
● For I4,
●
● Exracting are : {{I2I1,
Two paths prefix paths gets ∶ 1}, {I2, I4Pattern
I4Conditional ∶ 1}} Base(CPB)
as:{{I2, I1 ∶ 1}, {I2 ∶
● Compute
1}} support counts in CPB:
support count(I1) =1, support count(I2) =1 + 1 =2
Conditional FP-tree will be: (I2 ∶ 2)
here ∶ 2 represent the support count for I2 in CPB
● Now take all possible combinations between I4 and Conditional
FP-tree and get the Frequent Patterns
● Frequent Patterns Generated will be: {I2, I4 ∶ 2}
Cluster Analysis-
ConceptsI
Cluster Analysis-
ConceptsII
● Clustering: The process of grouping a set of physical or abstract
objects into classes of similar objects
● In other words: It is the task of grouping a set of objects in such a way
that objects in the same group (called a cluster) are more similar (in
some sense) to each other than to those in other groups (clusters).
● A cluster is a collection of data objects that are similar to one
another within the same cluster and are dissimilar to the objects in
other clusters.
● In clustering analysis, First partition the set of data into groups based
on data similarity (e.g., using clustering), and then assign labels to the
relatively small number of groups
● Unlike classification, clustering and unsupervised learning do not rely
on predefined classes and class-labeled training examples.
● Clustering is a form of learning by observation, rather than learning by
examples.
CS402 Data Mining & Warehousing
Cluster Analysis Concepts
Cluster Analysis-
ConceptsIII
● Applications: market research, pattern recognition, data analysis,
image processing, machine learning, information retrieval,
bioinformatics, data compression, and computer graphics
● In business:help marketers discover distinct groups in their customer
bases and characterize customer groups based on purchasing patterns.
● In biology: used to derive plant and animal taxonomies, categorize
genes with similar functionality, and gain insight into structures
inherent in populations
Helps to classify documents on the Web for information discovery.
● Help in the identification of areas of similar land use in an earth
observation database etc...
● Outlier detection: can detect values that are “far away” from any
cluster
Cluster Analysis-
ConceptsVI
● High dimensionality
● A database or a data warehouse can contain several dimensions or
attributes
● Algorithms should support data objects in high dimensional space
● Constraint-based clustering:
● Real-world applications may need to perform clustering under various
kinds of constraints.
● The algorithms must be capable of satisfying user specified constraints.
Interpretability and usability:
●
● The clustering results to be interpretable, comprehensible, and usable.
Interval-Scaled Variables
Interval-Scaled Variables(contd...)
Computing dissimilarity (or similarity) between the objects described by
interval-scaled variable:
● Dissimilarity computed based on the distance between each pair of
objects
● Can use various distance measures: Euclidean distance, Manhattan
(or city block) distance, Minkowski distance
● if i = (x i 1 , x i 2 , ..., x i n ) and j = (x j 1 , x j 2 , ..., x j n ) are two
n − dimensional data objects
● Euclidean Distance:
Binary Variables(contd...)
Symmetric binary dissimilarity
● A binary variable is symmetric if both of its states are equally valuable
and carry the same weight
ie., there is no preference on which outcome should be coded as 0 or 1
Eg; gender: male, female
The dissimilarity can be computed as:
●
Binary Variables(contd...)
Asymmetric binary dissimilarity
We can also
computeAsymmetric
binary similarity
● Asymmetric binary
similarity between the
objects i and j, or
sim(i, j), can be
computed as:
Mary and Jim: unlikely to have a similar disease due to higher value
● Jack and Mary:most likely to have a similar disease
δif=
j 0 if either
● ● x i f or x j f is missing(there is no measurement of variable f for object i
or object j)
x i f = x j f =0 and variable f is asymmetric binary;
● fi j =1 otherwise
∆
Categorization of Clustering
MethodsIII
● also called the top-down approach, starts with all of the objects in the same
cluster.
● In each successive iteration, a cluster is split up into smaller clusters, until
eventually each object is in one cluster, or until a termination condition
holds.
● Hierarchical methods suffer from the fact that once a step (merge or
split) is done, it can never be undone.
● Heirachical Clustering Methods: BIRCH
● Density-based methods:
● Clustering methods have been developed based on the notion of
density(number of objects or data points)
● The general idea is to continue growing the given cluster as long as the
density (number of objects or data points) in the “neighborhood” exceeds
some threshold
● ie; for each data point within a given cluster, the neighborhood of a
given radius has to contain at least a minimum number of points
Steps:
1 Choose k, the number of clusters
2 Select at random k points, the centroids(not necessarily from your
dataset)
3 Assign each data point to the closest centroid , which forms k
clusters
4 Compute and centroids
● The new place thewill
newbecetroid of value
the mean each of
cluster
the objects for each cluster
5 Reassign each data point to then new closest centroid(mean value).
If any reassignment took place repeat step 4
● Otherwise STOP
●
The process of iteratively reassigning objects to clusters to improve the
partitioning is referred to as iterative relocation
k means - Problem
Use k-means clustering algorithm to divide the following data into two
clusters and also compute the the representative data points for the
clusters.
Solution
(1) No. of clusters, we have k =2
(2) Choose two points arbitrarily as the initial cluster centroids.
(3) To assign each data point to the closest centroid compute distances
(9)Conclusion
● The k means clustering algorithm with k =2 applied to the dataset in
Table yields the following clusters and the associated cluster centres