Unit-4 Da
Unit-4 Da
Frequent itemsets:
A set of items together is called an itemset. If any itemset has k-items it is called a k-itemset. An itemset
consists of two or more items. An itemset that occurs frequently is called a frequent itemset. Thus
frequent itemset mining is a data mining technique to identify the items that often occur together.
For Example, Bread and butter, Laptop and Antivirus software, etc.
A set of items is called frequent if it satisfies a minimum threshold value for support and confidence.
Support shows transactions with items purchased together in a single transaction. Confidence shows
transactions where the items are purchased one after the other.
For frequent itemset mining method, we consider only those transactions which meet minimum threshold
support and confidence requirements. Insights from these mining algorithms offer a lot of benefits, cost-
cutting and improved competitive advantage.
There is a tradeoff time taken to mine data and the volume of data for frequent mining. The frequent
mining algorithm is an efficient algorithm to mine the hidden patterns of itemsets within a short time and
less memory consumption.
Frequent itemset or pattern mining is broadly used because of its wide applications in mining association
rules, correlations and graph patterns constraint that is based on frequent patterns, sequential patterns, and
many other data mining tasks.
Example of Apriori:
Support threshold=50%, Confidence= 60%
T1 I1,I2,I3
T2 I2,I3,I4
T3 I4,I5
T4 I1,I2,I4
T5 I1,I2,I3,I5
T6 I1,I2,I3,I4
Solution:
Support threshold=50% => 0.5*6= 3 => min_sup=3
1. Count Of Each Item
TABLE-2
Item Count
I1 4
I2 5
I3 4
I4 4
I5 2
2. Prune Step: TABLE -2 shows that I5 item does not meet min_sup=3, thus it is deleted,
only I1, I2, I3, I4 meet min_sup count.
TABLE-3
Item Count
I1 4
I2 5
I3 4
I4 4
3. Join Step: Form 2-itemset. From TABLE-1 find out the occurrences of 2-itemset.
TABLE-4
Item Count
I1,I2 4
I1,I3 3
I1,I4 2
I2,I3 4
I2,I4 3
I3,I4 2
4. Prune Step: TABLE -4 shows that item set {I1, I4} and {I3, I4} does not meet min_sup, thus it
is deleted.
TABLE-5
Item Count
I1,I2 4
I1,I3 3
I2,I3 4
I2,I4 3
5. Join and Prune Step: Form 3-itemset. From the TABLE- 1 find out occurrences of 3-itemset.
From TABLE-5, find out the 2-itemset subsets which support min_sup.
We can see for itemset {I1, I2, I3} subsets, {I1, I2}, {I1, I3}, {I2, I3} are occurring in TABLE-5
thus {I1, I2, I3} is frequent.
We can see for itemset {I1, I2, I4} subsets, {I1, I2}, {I1, I4}, {I2, I4}, {I1, I4} is not frequent, as
it is not occurring in TABLE-5 thus {I1, I2, I4} is not frequent, hence it is deleted.
TABLE-6
Item
I1,I2,I3
I1,I2,I4
I1,I3,I4
I2,I3,I4
Determines the best value for K center points or centroids by an iterative process.
Assigns each data point to its closest k-center. Those data points which are near to the particular
k-center, create a cluster.
Hence each cluster has datapoints with some commonalities, and it is away from other clusters.
Algorithm:
The working of the K-Means algorithm is explained in the below steps:
Step-1: Select the number K to decide the number of clusters.
Step-2: Select random K points or centroids. (It can be other from the input dataset).
Step-3: Assign each data point to their closest centroid, which will form the predefined K clusters.
Step-4: Calculate the variance and place a new centroid of each cluster.
Step-5: Repeat the third steps, which means reassign each datapoint to the new closest centroid of each
cluster.
Step-6: If any reassignment occurs, then go to step-4 else go to FINISH.
Step-7: The model is ready.
2. Hierarchical Clustering:
A Hierarchical clustering method works via grouping data into a tree of clusters. Hierarchical clustering
begins by treating every data points as a separate cluster. Then, it repeatedly executes the subsequent
steps:
1. Identify the 2 clusters which can be closest together, and
2. Merge the 2 maximum comparable clusters. We need to continue these steps until all the
clusters are merged together.
In Hierarchical Clustering, the aim is to produce a hierarchical series of nested clusters. A diagram called
Dendrogram (A Dendrogram is a tree-like diagram that statistics the sequences of merges or splits)
graphically represents this hierarchy and is an inverted tree that describes the order in which factors are
merged (bottom-up view) or cluster are break up (top-down view).
The basic method to generate hierarchical clustering are:
1. Agglomerative:
Initially consider every data point as an individual Cluster and at every step, merge the nearest pairs of the
cluster. (It is a bottom-up method). At first everydata set set is considered as individual entity or cluster.
At every iteration, the clusters merge with different clusters until one cluster is formed.
Algorithm for Agglomerative Hierarchical Clustering is:
Calculate the similarity of one cluster with all the other clusters (calculate proximity matrix)
Consider every data point as a individual cluster
Merge the clusters which are highly similar or close to each other.
Recalculate the proximity matrix for each cluster
Repeat Step 3 and 4 until only a single cluster remains.
Let’s see the graphical representation of this algorithm using a dendrogram.
Note:
This is just a demonstration of how the actual algorithm works no calculation has been performed below
all the proximity among the clusters are assumed.
Let’s say we have six data points A, B, C, D, E, F.
Step-1:
Consider each alphabet as a single cluster and calculate the distance of one cluster from all the other
clusters.
Step-2:
In the second step comparable clusters are merged together to form a single cluster. Let’s say cluster (B)
and cluster (C) are very similar to each other therefore we merge them in the second step similarly with
cluster (D) and (E) and at last, we get the clusters
[(A), (BC), (DE), (F)]
Step-3:
We recalculate the proximity according to the algorithm and merge the two nearest clusters([(DE), (F)])
together to form new clusters as [(A), (BC), (DEF)]
Step-4:
Repeating the same process; The clusters DEF and BC are comparable and merged together to form a
new cluster. We’re now left with clusters [(A), (BCDEF)].
Step-5:
At last the two remaining clusters are merged together to form a single cluster [(ABCDEF)].
2.Divisive:
We can say that the Divisive Hierarchical clustering is precisely the opposite of the Agglomerative
Hierarchical clustering. In Divisive Hierarchical clustering, we take into account all of the data points as a
single cluster and in every iteration, we separate the data points from the clusters which aren’t
comparable. In the end, we are left with N clusters.
CLIQUE Algorithm uses density and grid-based technique i.e. subspace clustering algorithm and
finds out the cluster by taking density threshold and a number of grids as input parameters. It is
specially designed to handle datasets with a large number of dimensions. CLIQUE Algorithm is
very scalable with respect to the value of the records, and a number of dimensions in the dataset
because it is grid-based and uses the Apriori Property effectively.
Apriori Approach Stated that If an X dimensional unit is dense then all its projections in X-1
dimensional space are also dense.
This means that dense regions in a given subspace must produce dense regions when projected to
a low-dimensional subspace. CLIQUE restricts its search for high-dimensional dense cells to the
intersection of dense cells in the subspace because CLIQUE uses apriori properties.
Working of CLIQUE Algorithm:
The CLIQUE algorithm first divides the data space into grids. It is done by dividing each
dimension into equal intervals called units. After that, it identifies dense units. A unit is dense if
the data points in this are exceeding the threshold value.
Once the algorithm finds dense cells along one dimension, the algorithm tries to find dense cells
along two dimensions, and it works until all dense cells along the entire dimension are found.
After finding all dense cells in all dimensions, the algorithm proceeds to find the largest set
(“cluster”) of connected dense cells. Finally, the CLIQUE algorithm generates a minimal
description of the cluster. Clusters are then generated from all dense subspaces using the apriori
approach.
Advantage:
CLIQUE is a subspace clustering algorithm that outperforms K-means, DBSCAN, and
Farthest First in both execution time and accuracy.
CLIQUE can find clusters of any shape and is able to find any number of clusters in any
number of dimensions, where the number is not predetermined by a parameter.
One of the simplest methods, and interpretability of results.
Disadvantage:
The main disadvantage of CLIQUE Algorithm is that if the size of the cell is unsuitable
for a set of very high values, then too much of the estimation will take place and the
correct cluster will be unable to find.