0% found this document useful (0 votes)
121 views

Unit-4 Da

This document discusses frequent itemset mining and the Apriori algorithm. Frequent itemset mining is a technique to identify items that often occur together in transaction data. The Apriori algorithm is an iterative algorithm that uses a join and prune approach to efficiently discover frequent itemsets. It works by finding frequent itemsets of size k by joining frequent itemsets of size k-1.

Uploaded by

Sameer Bahai
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
121 views

Unit-4 Da

This document discusses frequent itemset mining and the Apriori algorithm. Frequent itemset mining is a technique to identify items that often occur together in transaction data. The Apriori algorithm is an iterative algorithm that uses a join and prune approach to efficiently discover frequent itemsets. It works by finding frequent itemsets of size k by joining frequent itemsets of size k-1.

Uploaded by

Sameer Bahai
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 15

UNIT-4 (Frequent Itemsets and Clustering)

Frequent itemsets:
A set of items together is called an itemset. If any itemset has k-items it is called a k-itemset. An itemset
consists of two or more items. An itemset that occurs frequently is called a frequent itemset. Thus
frequent itemset mining is a data mining technique to identify the items that often occur together.
For Example, Bread and butter, Laptop and Antivirus software, etc.
A set of items is called frequent if it satisfies a minimum threshold value for support and confidence.
Support shows transactions with items purchased together in a single transaction. Confidence shows
transactions where the items are purchased one after the other.
For frequent itemset mining method, we consider only those transactions which meet minimum threshold
support and confidence requirements. Insights from these mining algorithms offer a lot of benefits, cost-
cutting and improved competitive advantage.
There is a tradeoff time taken to mine data and the volume of data for frequent mining. The frequent
mining algorithm is an efficient algorithm to mine the hidden patterns of itemsets within a short time and
less memory consumption.
Frequent itemset or pattern mining is broadly used because of its wide applications in mining association
rules, correlations and graph patterns constraint that is based on frequent patterns, sequential patterns, and
many other data mining tasks.

Apriori Algorithm – Frequent Pattern Algorithms


Apriori algorithm was the first algorithm that was proposed for frequent itemset mining. It was later
improved by R Agarwal and R Srikant and came to be known as Apriori. This algorithm uses two steps
“join” and “prune” to reduce the search space. It is an iterative approach to discover the most frequent
itemsets.
Apriori says:
The probability that item I is not frequent is if:

 P(I) < minimum support threshold, then I is not frequent.


 P (I+A) < minimum support threshold, then I+A is not frequent, where A also belongs to itemset.
 If an itemset set has value less than minimum support then all of its supersets will also fall below
min support, and thus can be ignored. This property is called the Antimonotone property.
The steps followed in the Apriori Algorithm of data mining are:
1. Join Step: This step generates (K+1) itemset from K-itemsets by joining each item with itself.
2. Prune Step: This step scans the count of each item in the database. If the candidate item does not
meet minimum support, then it is regarded as infrequent and thus it is removed. This step is
performed to reduce the size of the candidate itemsets.
Steps In Apriori
Apriori algorithm is a sequence of steps to be followed to find the most frequent itemset in the
given database. This data mining technique follows the join and the prune steps iteratively until
the most frequent itemset is achieved. A minimum support threshold is given in the problem or it
is assumed by the user.
#1) In the first iteration of the algorithm, each item is taken as a 1-itemsets candidate. The
algorithm will count the occurrences of each item.
#2) Let there be some minimum support, min_sup( eg 2). The set of 1 – itemsets whose
occurrence is satisfying the min sup are determined. Only those candidates which count more
than or equal to min_sup, are taken ahead for the next iteration and the others are pruned.
#3) Next, 2-itemset frequent items with min_sup are discovered. For this in the join step, the 2-
itemset is generated by forming a group of 2 by combining items with itself.
#4) The 2-itemset candidates are pruned using min-sup threshold value. Now the table will have
2 itemsets with min-sup only.
#5) The next iteration will form 3 –itemsets using join and prune step. This iteration will follow
antimonotone property where the subsets of 3-itemsets, that is the 2 –itemset subsets of each
group fall in min_sup. If all 2-itemset subsets are frequent then the superset will be frequent
otherwise it is pruned.
#6) Next step will follow making 4-itemset by joining 3-itemset with itself and pruning if its
subset does not meet the min_sup criteria. The algorithm is stopped when the most frequent
itemset is achieved.

Example of Apriori:
Support threshold=50%, Confidence= 60%

Transaction List of items

T1 I1,I2,I3

T2 I2,I3,I4

T3 I4,I5

T4 I1,I2,I4

T5 I1,I2,I3,I5

T6 I1,I2,I3,I4
Solution:
Support threshold=50% => 0.5*6= 3 => min_sup=3
1. Count Of Each Item
TABLE-2
Item Count

I1 4

I2 5

I3 4

I4 4

I5 2

2. Prune Step: TABLE -2 shows that I5 item does not meet min_sup=3, thus it is deleted,
only I1, I2, I3, I4 meet min_sup count.
TABLE-3
Item Count

I1 4

I2 5

I3 4

I4 4
3. Join Step: Form 2-itemset. From TABLE-1 find out the occurrences of 2-itemset.

TABLE-4
Item Count

I1,I2 4

I1,I3 3

I1,I4 2

I2,I3 4

I2,I4 3

I3,I4 2

4. Prune Step: TABLE -4 shows that item set {I1, I4} and {I3, I4} does not meet min_sup, thus it
is deleted.
TABLE-5
Item Count

I1,I2 4

I1,I3 3

I2,I3 4

I2,I4 3

5. Join and Prune Step: Form 3-itemset. From the TABLE- 1 find out occurrences of 3-itemset.
From TABLE-5, find out the 2-itemset subsets which support min_sup.
We can see for itemset {I1, I2, I3} subsets, {I1, I2}, {I1, I3}, {I2, I3} are occurring in TABLE-5
thus {I1, I2, I3} is frequent.
We can see for itemset {I1, I2, I4} subsets, {I1, I2}, {I1, I4}, {I2, I4}, {I1, I4} is not frequent, as
it is not occurring in TABLE-5 thus {I1, I2, I4} is not frequent, hence it is deleted.
TABLE-6
Item

I1,I2,I3

I1,I2,I4

I1,I3,I4

I2,I3,I4

Only {I1, I2, I3} is frequent.


6. Generate Association Rules: From the frequent itemset discovered above the association could
be:
{I1, I2} => {I3}
Confidence = support {I1, I2, I3} / support {I1, I2} = (3/ 4)* 100 = 75%
{I1, I3} => {I2}
Confidence = support {I1, I2, I3} / support {I1, I3} = (3/ 3)* 100 = 100%
{I2, I3} => {I1}
Confidence = support {I1, I2, I3} / support {I2, I3} = (3/ 4)* 100 = 75%
{I1} => {I2, I3}
Confidence = support {I1, I2, I3} / support {I1} = (3/ 4)* 100 = 75%
{I2} => {I1, I3}
Confidence = support {I1, I2, I3} / support {I2 = (3/ 5)* 100 = 60%
{I3} => {I1, I2}
Confidence = support {I1, I2, I3} / support {I3} = (3/ 4)* 100 = 75%
This shows that all the above association rules are strong if minimum confidence threshold is
60%.

Applications OfApriori Algorithm


Some fields where Apriori is used:
1. In Education Field: Extracting association rules in data mining of admitted students
through characteristics and specialties.
2. In the Medical field: For example Analysis of the patient’s database.
3. In Forestry: Analysis of probability and intensity of forest fire with the forest fire data.
4. Apriori is used by many companies like Amazon in the Recommender System and by
Google for the auto-complete feature.

Handling large dataset in main memory:


Explanation through images:

Img1 for large dataset handling

Img2 for large dataset handling


Img3 for large dataset handling

Img4 for large dataset handling

Img5 for large dataset handling


Img6 for large dataset handling

Techniques of handling Large datasets:


1. Chunking your data:If you do not need all the data at the same time, you can load your data in
pieces called chunks. A chunk is a part of our dataset. Chunk size depends on how much RAM
you have.
2. Dropping columns:Sometimes, we only need a subset of columns and not all columns for our
analysis. There are a lot of columns present in a dataset that is not needed. Thus, we will only
load a few columns to our memory that are useful.
3. Choosing right datatypes:The default datatypes used for values are not most memory efficient.
We can change the datatypes of a few of our columns based on the values they store and thus can
load large datasets in memory.
Clustering :
Clustering is the task of dividing the population or data points into a number of groups such that data
points in the same groups are more similar to other data points in the same group than those in other
groups. In simple words, the aim is to segregate groups with similar traits and assign them into clusters.
Let’s understand this with an example. Suppose, you are the head of a rental store and wish to understand
preferences of your costumers to scale up your business. Is it possible for you to look at details of each
costumer and devise a unique business strategy for each one of them? Definitely not. But, what you can
do is to cluster all of your costumers into say 10 groups based on their purchasing habits and use a
separate strategy for costumers in each of these 10 groups. And this is what we call clustering.

1. K-Means Clustering Algorithm:


K-Means Clustering is an Unsupervised Learning algorithm, which groups the unlabeled dataset into
different clusters. Here K defines the number of pre-defined clusters that need to be created in the
process, as if K=2, there will be two clusters, and for K=3, there will be three clusters, and so on.
It allows us to cluster the data into different groups and a convenient way to discover the categories of
groups in the unlabeled dataset on its own without the need for any training.
It is a centroid-based algorithm, where each cluster is associated with a centroid. The main aim of this
algorithm is to minimize the sum of distances between the data point and their corresponding clusters.
The algorithm takes the unlabeled dataset as input, divides the dataset into k-number of clusters, and
repeats the process until it does not find the best clusters. The value of k should be predetermined in this
algorithm.
The k-means clustering algorithm mainly performs two tasks:

 Determines the best value for K center points or centroids by an iterative process.
 Assigns each data point to its closest k-center. Those data points which are near to the particular
k-center, create a cluster.
Hence each cluster has datapoints with some commonalities, and it is away from other clusters.
Algorithm:
The working of the K-Means algorithm is explained in the below steps:
Step-1: Select the number K to decide the number of clusters.
Step-2: Select random K points or centroids. (It can be other from the input dataset).
Step-3: Assign each data point to their closest centroid, which will form the predefined K clusters.
Step-4: Calculate the variance and place a new centroid of each cluster.
Step-5: Repeat the third steps, which means reassign each datapoint to the new closest centroid of each
cluster.
Step-6: If any reassignment occurs, then go to step-4 else go to FINISH.
Step-7: The model is ready.

2. Hierarchical Clustering:
A Hierarchical clustering method works via grouping data into a tree of clusters. Hierarchical clustering
begins by treating every data points as a separate cluster. Then, it repeatedly executes the subsequent
steps:
1. Identify the 2 clusters which can be closest together, and
2. Merge the 2 maximum comparable clusters. We need to continue these steps until all the
clusters are merged together.
In Hierarchical Clustering, the aim is to produce a hierarchical series of nested clusters. A diagram called
Dendrogram (A Dendrogram is a tree-like diagram that statistics the sequences of merges or splits)
graphically represents this hierarchy and is an inverted tree that describes the order in which factors are
merged (bottom-up view) or cluster are break up (top-down view).
The basic method to generate hierarchical clustering are:
1. Agglomerative:
Initially consider every data point as an individual Cluster and at every step, merge the nearest pairs of the
cluster. (It is a bottom-up method). At first everydata set set is considered as individual entity or cluster.
At every iteration, the clusters merge with different clusters until one cluster is formed.
Algorithm for Agglomerative Hierarchical Clustering is:

 Calculate the similarity of one cluster with all the other clusters (calculate proximity matrix)
 Consider every data point as a individual cluster
 Merge the clusters which are highly similar or close to each other.
 Recalculate the proximity matrix for each cluster
 Repeat Step 3 and 4 until only a single cluster remains.
Let’s see the graphical representation of this algorithm using a dendrogram.
Note:
This is just a demonstration of how the actual algorithm works no calculation has been performed below
all the proximity among the clusters are assumed.
Let’s say we have six data points A, B, C, D, E, F.

Step-1:
Consider each alphabet as a single cluster and calculate the distance of one cluster from all the other
clusters.
Step-2:
In the second step comparable clusters are merged together to form a single cluster. Let’s say cluster (B)
and cluster (C) are very similar to each other therefore we merge them in the second step similarly with
cluster (D) and (E) and at last, we get the clusters
[(A), (BC), (DE), (F)]
Step-3:
We recalculate the proximity according to the algorithm and merge the two nearest clusters([(DE), (F)])
together to form new clusters as [(A), (BC), (DEF)]
Step-4:
Repeating the same process; The clusters DEF and BC are comparable and merged together to form a
new cluster. We’re now left with clusters [(A), (BCDEF)].
Step-5:
At last the two remaining clusters are merged together to form a single cluster [(ABCDEF)].

2.Divisive:
We can say that the Divisive Hierarchical clustering is precisely the opposite of the Agglomerative
Hierarchical clustering. In Divisive Hierarchical clustering, we take into account all of the data points as a
single cluster and in every iteration, we separate the data points from the clusters which aren’t
comparable. In the end, we are left with N clusters.

Clustering high dimensional data:


 Clustering high-dimensional data is the cluster analysis of data with anywhere from a few dozen
to many thousands of dimensions.
 Such high-dimensional spaces of data are often encountered in areas such as medicine, where
DNA microarray technology can produce many measurements at once, and the clustering of text
documents, where, if a word-frequency vector is used, the number of dimensions equals the size
of vocabulary.
 Most clustering methods are designed for clustering low-dimensional data and encounter
challenges when the dimensionality of data grows really high(says over 10 dimensions, or even
over thousands of dimensions for some tasks )
Issues:
 Noise
 Distance measure meaningless
What happen when dimensionality increases?

 Only a small numbers of dimensions are relevant to certain clusters


 Producing noise and masking the real clusters.
 Data become increasingly sparse because the data points are likely located in different
dimensional subspaces.
 Data points can be considered as all equally distanced.
 The distance measure, which is essential for cluster analysis, becomes meaningless.
Solution techniques:
1. Feature/ Attribute Transformation
2. Feature/ Attribute Selection
3. Subspace clustering
Examples:

 Principal component analysis


 Singular value decomposition
 Transform the data onto a smaller space while preserving the original relative distance
between objects.
 They summarize data by creating linear combination of the attributes.

1. Feature/ Attribute Transformation


 They do not remove any of the original attribute from analysis.
 The irrelevant information may mask the real clusters, even after transformation.
 The transformed features (attributes) are often difficult to interpret, making the clustering
results less useful.
 Thus, feature transformation is only suited to data sets where most of the dimensions are
relevant to the clustering task.
 Unfortunately, real-world data sets tend to have many highly correlated, or redundant,
dimensions.
2. Feature/Attributes Selection
 It is commonly used for data reduction by removing irrelevant or redundant dimensions
(or attributes).
 Given a set of attributes, attribute subset selection finds the subset of attributes that are
most relevant to the data mining task.
 Attribute subset selection involves searching through various attribute subsets and
evaluating these subsets using certain criteria.
 Supervised learning: the most relevant set of attributes are found with respect to the given
class labels.
Unsupervised process: such as entropy analysis, which is based on the property that
entropy tends to be low for data that contain tight clusters.
3. Subspace Clustering
 It is an extension to attribute subset selection that has shown its strength at high-
dimensional clustering.
 It is based on the observation that different subspaces may contain different, meaningful
clusters.
 Subspaces clustering searches for groups of clusters within different subspaces of the
same data set.
 The problem becomes how to find such subspaces clusters effectively and efficiently.
High-dimensional data clustering approaches

 Dimension-Growth Subspace clustering


 CLIQUE( clustering in QUEst)
 Dimension-Reduction Projected clustering
 PROCLUS (PROjectedCLUStering)
 Frequent pattern based clustering
 pCluster

CLIQUE: Grid-Based Subspace Clustering


 CLIQUE (clustering in QUEst) invented by Agrawal, Genhrke, Gunopulos,
Raghavan:SIGMOD’98.
 CLIQUE is a density-based and grid-based subspace clustering algorithm. So let’s first take a
look at what is a grid and density-based clustering technique.
1. Grid-Based Clustering Technique: In Grid-Based Methods, the space of instance is divided
into a grid structure. Clustering techniques are then applied using the Cells of the grid, instead of
individual data points, as the base units.
2. Density-Based Clustering Technique: In Density-Based Methods, A cluster is a maximal set
of connected dense units in a subspace.

CLIQUE Algorithm uses density and grid-based technique i.e. subspace clustering algorithm and
finds out the cluster by taking density threshold and a number of grids as input parameters. It is
specially designed to handle datasets with a large number of dimensions. CLIQUE Algorithm is
very scalable with respect to the value of the records, and a number of dimensions in the dataset
because it is grid-based and uses the Apriori Property effectively.

Apriori Approach Stated that If an X dimensional unit is dense then all its projections in X-1
dimensional space are also dense.

This means that dense regions in a given subspace must produce dense regions when projected to
a low-dimensional subspace. CLIQUE restricts its search for high-dimensional dense cells to the
intersection of dense cells in the subspace because CLIQUE uses apriori properties.
Working of CLIQUE Algorithm:
The CLIQUE algorithm first divides the data space into grids. It is done by dividing each
dimension into equal intervals called units. After that, it identifies dense units. A unit is dense if
the data points in this are exceeding the threshold value.
Once the algorithm finds dense cells along one dimension, the algorithm tries to find dense cells
along two dimensions, and it works until all dense cells along the entire dimension are found.
After finding all dense cells in all dimensions, the algorithm proceeds to find the largest set
(“cluster”) of connected dense cells. Finally, the CLIQUE algorithm generates a minimal
description of the cluster. Clusters are then generated from all dense subspaces using the apriori
approach.
Advantage:
 CLIQUE is a subspace clustering algorithm that outperforms K-means, DBSCAN, and
Farthest First in both execution time and accuracy.
 CLIQUE can find clusters of any shape and is able to find any number of clusters in any
number of dimensions, where the number is not predetermined by a parameter.
 One of the simplest methods, and interpretability of results.
Disadvantage:
 The main disadvantage of CLIQUE Algorithm is that if the size of the cell is unsuitable
for a set of very high values, then too much of the estimation will take place and the
correct cluster will be unable to find.

Projected clustering (PROCLUS):


Projected clustering is the first, top-down partitioning projected clustering algorithm based on the notion
of k- medoid clustering which was presented by Aggarwal (1999). It determines medoids for each cluster
repetitively on a sample of data using a greedy hill climbing technique and then upgrades the results
repetitively. Cluster quality in projected clustering is a function of average distance between data points
and the closest medoid. Also, the subspace dimensionality is an input framework which generates clusters
of alike sizes.

Features of Projected Clustering :

 Projected clustering is a typical- dimension – reduction subspace clustering method. That


is, instead of initiating from single – dimensional spaces, it proceeds by identifying an
initial approximation of the clusters in high dimensional attribute space.
 Each dimension is then allocated a weight for each cluster and the renovated weights are
used in the next repetition to restore the clusters . This leads to the inspection of dense
regions in all subspaces of some craved dimensionality.
 It avoids the production of a huge number of overlapped clusters in lower dimensionality.
 Projected clustering finds the finest set of medoids by a hill climbing technique but
generalized to deal with projected clustering.
 It acquire a distance measure called Manhattan segmental distance.
 This algorithm composed of three phases : Initialization, iteration, cluster refinement.
 However, projected clustering is speedy than CLIQUE due to the sampling of large
datasets, though the use of small number of illustrative points can cause this algorithm to
miss out some clusters completely.
 Experiments on projected clustering show that the procedure is structured and scalable at
finding high dimensional clusters. This algorithm finds non overlapped partitions of
points.
Input and Output for Projected Clustering :
Input –

 The group of data points.


 Number of clusters, indicated by k.
 Average number of dimensions for each clusters, indicated by L.
Output –
The clusters identified, and the dimensions esteemed to such clusters.

You might also like