BDA Assignment2 BE6 20
BDA Assignment2 BE6 20
If we check the bit array, bits at these indices are set to 1 but we know
that “cat” was never added to the filter. Bit at index 1 and 7 was set
when we added “geeks” and bit 3 was set we added “nerd”.
So, because bits at calculated indices are already set by some other
item, bloom filter erroneously claims that “cat” is present and
generating a false positive result. Depending on the application, it could
be huge downside or relatively okay.
We can control the probability of getting a false positive by controlling
the size of the Bloom filter. More space means fewer false positives. If
we want to decrease probability of false positive result, we have to use
a greater number of hash functions and larger bit array. This would add
latency in addition to the item and checking membership.
2.
3. Explain SON algorithm with suitable example.
Divide the dataset into chunks and distribute them across multiple machines
or processors.
Each machine identifies local frequent itemsets by scanning its portion of the
dataset and counting the occurrences of items.
The support threshold is applied locally to filter out infrequent itemsets.
Phase 1 - Reduce Phase:
Let's say our support threshold is 2 (meaning an itemset must appear in at least
2 transactions to be considered frequent).
Phase 1:
Map Phase:
Each machine processes a portion of the dataset and identifies local frequent
itemsets.
Machine 1: {bread: 3, milk: 2, butter: 2}
Machine 2: {eggs: 1}
Reduce Phase:
Move each representative point a fraction closer to the centroid of its assigned
points.
This step aims to "shrink" the clusters towards their center.
Merge Clusters:
Repeat steps 3 and 4 until the desired number of clusters is obtained or until
the clusters are no longer merging.
Example:
Let's illustrate the CURE algorithm with a simple dataset:
Assign each point to its nearest representative point. The initial clusters are:
Cluster 1: {(1, 2), (2, 3), (2, 4), (3, 5)}
Cluster 2: {(6, 8), (7, 9), (8, 7), (9, 6)}
Cluster Shrinkage:
Move the representative points closer to the centroids of their assigned points.
For example, the centroid of Cluster 1 is (2, 3.5), so the representative point
(1, 2) moves towards (2, 3.5).
Merge Clusters:
Check if any clusters are within a specified distance threshold (e.g., Euclidean
distance).
If the distance between clusters is less than a threshold, merge them.
For example, Cluster 1 and Cluster 2 are relatively close, so they might merge
into a single cluster.
Repeat:
Repeat steps 3 and 4 until the desired number of clusters is achieved or until
clusters are no longer merging.
Output:
The final output of the CURE algorithm will be the clustered dataset,
represented by the merged clusters.