0% found this document useful (0 votes)
30 views

BDA Assignment2 BE6 20

The document discusses Bloom filter algorithm in detail. It explains the working, properties and limitations of Bloom filters including false positives. An example is provided to illustrate how elements are added to and looked up in a Bloom filter.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
30 views

BDA Assignment2 BE6 20

The document discusses Bloom filter algorithm in detail. It explains the working, properties and limitations of Bloom filters including false positives. An example is provided to illustrate how elements are added to and looked up in a Bloom filter.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

Assignment 2

Name: Harsh Mordharia Class and Roll No: BE 6/ 20


Subject: Big Data Analytics PRN: 20UF15936IT031

1. Elaborate Bloom Filter Algorithm

A Bloom filter is a space-efficient probabilistic data structure that is


used to test whether an element is a member of a set. For example,
checking availability of username is set membership problem, where
the set is the list of all registered username. The price we pay for
efficiency is that it is probabilistic in nature that means, there might be
some False Positive results. False positive means, it might tell that
given username is already taken but actually it’s not.
Properties of Bloom Filters:KI
• Unlike a standard hash table, a Bloom filter of a fixed size can
represent a set with an arbitrarily large number of elements.
• Adding an element never fails. However, the false positive rate
increases steadily as elements are added until all bits in the
filter are set to 1, at which point all queries yield a positive
result.
• Bloom filters never generate false negative result, i.e., telling
you that a username doesn’t exist when it actually exists.
• Deleting elements from filter is not possible because, if we
delete a single element by clearing bits at indices generated by
k hash functions, it might cause deletion of few other elements.
Example – if we delete “geeks” (in given example below) by
clearing bit at 1, 4 and 7, we might end up deleting “nerd” also
Because bit at index 4 becomes 0 and bloom filter claims that
“nerd” is not present.
Working of Bloom Filter
A empty bloom filter is a bit array of m bits, all set to zero, like this –

We need k number of hash functions to calculate the hashes for a


given input. When we want to add an item in the filter, the bits at k
indices h1(x), h2(x), … hk(x) are set, where indices are calculated using
hash functions.
Example – Suppose we want to enter “geeks” in the filter, we are using
3 hash functions and a bit array of length 10, all set to 0 initially. First
we’ll calculate the hashes as follows:
h1(“geeks”) % 10 = 1
h2(“geeks”) % 10 = 4
h3(“geeks”) % 10 = 7

Note: These outputs are random for explanation only.


Now we will set the bits at indices 1, 4 and 7 to 1

Again we want to enter “nerd”, similarly, we’ll calculate hashes


h1(“nerd”) % 10 = 3
h2(“nerd”) % 10 = 5
h3(“nerd”) % 10 = 4

Set the bits at indices 3, 5 and 4 to 1


Now if we want to check “geeks” is present in filter or not. We’ll do the
same process but this time in reverse order. We calculate respective
hashes using h1, h2 and h3 and check if all these indices are set to 1 in
the bit array. If all the bits are set then we can say that “geeks”
is probably present. If any of the bit at these indices are 0 then “geeks”
is notpresent.

False Positive in Bloom Filters


The question is why we said “probably present,” why this uncertainty.
Let us understand this with an example. Suppose we want to check
whether “cat” is present or not. We will calculate hashes using h1, h2
and h3
h1(“cat”) % 10 = 1
h2(“cat”) % 10 = 3
h3(“cat”) % 10 = 7

If we check the bit array, bits at these indices are set to 1 but we know
that “cat” was never added to the filter. Bit at index 1 and 7 was set
when we added “geeks” and bit 3 was set we added “nerd”.

So, because bits at calculated indices are already set by some other
item, bloom filter erroneously claims that “cat” is present and
generating a false positive result. Depending on the application, it could
be huge downside or relatively okay.
We can control the probability of getting a false positive by controlling
the size of the Bloom filter. More space means fewer false positives. If
we want to decrease probability of false positive result, we have to use
a greater number of hash functions and larger bit array. This would add
latency in addition to the item and checking membership.

2.
3. Explain SON algorithm with suitable example.

The SON (Savasere, Omiecinski, and Navathe) algorithm is a widely used


algorithm for efficiently finding frequent itemsets in large datasets. It's
particularly useful in the context of market basket analysis, where you're
trying to find associations between items that frequently occur together in
transactions.

Here's an explanation of the SON algorithm along with an example:


SON Algorithm:
Phase 1 - Map Phase:

Divide the dataset into chunks and distribute them across multiple machines
or processors.
Each machine identifies local frequent itemsets by scanning its portion of the
dataset and counting the occurrences of items.
The support threshold is applied locally to filter out infrequent itemsets.
Phase 1 - Reduce Phase:

Collect the local frequent itemsets from all machines.


Aggregate the counts of itemsets that span multiple chunks and sum them up.
Further filtering can be done based on the global support threshold.
Phase 2 - Map Phase:

Generate candidate itemsets based on the frequent itemsets found in Phase 1.


Distribute these candidate itemsets across machines.
Phase 2 - Reduce Phase:

Count the occurrences of candidate itemsets in the entire dataset.


Apply the global support threshold to filter out infrequent itemsets.
Output:

The remaining itemsets after Phase 2 are the frequent itemsets.


Example:
Let's consider a simplified market basket dataset:

Transaction 1: {bread, milk, eggs}

Transaction 2: {bread, butter}

Transaction 3: {milk, butter}

Transaction 4: {bread, milk, butter}

Let's say our support threshold is 2 (meaning an itemset must appear in at least
2 transactions to be considered frequent).
Phase 1:
Map Phase:

Each machine processes a portion of the dataset and identifies local frequent
itemsets.
Machine 1: {bread: 3, milk: 2, butter: 2}
Machine 2: {eggs: 1}
Reduce Phase:

Combine local frequent itemsets from all machines.


Combined: {bread: 3, milk: 2, butter: 2, eggs: 1}
Phase 2:
Map Phase:

Generate candidate itemsets based on frequent itemsets from Phase 1.


Candidates: {bread, milk}, {bread, butter}, {milk, butter}
Reduce Phase:

Count occurrences of candidate itemsets in the entire dataset.


{bread, milk}: 2
{bread, butter}: 1
{milk, butter}: 1
Output:
Since only {bread, milk} meets the support threshold of 2, it's the only
frequent itemset.
In this example, the SON algorithm efficiently identifies the frequent itemsets
(in this case, {bread, milk}) without having to process the entire dataset
multiple times. It reduces the computational burden by performing filtering in
two phases, thereby improving the scalability of the algorithm for large
datasets.

4. Explain CURE algorithm with suitable example.

CURE (Clustering Using Representatives) algorithm is an iterative


hierarchical clustering algorithm that aims to efficiently cluster large datasets.
Unlike traditional hierarchical clustering algorithms like agglomerative
clustering, CURE doesn't require storing the entire dataset in memory at once,
making it suitable for handling large datasets. It works by selecting
representative points from clusters to create a hierarchical clustering structure.

Steps of the CURE Algorithm:


Representative Points Selection:

Randomly sample a subset of points from the dataset as initial representative


points.
These representative points serve as the centers of potential clusters.
Clustering Initialization:

Assign each point in the dataset to its nearest representative point.


Initially, each representative point is considered as a cluster.
Cluster Shrinkage:

Move each representative point a fraction closer to the centroid of its assigned
points.
This step aims to "shrink" the clusters towards their center.
Merge Clusters:

Merge clusters that are within a specified distance threshold.


The distance between clusters can be measured using various metrics like
Euclidean distance.
Repeat:

Repeat steps 3 and 4 until the desired number of clusters is obtained or until
the clusters are no longer merging.
Example:
Let's illustrate the CURE algorithm with a simple dataset:

Consider the following 2D dataset:


Data Points:
(1, 2), (2, 3), (2, 4), (3, 5), (6, 8), (7, 9), (8, 7), (9, 6)
Suppose we want to cluster these points into 2 clusters using CURE.
Representative Points Selection:

Randomly select two points from the dataset as initial representatives.


Let's say we select (1, 2) and (8, 7).
Clustering Initialization:

Assign each point to its nearest representative point. The initial clusters are:
Cluster 1: {(1, 2), (2, 3), (2, 4), (3, 5)}
Cluster 2: {(6, 8), (7, 9), (8, 7), (9, 6)}
Cluster Shrinkage:

Move the representative points closer to the centroids of their assigned points.
For example, the centroid of Cluster 1 is (2, 3.5), so the representative point
(1, 2) moves towards (2, 3.5).
Merge Clusters:

Check if any clusters are within a specified distance threshold (e.g., Euclidean
distance).
If the distance between clusters is less than a threshold, merge them.
For example, Cluster 1 and Cluster 2 are relatively close, so they might merge
into a single cluster.
Repeat:

Repeat steps 3 and 4 until the desired number of clusters is achieved or until
clusters are no longer merging.
Output:
The final output of the CURE algorithm will be the clustered dataset,
represented by the merged clusters.

In this example, we may end up with two clusters:

Cluster 1: {(1, 2), (2, 3), (2, 4), (3, 5)}


Cluster 2: {(6, 8), (7, 9), (8, 7), (9, 6)}
These clusters represent the final grouping of points achieved by the CURE
algorithm.

You might also like