0% found this document useful (0 votes)

30 views

BDA Assignment2 BE6 20

The document discusses Bloom filter algorithm in detail. It explains the working, properties and limitations of Bloom filters including false positives. An example is provided to illustrate how elements are added to and looked up in a Bloom filter.

Uploaded by

vardhan mordharia

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

30 views

BDA Assignment2 BE6 20

Uploaded by

vardhan mordharia

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 9

Assignment 2

Name: Harsh Mordharia Class and Roll No: BE 6/ 20

Subject: Big Data Analytics PRN: 20UF15936IT031

1. Elaborate Bloom Filter Algorithm

A Bloom filter is a space-efficient probabilistic data structure that is

used to test whether an element is a member of a set. For example,
checking availability of username is set membership problem, where
the set is the list of all registered username. The price we pay for
efficiency is that it is probabilistic in nature that means, there might be
some False Positive results. False positive means, it might tell that
given username is already taken but actually it’s not.
Properties of Bloom Filters:KI
• Unlike a standard hash table, a Bloom filter of a fixed size can
represent a set with an arbitrarily large number of elements.
• Adding an element never fails. However, the false positive rate
increases steadily as elements are added until all bits in the
filter are set to 1, at which point all queries yield a positive
result.
• Bloom filters never generate false negative result, i.e., telling
you that a username doesn’t exist when it actually exists.
• Deleting elements from filter is not possible because, if we
delete a single element by clearing bits at indices generated by
k hash functions, it might cause deletion of few other elements.
Example – if we delete “geeks” (in given example below) by
clearing bit at 1, 4 and 7, we might end up deleting “nerd” also
Because bit at index 4 becomes 0 and bloom filter claims that
“nerd” is not present.
Working of Bloom Filter
A empty bloom filter is a bit array of m bits, all set to zero, like this –

We need k number of hash functions to calculate the hashes for a

given input. When we want to add an item in the filter, the bits at k
indices h1(x), h2(x), … hk(x) are set, where indices are calculated using
hash functions.
Example – Suppose we want to enter “geeks” in the filter, we are using
3 hash functions and a bit array of length 10, all set to 0 initially. First
we’ll calculate the hashes as follows:
h1(“geeks”) % 10 = 1
h2(“geeks”) % 10 = 4
h3(“geeks”) % 10 = 7

Note: These outputs are random for explanation only.

Now we will set the bits at indices 1, 4 and 7 to 1

Again we want to enter “nerd”, similarly, we’ll calculate hashes

h1(“nerd”) % 10 = 3
h2(“nerd”) % 10 = 5
h3(“nerd”) % 10 = 4

Set the bits at indices 3, 5 and 4 to 1

Now if we want to check “geeks” is present in filter or not. We’ll do the
same process but this time in reverse order. We calculate respective
hashes using h1, h2 and h3 and check if all these indices are set to 1 in
the bit array. If all the bits are set then we can say that “geeks”
is probably present. If any of the bit at these indices are 0 then “geeks”
is notpresent.

False Positive in Bloom Filters

The question is why we said “probably present,” why this uncertainty.
Let us understand this with an example. Suppose we want to check
whether “cat” is present or not. We will calculate hashes using h1, h2
and h3
h1(“cat”) % 10 = 1
h2(“cat”) % 10 = 3
h3(“cat”) % 10 = 7

If we check the bit array, bits at these indices are set to 1 but we know
that “cat” was never added to the filter. Bit at index 1 and 7 was set
when we added “geeks” and bit 3 was set we added “nerd”.

So, because bits at calculated indices are already set by some other
item, bloom filter erroneously claims that “cat” is present and
generating a false positive result. Depending on the application, it could
be huge downside or relatively okay.
We can control the probability of getting a false positive by controlling
the size of the Bloom filter. More space means fewer false positives. If
we want to decrease probability of false positive result, we have to use
a greater number of hash functions and larger bit array. This would add
latency in addition to the item and checking membership.

2.
3. Explain SON algorithm with suitable example.

The SON (Savasere, Omiecinski, and Navathe) algorithm is a widely used

algorithm for efficiently finding frequent itemsets in large datasets. It's
particularly useful in the context of market basket analysis, where you're
trying to find associations between items that frequently occur together in
transactions.

Here's an explanation of the SON algorithm along with an example:

SON Algorithm:
Phase 1 - Map Phase:

Divide the dataset into chunks and distribute them across multiple machines
or processors.
Each machine identifies local frequent itemsets by scanning its portion of the
dataset and counting the occurrences of items.
The support threshold is applied locally to filter out infrequent itemsets.
Phase 1 - Reduce Phase:

Collect the local frequent itemsets from all machines.

Aggregate the counts of itemsets that span multiple chunks and sum them up.
Further filtering can be done based on the global support threshold.
Phase 2 - Map Phase:

Generate candidate itemsets based on the frequent itemsets found in Phase 1.

Distribute these candidate itemsets across machines.
Phase 2 - Reduce Phase:

Count the occurrences of candidate itemsets in the entire dataset.

Apply the global support threshold to filter out infrequent itemsets.
Output:

The remaining itemsets after Phase 2 are the frequent itemsets.

Example:
Let's consider a simplified market basket dataset:

Transaction 1: {bread, milk, eggs}

Transaction 2: {bread, butter}

Transaction 3: {milk, butter}

Transaction 4: {bread, milk, butter}

Let's say our support threshold is 2 (meaning an itemset must appear in at least
2 transactions to be considered frequent).
Phase 1:
Map Phase:

Each machine processes a portion of the dataset and identifies local frequent
itemsets.
Machine 1: {bread: 3, milk: 2, butter: 2}
Machine 2: {eggs: 1}
Reduce Phase:

Combine local frequent itemsets from all machines.

Combined: {bread: 3, milk: 2, butter: 2, eggs: 1}
Phase 2:
Map Phase:

Generate candidate itemsets based on frequent itemsets from Phase 1.

Candidates: {bread, milk}, {bread, butter}, {milk, butter}
Reduce Phase:

Count occurrences of candidate itemsets in the entire dataset.

{bread, milk}: 2
{bread, butter}: 1
{milk, butter}: 1
Output:
Since only {bread, milk} meets the support threshold of 2, it's the only
frequent itemset.
In this example, the SON algorithm efficiently identifies the frequent itemsets
(in this case, {bread, milk}) without having to process the entire dataset
multiple times. It reduces the computational burden by performing filtering in
two phases, thereby improving the scalability of the algorithm for large
datasets.

4. Explain CURE algorithm with suitable example.

CURE (Clustering Using Representatives) algorithm is an iterative

hierarchical clustering algorithm that aims to efficiently cluster large datasets.
Unlike traditional hierarchical clustering algorithms like agglomerative
clustering, CURE doesn't require storing the entire dataset in memory at once,
making it suitable for handling large datasets. It works by selecting
representative points from clusters to create a hierarchical clustering structure.

Steps of the CURE Algorithm:

Representative Points Selection:

Randomly sample a subset of points from the dataset as initial representative

points.
These representative points serve as the centers of potential clusters.
Clustering Initialization:

Assign each point in the dataset to its nearest representative point.

Initially, each representative point is considered as a cluster.
Cluster Shrinkage:

Move each representative point a fraction closer to the centroid of its assigned
points.
This step aims to "shrink" the clusters towards their center.
Merge Clusters:

Merge clusters that are within a specified distance threshold.

The distance between clusters can be measured using various metrics like
Euclidean distance.
Repeat:

Repeat steps 3 and 4 until the desired number of clusters is obtained or until
the clusters are no longer merging.
Example:
Let's illustrate the CURE algorithm with a simple dataset:

Consider the following 2D dataset:

Data Points:
(1, 2), (2, 3), (2, 4), (3, 5), (6, 8), (7, 9), (8, 7), (9, 6)
Suppose we want to cluster these points into 2 clusters using CURE.
Representative Points Selection:

Randomly select two points from the dataset as initial representatives.

Let's say we select (1, 2) and (8, 7).
Clustering Initialization:

Assign each point to its nearest representative point. The initial clusters are:
Cluster 1: {(1, 2), (2, 3), (2, 4), (3, 5)}
Cluster 2: {(6, 8), (7, 9), (8, 7), (9, 6)}
Cluster Shrinkage:

Move the representative points closer to the centroids of their assigned points.
For example, the centroid of Cluster 1 is (2, 3.5), so the representative point
(1, 2) moves towards (2, 3.5).
Merge Clusters:

Check if any clusters are within a specified distance threshold (e.g., Euclidean
distance).
If the distance between clusters is less than a threshold, merge them.
For example, Cluster 1 and Cluster 2 are relatively close, so they might merge
into a single cluster.
Repeat:

Repeat steps 3 and 4 until the desired number of clusters is achieved or until
clusters are no longer merging.
Output:
The final output of the CURE algorithm will be the clustered dataset,
represented by the merged clusters.

In this example, we may end up with two clusters:

Cluster 1: {(1, 2), (2, 3), (2, 4), (3, 5)}

Cluster 2: {(6, 8), (7, 9), (8, 7), (9, 6)}
These clusters represent the final grouping of points achieved by the CURE
algorithm.

2 Bill of Exchange Template
96% (26)
2 Bill of Exchange Template
1 page
Manjit Kaur A/P Bant Singh at Jai Singh: Your Bill Statement / Penyata Bil Anda
No ratings yet
Manjit Kaur A/P Bant Singh at Jai Singh: Your Bill Statement / Penyata Bil Anda
5 pages
Kasita Brochure Web 2018 PDF
No ratings yet
Kasita Brochure Web 2018 PDF
9 pages
Assignment 2 BDA
No ratings yet
Assignment 2 BDA
9 pages
BDA Assignment No-2 B-2 47
No ratings yet
BDA Assignment No-2 B-2 47
14 pages
Data Stream Sampling
No ratings yet
Data Stream Sampling
25 pages
Bda Ut-2
No ratings yet
Bda Ut-2
34 pages
Bloom Filter
No ratings yet
Bloom Filter
50 pages
Viden Io Data Analytics Lecture7 Data Stream Filtering PDF
No ratings yet
Viden Io Data Analytics Lecture7 Data Stream Filtering PDF
20 pages
Streaming Algorithms: CS6234 Advanced Algorithms February 10 2015
No ratings yet
Streaming Algorithms: CS6234 Advanced Algorithms February 10 2015
90 pages
module4(2)
No ratings yet
module4(2)
10 pages
(8) Bloom Filters - A Probabilistic Data Structure _ LinkedIn
No ratings yet
(8) Bloom Filters - A Probabilistic Data Structure _ LinkedIn
7 pages
BDA PT 2
No ratings yet
BDA PT 2
35 pages
BDA Questions
No ratings yet
BDA Questions
20 pages
ADS EXP 8 Tanisha Kanal
No ratings yet
ADS EXP 8 Tanisha Kanal
10 pages
Data Science 5
No ratings yet
Data Science 5
82 pages
Bloom Filters: Presented By: Eman Shafiq (2017-EE-389) Bareera Azhar (2017-EE-379) Ruqia Rubab (2017-EE-383
No ratings yet
Bloom Filters: Presented By: Eman Shafiq (2017-EE-389) Bareera Azhar (2017-EE-379) Ruqia Rubab (2017-EE-383
14 pages
Probablistic Data Structures
No ratings yet
Probablistic Data Structures
5 pages
DSBDA UT 2 Part 2
No ratings yet
DSBDA UT 2 Part 2
21 pages
Lecture08_BloomFilter
No ratings yet
Lecture08_BloomFilter
2 pages
Manual Bda 6 7 8
No ratings yet
Manual Bda 6 7 8
6 pages
Streams 2
No ratings yet
Streams 2
49 pages
Bloom Filter
No ratings yet
Bloom Filter
29 pages
Mining Data Streams (Part 2)
No ratings yet
Mining Data Streams (Part 2)
56 pages
Bloom Filters A Tutorial, Analysis, and Survey
No ratings yet
Bloom Filters A Tutorial, Analysis, and Survey
31 pages
Bloom Filters - Short Tutorial: Web Cache Sharing ( (3) ) Collaborating Web Caches Use Bloom Filters (Dubbed
No ratings yet
Bloom Filters - Short Tutorial: Web Cache Sharing ( (3) ) Collaborating Web Caches Use Bloom Filters (Dubbed
4 pages
Introduction to Bloom Filters
No ratings yet
Introduction to Bloom Filters
7 pages
6 Filtering and Streaming: 6.1 Bloom Filters
No ratings yet
6 Filtering and Streaming: 6.1 Bloom Filters
6 pages
2020300053_BDA_EXP4_CHINMAY
No ratings yet
2020300053_BDA_EXP4_CHINMAY
4 pages
BDA
No ratings yet
BDA
6 pages
Bloom Filters - Short Tutorial: Web Cache Sharing ( (3) ) Collaborating Web Caches Use Bloom Filters (Dubbed
No ratings yet
Bloom Filters - Short Tutorial: Web Cache Sharing ( (3) ) Collaborating Web Caches Use Bloom Filters (Dubbed
4 pages
Bloom Filters A Tutorial Analysis and Survey
No ratings yet
Bloom Filters A Tutorial Analysis and Survey
32 pages
unit-3.pptx
No ratings yet
unit-3.pptx
49 pages
DMBI IAT-2 IMP QUES SOLN
No ratings yet
DMBI IAT-2 IMP QUES SOLN
43 pages
Bloom Filters: References
No ratings yet
Bloom Filters: References
22 pages
Unit - I: Random Access Machine Model
No ratings yet
Unit - I: Random Access Machine Model
39 pages
CS 561, Lecture 2: Randomization in Data Structures: Jared Saia University of New Mexico
No ratings yet
CS 561, Lecture 2: Randomization in Data Structures: Jared Saia University of New Mexico
46 pages
ADS Mid-2
No ratings yet
ADS Mid-2
28 pages
Rsa 2008
No ratings yet
Rsa 2008
32 pages
Mining Data Streams
No ratings yet
Mining Data Streams
34 pages
BDA notes part 2
No ratings yet
BDA notes part 2
5 pages
Bloom Filter Guo
No ratings yet
Bloom Filter Guo
90 pages
Bloomfilter
No ratings yet
Bloomfilter
9 pages
Limited pass algorithm
No ratings yet
Limited pass algorithm
33 pages
Gdfer 3
No ratings yet
Gdfer 3
12 pages
Unit-4 Da
No ratings yet
Unit-4 Da
15 pages
IPMV ANSWERS
No ratings yet
IPMV ANSWERS
12 pages
Probabilistic Data Structures
No ratings yet
Probabilistic Data Structures
26 pages
What Is An Algorithm
No ratings yet
What Is An Algorithm
8 pages
Mining Data Streams (Part 2) : Mining of Massive Datasets Jure Leskovec, Anand Rajaraman, Jeff Ullman
No ratings yet
Mining Data Streams (Part 2) : Mining of Massive Datasets Jure Leskovec, Anand Rajaraman, Jeff Ullman
46 pages
CS Presentation 3
No ratings yet
CS Presentation 3
1 page
Lecture Notes On Bucket Algorithms - Luc Devroye
No ratings yet
Lecture Notes On Bucket Algorithms - Luc Devroye
154 pages
Dav Cia 2
No ratings yet
Dav Cia 2
6 pages
Module 4 (3)
No ratings yet
Module 4 (3)
71 pages
Optimization Algorithms For Association Rule Mining (ARM) : K.Indira
No ratings yet
Optimization Algorithms For Association Rule Mining (ARM) : K.Indira
118 pages
COMP1942 Question Paper
No ratings yet
COMP1942 Question Paper
5 pages
Ahemd's Answers
No ratings yet
Ahemd's Answers
17 pages
It-3031 (DMDW) - CS End Nov 2023
No ratings yet
It-3031 (DMDW) - CS End Nov 2023
23 pages
BDA Questions
No ratings yet
BDA Questions
8 pages
Ulllted States Patent (10) Patent N0.: US 8,549,004 B2
No ratings yet
Ulllted States Patent (10) Patent N0.: US 8,549,004 B2
12 pages
Unit 2 Mathematical Foundation of Big Data: - Syllabus
No ratings yet
Unit 2 Mathematical Foundation of Big Data: - Syllabus
26 pages
Learn Programming Using C#
From Everand
Learn Programming Using C#
Taurius Litvinavicius
No ratings yet
Top Numerical Methods With Matlab For Beginners!
From Everand
Top Numerical Methods With Matlab For Beginners!
Andrei Besedin
No ratings yet
Report
No ratings yet
Report
7 pages
Mahesh Tapas
No ratings yet
Mahesh Tapas
7 pages
The Top 10 High-Demand Jobs With Attractive Salaries
No ratings yet
The Top 10 High-Demand Jobs With Attractive Salaries
54 pages
MODULE 4 Open Source
No ratings yet
MODULE 4 Open Source
11 pages
Year 2015
No ratings yet
Year 2015
2 pages
Pengaruh Teknik Budidaya Terhadap Produksi Kopi (Coffea Spp. L.) MASYARAKAT KARO
No ratings yet
Pengaruh Teknik Budidaya Terhadap Produksi Kopi (Coffea Spp. L.) MASYARAKAT KARO
16 pages
Sabp P 004
No ratings yet
Sabp P 004
11 pages
EP For B.Karthick
No ratings yet
EP For B.Karthick
5 pages
BMM Sem V 1920
0% (1)
BMM Sem V 1920
26 pages
Factsheet Woodward Micronet
No ratings yet
Factsheet Woodward Micronet
1 page
Analysis of AND Sony Pictures India LTD.: Zee Entertainment Enterprises LTD
No ratings yet
Analysis of AND Sony Pictures India LTD.: Zee Entertainment Enterprises LTD
10 pages
Five Nights at Freddys
No ratings yet
Five Nights at Freddys
22 pages
Two-, Three-, and Four-Atom Exchange Effects in bcc3 He
No ratings yet
Two-, Three-, and Four-Atom Exchange Effects in bcc3 He
3 pages
(Notes) FAR Summary (NICE)
No ratings yet
(Notes) FAR Summary (NICE)
48 pages
Accredited Poultry Dressing Plants (PDP)
No ratings yet
Accredited Poultry Dressing Plants (PDP)
13 pages
Residents Turn Out For Fall Festival: Inside This Issue
No ratings yet
Residents Turn Out For Fall Festival: Inside This Issue
20 pages
Scope and Application of Learning Theories in The Delivery of Medical Education
No ratings yet
Scope and Application of Learning Theories in The Delivery of Medical Education
6 pages
Parts List For Model As Puriii5000
No ratings yet
Parts List For Model As Puriii5000
1 page
Chandu Resume
No ratings yet
Chandu Resume
2 pages
Oracle Database 12c New Features - Part 1
No ratings yet
Oracle Database 12c New Features - Part 1
17 pages
Planning A Wisp Solar Powered Tower - Ubiquiti Wiki
No ratings yet
Planning A Wisp Solar Powered Tower - Ubiquiti Wiki
5 pages
Modern Techniques of Skill Development: Indian Perspectives
No ratings yet
Modern Techniques of Skill Development: Indian Perspectives
13 pages
M&G USA Corp. Chapter 11 Bankruptcy Filing
No ratings yet
M&G USA Corp. Chapter 11 Bankruptcy Filing
28 pages
Unit 5 Postal and Shipping Services Presentation
No ratings yet
Unit 5 Postal and Shipping Services Presentation
42 pages
8-Port Antenna Frequency Range Dual Polarization HPBW Adjust. Electr. DT
No ratings yet
8-Port Antenna Frequency Range Dual Polarization HPBW Adjust. Electr. DT
11 pages
Zuari 64879500780
No ratings yet
Zuari 64879500780
227 pages
Motherboard A8ne-Fm
No ratings yet
Motherboard A8ne-Fm
74 pages

BDA Assignment2 BE6 20

Uploaded by

BDA Assignment2 BE6 20

Uploaded by

Assignment 2

Name: Harsh Mordharia Class and Roll No: BE 6/ 20

1. Elaborate Bloom Filter Algorithm

A Bloom filter is a space-efficient probabilistic data structure that is

We need k number of hash functions to calculate the hashes for a

Note: These outputs are random for explanation only.

Again we want to enter “nerd”, similarly, we’ll calculate hashes

Set the bits at indices 3, 5 and 4 to 1

False Positive in Bloom Filters

The SON (Savasere, Omiecinski, and Navathe) algorithm is a widely used

Here's an explanation of the SON algorithm along with an example:

Collect the local frequent itemsets from all machines.

Generate candidate itemsets based on the frequent itemsets found in Phase 1.

Count the occurrences of candidate itemsets in the entire dataset.

The remaining itemsets after Phase 2 are the frequent itemsets.

Transaction 1: {bread, milk, eggs}

Transaction 2: {bread, butter}

Transaction 3: {milk, butter}

Transaction 4: {bread, milk, butter}

Combine local frequent itemsets from all machines.

Generate candidate itemsets based on frequent itemsets from Phase 1.

Count occurrences of candidate itemsets in the entire dataset.

4. Explain CURE algorithm with suitable example.

CURE (Clustering Using Representatives) algorithm is an iterative

Steps of the CURE Algorithm:

Randomly sample a subset of points from the dataset as initial representative

Assign each point in the dataset to its nearest representative point.

Merge clusters that are within a specified distance threshold.

Consider the following 2D dataset:

Randomly select two points from the dataset as initial representatives.

In this example, we may end up with two clusters:

Cluster 1: {(1, 2), (2, 3), (2, 4), (3, 5)}

You might also like