SlideShare a Scribd company logo
Bloom Filters
Kira Radinsky
Slides based on material from:
Michael Mitzenmacher and Hanoch Levy
Motivation - Cache
• Lookup questions:
Does item “x” exist in a set?
• Data set may be very big or expensive to
access. Filter lookup questions with negative
results before accessing data.
• Allow false positive errors, as they only cost us an
extra data access.
• Don’t allow false negative errors, because they
result in wrong answers.
Application of Bloom Filters:
Distributed Web Caches
Web Cache 1 Web Cache 2 Web Cache 3
Web Cache 6Web Cache 5Web Cache 4
• Send Bloom filters of URLs.
• False positives do not hurt much.
– Get errors from cache changes anyway
Web Caching
• Summary Cache: [Fan, Cao, Almeida, & Broder]
If local caches know each other’s content...
…try local cache before going out to Web
• Sending/updating lists of URLs too expensive.
• Solution: use Bloom filters.
• False positives
– Local requests go unfulfilled.
– Small cost, big potential gain
The Problem Solved by BF:
Approximate Set Membership
• Lookup Problem: Given a set S = {x1,x2,…,xn}, construct
data structure to answer queries of the form
“Is y in S?”
• Data structure should be:
– Fast (Faster than searching through S).
– Small (Smaller than explicit representation).
• To obtain speed and size improvements, allow some
probability of error.
– False positives: y  S but we report y  S
– False negatives: y  S but we report y  S
Bloom Filters
Start with an m bit array, filled with 0s.
Hash each item xj in S k times. If Hi(xj) = a, set B[a] = 1.
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0B
0 1 0 0 1 0 1 0 0 1 1 1 0 1 1 0B
To check if y is in S, check B at Hi(y). All k values must be 1.
0 1 0 0 1 0 1 0 0 1 1 1 0 1 1 0B
0 1 0 0 1 0 1 0 0 1 1 1 0 1 1 0B
Possible to have a false positive; all k values are 1, but y is not in S.
Bloom Filter
01000 10100 00010
x
h1(x) h2(x) hk(x)
V0 Vm-1
h3(x)
Advantages
• No Overflow
• Union and intersection of Bloom filters
– A simple bitwise OR and AND operations
• Applications:
– Google BigTable
– The Squid Web Proxy Cache uses Bloom filters for
cache digests.
Bloom Errors
01000 10100 00010
h1(x) h2(x) hk(x)
V0 Vm-1
h3(x)
a b c d
x didn’t appear, yet its bits are already set
Example
0
0.01
0.02
0.03
0.04
0.05
0.06
0.07
0.08
0.09
0.1
0 1 2 3 4 5 6 7 8 9 10
Hash functions
Falsepositiverate
m/n = 8
Opt k = 8 ln 2 = 5.45...
Tradeoffs
• Three parameters.
– Size m/n : bits per item.
• |U| = n: Number of elements to encode.
• hi: U[1..m] : Maintain a Bit Vector V of size m
– Time k : number of hash functions.
• Use k hash functions (h1..hk)
– Error f : false positive probability.
Bloom Filter Tradeoffs
• Three factors: m,k and n.
• Normally, n and m are given, and we select k.
• Small k
– Less computations.
– Actual number of bits accessed (nk) is smaller, so the chance of a “step
over” is smaller too.
– However, less bits need to be stepped over to generate an error.
• For big k, the exact opposite holds.
• Not surprisingly, when k is optimal, the “hit ratio” (ratio of bits
flipped in the array) is exactly 0.5
Alternative Approach for
Bloom Filters: Perfect Hashing Approach
Element 1 Element 2 Element 3 Element 4 Element 5
Fingerprint(4) Fingerprint(5) Fingerprint(2) Fingerprint(1) Fingerprint(3)
Perfect Hashing Approach
• Folklore Bloom filter construction.
– Recall: Given a set S = {x1,x2,x3,…xn} on a universe U, we want
to answer membership queries.
– Method: Find an n-cell perfect hash function for S.
• Maps set of n elements to n cells in a 1-1 manner.
– Then keep bit fingerprint of item in each cell.
Lookups have false positive < e.
– Advantage: each bit/item reduces false positives by a factor
of 1/2, vs ln 2 for a standard Bloom filter.
• Negatives:
– Perfect hash functions non-trivial to find.
– Cannot handle on-line insertions.
 )/1(log2 e
Bloom Filters and Deletions
• Cache contents change
– Items both inserted and deleted.
• Insertions are easy – add bits to BF
• Can Bloom filters handle deletions?
– Use Counting Bloom Filters to track
insertions/deletions at hosts;
– Send Bloom filters.
Handling Deletions
• Bloom filters can handle insertions, but not
deletions.
• If deleting xi means resetting 1s to 0s, then
deleting xi will “delete” xj.
0 1 0 0 1 0 1 0 0 1 1 1 0 1 1 0B
xi xj
Counting Bloom Filters
Start with an m bit array, filled with 0s.
Hash each item xj in S k times. If Hi(xj) = a, add 1 to B[a].
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0B
0 3 0 0 1 0 2 0 0 3 2 1 0 2 1 0B
To delete xj decrement the corresponding counters.
0 2 0 0 0 0 2 0 0 3 2 1 0 1 1 0B
Can obtain a corresponding Bloom filter by reducing to 0/1.
0 1 0 0 0 0 1 0 0 1 1 1 0 1 1 0B
Counting Bloom Filters: Overflow
• Must choose counters large enough to avoid
overflow.
• Poisson approximation suggests 4 bits/counter.
– Average load using k = (ln 2)m/n counters is ln 2.
– Probability a counter has load at least 16:
• Failsafes possible.
17E78.6!16/)2(ln 162ln
 
e
Variations and Extensions
• Distance-Sensitive Bloom Filters
• Bloomier Filter
Extension: Distance-Sensitive Bloom Filters
• Instead of answering questions of the form
we would like to answer questions of the form
• That is, is the query close to some element of the set, under
some metric and some notion of close.
• Applications:
– DNA matching
– Virus/worm matching
– Databases
• Some initial results [KirschMitzenmacher]. Hard.
.SyIs 
.SxyIs 
Extension: Bloomier Filter
• Bloom filters handle set membership.
• Counters to handle multi-set/count tracking.
• Bloomier filter [Chazelle, Kilian, Rubinfeld, Tal]:
– Extend to handle approximate functions.
– Each element of set has associated function value.
– Non-set elements should return null.
– Want to always return correct function value for set
elements.
– A false positive returns a function value for a non-null
element.
Ad

More Related Content

What's hot (20)

Probabilistic data structures. Part 3. Frequency
Probabilistic data structures. Part 3. FrequencyProbabilistic data structures. Part 3. Frequency
Probabilistic data structures. Part 3. Frequency
Andrii Gakhov
 
STRIP: stream learning of influence probabilities.
STRIP: stream learning of influence probabilities.STRIP: stream learning of influence probabilities.
STRIP: stream learning of influence probabilities.
Albert Bifet
 
Exceeding Classical: Probabilistic Data Structures in Data Intensive Applicat...
Exceeding Classical: Probabilistic Data Structures in Data Intensive Applicat...Exceeding Classical: Probabilistic Data Structures in Data Intensive Applicat...
Exceeding Classical: Probabilistic Data Structures in Data Intensive Applicat...
Andrii Gakhov
 
Hashing gt1
Hashing gt1Hashing gt1
Hashing gt1
Gopi Saiteja
 
Introduction to Big Data Science
Introduction to Big Data ScienceIntroduction to Big Data Science
Introduction to Big Data Science
Albert Bifet
 
Efficient Data Stream Classification via Probabilistic Adaptive Windows
Efficient Data Stream Classification via Probabilistic Adaptive WindowsEfficient Data Stream Classification via Probabilistic Adaptive Windows
Efficient Data Stream Classification via Probabilistic Adaptive Windows
Albert Bifet
 
Internet of Things Data Science
Internet of Things Data ScienceInternet of Things Data Science
Internet of Things Data Science
Albert Bifet
 
Real-Time Big Data Stream Analytics
Real-Time Big Data Stream AnalyticsReal-Time Big Data Stream Analytics
Real-Time Big Data Stream Analytics
Albert Bifet
 
Hashing
HashingHashing
Hashing
Ramzi Alqrainy
 
Data streaming algorithms
Data streaming algorithmsData streaming algorithms
Data streaming algorithms
Sandeep Joshi
 
Too Much Data? - Just Sample, Just Hash, ...
Too Much Data? - Just Sample, Just Hash, ...Too Much Data? - Just Sample, Just Hash, ...
Too Much Data? - Just Sample, Just Hash, ...
Andrii Gakhov
 
Data Analysis in Python
Data Analysis in PythonData Analysis in Python
Data Analysis in Python
Richard Herrell
 
Heaps
HeapsHeaps
Heaps
Hoang Nguyen
 
Statistics - ArgMax Equation
Statistics - ArgMax EquationStatistics - ArgMax Equation
Statistics - ArgMax Equation
Andrew Ferlitsch
 
Profiling in Python
Profiling in PythonProfiling in Python
Profiling in Python
Fabian Pedregosa
 
Consistent hashing
Consistent hashingConsistent hashing
Consistent hashing
Jooho Lee
 
Faster persistent data structures through hashing
Faster persistent data structures through hashingFaster persistent data structures through hashing
Faster persistent data structures through hashing
Johan Tibell
 
Scientific Computing with Python - NumPy | WeiYuan
Scientific Computing with Python - NumPy | WeiYuanScientific Computing with Python - NumPy | WeiYuan
Scientific Computing with Python - NumPy | WeiYuan
Wei-Yuan Chang
 
Tech talk Probabilistic Data Structure
Tech talk  Probabilistic Data StructureTech talk  Probabilistic Data Structure
Tech talk Probabilistic Data Structure
Rishabh Dugar
 
From Trill to Quill: Pushing the Envelope of Functionality and Scale
From Trill to Quill: Pushing the Envelope of Functionality and ScaleFrom Trill to Quill: Pushing the Envelope of Functionality and Scale
From Trill to Quill: Pushing the Envelope of Functionality and Scale
Badrish Chandramouli
 
Probabilistic data structures. Part 3. Frequency
Probabilistic data structures. Part 3. FrequencyProbabilistic data structures. Part 3. Frequency
Probabilistic data structures. Part 3. Frequency
Andrii Gakhov
 
STRIP: stream learning of influence probabilities.
STRIP: stream learning of influence probabilities.STRIP: stream learning of influence probabilities.
STRIP: stream learning of influence probabilities.
Albert Bifet
 
Exceeding Classical: Probabilistic Data Structures in Data Intensive Applicat...
Exceeding Classical: Probabilistic Data Structures in Data Intensive Applicat...Exceeding Classical: Probabilistic Data Structures in Data Intensive Applicat...
Exceeding Classical: Probabilistic Data Structures in Data Intensive Applicat...
Andrii Gakhov
 
Introduction to Big Data Science
Introduction to Big Data ScienceIntroduction to Big Data Science
Introduction to Big Data Science
Albert Bifet
 
Efficient Data Stream Classification via Probabilistic Adaptive Windows
Efficient Data Stream Classification via Probabilistic Adaptive WindowsEfficient Data Stream Classification via Probabilistic Adaptive Windows
Efficient Data Stream Classification via Probabilistic Adaptive Windows
Albert Bifet
 
Internet of Things Data Science
Internet of Things Data ScienceInternet of Things Data Science
Internet of Things Data Science
Albert Bifet
 
Real-Time Big Data Stream Analytics
Real-Time Big Data Stream AnalyticsReal-Time Big Data Stream Analytics
Real-Time Big Data Stream Analytics
Albert Bifet
 
Data streaming algorithms
Data streaming algorithmsData streaming algorithms
Data streaming algorithms
Sandeep Joshi
 
Too Much Data? - Just Sample, Just Hash, ...
Too Much Data? - Just Sample, Just Hash, ...Too Much Data? - Just Sample, Just Hash, ...
Too Much Data? - Just Sample, Just Hash, ...
Andrii Gakhov
 
Statistics - ArgMax Equation
Statistics - ArgMax EquationStatistics - ArgMax Equation
Statistics - ArgMax Equation
Andrew Ferlitsch
 
Consistent hashing
Consistent hashingConsistent hashing
Consistent hashing
Jooho Lee
 
Faster persistent data structures through hashing
Faster persistent data structures through hashingFaster persistent data structures through hashing
Faster persistent data structures through hashing
Johan Tibell
 
Scientific Computing with Python - NumPy | WeiYuan
Scientific Computing with Python - NumPy | WeiYuanScientific Computing with Python - NumPy | WeiYuan
Scientific Computing with Python - NumPy | WeiYuan
Wei-Yuan Chang
 
Tech talk Probabilistic Data Structure
Tech talk  Probabilistic Data StructureTech talk  Probabilistic Data Structure
Tech talk Probabilistic Data Structure
Rishabh Dugar
 
From Trill to Quill: Pushing the Envelope of Functionality and Scale
From Trill to Quill: Pushing the Envelope of Functionality and ScaleFrom Trill to Quill: Pushing the Envelope of Functionality and Scale
From Trill to Quill: Pushing the Envelope of Functionality and Scale
Badrish Chandramouli
 

Viewers also liked (12)

Tutorial 12 (click models)
Tutorial 12 (click models)Tutorial 12 (click models)
Tutorial 12 (click models)
Kira
 
Tutorial 13 (explicit ugc + sentiment analysis)
Tutorial 13 (explicit ugc + sentiment analysis)Tutorial 13 (explicit ugc + sentiment analysis)
Tutorial 13 (explicit ugc + sentiment analysis)
Kira
 
Circuit Theory Audio Filter
Circuit Theory Audio FilterCircuit Theory Audio Filter
Circuit Theory Audio Filter
Mark Falcone
 
Lic
LicLic
Lic
Priya_Srivastava
 
Tutorial 14 (collaborative filtering)
Tutorial 14 (collaborative filtering)Tutorial 14 (collaborative filtering)
Tutorial 14 (collaborative filtering)
Kira
 
High pass filter
High pass filterHigh pass filter
High pass filter
Anirban Bhowal
 
Low pass filter
Low pass filterLow pass filter
Low pass filter
Mohamad Firdaus Daud
 
Tutorial 1 (information retrieval basics)
Tutorial 1 (information retrieval basics)Tutorial 1 (information retrieval basics)
Tutorial 1 (information retrieval basics)
Kira
 
Information retrieval s
Information retrieval sInformation retrieval s
Information retrieval s
silambu111
 
High pass filter with analog electronic
High pass filter with analog electronicHigh pass filter with analog electronic
High pass filter with analog electronic
Dilouar Hossain
 
Filters
FiltersFilters
Filters
Priya_Srivastava
 
Ayesha low pass filter
Ayesha low pass filterAyesha low pass filter
Ayesha low pass filter
Ayesha Saeed
 
Tutorial 12 (click models)
Tutorial 12 (click models)Tutorial 12 (click models)
Tutorial 12 (click models)
Kira
 
Tutorial 13 (explicit ugc + sentiment analysis)
Tutorial 13 (explicit ugc + sentiment analysis)Tutorial 13 (explicit ugc + sentiment analysis)
Tutorial 13 (explicit ugc + sentiment analysis)
Kira
 
Circuit Theory Audio Filter
Circuit Theory Audio FilterCircuit Theory Audio Filter
Circuit Theory Audio Filter
Mark Falcone
 
Tutorial 14 (collaborative filtering)
Tutorial 14 (collaborative filtering)Tutorial 14 (collaborative filtering)
Tutorial 14 (collaborative filtering)
Kira
 
Tutorial 1 (information retrieval basics)
Tutorial 1 (information retrieval basics)Tutorial 1 (information retrieval basics)
Tutorial 1 (information retrieval basics)
Kira
 
Information retrieval s
Information retrieval sInformation retrieval s
Information retrieval s
silambu111
 
High pass filter with analog electronic
High pass filter with analog electronicHigh pass filter with analog electronic
High pass filter with analog electronic
Dilouar Hossain
 
Ayesha low pass filter
Ayesha low pass filterAyesha low pass filter
Ayesha low pass filter
Ayesha Saeed
 
Ad

Similar to Tutorial 9 (bloom filters) (20)

Lecture_3.pptx
Lecture_3.pptxLecture_3.pptx
Lecture_3.pptx
GayathriSanthosh11
 
Sean Kandel - Data profiling: Assessing the overall content and quality of a ...
Sean Kandel - Data profiling: Assessing the overall content and quality of a ...Sean Kandel - Data profiling: Assessing the overall content and quality of a ...
Sean Kandel - Data profiling: Assessing the overall content and quality of a ...
huguk
 
ilp-nlp-slides.pdf
ilp-nlp-slides.pdfilp-nlp-slides.pdf
ilp-nlp-slides.pdf
FlorentBersani
 
AI_Theory: Covolutional_neuron_network.pdf
AI_Theory: Covolutional_neuron_network.pdfAI_Theory: Covolutional_neuron_network.pdf
AI_Theory: Covolutional_neuron_network.pdf
21146290
 
Unit 5 Streams2.pptx
Unit 5 Streams2.pptxUnit 5 Streams2.pptx
Unit 5 Streams2.pptx
SonaliAjankar
 
Introduction to Bloom Filters
Introduction to Bloom Filters Introduction to Bloom Filters
Introduction to Bloom Filters
Hayden Marchant
 
Strata NYC 2015: Sketching Big Data with Spark: randomized algorithms for lar...
Strata NYC 2015: Sketching Big Data with Spark: randomized algorithms for lar...Strata NYC 2015: Sketching Big Data with Spark: randomized algorithms for lar...
Strata NYC 2015: Sketching Big Data with Spark: randomized algorithms for lar...
Databricks
 
Nearest Neighbor Customer Insight
Nearest Neighbor Customer InsightNearest Neighbor Customer Insight
Nearest Neighbor Customer Insight
MapR Technologies
 
Fast Single-pass K-means Clusterting at Oxford
Fast Single-pass K-means Clusterting at Oxford Fast Single-pass K-means Clusterting at Oxford
Fast Single-pass K-means Clusterting at Oxford
MapR Technologies
 
Beating Floating Point at its Own Game: Posit Arithmetic
Beating Floating Point at its Own Game: Posit ArithmeticBeating Floating Point at its Own Game: Posit Arithmetic
Beating Floating Point at its Own Game: Posit Arithmetic
inside-BigData.com
 
Graph Analysis Beyond Linear Algebra
Graph Analysis Beyond Linear AlgebraGraph Analysis Beyond Linear Algebra
Graph Analysis Beyond Linear Algebra
Jason Riedy
 
02-gates-w.pptx
02-gates-w.pptx02-gates-w.pptx
02-gates-w.pptx
039JagadeeswaranS
 
Lecture 8 about data mining and how to use it.pptx
Lecture 8 about data mining and how to use it.pptxLecture 8 about data mining and how to use it.pptx
Lecture 8 about data mining and how to use it.pptx
HedraAtif
 
Support Vector Machines is the the the the the the the the the
Support Vector Machines is the the the the the the the the theSupport Vector Machines is the the the the the the the the the
Support Vector Machines is the the the the the the the the the
sanjaibalajeessn
 
Practical Deep Learning Using Tensor Flow - Sandeep Kath
Practical Deep Learning Using Tensor Flow - Sandeep KathPractical Deep Learning Using Tensor Flow - Sandeep Kath
Practical Deep Learning Using Tensor Flow - Sandeep Kath
Sandeep Kath
 
Lecture 2 Modeling and Solving LP Problems in a Spreadsheet (1).pptx
Lecture 2 Modeling and Solving LP Problems in a Spreadsheet (1).pptxLecture 2 Modeling and Solving LP Problems in a Spreadsheet (1).pptx
Lecture 2 Modeling and Solving LP Problems in a Spreadsheet (1).pptx
geddamjeevan5
 
Lecture 7: Data-Intensive Computing for Text Analysis (Fall 2011)
Lecture 7: Data-Intensive Computing for Text Analysis (Fall 2011)Lecture 7: Data-Intensive Computing for Text Analysis (Fall 2011)
Lecture 7: Data-Intensive Computing for Text Analysis (Fall 2011)
Matthew Lease
 
Probabilistic data structure
Probabilistic data structureProbabilistic data structure
Probabilistic data structure
Thinh Dang
 
Data Link layer in computer networks cse
Data Link layer in computer networks cseData Link layer in computer networks cse
Data Link layer in computer networks cse
VIJAYARAJAV
 
Oxford 05-oct-2012
Oxford 05-oct-2012Oxford 05-oct-2012
Oxford 05-oct-2012
Ted Dunning
 
Sean Kandel - Data profiling: Assessing the overall content and quality of a ...
Sean Kandel - Data profiling: Assessing the overall content and quality of a ...Sean Kandel - Data profiling: Assessing the overall content and quality of a ...
Sean Kandel - Data profiling: Assessing the overall content and quality of a ...
huguk
 
AI_Theory: Covolutional_neuron_network.pdf
AI_Theory: Covolutional_neuron_network.pdfAI_Theory: Covolutional_neuron_network.pdf
AI_Theory: Covolutional_neuron_network.pdf
21146290
 
Unit 5 Streams2.pptx
Unit 5 Streams2.pptxUnit 5 Streams2.pptx
Unit 5 Streams2.pptx
SonaliAjankar
 
Introduction to Bloom Filters
Introduction to Bloom Filters Introduction to Bloom Filters
Introduction to Bloom Filters
Hayden Marchant
 
Strata NYC 2015: Sketching Big Data with Spark: randomized algorithms for lar...
Strata NYC 2015: Sketching Big Data with Spark: randomized algorithms for lar...Strata NYC 2015: Sketching Big Data with Spark: randomized algorithms for lar...
Strata NYC 2015: Sketching Big Data with Spark: randomized algorithms for lar...
Databricks
 
Nearest Neighbor Customer Insight
Nearest Neighbor Customer InsightNearest Neighbor Customer Insight
Nearest Neighbor Customer Insight
MapR Technologies
 
Fast Single-pass K-means Clusterting at Oxford
Fast Single-pass K-means Clusterting at Oxford Fast Single-pass K-means Clusterting at Oxford
Fast Single-pass K-means Clusterting at Oxford
MapR Technologies
 
Beating Floating Point at its Own Game: Posit Arithmetic
Beating Floating Point at its Own Game: Posit ArithmeticBeating Floating Point at its Own Game: Posit Arithmetic
Beating Floating Point at its Own Game: Posit Arithmetic
inside-BigData.com
 
Graph Analysis Beyond Linear Algebra
Graph Analysis Beyond Linear AlgebraGraph Analysis Beyond Linear Algebra
Graph Analysis Beyond Linear Algebra
Jason Riedy
 
Lecture 8 about data mining and how to use it.pptx
Lecture 8 about data mining and how to use it.pptxLecture 8 about data mining and how to use it.pptx
Lecture 8 about data mining and how to use it.pptx
HedraAtif
 
Support Vector Machines is the the the the the the the the the
Support Vector Machines is the the the the the the the the theSupport Vector Machines is the the the the the the the the the
Support Vector Machines is the the the the the the the the the
sanjaibalajeessn
 
Practical Deep Learning Using Tensor Flow - Sandeep Kath
Practical Deep Learning Using Tensor Flow - Sandeep KathPractical Deep Learning Using Tensor Flow - Sandeep Kath
Practical Deep Learning Using Tensor Flow - Sandeep Kath
Sandeep Kath
 
Lecture 2 Modeling and Solving LP Problems in a Spreadsheet (1).pptx
Lecture 2 Modeling and Solving LP Problems in a Spreadsheet (1).pptxLecture 2 Modeling and Solving LP Problems in a Spreadsheet (1).pptx
Lecture 2 Modeling and Solving LP Problems in a Spreadsheet (1).pptx
geddamjeevan5
 
Lecture 7: Data-Intensive Computing for Text Analysis (Fall 2011)
Lecture 7: Data-Intensive Computing for Text Analysis (Fall 2011)Lecture 7: Data-Intensive Computing for Text Analysis (Fall 2011)
Lecture 7: Data-Intensive Computing for Text Analysis (Fall 2011)
Matthew Lease
 
Probabilistic data structure
Probabilistic data structureProbabilistic data structure
Probabilistic data structure
Thinh Dang
 
Data Link layer in computer networks cse
Data Link layer in computer networks cseData Link layer in computer networks cse
Data Link layer in computer networks cse
VIJAYARAJAV
 
Oxford 05-oct-2012
Oxford 05-oct-2012Oxford 05-oct-2012
Oxford 05-oct-2012
Ted Dunning
 
Ad

More from Kira (9)

Tutorial 11 (computational advertising)
Tutorial 11 (computational advertising)Tutorial 11 (computational advertising)
Tutorial 11 (computational advertising)
Kira
 
Tutorial 10 (computational advertising)
Tutorial 10 (computational advertising)Tutorial 10 (computational advertising)
Tutorial 10 (computational advertising)
Kira
 
Tutorial 8 (web graph models)
Tutorial 8 (web graph models)Tutorial 8 (web graph models)
Tutorial 8 (web graph models)
Kira
 
Tutorial 7 (link analysis)
Tutorial 7 (link analysis)Tutorial 7 (link analysis)
Tutorial 7 (link analysis)
Kira
 
Tutorial 6 (web graph attributes)
Tutorial 6 (web graph attributes)Tutorial 6 (web graph attributes)
Tutorial 6 (web graph attributes)
Kira
 
Tutorial 5 (lucene)
Tutorial 5 (lucene)Tutorial 5 (lucene)
Tutorial 5 (lucene)
Kira
 
Tutorial 4 (duplicate detection)
Tutorial 4 (duplicate detection)Tutorial 4 (duplicate detection)
Tutorial 4 (duplicate detection)
Kira
 
Tutorial 3 (b tree min heap)
Tutorial 3 (b tree min heap)Tutorial 3 (b tree min heap)
Tutorial 3 (b tree min heap)
Kira
 
Tutorial 2 (mle + language models)
Tutorial 2 (mle + language models)Tutorial 2 (mle + language models)
Tutorial 2 (mle + language models)
Kira
 
Tutorial 11 (computational advertising)
Tutorial 11 (computational advertising)Tutorial 11 (computational advertising)
Tutorial 11 (computational advertising)
Kira
 
Tutorial 10 (computational advertising)
Tutorial 10 (computational advertising)Tutorial 10 (computational advertising)
Tutorial 10 (computational advertising)
Kira
 
Tutorial 8 (web graph models)
Tutorial 8 (web graph models)Tutorial 8 (web graph models)
Tutorial 8 (web graph models)
Kira
 
Tutorial 7 (link analysis)
Tutorial 7 (link analysis)Tutorial 7 (link analysis)
Tutorial 7 (link analysis)
Kira
 
Tutorial 6 (web graph attributes)
Tutorial 6 (web graph attributes)Tutorial 6 (web graph attributes)
Tutorial 6 (web graph attributes)
Kira
 
Tutorial 5 (lucene)
Tutorial 5 (lucene)Tutorial 5 (lucene)
Tutorial 5 (lucene)
Kira
 
Tutorial 4 (duplicate detection)
Tutorial 4 (duplicate detection)Tutorial 4 (duplicate detection)
Tutorial 4 (duplicate detection)
Kira
 
Tutorial 3 (b tree min heap)
Tutorial 3 (b tree min heap)Tutorial 3 (b tree min heap)
Tutorial 3 (b tree min heap)
Kira
 
Tutorial 2 (mle + language models)
Tutorial 2 (mle + language models)Tutorial 2 (mle + language models)
Tutorial 2 (mle + language models)
Kira
 

Recently uploaded (20)

ThousandEyes Partner Innovation Updates for May 2025
ThousandEyes Partner Innovation Updates for May 2025ThousandEyes Partner Innovation Updates for May 2025
ThousandEyes Partner Innovation Updates for May 2025
ThousandEyes
 
tecnologias de las primeras civilizaciones.pdf
tecnologias de las primeras civilizaciones.pdftecnologias de las primeras civilizaciones.pdf
tecnologias de las primeras civilizaciones.pdf
fjgm517
 
How Can I use the AI Hype in my Business Context?
How Can I use the AI Hype in my Business Context?How Can I use the AI Hype in my Business Context?
How Can I use the AI Hype in my Business Context?
Daniel Lehner
 
Andrew Marnell: Transforming Business Strategy Through Data-Driven Insights
Andrew Marnell: Transforming Business Strategy Through Data-Driven InsightsAndrew Marnell: Transforming Business Strategy Through Data-Driven Insights
Andrew Marnell: Transforming Business Strategy Through Data-Driven Insights
Andrew Marnell
 
Enhancing ICU Intelligence: How Our Functional Testing Enabled a Healthcare I...
Enhancing ICU Intelligence: How Our Functional Testing Enabled a Healthcare I...Enhancing ICU Intelligence: How Our Functional Testing Enabled a Healthcare I...
Enhancing ICU Intelligence: How Our Functional Testing Enabled a Healthcare I...
Impelsys Inc.
 
Cyber Awareness overview for 2025 month of security
Cyber Awareness overview for 2025 month of securityCyber Awareness overview for 2025 month of security
Cyber Awareness overview for 2025 month of security
riccardosl1
 
AI EngineHost Review: Revolutionary USA Datacenter-Based Hosting with NVIDIA ...
AI EngineHost Review: Revolutionary USA Datacenter-Based Hosting with NVIDIA ...AI EngineHost Review: Revolutionary USA Datacenter-Based Hosting with NVIDIA ...
AI EngineHost Review: Revolutionary USA Datacenter-Based Hosting with NVIDIA ...
SOFTTECHHUB
 
Splunk Security Update | Public Sector Summit Germany 2025
Splunk Security Update | Public Sector Summit Germany 2025Splunk Security Update | Public Sector Summit Germany 2025
Splunk Security Update | Public Sector Summit Germany 2025
Splunk
 
Role of Data Annotation Services in AI-Powered Manufacturing
Role of Data Annotation Services in AI-Powered ManufacturingRole of Data Annotation Services in AI-Powered Manufacturing
Role of Data Annotation Services in AI-Powered Manufacturing
Andrew Leo
 
Rusty Waters: Elevating Lakehouses Beyond Spark
Rusty Waters: Elevating Lakehouses Beyond SparkRusty Waters: Elevating Lakehouses Beyond Spark
Rusty Waters: Elevating Lakehouses Beyond Spark
carlyakerly1
 
Mobile App Development Company in Saudi Arabia
Mobile App Development Company in Saudi ArabiaMobile App Development Company in Saudi Arabia
Mobile App Development Company in Saudi Arabia
Steve Jonas
 
Complete Guide to Advanced Logistics Management Software in Riyadh.pdf
Complete Guide to Advanced Logistics Management Software in Riyadh.pdfComplete Guide to Advanced Logistics Management Software in Riyadh.pdf
Complete Guide to Advanced Logistics Management Software in Riyadh.pdf
Software Company
 
HCL Nomad Web – Best Practices and Managing Multiuser Environments
HCL Nomad Web – Best Practices and Managing Multiuser EnvironmentsHCL Nomad Web – Best Practices and Managing Multiuser Environments
HCL Nomad Web – Best Practices and Managing Multiuser Environments
panagenda
 
2025-05-Q4-2024-Investor-Presentation.pptx
2025-05-Q4-2024-Investor-Presentation.pptx2025-05-Q4-2024-Investor-Presentation.pptx
2025-05-Q4-2024-Investor-Presentation.pptx
Samuele Fogagnolo
 
The Evolution of Meme Coins A New Era for Digital Currency ppt.pdf
The Evolution of Meme Coins A New Era for Digital Currency ppt.pdfThe Evolution of Meme Coins A New Era for Digital Currency ppt.pdf
The Evolution of Meme Coins A New Era for Digital Currency ppt.pdf
Abi john
 
TrsLabs - Fintech Product & Business Consulting
TrsLabs - Fintech Product & Business ConsultingTrsLabs - Fintech Product & Business Consulting
TrsLabs - Fintech Product & Business Consulting
Trs Labs
 
HCL Nomad Web – Best Practices und Verwaltung von Multiuser-Umgebungen
HCL Nomad Web – Best Practices und Verwaltung von Multiuser-UmgebungenHCL Nomad Web – Best Practices und Verwaltung von Multiuser-Umgebungen
HCL Nomad Web – Best Practices und Verwaltung von Multiuser-Umgebungen
panagenda
 
Increasing Retail Store Efficiency How can Planograms Save Time and Money.pptx
Increasing Retail Store Efficiency How can Planograms Save Time and Money.pptxIncreasing Retail Store Efficiency How can Planograms Save Time and Money.pptx
Increasing Retail Store Efficiency How can Planograms Save Time and Money.pptx
Anoop Ashok
 
How analogue intelligence complements AI
How analogue intelligence complements AIHow analogue intelligence complements AI
How analogue intelligence complements AI
Paul Rowe
 
DevOpsDays Atlanta 2025 - Building 10x Development Organizations.pptx
DevOpsDays Atlanta 2025 - Building 10x Development Organizations.pptxDevOpsDays Atlanta 2025 - Building 10x Development Organizations.pptx
DevOpsDays Atlanta 2025 - Building 10x Development Organizations.pptx
Justin Reock
 
ThousandEyes Partner Innovation Updates for May 2025
ThousandEyes Partner Innovation Updates for May 2025ThousandEyes Partner Innovation Updates for May 2025
ThousandEyes Partner Innovation Updates for May 2025
ThousandEyes
 
tecnologias de las primeras civilizaciones.pdf
tecnologias de las primeras civilizaciones.pdftecnologias de las primeras civilizaciones.pdf
tecnologias de las primeras civilizaciones.pdf
fjgm517
 
How Can I use the AI Hype in my Business Context?
How Can I use the AI Hype in my Business Context?How Can I use the AI Hype in my Business Context?
How Can I use the AI Hype in my Business Context?
Daniel Lehner
 
Andrew Marnell: Transforming Business Strategy Through Data-Driven Insights
Andrew Marnell: Transforming Business Strategy Through Data-Driven InsightsAndrew Marnell: Transforming Business Strategy Through Data-Driven Insights
Andrew Marnell: Transforming Business Strategy Through Data-Driven Insights
Andrew Marnell
 
Enhancing ICU Intelligence: How Our Functional Testing Enabled a Healthcare I...
Enhancing ICU Intelligence: How Our Functional Testing Enabled a Healthcare I...Enhancing ICU Intelligence: How Our Functional Testing Enabled a Healthcare I...
Enhancing ICU Intelligence: How Our Functional Testing Enabled a Healthcare I...
Impelsys Inc.
 
Cyber Awareness overview for 2025 month of security
Cyber Awareness overview for 2025 month of securityCyber Awareness overview for 2025 month of security
Cyber Awareness overview for 2025 month of security
riccardosl1
 
AI EngineHost Review: Revolutionary USA Datacenter-Based Hosting with NVIDIA ...
AI EngineHost Review: Revolutionary USA Datacenter-Based Hosting with NVIDIA ...AI EngineHost Review: Revolutionary USA Datacenter-Based Hosting with NVIDIA ...
AI EngineHost Review: Revolutionary USA Datacenter-Based Hosting with NVIDIA ...
SOFTTECHHUB
 
Splunk Security Update | Public Sector Summit Germany 2025
Splunk Security Update | Public Sector Summit Germany 2025Splunk Security Update | Public Sector Summit Germany 2025
Splunk Security Update | Public Sector Summit Germany 2025
Splunk
 
Role of Data Annotation Services in AI-Powered Manufacturing
Role of Data Annotation Services in AI-Powered ManufacturingRole of Data Annotation Services in AI-Powered Manufacturing
Role of Data Annotation Services in AI-Powered Manufacturing
Andrew Leo
 
Rusty Waters: Elevating Lakehouses Beyond Spark
Rusty Waters: Elevating Lakehouses Beyond SparkRusty Waters: Elevating Lakehouses Beyond Spark
Rusty Waters: Elevating Lakehouses Beyond Spark
carlyakerly1
 
Mobile App Development Company in Saudi Arabia
Mobile App Development Company in Saudi ArabiaMobile App Development Company in Saudi Arabia
Mobile App Development Company in Saudi Arabia
Steve Jonas
 
Complete Guide to Advanced Logistics Management Software in Riyadh.pdf
Complete Guide to Advanced Logistics Management Software in Riyadh.pdfComplete Guide to Advanced Logistics Management Software in Riyadh.pdf
Complete Guide to Advanced Logistics Management Software in Riyadh.pdf
Software Company
 
HCL Nomad Web – Best Practices and Managing Multiuser Environments
HCL Nomad Web – Best Practices and Managing Multiuser EnvironmentsHCL Nomad Web – Best Practices and Managing Multiuser Environments
HCL Nomad Web – Best Practices and Managing Multiuser Environments
panagenda
 
2025-05-Q4-2024-Investor-Presentation.pptx
2025-05-Q4-2024-Investor-Presentation.pptx2025-05-Q4-2024-Investor-Presentation.pptx
2025-05-Q4-2024-Investor-Presentation.pptx
Samuele Fogagnolo
 
The Evolution of Meme Coins A New Era for Digital Currency ppt.pdf
The Evolution of Meme Coins A New Era for Digital Currency ppt.pdfThe Evolution of Meme Coins A New Era for Digital Currency ppt.pdf
The Evolution of Meme Coins A New Era for Digital Currency ppt.pdf
Abi john
 
TrsLabs - Fintech Product & Business Consulting
TrsLabs - Fintech Product & Business ConsultingTrsLabs - Fintech Product & Business Consulting
TrsLabs - Fintech Product & Business Consulting
Trs Labs
 
HCL Nomad Web – Best Practices und Verwaltung von Multiuser-Umgebungen
HCL Nomad Web – Best Practices und Verwaltung von Multiuser-UmgebungenHCL Nomad Web – Best Practices und Verwaltung von Multiuser-Umgebungen
HCL Nomad Web – Best Practices und Verwaltung von Multiuser-Umgebungen
panagenda
 
Increasing Retail Store Efficiency How can Planograms Save Time and Money.pptx
Increasing Retail Store Efficiency How can Planograms Save Time and Money.pptxIncreasing Retail Store Efficiency How can Planograms Save Time and Money.pptx
Increasing Retail Store Efficiency How can Planograms Save Time and Money.pptx
Anoop Ashok
 
How analogue intelligence complements AI
How analogue intelligence complements AIHow analogue intelligence complements AI
How analogue intelligence complements AI
Paul Rowe
 
DevOpsDays Atlanta 2025 - Building 10x Development Organizations.pptx
DevOpsDays Atlanta 2025 - Building 10x Development Organizations.pptxDevOpsDays Atlanta 2025 - Building 10x Development Organizations.pptx
DevOpsDays Atlanta 2025 - Building 10x Development Organizations.pptx
Justin Reock
 

Tutorial 9 (bloom filters)

  • 1. Bloom Filters Kira Radinsky Slides based on material from: Michael Mitzenmacher and Hanoch Levy
  • 2. Motivation - Cache • Lookup questions: Does item “x” exist in a set? • Data set may be very big or expensive to access. Filter lookup questions with negative results before accessing data. • Allow false positive errors, as they only cost us an extra data access. • Don’t allow false negative errors, because they result in wrong answers.
  • 3. Application of Bloom Filters: Distributed Web Caches Web Cache 1 Web Cache 2 Web Cache 3 Web Cache 6Web Cache 5Web Cache 4 • Send Bloom filters of URLs. • False positives do not hurt much. – Get errors from cache changes anyway
  • 4. Web Caching • Summary Cache: [Fan, Cao, Almeida, & Broder] If local caches know each other’s content... …try local cache before going out to Web • Sending/updating lists of URLs too expensive. • Solution: use Bloom filters. • False positives – Local requests go unfulfilled. – Small cost, big potential gain
  • 5. The Problem Solved by BF: Approximate Set Membership • Lookup Problem: Given a set S = {x1,x2,…,xn}, construct data structure to answer queries of the form “Is y in S?” • Data structure should be: – Fast (Faster than searching through S). – Small (Smaller than explicit representation). • To obtain speed and size improvements, allow some probability of error. – False positives: y  S but we report y  S – False negatives: y  S but we report y  S
  • 6. Bloom Filters Start with an m bit array, filled with 0s. Hash each item xj in S k times. If Hi(xj) = a, set B[a] = 1. 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0B 0 1 0 0 1 0 1 0 0 1 1 1 0 1 1 0B To check if y is in S, check B at Hi(y). All k values must be 1. 0 1 0 0 1 0 1 0 0 1 1 1 0 1 1 0B 0 1 0 0 1 0 1 0 0 1 1 1 0 1 1 0B Possible to have a false positive; all k values are 1, but y is not in S.
  • 7. Bloom Filter 01000 10100 00010 x h1(x) h2(x) hk(x) V0 Vm-1 h3(x)
  • 8. Advantages • No Overflow • Union and intersection of Bloom filters – A simple bitwise OR and AND operations • Applications: – Google BigTable – The Squid Web Proxy Cache uses Bloom filters for cache digests.
  • 9. Bloom Errors 01000 10100 00010 h1(x) h2(x) hk(x) V0 Vm-1 h3(x) a b c d x didn’t appear, yet its bits are already set
  • 10. Example 0 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.1 0 1 2 3 4 5 6 7 8 9 10 Hash functions Falsepositiverate m/n = 8 Opt k = 8 ln 2 = 5.45...
  • 11. Tradeoffs • Three parameters. – Size m/n : bits per item. • |U| = n: Number of elements to encode. • hi: U[1..m] : Maintain a Bit Vector V of size m – Time k : number of hash functions. • Use k hash functions (h1..hk) – Error f : false positive probability.
  • 12. Bloom Filter Tradeoffs • Three factors: m,k and n. • Normally, n and m are given, and we select k. • Small k – Less computations. – Actual number of bits accessed (nk) is smaller, so the chance of a “step over” is smaller too. – However, less bits need to be stepped over to generate an error. • For big k, the exact opposite holds. • Not surprisingly, when k is optimal, the “hit ratio” (ratio of bits flipped in the array) is exactly 0.5
  • 13. Alternative Approach for Bloom Filters: Perfect Hashing Approach Element 1 Element 2 Element 3 Element 4 Element 5 Fingerprint(4) Fingerprint(5) Fingerprint(2) Fingerprint(1) Fingerprint(3)
  • 14. Perfect Hashing Approach • Folklore Bloom filter construction. – Recall: Given a set S = {x1,x2,x3,…xn} on a universe U, we want to answer membership queries. – Method: Find an n-cell perfect hash function for S. • Maps set of n elements to n cells in a 1-1 manner. – Then keep bit fingerprint of item in each cell. Lookups have false positive < e. – Advantage: each bit/item reduces false positives by a factor of 1/2, vs ln 2 for a standard Bloom filter. • Negatives: – Perfect hash functions non-trivial to find. – Cannot handle on-line insertions.  )/1(log2 e
  • 15. Bloom Filters and Deletions • Cache contents change – Items both inserted and deleted. • Insertions are easy – add bits to BF • Can Bloom filters handle deletions? – Use Counting Bloom Filters to track insertions/deletions at hosts; – Send Bloom filters.
  • 16. Handling Deletions • Bloom filters can handle insertions, but not deletions. • If deleting xi means resetting 1s to 0s, then deleting xi will “delete” xj. 0 1 0 0 1 0 1 0 0 1 1 1 0 1 1 0B xi xj
  • 17. Counting Bloom Filters Start with an m bit array, filled with 0s. Hash each item xj in S k times. If Hi(xj) = a, add 1 to B[a]. 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0B 0 3 0 0 1 0 2 0 0 3 2 1 0 2 1 0B To delete xj decrement the corresponding counters. 0 2 0 0 0 0 2 0 0 3 2 1 0 1 1 0B Can obtain a corresponding Bloom filter by reducing to 0/1. 0 1 0 0 0 0 1 0 0 1 1 1 0 1 1 0B
  • 18. Counting Bloom Filters: Overflow • Must choose counters large enough to avoid overflow. • Poisson approximation suggests 4 bits/counter. – Average load using k = (ln 2)m/n counters is ln 2. – Probability a counter has load at least 16: • Failsafes possible. 17E78.6!16/)2(ln 162ln   e
  • 19. Variations and Extensions • Distance-Sensitive Bloom Filters • Bloomier Filter
  • 20. Extension: Distance-Sensitive Bloom Filters • Instead of answering questions of the form we would like to answer questions of the form • That is, is the query close to some element of the set, under some metric and some notion of close. • Applications: – DNA matching – Virus/worm matching – Databases • Some initial results [KirschMitzenmacher]. Hard. .SyIs  .SxyIs 
  • 21. Extension: Bloomier Filter • Bloom filters handle set membership. • Counters to handle multi-set/count tracking. • Bloomier filter [Chazelle, Kilian, Rubinfeld, Tal]: – Extend to handle approximate functions. – Each element of set has associated function value. – Non-set elements should return null. – Want to always return correct function value for set elements. – A false positive returns a function value for a non-null element.