Probabilistic Data Structures
Probabilistic Data Structures
Structures
DR. HAMED ABDELHAQ
Outline
group of data structures that are extremely useful for big data
data we are dealing with becomes very large, e.g., arriving from streaming
applications
Membership: Frequency
Checking whether some items exist in Counting most frequent items
the data set
E.g., Count-Min Sketch
E.g., bloom filters
Cardinality
Searching
estimating the number of distinct
Searching for similar items
elements
E.g., locality sensitive hashing E.g., hyperloglog
Bloom filter
Approximate set-membership problem
Use the concept of hash tables
Fast in insertion
Fast in look-ups
So, why Bloom filters?
+ve:
Much more space efficient than hash sets
-ve:
Cannot store associated objects
No deletions
Allow for errors (non-zero) false positive probability
Applications of Bloom filters
Spell checking
Keep track of a list of forbidden passwords
Network router
Limited memory, and you need to be super fast
E.g., keep track of a lot of IP addresses
Ingredients of Bloom filters
x1 y x2
0 0 10 0 10 10 0 0 10 0 10 0
If only
Each
To check
element
1s ifappear,
Initialy isofwith
inSconclude
S,
is all
hashed
check
0 thethat
k times
kyhash
is in S
Each
location.
This hash
mayIfyield
location
a 0 false
appears
setpositive
to, 1y is not in S
Performance of Bloom Filters
the more common words, the bigger the Jaccard index, the more probable it is
that two questions are a duplicate.
Minhash Signatures
To calculate MinHash
we need to create the dictionary (a set of all words) from all our questions.
create a random permutation
Index Word Q1 Q2 Q3
1 ruler 1
2 of 2 2 2
3 the 3 3 3
4 first 4 4 4
5 president 5
6 Who 6 6 6
7 Jordan 7
8 was 8 8 8
9 king 9
10 Palestine 10 10
Creating Minhashes 1.
2.
“Who was the first president of Palestine”
“Who was the first ruler of Palestine”
3. “Who was the first king of Jordan”
2nd
permutation:
Index Word Q1 Q2 Q3
1 president 1
2 king 2
3 Jordan 3
4 first 4 4 4
5 ruler 5
6 Palestine 6 6
7 the 7 7 7
8 was 8 8 8
9 of 9 9 9
10 Who 10 10 10
Creating Minhashes 1.
2.
“Who was the first president of Palestine”
“Who was the first ruler of Palestine”
3. “Who was the first king of Jordan”
3rd
permutation:
Index Word Q1 Q2 Q3
1 Jordan 1
2 Who 2 2 2
3 king 3
4 first 4 4 4
5 of 5 5 5
6 Palestine 6 6
7 president 7
8 was 8 8 8
9 the 9 9 9
10 ruler 10
Resulting Minhash Signatures
Finding the probability that a pair of document can be mapped to the same bucket
(similar documents)
Assume the Jaccard similarity btn them is (s)
The probability that the signatures agree in all rows of one particular band is .
The probability that the signatures disagree in at least one row of a particular band
is 1-.
The probability that the signatures disagree in at least one row of each of the
bands is
Analysis of the banding technique