0% found this document useful (0 votes)
37 views

Probabilistic Data Structures

This document discusses probabilistic data structures and provides examples of Bloom filters and locality sensitive hashing. It explains that probabilistic data structures can be extremely useful for processing large datasets as they employ hash functions to compactly represent sets of items using much less memory and constant query time. Specific probabilistic data structures are described in detail, including how they work and their applications. Examples of creating minhash signatures from text to enable detection of similar documents are also provided.

Uploaded by

ghassanmaq7
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
37 views

Probabilistic Data Structures

This document discusses probabilistic data structures and provides examples of Bloom filters and locality sensitive hashing. It explains that probabilistic data structures can be extremely useful for processing large datasets as they employ hash functions to compactly represent sets of items using much less memory and constant query time. Specific probabilistic data structures are described in detail, including how they work and their applications. Examples of creating minhash signatures from text to enable detection of similar documents are also provided.

Uploaded by

ghassanmaq7
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 26

Probabilistic Data

Structures
DR. HAMED ABDELHAQ
Outline

 What are Probabilistic Data Structures

 Examples of Probabilistic Data Structures


 Bloom Filters
 Locality Sensitive Hashing
Motivation

 When processing large data sets, we might need to do simple tasks:


 counting number of unique items
 checks whether some items exist in the data set

 Using deterministic data structure with big data


 can be very expensive and infeasible
 The data does not fit in memory
Motivation: Probabilistic Data Structure

 group of data structures that are extremely useful for big data

 data we are dealing with becomes very large, e.g., arriving from streaming
applications

 employ hash functions to randomize and compactly represent a set of items

 use much less memory and have constant query time


When can be used?

 Membership:  Frequency
 Checking whether some items exist in  Counting most frequent items
the data set
 E.g., Count-Min Sketch
 E.g., bloom filters
 Cardinality
 Searching
 estimating the number of distinct
 Searching for similar items
elements
 E.g., locality sensitive hashing  E.g., hyperloglog
Bloom filter
 Approximate set-membership problem
 Use the concept of hash tables
 Fast in insertion
 Fast in look-ups
 So, why Bloom filters?
 +ve:
 Much more space efficient than hash sets
 -ve:
 Cannot store associated objects
 No deletions
 Allow for errors (non-zero) false positive probability
Applications of Bloom filters

 Spell checking
 Keep track of a list of forbidden passwords
 Network router
 Limited memory, and you need to be super fast
 E.g., keep track of a lot of IP addresses
Ingredients of Bloom filters

 Bloom filters have two components:


1. Array of n entries: each entry is a single bit.
 Suppose we have a set to be inserted into the array
S = {s1,s2,...,sm}
 Thus, the number of bits per element = n/m

2. A set of hash functions (k hash functions): h1,…, hk

 Now, we need to answer a question like “Is x an element of S?”


 If xS , we must answer yes
Operations

1. Initially set the array to 0


2. sS, A[hi(s)] = 1 for 1 i  k
(an entry can be set to 1 multiple times, only the first times has an effect )
3. To check if xS
 we check whether all location A[hi(x)] for 1 i  k are set to 1
 If not, clearly xS.
 If all A[hi(x)] are set to 1, we assume xS
Possibility of errors

x1 y x2

0 0 10 0 10 10 0 0 10 0 10 0

If only
Each
To check
element
1s ifappear,
Initialy isofwith
inSconclude
S,
is all
hashed
check
0 thethat
k times
kyhash
is in S
Each
location.
This hash
mayIfyield
location
a 0 false
appears
setpositive
to, 1y is not in S
Performance of Bloom Filters

 Probability of false positive depends on:


 The density of 1s in the array
 The number of hash functions
 =

 Number of 1s is approximately the number of inserted elements times the number


of hash functions.
 Collision lowers this slightly
Estimating error probability

 Probability of false positive:


f  (1  p ) k  (1  e  km / n ) k

 To find the optimal k to minimize f: minimizing g=ln(f)  km / n


dg  km / n km e
 ln(1  e )  km / n
Þ k=ln(2)*(n/m) dk n 1 e
Þ f = (1/2)k = (0.6185..)n/m
The false positive probability falls exponentially in n/m ,the number bits used per item !!
Example

 Suppose we use an array of n=1 billion bits, k=5 hash functions,


and m=100 million elements.
 Fraction of zeros =
 Fraction of 1s = 1 - 0.607 = 0.393
 Probability of false positive =
Locality Sensitive Hashing

 finding duplicate documents in a list may look like a simple task


 use a hash table

 finding documents with differences such as typos or different words


 the problem becomes much more complex
Jaccard Similarity

 Ex) Our use-case example


1. “Who was the first president of Palestine”
2. “Who was the first ruler of Palestine”
3. “Who was the first king of Jordan”
Jaccard similarity(q1,q2) = 6/8 = 0.75

 the more common words, the bigger the Jaccard index, the more probable it is
that two questions are a duplicate.
Minhash Signatures

 Jaccard can be a good string metric, however


 we need to split each question into the words
 compare the two sets
 repeat for every pair
 The amount of pairs will grow rapidly
 creating a simple fixed-size numeric fingerprint (signature) for each sentence.
 Called minhash signatures
Creating Minhashes

 To calculate MinHash
 we need to create the dictionary (a set of all words) from all our questions.
 create a random permutation

 Back to our use case example:


 The set of words we have:
(Who, was, the, first, president, of, Palestine, ruler, king, Jordan)
Creating Minhashes 1. “Who was the first president of Palestine”
2. “Who was the first ruler of Palestine”
3. “Who was the first king of Jordan”
1st permutation:

Index Word Q1 Q2 Q3
1 ruler 1
2 of 2 2 2
3 the 3 3 3
4 first 4 4 4
5 president 5
6 Who 6 6 6
7 Jordan 7
8 was 8 8 8
9 king 9
10 Palestine 10 10
Creating Minhashes 1.
2.
“Who was the first president of Palestine”
“Who was the first ruler of Palestine”
3. “Who was the first king of Jordan”
2nd
permutation:

Index Word Q1 Q2 Q3
1 president 1
2 king 2
3 Jordan 3
4 first 4 4 4
5 ruler 5
6 Palestine 6 6
7 the 7 7 7
8 was 8 8 8
9 of 9 9 9
10 Who 10 10 10
Creating Minhashes 1.
2.
“Who was the first president of Palestine”
“Who was the first ruler of Palestine”
3. “Who was the first king of Jordan”
3rd
permutation:

Index Word Q1 Q2 Q3
1 Jordan 1
2 Who 2 2 2
3 king 3
4 first 4 4 4
5 of 5 5 5
6 Palestine 6 6
7 president 7
8 was 8 8 8
9 the 9 9 9
10 ruler 10
Resulting Minhash Signatures

 Trying 3 more permutations, we might end up having the following minhashes:


 MinHash(Q1) = [2, 1, 2, 2, 1, 1]
MinHash(Q2) = [1, 4, 4, 2, 1, 1]
MinHash(Q3) = [2, 2, 1, 4, 4, 1]
Locality Sensitive Hashing (LSH) for Minhash
Signatures
 Problem: finding questions similar a certain question is computationally
expensive.
 Even when using minhash signatures
 Solution: “hashing” items several times, in such a way that
 similar items are more likely to be hashed to the same bucket than dissimilar items are
 any pair that hashed to the same bucket for any of the hashings to be a candidate pair
 false positives: dissimilar pairs in the same bucket
 false negatives: similar pairs in different buckets
LSH - Minhash Signature Partitioning

 Dividing the signature into a number of b bands, consisting of r rows each


 This increases the chance of having bands with identical partitions
 These identical partition will then be mapped to the same bucket
 MinHash(Q1) = [2, 1, 2, 2, 1, 1] => [2, 1, 2] [2, 1, 1]
MinHash(Q2) = [1, 4, 4, 2, 1, 1] => [1, 4, 4] [2, 1, 1]
MinHash(Q3) = [2, 2, 1, 4, 4, 1] => [2, 2, 1] [4, 4, 1]
LSH – Mapping elements to buckets

 For each band, a hash function that


takes vectors of r integers
 hashes them to some large
number of buckets
 We can use different hash function,
one function for each band
Analysis of the banding technique

 Finding the probability that a pair of document can be mapped to the same bucket
(similar documents)
 Assume the Jaccard similarity btn them is (s)
 The probability that the signatures agree in all rows of one particular band is .
 The probability that the signatures disagree in at least one row of a particular band
is 1-.
 The probability that the signatures disagree in at least one row of each of the
bands is
Analysis of the banding technique

 The probability that the


signatures agree in all the rows
of at least one band, and
therefore become a
candidate pair
1-

You might also like