0% found this document useful (0 votes)

37 views

Probabilistic Data Structures

This document discusses probabilistic data structures and provides examples of Bloom filters and locality sensitive hashing. It explains that probabilistic data structures can be extremely useful for processing large datasets as they employ hash functions to compactly represent sets of items using much less memory and constant query time. Specific probabilistic data structures are described in detail, including how they work and their applications. Examples of creating minhash signatures from text to enable detection of similar documents are also provided.

Uploaded by

ghassanmaq7

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

37 views

Probabilistic Data Structures

Uploaded by

ghassanmaq7

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 26

Probabilistic Data

Structures
DR. HAMED ABDELHAQ
Outline

 What are Probabilistic Data Structures

 Examples of Probabilistic Data Structures

 Bloom Filters
 Locality Sensitive Hashing
Motivation

 When processing large data sets, we might need to do simple tasks:

 counting number of unique items
 checks whether some items exist in the data set

 Using deterministic data structure with big data

 can be very expensive and infeasible
 The data does not fit in memory
Motivation: Probabilistic Data Structure

 group of data structures that are extremely useful for big data

 data we are dealing with becomes very large, e.g., arriving from streaming
applications

 employ hash functions to randomize and compactly represent a set of items

 use much less memory and have constant query time

When can be used?

 Membership:  Frequency
 Checking whether some items exist in  Counting most frequent items
the data set
 E.g., Count-Min Sketch
 E.g., bloom filters
 Cardinality
 Searching
 estimating the number of distinct
 Searching for similar items
elements
 E.g., locality sensitive hashing  E.g., hyperloglog
Bloom filter
 Approximate set-membership problem
 Use the concept of hash tables
 Fast in insertion
 Fast in look-ups
 So, why Bloom filters?
 +ve:
 Much more space efficient than hash sets
 -ve:
 Cannot store associated objects
 No deletions
 Allow for errors (non-zero) false positive probability
Applications of Bloom filters

 Spell checking
 Keep track of a list of forbidden passwords
 Network router
 Limited memory, and you need to be super fast
 E.g., keep track of a lot of IP addresses
Ingredients of Bloom filters

 Bloom filters have two components:

1. Array of n entries: each entry is a single bit.
 Suppose we have a set to be inserted into the array
S = {s1,s2,...,sm}
 Thus, the number of bits per element = n/m

2. A set of hash functions (k hash functions): h1,…, hk

 Now, we need to answer a question like “Is x an element of S?”

 If xS , we must answer yes
Operations

1. Initially set the array to 0

2. sS, A[hi(s)] = 1 for 1 i  k
(an entry can be set to 1 multiple times, only the first times has an effect )
3. To check if xS
 we check whether all location A[hi(x)] for 1 i  k are set to 1
 If not, clearly xS.
 If all A[hi(x)] are set to 1, we assume xS
Possibility of errors

x1 y x2

0 0 10 0 10 10 0 0 10 0 10 0

If only
Each
To check
element
1s ifappear,
Initialy isofwith
inSconclude
S,
is all
hashed
check
0 thethat
k times
kyhash
is in S
Each
location.
This hash
mayIfyield
location
a 0 false
appears
setpositive
to, 1y is not in S
Performance of Bloom Filters

 Probability of false positive depends on:

 The density of 1s in the array
 The number of hash functions
 =

 Number of 1s is approximately the number of inserted elements times the number

of hash functions.
 Collision lowers this slightly
Estimating error probability

 Probability of false positive:

f  (1  p ) k  (1  e  km / n ) k

 To find the optimal k to minimize f: minimizing g=ln(f)  km / n

dg  km / n km e
 ln(1  e )  km / n
Þ k=ln(2)*(n/m) dk n 1 e
Þ f = (1/2)k = (0.6185..)n/m
The false positive probability falls exponentially in n/m ,the number bits used per item !!
Example

 Suppose we use an array of n=1 billion bits, k=5 hash functions,

and m=100 million elements.
 Fraction of zeros =
 Fraction of 1s = 1 - 0.607 = 0.393
 Probability of false positive =
Locality Sensitive Hashing

 finding duplicate documents in a list may look like a simple task

 use a hash table

 finding documents with differences such as typos or different words

 the problem becomes much more complex
Jaccard Similarity

 Ex) Our use-case example

1. “Who was the first president of Palestine”
2. “Who was the first ruler of Palestine”
3. “Who was the first king of Jordan”
Jaccard similarity(q1,q2) = 6/8 = 0.75

 the more common words, the bigger the Jaccard index, the more probable it is
that two questions are a duplicate.
Minhash Signatures

 Jaccard can be a good string metric, however

 we need to split each question into the words
 compare the two sets
 repeat for every pair
 The amount of pairs will grow rapidly
 creating a simple fixed-size numeric fingerprint (signature) for each sentence.
 Called minhash signatures
Creating Minhashes

 To calculate MinHash
 we need to create the dictionary (a set of all words) from all our questions.
 create a random permutation

 Back to our use case example:

 The set of words we have:
(Who, was, the, first, president, of, Palestine, ruler, king, Jordan)
Creating Minhashes 1. “Who was the first president of Palestine”
2. “Who was the first ruler of Palestine”
3. “Who was the first king of Jordan”
1st permutation:

Index Word Q1 Q2 Q3
1 ruler 1
2 of 2 2 2
3 the 3 3 3
4 first 4 4 4
5 president 5
6 Who 6 6 6
7 Jordan 7
8 was 8 8 8
9 king 9
10 Palestine 10 10
Creating Minhashes 1.
2.
“Who was the first president of Palestine”
“Who was the first ruler of Palestine”
3. “Who was the first king of Jordan”
2nd
permutation:

Index Word Q1 Q2 Q3
1 president 1
2 king 2
3 Jordan 3
4 first 4 4 4
5 ruler 5
6 Palestine 6 6
7 the 7 7 7
8 was 8 8 8
9 of 9 9 9
10 Who 10 10 10
Creating Minhashes 1.
2.
“Who was the first president of Palestine”
“Who was the first ruler of Palestine”
3. “Who was the first king of Jordan”
3rd
permutation:

Index Word Q1 Q2 Q3
1 Jordan 1
2 Who 2 2 2
3 king 3
4 first 4 4 4
5 of 5 5 5
6 Palestine 6 6
7 president 7
8 was 8 8 8
9 the 9 9 9
10 ruler 10
Resulting Minhash Signatures

 Trying 3 more permutations, we might end up having the following minhashes:

 MinHash(Q1) = [2, 1, 2, 2, 1, 1]
MinHash(Q2) = [1, 4, 4, 2, 1, 1]
MinHash(Q3) = [2, 2, 1, 4, 4, 1]
Locality Sensitive Hashing (LSH) for Minhash
Signatures
 Problem: finding questions similar a certain question is computationally
expensive.
 Even when using minhash signatures
 Solution: “hashing” items several times, in such a way that
 similar items are more likely to be hashed to the same bucket than dissimilar items are
 any pair that hashed to the same bucket for any of the hashings to be a candidate pair
 false positives: dissimilar pairs in the same bucket
 false negatives: similar pairs in different buckets
LSH - Minhash Signature Partitioning

 Dividing the signature into a number of b bands, consisting of r rows each

 This increases the chance of having bands with identical partitions
 These identical partition will then be mapped to the same bucket
 MinHash(Q1) = [2, 1, 2, 2, 1, 1] => [2, 1, 2] [2, 1, 1]
MinHash(Q2) = [1, 4, 4, 2, 1, 1] => [1, 4, 4] [2, 1, 1]
MinHash(Q3) = [2, 2, 1, 4, 4, 1] => [2, 2, 1] [4, 4, 1]
LSH – Mapping elements to buckets

 For each band, a hash function that

takes vectors of r integers
 hashes them to some large
number of buckets
 We can use different hash function,
one function for each band
Analysis of the banding technique

 Finding the probability that a pair of document can be mapped to the same bucket
(similar documents)
 Assume the Jaccard similarity btn them is (s)
 The probability that the signatures agree in all rows of one particular band is .
 The probability that the signatures disagree in at least one row of a particular band
is 1-.
 The probability that the signatures disagree in at least one row of each of the
bands is
Analysis of the banding technique

 The probability that the

signatures agree in all the rows
of at least one band, and
therefore become a
candidate pair
1-

Amazon Support Engineer Interview Questions
100% (1)
Amazon Support Engineer Interview Questions
7 pages
AVR Microcontroller and Embedded Systems by ALI MAZIDI
96% (79)
AVR Microcontroller and Embedded Systems by ALI MAZIDI
781 pages
Bloomfilter
No ratings yet
Bloomfilter
9 pages
Locality Sensitive Hashing Towards Data Science
No ratings yet
Locality Sensitive Hashing Towards Data Science
16 pages
Lec1 Bloom Distinctcount
No ratings yet
Lec1 Bloom Distinctcount
76 pages
Rsa 2008
No ratings yet
Rsa 2008
32 pages
1 Overview: Lecture 2 - February 3, 2005
No ratings yet
1 Overview: Lecture 2 - February 3, 2005
6 pages
CS 561, Lecture 2: Randomization in Data Structures: Jared Saia University of New Mexico
No ratings yet
CS 561, Lecture 2: Randomization in Data Structures: Jared Saia University of New Mexico
46 pages
Chapter 5
No ratings yet
Chapter 5
53 pages
Computational Tools DTU Presentation Week4
No ratings yet
Computational Tools DTU Presentation Week4
40 pages
Advanced Data Structures Notes
100% (2)
Advanced Data Structures Notes
142 pages
Mining Data Streams (Part 2)
No ratings yet
Mining Data Streams (Part 2)
56 pages
Viden Io Data Analytics Lecture7 Data Stream Filtering PDF
No ratings yet
Viden Io Data Analytics Lecture7 Data Stream Filtering PDF
20 pages
312 Course Project-1
No ratings yet
312 Course Project-1
16 pages
Streaming Algorithms: CS6234 Advanced Algorithms February 10 2015
No ratings yet
Streaming Algorithms: CS6234 Advanced Algorithms February 10 2015
90 pages
Theory of Locality Sensitive Hashing - CS246 Stanford (Slides)
No ratings yet
Theory of Locality Sensitive Hashing - CS246 Stanford (Slides)
52 pages
DSBD_Unit-II_3
No ratings yet
DSBD_Unit-II_3
28 pages
Streams 2
No ratings yet
Streams 2
49 pages
Data Science 5
No ratings yet
Data Science 5
82 pages
Similarity 1
No ratings yet
Similarity 1
53 pages
Bloom Filter
No ratings yet
Bloom Filter
29 pages
Data Mining: Sketching, Locality Sensitive Hashing
No ratings yet
Data Mining: Sketching, Locality Sensitive Hashing
61 pages
Hashing
No ratings yet
Hashing
111 pages
High Speed Hashing For Integers
No ratings yet
High Speed Hashing For Integers
17 pages
Data Stream Sampling
No ratings yet
Data Stream Sampling
25 pages
Bda Ut-2
No ratings yet
Bda Ut-2
34 pages
BDA PT 2
No ratings yet
BDA PT 2
35 pages
1994 - Graphs, Hypergraphs and Hashing
No ratings yet
1994 - Graphs, Hypergraphs and Hashing
13 pages
Bloom Filters A Tutorial, Analysis, and Survey
No ratings yet
Bloom Filters A Tutorial, Analysis, and Survey
31 pages
Advanced Algorithms Course. Lecture Notes. Part 10: Hashing
No ratings yet
Advanced Algorithms Course. Lecture Notes. Part 10: Hashing
4 pages
1 Hashing: 1.1 Maintaining A Dictionary
No ratings yet
1 Hashing: 1.1 Maintaining A Dictionary
17 pages
Lect1004 PDF
No ratings yet
Lect1004 PDF
7 pages
1 Hashing: 1.1 Desired Properties
No ratings yet
1 Hashing: 1.1 Desired Properties
8 pages
Lecture 8 Hashing
No ratings yet
Lecture 8 Hashing
47 pages
Assocrules 2
No ratings yet
Assocrules 2
49 pages
Compsci Algorithms For Data Science: Cameron Musco University of Massachusetts Amherst. Fall 2019
No ratings yet
Compsci Algorithms For Data Science: Cameron Musco University of Massachusetts Amherst. Fall 2019
28 pages
wk3 3
No ratings yet
wk3 3
111 pages
c11 Hashing
No ratings yet
c11 Hashing
9 pages
Analysis of Algorithms CS 477/677: Hashing Instructor: George Bebis
No ratings yet
Analysis of Algorithms CS 477/677: Hashing Instructor: George Bebis
53 pages
03.2 03.3 Shingling MinHash
No ratings yet
03.2 03.3 Shingling MinHash
32 pages
Probablistic Data Structures
No ratings yet
Probablistic Data Structures
5 pages
24 SimilaritySearch
No ratings yet
24 SimilaritySearch
52 pages
Hashing Búsqueda Por Transformación de Claves: Yana Saint-Priest
No ratings yet
Hashing Búsqueda Por Transformación de Claves: Yana Saint-Priest
26 pages
Module 5
No ratings yet
Module 5
25 pages
DSA_240404_220052 (1)
No ratings yet
DSA_240404_220052 (1)
9 pages
Toc CS246 PRK
No ratings yet
Toc CS246 PRK
17 pages
Bloom Filters - Short Tutorial: Web Cache Sharing ( (3) ) Collaborating Web Caches Use Bloom Filters (Dubbed
No ratings yet
Bloom Filters - Short Tutorial: Web Cache Sharing ( (3) ) Collaborating Web Caches Use Bloom Filters (Dubbed
4 pages
Bloom Filters - Short Tutorial: Web Cache Sharing ( (3) ) Collaborating Web Caches Use Bloom Filters (Dubbed
No ratings yet
Bloom Filters - Short Tutorial: Web Cache Sharing ( (3) ) Collaborating Web Caches Use Bloom Filters (Dubbed
4 pages
Locality-Sensitive Hashing
No ratings yet
Locality-Sensitive Hashing
10 pages
Probabilistic Counting Algorithms For Database Applications - Flajolet
No ratings yet
Probabilistic Counting Algorithms For Database Applications - Flajolet
28 pages
SoICT-Eng - ProbComp - Lec 5
No ratings yet
SoICT-Eng - ProbComp - Lec 5
41 pages
MIT6 006F11 Lec08 PDF
No ratings yet
MIT6 006F11 Lec08 PDF
7 pages
Manual Bda 6 7 8
No ratings yet
Manual Bda 6 7 8
6 pages
Rank-Indexed Hashing: A Compact Construction of Bloom Filters and Variants
No ratings yet
Rank-Indexed Hashing: A Compact Construction of Bloom Filters and Variants
10 pages
6 Filtering and Streaming: 6.1 Bloom Filters
No ratings yet
6 Filtering and Streaming: 6.1 Bloom Filters
6 pages
Theory and Practice of Monotone Minimal Perfect Hashing
No ratings yet
Theory and Practice of Monotone Minimal Perfect Hashing
27 pages
14 Hashing
No ratings yet
14 Hashing
23 pages
(8) Bloom Filters - A Probabilistic Data Structure _ LinkedIn
No ratings yet
(8) Bloom Filters - A Probabilistic Data Structure _ LinkedIn
7 pages
Bloom Filter Guo
No ratings yet
Bloom Filter Guo
90 pages
Bell's Inequality Untwisted
From Everand
Bell's Inequality Untwisted
James Spinosa
No ratings yet
Hashing
From Everand
Hashing
Prakash Hegade
No ratings yet
From The Beginning
From Everand
From The Beginning
James A. Madison
No ratings yet
Integrating A CCS Spectrometer in MATLAB
No ratings yet
Integrating A CCS Spectrometer in MATLAB
5 pages
INEFOP en Coursera: Programa 4: Certificaciones Profesionales de IBM - Analista de Datos
No ratings yet
INEFOP en Coursera: Programa 4: Certificaciones Profesionales de IBM - Analista de Datos
4 pages
M-AUDIO - ProFire 2626 - High-Definition 26-In - 26-Out FireWire Audio Interface With Octane Preamp Technology
No ratings yet
M-AUDIO - ProFire 2626 - High-Definition 26-In - 26-Out FireWire Audio Interface With Octane Preamp Technology
3 pages
Csew Ii
No ratings yet
Csew Ii
2 pages
2024 2026 Syllabus
No ratings yet
2024 2026 Syllabus
30 pages
FINAL Internship Report
No ratings yet
FINAL Internship Report
37 pages
Download ebooks file Intelligence Science: Leading the Age of Intelligence Zhongzhi Shi all chapters
100% (2)
Download ebooks file Intelligence Science: Leading the Age of Intelligence Zhongzhi Shi all chapters
47 pages
RK-F2_100_General_Information_E_Rev01
No ratings yet
RK-F2_100_General_Information_E_Rev01
7 pages
Call Centre Proposal
No ratings yet
Call Centre Proposal
18 pages
Introduction To ICT
No ratings yet
Introduction To ICT
72 pages
Interoperability Test of Ieee C37.118 Standard-Based Pmus: Eugene Y. Song Gerald J. Fitzpatrick
No ratings yet
Interoperability Test of Ieee C37.118 Standard-Based Pmus: Eugene Y. Song Gerald J. Fitzpatrick
24 pages
DCM Halidh
No ratings yet
DCM Halidh
12 pages
Contabilidad Electronica Ebs Ver 12
No ratings yet
Contabilidad Electronica Ebs Ver 12
57 pages
VxRail Appliance - VxRail Appliance Power Control Procedures-Power Down and Power Up A VxRail Cluster
No ratings yet
VxRail Appliance - VxRail Appliance Power Control Procedures-Power Down and Power Up A VxRail Cluster
6 pages
Mixed Tenses Online Activity For Intermediate
No ratings yet
Mixed Tenses Online Activity For Intermediate
1 page
Chapter1 Network Fundamentals
No ratings yet
Chapter1 Network Fundamentals
34 pages
Umair Aziz - CV
No ratings yet
Umair Aziz - CV
11 pages
Tamil Language Translator - Google Search
No ratings yet
Tamil Language Translator - Google Search
3 pages
KMS-0022 DNP3 Protocol Manual R08 151218
No ratings yet
KMS-0022 DNP3 Protocol Manual R08 151218
79 pages
Cbse Guess Paper-1
No ratings yet
Cbse Guess Paper-1
11 pages
Datasheet CybTouch 8 PSe PS V2.x en
No ratings yet
Datasheet CybTouch 8 PSe PS V2.x en
6 pages
Bilkent Course Equivalents (1)
No ratings yet
Bilkent Course Equivalents (1)
4 pages
Living As The Enemy Prince Chapter 2 - LIKE MANGA
No ratings yet
Living As The Enemy Prince Chapter 2 - LIKE MANGA
5 pages
[Ebooks PDF] download Introduction to Analytic and Probabilistic Number Theory 3rd Edition Gerald Tenenbaum full chapters
100% (1)
[Ebooks PDF] download Introduction to Analytic and Probabilistic Number Theory 3rd Edition Gerald Tenenbaum full chapters
55 pages
Exam Schedule CIA - 2 PDF
No ratings yet
Exam Schedule CIA - 2 PDF
7 pages
ch08 Ts TK Fault Tolerance I
No ratings yet
ch08 Ts TK Fault Tolerance I
29 pages
CS_3306_01_Written_Assignment_Unit_2
No ratings yet
CS_3306_01_Written_Assignment_Unit_2
5 pages
DataCamp Curriculum Cheat Sheet
No ratings yet
DataCamp Curriculum Cheat Sheet
11 pages