SlideShare a Scribd company logo
PROBABILISTIC
DATA STRUCTURES
Thinh Dang-An
Definitions
• Data structure
• It is a ‘structure’ that holds ‘data’, allowing you to extract
information
• Probabilistic
• Query may return a wrong answer
• The answer is ‘good enough’
• Uses a fraction of the resources i.e. memory or cpu cycles
Four types:
• Membership
• Bloom Filter
• Cuckoo Filters
• Cardinality
• Linear Counting
• LogLog, SuperLogLog,
HyperLogLog, HyperLogLog++
• Frequency
• Count-Min Sketch
• Majority Algorithm
• Misra-Gries Algorithm
• Similarity
• Locality-Sensitive Hashing (LSH)
• MinHash
• SimHash
Bloom Filter
Membership
Properties
• It tells us that the element either definitely is not in
the set or may be in the set.
• Bloom filters are called filters because they are often
used as a cheap first pass to filter out segments of a
dataset that do not match a query.
How does it work
• Bloom filter is a bit array of m bits, all set to 0 at the beginning
• To insert element into the filter - calculate values of all k hash functions for the
element and set bit with the corresponding indices
• To test if element is in the filter - calculate all k hash functions for the element
and check bits in all corresponding indices:
• if all bits are set, then answer is “maybe”
• if at least 1 bit isn’t set, then answer is “definitely not”
• Time needed to insert or test elements is a fixed constant O(k), independent
from the number of items already in the filter
Application
• Google BigTable, Apache HBase and Apache Cassandra use Bloom filters to
reduce the disk lookups for non-existent rows or columns
• Medium uses Bloom filters to avoid recommending articles a user has previously
read
• Google Chrome web browser used to use a Bloom filter to identify malicious URLs
(moved to PrefixSet, Issue 71832)
• The Squid Web Proxy Cache uses Bloom filters for cache digests
Cuckoo Filters
Membership
Properties
• Practically better than bloom filter
• Supports adding and removing items dynamically
• Provide higher lookup performance
• Cuckoo hashing – resolves collisions by rehashing to a new
place
How does it work
• Parameters of the Filter:
• 1. Two Hash Functions: h1 and h2
• 2. An array B with n buckets. The i-th bucket will be called B[i]
• Input: L, a list of elements to be inserted into the cuckoo filter.
How does it work
While L is not empty:
Let x be the first item in the list L. Remove x from the list.
If B[h1(x)] is empty:
place x in B[h1(x)]
Else, If B[h2(x) is empty]:
place x in B[h2(x)]
Else:
Let y be the element in B[h2(x)].
Prepend y to L
place x in B[h2(x)]
What if cuckoo filter use more than two
hash functions?
• Nothing happen and this isn't necessary, Because :
• If you use too many hash function, that will take time to
implement and don't bring any benefit.
• You need more space to store when many insert data focus on
one bucket by add more element per bucket.
COMPARISON WITH BLOOM FILTER
• Space Efficiency
• Number of Memory Accesses
• Value Association
• Maximum Capacity
Count Min Sketch
Frequency
Properties
• Only over-estimate, not under-estimate.
• Time needed to add element or return its frequency is a fixed
constant O(k), assuming that every hash function can be
evaluated in a constant time.
How does it work
• Use multiple arrays with different hash functions to compute
the index.
• When queried, return the minimum of the numbers the arrays.
→ Count-Min Sketch
• AT&T has used Count-Min Sketch in network switches to perform analyses on
network traffic using limited memory
• At Google, a precursor of the count-min sketch (called the “count sketch”) has
been implemented on top of their MapReduce parallel processing infrastructure
• Implemented as a part of Algebird library from Twitter
HyperLogLog
Cardinality
Properties
• HyperLogLog is described by 2 parameters:
• p – number of bits that determine a bucket to use
averaging (m = 2^p is the number of buckets/substreams)
• h - hash function, that produces uniform hash values
• The HyperLogLog algorithm is able to estimate cardinalities of
> 10^9 with a typical error rate of 2%
• Observe the maximum number of leading zeros that for all
hash values.
How does it work
• Stochastic averaging is used to reduce the large variability:
• The input stream of data elements S is divided into m substreams S(i) using the first p
bits of the hash values (m = 2^p)
• In each substream, the rank (after the initial p bits that are used for substreaming) is
measured independently.
• These numbers are kept in an array of registers M, where M[i] stores the maximum rank
it seen for the substream with index i.
• The cardinality formula is calculated computes to approximate the cardinality of a
multiset.
Example
Example
Application
• PFCount in Redis
• Counting unique visitors to a website,...
MinHash
Similarity
Properties
• Compute a “signature” for each set, so that similar documents have similar
signatures (and dissimilar docs are unlikely to have similar signatures)
• Trade-off: length of signature vs. accuracy
How does it work
For each row r = 0, 1, …, N-1 of the characteristic matrix:
1. Compute h1(r), h2(r), …, hn(r)
2. For each column c:
1. If column c has 0 in row r, do nothing
2. Otherwise, for each i = 1,2, …, n set SIG(i, c) to be min(hi
(r), SIG(i, c))
With:
r: row
c: column
i: index of hash
Note: in practice we need to only iterate through the non-zero
elements.
Problem with MinHash
• Assume that we construct a 1,000 byte minhash signature
for each document.
• Million documents can now fit into 1 gigabyte of RAM.
But how much does it cost to find the nearest neighbor
of a document? -
• Brute force: 1/2 N(N-1) comparisons.
• Need a way to reduce the number of comparisons
Locality sensitive Hashing
Similarity
Properties
• Idea:
• From minHash, divide the signature matrix rows into b bands
of r rows hash the columns in each band with a basic hash
function each band divided to buckets [i.e a hashtable for
each band]
• If sets S and T have same values in a band, they will be
hashed into the same bucket in that band.
• For nearest-neighbor, the candidates are the items in the
same bucket as query item, in each band
Application
• Finding duplicate pages on the web
• Retrieving images
• Retrieving music
References
1. Series probabilistic data structure - Andrii Gakhov
2. Cuckoo Filter: Practically Better Than Bloom - Bin Fan, David G. Andersen, Michael
Kaminsky† , Michael D. Mitzenmacher
3. HyperLogLog in Practice: Algorithmic Engineering of a State of The Art Cardinality
Estimation Algorithm - Stefan Heule , Marc Nunkesser ,Alexander Hall
4. MinHash & LSH slide
Thank you for watching
Ad

More Related Content

What's hot (20)

Bloom filters
Bloom filtersBloom filters
Bloom filters
Devesh Maru
 
Artificial Intelligence: Case-based & Model-based Reasoning
Artificial Intelligence: Case-based & Model-based ReasoningArtificial Intelligence: Case-based & Model-based Reasoning
Artificial Intelligence: Case-based & Model-based Reasoning
The Integral Worm
 
Bayesian Networks - A Brief Introduction
Bayesian Networks - A Brief IntroductionBayesian Networks - A Brief Introduction
Bayesian Networks - A Brief Introduction
Adnan Masood
 
search strategies in artificial intelligence
search strategies in artificial intelligencesearch strategies in artificial intelligence
search strategies in artificial intelligence
Hanif Ullah (Gold Medalist)
 
Probabilistic Reasoning
Probabilistic ReasoningProbabilistic Reasoning
Probabilistic Reasoning
Junya Tanaka
 
Introdution and designing a learning system
Introdution and designing a learning systemIntrodution and designing a learning system
Introdution and designing a learning system
swapnac12
 
Attribute grammer
Attribute grammerAttribute grammer
Attribute grammer
ahmed51236
 
Machine Learning and Data Mining: 04 Association Rule Mining
Machine Learning and Data Mining: 04 Association Rule MiningMachine Learning and Data Mining: 04 Association Rule Mining
Machine Learning and Data Mining: 04 Association Rule Mining
Pier Luca Lanzi
 
텐서플로우 설치도 했고 튜토리얼도 봤고 기초 예제도 짜봤다면 TensorFlow KR Meetup 2016
텐서플로우 설치도 했고 튜토리얼도 봤고 기초 예제도 짜봤다면 TensorFlow KR Meetup 2016텐서플로우 설치도 했고 튜토리얼도 봤고 기초 예제도 짜봤다면 TensorFlow KR Meetup 2016
텐서플로우 설치도 했고 튜토리얼도 봤고 기초 예제도 짜봤다면 TensorFlow KR Meetup 2016
Taehoon Kim
 
Decision tree
Decision treeDecision tree
Decision tree
SEMINARGROOT
 
Bayes network
Bayes networkBayes network
Bayes network
Dr. C.V. Suresh Babu
 
Bayesian networks
Bayesian networksBayesian networks
Bayesian networks
Massimiliano Patacchiola
 
Reasoning in AI
Reasoning in AIReasoning in AI
Reasoning in AI
Gunjan Chhabra
 
Working principle of Turing machine
Working principle of Turing machineWorking principle of Turing machine
Working principle of Turing machine
Karan Thakkar
 
Grid based method & model based clustering method
Grid based method & model based clustering methodGrid based method & model based clustering method
Grid based method & model based clustering method
rajshreemuthiah
 
Opinion Mining
Opinion MiningOpinion Mining
Opinion Mining
Ali Habeeb
 
Forward Backward Chaining
Forward Backward ChainingForward Backward Chaining
Forward Backward Chaining
QAU ISLAMABAD,PAKISTAN
 
weak slot and filler structure
weak slot and filler structureweak slot and filler structure
weak slot and filler structure
Amey Kerkar
 
Naive Bayes Classifier
Naive Bayes ClassifierNaive Bayes Classifier
Naive Bayes Classifier
Arunabha Saha
 
Logic programming (1)
Logic programming (1)Logic programming (1)
Logic programming (1)
Nitesh Singh
 
Artificial Intelligence: Case-based & Model-based Reasoning
Artificial Intelligence: Case-based & Model-based ReasoningArtificial Intelligence: Case-based & Model-based Reasoning
Artificial Intelligence: Case-based & Model-based Reasoning
The Integral Worm
 
Bayesian Networks - A Brief Introduction
Bayesian Networks - A Brief IntroductionBayesian Networks - A Brief Introduction
Bayesian Networks - A Brief Introduction
Adnan Masood
 
Probabilistic Reasoning
Probabilistic ReasoningProbabilistic Reasoning
Probabilistic Reasoning
Junya Tanaka
 
Introdution and designing a learning system
Introdution and designing a learning systemIntrodution and designing a learning system
Introdution and designing a learning system
swapnac12
 
Attribute grammer
Attribute grammerAttribute grammer
Attribute grammer
ahmed51236
 
Machine Learning and Data Mining: 04 Association Rule Mining
Machine Learning and Data Mining: 04 Association Rule MiningMachine Learning and Data Mining: 04 Association Rule Mining
Machine Learning and Data Mining: 04 Association Rule Mining
Pier Luca Lanzi
 
텐서플로우 설치도 했고 튜토리얼도 봤고 기초 예제도 짜봤다면 TensorFlow KR Meetup 2016
텐서플로우 설치도 했고 튜토리얼도 봤고 기초 예제도 짜봤다면 TensorFlow KR Meetup 2016텐서플로우 설치도 했고 튜토리얼도 봤고 기초 예제도 짜봤다면 TensorFlow KR Meetup 2016
텐서플로우 설치도 했고 튜토리얼도 봤고 기초 예제도 짜봤다면 TensorFlow KR Meetup 2016
Taehoon Kim
 
Working principle of Turing machine
Working principle of Turing machineWorking principle of Turing machine
Working principle of Turing machine
Karan Thakkar
 
Grid based method & model based clustering method
Grid based method & model based clustering methodGrid based method & model based clustering method
Grid based method & model based clustering method
rajshreemuthiah
 
Opinion Mining
Opinion MiningOpinion Mining
Opinion Mining
Ali Habeeb
 
weak slot and filler structure
weak slot and filler structureweak slot and filler structure
weak slot and filler structure
Amey Kerkar
 
Naive Bayes Classifier
Naive Bayes ClassifierNaive Bayes Classifier
Naive Bayes Classifier
Arunabha Saha
 
Logic programming (1)
Logic programming (1)Logic programming (1)
Logic programming (1)
Nitesh Singh
 

Similar to Probabilistic data structure (20)

hash
 hash hash
hash
tim4911
 
Data streaming algorithms
Data streaming algorithmsData streaming algorithms
Data streaming algorithms
Sandeep Joshi
 
Data Analytics using R.pptx
Data Analytics using R.pptxData Analytics using R.pptx
Data Analytics using R.pptx
CheatMe
 
Probabilistic data structures. Part 3. Frequency
Probabilistic data structures. Part 3. FrequencyProbabilistic data structures. Part 3. Frequency
Probabilistic data structures. Part 3. Frequency
Andrii Gakhov
 
Building graphs to discover information by David Martínez at Big Data Spain 2015
Building graphs to discover information by David Martínez at Big Data Spain 2015Building graphs to discover information by David Martínez at Big Data Spain 2015
Building graphs to discover information by David Martínez at Big Data Spain 2015
Big Data Spain
 
MapReduce Algorithm Design - Parallel Reduce Operations
MapReduce Algorithm Design - Parallel Reduce OperationsMapReduce Algorithm Design - Parallel Reduce Operations
MapReduce Algorithm Design - Parallel Reduce Operations
Jason J Pulikkottil
 
design mapping lecture6-mapreducealgorithmdesign.ppt
design mapping lecture6-mapreducealgorithmdesign.pptdesign mapping lecture6-mapreducealgorithmdesign.ppt
design mapping lecture6-mapreducealgorithmdesign.ppt
turningpointinnospac
 
Online statistical analysis using transducers and sketch algorithms
Online statistical analysis using transducers and sketch algorithmsOnline statistical analysis using transducers and sketch algorithms
Online statistical analysis using transducers and sketch algorithms
Simon Belak
 
cb streams - gavin pickin
cb streams - gavin pickincb streams - gavin pickin
cb streams - gavin pickin
Ortus Solutions, Corp
 
RecSplit Minimal Perfect Hashing
RecSplit Minimal Perfect HashingRecSplit Minimal Perfect Hashing
RecSplit Minimal Perfect Hashing
Thomas Mueller
 
Lecture_3.pptx
Lecture_3.pptxLecture_3.pptx
Lecture_3.pptx
GayathriSanthosh11
 
Probabilistic data structures
Probabilistic data structuresProbabilistic data structures
Probabilistic data structures
Yoav chernobroda
 
Python高级编程(二)
Python高级编程(二)Python高级编程(二)
Python高级编程(二)
Qiangning Hong
 
Sketch algorithms
Sketch algorithmsSketch algorithms
Sketch algorithms
Simon Belak
 
Hash tables
Hash tablesHash tables
Hash tables
International Islamic University
 
AI與大數據數據處理 Spark實戰(20171216)
AI與大數據數據處理 Spark實戰(20171216)AI與大數據數據處理 Spark實戰(20171216)
AI與大數據數據處理 Spark實戰(20171216)
Paul Chao
 
Exceeding Classical: Probabilistic Data Structures in Data Intensive Applicat...
Exceeding Classical: Probabilistic Data Structures in Data Intensive Applicat...Exceeding Classical: Probabilistic Data Structures in Data Intensive Applicat...
Exceeding Classical: Probabilistic Data Structures in Data Intensive Applicat...
Andrii Gakhov
 
New zealand bloom filter
New zealand bloom filterNew zealand bloom filter
New zealand bloom filter
xlight
 
Approximate methods for scalable data mining (long version)
Approximate methods for scalable data mining (long version)Approximate methods for scalable data mining (long version)
Approximate methods for scalable data mining (long version)
Andrew Clegg
 
data structures queue stack insert and delete time complexity
data structures queue stack insert and delete time complexitydata structures queue stack insert and delete time complexity
data structures queue stack insert and delete time complexity
libannpost
 
Data streaming algorithms
Data streaming algorithmsData streaming algorithms
Data streaming algorithms
Sandeep Joshi
 
Data Analytics using R.pptx
Data Analytics using R.pptxData Analytics using R.pptx
Data Analytics using R.pptx
CheatMe
 
Probabilistic data structures. Part 3. Frequency
Probabilistic data structures. Part 3. FrequencyProbabilistic data structures. Part 3. Frequency
Probabilistic data structures. Part 3. Frequency
Andrii Gakhov
 
Building graphs to discover information by David Martínez at Big Data Spain 2015
Building graphs to discover information by David Martínez at Big Data Spain 2015Building graphs to discover information by David Martínez at Big Data Spain 2015
Building graphs to discover information by David Martínez at Big Data Spain 2015
Big Data Spain
 
MapReduce Algorithm Design - Parallel Reduce Operations
MapReduce Algorithm Design - Parallel Reduce OperationsMapReduce Algorithm Design - Parallel Reduce Operations
MapReduce Algorithm Design - Parallel Reduce Operations
Jason J Pulikkottil
 
design mapping lecture6-mapreducealgorithmdesign.ppt
design mapping lecture6-mapreducealgorithmdesign.pptdesign mapping lecture6-mapreducealgorithmdesign.ppt
design mapping lecture6-mapreducealgorithmdesign.ppt
turningpointinnospac
 
Online statistical analysis using transducers and sketch algorithms
Online statistical analysis using transducers and sketch algorithmsOnline statistical analysis using transducers and sketch algorithms
Online statistical analysis using transducers and sketch algorithms
Simon Belak
 
RecSplit Minimal Perfect Hashing
RecSplit Minimal Perfect HashingRecSplit Minimal Perfect Hashing
RecSplit Minimal Perfect Hashing
Thomas Mueller
 
Probabilistic data structures
Probabilistic data structuresProbabilistic data structures
Probabilistic data structures
Yoav chernobroda
 
Python高级编程(二)
Python高级编程(二)Python高级编程(二)
Python高级编程(二)
Qiangning Hong
 
Sketch algorithms
Sketch algorithmsSketch algorithms
Sketch algorithms
Simon Belak
 
AI與大數據數據處理 Spark實戰(20171216)
AI與大數據數據處理 Spark實戰(20171216)AI與大數據數據處理 Spark實戰(20171216)
AI與大數據數據處理 Spark實戰(20171216)
Paul Chao
 
Exceeding Classical: Probabilistic Data Structures in Data Intensive Applicat...
Exceeding Classical: Probabilistic Data Structures in Data Intensive Applicat...Exceeding Classical: Probabilistic Data Structures in Data Intensive Applicat...
Exceeding Classical: Probabilistic Data Structures in Data Intensive Applicat...
Andrii Gakhov
 
New zealand bloom filter
New zealand bloom filterNew zealand bloom filter
New zealand bloom filter
xlight
 
Approximate methods for scalable data mining (long version)
Approximate methods for scalable data mining (long version)Approximate methods for scalable data mining (long version)
Approximate methods for scalable data mining (long version)
Andrew Clegg
 
data structures queue stack insert and delete time complexity
data structures queue stack insert and delete time complexitydata structures queue stack insert and delete time complexity
data structures queue stack insert and delete time complexity
libannpost
 
Ad

Recently uploaded (20)

Classification_in_Machinee_Learning.pptx
Classification_in_Machinee_Learning.pptxClassification_in_Machinee_Learning.pptx
Classification_in_Machinee_Learning.pptx
wencyjorda88
 
Digilocker under workingProcess Flow.pptx
Digilocker  under workingProcess Flow.pptxDigilocker  under workingProcess Flow.pptx
Digilocker under workingProcess Flow.pptx
satnamsadguru491
 
Cleaned_Lecture 6666666_Simulation_I.pdf
Cleaned_Lecture 6666666_Simulation_I.pdfCleaned_Lecture 6666666_Simulation_I.pdf
Cleaned_Lecture 6666666_Simulation_I.pdf
alcinialbob1234
 
Data Analytics Overview and its applications
Data Analytics Overview and its applicationsData Analytics Overview and its applications
Data Analytics Overview and its applications
JanmejayaMishra7
 
Safety Innovation in Mt. Vernon A Westchester County Model for New Rochelle a...
Safety Innovation in Mt. Vernon A Westchester County Model for New Rochelle a...Safety Innovation in Mt. Vernon A Westchester County Model for New Rochelle a...
Safety Innovation in Mt. Vernon A Westchester County Model for New Rochelle a...
James Francis Paradigm Asset Management
 
Secure_File_Storage_Hybrid_Cryptography.pptx..
Secure_File_Storage_Hybrid_Cryptography.pptx..Secure_File_Storage_Hybrid_Cryptography.pptx..
Secure_File_Storage_Hybrid_Cryptography.pptx..
yuvarajreddy2002
 
Stack_and_Queue_Presentation_Final (1).pptx
Stack_and_Queue_Presentation_Final (1).pptxStack_and_Queue_Presentation_Final (1).pptx
Stack_and_Queue_Presentation_Final (1).pptx
binduraniha86
 
DPR_Expert_Recruitment_notice_Revised.pdf
DPR_Expert_Recruitment_notice_Revised.pdfDPR_Expert_Recruitment_notice_Revised.pdf
DPR_Expert_Recruitment_notice_Revised.pdf
inmishra17121973
 
Ch3MCT24.pptx measure of central tendency
Ch3MCT24.pptx measure of central tendencyCh3MCT24.pptx measure of central tendency
Ch3MCT24.pptx measure of central tendency
ayeleasefa2
 
Developing Security Orchestration, Automation, and Response Applications
Developing Security Orchestration, Automation, and Response ApplicationsDeveloping Security Orchestration, Automation, and Response Applications
Developing Security Orchestration, Automation, and Response Applications
VICTOR MAESTRE RAMIREZ
 
Flip flop presenation-Presented By Mubahir khan.pptx
Flip flop presenation-Presented By Mubahir khan.pptxFlip flop presenation-Presented By Mubahir khan.pptx
Flip flop presenation-Presented By Mubahir khan.pptx
mubashirkhan45461
 
Thingyan is now a global treasure! See how people around the world are search...
Thingyan is now a global treasure! See how people around the world are search...Thingyan is now a global treasure! See how people around the world are search...
Thingyan is now a global treasure! See how people around the world are search...
Pixellion
 
chapter 4 Variability statistical research .pptx
chapter 4 Variability statistical research .pptxchapter 4 Variability statistical research .pptx
chapter 4 Variability statistical research .pptx
justinebandajbn
 
C++_OOPs_DSA1_Presentation_Template.pptx
C++_OOPs_DSA1_Presentation_Template.pptxC++_OOPs_DSA1_Presentation_Template.pptx
C++_OOPs_DSA1_Presentation_Template.pptx
aquibnoor22079
 
CTS EXCEPTIONSPrediction of Aluminium wire rod physical properties through AI...
CTS EXCEPTIONSPrediction of Aluminium wire rod physical properties through AI...CTS EXCEPTIONSPrediction of Aluminium wire rod physical properties through AI...
CTS EXCEPTIONSPrediction of Aluminium wire rod physical properties through AI...
ThanushsaranS
 
Minions Want to eat presentacion muy linda
Minions Want to eat presentacion muy lindaMinions Want to eat presentacion muy linda
Minions Want to eat presentacion muy linda
CarlaAndradesSoler1
 
How to join illuminati Agent in uganda call+256776963507/0741506136
How to join illuminati Agent in uganda call+256776963507/0741506136How to join illuminati Agent in uganda call+256776963507/0741506136
How to join illuminati Agent in uganda call+256776963507/0741506136
illuminati Agent uganda call+256776963507/0741506136
 
Ppt. Nikhil.pptxnshwuudgcudisisshvehsjks
Ppt. Nikhil.pptxnshwuudgcudisisshvehsjksPpt. Nikhil.pptxnshwuudgcudisisshvehsjks
Ppt. Nikhil.pptxnshwuudgcudisisshvehsjks
panchariyasahil
 
03 Daniel 2-notes.ppt seminario escatologia
03 Daniel 2-notes.ppt seminario escatologia03 Daniel 2-notes.ppt seminario escatologia
03 Daniel 2-notes.ppt seminario escatologia
Alexander Romero Arosquipa
 
FPET_Implementation_2_MA to 360 Engage Direct.pptx
FPET_Implementation_2_MA to 360 Engage Direct.pptxFPET_Implementation_2_MA to 360 Engage Direct.pptx
FPET_Implementation_2_MA to 360 Engage Direct.pptx
ssuser4ef83d
 
Classification_in_Machinee_Learning.pptx
Classification_in_Machinee_Learning.pptxClassification_in_Machinee_Learning.pptx
Classification_in_Machinee_Learning.pptx
wencyjorda88
 
Digilocker under workingProcess Flow.pptx
Digilocker  under workingProcess Flow.pptxDigilocker  under workingProcess Flow.pptx
Digilocker under workingProcess Flow.pptx
satnamsadguru491
 
Cleaned_Lecture 6666666_Simulation_I.pdf
Cleaned_Lecture 6666666_Simulation_I.pdfCleaned_Lecture 6666666_Simulation_I.pdf
Cleaned_Lecture 6666666_Simulation_I.pdf
alcinialbob1234
 
Data Analytics Overview and its applications
Data Analytics Overview and its applicationsData Analytics Overview and its applications
Data Analytics Overview and its applications
JanmejayaMishra7
 
Safety Innovation in Mt. Vernon A Westchester County Model for New Rochelle a...
Safety Innovation in Mt. Vernon A Westchester County Model for New Rochelle a...Safety Innovation in Mt. Vernon A Westchester County Model for New Rochelle a...
Safety Innovation in Mt. Vernon A Westchester County Model for New Rochelle a...
James Francis Paradigm Asset Management
 
Secure_File_Storage_Hybrid_Cryptography.pptx..
Secure_File_Storage_Hybrid_Cryptography.pptx..Secure_File_Storage_Hybrid_Cryptography.pptx..
Secure_File_Storage_Hybrid_Cryptography.pptx..
yuvarajreddy2002
 
Stack_and_Queue_Presentation_Final (1).pptx
Stack_and_Queue_Presentation_Final (1).pptxStack_and_Queue_Presentation_Final (1).pptx
Stack_and_Queue_Presentation_Final (1).pptx
binduraniha86
 
DPR_Expert_Recruitment_notice_Revised.pdf
DPR_Expert_Recruitment_notice_Revised.pdfDPR_Expert_Recruitment_notice_Revised.pdf
DPR_Expert_Recruitment_notice_Revised.pdf
inmishra17121973
 
Ch3MCT24.pptx measure of central tendency
Ch3MCT24.pptx measure of central tendencyCh3MCT24.pptx measure of central tendency
Ch3MCT24.pptx measure of central tendency
ayeleasefa2
 
Developing Security Orchestration, Automation, and Response Applications
Developing Security Orchestration, Automation, and Response ApplicationsDeveloping Security Orchestration, Automation, and Response Applications
Developing Security Orchestration, Automation, and Response Applications
VICTOR MAESTRE RAMIREZ
 
Flip flop presenation-Presented By Mubahir khan.pptx
Flip flop presenation-Presented By Mubahir khan.pptxFlip flop presenation-Presented By Mubahir khan.pptx
Flip flop presenation-Presented By Mubahir khan.pptx
mubashirkhan45461
 
Thingyan is now a global treasure! See how people around the world are search...
Thingyan is now a global treasure! See how people around the world are search...Thingyan is now a global treasure! See how people around the world are search...
Thingyan is now a global treasure! See how people around the world are search...
Pixellion
 
chapter 4 Variability statistical research .pptx
chapter 4 Variability statistical research .pptxchapter 4 Variability statistical research .pptx
chapter 4 Variability statistical research .pptx
justinebandajbn
 
C++_OOPs_DSA1_Presentation_Template.pptx
C++_OOPs_DSA1_Presentation_Template.pptxC++_OOPs_DSA1_Presentation_Template.pptx
C++_OOPs_DSA1_Presentation_Template.pptx
aquibnoor22079
 
CTS EXCEPTIONSPrediction of Aluminium wire rod physical properties through AI...
CTS EXCEPTIONSPrediction of Aluminium wire rod physical properties through AI...CTS EXCEPTIONSPrediction of Aluminium wire rod physical properties through AI...
CTS EXCEPTIONSPrediction of Aluminium wire rod physical properties through AI...
ThanushsaranS
 
Minions Want to eat presentacion muy linda
Minions Want to eat presentacion muy lindaMinions Want to eat presentacion muy linda
Minions Want to eat presentacion muy linda
CarlaAndradesSoler1
 
Ppt. Nikhil.pptxnshwuudgcudisisshvehsjks
Ppt. Nikhil.pptxnshwuudgcudisisshvehsjksPpt. Nikhil.pptxnshwuudgcudisisshvehsjks
Ppt. Nikhil.pptxnshwuudgcudisisshvehsjks
panchariyasahil
 
FPET_Implementation_2_MA to 360 Engage Direct.pptx
FPET_Implementation_2_MA to 360 Engage Direct.pptxFPET_Implementation_2_MA to 360 Engage Direct.pptx
FPET_Implementation_2_MA to 360 Engage Direct.pptx
ssuser4ef83d
 
Ad

Probabilistic data structure

  • 2. Definitions • Data structure • It is a ‘structure’ that holds ‘data’, allowing you to extract information • Probabilistic • Query may return a wrong answer • The answer is ‘good enough’ • Uses a fraction of the resources i.e. memory or cpu cycles
  • 3. Four types: • Membership • Bloom Filter • Cuckoo Filters • Cardinality • Linear Counting • LogLog, SuperLogLog, HyperLogLog, HyperLogLog++ • Frequency • Count-Min Sketch • Majority Algorithm • Misra-Gries Algorithm • Similarity • Locality-Sensitive Hashing (LSH) • MinHash • SimHash
  • 5. Properties • It tells us that the element either definitely is not in the set or may be in the set. • Bloom filters are called filters because they are often used as a cheap first pass to filter out segments of a dataset that do not match a query.
  • 6. How does it work • Bloom filter is a bit array of m bits, all set to 0 at the beginning • To insert element into the filter - calculate values of all k hash functions for the element and set bit with the corresponding indices • To test if element is in the filter - calculate all k hash functions for the element and check bits in all corresponding indices: • if all bits are set, then answer is “maybe” • if at least 1 bit isn’t set, then answer is “definitely not” • Time needed to insert or test elements is a fixed constant O(k), independent from the number of items already in the filter
  • 7. Application • Google BigTable, Apache HBase and Apache Cassandra use Bloom filters to reduce the disk lookups for non-existent rows or columns • Medium uses Bloom filters to avoid recommending articles a user has previously read • Google Chrome web browser used to use a Bloom filter to identify malicious URLs (moved to PrefixSet, Issue 71832) • The Squid Web Proxy Cache uses Bloom filters for cache digests
  • 9. Properties • Practically better than bloom filter • Supports adding and removing items dynamically • Provide higher lookup performance • Cuckoo hashing – resolves collisions by rehashing to a new place
  • 10. How does it work • Parameters of the Filter: • 1. Two Hash Functions: h1 and h2 • 2. An array B with n buckets. The i-th bucket will be called B[i] • Input: L, a list of elements to be inserted into the cuckoo filter.
  • 11. How does it work While L is not empty: Let x be the first item in the list L. Remove x from the list. If B[h1(x)] is empty: place x in B[h1(x)] Else, If B[h2(x) is empty]: place x in B[h2(x)] Else: Let y be the element in B[h2(x)]. Prepend y to L place x in B[h2(x)]
  • 12. What if cuckoo filter use more than two hash functions? • Nothing happen and this isn't necessary, Because : • If you use too many hash function, that will take time to implement and don't bring any benefit. • You need more space to store when many insert data focus on one bucket by add more element per bucket.
  • 13. COMPARISON WITH BLOOM FILTER • Space Efficiency • Number of Memory Accesses • Value Association • Maximum Capacity
  • 15. Properties • Only over-estimate, not under-estimate. • Time needed to add element or return its frequency is a fixed constant O(k), assuming that every hash function can be evaluated in a constant time.
  • 16. How does it work • Use multiple arrays with different hash functions to compute the index. • When queried, return the minimum of the numbers the arrays. → Count-Min Sketch
  • 17. • AT&T has used Count-Min Sketch in network switches to perform analyses on network traffic using limited memory • At Google, a precursor of the count-min sketch (called the “count sketch”) has been implemented on top of their MapReduce parallel processing infrastructure • Implemented as a part of Algebird library from Twitter
  • 19. Properties • HyperLogLog is described by 2 parameters: • p – number of bits that determine a bucket to use averaging (m = 2^p is the number of buckets/substreams) • h - hash function, that produces uniform hash values • The HyperLogLog algorithm is able to estimate cardinalities of > 10^9 with a typical error rate of 2% • Observe the maximum number of leading zeros that for all hash values.
  • 20. How does it work • Stochastic averaging is used to reduce the large variability: • The input stream of data elements S is divided into m substreams S(i) using the first p bits of the hash values (m = 2^p) • In each substream, the rank (after the initial p bits that are used for substreaming) is measured independently. • These numbers are kept in an array of registers M, where M[i] stores the maximum rank it seen for the substream with index i. • The cardinality formula is calculated computes to approximate the cardinality of a multiset.
  • 23. Application • PFCount in Redis • Counting unique visitors to a website,...
  • 25. Properties • Compute a “signature” for each set, so that similar documents have similar signatures (and dissimilar docs are unlikely to have similar signatures) • Trade-off: length of signature vs. accuracy
  • 26. How does it work For each row r = 0, 1, …, N-1 of the characteristic matrix: 1. Compute h1(r), h2(r), …, hn(r) 2. For each column c: 1. If column c has 0 in row r, do nothing 2. Otherwise, for each i = 1,2, …, n set SIG(i, c) to be min(hi (r), SIG(i, c)) With: r: row c: column i: index of hash Note: in practice we need to only iterate through the non-zero elements.
  • 27. Problem with MinHash • Assume that we construct a 1,000 byte minhash signature for each document. • Million documents can now fit into 1 gigabyte of RAM. But how much does it cost to find the nearest neighbor of a document? - • Brute force: 1/2 N(N-1) comparisons. • Need a way to reduce the number of comparisons
  • 29. Properties • Idea: • From minHash, divide the signature matrix rows into b bands of r rows hash the columns in each band with a basic hash function each band divided to buckets [i.e a hashtable for each band] • If sets S and T have same values in a band, they will be hashed into the same bucket in that band. • For nearest-neighbor, the candidates are the items in the same bucket as query item, in each band
  • 30. Application • Finding duplicate pages on the web • Retrieving images • Retrieving music
  • 31. References 1. Series probabilistic data structure - Andrii Gakhov 2. Cuckoo Filter: Practically Better Than Bloom - Bin Fan, David G. Andersen, Michael Kaminsky† , Michael D. Mitzenmacher 3. HyperLogLog in Practice: Algorithmic Engineering of a State of The Art Cardinality Estimation Algorithm - Stefan Heule , Marc Nunkesser ,Alexander Hall 4. MinHash & LSH slide Thank you for watching

Editor's Notes

  • #4: membership To determine membership of the element in a large set of elements  frequency To estimate number of times an element occurs in a set  Cardinality  To determine the number of distinct elements Similarity To find clusters of similar documents from the document set • To find duplicates of the document in the document set