Probabilistic Data Structures (Edmonton Data Science Meetup, March 2018)

Probabilistic Data
Structures
KYLE J. DAVIS
TECHNICAL MARKETING MANAGER
REDIS LABS

Who We Are
Open source. The leading in-memory database platform,
supporting any high performance operational, analytics or
hybrid use case.
The open source home and commercial provider of Redis
Enterprise technology, platform, products & services.
2

Stack Overflow Survey: The Most Loved Databases
3
64.8%
60.8%
55%
54.2%
49.9%
49.6%
47.2%
36.9%
Redis
PostgreSQL
MongoDB
SQL Server
Cassandra
MySQL
SQLite
Oracle
% of devs who expressed interest in continuing to develop with a language/tech

Redis Top Differentiators
Simplicity ExtensibilityPerformance
NoSQL Benchmark
1
Redis Data Structures
2 3
Redis Modules
4
Lists
Hashes
Bitmaps
Strings
Bit field
Streams
Hyperloglog
Sorted Sets
Sets
Geospatial Indexes

Simplicity: Data Structures - Redis’ Building Blocks
Lists
[ A → B → C → D → E ]
Hashes
{ A: “foo”, B: “bar”, C: “baz” }
Bitmaps
0011010101100111001010
Strings
"I'm a Plain Text String!”
Bit field
{23334}{112345569}{766538}
Key
5
2
”Retrieve the e-mail address of the user with the highest
bid in an auction that started on July 24th at 11:00pm PST” ZREVRANGE 07242015_2300 0 0=
Streams
{id1=time1.seq1(A:“xyz”, B:“cdf”),
d2=time2.seq2(D:“abc”, )}
Hyperloglog
00110101 11001110
Sorted Sets
{ A: 0.1, B: 0.3, C: 100 }
Sets
{ A , B , C , D , E }
Geospatial Indexes
{ A: (51.5, 0.12), B: (32.1, 34.7) }

• Add-ons that use a Redis API to seamlessly support additional
use cases and data structures.
• Enjoy Redis’ simplicity, super high performance, infinite
scalability and high availability.
Extensibility: Modules Extend Redis Infinitely
• Any C/C++/Rust program can become a Module and run on Redis.
• Leverage existing data structures or introduce new ones.
• Can be used by anyone; Redis Enterprise Modules are tested and certified by Redis
Labs.
• Turn Redis into a Multi-Model database
6
3

Deterministic
• You know how it will work.
• Data in = data out.
• Data is stored or it isn’t.
• Structure size >= data size
• Examples:
–Hash map (1953)
–Linked lists (1955)
–Heaps (1964)
–…
Data Structures:
Probabilistic
• Behaves differently in different
contexts
• Data in maybe data out.
• Provides a fuzzy view of data
• Structure size can be less than data
size.
• Examples:
–Bloom Filters (1970/1998)
–Count Min Sketch (2005)
–HyperLogLog (2007)
–Cuckoo Filter (2014)
–…

…BUT WHY?!
Sometimes speed is more
important than correctness
Sometimes compactness is more
important than correctness
Sometimes you only need certain
data guarantees
You can use both!

You will not leave tonight knowing everything about
Probabilistic data structures. But…

• Input: Anything, of any length
• Output: A (very) large number
• Properties: Any change in the input will result in a completely different output, but for
a given input, the output will always be the same. One way: Practically impossible to
reverse computationally.
• Cryptographic (SHA family, RIPEMD, etc.)
–Hard to compute,
–very low collision
• Non-Cryptographic (Murmur, spooky, xxhash, fnv, etc.)
– Easy to compute
– Low collision
–Smaller result size
Step 0: The hashing function

• Filter is a weird term for it - think storage not filtering
• Items are hashed, and the hashed items are stored in a bit field.
• Maybe or no.
• Demo
–http://llimllib.github.io/bloomfilter-tutorial/
–Not precisely how it’s done normally, but nice and visual
• Bit flipping.
• Put items in and query status
–Simplest form: Never fills, just gets bad.
–More complex: Fills to a pre-determined error rate ”grows”
• Growing
Step 1: Bloom Filters

- Username search (speed, guarantees)
- Fraud Mitigation (speed, guarantees)
- Akamai – One hit wonder problem (speed, compactness, guarantees)
- Databases - Disk lookups for non-existent data (speed, guarantees)
- Chrome – Is a URL malicious? (speed, guarantees, combined)
- Bitcoin – Transaction privacy in Simplified Payment Verification (compactness, combined)
- Venti – Only storing unique data in archival storage (speed, guarantees)
- Exim – as part of a rate limiter (speed, compactness, guarantees)
- Medium – Content freshness (speed, guarantees)
Step 1: Bloom Filter Usage (General)

• Provided by ReBloom Module
• BF.ADD [filter name] [item]
• BF.EXISTS [filter name] [item]
• Others commands for edge cases and administration: BF.RESERVE, BF.MADD,
BF.MEXISTS, BF.SCANDUMP, BF.LOADCHUNK
Step 1: Bloom Filter Redis Usage

• Funny name again. Estimates cardinality of unique items.
• Part of the the “sketch” family of data types
• Bit flipping and count
• Add, Count or Merge
–Merge is really useful
• 12kb for Redis implementation
• Standard Error
Step 2: HyperLogLog

Items are hashed. Look at the
binary of the hash value, find the
position of the first 1 (i.e. length
first run of 0s), count/increment a
table cell based on the position.
Complete multiple times with
different buckets and the
maximum is your count.
Step 2a: How does HyperLogLog work?

• Facebook Likes (speed, compactness, guarantees)
• Reddit Unique Reads (speed, compactness, guarantees)
• Network Attack Mitigation (speed, compactness, guarantees, combined)
• Neustar (Advertising Platforms) Group Intersections (compactness, guarantees, combined)
Step 2: HyperLogLog Uses (General)

• Built into Redis
• PFADD [hll name] [element… ]
• PFCOUNT [hll name(s)…]
• PFMERGE [dest] [source…]
Step 2: HyperLogLog Redis Usage

• Frequency Estimation (counting)
• “Sketch” family
• Increment, Query, Merge (with weights!)
• Hash items with multiple functions, counter for
each bit position.
–Grid counters of bit positions and depth
–Take the minimum
• Initialize with error at probability if to dial in
requirements
–0.01% error rate at probability of 0.01% = 40kb
• Overestimations are possible, especially at
small observations (underestimates are not)
Step 3: Count Min Sketch
1
Initial B1 B2 B3 B4
Hash 1 0 0 0 0
Hash 2 0 0 0 0
Hash 3 0 0 0 0
’foo’ INCRBY 1 B1 B2 B3 B4
Hash 1 = 3 0 1 0 1
Hash 2 = 5 0 1 0 0
Hash 3 = 1 0 0 0 0
‘bar’ INCRBY 99 B1 B2 B3 B4
Hash 1 = 11 0 1 0 1
Hash 2 = 5 0 100 0 0
Hash 3 = 8 99 0 0 99
Query `baz` MIN (5,1,0) = 0

• Network flows (speed, compactness, guarantees)
• Anomaly Detection (speed, guarantees, combined)
• Outliers (guarantees, combined)
• Power Saving Analytics in IoT Devices (speed, combined)
Step 3: Count Min Sketch Uses

• Provided by Count Min Sketch Module
• CMS.INCRBY [sketch name] [item] [amount to increment] […]
• CMS.QUERY [sketch name] [item] [item…]
• CMS.MERGE [dest] [sketch name] [sketch name…] [WEIGHTS weight weight…]
• CMS.INITBYDIM, CMS.INITBYERR
Step 3: Count Min Sketch Redis Usage

Cuckoo Filters
CC BY-SA 2.0 / Ltshears

• Same use patterns usage as Bloom filters
• Can delete and count items
• Larger than Bloom filters
• Hash x2, fingerprint x1, place the fingerprint in one bucket, if empty
–If full, kick it out to the next bucket.
• Look up does the same hash/fingerprint routine, looks for the finger print in any of the
buckets.
Step 4: Cuckoo Filter

• Slower to insert
• Faster to lookup
• Great for times when you don’t have a:
–Good Cardinality Estimate
–Tight storage budget
• Only viable option for delete on a probabilistic presence detection
• CF.ADD, CF.INSERT, CF.DEL, CF.EXISTS + a few options
Step 4: Cuckoo vs Bloom

Other probabilistic data structures?

Questions?
kyle@redislabs.com / mike@redislabs.com

Probabilistic Data Structures (Edmonton Data Science Meetup, March 2018)

More Related Content

What's hot (20)

Similar to Probabilistic Data Structures (Edmonton Data Science Meetup, March 2018) (20)

Recently uploaded (20)

Probabilistic Data Structures (Edmonton Data Science Meetup, March 2018)