Distributed count(distinct) with hyper loglog on postgresql | PGConf EU 2017) | Burak Yucesoy

Burak Yucesoy | Citus Data | PGConf EU
Distributed
COUNT(DISTINCT) with
HyperLogLog on
PostgreSQL

What is COUNT(DISTINCT)?
● Number of unique elements (cardinality) in given data
● Useful to find things like…
○ Number of unique users visited your web page
○ Number of unique products in your inventory

What is distributed COUNT(DISTINCT)?
Worker
Node 1
logins_001
Coordinator
Worker
Node 2
logins_002
Worker
Node 3
logins_003

Why do we need distributed COUNT(DISTINCT)?
● Your data is too big to fit in memory of single machine
● Naive approach for COUNT(DISTINCT) needs too much memory

Why does distributed COUNT(DISTINCT) is difficult?
Worker
Node 1
logins_001
CoordinatorSELECT COUNT(*) FROM logins;
Worker
Node 2
logins_002
Worker
Node 3
logins_003
600
100 200 300SELECT COUNT(*) FROM ...;

Worker
Node 1
logins_001
CoordinatorSELECT COUNT(DISTINCT username) FROM logins;
Worker
Node 2
logins_002
Worker
Node 3
logins_003
SELECT COUNT(DISTINCT user_id) FROM ...;

Some Possible Approaches
● Pull all distinct data to one node and count there. (Doesn’t scale)
● Repartition data on the fly. (Scales but it’s very slow)
● Use HyperLogLog. (Scales and fast)

HyperLogLog(HLL)
HLL is;
● Approximation algorithm
● Estimates cardinality of given data
● Mathematically proven error bounds

Is it OK to approximate?
It depends…

HLL
● Very fast
● Low memory footprint
● Can work with streaming data
● Can merge estimations of two separate datasets efficiently

How does HLL work?
Steps;
1. Hash all elements
a. Ensures uniform data distribution
b. Can treat all data types same
2. Observing rare bit patterns
3. Stochastic averaging

How does HLL work? - Observing rare bit patterns
hash
Alice 645403841
binary
0010...001
Number of leading zeros: 2
Maximum number of leading zeros: 2

hash
Bob 1492309842
binary
0101...010
Number of leading zeros: 1

...
Cardinality Estimation: 27

How does HLL work? Stochastic Averaging
Measuring same thing repeatedly and taking average.

Data
Partition 1
Partition 3
Partition 2
7
5
12
228.968...
Estimation
27
25
212

01000101...010
First m bits to decide
partition number
Remaining bits to
count leading zeros

Error rate of HLL is damn good
● Typical Error Rate: 1.04 / sqrt(number of partitions)
● Memory need is number of partitions * log(log(max. value in hash space)) bit
● Can estimate cardinalities well beyond 109
with 1% error rate while using a
memory of only 6 kilobytes
● Memory vs accuracy tradeoff

Why does HLL work?
It turns out, combination of lots of bad estimation is a
good estimation

Some interesting examples
Alice
Alice
Alice
…
…
…
Alice
Partition 1
Partition 3
Partition 2
0
2
0
1.103...
Harmonic
Mean
20
22
20
hash
Alice 645403841
binary
00100110...001
... ... ...

Some interesting examples
Charlie
Partition 1
Partition 8
Partition 2
29
0
0
1.142...
Harmonic
Mean
229
20
20
hash
Charlie 0
binary
00000000...000
... ... ...

postgresql-hll
● https://github.com/aggregateknowledge/postgresql-hll
● https://github.com/citusdata/postgresql-hll
● Companies using postgresql-hll for their dashboard
● Neustar
● Cloudflare

postgresql-hll uses a data structure, also called hll to keep maximum number of
leading zeros of each partition.
● Use hll_hash_bigint to hash elements.
○ There are some other functions for other common data types.
● Use hll_add_agg to aggregate hashed elements into hll data structure.
● Use hll_cardinality to materialize hll data structure to actual distinct count.
postgresql-hll in single node

What Happens in
Distributed Scenario?

How to merge COUNT(DISTINCT) with HLL
Shard 1
Shard 1
Partition 1
Shard 1
Partition 3
Shard 1
Partition 2
7
5
12
HLL(7, 5, 12)
Intermediate
Result

Shard 2
Shard 2
Partition 1
Shard 2
Partition 3
Shard 2
Partition 2
11
7
8
HLL(11, 7, 8)
Intermediate
Result

11
7
12
1053.255
211
27
212
HLL(11, 7, 8)
HLL(7, 5, 12)
HLL(11, 7, 12)
hll_union_agg

Shard 1
+
Shard 2
Shard 1
Partition 1(7)
+
Shard 2
Partition 1(11)
11
7
12
1053.255
Estimation
Shard 1
Partition 2(5)
+
Shard 2
Partition 2(7)
Shard 1
Partition 3(12)
+
Shard 2
Partition 4(8)

1. Separate data into shards.
postgresql-hll in distributed environment
logins_001 logins_002 logins_003

2. Put shards into separate nodes.
Worker
Node 1
Coordinator
Worker
Node 2
Worker
Node 3
logins_001 logins_002 logins_003

3. For each shard, calculate hll (but do not materialize).
Shard 1
Shard 1
Partition 1
Shard 1
Partition 3
Shard 1
Partition 2
7
5
12
HLL(7, 5, 12)
Intermediate
Result

4. Pull intermediate results to a single node.
Worker
Node 1
logins_001
Coordinator
Worker
Node 2
logins_002
Worker
Node 3
logins_003
HLL(6, 4, 11) HLL(10, 6, 7) HLL(7, 12, 5)

5. Merge separate hll data structures and materialize them
11
13
12
10532.571...
211
213
212
HLL(11, 7, 8)
HLL(7, 5, 12)
HLL(11, 13, 12)
HLL(8, 13, 6)

Or use Citus :)

Burak Yucesoy
burak@citusdata.com
@byucesoy
Thank You
citusdata.com | @citusdata

Distributed count(distinct) with hyper loglog on postgresql | PGConf EU 2017) | Burak Yucesoy

Recommended

More Related Content

Similar to Distributed count(distinct) with hyper loglog on postgresql | PGConf EU 2017) | Burak Yucesoy (20)

More from Citus Data (20)

Recently uploaded (20)

Distributed count(distinct) with hyper loglog on postgresql | PGConf EU 2017) | Burak Yucesoy