21st Athens Big Data Meetup - 3rd Talk - Dive into ClickHouse query execution

Data processing into ClickHouse
Nikolai Kochetov, ClickHouse developer

Agenda
› Data layout and compression
› In-memory layout and data processing
› Pipelining and parallelism
› Specialized data structures
3 / 44

Column-Oriented DBMS
General ideas
› Separate column is stored in separate file
(or several files)
› Only affected columns are read
› Columnar data representation in memory
Additional concepts
› Sparse index
› Per-column compression
5 / 44

Compression
Highly customizable in CREATE TABLE statement
CREATE TABLE codec_example
(
`dt` DateTime, -- default CODEC is LZ4
`dt_none` DateTime CODEC(NONE),
`dt_lz4_4` DateTime CODEC(LZ4HC(4)),
`dt_zstd` DateTime CODEC(ZSTD),
`dt_dd_lz4` DateTime CODEC(DoubleDelta, LZ4HC) -- combined
)
ENGINE = MergeTree
ORDER BY dt
6 / 44

Compression ratio vs decompression speed
Intel Xeon E3-1225V3, enwik8 https://quixdb.github.io/squash-benchmark
7 / 44

Compression ratio
SELECT
column,
formatReadableSize(column_data_compressed_bytes) AS compressed,
formatReadableSize(column_data_uncompressed_bytes) AS uncompressed,
column_data_uncompressed_bytes / column_data_compressed_bytes AS r
FROM system.parts_columns
WHERE (table = 'codec_example') AND active ORDER BY r ASC
┌─column────┬─compressed─┬─uncompressed─┬──────────────────r─┐
│ dt_none │ 67.73 MiB │ 67.70 MiB │ 0.999618408127124 │
│ dt │ 3.06 MiB │ 67.70 MiB │ 22.156958788868835 │
│ dt_lz4_4 │ 3.06 MiB │ 67.70 MiB │ 22.156958788868835 │
│ dt_zstd │ 1.08 MiB │ 67.70 MiB │ 62.91648262048673 │
│ dt_dd_lz4 │ 938.17 KiB │ 67.70 MiB │ 73.89642182401099 │
└───────────┴────────────┴──────────────┴────────────────────┘
8 / 44

Data transformation chain
Time series data
9 / 44

Time series data -> Delta
10 / 44

Time series data -> Delta -> Delta
11 / 44

Time series data -> Delta -> Delta -> variable length encoding
Time series data -> DoubleDelta
Time series data -> DoubleDelta -> LZ4HC
12 / 44

Read time
Test query
SELECT dt_dd_lz4 FROM codec_example FORMAT Null
Enable system.query_log
SET log_queries = 1
xml config:
clickhouse.tech/docs/en/operations/server_settings/settings/#server_settings-query-log
Drop FS cache
$ echo 3 | sudo tee /proc/sys/vm/drop_caches
13 / 44

Read time
Profile events are in system.query_log
SELECT
pe.Names,
pe.Values
FROM system.query_log
ARRAY JOIN ProfileEvents AS pe
WHERE event_date = today() AND type = 'QueryFinish'
AND query_id = '...'
┌─pe.Names──────────────────────────────┬─pe.Values─┐
│ DiskReadElapsedMicroseconds │ 123970 │
│ RealTimeMicroseconds │ 596084 │
...
14 / 44

Read time
Higher compression rate means
› less IO and more CPU time
› less real time for IO-bounded queries
15 / 44

Read time
select sum(halfMD5(halfMD5(dt))) from codec_example
For CPU-bounded queries decompression time is usually insignificant
16 / 44

In-memory layout
and data processing

Data processing
› Data is processed by blocks
› Block stores slices of columns
› Column is represented in one or several buffers
18 / 44

Integers
› Single buffer
› Stores zero at position -1
› Extra 15 bytes are allocated at array’s tail
19 / 44

Strings
› Buffers with data and offsets
› Offsets are prefix sums of sizes
› Store 0 at string’s end
20 / 44

Arrays
› As well as Strings
› Offsets are stored in a
separate file on FS
21 / 44

N-dimensional Arrays
› N-dimensional Array
is an Array of
(N-1)-dimensional Arrays
› N-dimensional Offsets
are Offsets for
(N-1)-dimensional offsets
› Natural generalization of
1-dimensional Arrays
22 / 44

Functions
Concepts
› Pure (with some exceptions)
› Strong typing
› Multiple overloads
Per-columns execution
› Less virtual calls
› SIMD optimizations
› Complication for UDF
23 / 44

SIMD operations
int memcmpSmallAllowOverflow15(const Char * a, size_t a_size,
const Char * b, size_t b_size)
{
size_t min_size = std::min(a_size, b_size);
for (size_t offset = 0; offset < min_size; offset += 16)
{
/// Compare 16 bytes at once
uint16_t mask = _mm_movemask_epi8(_mm_cmpeq_epi8(
_mm_loadu_si128(reinterpret_cast<const __m128i *>(a + offset)),
_mm_loadu_si128(reinterpret_cast<const __m128i *>(b + offset))));
if (~mask) /// if mask has zero bit (some bytes are different)
{
/// Find and compare first different bytes
...
}
}
return detail::cmp(a_size, b_size);
}
24 / 44

SIMD operations
Main loop for memcmpSmallAllowOverflow15
0xa187fc0 : add $0x10,%r8 ; offset += 16
0xa187fc4 : cmp %r9,%r8 ; if (offset >= min_size)
0xa187fc7 : jae 0xa188008 ; exit loop
0xa187fc9 : movdqu (%rdx,%r8,1),%xmm0 ; xmm0 = a[offset] (16 bytes)
0xa187fcf : movdqu (%rdi,%r8,1),%xmm1 ; xmm1 = b[offset] (16 bytes)
0xa187fd5 : pcmpeqb %xmm1,%xmm0 ; xmm0 = (xmm0 == xmm1)
; (16 bytes at once)
0xa187fd9 : pmovmskb %xmm0,%eax ; mask = `bit mask from xmm0`
0xa187fdd : xor $0xffff,%ax ; mask = ~mask
0xa187fe1 : je 0xa187fc0 ; if (mask == 0)
; continue loop
25 / 44

Query Pipeline
SELECT avg(length(URL)) FROM hits WHERE URL != ''
Independent execution steps
› Read column URL
› Calculate expression URL != ''
› Filter column URL
› Calculate function length(URL)
› Calculate aggregate function avg
27 / 44

Query Pipeline
Properties
› Arbitrary graph
› Support parallel execution
› Dynamically changeable
28 / 44

Parallel Execution
Parallelism by data
29 / 44

Parallel Execution
SELECT hex(SHA256(*)) FROM (
SELECT URL FROM hits ORDER BY URL ASC)))
Vertical parallelism
30 / 44

Dynamic pipeline modification
Sometimes we need to change pipeline during execution
Sort stores all query data in memory
Set max_bytes_before_external_sort = <some limit>
31 / 44

32 / 44

33 / 44

34 / 44

35 / 44

Query Pipeline
SELECT avg(length(URL)) + 1
FROM hits WHERE URL != ''
WITH TOTALS SETTINGS extremes = 1
┌─plus(avg(length(URL)), 1)─┐
│ 85.3475007793562 │
└───────────────────────────┘
Totals:
│ 85.3475007793562 │
└───────────────────────────┘
Extremes:
│ 85.3475007793562 │
│ 85.3475007793562 │
└───────────────────────────┘
36 / 44

Task analysis
Task example: string search.
Possible aspects of a task
› Approximate or exact search
› Substring or regexp
› Single or multiple needles
› Single or multiple haystacks
› Short or long strings
› Bytes, unicode code points, real words
For every option can be created specialized algorithm
38 / 44

Concepts
› Take the best implementations
Example: simdjson, pdqsort
› Improve existent algorithms
Volnitsky -> MultiVolnitsky
memcpy -> memcpySmallAllowReadWriteOverflow15
› Use more optimal specializations
40 hash table implementations for GROUP BY
› Test performance on real data
Per-commit tests on real (obfuscated) dataset with page hits
› Profiling
39 / 44

GROUP BY
› Hash table
› Parallel
› Merging in
single thread
40 / 44

GROUP BY
Two level
› Split data to
256 buckets
› Merging in
multiple threads
› More efficient for
remote queries
41 / 44

Hash table specializations
› 8-bit or 16 bit key
lookup table
› 32, 64, 128, 256 bit key
32-bit hash for aggregating, 64-bit hash for merging
› several fixed size keys
represented as single integer if possible
› string key
store pre-calculated hash in hash table
small string optimization
› LowCardinality key
pre-calculated hash for dictionaries
pre-calculated bucket for consecutively repeated dictionaries
42 / 44

Conclusion
› Specialized algorithms and data structures
are necessary for the best performance
› Use the same ideas in your projects
› Contribute: https://github.com/ClickHouse/ClickHouse
43 / 44

21st Athens Big Data Meetup - 3rd Talk - Dive into ClickHouse query execution

Recommended

More Related Content

What's hot (20)

Similar to 21st Athens Big Data Meetup - 3rd Talk - Dive into ClickHouse query execution (20)

More from Athens Big Data (20)

Recently uploaded (20)

21st Athens Big Data Meetup - 3rd Talk - Dive into ClickHouse query execution