Performance Tuning RocksDB for Kafka Streams' State Stores (Dhruba Borthakur, Rockset, Bruno Cadonna, Confluent) Kafka Summit 2020

Performance Tuning RocksDB
for Kafka Streams’ State Store
Dhruba Borthakur (Rockset), Bruno Cadonna (Confluent)

About the Presenters
Dhruba Borthakur
CTO & Co-founder Rockset
rockset.com
2
Bruno Cadonna
Contributor to Apache Kafka &
Software Engineer at Confluent
confluent.io

Agenda
• Kafka Streams and State Stores
• Introduction to RocksDB
• Compaction Styles in RocksDB
• Possible Operational Issues
• Tuning RocksDB
• RocksDB Command Line Utilities
• Takeaways
3

Kafka Streams and State Stores

Kafka Streams
5
● Stateless and stateful processors
● Stateful processors use state stores

Kafka Streams
6

Kafka Streams
7

Kafka Streams
8

Kafka Streams
9

Kafka Streams
10

Kafka Streams
11
● Create one topology per input partition, i.e., task

State Stores in Kafka Streams
12
• Stateful processor may use one or more state
stores
• Each partition has its own state store
Metrics &
De-/Serialization
Caching
Changelogging
Restoration

13
stores
• State stores are layered:
Metrics &
De-/Serialization
Caching
Changelogging
Restoration

14
Metrics &
De-/Serialization
Caching
Changelogging
Restoration
stores
• collects metrics and de-/serializes records

15
01
10
Metrics &
De-/Serialization
Caching
Changelogging
Restoration
stores

16
01
10
Metrics &
De-/Serialization
Caching
Changelogging
Restoration
stores
• caches records

01
10
17
01
10
Metrics &
De-/Serialization
Caching
Changelogging
Restoration
stores
• caches records
• writes records to changelog

01
10
18
01
10
01
10
Metrics &
De-/Serialization
Caching
Changelogging
Restoration
stores
• caches records

01
10
19
01
10
01
10
Metrics &
De-/Serialization
Caching
Changelogging
Restoration
stores
• caches records
• writes records to local state store

01
10
20
01
10
01
10
Metrics &
De-/Serialization
Caching
Changelogging
Restoration
stores
• caches records
• writes records to local state store
• State stores are restored from changelog
topics
• Restoration is byte-based and by-passes
wrapping layers

RocksDB is the Default State Store
• Kafka Streams needed a write optimized state store
• Kafka Streams 2.6 uses RocksDB 5.18.4
• Kafka Streams provides metrics to monitor RocksDB state stores
• RocksDB can be configured by passing a class that implements interface
RocksDBConfigSetter to configuration rocksdb.config.setter
21

Example: Configuring RocksDB in Kafka Streams
22
public static class MyRocksDBConfig implements RocksDBConfigSetter {
@Override
public void setConfig(final String storeName,
final Options options,
final Map<String, Object> configs) {
// e.g. set compaction style
options.setCompactionStyle(CompactionStyle.LEVEL);
}
@Override
public void close(final String storeName, final Options options) {}
}

What is RocksDB?
• Key-value persistent store
• Embedded C++ & Java library
• Server workloads
24

What is it not?
• Not distributed
• No failover
• Not highly-available. If the machine
dies, you lose your data
• Focus on performance
Kafka Streams makes it fault-tolerant
25

RocksDB API
• Keys and values are byte arrays
• Data are stored sorted by key
• Update Operations: Put/Delete/Merge
• Queries: Get/Iterator
26

Log Structured Merge Architecture
27
Periodic
compaction
Read only data
in SSD or disk
Read write data
in RAM
Transaction log
Scan request from
application
Write request
from application

RocksDB Write Path
28
Write request
Read only
MemTable
Log
Log
sst sst sst
sst sst sst
LS
Compaction
Flush
SwitchSwitch
Active
MemTable Log

RocksDB Reads
• Data can be in memory or disk
• Consult multiple files to find the latest
instance of the key
• Use bloom filters to reduce IO
• Every sst file has a bloom filter
• bloom filters are cached in memory
• default config: eliminates 99% of reads
29

RocksDB Read Path
30
Read only
MemTable Log
Log
sst sst sst
LS
Compaction
Flush
Active
MemTable Log
sst sst sst
Memory
Persistent
Storage
Read
request
Get(k)
Blooms

RocksDB Architecture
31
Read only
MemTable
Log
Log
sst sst sst
LS
Compaction
Flush
Active
MemTable Log
sst sst
Memory
Persistent
Storage
sst
Switch Switch
Write
request
Read only
BlockCache
Read
request

RocksDB Open & Pluggable
32
Pluggable
compaction
Pluggable sst
data format on
storage
Pluggable
Memtable
format in RAM
Transaction log
Blooms
Customizable
WAL
Get or scan request
from application
Write request
from application

What is Compaction
• Multi-threaded
• Parallel compactions on different parts of the database
• Deletes overwritten keys
• Two types of compactions
• level compactions
• universal compaction
34

Level compaction
• RocksDB default compaction is Level Compaction (for read heavy workloads)
• Stores data in multiple levels
• More recent data stored in L0
• Older data stored in Lmax
• Files in L0
• overlapping keys, sorted by flush time
• Files in L1 to Lmax
• non overlapping keys, sorted by key
• Max space amplification = 10%
https://github.com/facebook/rocksdb/wiki/Leveled-Compaction
35

Universal Compaction
• For write heavy workloads
• needed if Level style compaction is bottlenecked by disk throughout
• Stores all files in L0
• All files are arranged in time order
• Decreases write amplification but increases space amplification
• Pick up files that are chronologically adjacent to one another
• merge them
• replace them with a new file in L0
36

Operational Issues
• High memory usage
• Application gets slower or even crashes
• Operating system shows high memory usage
• Kafka Streams metrics for monitoring memory
usage of RocksDB (KIP-607, planned for 2.7)
show high values
38

Operational Issues
• High memory usage
• Application gets slower or even crashes
• Operating system shows high memory usage
• Kafka Streams metrics for monitoring memory
usage of RocksDB (KIP-607, planned for 2.7)
show high values
• High disk usage
• Application crashes with I/O errors
• Operating system shows high disk usage
39

Operational Issues
• High disk I/O
• Operating system shows high disk I/O
• Kafka Streams metrics with high values
• memtable-bytes-flushed-[rate | total]
• bytes-[read | written]-compaction-rate
• Kafka Streams metrics with low values
• memtable-hit-ratio
• block-cache-[data | index | filter]-hit-ratio
40

Operational Issues
• High disk I/O
• Operating system shows high disk I/O
• Kafka Streams metrics with high values
• memtable-bytes-flushed-[rate | total]
• bytes-[read | written]-compaction-rate
• Kafka Streams metrics with low values
• memtable-hit-ratio
• block-cache-[data | index | filter]-hit-ratio
• Write stalls
• Processing latency of the application increases
• Kafka Streams client gets kicked out of the group
• Kafka Streams metric write-stall-duration-[avg | total] shows high values
41

Operational Issues
• Too many open files
• Application crashes with I/O errors
• Kafka Streams metric number-open-files shows high values
42

Operational Issues
• Kafka Streams client gets kicked out of the consumer group during restoration
• Before 2.6 Kafka Streams used RocksDB’s bulk loading (Options#prepareForBulkLoad())
feature to restore the state store faster.
• Bulk loading basically consists of:
• disable automatic compaction and
• write all data to level 0
• trigger manual compaction
43

Operational Issues
• Manual compaction is a blocking call that may take longer than max.poll.interval.ms
44

Operational Issues
• Manual compaction is a blocking call that may take longer than max.poll.interval.ms
• Bulk loading is removed in 2.6
• Currently evaluating alternatives to increase the performance of state store restoration by using other
features of RocksDB, e.g., ingesting SST files directly.
45

Debug Kafka Streams OOM
• Memory consumption
• memtable (for writes)
• memtable size, number of memtables
• block cache (reads)
• configure to share among all the partitions in the kafka store
• Kafka Streams keeps index blocks in the block cache
• rocksdb-java bugs (https://github.com/facebook/rocksdb/issues/6247)
• High disk usage
• Use level compaction instead of universal compaction
• provision more disk space
https://docs.confluent.io/current/streams/developer-guide/memory-mgmt.html
47

Debug writes stalls
• Debug write stalls in RocksDB
• Is disk IO utilization at 100%?
• add more storage spindles
• use universal compaction
• Check number of background compaction threads
• Kafka Streams uses Max(2, number of available processors) by default
• Check memtable configuration
• AdvancedColumnFamilyOptions.max_write_buffer_number
• ColumnFamilyOptions.write_buffer_size
48

Debugging file descriptor issues
• Too many open files
• DBOptions.max_open_files = -1 (default)
• opens all sst files at db open time
• good for performance but can run out of file descriptors
• Increase operating system number of open file descriptors
• Set DBOptions.max_open_files = 10000
• will open a max of 10000 files concurrently
• Decrease number of files by making each file larger
• AdvancedColumnFamilyOptions.target_file_size_base = 128 MB (default is 64 MB)
49

RocksDB Command Line Utilities

Build rocksdb command line utilities
git clone git@github.com:facebook/rocksdb.git
cd rocksdb
make ldb sst_dump
cp ldb /usr/local/bin
cp sst_dump /usr/local/bin
51
Useful RocksDB command line tools: https://github.com/facebook/rocksdb/wiki/Administration-and-Data-
Access-Tool
Build
# change these values accordingly
APP_ID=my-app
STATE_STORE=my-counts
STATE_STORE_DIR=/tmp/kafka-streams
TASKS=$(ls $STATE_STORE_DIR/$APP_ID)
Change These Values

Useful commands
# View all keys
for i in $TASKS; do
ldb --db=$STATE_STORE_DIR/$APP_ID/$i/rocksdb/$STATE_STORE
scan 2>/dev/null;
done
# Show table properties
for i in $TASKS; do
TABLE_PROPERTIES=$(sst_dump --
file=$STATE_STORE_DIR/$APP_ID/$i/rocksdb/$STATE_STORE --
show_properties)
echo -e "Table properties for task:
$in$TABLE_PROPERTIESnn"
done
52

Useful commands- Example output
53
# example output
Table properties for task: 1_9
from [] to []
Process /tmp/kafka-streams/my-app/1_9/rocksdb/my-counts/000006.sst
Sst file format: block-based
Table Properties:
------------------------------
# data blocks: 1
# entries: 2
raw key size: 18
raw average key size: 9.000000
raw value size: 88
raw average value size: 44.000000
data block size: 125
index block size: 35
filter block size: 0
(estimated) table size: 160

Takeaways
• RocksDB is the default state store in Kafka Streams
• Kafka Streams provides functionality to configure and monitor RocksDB
• RocksDB uses a log structured merge (LSM) architecture with different compaction
styles
• You might run into operational issues, but you can solve them by debugging and tuning
RocksDB
• RocksDB offers command line utilities for analysing state stores
55

56
Thank you!
dhruba@rockset.com
bruno@confluent.io
cnfl.io/meetups cnfl.io/slackcnfl.io/blog
Learn how Rockset uses RocksDB
https://rockset.com/blog/how-we-use-rocksdb-at-rockset/

Performance Tuning RocksDB for Kafka Streams' State Stores (Dhruba Borthakur, Rockset, Bruno Cadonna, Confluent) Kafka Summit 2020

Recommended

More Related Content

What's hot (20)

Similar to Performance Tuning RocksDB for Kafka Streams' State Stores (Dhruba Borthakur, Rockset, Bruno Cadonna, Confluent) Kafka Summit 2020 (20)

More from confluent (20)

Recently uploaded (20)

Performance Tuning RocksDB for Kafka Streams' State Stores (Dhruba Borthakur, Rockset, Bruno Cadonna, Confluent) Kafka Summit 2020