Storage Capacity Management on Multi-tenant Kafka Cluster with Nurettin Omeroglu

Storage Capacity Management
@Booking.com
Nurettin OMEROGLU

What happens
when broker
disk is FULL?
A)Only some producers fail
B)
C)
All producers fail
Kafka service fails

Streaming Infra Team
Nurettin OMEROGLU
Senior Software Engineer
I am a member of Streaming Infra Team (10
people) and have more than 4 years of expertise
on Apache Kafka client and server side
components. We manage on-prem Kafka solution
serving to clients running on variety of platforms
such as bare-metal, kubernetes and also Cloud

Agenda
1. Introduction
2. Before the project
3. Step by step capacity project
4. Future plans

100M
monthly active
app users
155,000
destinations around the world
Car hire available in 140+countries
and pre-booked taxis in
over 500cities across 120+
countries
243M+
verified guest reviews
and 24/7
customer service
in 45
languages and dialects
Since 2010,
Booking.com has
welcomed
4.5B+
guest arrivals
28M
total reported
listings
worldwide
6.6M
options in homes,
apartments and
other unique
places to stay
30
different types of
places to stay,
including homes,
apartments, B&Bs,
hostels, farm stays,
bungalows, even
boats, igloos and
treehouses
140offices in 70countries over
5,000employees in Amsterdam

Payments
A/B Tests
MySQL
Cassandra
Hadoop
Cloud
...
Events
Logs
Online ML
Fraud detection
Personalization
Bookings FPA reporting
Data Streaming
Platform
MySQL
Cassandra
Hadoop
Cloud
...
● Transports and transposes data via pub/sub;
● Connects application through data pipeline
● Resilient, scalable, fault tolerant, secure, with SLO guarantees;
Real-time
analytics

Scale of Streaming @Booking.com
How much data? ~2.2PB
produced and consumed per day
How many clusters? 62
How many topics? ~34K
How many partitions? ~138K
How many servers? 900 kafka brokers
+75 zk

Setup
● On-premise multi-tenant kafka clusters running on bare-metal
● Local SSD storage (~3.5TB per broker)
● 32 thread CPU / 256MB memory / 10 Gb network

Existing Components
● Custom Configuration validations
● Custom Quota validations
○ Topics per principal
○ Partitions per principal
…
● Topics
● Custom quotas
(booking-specific)
…
● Specific Configurations
● Custom Quotas
…
● Custom PrincipalBuilder
● Custom Policies
(AbstractPolicy)
○ AlterConfigPolicy
○ CreateTopicPolicy
…
Mysql
(Metadata
Store)
Bkstreaming CLI
(Self-service, home-built)
Kontrole
(Control Center, home-built)
Kafka Cluster

Example Scenario for Custom Quota Validations
(2) Auth: OK
(3) Topics per principal quota: OK
(4) Partitions per principal quota: OK
(1) Add topic for a service
(5) Create topic
Mysql
(Metadata
Store)
Kontrole
(Control Center)
Kafka Cluster

Reactive Approach
● Clients use retention.ms configuration
retention.ms - which deletes messages after a
certain amount of time.
● Dangerous situations if traffic spikes
● We were the middleman handling the toil /
issues between multiple tenants
○ Increase number of brokers, or
○ Determine noisy neighbors and
■ Throttle, or
■ Communicate with clients (night?)
● Lack of visibility and forecasting to plan ahead
reserved space for safety
Topic 1
Shared broker disk among topics
Topic 4
Topic 2
Topic 5
Topic 3
Topic 6

IDEA?
retention.bytes - which deletes the oldest messages
when the total size of a partition exceeds a threshold.
● Reserve storage per principal (quota)
● Let the clients manage their reserved storage
● Make retention.bytes mandatory on topic
● Feedback to clients around their usage/growth
Discarded Options:
● Kubernetes elasticity
● Network attached or remote storage options
reserved space for safety
Reserved quotas per principal
Principal
quota
Principal
quota
Principal
quota

Determine cluster capacity
1) Periodically fetch
information from Cruise
Control about the cluster
Number of available
brokers, disk information …
2) Use min disk capacity
among brokers to calculate
cluster capacity
3) Target 90% disk usage
(headroom)
Total capacity = (min broker disk * number of brokers) *
0.9
Kontrole
Cruise
Control
Graphite
(1) Periodic cron job
(2) Available brokers,
disk information
(3) Calculate capacity,
Publish metrics

New Quota + Topic level configuration
● Reserve storage per principal (quota) (default 500MB)
● Add property `topic_capacity_bytes` per Kafka topic (not visible to
Kafka brokers) to manage retention.bytes
● We do all the calculations under this value (including retention.bytes)
topic_capacity_bytes = retention.bytes * partition_count * replica_count
● Whenever there is a partition count increase (i.e. done via Kontrole),
retention.bytes (per partition) is re-calculated accordingly.

New Quota Creation
Kontrole
Cruise
Control
mysql
(1) Create principal quota
(2) Get available brokers,
disk information
(3) Get existing quotas
(5) Save quota
(4) Validate if new quota fits into cluster

New Topic Creation
Kontrole
mysql
(1) Create topic
with topic_capacity_bytes
(2) Get principal’s quota
(3) Enough space for the new topic?
(4) No, reject. Ask for quota increase
(4) Yes, topic fits, go on!
Create topic with relevant
retention.bytes
Kafka Cluster

Add Alerting
● Warn/notify before topic_capacity_bytes configuration kicks in and start
deleting data.
● Actions:
○ reduce the retention.ms configuration, or
○ increase the topic capacity.

Onboard Existing Clusters
● Simulating scenarios on test cluster
● Operational documentation
● Stakeholder management
● Documentation for clients
● Enable capacity project on a cluster
○ Calculate / Add topic_capacity_bytes to each topic (with extra)
○ Calculate / Add quotas per principal

Migration Challenges
● Revert strategy
○ Dynamic flag to disable the project on cluster
● Sanity check if cluster is suitable
○ Brokers may have non-uniform storage capacity
○ With extras, all quotas may not fit into the available capacity

What is next?
● Allow teams to extend their quota if there is enough capacity
(self service)
● Send usage report to the teams, with the capacity allocated to the
principal vs. their usage
(cost attribution)

Booking.com
Facebook: facebook.com/booking.com
Instagram: @bookingcom
Twitter: @booking.com; @bookingcomnews
Linkedin: nl.linkedin.com/company/booking.com
Youtube: youtube.com/booking
Join Booking.com as a partner
join.booking.com
Join the Booking.com team
careers.booking.com
Questions?

Storage Capacity Management on Multi-tenant Kafka Cluster with Nurettin Omeroglu

Recommended

More Related Content

What's hot (20)

Similar to Storage Capacity Management on Multi-tenant Kafka Cluster with Nurettin Omeroglu (20)

More from HostedbyConfluent (20)

Recently uploaded (20)

Storage Capacity Management on Multi-tenant Kafka Cluster with Nurettin Omeroglu