SlideShare a Scribd company logo
MULTI-TENANCY KAFKA
CLUSTER FOR LINE SERVICES
WITH 250 BILLION DAILY
MESSAGES
SPEAKER
● Yuto Kawamura
● Senior Software Engineer
● Lead of the team providing company-
wide Kafka platform
● Active in Apache Kafka Community
● Code contribution
● Speaking at Kafka Summit
Agenda • Multitenancy company-wide
Kafka Platform
APACHE KAFKA
● Middleware for streaming data
● Highly scalable/available
● Data persistency
● Supports Pub-Sub model
KAFKA COMPONENTS
Consumer
ConsumerProducer
Producer
Broker cluster
KAFKA PLATFORM
● Large scale Kafka clusters provided for any systems/services
inside LINE
● Started internally from messaging server development team
● Expanded to company-wide platform
USAGE
● Two kinds of usage
● Distributed task queue for buffering and processing business
logic asynchronously
● e.g, Queueing heavy task from web app server to background
task processor
● "Data Hub" for distributing data to other services
PUB-SUB EXAMPLE
SINGLE CLUSTER SHARED BY MANY SYSTEMS
● Concept of “Data Hub”
● Easy to find and access data
● Operational and management efficiency
● Not making operational cost to be proportional to users
● Concentrate engineering resources for maximizing reliability/
performance
MULTITENANCY
Blockchain
Platform
Data Analysis
System
Timeline
5 million / second
210TB
Daily inflow
4GB / second
50+
systems
250 billion
Daily messages
SCALE
● Cluster can protect itself against abusive workloads
● Accidental workload doesn't propagate to other users
● We can track on which client is sending requests
● Find source of strange requests
● Certain level of isolation among client workloads
● Slow response for one client doesn't appears to another client
MULTITENANCY REQUIREMENTS
● It is more important to manage number of requests over
incoming/outgoing byte rate
● Kafka is amazingly durable for large data
● Good design leveraging system functions
● Page cache for caching data
● sendfile(2) for zero copy transfer data
● Native batching
● Typical danger exists in clients sending many requests
PROTECT CLUSTER AGAINST ABUSING
REQUEST QUOTA
● Use Request Quota
● Limit “Time of broker threads that
can be used by each client group”
● Set default quota
● Prevent single client from
accidentally consuming all broker
resources
ISOLATION AMONG CLIENT WORKLOADS
● When can performance isolation among different clients violated?
● Let’s look at example of actual troubleshooting.
DETECTION
● 50x ~ 100x slower response time
in 99th %ile Produce response
time
● Normal: ~20ms
● Observed: 50ms ~ 200ms
FINDING #1
● Coincidental disk read of a
certain amount
FINDING #2
● Network threads' utilization was
very high
REQUEST HANDLING IN KAFKA BROKER
● Network Threads: Reads/
Writes request/response from/to
client sockets
● Request Handler Threads:
Processes requests and
produces response object
REQUEST HANDLING - READ REQUEST
REQUEST HANDLING - PROCESS
REQUEST HANDLING - WRITE RESPONSE
NETWORK THREAD RUNS EVENT LOOP
● Multiplex and processes client sockets assigned sequentially
● It never blocks awaiting IO completion
WHEN NETWORK THREADS GETS BUSY...
● It means one of following:
● 1. Really busy doing lots of work Many requests/responses to
read/write
● 2. Blocked by some operations (which should not happen in
event loop in general)
RESPONSE HANDLING - NORMAL REQUESTS
● When response is in queue, all
data to be transferred are in
memory
RESPONSE HANDLING - FETCH RESPONSE
● When response is in queue, topic
data segments are not in
userspace memory
● => Copy to client socket directly
inside the kernel using
sendfile(2) system call
IF TARGET DATA NOT EXISTS IN PAGE CACHE
● Target data in page cache:
● => Just a memory copy. Very fast: ~ 100us
● Target data is NOT in page cache:
● => Needs to load data from disk into page cache first. Can be
slow: ~ 50ms (or even slower)
● => If this happens in event loop…?
SUSPECTING BLOCKING IN SENDFILE(2)
● Inspect duration of sendfile system calls issued by broker
process
● How?
SYSTEMTAP
● A kernel layer dynamic tracing tool
and scripting language
● Safe to run in production because
of low overhead
● Alternatively: DTrace, eBPF, etc…
INSPECT SENDFILE(2) DURATION
PROBLEM HYPOTHESIS
SOLUTION
● Make sure that data is ready on memory before the response is
passed to the network thread
● => Event loop never blocks
WARMUP PAGE CACHE
● Move blocking part to request
handler threads (= single queue
and pool of threads)
WARMUP PAGE CACHE
● When Network thread calls
sendfile(2) for transferring log data,
it's always in page cache
WARMUP IMPLEMENTATION
● Easiest way: Do synchronous read(2) on target data
● => Large overhead by copying memory from kernel to
userland
● Why is Kafka using sendfile(2) for transferring topic data?
● => To avoid expensive large memory copy
● How can we achieve it keeping this property?
TRICK - ZERO COPY PAGE LOAD
● Call sendfile(2) for target data with
dest /dev/null
● The /dev/null driver does not
actually copy data to anywhere
WHY IT HAS ALMOST NO OVERHEAD?
● Linux kernel internally uses `splice` to implement sendfile(2)
● `splice` implementation of /dev/null returns without iterating
target data
PATCHING KAFKA
IT WORKED
● No response time degradation
with coincidence of Fetch request
reading disk
KAFKA-7504 Broker performance degradation caused by call of sendfile
reading disk in network thread - x50 ~ x100 response time reduction
KAFKA-4614 Long GC pause harming broker performance which is caused by
mmap objects created for OffsetIndex - x100 ~ response time reduction
KAFKA-6501 ReplicaFetcherThread should close the ReplicaFetcherBlockingSend
earlier on shutdown - Eliminate significant latency during broker restart
WHY NOT CONTRIBUTE IT BACK?
CONCLUSION
● Introduced our engineering for operating the company-wide
Kafka platform
● Quota, SystemTap and patch understanding system deeply
● After fixing some issues, our hosting policy is working well and
efficiently, keeping:
● concept of single "Data Hub" and
● operational cost not to be proportional to the number of users/
usages
● We are contributing to the world through OSS
… AND NEXT
● Kafka platform evolution
● Clients standardization and management
● Higher availability
● SRE team
● Planning to rollout new team for Reliability Engineering
● Share knowledge/tools which are independent from
Middleware
● Come and ask me more!
THANK YOU
Ad

More Related Content

What's hot (20)

containerdの概要と最近の機能
containerdの概要と最近の機能containerdの概要と最近の機能
containerdの概要と最近の機能
Kohei Tokunaga
 
PG-REXで学ぶPacemaker運用の実例
PG-REXで学ぶPacemaker運用の実例PG-REXで学ぶPacemaker運用の実例
PG-REXで学ぶPacemaker運用の実例
kazuhcurry
 
オンプレML基盤on Kubernetes 〜Yahoo! JAPAN AIPF〜
オンプレML基盤on Kubernetes 〜Yahoo! JAPAN AIPF〜オンプレML基盤on Kubernetes 〜Yahoo! JAPAN AIPF〜
オンプレML基盤on Kubernetes 〜Yahoo! JAPAN AIPF〜
Yahoo!デベロッパーネットワーク
 
OSC2011 Tokyo/Spring 自宅SAN友の会(前半)
OSC2011 Tokyo/Spring 自宅SAN友の会(前半)OSC2011 Tokyo/Spring 自宅SAN友の会(前半)
OSC2011 Tokyo/Spring 自宅SAN友の会(前半)
Satoshi Shimazaki
 
Dockerからcontainerdへの移行
Dockerからcontainerdへの移行Dockerからcontainerdへの移行
Dockerからcontainerdへの移行
Kohei Tokunaga
 
わかる!metadata.managedFields / Kubernetes Meetup Tokyo 48
わかる!metadata.managedFields / Kubernetes Meetup Tokyo 48わかる!metadata.managedFields / Kubernetes Meetup Tokyo 48
わかる!metadata.managedFields / Kubernetes Meetup Tokyo 48
Preferred Networks
 
[AKIBA.AWS] EC2の基礎 - パフォーマンスを100%引き出すオプション設定 -
[AKIBA.AWS] EC2の基礎 - パフォーマンスを100%引き出すオプション設定 -[AKIBA.AWS] EC2の基礎 - パフォーマンスを100%引き出すオプション設定 -
[AKIBA.AWS] EC2の基礎 - パフォーマンスを100%引き出すオプション設定 -
Shuji Kikuchi
 
アーキテクチャから理解するPostgreSQLのレプリケーション
アーキテクチャから理解するPostgreSQLのレプリケーションアーキテクチャから理解するPostgreSQLのレプリケーション
アーキテクチャから理解するPostgreSQLのレプリケーション
Masahiko Sawada
 
Prometheus at Preferred Networks
Prometheus at Preferred NetworksPrometheus at Preferred Networks
Prometheus at Preferred Networks
Preferred Networks
 
Apache Kafkaって本当に大丈夫?~故障検証のオーバービューと興味深い挙動の紹介~
Apache Kafkaって本当に大丈夫?~故障検証のオーバービューと興味深い挙動の紹介~Apache Kafkaって本当に大丈夫?~故障検証のオーバービューと興味深い挙動の紹介~
Apache Kafkaって本当に大丈夫?~故障検証のオーバービューと興味深い挙動の紹介~
NTT DATA OSS Professional Services
 
root権限無しでKubernetesを動かす
root権限無しでKubernetesを動かす root権限無しでKubernetesを動かす
root権限無しでKubernetesを動かす
Akihiro Suda
 
PostgreSQLモニタリングの基本とNTTデータが追加したモニタリング新機能(Open Source Conference 2021 Online F...
PostgreSQLモニタリングの基本とNTTデータが追加したモニタリング新機能(Open Source Conference 2021 Online F...PostgreSQLモニタリングの基本とNTTデータが追加したモニタリング新機能(Open Source Conference 2021 Online F...
PostgreSQLモニタリングの基本とNTTデータが追加したモニタリング新機能(Open Source Conference 2021 Online F...
NTT DATA Technology & Innovation
 
Grafana LokiではじめるKubernetesロギングハンズオン(NTT Tech Conference #4 ハンズオン資料)
Grafana LokiではじめるKubernetesロギングハンズオン(NTT Tech Conference #4 ハンズオン資料)Grafana LokiではじめるKubernetesロギングハンズオン(NTT Tech Conference #4 ハンズオン資料)
Grafana LokiではじめるKubernetesロギングハンズオン(NTT Tech Conference #4 ハンズオン資料)
NTT DATA Technology & Innovation
 
コンテナにおけるパフォーマンス調査でハマった話
コンテナにおけるパフォーマンス調査でハマった話コンテナにおけるパフォーマンス調査でハマった話
コンテナにおけるパフォーマンス調査でハマった話
Yuta Shimada
 
乗っ取れコンテナ!!開発者から見たコンテナセキュリティの考え方(CloudNative Days Tokyo 2021 発表資料)
乗っ取れコンテナ!!開発者から見たコンテナセキュリティの考え方(CloudNative Days Tokyo 2021 発表資料)乗っ取れコンテナ!!開発者から見たコンテナセキュリティの考え方(CloudNative Days Tokyo 2021 発表資料)
乗っ取れコンテナ!!開発者から見たコンテナセキュリティの考え方(CloudNative Days Tokyo 2021 発表資料)
NTT DATA Technology & Innovation
 
Docker Compose 徹底解説
Docker Compose 徹底解説Docker Compose 徹底解説
Docker Compose 徹底解説
Masahito Zembutsu
 
コンテナネットワーキング(CNI)最前線
コンテナネットワーキング(CNI)最前線コンテナネットワーキング(CNI)最前線
コンテナネットワーキング(CNI)最前線
Motonori Shindo
 
PGOを用いたPostgreSQL on Kubernetes入門(PostgreSQL Conference Japan 2022 発表資料)
PGOを用いたPostgreSQL on Kubernetes入門(PostgreSQL Conference Japan 2022 発表資料)PGOを用いたPostgreSQL on Kubernetes入門(PostgreSQL Conference Japan 2022 発表資料)
PGOを用いたPostgreSQL on Kubernetes入門(PostgreSQL Conference Japan 2022 発表資料)
NTT DATA Technology & Innovation
 
Metaspace
MetaspaceMetaspace
Metaspace
Yasumasa Suenaga
 
Db2 v11.5.4 高可用性構成 & HADR 構成パターンご紹介
Db2 v11.5.4 高可用性構成 & HADR 構成パターンご紹介Db2 v11.5.4 高可用性構成 & HADR 構成パターンご紹介
Db2 v11.5.4 高可用性構成 & HADR 構成パターンご紹介
IBM Analytics Japan
 
containerdの概要と最近の機能
containerdの概要と最近の機能containerdの概要と最近の機能
containerdの概要と最近の機能
Kohei Tokunaga
 
PG-REXで学ぶPacemaker運用の実例
PG-REXで学ぶPacemaker運用の実例PG-REXで学ぶPacemaker運用の実例
PG-REXで学ぶPacemaker運用の実例
kazuhcurry
 
OSC2011 Tokyo/Spring 自宅SAN友の会(前半)
OSC2011 Tokyo/Spring 自宅SAN友の会(前半)OSC2011 Tokyo/Spring 自宅SAN友の会(前半)
OSC2011 Tokyo/Spring 自宅SAN友の会(前半)
Satoshi Shimazaki
 
Dockerからcontainerdへの移行
Dockerからcontainerdへの移行Dockerからcontainerdへの移行
Dockerからcontainerdへの移行
Kohei Tokunaga
 
わかる!metadata.managedFields / Kubernetes Meetup Tokyo 48
わかる!metadata.managedFields / Kubernetes Meetup Tokyo 48わかる!metadata.managedFields / Kubernetes Meetup Tokyo 48
わかる!metadata.managedFields / Kubernetes Meetup Tokyo 48
Preferred Networks
 
[AKIBA.AWS] EC2の基礎 - パフォーマンスを100%引き出すオプション設定 -
[AKIBA.AWS] EC2の基礎 - パフォーマンスを100%引き出すオプション設定 -[AKIBA.AWS] EC2の基礎 - パフォーマンスを100%引き出すオプション設定 -
[AKIBA.AWS] EC2の基礎 - パフォーマンスを100%引き出すオプション設定 -
Shuji Kikuchi
 
アーキテクチャから理解するPostgreSQLのレプリケーション
アーキテクチャから理解するPostgreSQLのレプリケーションアーキテクチャから理解するPostgreSQLのレプリケーション
アーキテクチャから理解するPostgreSQLのレプリケーション
Masahiko Sawada
 
Prometheus at Preferred Networks
Prometheus at Preferred NetworksPrometheus at Preferred Networks
Prometheus at Preferred Networks
Preferred Networks
 
Apache Kafkaって本当に大丈夫?~故障検証のオーバービューと興味深い挙動の紹介~
Apache Kafkaって本当に大丈夫?~故障検証のオーバービューと興味深い挙動の紹介~Apache Kafkaって本当に大丈夫?~故障検証のオーバービューと興味深い挙動の紹介~
Apache Kafkaって本当に大丈夫?~故障検証のオーバービューと興味深い挙動の紹介~
NTT DATA OSS Professional Services
 
root権限無しでKubernetesを動かす
root権限無しでKubernetesを動かす root権限無しでKubernetesを動かす
root権限無しでKubernetesを動かす
Akihiro Suda
 
PostgreSQLモニタリングの基本とNTTデータが追加したモニタリング新機能(Open Source Conference 2021 Online F...
PostgreSQLモニタリングの基本とNTTデータが追加したモニタリング新機能(Open Source Conference 2021 Online F...PostgreSQLモニタリングの基本とNTTデータが追加したモニタリング新機能(Open Source Conference 2021 Online F...
PostgreSQLモニタリングの基本とNTTデータが追加したモニタリング新機能(Open Source Conference 2021 Online F...
NTT DATA Technology & Innovation
 
Grafana LokiではじめるKubernetesロギングハンズオン(NTT Tech Conference #4 ハンズオン資料)
Grafana LokiではじめるKubernetesロギングハンズオン(NTT Tech Conference #4 ハンズオン資料)Grafana LokiではじめるKubernetesロギングハンズオン(NTT Tech Conference #4 ハンズオン資料)
Grafana LokiではじめるKubernetesロギングハンズオン(NTT Tech Conference #4 ハンズオン資料)
NTT DATA Technology & Innovation
 
コンテナにおけるパフォーマンス調査でハマった話
コンテナにおけるパフォーマンス調査でハマった話コンテナにおけるパフォーマンス調査でハマった話
コンテナにおけるパフォーマンス調査でハマった話
Yuta Shimada
 
乗っ取れコンテナ!!開発者から見たコンテナセキュリティの考え方(CloudNative Days Tokyo 2021 発表資料)
乗っ取れコンテナ!!開発者から見たコンテナセキュリティの考え方(CloudNative Days Tokyo 2021 発表資料)乗っ取れコンテナ!!開発者から見たコンテナセキュリティの考え方(CloudNative Days Tokyo 2021 発表資料)
乗っ取れコンテナ!!開発者から見たコンテナセキュリティの考え方(CloudNative Days Tokyo 2021 発表資料)
NTT DATA Technology & Innovation
 
コンテナネットワーキング(CNI)最前線
コンテナネットワーキング(CNI)最前線コンテナネットワーキング(CNI)最前線
コンテナネットワーキング(CNI)最前線
Motonori Shindo
 
PGOを用いたPostgreSQL on Kubernetes入門(PostgreSQL Conference Japan 2022 発表資料)
PGOを用いたPostgreSQL on Kubernetes入門(PostgreSQL Conference Japan 2022 発表資料)PGOを用いたPostgreSQL on Kubernetes入門(PostgreSQL Conference Japan 2022 発表資料)
PGOを用いたPostgreSQL on Kubernetes入門(PostgreSQL Conference Japan 2022 発表資料)
NTT DATA Technology & Innovation
 
Db2 v11.5.4 高可用性構成 & HADR 構成パターンご紹介
Db2 v11.5.4 高可用性構成 & HADR 構成パターンご紹介Db2 v11.5.4 高可用性構成 & HADR 構成パターンご紹介
Db2 v11.5.4 高可用性構成 & HADR 構成パターンご紹介
IBM Analytics Japan
 

Similar to Multi-Tenancy Kafka cluster for LINE services with 250 billion daily messages (20)

Netflix Open Source Meetup Season 4 Episode 2
Netflix Open Source Meetup Season 4 Episode 2Netflix Open Source Meetup Season 4 Episode 2
Netflix Open Source Meetup Season 4 Episode 2
aspyker
 
14th Athens Big Data Meetup - Landoop Workshop - Apache Kafka Entering The St...
14th Athens Big Data Meetup - Landoop Workshop - Apache Kafka Entering The St...14th Athens Big Data Meetup - Landoop Workshop - Apache Kafka Entering The St...
14th Athens Big Data Meetup - Landoop Workshop - Apache Kafka Entering The St...
Athens Big Data
 
Lookout on Scaling Security to 100 Million Devices
Lookout on Scaling Security to 100 Million DevicesLookout on Scaling Security to 100 Million Devices
Lookout on Scaling Security to 100 Million Devices
ScyllaDB
 
An Introduction to Apache Kafka
An Introduction to Apache KafkaAn Introduction to Apache Kafka
An Introduction to Apache Kafka
Amir Sedighi
 
Netflix Keystone Pipeline at Big Data Bootcamp, Santa Clara, Nov 2015
Netflix Keystone Pipeline at Big Data Bootcamp, Santa Clara, Nov 2015Netflix Keystone Pipeline at Big Data Bootcamp, Santa Clara, Nov 2015
Netflix Keystone Pipeline at Big Data Bootcamp, Santa Clara, Nov 2015
Monal Daxini
 
Introduction to apache kafka
Introduction to apache kafkaIntroduction to apache kafka
Introduction to apache kafka
Samuel Kerrien
 
End to End Processing of 3.7 Million Telemetry Events per Second using Lambda...
End to End Processing of 3.7 Million Telemetry Events per Second using Lambda...End to End Processing of 3.7 Million Telemetry Events per Second using Lambda...
End to End Processing of 3.7 Million Telemetry Events per Second using Lambda...
DataWorks Summit/Hadoop Summit
 
Netflix Data Pipeline With Kafka
Netflix Data Pipeline With KafkaNetflix Data Pipeline With Kafka
Netflix Data Pipeline With Kafka
Allen (Xiaozhong) Wang
 
Netflix Data Pipeline With Kafka
Netflix Data Pipeline With KafkaNetflix Data Pipeline With Kafka
Netflix Data Pipeline With Kafka
Steven Wu
 
Storage and performance- Batch processing, Whiptail
Storage and performance- Batch processing, WhiptailStorage and performance- Batch processing, Whiptail
Storage and performance- Batch processing, Whiptail
Internet World
 
Slow things down to make them go faster [FOSDEM 2022]
Slow things down to make them go faster [FOSDEM 2022]Slow things down to make them go faster [FOSDEM 2022]
Slow things down to make them go faster [FOSDEM 2022]
Jimmy Angelakos
 
6 open capi_meetup_in_japan_final
6 open capi_meetup_in_japan_final6 open capi_meetup_in_japan_final
6 open capi_meetup_in_japan_final
Yutaka Kawai
 
Dark launching with Consul at Hootsuite - Bill Monkman
Dark launching with Consul at Hootsuite - Bill MonkmanDark launching with Consul at Hootsuite - Bill Monkman
Dark launching with Consul at Hootsuite - Bill Monkman
Ambassador Labs
 
High performace network of Cloud Native Taiwan User Group
High performace network of Cloud Native Taiwan User GroupHigh performace network of Cloud Native Taiwan User Group
High performace network of Cloud Native Taiwan User Group
HungWei Chiu
 
Pulsar - Distributed pub/sub platform
Pulsar - Distributed pub/sub platformPulsar - Distributed pub/sub platform
Pulsar - Distributed pub/sub platform
Matteo Merli
 
Building zero data loss pipelines with apache kafka
Building zero data loss pipelines with apache kafkaBuilding zero data loss pipelines with apache kafka
Building zero data loss pipelines with apache kafka
Avinash Ramineni
 
Multitenancy: Kafka clusters for everyone at LINE
Multitenancy: Kafka clusters for everyone at LINEMultitenancy: Kafka clusters for everyone at LINE
Multitenancy: Kafka clusters for everyone at LINE
kawamuray
 
Backing up Wikipedia Databases
Backing up Wikipedia DatabasesBacking up Wikipedia Databases
Backing up Wikipedia Databases
Jaime Crespo
 
Apache Kafka - Martin Podval
Apache Kafka - Martin PodvalApache Kafka - Martin Podval
Apache Kafka - Martin Podval
Martin Podval
 
Galera webinar migration to galera cluster from my sql async replication
Galera webinar migration to galera cluster from my sql async replicationGalera webinar migration to galera cluster from my sql async replication
Galera webinar migration to galera cluster from my sql async replication
Codership Oy - Creators of Galera Cluster
 
Netflix Open Source Meetup Season 4 Episode 2
Netflix Open Source Meetup Season 4 Episode 2Netflix Open Source Meetup Season 4 Episode 2
Netflix Open Source Meetup Season 4 Episode 2
aspyker
 
14th Athens Big Data Meetup - Landoop Workshop - Apache Kafka Entering The St...
14th Athens Big Data Meetup - Landoop Workshop - Apache Kafka Entering The St...14th Athens Big Data Meetup - Landoop Workshop - Apache Kafka Entering The St...
14th Athens Big Data Meetup - Landoop Workshop - Apache Kafka Entering The St...
Athens Big Data
 
Lookout on Scaling Security to 100 Million Devices
Lookout on Scaling Security to 100 Million DevicesLookout on Scaling Security to 100 Million Devices
Lookout on Scaling Security to 100 Million Devices
ScyllaDB
 
An Introduction to Apache Kafka
An Introduction to Apache KafkaAn Introduction to Apache Kafka
An Introduction to Apache Kafka
Amir Sedighi
 
Netflix Keystone Pipeline at Big Data Bootcamp, Santa Clara, Nov 2015
Netflix Keystone Pipeline at Big Data Bootcamp, Santa Clara, Nov 2015Netflix Keystone Pipeline at Big Data Bootcamp, Santa Clara, Nov 2015
Netflix Keystone Pipeline at Big Data Bootcamp, Santa Clara, Nov 2015
Monal Daxini
 
Introduction to apache kafka
Introduction to apache kafkaIntroduction to apache kafka
Introduction to apache kafka
Samuel Kerrien
 
End to End Processing of 3.7 Million Telemetry Events per Second using Lambda...
End to End Processing of 3.7 Million Telemetry Events per Second using Lambda...End to End Processing of 3.7 Million Telemetry Events per Second using Lambda...
End to End Processing of 3.7 Million Telemetry Events per Second using Lambda...
DataWorks Summit/Hadoop Summit
 
Netflix Data Pipeline With Kafka
Netflix Data Pipeline With KafkaNetflix Data Pipeline With Kafka
Netflix Data Pipeline With Kafka
Steven Wu
 
Storage and performance- Batch processing, Whiptail
Storage and performance- Batch processing, WhiptailStorage and performance- Batch processing, Whiptail
Storage and performance- Batch processing, Whiptail
Internet World
 
Slow things down to make them go faster [FOSDEM 2022]
Slow things down to make them go faster [FOSDEM 2022]Slow things down to make them go faster [FOSDEM 2022]
Slow things down to make them go faster [FOSDEM 2022]
Jimmy Angelakos
 
6 open capi_meetup_in_japan_final
6 open capi_meetup_in_japan_final6 open capi_meetup_in_japan_final
6 open capi_meetup_in_japan_final
Yutaka Kawai
 
Dark launching with Consul at Hootsuite - Bill Monkman
Dark launching with Consul at Hootsuite - Bill MonkmanDark launching with Consul at Hootsuite - Bill Monkman
Dark launching with Consul at Hootsuite - Bill Monkman
Ambassador Labs
 
High performace network of Cloud Native Taiwan User Group
High performace network of Cloud Native Taiwan User GroupHigh performace network of Cloud Native Taiwan User Group
High performace network of Cloud Native Taiwan User Group
HungWei Chiu
 
Pulsar - Distributed pub/sub platform
Pulsar - Distributed pub/sub platformPulsar - Distributed pub/sub platform
Pulsar - Distributed pub/sub platform
Matteo Merli
 
Building zero data loss pipelines with apache kafka
Building zero data loss pipelines with apache kafkaBuilding zero data loss pipelines with apache kafka
Building zero data loss pipelines with apache kafka
Avinash Ramineni
 
Multitenancy: Kafka clusters for everyone at LINE
Multitenancy: Kafka clusters for everyone at LINEMultitenancy: Kafka clusters for everyone at LINE
Multitenancy: Kafka clusters for everyone at LINE
kawamuray
 
Backing up Wikipedia Databases
Backing up Wikipedia DatabasesBacking up Wikipedia Databases
Backing up Wikipedia Databases
Jaime Crespo
 
Apache Kafka - Martin Podval
Apache Kafka - Martin PodvalApache Kafka - Martin Podval
Apache Kafka - Martin Podval
Martin Podval
 
Ad

More from LINE Corporation (20)

JJUG CCC 2018 Fall 懇親会LT
JJUG CCC 2018 Fall 懇親会LTJJUG CCC 2018 Fall 懇親会LT
JJUG CCC 2018 Fall 懇親会LT
LINE Corporation
 
Reduce dependency on Rx with Kotlin Coroutines
Reduce dependency on Rx with Kotlin CoroutinesReduce dependency on Rx with Kotlin Coroutines
Reduce dependency on Rx with Kotlin Coroutines
LINE Corporation
 
Kotlin/NativeでAndroidのNativeメソッドを実装してみた
Kotlin/NativeでAndroidのNativeメソッドを実装してみたKotlin/NativeでAndroidのNativeメソッドを実装してみた
Kotlin/NativeでAndroidのNativeメソッドを実装してみた
LINE Corporation
 
Use Kotlin scripts and Clova SDK to build your Clova extension
Use Kotlin scripts and Clova SDK to build your Clova extensionUse Kotlin scripts and Clova SDK to build your Clova extension
Use Kotlin scripts and Clova SDK to build your Clova extension
LINE Corporation
 
The Magic of LINE 購物 Testing
The Magic of LINE 購物 TestingThe Magic of LINE 購物 Testing
The Magic of LINE 購物 Testing
LINE Corporation
 
GA Test Automation
GA Test AutomationGA Test Automation
GA Test Automation
LINE Corporation
 
UI Automation Test with JUnit5
UI Automation Test with JUnit5UI Automation Test with JUnit5
UI Automation Test with JUnit5
LINE Corporation
 
Feature Detection for UI Testing
Feature Detection for UI TestingFeature Detection for UI Testing
Feature Detection for UI Testing
LINE Corporation
 
LINE 新星計劃介紹與新創團隊分享
LINE 新星計劃介紹與新創團隊分享LINE 新星計劃介紹與新創團隊分享
LINE 新星計劃介紹與新創團隊分享
LINE Corporation
 
​LINE 技術合作夥伴與應用分享
​LINE 技術合作夥伴與應用分享​LINE 技術合作夥伴與應用分享
​LINE 技術合作夥伴與應用分享
LINE Corporation
 
LINE 開發者社群經營與技術推廣
LINE 開發者社群經營與技術推廣LINE 開發者社群經營與技術推廣
LINE 開發者社群經營與技術推廣
LINE Corporation
 
日本開發者大會短講分享
日本開發者大會短講分享日本開發者大會短講分享
日本開發者大會短講分享
LINE Corporation
 
LINE Chatbot - 活動報名報到設計分享
LINE Chatbot - 活動報名報到設計分享LINE Chatbot - 活動報名報到設計分享
LINE Chatbot - 活動報名報到設計分享
LINE Corporation
 
在 LINE 私有雲中使用 Managed Kubernetes
在 LINE 私有雲中使用 Managed Kubernetes在 LINE 私有雲中使用 Managed Kubernetes
在 LINE 私有雲中使用 Managed Kubernetes
LINE Corporation
 
LINE TODAY高效率的敏捷測試開發技巧
LINE TODAY高效率的敏捷測試開發技巧LINE TODAY高效率的敏捷測試開發技巧
LINE TODAY高效率的敏捷測試開發技巧
LINE Corporation
 
LINE 區塊鏈平台及代幣經濟 - LINK Chain及LINK介紹
LINE 區塊鏈平台及代幣經濟 - LINK Chain及LINK介紹LINE 區塊鏈平台及代幣經濟 - LINK Chain及LINK介紹
LINE 區塊鏈平台及代幣經濟 - LINK Chain及LINK介紹
LINE Corporation
 
LINE Things - LINE IoT平台新技術分享
LINE Things - LINE IoT平台新技術分享LINE Things - LINE IoT平台新技術分享
LINE Things - LINE IoT平台新技術分享
LINE Corporation
 
LINE Pay - 一卡通支付新體驗
LINE Pay - 一卡通支付新體驗LINE Pay - 一卡通支付新體驗
LINE Pay - 一卡通支付新體驗
LINE Corporation
 
LINE Platform API Update - 打造一個更好的Chatbot服務
LINE Platform API Update - 打造一個更好的Chatbot服務LINE Platform API Update - 打造一個更好的Chatbot服務
LINE Platform API Update - 打造一個更好的Chatbot服務
LINE Corporation
 
Keynote - ​LINE 的技術策略佈局與跨國產品開發
Keynote - ​LINE 的技術策略佈局與跨國產品開發Keynote - ​LINE 的技術策略佈局與跨國產品開發
Keynote - ​LINE 的技術策略佈局與跨國產品開發
LINE Corporation
 
JJUG CCC 2018 Fall 懇親会LT
JJUG CCC 2018 Fall 懇親会LTJJUG CCC 2018 Fall 懇親会LT
JJUG CCC 2018 Fall 懇親会LT
LINE Corporation
 
Reduce dependency on Rx with Kotlin Coroutines
Reduce dependency on Rx with Kotlin CoroutinesReduce dependency on Rx with Kotlin Coroutines
Reduce dependency on Rx with Kotlin Coroutines
LINE Corporation
 
Kotlin/NativeでAndroidのNativeメソッドを実装してみた
Kotlin/NativeでAndroidのNativeメソッドを実装してみたKotlin/NativeでAndroidのNativeメソッドを実装してみた
Kotlin/NativeでAndroidのNativeメソッドを実装してみた
LINE Corporation
 
Use Kotlin scripts and Clova SDK to build your Clova extension
Use Kotlin scripts and Clova SDK to build your Clova extensionUse Kotlin scripts and Clova SDK to build your Clova extension
Use Kotlin scripts and Clova SDK to build your Clova extension
LINE Corporation
 
The Magic of LINE 購物 Testing
The Magic of LINE 購物 TestingThe Magic of LINE 購物 Testing
The Magic of LINE 購物 Testing
LINE Corporation
 
UI Automation Test with JUnit5
UI Automation Test with JUnit5UI Automation Test with JUnit5
UI Automation Test with JUnit5
LINE Corporation
 
Feature Detection for UI Testing
Feature Detection for UI TestingFeature Detection for UI Testing
Feature Detection for UI Testing
LINE Corporation
 
LINE 新星計劃介紹與新創團隊分享
LINE 新星計劃介紹與新創團隊分享LINE 新星計劃介紹與新創團隊分享
LINE 新星計劃介紹與新創團隊分享
LINE Corporation
 
​LINE 技術合作夥伴與應用分享
​LINE 技術合作夥伴與應用分享​LINE 技術合作夥伴與應用分享
​LINE 技術合作夥伴與應用分享
LINE Corporation
 
LINE 開發者社群經營與技術推廣
LINE 開發者社群經營與技術推廣LINE 開發者社群經營與技術推廣
LINE 開發者社群經營與技術推廣
LINE Corporation
 
日本開發者大會短講分享
日本開發者大會短講分享日本開發者大會短講分享
日本開發者大會短講分享
LINE Corporation
 
LINE Chatbot - 活動報名報到設計分享
LINE Chatbot - 活動報名報到設計分享LINE Chatbot - 活動報名報到設計分享
LINE Chatbot - 活動報名報到設計分享
LINE Corporation
 
在 LINE 私有雲中使用 Managed Kubernetes
在 LINE 私有雲中使用 Managed Kubernetes在 LINE 私有雲中使用 Managed Kubernetes
在 LINE 私有雲中使用 Managed Kubernetes
LINE Corporation
 
LINE TODAY高效率的敏捷測試開發技巧
LINE TODAY高效率的敏捷測試開發技巧LINE TODAY高效率的敏捷測試開發技巧
LINE TODAY高效率的敏捷測試開發技巧
LINE Corporation
 
LINE 區塊鏈平台及代幣經濟 - LINK Chain及LINK介紹
LINE 區塊鏈平台及代幣經濟 - LINK Chain及LINK介紹LINE 區塊鏈平台及代幣經濟 - LINK Chain及LINK介紹
LINE 區塊鏈平台及代幣經濟 - LINK Chain及LINK介紹
LINE Corporation
 
LINE Things - LINE IoT平台新技術分享
LINE Things - LINE IoT平台新技術分享LINE Things - LINE IoT平台新技術分享
LINE Things - LINE IoT平台新技術分享
LINE Corporation
 
LINE Pay - 一卡通支付新體驗
LINE Pay - 一卡通支付新體驗LINE Pay - 一卡通支付新體驗
LINE Pay - 一卡通支付新體驗
LINE Corporation
 
LINE Platform API Update - 打造一個更好的Chatbot服務
LINE Platform API Update - 打造一個更好的Chatbot服務LINE Platform API Update - 打造一個更好的Chatbot服務
LINE Platform API Update - 打造一個更好的Chatbot服務
LINE Corporation
 
Keynote - ​LINE 的技術策略佈局與跨國產品開發
Keynote - ​LINE 的技術策略佈局與跨國產品開發Keynote - ​LINE 的技術策略佈局與跨國產品開發
Keynote - ​LINE 的技術策略佈局與跨國產品開發
LINE Corporation
 
Ad

Recently uploaded (20)

Role of Data Annotation Services in AI-Powered Manufacturing
Role of Data Annotation Services in AI-Powered ManufacturingRole of Data Annotation Services in AI-Powered Manufacturing
Role of Data Annotation Services in AI-Powered Manufacturing
Andrew Leo
 
AI Changes Everything – Talk at Cardiff Metropolitan University, 29th April 2...
AI Changes Everything – Talk at Cardiff Metropolitan University, 29th April 2...AI Changes Everything – Talk at Cardiff Metropolitan University, 29th April 2...
AI Changes Everything – Talk at Cardiff Metropolitan University, 29th April 2...
Alan Dix
 
What is Model Context Protocol(MCP) - The new technology for communication bw...
What is Model Context Protocol(MCP) - The new technology for communication bw...What is Model Context Protocol(MCP) - The new technology for communication bw...
What is Model Context Protocol(MCP) - The new technology for communication bw...
Vishnu Singh Chundawat
 
UiPath Community Berlin: Orchestrator API, Swagger, and Test Manager API
UiPath Community Berlin: Orchestrator API, Swagger, and Test Manager APIUiPath Community Berlin: Orchestrator API, Swagger, and Test Manager API
UiPath Community Berlin: Orchestrator API, Swagger, and Test Manager API
UiPathCommunity
 
How analogue intelligence complements AI
How analogue intelligence complements AIHow analogue intelligence complements AI
How analogue intelligence complements AI
Paul Rowe
 
Noah Loul Shares 5 Steps to Implement AI Agents for Maximum Business Efficien...
Noah Loul Shares 5 Steps to Implement AI Agents for Maximum Business Efficien...Noah Loul Shares 5 Steps to Implement AI Agents for Maximum Business Efficien...
Noah Loul Shares 5 Steps to Implement AI Agents for Maximum Business Efficien...
Noah Loul
 
tecnologias de las primeras civilizaciones.pdf
tecnologias de las primeras civilizaciones.pdftecnologias de las primeras civilizaciones.pdf
tecnologias de las primeras civilizaciones.pdf
fjgm517
 
Drupalcamp Finland – Measuring Front-end Energy Consumption
Drupalcamp Finland – Measuring Front-end Energy ConsumptionDrupalcamp Finland – Measuring Front-end Energy Consumption
Drupalcamp Finland – Measuring Front-end Energy Consumption
Exove
 
Splunk Security Update | Public Sector Summit Germany 2025
Splunk Security Update | Public Sector Summit Germany 2025Splunk Security Update | Public Sector Summit Germany 2025
Splunk Security Update | Public Sector Summit Germany 2025
Splunk
 
The Evolution of Meme Coins A New Era for Digital Currency ppt.pdf
The Evolution of Meme Coins A New Era for Digital Currency ppt.pdfThe Evolution of Meme Coins A New Era for Digital Currency ppt.pdf
The Evolution of Meme Coins A New Era for Digital Currency ppt.pdf
Abi john
 
IEDM 2024 Tutorial2_Advances in CMOS Technologies and Future Directions for C...
IEDM 2024 Tutorial2_Advances in CMOS Technologies and Future Directions for C...IEDM 2024 Tutorial2_Advances in CMOS Technologies and Future Directions for C...
IEDM 2024 Tutorial2_Advances in CMOS Technologies and Future Directions for C...
organizerofv
 
DevOpsDays Atlanta 2025 - Building 10x Development Organizations.pptx
DevOpsDays Atlanta 2025 - Building 10x Development Organizations.pptxDevOpsDays Atlanta 2025 - Building 10x Development Organizations.pptx
DevOpsDays Atlanta 2025 - Building 10x Development Organizations.pptx
Justin Reock
 
HCL Nomad Web – Best Practices und Verwaltung von Multiuser-Umgebungen
HCL Nomad Web – Best Practices und Verwaltung von Multiuser-UmgebungenHCL Nomad Web – Best Practices und Verwaltung von Multiuser-Umgebungen
HCL Nomad Web – Best Practices und Verwaltung von Multiuser-Umgebungen
panagenda
 
Transcript: #StandardsGoals for 2025: Standards & certification roundup - Tec...
Transcript: #StandardsGoals for 2025: Standards & certification roundup - Tec...Transcript: #StandardsGoals for 2025: Standards & certification roundup - Tec...
Transcript: #StandardsGoals for 2025: Standards & certification roundup - Tec...
BookNet Canada
 
#StandardsGoals for 2025: Standards & certification roundup - Tech Forum 2025
#StandardsGoals for 2025: Standards & certification roundup - Tech Forum 2025#StandardsGoals for 2025: Standards & certification roundup - Tech Forum 2025
#StandardsGoals for 2025: Standards & certification roundup - Tech Forum 2025
BookNet Canada
 
AI and Data Privacy in 2025: Global Trends
AI and Data Privacy in 2025: Global TrendsAI and Data Privacy in 2025: Global Trends
AI and Data Privacy in 2025: Global Trends
InData Labs
 
Cybersecurity Identity and Access Solutions using Azure AD
Cybersecurity Identity and Access Solutions using Azure ADCybersecurity Identity and Access Solutions using Azure AD
Cybersecurity Identity and Access Solutions using Azure AD
VICTOR MAESTRE RAMIREZ
 
Manifest Pre-Seed Update | A Humanoid OEM Deeptech In France
Manifest Pre-Seed Update | A Humanoid OEM Deeptech In FranceManifest Pre-Seed Update | A Humanoid OEM Deeptech In France
Manifest Pre-Seed Update | A Humanoid OEM Deeptech In France
chb3
 
Electronic_Mail_Attacks-1-35.pdf by xploit
Electronic_Mail_Attacks-1-35.pdf by xploitElectronic_Mail_Attacks-1-35.pdf by xploit
Electronic_Mail_Attacks-1-35.pdf by xploit
niftliyevhuseyn
 
Special Meetup Edition - TDX Bengaluru Meetup #52.pptx
Special Meetup Edition - TDX Bengaluru Meetup #52.pptxSpecial Meetup Edition - TDX Bengaluru Meetup #52.pptx
Special Meetup Edition - TDX Bengaluru Meetup #52.pptx
shyamraj55
 
Role of Data Annotation Services in AI-Powered Manufacturing
Role of Data Annotation Services in AI-Powered ManufacturingRole of Data Annotation Services in AI-Powered Manufacturing
Role of Data Annotation Services in AI-Powered Manufacturing
Andrew Leo
 
AI Changes Everything – Talk at Cardiff Metropolitan University, 29th April 2...
AI Changes Everything – Talk at Cardiff Metropolitan University, 29th April 2...AI Changes Everything – Talk at Cardiff Metropolitan University, 29th April 2...
AI Changes Everything – Talk at Cardiff Metropolitan University, 29th April 2...
Alan Dix
 
What is Model Context Protocol(MCP) - The new technology for communication bw...
What is Model Context Protocol(MCP) - The new technology for communication bw...What is Model Context Protocol(MCP) - The new technology for communication bw...
What is Model Context Protocol(MCP) - The new technology for communication bw...
Vishnu Singh Chundawat
 
UiPath Community Berlin: Orchestrator API, Swagger, and Test Manager API
UiPath Community Berlin: Orchestrator API, Swagger, and Test Manager APIUiPath Community Berlin: Orchestrator API, Swagger, and Test Manager API
UiPath Community Berlin: Orchestrator API, Swagger, and Test Manager API
UiPathCommunity
 
How analogue intelligence complements AI
How analogue intelligence complements AIHow analogue intelligence complements AI
How analogue intelligence complements AI
Paul Rowe
 
Noah Loul Shares 5 Steps to Implement AI Agents for Maximum Business Efficien...
Noah Loul Shares 5 Steps to Implement AI Agents for Maximum Business Efficien...Noah Loul Shares 5 Steps to Implement AI Agents for Maximum Business Efficien...
Noah Loul Shares 5 Steps to Implement AI Agents for Maximum Business Efficien...
Noah Loul
 
tecnologias de las primeras civilizaciones.pdf
tecnologias de las primeras civilizaciones.pdftecnologias de las primeras civilizaciones.pdf
tecnologias de las primeras civilizaciones.pdf
fjgm517
 
Drupalcamp Finland – Measuring Front-end Energy Consumption
Drupalcamp Finland – Measuring Front-end Energy ConsumptionDrupalcamp Finland – Measuring Front-end Energy Consumption
Drupalcamp Finland – Measuring Front-end Energy Consumption
Exove
 
Splunk Security Update | Public Sector Summit Germany 2025
Splunk Security Update | Public Sector Summit Germany 2025Splunk Security Update | Public Sector Summit Germany 2025
Splunk Security Update | Public Sector Summit Germany 2025
Splunk
 
The Evolution of Meme Coins A New Era for Digital Currency ppt.pdf
The Evolution of Meme Coins A New Era for Digital Currency ppt.pdfThe Evolution of Meme Coins A New Era for Digital Currency ppt.pdf
The Evolution of Meme Coins A New Era for Digital Currency ppt.pdf
Abi john
 
IEDM 2024 Tutorial2_Advances in CMOS Technologies and Future Directions for C...
IEDM 2024 Tutorial2_Advances in CMOS Technologies and Future Directions for C...IEDM 2024 Tutorial2_Advances in CMOS Technologies and Future Directions for C...
IEDM 2024 Tutorial2_Advances in CMOS Technologies and Future Directions for C...
organizerofv
 
DevOpsDays Atlanta 2025 - Building 10x Development Organizations.pptx
DevOpsDays Atlanta 2025 - Building 10x Development Organizations.pptxDevOpsDays Atlanta 2025 - Building 10x Development Organizations.pptx
DevOpsDays Atlanta 2025 - Building 10x Development Organizations.pptx
Justin Reock
 
HCL Nomad Web – Best Practices und Verwaltung von Multiuser-Umgebungen
HCL Nomad Web – Best Practices und Verwaltung von Multiuser-UmgebungenHCL Nomad Web – Best Practices und Verwaltung von Multiuser-Umgebungen
HCL Nomad Web – Best Practices und Verwaltung von Multiuser-Umgebungen
panagenda
 
Transcript: #StandardsGoals for 2025: Standards & certification roundup - Tec...
Transcript: #StandardsGoals for 2025: Standards & certification roundup - Tec...Transcript: #StandardsGoals for 2025: Standards & certification roundup - Tec...
Transcript: #StandardsGoals for 2025: Standards & certification roundup - Tec...
BookNet Canada
 
#StandardsGoals for 2025: Standards & certification roundup - Tech Forum 2025
#StandardsGoals for 2025: Standards & certification roundup - Tech Forum 2025#StandardsGoals for 2025: Standards & certification roundup - Tech Forum 2025
#StandardsGoals for 2025: Standards & certification roundup - Tech Forum 2025
BookNet Canada
 
AI and Data Privacy in 2025: Global Trends
AI and Data Privacy in 2025: Global TrendsAI and Data Privacy in 2025: Global Trends
AI and Data Privacy in 2025: Global Trends
InData Labs
 
Cybersecurity Identity and Access Solutions using Azure AD
Cybersecurity Identity and Access Solutions using Azure ADCybersecurity Identity and Access Solutions using Azure AD
Cybersecurity Identity and Access Solutions using Azure AD
VICTOR MAESTRE RAMIREZ
 
Manifest Pre-Seed Update | A Humanoid OEM Deeptech In France
Manifest Pre-Seed Update | A Humanoid OEM Deeptech In FranceManifest Pre-Seed Update | A Humanoid OEM Deeptech In France
Manifest Pre-Seed Update | A Humanoid OEM Deeptech In France
chb3
 
Electronic_Mail_Attacks-1-35.pdf by xploit
Electronic_Mail_Attacks-1-35.pdf by xploitElectronic_Mail_Attacks-1-35.pdf by xploit
Electronic_Mail_Attacks-1-35.pdf by xploit
niftliyevhuseyn
 
Special Meetup Edition - TDX Bengaluru Meetup #52.pptx
Special Meetup Edition - TDX Bengaluru Meetup #52.pptxSpecial Meetup Edition - TDX Bengaluru Meetup #52.pptx
Special Meetup Edition - TDX Bengaluru Meetup #52.pptx
shyamraj55
 

Multi-Tenancy Kafka cluster for LINE services with 250 billion daily messages

  • 1. MULTI-TENANCY KAFKA CLUSTER FOR LINE SERVICES WITH 250 BILLION DAILY MESSAGES
  • 2. SPEAKER ● Yuto Kawamura ● Senior Software Engineer ● Lead of the team providing company- wide Kafka platform ● Active in Apache Kafka Community ● Code contribution ● Speaking at Kafka Summit
  • 3. Agenda • Multitenancy company-wide Kafka Platform
  • 4. APACHE KAFKA ● Middleware for streaming data ● Highly scalable/available ● Data persistency ● Supports Pub-Sub model
  • 6. KAFKA PLATFORM ● Large scale Kafka clusters provided for any systems/services inside LINE ● Started internally from messaging server development team ● Expanded to company-wide platform
  • 7. USAGE ● Two kinds of usage ● Distributed task queue for buffering and processing business logic asynchronously ● e.g, Queueing heavy task from web app server to background task processor ● "Data Hub" for distributing data to other services
  • 9. SINGLE CLUSTER SHARED BY MANY SYSTEMS ● Concept of “Data Hub” ● Easy to find and access data ● Operational and management efficiency ● Not making operational cost to be proportional to users ● Concentrate engineering resources for maximizing reliability/ performance
  • 11. 5 million / second 210TB Daily inflow 4GB / second 50+ systems 250 billion Daily messages SCALE
  • 12. ● Cluster can protect itself against abusive workloads ● Accidental workload doesn't propagate to other users ● We can track on which client is sending requests ● Find source of strange requests ● Certain level of isolation among client workloads ● Slow response for one client doesn't appears to another client MULTITENANCY REQUIREMENTS
  • 13. ● It is more important to manage number of requests over incoming/outgoing byte rate ● Kafka is amazingly durable for large data ● Good design leveraging system functions ● Page cache for caching data ● sendfile(2) for zero copy transfer data ● Native batching ● Typical danger exists in clients sending many requests PROTECT CLUSTER AGAINST ABUSING
  • 14. REQUEST QUOTA ● Use Request Quota ● Limit “Time of broker threads that can be used by each client group” ● Set default quota ● Prevent single client from accidentally consuming all broker resources
  • 15. ISOLATION AMONG CLIENT WORKLOADS ● When can performance isolation among different clients violated? ● Let’s look at example of actual troubleshooting.
  • 16. DETECTION ● 50x ~ 100x slower response time in 99th %ile Produce response time ● Normal: ~20ms ● Observed: 50ms ~ 200ms
  • 17. FINDING #1 ● Coincidental disk read of a certain amount
  • 18. FINDING #2 ● Network threads' utilization was very high
  • 19. REQUEST HANDLING IN KAFKA BROKER ● Network Threads: Reads/ Writes request/response from/to client sockets ● Request Handler Threads: Processes requests and produces response object
  • 20. REQUEST HANDLING - READ REQUEST
  • 22. REQUEST HANDLING - WRITE RESPONSE
  • 23. NETWORK THREAD RUNS EVENT LOOP ● Multiplex and processes client sockets assigned sequentially ● It never blocks awaiting IO completion
  • 24. WHEN NETWORK THREADS GETS BUSY... ● It means one of following: ● 1. Really busy doing lots of work Many requests/responses to read/write ● 2. Blocked by some operations (which should not happen in event loop in general)
  • 25. RESPONSE HANDLING - NORMAL REQUESTS ● When response is in queue, all data to be transferred are in memory
  • 26. RESPONSE HANDLING - FETCH RESPONSE ● When response is in queue, topic data segments are not in userspace memory ● => Copy to client socket directly inside the kernel using sendfile(2) system call
  • 27. IF TARGET DATA NOT EXISTS IN PAGE CACHE ● Target data in page cache: ● => Just a memory copy. Very fast: ~ 100us ● Target data is NOT in page cache: ● => Needs to load data from disk into page cache first. Can be slow: ~ 50ms (or even slower) ● => If this happens in event loop…?
  • 28. SUSPECTING BLOCKING IN SENDFILE(2) ● Inspect duration of sendfile system calls issued by broker process ● How?
  • 29. SYSTEMTAP ● A kernel layer dynamic tracing tool and scripting language ● Safe to run in production because of low overhead ● Alternatively: DTrace, eBPF, etc…
  • 32. SOLUTION ● Make sure that data is ready on memory before the response is passed to the network thread ● => Event loop never blocks
  • 33. WARMUP PAGE CACHE ● Move blocking part to request handler threads (= single queue and pool of threads)
  • 34. WARMUP PAGE CACHE ● When Network thread calls sendfile(2) for transferring log data, it's always in page cache
  • 35. WARMUP IMPLEMENTATION ● Easiest way: Do synchronous read(2) on target data ● => Large overhead by copying memory from kernel to userland ● Why is Kafka using sendfile(2) for transferring topic data? ● => To avoid expensive large memory copy ● How can we achieve it keeping this property?
  • 36. TRICK - ZERO COPY PAGE LOAD ● Call sendfile(2) for target data with dest /dev/null ● The /dev/null driver does not actually copy data to anywhere
  • 37. WHY IT HAS ALMOST NO OVERHEAD? ● Linux kernel internally uses `splice` to implement sendfile(2) ● `splice` implementation of /dev/null returns without iterating target data
  • 39. IT WORKED ● No response time degradation with coincidence of Fetch request reading disk
  • 40. KAFKA-7504 Broker performance degradation caused by call of sendfile reading disk in network thread - x50 ~ x100 response time reduction KAFKA-4614 Long GC pause harming broker performance which is caused by mmap objects created for OffsetIndex - x100 ~ response time reduction KAFKA-6501 ReplicaFetcherThread should close the ReplicaFetcherBlockingSend earlier on shutdown - Eliminate significant latency during broker restart WHY NOT CONTRIBUTE IT BACK?
  • 41. CONCLUSION ● Introduced our engineering for operating the company-wide Kafka platform ● Quota, SystemTap and patch understanding system deeply ● After fixing some issues, our hosting policy is working well and efficiently, keeping: ● concept of single "Data Hub" and ● operational cost not to be proportional to the number of users/ usages ● We are contributing to the world through OSS
  • 42. … AND NEXT ● Kafka platform evolution ● Clients standardization and management ● Higher availability ● SRE team ● Planning to rollout new team for Reliability Engineering ● Share knowledge/tools which are independent from Middleware ● Come and ask me more!