SlideShare a Scribd company logo
Welcome!
To the 3rd Community Over Code Performance Engineering Track
Track chairs: Paul Brebner, Roger Abelenda
1st New Orleans 2022
Technologies: Lucene, Ozone, Kafka, Cassandra, JMeter, and Spark/ML
2nd Beijing 2023
Technologies: Kafka, JMeter, Arrow, Java profiling, Spark & Flink, Hadoop
3rd Halifax 2023
Technologies: Kafka, Ozone, Cassandra, Camel, JMeter/Selenium, Lucene
And
maybe
4th EU
2024
Talks (3 x 2 = 6)
11:20 Paul Brebner
Developing Fast Applications With Open Source Software - Without The Fury (Kafka)
12:10 Duong Nguyen, Tanvi Penumudy, Ritesh Shukla
Design patterns and then the road to realize billions of objects, and exabytes of capacity, while preserving performance
in Apache Ozone
--- LUNCH
2:20 German Eichberger, Pallavi Iyengar
Performance measurement and tuning of Cassandra 5.0 transactions on Cloud infrastructure
3:10 Otavio Piske
Hunting Performance Monsters on the Back of a Camel
--- COFFEE
4:10 Roger Abelenda
Quick load testing from Selenium scripts
5:00 Stefan Vodita
Lessons Learned from Benchmarking Amazon’s E-commerce Search Engine
© Instaclustr Pty Limited, 2023
Fast Open Source Software —
Without The Fury
Paul Brebner - Open Source Technology Evangelist
www.instaclustr.com/paul-brebner/
Community Over Code Halifax October 2023
Fast Cars?!
“NetApp helps Aston Martin F1 use data to go faster”
Massive amounts of data is
captured and used to gain
milliseconds of performance
improvements on race-days
Using NetApp Cloud and Storage
Technologies
Fast Cars?! “Fast & Furious”
“Fast & Furious” Cars for my Grandson
(unopened as they are “collectables”)
Fast Open Source Software
(Source: wikimedia.org)
© Instaclustr Pty Limited, 2023
Enabled by Scalable Big Data
Technologies (e.g. Apache Kafka®)
(Source: wikimedia.org)
© Instaclustr Pty Limited, 2023
And Cloud Infrastructure
(Source: wikimedia.org)
© Instaclustr Pty Limited, 2023
Cars run on roads
S/W runs on H/W
Without the Fury
Fury 1 Too Many Kafka Topics
Fury 2 Slow Consumers
Fury 3 Too Many Kafka Partitions
Fury 4 Single Threaded Consumers—
concurrency limited by
partitions
Fury 5 Operational Problems
© Instaclustr Pty Limited, 2023
(Source: Shutterstock)
(Source: Shutterstock)
Cloud Platform for Big Data
Open Source Technologies
Focus of this talk is on
Apache Kafka®
Instaclustr Managed Platform
© Instaclustr Pty Limited, 2023
© Instaclustr Pty Limited, 2023
Kafka is a distributed streams processing
system—it allows distributed producers to send
messages to distributed consumers via a Kafka
cluster.
What is
Kafka?
Partitions Enable Concurrency:
Cluster and Producers
© Instaclustr Pty Limited, 2023
Partitions Enable Concurrency:
Cluster and Consumers
© Instaclustr Pty Limited, 2023
(and followers)
Partition n
Topic
Partition 1
Producer
Partition 2
Consumer Group
Consumer
Consumer
Consumers share
work within groups
Consumer
Partitions enable Consumers to share work
(c.f. Amish Barn raising) within a consumer group
© Instaclustr Pty Limited, 2023
Multiple groups enable message broadcasting.
Messages are duplicated (c.f. clones) across groups, as
each consumer group receives a copy of each message.
Multiple Groups Enable Message Broadcasting
Consumer
Consumer
Consumer
Consumer
Topic
Partition 1
Partition 2
Partition n
Producer
Consumer Group
Consumer Group
Messages are
duplicated across
Consumer groups
Messages are duplicated (c.f. clones) across groups,
as each consumer group receives a copy of each message
© Instaclustr Pty Limited, 2023
©Instaclustr Pty Limited, 2023
“Kongo” Logistics IoT Application
Fury 1: Too Many Kafka Topics
Design Choices:
Many vs. One Topic?
• 100s of locations (Warehouses, Trucks)
• Each location has a topic and multiple
consumer groups (so all the Goods in a
location receive relevant events)
Option 1:
§ Many topics
§ Many consumer groups per topic ->
high fan-out
1. Many (100s) of Topics
© Instaclustr Pty Limited, 2023
2. One Topic, One Consumer Group
• One topic for all locations
• Using an external notification
mechanism for event broadcasting
(Guava Event Bus)
© Instaclustr Pty Limited, 2023
Single Topic/Single Consumer
Group Wins
© Instaclustr Pty Limited, 2023
Many topics, many
consumer groups, 7200
Single topic, single
consumer group,
1120000
0
200000
400000
600000
800000
1000000
1200000
Many topics Single topic
TPS
155 times
better!
But why?
Trains Are More Scalable Than Cars
Train in the Canadian Rockies (Source: Getty Images)
© Instaclustr Pty Limited, 2023
High Fan-Out = Lots of On/Off Ramps
© Instaclustr Pty Limited, 2023
(Source: Shutterstock)
Explanation:
High fan-out =
lots of output
data and many
consumer
groups
(resource
intensive)
Many Topics = Traffic Jam
© Instaclustr Pty Limited, 2023
(Source: Shutterstock)
Explanation:
More topics à
more partitions
But Kafka Is Scalable! Bigger Clusters
(Source: Wikimedia)
Add more lanes!
Vertical/Scale-up
Increase Node
Sizes
© Instaclustr Pty Limited, 2023
Horizontal/Scale-Out
Add More Nodes
Add more roads!
(Source: Shutterstock)
© Instaclustr Pty Limited, 2023
©Instaclustr Pty Limited, 2023
Fury 2: Slow Consumers
Kafka+Cassandra Anomaly Detector
Application
Massively Scalable Anomaly
Detection: Tuning Knobs (Orange h/w, Yellow s/w)
• Initially just increased h/w resources
• Scaling was “easy” with Kubernetes
• Easy to create lots of consumers (100s)
• Initially single threaded Kafka consumer
(no thread pools)
© Instaclustr Pty Limited, 2023
But Scalability Not Great:
Scaling Is Too Easy—Scalability Harder
0
1
2
3
4
5
6
7
8
0 100 200 300 400 500 600 700
Billions
checks/day
Total Cores
Total Cores vs. Billions of Checks/Day (pre-tuning)
© Instaclustr Pty Limited, 2023
Slow Kafka Consumers Problem
Default Kafka consumers:
• Are single-threaded
• If the processing is “slow” then queuing
occurs—as the thread is blocked—reducing
throughput
• Solutions include speed up the processing, or
increase the number of consumers
• But more consumers à more partitions
• As each consumer needs 1 or more partitions
(Source: Getty Images)
© Instaclustr Pty Limited, 2023
Single Threaded Kafka Consumers
And Slow Processing = Slow Consumers
S
L
O
W
> 10 ms
• Slow consumers à
need more consumers
and also more
partitions for higher
throughput
• But more consumers
is slow, try speeding
them up
© Instaclustr Pty Limited, 2023
We Need Some Car Mods (Hacks)
© Instaclustr Pty Limited, 2023
(Source: Getty Images)
Multi-Threaded/Two Pool Consumers
The famous Bondi Ocean Pool
(in Sydney, Australia) has 2 pools
(Source: Shutterstock)
© Instaclustr Pty Limited, 2023
Tuning: Optimize Consumer Speed/
Concurrency Using 2 Stage Pipeline
1
2
Less consumers
(around 100) gives
higher throughput—
a surprise!
1. Speed up polling (thread pool 1)
2. Maximize anomaly detector
concurrency (thread pool 2)
Result—Reduces the number of
consumers and therefore partitions
needed and gives higher
throughput—why?
Don’t more partitions give higher
throughput?! Answer in part 3.
© Instaclustr Pty Limited, 2023
Scalability Post-Tuning—7.5 to 19 Billion
Checks/Day—2.5 Times Improvement
0
2
4
6
8
10
12
14
16
18
20
0 100 200 300 400 500 600 700
Billions
checks/day
Total Cores
Total Cores vs. Billions of Checks/Day (pre-tuning)
Billions of checks/day (pre-tuning) Billions of checks/day (post-tuning)
© Instaclustr Pty Limited, 2023
©Instaclustr Pty Limited, 2023
Fury 3: Too Many Partitions
What’s really going on under
the Kafka Bonnet?
(Source: Adobe Stock)
(Source: Adobe Stock)
Partitions = Pistons (Cylinders)
©Instaclustr Pty Limited, 2021
(Source: Getty Images)
© Instaclustr Pty Limited, 2023
But How Many?
©Instaclustr Pty Limited, 2021
Isetta 1-cylinder car
(Source: Wikimedia)
1 piston isn’t very
powerful:
Isetta (bubble car)
single-cylinder, 10HP,
top speed 55MPH
© Instaclustr Pty Limited, 2023
…16 Pistons Is a Lot!
©Instaclustr Pty Limited, 2021
Cadillac V-16
175HP,
100 MPH!
By Ramgeis - fotografiert von Ramgeis in Pebble Beach, Kalifornien im August 2004,
CC BY-SA 3.0
© Instaclustr Pty Limited, 2023
Can You Have “Too Many”?
©Instaclustr Pty Limited, 2021
Source: Wikimedia
YES!
Experimental 42-cylinder
2,350 hp (plane) engine!
© Instaclustr Pty Limited, 2023
Benchmarking (2020): Partitions and
Replication Factor Are the Culprits
0
100000
200000
300000
400000
500000
600000
700000
800000
900000
1 10 100 1000 10000
TPS
Partitions
Kafka Partitions vs. Throughput
Cluster: 3 nodes x 4 cores = 12 cores total
Replication Factor 3 (TPS) Replication Factor 1 (TPS)
You need sufficient
partitions to benefit
from the cluster
concurrency—
And not too many
that the replication
overhead impacts
overall throughput
© Instaclustr Pty Limited, 2023
2022 Kraft/Zookeeper Modes vs.
2020 Results Better, 1000 Partitions Is Ok
You need sufficient partitions to benefit from the cluster concurrency
And not too many that the replication overhead impacts overall throughput
0
0.5
1
1.5
2
2.5
1 10 100 1000 10000
Partitions vs. Throughput (M TPS)
ZK TPS (M) KRAFT TPS (M) 2020 TPS (M)
2022 - Better
2020 - Worse
© Instaclustr Pty Limited, 2023
©Instaclustr Pty Limited, 2023
Fury 4: Single Threaded Consumers
What’s New? Kafka Parallel
Consumer = A New Engine!
(Source: Adobe Stock)
"La Jamais Contente", first car to reach 100 km/h
in 1899 – 68hp electric!
(Source: Wikimedia)
Rimac Nevera – Electric “Hypercar”
4 Engines, 1,888 HP, 0-400KM/H in 29s
(Source: Wikimedia)
Theory: Little’s Law:
Concurrency = Throughput x Time
You need sufficient partitions to benefit from the cluster concurrency
And not too many that the replication overhead impacts overall throughput
2022 - Better
2020 - Worse
Rearranged:
• Throughput =
Concurrency/Time
• Concurrency is Partitions =
Consumers
• Using default consumer the
throughput drops with
increasing time
• Only solution is to increase
partitions
© Instaclustr Pty Limited, 2023
For Given Target Throughput (1M TPS)
Increasing Partitions With Increasing Time
You need sufficient partitions to benefit from the cluster concurrency
And not too many that the replication overhead impacts overall throughput
2022 - Better
2020 - Worse
© Instaclustr Pty Limited, 2023
Order in Kafka Is Partition Based—
So How To Increase Consumer Concurrency?
You need sufficient partitions to benefit from the cluster concurrency
And not too many that the replication overhead impacts overall throughput
2022 - Better
2020 - Worse
(Source: Adobe Stock)
© Instaclustr Pty Limited, 2023
Kafka Parallel Consumers:
Multi-Threaded Consumer
You need sufficient partitions to benefit from the cluster concurrency
And not too many that the replication overhead impacts overall throughput
2022 - Better
2020 - Worse
• Multiple ordering options—c.f. default Kafka only guarantees order within partitions!
PARTITION à KEY à UNORDERED
Increasing concurrency à
• Concurrency from 1 to lots —depends on client resources, and Partitions/Key space sizes
• KEY has higher concurrency than Partition and is ordered by KEY—reasonable compromise
• UNORDERED is unordered
© Instaclustr Pty Limited, 2023
Kafka Parallel Consumers
Multi-Threaded Consumer = Buses
You need sufficient partitions to benefit from the cluster concurrency
And not too many that the replication overhead impacts overall throughput
2022 - Better
2020 - Worse
(Source: Getty Images)
© Instaclustr Pty Limited, 2023
Theoretical Improvement for Each
Mode – max 3 orders of magnitude
You need sufficient partitions to benefit from the cluster concurrency
And not too many that the replication overhead impacts overall throughput
2022 - Better
2020 - Worse
1,000 partitions
100 consumers max
© Instaclustr Pty Limited, 2023
Experimental Results:
3, 50, and 200 times improvement, unordered best
You need sufficient partitions to benefit from the cluster concurrency
And not too many that the replication overhead impacts overall throughput
2022 - Better
2020 - Worse
1 consumer
10 partitions
100 keys
10ms latency
© Instaclustr Pty Limited, 2023
Watch Out for the Kafka Furies
2022 - Better
2020 - Worse
Too many topics
(= too many partitions) Too many consumer groups
Slow consumers
(= too many partitions)
Insufficient/
too many partitions
Single threaded
consumer
(= too many partitions)
© Instaclustr Pty Limited, 2023
Speed can be achieved with train-buses
You need sufficient partitions to benefit from the cluster concurrency
And not too many that the replication overhead impacts overall throughput
2022 - Better
2020 - Worse
© Instaclustr Pty Limited, 2023 Source: Wikimedia
Minimize Topics and Partitions = Tracks
- Buses are fast & self-driving on tracks
Minimize Consumer Groups = Interchanges
- At interchanges, buses fan out onto roads,
reducing passenger transfers
Maximize Consumer Concurrency = Buses
- Multiple passengers,
integrated with road system
Adelaide’s O-Bahn Busway train-bus system
Track + Buses
©Instaclustr Pty Limited, 2023
Fury 5: Operational Problems
(Source: Adobe Stock)
Pit Stop Performance Penalties (Source: Getty Images)
Even well designed Kafka Applications
occasionally have operational performance problems
You need sufficient partitions to benefit from the cluster concurrency
And not too many that the replication overhead impacts overall throughput
2022 - Better
2020 - Worse
© Instaclustr Pty Limited, 2023
Kafka Cluster CPU Utilization (0-100%)
Rapid increase in CPU Utilization
from normal of 50% to lots
What’s going on?
Has the workload increased?
No.
Has the cluster capacity
decreased? (e.g. lost some
brokers)
No.
This was the only example of a
Kafka performance problem in
our Post-Incident Reviews
Normal
Abnormal
Remediation
• Attempt 1: Replace 1 broker at a time
• Didn’t work – problem reappeared when new broker took over
partition leadership
• (A broker restart is what triggered the problem)
• Attempt 2: Stop all customer clients
(producers/consumers)
• Perform a rolling restart of Kafka cluster
• Restart clients, hold breathe…
Apollo 3rd stage J-2 Engines (13,000,000 HP) were designed
to restart – but failed to restart in uncrewed Apollo 6
(Source: NASA)
Back to normal
Normal
Diagnosis: Kafka is a distributed system
Kafka clients are also a critical part of the system
Kafka Cluster CPU Utilization (%)
Kafka Network Processor Threads handle client network data
Network Processor Idle % has decreased (more is better, less is worse)
The client load had increased.
Network Processor Idle (%)
Clients
Clients
Why? Further Clues and Causes
• Kernel Error
• TCP: request_sock_TCP: Possible SYN Flooding …
• Decreased Network Processor Idle % was a symptom
• Of repeated Kafka producer connection attempts
• TCP congestion control and window size dropped to very small and
inconsistent values between producers and broker
• Making it impossible to reconnect producers to brokers after a broker restart
• Cause?
• Broker restart triggered the problem
• Permanent fix required Linux Kernel and Kafka version upgrades
• And different settings for SYN cookie options
• Probably related to KAFKA issues 9648 and 764
And Back to the Start (Aston Martins)
Thank You and Goodbye (Eject)
“Ejector seat? You’re joking.”
My now “collectable” but played with Corgi DB5 (Source: Paul Brebner)
Aston Martin DB5
(Source: Wikimedia)
www.instaclustr.com
info@instaclustr.com
@instaclustr
Q & A
© Instaclustr Pty Limited, 2023
www.instaclustr.com
info@instaclustr.com
@instaclustr
THANK
YOU!
© Instaclustr Pty Limited, 2023
www.instaclustr.com/paul-brebner/
Ad

More Related Content

Similar to Fast Open Source Software - Without The Fury (20)

Spinning Your Drones with Cadence Workflows, Apache Kafka and TensorFlow
 Spinning Your Drones with Cadence Workflows, Apache Kafka and TensorFlow Spinning Your Drones with Cadence Workflows, Apache Kafka and TensorFlow
Spinning Your Drones with Cadence Workflows, Apache Kafka and TensorFlow
All Things Open
 
40 Powers of 10 - Simulating the Universe with the DiRAC HPC Facility
40 Powers of 10 - Simulating the Universe with the DiRAC HPC Facility40 Powers of 10 - Simulating the Universe with the DiRAC HPC Facility
40 Powers of 10 - Simulating the Universe with the DiRAC HPC Facility
inside-BigData.com
 
The OptIPuter as a Prototype for CalREN-XD
The OptIPuter as a Prototype for CalREN-XDThe OptIPuter as a Prototype for CalREN-XD
The OptIPuter as a Prototype for CalREN-XD
Larry Smarr
 
Cortex: Horizontally Scalable, Highly Available Prometheus
Cortex: Horizontally Scalable, Highly Available PrometheusCortex: Horizontally Scalable, Highly Available Prometheus
Cortex: Horizontally Scalable, Highly Available Prometheus
Grafana Labs
 
Scallable Distributed Deep Learning on OpenPOWER systems
Scallable Distributed Deep Learning on OpenPOWER systemsScallable Distributed Deep Learning on OpenPOWER systems
Scallable Distributed Deep Learning on OpenPOWER systems
Ganesan Narayanasamy
 
Apache Spark Streaming + Kafka 0.10 with Joan Viladrosariera
Apache Spark Streaming + Kafka 0.10 with Joan ViladrosarieraApache Spark Streaming + Kafka 0.10 with Joan Viladrosariera
Apache Spark Streaming + Kafka 0.10 with Joan Viladrosariera
Spark Summit
 
[Spark Summit EU 2017] Apache spark streaming + kafka 0.10 an integration story
[Spark Summit EU 2017] Apache spark streaming + kafka 0.10  an integration story[Spark Summit EU 2017] Apache spark streaming + kafka 0.10  an integration story
[Spark Summit EU 2017] Apache spark streaming + kafka 0.10 an integration story
Joan Viladrosa Riera
 
The Optiputer - Toward a Terabit LAN
The Optiputer - Toward a Terabit LANThe Optiputer - Toward a Terabit LAN
The Optiputer - Toward a Terabit LAN
Larry Smarr
 
Spark Streaming + Kafka 0.10: an integration story by Joan Viladrosa Riera at...
Spark Streaming + Kafka 0.10: an integration story by Joan Viladrosa Riera at...Spark Streaming + Kafka 0.10: an integration story by Joan Viladrosa Riera at...
Spark Streaming + Kafka 0.10: an integration story by Joan Viladrosa Riera at...
Big Data Spain
 
Enhancing Apache Kafka for Large Scale Real-Time Data Pipeline at Tencent | K...
Enhancing Apache Kafka for Large Scale Real-Time Data Pipeline at Tencent | K...Enhancing Apache Kafka for Large Scale Real-Time Data Pipeline at Tencent | K...
Enhancing Apache Kafka for Large Scale Real-Time Data Pipeline at Tencent | K...
HostedbyConfluent
 
101 mistakes FINN.no has made with Kafka (Baksida meetup)
101 mistakes FINN.no has made with Kafka (Baksida meetup)101 mistakes FINN.no has made with Kafka (Baksida meetup)
101 mistakes FINN.no has made with Kafka (Baksida meetup)
Henning Spjelkavik
 
The Next Chapter in the Sordid Love/Hate Relationship Between DBs and OSes by...
The Next Chapter in the Sordid Love/Hate Relationship Between DBs and OSes by...The Next Chapter in the Sordid Love/Hate Relationship Between DBs and OSes by...
The Next Chapter in the Sordid Love/Hate Relationship Between DBs and OSes by...
ScyllaDB
 
OptIPuter Overview
OptIPuter OverviewOptIPuter Overview
OptIPuter Overview
Larry Smarr
 
Hpc Cloud project Overview
Hpc Cloud project OverviewHpc Cloud project Overview
Hpc Cloud project Overview
Floris Sluiter
 
An Optics Life
An Optics LifeAn Optics Life
An Optics Life
Thomas Weible
 
IESL Talk Series: Apache System Projects in the Real World
IESL Talk Series: Apache System Projects in the Real WorldIESL Talk Series: Apache System Projects in the Real World
IESL Talk Series: Apache System Projects in the Real World
Srinath Perera
 
Developing Real-Time Data Pipelines with Apache Kafka
Developing Real-Time Data Pipelines with Apache KafkaDeveloping Real-Time Data Pipelines with Apache Kafka
Developing Real-Time Data Pipelines with Apache Kafka
Joe Stein
 
Openstack Pakistan Workshop (intro)
Openstack Pakistan Workshop (intro)Openstack Pakistan Workshop (intro)
Openstack Pakistan Workshop (intro)
Affan Syed
 
Kickstart your Kafka with Faker Data | Francesco Tisiot, Aiven.io
Kickstart your Kafka with Faker Data | Francesco Tisiot, Aiven.ioKickstart your Kafka with Faker Data | Francesco Tisiot, Aiven.io
Kickstart your Kafka with Faker Data | Francesco Tisiot, Aiven.io
HostedbyConfluent
 
Fast Streaming into Clickhouse with Apache Pulsar
Fast Streaming into Clickhouse with Apache PulsarFast Streaming into Clickhouse with Apache Pulsar
Fast Streaming into Clickhouse with Apache Pulsar
Timothy Spann
 
Spinning Your Drones with Cadence Workflows, Apache Kafka and TensorFlow
 Spinning Your Drones with Cadence Workflows, Apache Kafka and TensorFlow Spinning Your Drones with Cadence Workflows, Apache Kafka and TensorFlow
Spinning Your Drones with Cadence Workflows, Apache Kafka and TensorFlow
All Things Open
 
40 Powers of 10 - Simulating the Universe with the DiRAC HPC Facility
40 Powers of 10 - Simulating the Universe with the DiRAC HPC Facility40 Powers of 10 - Simulating the Universe with the DiRAC HPC Facility
40 Powers of 10 - Simulating the Universe with the DiRAC HPC Facility
inside-BigData.com
 
The OptIPuter as a Prototype for CalREN-XD
The OptIPuter as a Prototype for CalREN-XDThe OptIPuter as a Prototype for CalREN-XD
The OptIPuter as a Prototype for CalREN-XD
Larry Smarr
 
Cortex: Horizontally Scalable, Highly Available Prometheus
Cortex: Horizontally Scalable, Highly Available PrometheusCortex: Horizontally Scalable, Highly Available Prometheus
Cortex: Horizontally Scalable, Highly Available Prometheus
Grafana Labs
 
Scallable Distributed Deep Learning on OpenPOWER systems
Scallable Distributed Deep Learning on OpenPOWER systemsScallable Distributed Deep Learning on OpenPOWER systems
Scallable Distributed Deep Learning on OpenPOWER systems
Ganesan Narayanasamy
 
Apache Spark Streaming + Kafka 0.10 with Joan Viladrosariera
Apache Spark Streaming + Kafka 0.10 with Joan ViladrosarieraApache Spark Streaming + Kafka 0.10 with Joan Viladrosariera
Apache Spark Streaming + Kafka 0.10 with Joan Viladrosariera
Spark Summit
 
[Spark Summit EU 2017] Apache spark streaming + kafka 0.10 an integration story
[Spark Summit EU 2017] Apache spark streaming + kafka 0.10  an integration story[Spark Summit EU 2017] Apache spark streaming + kafka 0.10  an integration story
[Spark Summit EU 2017] Apache spark streaming + kafka 0.10 an integration story
Joan Viladrosa Riera
 
The Optiputer - Toward a Terabit LAN
The Optiputer - Toward a Terabit LANThe Optiputer - Toward a Terabit LAN
The Optiputer - Toward a Terabit LAN
Larry Smarr
 
Spark Streaming + Kafka 0.10: an integration story by Joan Viladrosa Riera at...
Spark Streaming + Kafka 0.10: an integration story by Joan Viladrosa Riera at...Spark Streaming + Kafka 0.10: an integration story by Joan Viladrosa Riera at...
Spark Streaming + Kafka 0.10: an integration story by Joan Viladrosa Riera at...
Big Data Spain
 
Enhancing Apache Kafka for Large Scale Real-Time Data Pipeline at Tencent | K...
Enhancing Apache Kafka for Large Scale Real-Time Data Pipeline at Tencent | K...Enhancing Apache Kafka for Large Scale Real-Time Data Pipeline at Tencent | K...
Enhancing Apache Kafka for Large Scale Real-Time Data Pipeline at Tencent | K...
HostedbyConfluent
 
101 mistakes FINN.no has made with Kafka (Baksida meetup)
101 mistakes FINN.no has made with Kafka (Baksida meetup)101 mistakes FINN.no has made with Kafka (Baksida meetup)
101 mistakes FINN.no has made with Kafka (Baksida meetup)
Henning Spjelkavik
 
The Next Chapter in the Sordid Love/Hate Relationship Between DBs and OSes by...
The Next Chapter in the Sordid Love/Hate Relationship Between DBs and OSes by...The Next Chapter in the Sordid Love/Hate Relationship Between DBs and OSes by...
The Next Chapter in the Sordid Love/Hate Relationship Between DBs and OSes by...
ScyllaDB
 
OptIPuter Overview
OptIPuter OverviewOptIPuter Overview
OptIPuter Overview
Larry Smarr
 
Hpc Cloud project Overview
Hpc Cloud project OverviewHpc Cloud project Overview
Hpc Cloud project Overview
Floris Sluiter
 
IESL Talk Series: Apache System Projects in the Real World
IESL Talk Series: Apache System Projects in the Real WorldIESL Talk Series: Apache System Projects in the Real World
IESL Talk Series: Apache System Projects in the Real World
Srinath Perera
 
Developing Real-Time Data Pipelines with Apache Kafka
Developing Real-Time Data Pipelines with Apache KafkaDeveloping Real-Time Data Pipelines with Apache Kafka
Developing Real-Time Data Pipelines with Apache Kafka
Joe Stein
 
Openstack Pakistan Workshop (intro)
Openstack Pakistan Workshop (intro)Openstack Pakistan Workshop (intro)
Openstack Pakistan Workshop (intro)
Affan Syed
 
Kickstart your Kafka with Faker Data | Francesco Tisiot, Aiven.io
Kickstart your Kafka with Faker Data | Francesco Tisiot, Aiven.ioKickstart your Kafka with Faker Data | Francesco Tisiot, Aiven.io
Kickstart your Kafka with Faker Data | Francesco Tisiot, Aiven.io
HostedbyConfluent
 
Fast Streaming into Clickhouse with Apache Pulsar
Fast Streaming into Clickhouse with Apache PulsarFast Streaming into Clickhouse with Apache Pulsar
Fast Streaming into Clickhouse with Apache Pulsar
Timothy Spann
 

Recently uploaded (20)

How to Optimize Your AWS Environment for Improved Cloud Performance
How to Optimize Your AWS Environment for Improved Cloud PerformanceHow to Optimize Your AWS Environment for Improved Cloud Performance
How to Optimize Your AWS Environment for Improved Cloud Performance
ThousandEyes
 
Proactive Vulnerability Detection in Source Code Using Graph Neural Networks:...
Proactive Vulnerability Detection in Source Code Using Graph Neural Networks:...Proactive Vulnerability Detection in Source Code Using Graph Neural Networks:...
Proactive Vulnerability Detection in Source Code Using Graph Neural Networks:...
Ranjan Baisak
 
Explaining GitHub Actions Failures with Large Language Models Challenges, In...
Explaining GitHub Actions Failures with Large Language Models Challenges, In...Explaining GitHub Actions Failures with Large Language Models Challenges, In...
Explaining GitHub Actions Failures with Large Language Models Challenges, In...
ssuserb14185
 
How Valletta helped healthcare SaaS to transform QA and compliance to grow wi...
How Valletta helped healthcare SaaS to transform QA and compliance to grow wi...How Valletta helped healthcare SaaS to transform QA and compliance to grow wi...
How Valletta helped healthcare SaaS to transform QA and compliance to grow wi...
Egor Kaleynik
 
Adobe Master Collection CC Crack Advance Version 2025
Adobe Master Collection CC Crack Advance Version 2025Adobe Master Collection CC Crack Advance Version 2025
Adobe Master Collection CC Crack Advance Version 2025
kashifyounis067
 
TestMigrationsInPy: A Dataset of Test Migrations from Unittest to Pytest (MSR...
TestMigrationsInPy: A Dataset of Test Migrations from Unittest to Pytest (MSR...TestMigrationsInPy: A Dataset of Test Migrations from Unittest to Pytest (MSR...
TestMigrationsInPy: A Dataset of Test Migrations from Unittest to Pytest (MSR...
Andre Hora
 
EASEUS Partition Master Crack + License Code
EASEUS Partition Master Crack + License CodeEASEUS Partition Master Crack + License Code
EASEUS Partition Master Crack + License Code
aneelaramzan63
 
Designing AI-Powered APIs on Azure: Best Practices& Considerations
Designing AI-Powered APIs on Azure: Best Practices& ConsiderationsDesigning AI-Powered APIs on Azure: Best Practices& Considerations
Designing AI-Powered APIs on Azure: Best Practices& Considerations
Dinusha Kumarasiri
 
Exploring Code Comprehension in Scientific Programming: Preliminary Insight...
Exploring Code Comprehension  in Scientific Programming:  Preliminary Insight...Exploring Code Comprehension  in Scientific Programming:  Preliminary Insight...
Exploring Code Comprehension in Scientific Programming: Preliminary Insight...
University of Hawai‘i at Mānoa
 
Scaling GraphRAG: Efficient Knowledge Retrieval for Enterprise AI
Scaling GraphRAG:  Efficient Knowledge Retrieval for Enterprise AIScaling GraphRAG:  Efficient Knowledge Retrieval for Enterprise AI
Scaling GraphRAG: Efficient Knowledge Retrieval for Enterprise AI
danshalev
 
WinRAR Crack for Windows (100% Working 2025)
WinRAR Crack for Windows (100% Working 2025)WinRAR Crack for Windows (100% Working 2025)
WinRAR Crack for Windows (100% Working 2025)
sh607827
 
Download YouTube By Click 2025 Free Full Activated
Download YouTube By Click 2025 Free Full ActivatedDownload YouTube By Click 2025 Free Full Activated
Download YouTube By Click 2025 Free Full Activated
saniamalik72555
 
Why Orangescrum Is a Game Changer for Construction Companies in 2025
Why Orangescrum Is a Game Changer for Construction Companies in 2025Why Orangescrum Is a Game Changer for Construction Companies in 2025
Why Orangescrum Is a Game Changer for Construction Companies in 2025
Orangescrum
 
Requirements in Engineering AI- Enabled Systems: Open Problems and Safe AI Sy...
Requirements in Engineering AI- Enabled Systems: Open Problems and Safe AI Sy...Requirements in Engineering AI- Enabled Systems: Open Problems and Safe AI Sy...
Requirements in Engineering AI- Enabled Systems: Open Problems and Safe AI Sy...
Lionel Briand
 
Adobe After Effects Crack FREE FRESH version 2025
Adobe After Effects Crack FREE FRESH version 2025Adobe After Effects Crack FREE FRESH version 2025
Adobe After Effects Crack FREE FRESH version 2025
kashifyounis067
 
Top 10 Client Portal Software Solutions for 2025.docx
Top 10 Client Portal Software Solutions for 2025.docxTop 10 Client Portal Software Solutions for 2025.docx
Top 10 Client Portal Software Solutions for 2025.docx
Portli
 
Revolutionizing Residential Wi-Fi PPT.pptx
Revolutionizing Residential Wi-Fi PPT.pptxRevolutionizing Residential Wi-Fi PPT.pptx
Revolutionizing Residential Wi-Fi PPT.pptx
nidhisingh691197
 
Adobe Marketo Engage Champion Deep Dive - SFDC CRM Synch V2 & Usage Dashboards
Adobe Marketo Engage Champion Deep Dive - SFDC CRM Synch V2 & Usage DashboardsAdobe Marketo Engage Champion Deep Dive - SFDC CRM Synch V2 & Usage Dashboards
Adobe Marketo Engage Champion Deep Dive - SFDC CRM Synch V2 & Usage Dashboards
BradBedford3
 
How to Batch Export Lotus Notes NSF Emails to Outlook PST Easily?
How to Batch Export Lotus Notes NSF Emails to Outlook PST Easily?How to Batch Export Lotus Notes NSF Emails to Outlook PST Easily?
How to Batch Export Lotus Notes NSF Emails to Outlook PST Easily?
steaveroggers
 
How can one start with crypto wallet development.pptx
How can one start with crypto wallet development.pptxHow can one start with crypto wallet development.pptx
How can one start with crypto wallet development.pptx
laravinson24
 
How to Optimize Your AWS Environment for Improved Cloud Performance
How to Optimize Your AWS Environment for Improved Cloud PerformanceHow to Optimize Your AWS Environment for Improved Cloud Performance
How to Optimize Your AWS Environment for Improved Cloud Performance
ThousandEyes
 
Proactive Vulnerability Detection in Source Code Using Graph Neural Networks:...
Proactive Vulnerability Detection in Source Code Using Graph Neural Networks:...Proactive Vulnerability Detection in Source Code Using Graph Neural Networks:...
Proactive Vulnerability Detection in Source Code Using Graph Neural Networks:...
Ranjan Baisak
 
Explaining GitHub Actions Failures with Large Language Models Challenges, In...
Explaining GitHub Actions Failures with Large Language Models Challenges, In...Explaining GitHub Actions Failures with Large Language Models Challenges, In...
Explaining GitHub Actions Failures with Large Language Models Challenges, In...
ssuserb14185
 
How Valletta helped healthcare SaaS to transform QA and compliance to grow wi...
How Valletta helped healthcare SaaS to transform QA and compliance to grow wi...How Valletta helped healthcare SaaS to transform QA and compliance to grow wi...
How Valletta helped healthcare SaaS to transform QA and compliance to grow wi...
Egor Kaleynik
 
Adobe Master Collection CC Crack Advance Version 2025
Adobe Master Collection CC Crack Advance Version 2025Adobe Master Collection CC Crack Advance Version 2025
Adobe Master Collection CC Crack Advance Version 2025
kashifyounis067
 
TestMigrationsInPy: A Dataset of Test Migrations from Unittest to Pytest (MSR...
TestMigrationsInPy: A Dataset of Test Migrations from Unittest to Pytest (MSR...TestMigrationsInPy: A Dataset of Test Migrations from Unittest to Pytest (MSR...
TestMigrationsInPy: A Dataset of Test Migrations from Unittest to Pytest (MSR...
Andre Hora
 
EASEUS Partition Master Crack + License Code
EASEUS Partition Master Crack + License CodeEASEUS Partition Master Crack + License Code
EASEUS Partition Master Crack + License Code
aneelaramzan63
 
Designing AI-Powered APIs on Azure: Best Practices& Considerations
Designing AI-Powered APIs on Azure: Best Practices& ConsiderationsDesigning AI-Powered APIs on Azure: Best Practices& Considerations
Designing AI-Powered APIs on Azure: Best Practices& Considerations
Dinusha Kumarasiri
 
Exploring Code Comprehension in Scientific Programming: Preliminary Insight...
Exploring Code Comprehension  in Scientific Programming:  Preliminary Insight...Exploring Code Comprehension  in Scientific Programming:  Preliminary Insight...
Exploring Code Comprehension in Scientific Programming: Preliminary Insight...
University of Hawai‘i at Mānoa
 
Scaling GraphRAG: Efficient Knowledge Retrieval for Enterprise AI
Scaling GraphRAG:  Efficient Knowledge Retrieval for Enterprise AIScaling GraphRAG:  Efficient Knowledge Retrieval for Enterprise AI
Scaling GraphRAG: Efficient Knowledge Retrieval for Enterprise AI
danshalev
 
WinRAR Crack for Windows (100% Working 2025)
WinRAR Crack for Windows (100% Working 2025)WinRAR Crack for Windows (100% Working 2025)
WinRAR Crack for Windows (100% Working 2025)
sh607827
 
Download YouTube By Click 2025 Free Full Activated
Download YouTube By Click 2025 Free Full ActivatedDownload YouTube By Click 2025 Free Full Activated
Download YouTube By Click 2025 Free Full Activated
saniamalik72555
 
Why Orangescrum Is a Game Changer for Construction Companies in 2025
Why Orangescrum Is a Game Changer for Construction Companies in 2025Why Orangescrum Is a Game Changer for Construction Companies in 2025
Why Orangescrum Is a Game Changer for Construction Companies in 2025
Orangescrum
 
Requirements in Engineering AI- Enabled Systems: Open Problems and Safe AI Sy...
Requirements in Engineering AI- Enabled Systems: Open Problems and Safe AI Sy...Requirements in Engineering AI- Enabled Systems: Open Problems and Safe AI Sy...
Requirements in Engineering AI- Enabled Systems: Open Problems and Safe AI Sy...
Lionel Briand
 
Adobe After Effects Crack FREE FRESH version 2025
Adobe After Effects Crack FREE FRESH version 2025Adobe After Effects Crack FREE FRESH version 2025
Adobe After Effects Crack FREE FRESH version 2025
kashifyounis067
 
Top 10 Client Portal Software Solutions for 2025.docx
Top 10 Client Portal Software Solutions for 2025.docxTop 10 Client Portal Software Solutions for 2025.docx
Top 10 Client Portal Software Solutions for 2025.docx
Portli
 
Revolutionizing Residential Wi-Fi PPT.pptx
Revolutionizing Residential Wi-Fi PPT.pptxRevolutionizing Residential Wi-Fi PPT.pptx
Revolutionizing Residential Wi-Fi PPT.pptx
nidhisingh691197
 
Adobe Marketo Engage Champion Deep Dive - SFDC CRM Synch V2 & Usage Dashboards
Adobe Marketo Engage Champion Deep Dive - SFDC CRM Synch V2 & Usage DashboardsAdobe Marketo Engage Champion Deep Dive - SFDC CRM Synch V2 & Usage Dashboards
Adobe Marketo Engage Champion Deep Dive - SFDC CRM Synch V2 & Usage Dashboards
BradBedford3
 
How to Batch Export Lotus Notes NSF Emails to Outlook PST Easily?
How to Batch Export Lotus Notes NSF Emails to Outlook PST Easily?How to Batch Export Lotus Notes NSF Emails to Outlook PST Easily?
How to Batch Export Lotus Notes NSF Emails to Outlook PST Easily?
steaveroggers
 
How can one start with crypto wallet development.pptx
How can one start with crypto wallet development.pptxHow can one start with crypto wallet development.pptx
How can one start with crypto wallet development.pptx
laravinson24
 
Ad

Fast Open Source Software - Without The Fury

  • 1. Welcome! To the 3rd Community Over Code Performance Engineering Track Track chairs: Paul Brebner, Roger Abelenda
  • 2. 1st New Orleans 2022 Technologies: Lucene, Ozone, Kafka, Cassandra, JMeter, and Spark/ML
  • 3. 2nd Beijing 2023 Technologies: Kafka, JMeter, Arrow, Java profiling, Spark & Flink, Hadoop
  • 4. 3rd Halifax 2023 Technologies: Kafka, Ozone, Cassandra, Camel, JMeter/Selenium, Lucene And maybe 4th EU 2024
  • 5. Talks (3 x 2 = 6) 11:20 Paul Brebner Developing Fast Applications With Open Source Software - Without The Fury (Kafka) 12:10 Duong Nguyen, Tanvi Penumudy, Ritesh Shukla Design patterns and then the road to realize billions of objects, and exabytes of capacity, while preserving performance in Apache Ozone --- LUNCH 2:20 German Eichberger, Pallavi Iyengar Performance measurement and tuning of Cassandra 5.0 transactions on Cloud infrastructure 3:10 Otavio Piske Hunting Performance Monsters on the Back of a Camel --- COFFEE 4:10 Roger Abelenda Quick load testing from Selenium scripts 5:00 Stefan Vodita Lessons Learned from Benchmarking Amazon’s E-commerce Search Engine
  • 6. © Instaclustr Pty Limited, 2023 Fast Open Source Software — Without The Fury Paul Brebner - Open Source Technology Evangelist www.instaclustr.com/paul-brebner/ Community Over Code Halifax October 2023
  • 7. Fast Cars?! “NetApp helps Aston Martin F1 use data to go faster” Massive amounts of data is captured and used to gain milliseconds of performance improvements on race-days Using NetApp Cloud and Storage Technologies
  • 8. Fast Cars?! “Fast & Furious” “Fast & Furious” Cars for my Grandson (unopened as they are “collectables”)
  • 9. Fast Open Source Software (Source: wikimedia.org) © Instaclustr Pty Limited, 2023
  • 10. Enabled by Scalable Big Data Technologies (e.g. Apache Kafka®) (Source: wikimedia.org) © Instaclustr Pty Limited, 2023
  • 11. And Cloud Infrastructure (Source: wikimedia.org) © Instaclustr Pty Limited, 2023 Cars run on roads S/W runs on H/W
  • 12. Without the Fury Fury 1 Too Many Kafka Topics Fury 2 Slow Consumers Fury 3 Too Many Kafka Partitions Fury 4 Single Threaded Consumers— concurrency limited by partitions Fury 5 Operational Problems © Instaclustr Pty Limited, 2023 (Source: Shutterstock) (Source: Shutterstock)
  • 13. Cloud Platform for Big Data Open Source Technologies Focus of this talk is on Apache Kafka® Instaclustr Managed Platform © Instaclustr Pty Limited, 2023
  • 14. © Instaclustr Pty Limited, 2023 Kafka is a distributed streams processing system—it allows distributed producers to send messages to distributed consumers via a Kafka cluster. What is Kafka?
  • 15. Partitions Enable Concurrency: Cluster and Producers © Instaclustr Pty Limited, 2023
  • 16. Partitions Enable Concurrency: Cluster and Consumers © Instaclustr Pty Limited, 2023 (and followers)
  • 17. Partition n Topic Partition 1 Producer Partition 2 Consumer Group Consumer Consumer Consumers share work within groups Consumer Partitions enable Consumers to share work (c.f. Amish Barn raising) within a consumer group © Instaclustr Pty Limited, 2023
  • 18. Multiple groups enable message broadcasting. Messages are duplicated (c.f. clones) across groups, as each consumer group receives a copy of each message. Multiple Groups Enable Message Broadcasting Consumer Consumer Consumer Consumer Topic Partition 1 Partition 2 Partition n Producer Consumer Group Consumer Group Messages are duplicated across Consumer groups Messages are duplicated (c.f. clones) across groups, as each consumer group receives a copy of each message © Instaclustr Pty Limited, 2023
  • 19. ©Instaclustr Pty Limited, 2023 “Kongo” Logistics IoT Application Fury 1: Too Many Kafka Topics
  • 20. Design Choices: Many vs. One Topic? • 100s of locations (Warehouses, Trucks) • Each location has a topic and multiple consumer groups (so all the Goods in a location receive relevant events) Option 1: § Many topics § Many consumer groups per topic -> high fan-out 1. Many (100s) of Topics © Instaclustr Pty Limited, 2023
  • 21. 2. One Topic, One Consumer Group • One topic for all locations • Using an external notification mechanism for event broadcasting (Guava Event Bus) © Instaclustr Pty Limited, 2023
  • 22. Single Topic/Single Consumer Group Wins © Instaclustr Pty Limited, 2023 Many topics, many consumer groups, 7200 Single topic, single consumer group, 1120000 0 200000 400000 600000 800000 1000000 1200000 Many topics Single topic TPS 155 times better! But why?
  • 23. Trains Are More Scalable Than Cars Train in the Canadian Rockies (Source: Getty Images) © Instaclustr Pty Limited, 2023
  • 24. High Fan-Out = Lots of On/Off Ramps © Instaclustr Pty Limited, 2023 (Source: Shutterstock) Explanation: High fan-out = lots of output data and many consumer groups (resource intensive)
  • 25. Many Topics = Traffic Jam © Instaclustr Pty Limited, 2023 (Source: Shutterstock) Explanation: More topics à more partitions
  • 26. But Kafka Is Scalable! Bigger Clusters (Source: Wikimedia) Add more lanes! Vertical/Scale-up Increase Node Sizes © Instaclustr Pty Limited, 2023
  • 27. Horizontal/Scale-Out Add More Nodes Add more roads! (Source: Shutterstock) © Instaclustr Pty Limited, 2023
  • 28. ©Instaclustr Pty Limited, 2023 Fury 2: Slow Consumers Kafka+Cassandra Anomaly Detector Application
  • 29. Massively Scalable Anomaly Detection: Tuning Knobs (Orange h/w, Yellow s/w) • Initially just increased h/w resources • Scaling was “easy” with Kubernetes • Easy to create lots of consumers (100s) • Initially single threaded Kafka consumer (no thread pools) © Instaclustr Pty Limited, 2023
  • 30. But Scalability Not Great: Scaling Is Too Easy—Scalability Harder 0 1 2 3 4 5 6 7 8 0 100 200 300 400 500 600 700 Billions checks/day Total Cores Total Cores vs. Billions of Checks/Day (pre-tuning) © Instaclustr Pty Limited, 2023
  • 31. Slow Kafka Consumers Problem Default Kafka consumers: • Are single-threaded • If the processing is “slow” then queuing occurs—as the thread is blocked—reducing throughput • Solutions include speed up the processing, or increase the number of consumers • But more consumers à more partitions • As each consumer needs 1 or more partitions (Source: Getty Images) © Instaclustr Pty Limited, 2023
  • 32. Single Threaded Kafka Consumers And Slow Processing = Slow Consumers S L O W > 10 ms • Slow consumers à need more consumers and also more partitions for higher throughput • But more consumers is slow, try speeding them up © Instaclustr Pty Limited, 2023
  • 33. We Need Some Car Mods (Hacks) © Instaclustr Pty Limited, 2023 (Source: Getty Images)
  • 34. Multi-Threaded/Two Pool Consumers The famous Bondi Ocean Pool (in Sydney, Australia) has 2 pools (Source: Shutterstock) © Instaclustr Pty Limited, 2023
  • 35. Tuning: Optimize Consumer Speed/ Concurrency Using 2 Stage Pipeline 1 2 Less consumers (around 100) gives higher throughput— a surprise! 1. Speed up polling (thread pool 1) 2. Maximize anomaly detector concurrency (thread pool 2) Result—Reduces the number of consumers and therefore partitions needed and gives higher throughput—why? Don’t more partitions give higher throughput?! Answer in part 3. © Instaclustr Pty Limited, 2023
  • 36. Scalability Post-Tuning—7.5 to 19 Billion Checks/Day—2.5 Times Improvement 0 2 4 6 8 10 12 14 16 18 20 0 100 200 300 400 500 600 700 Billions checks/day Total Cores Total Cores vs. Billions of Checks/Day (pre-tuning) Billions of checks/day (pre-tuning) Billions of checks/day (post-tuning) © Instaclustr Pty Limited, 2023
  • 37. ©Instaclustr Pty Limited, 2023 Fury 3: Too Many Partitions What’s really going on under the Kafka Bonnet? (Source: Adobe Stock) (Source: Adobe Stock)
  • 38. Partitions = Pistons (Cylinders) ©Instaclustr Pty Limited, 2021 (Source: Getty Images) © Instaclustr Pty Limited, 2023
  • 39. But How Many? ©Instaclustr Pty Limited, 2021 Isetta 1-cylinder car (Source: Wikimedia) 1 piston isn’t very powerful: Isetta (bubble car) single-cylinder, 10HP, top speed 55MPH © Instaclustr Pty Limited, 2023
  • 40. …16 Pistons Is a Lot! ©Instaclustr Pty Limited, 2021 Cadillac V-16 175HP, 100 MPH! By Ramgeis - fotografiert von Ramgeis in Pebble Beach, Kalifornien im August 2004, CC BY-SA 3.0 © Instaclustr Pty Limited, 2023
  • 41. Can You Have “Too Many”? ©Instaclustr Pty Limited, 2021 Source: Wikimedia YES! Experimental 42-cylinder 2,350 hp (plane) engine! © Instaclustr Pty Limited, 2023
  • 42. Benchmarking (2020): Partitions and Replication Factor Are the Culprits 0 100000 200000 300000 400000 500000 600000 700000 800000 900000 1 10 100 1000 10000 TPS Partitions Kafka Partitions vs. Throughput Cluster: 3 nodes x 4 cores = 12 cores total Replication Factor 3 (TPS) Replication Factor 1 (TPS) You need sufficient partitions to benefit from the cluster concurrency— And not too many that the replication overhead impacts overall throughput © Instaclustr Pty Limited, 2023
  • 43. 2022 Kraft/Zookeeper Modes vs. 2020 Results Better, 1000 Partitions Is Ok You need sufficient partitions to benefit from the cluster concurrency And not too many that the replication overhead impacts overall throughput 0 0.5 1 1.5 2 2.5 1 10 100 1000 10000 Partitions vs. Throughput (M TPS) ZK TPS (M) KRAFT TPS (M) 2020 TPS (M) 2022 - Better 2020 - Worse © Instaclustr Pty Limited, 2023
  • 44. ©Instaclustr Pty Limited, 2023 Fury 4: Single Threaded Consumers What’s New? Kafka Parallel Consumer = A New Engine! (Source: Adobe Stock) "La Jamais Contente", first car to reach 100 km/h in 1899 – 68hp electric! (Source: Wikimedia) Rimac Nevera – Electric “Hypercar” 4 Engines, 1,888 HP, 0-400KM/H in 29s (Source: Wikimedia)
  • 45. Theory: Little’s Law: Concurrency = Throughput x Time You need sufficient partitions to benefit from the cluster concurrency And not too many that the replication overhead impacts overall throughput 2022 - Better 2020 - Worse Rearranged: • Throughput = Concurrency/Time • Concurrency is Partitions = Consumers • Using default consumer the throughput drops with increasing time • Only solution is to increase partitions © Instaclustr Pty Limited, 2023
  • 46. For Given Target Throughput (1M TPS) Increasing Partitions With Increasing Time You need sufficient partitions to benefit from the cluster concurrency And not too many that the replication overhead impacts overall throughput 2022 - Better 2020 - Worse © Instaclustr Pty Limited, 2023
  • 47. Order in Kafka Is Partition Based— So How To Increase Consumer Concurrency? You need sufficient partitions to benefit from the cluster concurrency And not too many that the replication overhead impacts overall throughput 2022 - Better 2020 - Worse (Source: Adobe Stock) © Instaclustr Pty Limited, 2023
  • 48. Kafka Parallel Consumers: Multi-Threaded Consumer You need sufficient partitions to benefit from the cluster concurrency And not too many that the replication overhead impacts overall throughput 2022 - Better 2020 - Worse • Multiple ordering options—c.f. default Kafka only guarantees order within partitions! PARTITION à KEY à UNORDERED Increasing concurrency à • Concurrency from 1 to lots —depends on client resources, and Partitions/Key space sizes • KEY has higher concurrency than Partition and is ordered by KEY—reasonable compromise • UNORDERED is unordered © Instaclustr Pty Limited, 2023
  • 49. Kafka Parallel Consumers Multi-Threaded Consumer = Buses You need sufficient partitions to benefit from the cluster concurrency And not too many that the replication overhead impacts overall throughput 2022 - Better 2020 - Worse (Source: Getty Images) © Instaclustr Pty Limited, 2023
  • 50. Theoretical Improvement for Each Mode – max 3 orders of magnitude You need sufficient partitions to benefit from the cluster concurrency And not too many that the replication overhead impacts overall throughput 2022 - Better 2020 - Worse 1,000 partitions 100 consumers max © Instaclustr Pty Limited, 2023
  • 51. Experimental Results: 3, 50, and 200 times improvement, unordered best You need sufficient partitions to benefit from the cluster concurrency And not too many that the replication overhead impacts overall throughput 2022 - Better 2020 - Worse 1 consumer 10 partitions 100 keys 10ms latency © Instaclustr Pty Limited, 2023
  • 52. Watch Out for the Kafka Furies 2022 - Better 2020 - Worse Too many topics (= too many partitions) Too many consumer groups Slow consumers (= too many partitions) Insufficient/ too many partitions Single threaded consumer (= too many partitions) © Instaclustr Pty Limited, 2023
  • 53. Speed can be achieved with train-buses You need sufficient partitions to benefit from the cluster concurrency And not too many that the replication overhead impacts overall throughput 2022 - Better 2020 - Worse © Instaclustr Pty Limited, 2023 Source: Wikimedia Minimize Topics and Partitions = Tracks - Buses are fast & self-driving on tracks Minimize Consumer Groups = Interchanges - At interchanges, buses fan out onto roads, reducing passenger transfers Maximize Consumer Concurrency = Buses - Multiple passengers, integrated with road system Adelaide’s O-Bahn Busway train-bus system Track + Buses
  • 54. ©Instaclustr Pty Limited, 2023 Fury 5: Operational Problems (Source: Adobe Stock) Pit Stop Performance Penalties (Source: Getty Images)
  • 55. Even well designed Kafka Applications occasionally have operational performance problems You need sufficient partitions to benefit from the cluster concurrency And not too many that the replication overhead impacts overall throughput 2022 - Better 2020 - Worse © Instaclustr Pty Limited, 2023 Kafka Cluster CPU Utilization (0-100%) Rapid increase in CPU Utilization from normal of 50% to lots What’s going on? Has the workload increased? No. Has the cluster capacity decreased? (e.g. lost some brokers) No. This was the only example of a Kafka performance problem in our Post-Incident Reviews Normal Abnormal
  • 56. Remediation • Attempt 1: Replace 1 broker at a time • Didn’t work – problem reappeared when new broker took over partition leadership • (A broker restart is what triggered the problem) • Attempt 2: Stop all customer clients (producers/consumers) • Perform a rolling restart of Kafka cluster • Restart clients, hold breathe… Apollo 3rd stage J-2 Engines (13,000,000 HP) were designed to restart – but failed to restart in uncrewed Apollo 6 (Source: NASA)
  • 58. Diagnosis: Kafka is a distributed system Kafka clients are also a critical part of the system Kafka Cluster CPU Utilization (%) Kafka Network Processor Threads handle client network data Network Processor Idle % has decreased (more is better, less is worse) The client load had increased. Network Processor Idle (%) Clients Clients
  • 59. Why? Further Clues and Causes • Kernel Error • TCP: request_sock_TCP: Possible SYN Flooding … • Decreased Network Processor Idle % was a symptom • Of repeated Kafka producer connection attempts • TCP congestion control and window size dropped to very small and inconsistent values between producers and broker • Making it impossible to reconnect producers to brokers after a broker restart • Cause? • Broker restart triggered the problem • Permanent fix required Linux Kernel and Kafka version upgrades • And different settings for SYN cookie options • Probably related to KAFKA issues 9648 and 764
  • 60. And Back to the Start (Aston Martins) Thank You and Goodbye (Eject) “Ejector seat? You’re joking.” My now “collectable” but played with Corgi DB5 (Source: Paul Brebner) Aston Martin DB5 (Source: Wikimedia)