SlideShare a Scribd company logo
Scaling Open Source Big
Data Cloud Applications is
Easy/Hard
Paul Brebner
Instaclustr—Technology Evangelist
©Instaclustr Pty Limited, 2022
DeveloperWeek 10 May 2022
Who am I?
• Previously
• R&D in distributed systems and performance engineering.
• Last 5 years
• Technology Evangelist for Instaclustr (soon NetApp)
• 100+ Blogs, demo applications, talks
• Open Source technologies including
• Apache Cassandra, Spark, Kafka, Zookeeper
• and Redis, OpenSearch, PostgreSQL, Kubernetes,
Prometheus, OpenTracing, etc
Cloud Platform for Big Data
Open Source Technologies
Latest addition is
Workflows with Uber’s
Instaclustr Managed Platform
©Instaclustr Pty Limited, 2021
Cloud Platform for Big Data
Open Source Technologies
Latest addition is
Workflows with Uber’s
This talk focuses on Cassandra and Kafka
©Instaclustr Pty Limited, 2021
Scaling is Easy! Cassandra and Kafka
Homogeneous distributed clusters à horizontally scalable
www.cassandra.apache.org/_/cassandra-basics.html
But actually lots of moving parts
(source: http://trumpetb.net/loco/rodsf.html)
Complications – DCs, Racks, Nodes, Partitions,
Replication Factor, Time (for auto-scaling)
Rows have a
partition key
and are
stored in
different
partitions
Example 1 – Cassandra Auto-Scaling
©Instaclustr Pty Limited, 2021
Two Ways of Resizing Clusters
1 - Horizontal Scaling
• Add nodes, no interruption
• But scale up only (not down)
• Takes time, puts extra load on cluster as data streams to extra nodes
2 - Vertical Scaling
• Replace nodes with bigger (or smaller) node types (more/less cores)
• Scale up and down
• Takes time, temporary reduction in capacity
• Choice of how many nodes are replaced concurrently – by “node” (1 node at a
time) or by “rack” (all nodes in a rack) , or in-between
Cluster resizing time – by node vs. by
rack – by rack is faster but …?
Cluster = 6 nodes, 3 racks, 2 nodes per rack
By node (concurrency 1)
By rack (concurrency 2)
Resizing by node – capacity reduced by 1/6 total
nodes each resize operation (simplified model)
Resizing by rack – capacity reduced by 2/6
nodes each resize operation
Comparison – resize by rack faster but has
bigger capacity hit during resize
Observations
• If the capacity during resize is exceeded latencies will increase
• Made worse by Cassandra load balancing which assumes equal sized
nodes
• By node, more nodes in the Cluster reduces the impact of reduced cluster
capacity during resizing (some clusters have 100s of nodes) – but will take
longer
• Many of our clusters have <= 6 nodes
Auto-scaling model - increasing load à linear
regression over 1 hour extrapolated to future
We predict the cluster will reach
100% capacity around the 280
minute mark (220 minutes in the
future)
Extrapolated
Measured
Resize by Rack vs. Node - initiated in time to
prevent overloading during resize operation
Resize by rack must be initiated sooner c.f. resize by node, even thought it’s faster to resize, as it has less capacity
during resize (67% c.f. 83% of initial capacity)
By
Rack
By Node
Auto-scaling POC – worked!
Monitoring API
Linear Regression +
Rules
Provisioning API
Rules generalized to allow for
• scaling up and down
• resizing by any number of nodes concurrently, up to rack size
Example 2 – Anomaly Detection
©Instaclustr Pty Limited, 2021
JoAnn Morgan Apollo 11 Mission Control
Multiple technologies: Kafka,
Cassandra, Kubernetes
Massively Scalable Anomaly Detection
– Tuning knobs (Orange h/w, yellow s/w)
Scaling is (too) Easy!
Initially just increased h/w resources
But scalability not great
0
1
2
3
4
5
6
7
8
0 100 200 300 400 500 600 700
Billions
checks/day
Total Cores
Total Cores vs. Billions of checks/day (pre-tuning)
Tuning required! Scalability Post-tuning
0
2
4
6
8
10
12
14
16
18
20
0 100 200 300 400 500 600 700
Billions
checks/day
Total Cores
Total Cores vs. Billions of checks/day (pre-tuning)
Billions of checks/day (pre-tuning) Billions of checks/day (post-tuning)
Tuning – Optimize s/w resources
(red arrows)
1
2
3
1. Minimize Kafka Consumers (thread pool 1)
2. Minimize Cassandra Connections
3. Maximize Cassandra client concurrency (thread
pool 2)
Example 3 – What’s really going on -
behind the Kafka partitions?
©Instaclustr Pty Limited, 2021
©Instaclustr Pty
Limited 2019,
2021, 2022
Kafka topic partitions enable
consumer concurrency
partitions >= consumers
Partition n
Topic “Parties”
Partition 1
Producer
Partition 2
Consumer Group
Consumer
Consumer
Consumers share
work within groups
Consumer
High consumer/partition fan out
Can be caused by:
1 Design – many topics and/or many consumers
2 Slow consumers à need more consumers to increase
throughput
Kafka write architecture – partition
replication
Benchmarking revealed that partitions
and replication factor are the culprit
0
100000
200000
300000
400000
500000
600000
700000
800000
900000
1 10 100 1000 10000
TPS
Partitions
Kafka Partitions vs. Throughput
Cluster: 3 nodes x 4 cores = 12 cores total
Replication Factor 3 (TPS) Replication Factor 1 (TPS)
Implications?
• Bigger Cluster (more nodes, bigger nodes)
• Design to minimize topics and consumers
• Optimize consumers for minimum time
• Always benchmark with many partitions
• Blame the Apache Zookeeper?
• Responsible for Kafka control
• From version 3.0 it’s being replaced by native KRaft protocol
• Not yet production ready
• May enable more partitions (but may not impact throughput)
Scaling is Mostly Easy!
§ Using Scalable Open Source Big Data Technologies
§ Hosted by suitable Cloud providers
§ With suitable monitoring, understanding of autoscaling
and how different software “knobs” interact, and by
scaling incrementally
© Instaclustr Pty Limited, 2022
www.instaclustr.com
info@instaclustr.com
@instaclustr
THANK
YOU!
For further Information see blogs www.instaclustr.com/paul-brebner/
Ad

More Related Content

Similar to OPEN Talk: Scaling Open Source Big Data Cloud Applications is Easy/Hard (20)

Big Data Streams Architectures. Why? What? How?
Big Data Streams Architectures. Why? What? How?Big Data Streams Architectures. Why? What? How?
Big Data Streams Architectures. Why? What? How?
Anton Nazaruk
 
Cassandra CLuster Management by Japan Cassandra Community
Cassandra CLuster Management by Japan Cassandra CommunityCassandra CLuster Management by Japan Cassandra Community
Cassandra CLuster Management by Japan Cassandra Community
Hiromitsu Komatsu
 
C* Summit 2013: Netflix Open Source Tools and Benchmarks for Cassandra by Adr...
C* Summit 2013: Netflix Open Source Tools and Benchmarks for Cassandra by Adr...C* Summit 2013: Netflix Open Source Tools and Benchmarks for Cassandra by Adr...
C* Summit 2013: Netflix Open Source Tools and Benchmarks for Cassandra by Adr...
DataStax Academy
 
Kafka vs kinesis
Kafka vs kinesisKafka vs kinesis
Kafka vs kinesis
Kaushal Lahankar, FRM
 
MayaData Datastax webinar - Operating Cassandra on Kubernetes with the help ...
MayaData  Datastax webinar - Operating Cassandra on Kubernetes with the help ...MayaData  Datastax webinar - Operating Cassandra on Kubernetes with the help ...
MayaData Datastax webinar - Operating Cassandra on Kubernetes with the help ...
MayaData Inc
 
Spinnaker VLDB 2011
Spinnaker VLDB 2011Spinnaker VLDB 2011
Spinnaker VLDB 2011
sandeep_tata
 
Tuning kafka pipelines
Tuning kafka pipelinesTuning kafka pipelines
Tuning kafka pipelines
Sumant Tambe
 
Streaming in Practice - Putting Apache Kafka in Production
Streaming in Practice - Putting Apache Kafka in ProductionStreaming in Practice - Putting Apache Kafka in Production
Streaming in Practice - Putting Apache Kafka in Production
confluent
 
Westpac Bank Tech Talk 1: Dive into Apache Kafka
Westpac Bank Tech Talk 1: Dive into Apache KafkaWestpac Bank Tech Talk 1: Dive into Apache Kafka
Westpac Bank Tech Talk 1: Dive into Apache Kafka
confluent
 
Cassandra Consistency: Tradeoffs and Limitations
Cassandra Consistency: Tradeoffs and LimitationsCassandra Consistency: Tradeoffs and Limitations
Cassandra Consistency: Tradeoffs and Limitations
Panagiotis Papadopoulos
 
HPC and cloud distributed computing, as a journey
HPC and cloud distributed computing, as a journeyHPC and cloud distributed computing, as a journey
HPC and cloud distributed computing, as a journey
Peter Clapham
 
IMCSummit 2015 - Day 2 IT Business Track - 4 Myths about In-Memory Databases ...
IMCSummit 2015 - Day 2 IT Business Track - 4 Myths about In-Memory Databases ...IMCSummit 2015 - Day 2 IT Business Track - 4 Myths about In-Memory Databases ...
IMCSummit 2015 - Day 2 IT Business Track - 4 Myths about In-Memory Databases ...
In-Memory Computing Summit
 
Capital One Delivers Risk Insights in Real Time with Stream Processing
Capital One Delivers Risk Insights in Real Time with Stream ProcessingCapital One Delivers Risk Insights in Real Time with Stream Processing
Capital One Delivers Risk Insights in Real Time with Stream Processing
confluent
 
Manycores for the Masses
Manycores for the MassesManycores for the Masses
Manycores for the Masses
Intel® Software
 
Near Real time Indexing Kafka Messages to Apache Blur using Spark Streaming
Near Real time Indexing Kafka Messages to Apache Blur using Spark StreamingNear Real time Indexing Kafka Messages to Apache Blur using Spark Streaming
Near Real time Indexing Kafka Messages to Apache Blur using Spark Streaming
Dibyendu Bhattacharya
 
Sanger OpenStack presentation March 2017
Sanger OpenStack presentation March 2017Sanger OpenStack presentation March 2017
Sanger OpenStack presentation March 2017
Dave Holland
 
Bloomreach - BloomStore Compute Cloud Infrastructure
Bloomreach - BloomStore Compute Cloud Infrastructure Bloomreach - BloomStore Compute Cloud Infrastructure
Bloomreach - BloomStore Compute Cloud Infrastructure
bloomreacheng
 
Cassandra in Operation
Cassandra in OperationCassandra in Operation
Cassandra in Operation
niallmilton
 
Netflix at-disney-09-26-2014
Netflix at-disney-09-26-2014Netflix at-disney-09-26-2014
Netflix at-disney-09-26-2014
Monal Daxini
 
Multi-Datacenter Kafka - Strata San Jose 2017
Multi-Datacenter Kafka - Strata San Jose 2017Multi-Datacenter Kafka - Strata San Jose 2017
Multi-Datacenter Kafka - Strata San Jose 2017
Gwen (Chen) Shapira
 
Big Data Streams Architectures. Why? What? How?
Big Data Streams Architectures. Why? What? How?Big Data Streams Architectures. Why? What? How?
Big Data Streams Architectures. Why? What? How?
Anton Nazaruk
 
Cassandra CLuster Management by Japan Cassandra Community
Cassandra CLuster Management by Japan Cassandra CommunityCassandra CLuster Management by Japan Cassandra Community
Cassandra CLuster Management by Japan Cassandra Community
Hiromitsu Komatsu
 
C* Summit 2013: Netflix Open Source Tools and Benchmarks for Cassandra by Adr...
C* Summit 2013: Netflix Open Source Tools and Benchmarks for Cassandra by Adr...C* Summit 2013: Netflix Open Source Tools and Benchmarks for Cassandra by Adr...
C* Summit 2013: Netflix Open Source Tools and Benchmarks for Cassandra by Adr...
DataStax Academy
 
MayaData Datastax webinar - Operating Cassandra on Kubernetes with the help ...
MayaData  Datastax webinar - Operating Cassandra on Kubernetes with the help ...MayaData  Datastax webinar - Operating Cassandra on Kubernetes with the help ...
MayaData Datastax webinar - Operating Cassandra on Kubernetes with the help ...
MayaData Inc
 
Spinnaker VLDB 2011
Spinnaker VLDB 2011Spinnaker VLDB 2011
Spinnaker VLDB 2011
sandeep_tata
 
Tuning kafka pipelines
Tuning kafka pipelinesTuning kafka pipelines
Tuning kafka pipelines
Sumant Tambe
 
Streaming in Practice - Putting Apache Kafka in Production
Streaming in Practice - Putting Apache Kafka in ProductionStreaming in Practice - Putting Apache Kafka in Production
Streaming in Practice - Putting Apache Kafka in Production
confluent
 
Westpac Bank Tech Talk 1: Dive into Apache Kafka
Westpac Bank Tech Talk 1: Dive into Apache KafkaWestpac Bank Tech Talk 1: Dive into Apache Kafka
Westpac Bank Tech Talk 1: Dive into Apache Kafka
confluent
 
Cassandra Consistency: Tradeoffs and Limitations
Cassandra Consistency: Tradeoffs and LimitationsCassandra Consistency: Tradeoffs and Limitations
Cassandra Consistency: Tradeoffs and Limitations
Panagiotis Papadopoulos
 
HPC and cloud distributed computing, as a journey
HPC and cloud distributed computing, as a journeyHPC and cloud distributed computing, as a journey
HPC and cloud distributed computing, as a journey
Peter Clapham
 
IMCSummit 2015 - Day 2 IT Business Track - 4 Myths about In-Memory Databases ...
IMCSummit 2015 - Day 2 IT Business Track - 4 Myths about In-Memory Databases ...IMCSummit 2015 - Day 2 IT Business Track - 4 Myths about In-Memory Databases ...
IMCSummit 2015 - Day 2 IT Business Track - 4 Myths about In-Memory Databases ...
In-Memory Computing Summit
 
Capital One Delivers Risk Insights in Real Time with Stream Processing
Capital One Delivers Risk Insights in Real Time with Stream ProcessingCapital One Delivers Risk Insights in Real Time with Stream Processing
Capital One Delivers Risk Insights in Real Time with Stream Processing
confluent
 
Near Real time Indexing Kafka Messages to Apache Blur using Spark Streaming
Near Real time Indexing Kafka Messages to Apache Blur using Spark StreamingNear Real time Indexing Kafka Messages to Apache Blur using Spark Streaming
Near Real time Indexing Kafka Messages to Apache Blur using Spark Streaming
Dibyendu Bhattacharya
 
Sanger OpenStack presentation March 2017
Sanger OpenStack presentation March 2017Sanger OpenStack presentation March 2017
Sanger OpenStack presentation March 2017
Dave Holland
 
Bloomreach - BloomStore Compute Cloud Infrastructure
Bloomreach - BloomStore Compute Cloud Infrastructure Bloomreach - BloomStore Compute Cloud Infrastructure
Bloomreach - BloomStore Compute Cloud Infrastructure
bloomreacheng
 
Cassandra in Operation
Cassandra in OperationCassandra in Operation
Cassandra in Operation
niallmilton
 
Netflix at-disney-09-26-2014
Netflix at-disney-09-26-2014Netflix at-disney-09-26-2014
Netflix at-disney-09-26-2014
Monal Daxini
 
Multi-Datacenter Kafka - Strata San Jose 2017
Multi-Datacenter Kafka - Strata San Jose 2017Multi-Datacenter Kafka - Strata San Jose 2017
Multi-Datacenter Kafka - Strata San Jose 2017
Gwen (Chen) Shapira
 

More from Paul Brebner (20)

Streaming More For Less With Apache Kafka Tiered Storage
Streaming More For Less With Apache Kafka Tiered StorageStreaming More For Less With Apache Kafka Tiered Storage
Streaming More For Less With Apache Kafka Tiered Storage
Paul Brebner
 
30 Of My Favourite Open Source Technologies In 30 Minutes
30 Of My Favourite Open Source Technologies In 30 Minutes30 Of My Favourite Open Source Technologies In 30 Minutes
30 Of My Favourite Open Source Technologies In 30 Minutes
Paul Brebner
 
Superpower Your Apache Kafka Applications Development with Complementary Open...
Superpower Your Apache Kafka Applications Development with Complementary Open...Superpower Your Apache Kafka Applications Development with Complementary Open...
Superpower Your Apache Kafka Applications Development with Complementary Open...
Paul Brebner
 
Architecting Applications With Multiple Open Source Big Data Technologies
Architecting Applications With Multiple Open Source Big Data TechnologiesArchitecting Applications With Multiple Open Source Big Data Technologies
Architecting Applications With Multiple Open Source Big Data Technologies
Paul Brebner
 
Apache ZooKeeper and Apache Curator: Meet the Dining Philosophers
Apache ZooKeeper and Apache Curator: Meet the Dining PhilosophersApache ZooKeeper and Apache Curator: Meet the Dining Philosophers
Apache ZooKeeper and Apache Curator: Meet the Dining Philosophers
Paul Brebner
 
Spinning your Drones with Cadence Workflows and Apache Kafka
Spinning your Drones with Cadence Workflows and Apache KafkaSpinning your Drones with Cadence Workflows and Apache Kafka
Spinning your Drones with Cadence Workflows and Apache Kafka
Paul Brebner
 
Change Data Capture (CDC) With Kafka Connect® and the Debezium PostgreSQL Sou...
Change Data Capture (CDC) With Kafka Connect® and the Debezium PostgreSQL Sou...Change Data Capture (CDC) With Kafka Connect® and the Debezium PostgreSQL Sou...
Change Data Capture (CDC) With Kafka Connect® and the Debezium PostgreSQL Sou...
Paul Brebner
 
A Visual Introduction to Apache Kafka
A Visual Introduction to Apache KafkaA Visual Introduction to Apache Kafka
A Visual Introduction to Apache Kafka
Paul Brebner
 
Massively Scalable Real-time Geospatial Anomaly Detection with Apache Kafka a...
Massively Scalable Real-time Geospatial Anomaly Detection with Apache Kafka a...Massively Scalable Real-time Geospatial Anomaly Detection with Apache Kafka a...
Massively Scalable Real-time Geospatial Anomaly Detection with Apache Kafka a...
Paul Brebner
 
Building a real-time data processing pipeline using Apache Kafka, Kafka Conne...
Building a real-time data processing pipeline using Apache Kafka, Kafka Conne...Building a real-time data processing pipeline using Apache Kafka, Kafka Conne...
Building a real-time data processing pipeline using Apache Kafka, Kafka Conne...
Paul Brebner
 
Grid Middleware – Principles, Practice and Potential
Grid Middleware – Principles, Practice and PotentialGrid Middleware – Principles, Practice and Potential
Grid Middleware – Principles, Practice and Potential
Paul Brebner
 
Grid middleware is easy to install, configure, secure, debug and manage acros...
Grid middleware is easy to install, configure, secure, debug and manage acros...Grid middleware is easy to install, configure, secure, debug and manage acros...
Grid middleware is easy to install, configure, secure, debug and manage acros...
Paul Brebner
 
Massively Scalable Real-time Geospatial Data Processing with Apache Kafka and...
Massively Scalable Real-time Geospatial Data Processing with Apache Kafka and...Massively Scalable Real-time Geospatial Data Processing with Apache Kafka and...
Massively Scalable Real-time Geospatial Data Processing with Apache Kafka and...
Paul Brebner
 
Massively Scalable Real-time Geospatial Data Processing with Apache Kafka and...
Massively Scalable Real-time Geospatial Data Processing with Apache Kafka and...Massively Scalable Real-time Geospatial Data Processing with Apache Kafka and...
Massively Scalable Real-time Geospatial Data Processing with Apache Kafka and...
Paul Brebner
 
Melbourne Big Data Meetup Talk: Scaling a Real-Time Anomaly Detection Applica...
Melbourne Big Data Meetup Talk: Scaling a Real-Time Anomaly Detection Applica...Melbourne Big Data Meetup Talk: Scaling a Real-Time Anomaly Detection Applica...
Melbourne Big Data Meetup Talk: Scaling a Real-Time Anomaly Detection Applica...
Paul Brebner
 
Massively Scalable Real-time Geospatial Data Processing with Apache Kafka and...
Massively Scalable Real-time Geospatial Data Processing with Apache Kafka and...Massively Scalable Real-time Geospatial Data Processing with Apache Kafka and...
Massively Scalable Real-time Geospatial Data Processing with Apache Kafka and...
Paul Brebner
 
0b101000 years of computing: a personal timeline - decade "0", the 1980's
0b101000 years of computing: a personal timeline - decade "0", the 1980's0b101000 years of computing: a personal timeline - decade "0", the 1980's
0b101000 years of computing: a personal timeline - decade "0", the 1980's
Paul Brebner
 
ApacheCon Berlin 2019: Kongo:Building a Scalable Streaming IoT Application us...
ApacheCon Berlin 2019: Kongo:Building a Scalable Streaming IoT Application us...ApacheCon Berlin 2019: Kongo:Building a Scalable Streaming IoT Application us...
ApacheCon Berlin 2019: Kongo:Building a Scalable Streaming IoT Application us...
Paul Brebner
 
ApacheCon2019 Talk: Kafka, Cassandra and Kubernetes at Scale – Real-time Ano...
ApacheCon2019 Talk: Kafka, Cassandra and Kubernetesat Scale – Real-time Ano...ApacheCon2019 Talk: Kafka, Cassandra and Kubernetesat Scale – Real-time Ano...
ApacheCon2019 Talk: Kafka, Cassandra and Kubernetes at Scale – Real-time Ano...
Paul Brebner
 
ApacheCon2019 Talk: Improving the Observability of Cassandra, Kafka and Kuber...
ApacheCon2019 Talk: Improving the Observability of Cassandra, Kafka and Kuber...ApacheCon2019 Talk: Improving the Observability of Cassandra, Kafka and Kuber...
ApacheCon2019 Talk: Improving the Observability of Cassandra, Kafka and Kuber...
Paul Brebner
 
Streaming More For Less With Apache Kafka Tiered Storage
Streaming More For Less With Apache Kafka Tiered StorageStreaming More For Less With Apache Kafka Tiered Storage
Streaming More For Less With Apache Kafka Tiered Storage
Paul Brebner
 
30 Of My Favourite Open Source Technologies In 30 Minutes
30 Of My Favourite Open Source Technologies In 30 Minutes30 Of My Favourite Open Source Technologies In 30 Minutes
30 Of My Favourite Open Source Technologies In 30 Minutes
Paul Brebner
 
Superpower Your Apache Kafka Applications Development with Complementary Open...
Superpower Your Apache Kafka Applications Development with Complementary Open...Superpower Your Apache Kafka Applications Development with Complementary Open...
Superpower Your Apache Kafka Applications Development with Complementary Open...
Paul Brebner
 
Architecting Applications With Multiple Open Source Big Data Technologies
Architecting Applications With Multiple Open Source Big Data TechnologiesArchitecting Applications With Multiple Open Source Big Data Technologies
Architecting Applications With Multiple Open Source Big Data Technologies
Paul Brebner
 
Apache ZooKeeper and Apache Curator: Meet the Dining Philosophers
Apache ZooKeeper and Apache Curator: Meet the Dining PhilosophersApache ZooKeeper and Apache Curator: Meet the Dining Philosophers
Apache ZooKeeper and Apache Curator: Meet the Dining Philosophers
Paul Brebner
 
Spinning your Drones with Cadence Workflows and Apache Kafka
Spinning your Drones with Cadence Workflows and Apache KafkaSpinning your Drones with Cadence Workflows and Apache Kafka
Spinning your Drones with Cadence Workflows and Apache Kafka
Paul Brebner
 
Change Data Capture (CDC) With Kafka Connect® and the Debezium PostgreSQL Sou...
Change Data Capture (CDC) With Kafka Connect® and the Debezium PostgreSQL Sou...Change Data Capture (CDC) With Kafka Connect® and the Debezium PostgreSQL Sou...
Change Data Capture (CDC) With Kafka Connect® and the Debezium PostgreSQL Sou...
Paul Brebner
 
A Visual Introduction to Apache Kafka
A Visual Introduction to Apache KafkaA Visual Introduction to Apache Kafka
A Visual Introduction to Apache Kafka
Paul Brebner
 
Massively Scalable Real-time Geospatial Anomaly Detection with Apache Kafka a...
Massively Scalable Real-time Geospatial Anomaly Detection with Apache Kafka a...Massively Scalable Real-time Geospatial Anomaly Detection with Apache Kafka a...
Massively Scalable Real-time Geospatial Anomaly Detection with Apache Kafka a...
Paul Brebner
 
Building a real-time data processing pipeline using Apache Kafka, Kafka Conne...
Building a real-time data processing pipeline using Apache Kafka, Kafka Conne...Building a real-time data processing pipeline using Apache Kafka, Kafka Conne...
Building a real-time data processing pipeline using Apache Kafka, Kafka Conne...
Paul Brebner
 
Grid Middleware – Principles, Practice and Potential
Grid Middleware – Principles, Practice and PotentialGrid Middleware – Principles, Practice and Potential
Grid Middleware – Principles, Practice and Potential
Paul Brebner
 
Grid middleware is easy to install, configure, secure, debug and manage acros...
Grid middleware is easy to install, configure, secure, debug and manage acros...Grid middleware is easy to install, configure, secure, debug and manage acros...
Grid middleware is easy to install, configure, secure, debug and manage acros...
Paul Brebner
 
Massively Scalable Real-time Geospatial Data Processing with Apache Kafka and...
Massively Scalable Real-time Geospatial Data Processing with Apache Kafka and...Massively Scalable Real-time Geospatial Data Processing with Apache Kafka and...
Massively Scalable Real-time Geospatial Data Processing with Apache Kafka and...
Paul Brebner
 
Massively Scalable Real-time Geospatial Data Processing with Apache Kafka and...
Massively Scalable Real-time Geospatial Data Processing with Apache Kafka and...Massively Scalable Real-time Geospatial Data Processing with Apache Kafka and...
Massively Scalable Real-time Geospatial Data Processing with Apache Kafka and...
Paul Brebner
 
Melbourne Big Data Meetup Talk: Scaling a Real-Time Anomaly Detection Applica...
Melbourne Big Data Meetup Talk: Scaling a Real-Time Anomaly Detection Applica...Melbourne Big Data Meetup Talk: Scaling a Real-Time Anomaly Detection Applica...
Melbourne Big Data Meetup Talk: Scaling a Real-Time Anomaly Detection Applica...
Paul Brebner
 
Massively Scalable Real-time Geospatial Data Processing with Apache Kafka and...
Massively Scalable Real-time Geospatial Data Processing with Apache Kafka and...Massively Scalable Real-time Geospatial Data Processing with Apache Kafka and...
Massively Scalable Real-time Geospatial Data Processing with Apache Kafka and...
Paul Brebner
 
0b101000 years of computing: a personal timeline - decade "0", the 1980's
0b101000 years of computing: a personal timeline - decade "0", the 1980's0b101000 years of computing: a personal timeline - decade "0", the 1980's
0b101000 years of computing: a personal timeline - decade "0", the 1980's
Paul Brebner
 
ApacheCon Berlin 2019: Kongo:Building a Scalable Streaming IoT Application us...
ApacheCon Berlin 2019: Kongo:Building a Scalable Streaming IoT Application us...ApacheCon Berlin 2019: Kongo:Building a Scalable Streaming IoT Application us...
ApacheCon Berlin 2019: Kongo:Building a Scalable Streaming IoT Application us...
Paul Brebner
 
ApacheCon2019 Talk: Kafka, Cassandra and Kubernetes at Scale – Real-time Ano...
ApacheCon2019 Talk: Kafka, Cassandra and Kubernetesat Scale – Real-time Ano...ApacheCon2019 Talk: Kafka, Cassandra and Kubernetesat Scale – Real-time Ano...
ApacheCon2019 Talk: Kafka, Cassandra and Kubernetes at Scale – Real-time Ano...
Paul Brebner
 
ApacheCon2019 Talk: Improving the Observability of Cassandra, Kafka and Kuber...
ApacheCon2019 Talk: Improving the Observability of Cassandra, Kafka and Kuber...ApacheCon2019 Talk: Improving the Observability of Cassandra, Kafka and Kuber...
ApacheCon2019 Talk: Improving the Observability of Cassandra, Kafka and Kuber...
Paul Brebner
 
Ad

Recently uploaded (20)

HCL Nomad Web – Best Practices und Verwaltung von Multiuser-Umgebungen
HCL Nomad Web – Best Practices und Verwaltung von Multiuser-UmgebungenHCL Nomad Web – Best Practices und Verwaltung von Multiuser-Umgebungen
HCL Nomad Web – Best Practices und Verwaltung von Multiuser-Umgebungen
panagenda
 
ThousandEyes Partner Innovation Updates for May 2025
ThousandEyes Partner Innovation Updates for May 2025ThousandEyes Partner Innovation Updates for May 2025
ThousandEyes Partner Innovation Updates for May 2025
ThousandEyes
 
Enhancing ICU Intelligence: How Our Functional Testing Enabled a Healthcare I...
Enhancing ICU Intelligence: How Our Functional Testing Enabled a Healthcare I...Enhancing ICU Intelligence: How Our Functional Testing Enabled a Healthcare I...
Enhancing ICU Intelligence: How Our Functional Testing Enabled a Healthcare I...
Impelsys Inc.
 
Greenhouse_Monitoring_Presentation.pptx.
Greenhouse_Monitoring_Presentation.pptx.Greenhouse_Monitoring_Presentation.pptx.
Greenhouse_Monitoring_Presentation.pptx.
hpbmnnxrvb
 
What is Model Context Protocol(MCP) - The new technology for communication bw...
What is Model Context Protocol(MCP) - The new technology for communication bw...What is Model Context Protocol(MCP) - The new technology for communication bw...
What is Model Context Protocol(MCP) - The new technology for communication bw...
Vishnu Singh Chundawat
 
Massive Power Outage Hits Spain, Portugal, and France: Causes, Impact, and On...
Massive Power Outage Hits Spain, Portugal, and France: Causes, Impact, and On...Massive Power Outage Hits Spain, Portugal, and France: Causes, Impact, and On...
Massive Power Outage Hits Spain, Portugal, and France: Causes, Impact, and On...
Aqusag Technologies
 
Rusty Waters: Elevating Lakehouses Beyond Spark
Rusty Waters: Elevating Lakehouses Beyond SparkRusty Waters: Elevating Lakehouses Beyond Spark
Rusty Waters: Elevating Lakehouses Beyond Spark
carlyakerly1
 
Electronic_Mail_Attacks-1-35.pdf by xploit
Electronic_Mail_Attacks-1-35.pdf by xploitElectronic_Mail_Attacks-1-35.pdf by xploit
Electronic_Mail_Attacks-1-35.pdf by xploit
niftliyevhuseyn
 
Generative Artificial Intelligence (GenAI) in Business
Generative Artificial Intelligence (GenAI) in BusinessGenerative Artificial Intelligence (GenAI) in Business
Generative Artificial Intelligence (GenAI) in Business
Dr. Tathagat Varma
 
Quantum Computing Quick Research Guide by Arthur Morgan
Quantum Computing Quick Research Guide by Arthur MorganQuantum Computing Quick Research Guide by Arthur Morgan
Quantum Computing Quick Research Guide by Arthur Morgan
Arthur Morgan
 
Cyber Awareness overview for 2025 month of security
Cyber Awareness overview for 2025 month of securityCyber Awareness overview for 2025 month of security
Cyber Awareness overview for 2025 month of security
riccardosl1
 
Build Your Own Copilot & Agents For Devs
Build Your Own Copilot & Agents For DevsBuild Your Own Copilot & Agents For Devs
Build Your Own Copilot & Agents For Devs
Brian McKeiver
 
How analogue intelligence complements AI
How analogue intelligence complements AIHow analogue intelligence complements AI
How analogue intelligence complements AI
Paul Rowe
 
UiPath Community Berlin: Orchestrator API, Swagger, and Test Manager API
UiPath Community Berlin: Orchestrator API, Swagger, and Test Manager APIUiPath Community Berlin: Orchestrator API, Swagger, and Test Manager API
UiPath Community Berlin: Orchestrator API, Swagger, and Test Manager API
UiPathCommunity
 
Into The Box Conference Keynote Day 1 (ITB2025)
Into The Box Conference Keynote Day 1 (ITB2025)Into The Box Conference Keynote Day 1 (ITB2025)
Into The Box Conference Keynote Day 1 (ITB2025)
Ortus Solutions, Corp
 
Manifest Pre-Seed Update | A Humanoid OEM Deeptech In France
Manifest Pre-Seed Update | A Humanoid OEM Deeptech In FranceManifest Pre-Seed Update | A Humanoid OEM Deeptech In France
Manifest Pre-Seed Update | A Humanoid OEM Deeptech In France
chb3
 
SAP Modernization: Maximizing the Value of Your SAP S/4HANA Migration.pdf
SAP Modernization: Maximizing the Value of Your SAP S/4HANA Migration.pdfSAP Modernization: Maximizing the Value of Your SAP S/4HANA Migration.pdf
SAP Modernization: Maximizing the Value of Your SAP S/4HANA Migration.pdf
Precisely
 
HCL Nomad Web – Best Practices and Managing Multiuser Environments
HCL Nomad Web – Best Practices and Managing Multiuser EnvironmentsHCL Nomad Web – Best Practices and Managing Multiuser Environments
HCL Nomad Web – Best Practices and Managing Multiuser Environments
panagenda
 
Designing Low-Latency Systems with Rust and ScyllaDB: An Architectural Deep Dive
Designing Low-Latency Systems with Rust and ScyllaDB: An Architectural Deep DiveDesigning Low-Latency Systems with Rust and ScyllaDB: An Architectural Deep Dive
Designing Low-Latency Systems with Rust and ScyllaDB: An Architectural Deep Dive
ScyllaDB
 
Noah Loul Shares 5 Steps to Implement AI Agents for Maximum Business Efficien...
Noah Loul Shares 5 Steps to Implement AI Agents for Maximum Business Efficien...Noah Loul Shares 5 Steps to Implement AI Agents for Maximum Business Efficien...
Noah Loul Shares 5 Steps to Implement AI Agents for Maximum Business Efficien...
Noah Loul
 
HCL Nomad Web – Best Practices und Verwaltung von Multiuser-Umgebungen
HCL Nomad Web – Best Practices und Verwaltung von Multiuser-UmgebungenHCL Nomad Web – Best Practices und Verwaltung von Multiuser-Umgebungen
HCL Nomad Web – Best Practices und Verwaltung von Multiuser-Umgebungen
panagenda
 
ThousandEyes Partner Innovation Updates for May 2025
ThousandEyes Partner Innovation Updates for May 2025ThousandEyes Partner Innovation Updates for May 2025
ThousandEyes Partner Innovation Updates for May 2025
ThousandEyes
 
Enhancing ICU Intelligence: How Our Functional Testing Enabled a Healthcare I...
Enhancing ICU Intelligence: How Our Functional Testing Enabled a Healthcare I...Enhancing ICU Intelligence: How Our Functional Testing Enabled a Healthcare I...
Enhancing ICU Intelligence: How Our Functional Testing Enabled a Healthcare I...
Impelsys Inc.
 
Greenhouse_Monitoring_Presentation.pptx.
Greenhouse_Monitoring_Presentation.pptx.Greenhouse_Monitoring_Presentation.pptx.
Greenhouse_Monitoring_Presentation.pptx.
hpbmnnxrvb
 
What is Model Context Protocol(MCP) - The new technology for communication bw...
What is Model Context Protocol(MCP) - The new technology for communication bw...What is Model Context Protocol(MCP) - The new technology for communication bw...
What is Model Context Protocol(MCP) - The new technology for communication bw...
Vishnu Singh Chundawat
 
Massive Power Outage Hits Spain, Portugal, and France: Causes, Impact, and On...
Massive Power Outage Hits Spain, Portugal, and France: Causes, Impact, and On...Massive Power Outage Hits Spain, Portugal, and France: Causes, Impact, and On...
Massive Power Outage Hits Spain, Portugal, and France: Causes, Impact, and On...
Aqusag Technologies
 
Rusty Waters: Elevating Lakehouses Beyond Spark
Rusty Waters: Elevating Lakehouses Beyond SparkRusty Waters: Elevating Lakehouses Beyond Spark
Rusty Waters: Elevating Lakehouses Beyond Spark
carlyakerly1
 
Electronic_Mail_Attacks-1-35.pdf by xploit
Electronic_Mail_Attacks-1-35.pdf by xploitElectronic_Mail_Attacks-1-35.pdf by xploit
Electronic_Mail_Attacks-1-35.pdf by xploit
niftliyevhuseyn
 
Generative Artificial Intelligence (GenAI) in Business
Generative Artificial Intelligence (GenAI) in BusinessGenerative Artificial Intelligence (GenAI) in Business
Generative Artificial Intelligence (GenAI) in Business
Dr. Tathagat Varma
 
Quantum Computing Quick Research Guide by Arthur Morgan
Quantum Computing Quick Research Guide by Arthur MorganQuantum Computing Quick Research Guide by Arthur Morgan
Quantum Computing Quick Research Guide by Arthur Morgan
Arthur Morgan
 
Cyber Awareness overview for 2025 month of security
Cyber Awareness overview for 2025 month of securityCyber Awareness overview for 2025 month of security
Cyber Awareness overview for 2025 month of security
riccardosl1
 
Build Your Own Copilot & Agents For Devs
Build Your Own Copilot & Agents For DevsBuild Your Own Copilot & Agents For Devs
Build Your Own Copilot & Agents For Devs
Brian McKeiver
 
How analogue intelligence complements AI
How analogue intelligence complements AIHow analogue intelligence complements AI
How analogue intelligence complements AI
Paul Rowe
 
UiPath Community Berlin: Orchestrator API, Swagger, and Test Manager API
UiPath Community Berlin: Orchestrator API, Swagger, and Test Manager APIUiPath Community Berlin: Orchestrator API, Swagger, and Test Manager API
UiPath Community Berlin: Orchestrator API, Swagger, and Test Manager API
UiPathCommunity
 
Into The Box Conference Keynote Day 1 (ITB2025)
Into The Box Conference Keynote Day 1 (ITB2025)Into The Box Conference Keynote Day 1 (ITB2025)
Into The Box Conference Keynote Day 1 (ITB2025)
Ortus Solutions, Corp
 
Manifest Pre-Seed Update | A Humanoid OEM Deeptech In France
Manifest Pre-Seed Update | A Humanoid OEM Deeptech In FranceManifest Pre-Seed Update | A Humanoid OEM Deeptech In France
Manifest Pre-Seed Update | A Humanoid OEM Deeptech In France
chb3
 
SAP Modernization: Maximizing the Value of Your SAP S/4HANA Migration.pdf
SAP Modernization: Maximizing the Value of Your SAP S/4HANA Migration.pdfSAP Modernization: Maximizing the Value of Your SAP S/4HANA Migration.pdf
SAP Modernization: Maximizing the Value of Your SAP S/4HANA Migration.pdf
Precisely
 
HCL Nomad Web – Best Practices and Managing Multiuser Environments
HCL Nomad Web – Best Practices and Managing Multiuser EnvironmentsHCL Nomad Web – Best Practices and Managing Multiuser Environments
HCL Nomad Web – Best Practices and Managing Multiuser Environments
panagenda
 
Designing Low-Latency Systems with Rust and ScyllaDB: An Architectural Deep Dive
Designing Low-Latency Systems with Rust and ScyllaDB: An Architectural Deep DiveDesigning Low-Latency Systems with Rust and ScyllaDB: An Architectural Deep Dive
Designing Low-Latency Systems with Rust and ScyllaDB: An Architectural Deep Dive
ScyllaDB
 
Noah Loul Shares 5 Steps to Implement AI Agents for Maximum Business Efficien...
Noah Loul Shares 5 Steps to Implement AI Agents for Maximum Business Efficien...Noah Loul Shares 5 Steps to Implement AI Agents for Maximum Business Efficien...
Noah Loul Shares 5 Steps to Implement AI Agents for Maximum Business Efficien...
Noah Loul
 
Ad

OPEN Talk: Scaling Open Source Big Data Cloud Applications is Easy/Hard

  • 1. Scaling Open Source Big Data Cloud Applications is Easy/Hard Paul Brebner Instaclustr—Technology Evangelist ©Instaclustr Pty Limited, 2022 DeveloperWeek 10 May 2022
  • 2. Who am I? • Previously • R&D in distributed systems and performance engineering. • Last 5 years • Technology Evangelist for Instaclustr (soon NetApp) • 100+ Blogs, demo applications, talks • Open Source technologies including • Apache Cassandra, Spark, Kafka, Zookeeper • and Redis, OpenSearch, PostgreSQL, Kubernetes, Prometheus, OpenTracing, etc
  • 3. Cloud Platform for Big Data Open Source Technologies Latest addition is Workflows with Uber’s Instaclustr Managed Platform ©Instaclustr Pty Limited, 2021
  • 4. Cloud Platform for Big Data Open Source Technologies Latest addition is Workflows with Uber’s This talk focuses on Cassandra and Kafka ©Instaclustr Pty Limited, 2021
  • 5. Scaling is Easy! Cassandra and Kafka Homogeneous distributed clusters à horizontally scalable www.cassandra.apache.org/_/cassandra-basics.html
  • 6. But actually lots of moving parts (source: http://trumpetb.net/loco/rodsf.html)
  • 7. Complications – DCs, Racks, Nodes, Partitions, Replication Factor, Time (for auto-scaling) Rows have a partition key and are stored in different partitions
  • 8. Example 1 – Cassandra Auto-Scaling ©Instaclustr Pty Limited, 2021
  • 9. Two Ways of Resizing Clusters 1 - Horizontal Scaling • Add nodes, no interruption • But scale up only (not down) • Takes time, puts extra load on cluster as data streams to extra nodes 2 - Vertical Scaling • Replace nodes with bigger (or smaller) node types (more/less cores) • Scale up and down • Takes time, temporary reduction in capacity • Choice of how many nodes are replaced concurrently – by “node” (1 node at a time) or by “rack” (all nodes in a rack) , or in-between
  • 10. Cluster resizing time – by node vs. by rack – by rack is faster but …? Cluster = 6 nodes, 3 racks, 2 nodes per rack By node (concurrency 1) By rack (concurrency 2)
  • 11. Resizing by node – capacity reduced by 1/6 total nodes each resize operation (simplified model)
  • 12. Resizing by rack – capacity reduced by 2/6 nodes each resize operation
  • 13. Comparison – resize by rack faster but has bigger capacity hit during resize
  • 14. Observations • If the capacity during resize is exceeded latencies will increase • Made worse by Cassandra load balancing which assumes equal sized nodes • By node, more nodes in the Cluster reduces the impact of reduced cluster capacity during resizing (some clusters have 100s of nodes) – but will take longer • Many of our clusters have <= 6 nodes
  • 15. Auto-scaling model - increasing load à linear regression over 1 hour extrapolated to future We predict the cluster will reach 100% capacity around the 280 minute mark (220 minutes in the future) Extrapolated Measured
  • 16. Resize by Rack vs. Node - initiated in time to prevent overloading during resize operation Resize by rack must be initiated sooner c.f. resize by node, even thought it’s faster to resize, as it has less capacity during resize (67% c.f. 83% of initial capacity) By Rack By Node
  • 17. Auto-scaling POC – worked! Monitoring API Linear Regression + Rules Provisioning API Rules generalized to allow for • scaling up and down • resizing by any number of nodes concurrently, up to rack size
  • 18. Example 2 – Anomaly Detection ©Instaclustr Pty Limited, 2021 JoAnn Morgan Apollo 11 Mission Control
  • 20. Massively Scalable Anomaly Detection – Tuning knobs (Orange h/w, yellow s/w) Scaling is (too) Easy! Initially just increased h/w resources
  • 21. But scalability not great 0 1 2 3 4 5 6 7 8 0 100 200 300 400 500 600 700 Billions checks/day Total Cores Total Cores vs. Billions of checks/day (pre-tuning)
  • 22. Tuning required! Scalability Post-tuning 0 2 4 6 8 10 12 14 16 18 20 0 100 200 300 400 500 600 700 Billions checks/day Total Cores Total Cores vs. Billions of checks/day (pre-tuning) Billions of checks/day (pre-tuning) Billions of checks/day (post-tuning)
  • 23. Tuning – Optimize s/w resources (red arrows) 1 2 3 1. Minimize Kafka Consumers (thread pool 1) 2. Minimize Cassandra Connections 3. Maximize Cassandra client concurrency (thread pool 2)
  • 24. Example 3 – What’s really going on - behind the Kafka partitions? ©Instaclustr Pty Limited, 2021 ©Instaclustr Pty Limited 2019, 2021, 2022
  • 25. Kafka topic partitions enable consumer concurrency partitions >= consumers Partition n Topic “Parties” Partition 1 Producer Partition 2 Consumer Group Consumer Consumer Consumers share work within groups Consumer
  • 26. High consumer/partition fan out Can be caused by: 1 Design – many topics and/or many consumers 2 Slow consumers à need more consumers to increase throughput
  • 27. Kafka write architecture – partition replication
  • 28. Benchmarking revealed that partitions and replication factor are the culprit 0 100000 200000 300000 400000 500000 600000 700000 800000 900000 1 10 100 1000 10000 TPS Partitions Kafka Partitions vs. Throughput Cluster: 3 nodes x 4 cores = 12 cores total Replication Factor 3 (TPS) Replication Factor 1 (TPS)
  • 29. Implications? • Bigger Cluster (more nodes, bigger nodes) • Design to minimize topics and consumers • Optimize consumers for minimum time • Always benchmark with many partitions • Blame the Apache Zookeeper? • Responsible for Kafka control • From version 3.0 it’s being replaced by native KRaft protocol • Not yet production ready • May enable more partitions (but may not impact throughput)
  • 30. Scaling is Mostly Easy! § Using Scalable Open Source Big Data Technologies § Hosted by suitable Cloud providers § With suitable monitoring, understanding of autoscaling and how different software “knobs” interact, and by scaling incrementally © Instaclustr Pty Limited, 2022