SlideShare a Scribd company logo
Apache Kafka
Availability of Kafka - Beyond the Brokers
Andrew Borley
Emma Humber
Kafka Summit Europe 2021
High availability
Kafka has guarantees around the number of server failures a cluster can tolerate
What if the environment becomes unavailable?
Many layers: Applications, Kafka, OS, Network, Storage
Design into the system
2
Constraints
What guarantees do you need
• Consistency, availability, performance, manual intervention
What resources do you have
• Datacenters, network, cost
Consistency OR availability
3
Stretch
Clusters
4
Availability zones
Set of isolated infrastructure
• Compute, storage and network connectivity + associated power and cooling
• Limits the blast radius of an infrastructure problem
A datacenter, or a failure domain within a datacenter
Geographic regions (eg Central Europe) often support multiple availability zones
5
AVAILABILITY ZONE 1
DATACENTER
KAFKA BROKER 1
rack.id AZ1
TOPIC A
PARTITION 1
Leader
TOPIC A
PARTITION 2
Follower
TOPIC A
PARTITION 3
Follower
ZOOKEEPER
SERVER 1
AVAILABILITY ZONE 3
DATACENTER
KAFKA BROKER 3
rack.id AZ3
TOPIC A
PARTITION 1
Follower
TOPIC A
PARTITION 2
Follower
TOPIC A
PARTITION 3
Leader
ZOOKEEPER
SERVER 3
KAFKA BROKER 2
rack.id AZ2
TOPIC A
PARTITION 1
Follower
TOPIC A
PARTITION 2
Leader
TOPIC A
PARTITION 3
Follower
ZOOKEEPER
SERVER 2
AVAILABILITY ZONE 2
DATACENTER
6
AVAILABILITY ZONE 1
DATACENTER
KAFKA BROKER 1
rack.id AZ1
TOPIC A
PARTITION 1
Leader
TOPIC A
PARTITION 2
Follower
TOPIC A
PARTITION 3
Follower
ZOOKEEPER
SERVER 1
AVAILABILITY ZONE 3
DATACENTER
KAFKA BROKER 3
rack.id AZ3
TOPIC A
PARTITION 1
Follower
TOPIC A
PARTITION 2
Follower
TOPIC A
PARTITION 3
Leader
ZOOKEEPER
SERVER 3
KAFKA BROKER 2
rack.id AZ2
TOPIC A
PARTITION 1
Follower
TOPIC A
PARTITION 2
Leader
TOPIC A
PARTITION 3
Follower
ZOOKEEPER
SERVER 2
AVAILABILITY ZONE 2
DATACENTER
7
Configuration
Kubernetes nodes assigned a zone, labelled with
topology.kubernetes.io/zone
Kafka’s broker.rack : topology.kubernetes.io/zone value
Configuration depends on installation technology eg
rackAssignment or topologyKey
Low latency a MUST
Look at timeouts and client configuration
zookeeper.connection.timeout.ms
replica.lag.time.max.ms 8
AVAILABILITY ZONE 1
DATACENTER
KAFKA BROKER 1
rack.id AZ1
TOPIC A
PARTITION 1
Leader
TOPIC A
PARTITION 2
Follower
TOPIC A
PARTITION 3
Follower
KAFKA BROKER 2
rack.id AZ1
TOPIC A
PARTITION 1
Follower
TOPIC A
PARTITION 2
Follower
TOPIC A
PARTITION 3
Leader
AVAILABILITY ZONE 2
DATACENTER
KAFKA BROKER 4
rack.id AZ2
TOPIC A
PARTITION 1
Follower
TOPIC A
PARTITION 2
Follower
TOPIC A
PARTITION 3
Follower
KAFKA BROKER 3
rack.id AZ2
TOPIC A
PARTITION 1
Follower
TOPIC A
PARTITION 2
Leader
TOPIC A
PARTITION 3
Follower
AVAILABILITY ZONE 3
DATACENTER
KAFKA BROKER6
rack.id AZ3
TOPIC A
PARTITION 1
Follower
TOPIC A
PARTITION 2
Follower
TOPIC A
PARTITION 3
Follower
KAFKA BROKER 5
rack.id AZ3
TOPIC A
PARTITION 1
Follower
TOPIC A
PARTITION 3
Follower
TOPIC A
PARTITION 2
Follower
Replication factor > min.insync.replicas
Ensure there are sufficient replicas to cover all zones
client.rack tags which node the client application is running on
• Consumers to fetch from the closest replica
9
AVAILABILITY ZONE 1
DATACENTER
KAFKA BROKER 1
rack.id AZ1
TOPIC A
PARTITION 1
Leader
TOPIC A
PARTITION 2
Follower
TOPIC A
PARTITION 3
Follower
ZOOKEEPER
SERVER 1
KAFKA BROKER 2
rack.id AZ1
TOPIC A
PARTITION 1
Follower
TOPIC A
PARTITION 2
Follower
TOPIC A
PARTITION 3
Leader
ZOOKEEPER
SERVER 2
AVAILABILITY ZONE 2
DATACENTER
ZOOKEEPER
SERVER 3
KAFKA BROKER 4
rack.id AZ2
TOPIC A
PARTITION 1
Follower
TOPIC A
PARTITION 2
Follower
TOPIC A
PARTITION 3
Follower
KAFKA BROKER 3
rack.id AZ2
TOPIC A
PARTITION 1
Follower
TOPIC A
PARTITION 2
Leader
TOPIC A
PARTITION 3
Follower
Zookeeper
10
Stretch clusters
Kafka and Zookeeper replication
ensures data is highly available
Data consistency
Exactly once processing possible
Event order can be preserved
Guards against the loss of a data
center
Simple client configuration
No offset lag or offset translation
Utilizes all brokers
11
Considerations
Stable, low latency, high-bandwidth
connection is a must : care if crossing
regions
No 0 downtime upgrades
Doesn't protect against whole cluster
failure
12
Cross-availability zone data transfer
fees will apply : especially
ingress/egress
Third datacenter required
Complexity of configuration
Multiple
Clusters
13
Multiple clusters
Multiple independent Kafka clusters in different regions. Topic data is
mirrored across clusters
• Active/passive: Produce to primary cluster, consume from any
• Active/active: Produce to and consume from any cluster
• Federated: Central cluster with multiple regional clusters
A mirror making technology mirrors the data between clusters
• Typically runs at the target cluster
• Data is consumed remotely and produced locally
14
PRIMARY data center SECONDARY data center
PRODUCER
CONSUMER
CONSUMER
TOPIC:
TOPIC1
TOPIC:
PRIMARY.TOPIC1
MIRROR-MAKER 2
Active/passive clusters
15
PRIMARY data center SECONDARY data center
PRODUCER
CONSUMER
CONSUMER
TOPIC:
TOPIC1
TOPIC:
PRIMARY.TOPIC1
MIRROR-MAKER 2
Active/passive clusters
16
Active/passive clusters
Data back-up to multiple destinations
Disaster recovery fail-over after loss of
infrastructure
Data migration:
• Moving to a new cluster
• Moving from a staging environment
to a production environment
Clusters are independent
Message keys used for partitioning, so
order is preserved on a per-key basis
Source cluster must be the 'owner' of
the data. Target cluster is essentially
read-only
17
Considerations
Target cluster will lag behind source cluster - mirroring is asynchronous, so
monitor that it's not too far behind
• High traffic topics could mean 1000’s of un-mirrored messages, even if there
is only a few milliseconds of lag
Message offsets in the source and target topics may not match up
• On fail-over consumers need to be updated to start at the correct offset
• Returning to the source cluster will require a similar process
Configuration changes on the source may need to be propagated to the target
Target cluster may be unused (until it’s needed)
18
LONDON data center BOSTON data center
PRODUCER
CONSUMER
TOPIC:
TOPIC1
TOPIC:
LONDON.TOPIC1
MIRROR-MAKER 2
TOPIC:
TOPIC1
TOPIC:
BOSTON.TOPIC1
MIRROR-MAKER 2
PRODUCER
CONSUMER
Active/active clusters
19
LONDON data center BOSTON data center
PRODUCER
CONSUMER
TOPIC:
TOPIC1
TOPIC:
LONDON.TOPIC1
MIRROR-MAKER 2
TOPIC:
TOPIC1
TOPIC:
BOSTON.TOPIC1
MIRROR-MAKER 2
PRODUCER
CONSUMER
Active/active clusters
20
Active/active clusters
Implicit disaster recovery – data centers can operate independently
Data back-up to multiple destinations
A ‘virtual’ topic shared across geographies
Serve users from a nearby data center to increase performance
Redundancy and resilience - can easily do a network redirect on failure to route all
traffic to surviving cluster
21
Considerations
Possible to accidentally configure loops of data
Lag and ordering issues may arise if trying to consume from a different data center
to where the data was produced
22
BOSTON data center
CONSUMER
TOPIC:
LONDON.TOPIC1
TOPIC:
TOPIC1
TOPIC:
BOSTON.TOPIC1
MIRROR-MAKER 2
PRODUCER
CONSUMER
CENTRAL data center
LONDON data center
PRODUCER
CONSUMER
REJKJAVIK data center
PRODUCER
CONSUMER
MIRROR-MAKER
2
TOPIC:
TOPIC1
TOPIC:
REJKJAVIK.TOPIC1
M
I
R
R
O
R
-
M
A
K
E
R
2
TOPIC:
TOPIC1
Federated (hub and spoke) clusters
23
Federated (hub and spoke) clusters
Each region has a Kafka cluster to handle data for the region
There is no requirement for one region to know about data for other regions
Serve users from a nearby data center to increase performance
Central cluster consolidates data from each regional center
• Can be used when central processing requires access to the full set of data
Mirroring is single direction
• Makes it simple to configure, deploy and monitor
24
Be aware
Not useful if all regions need access to the data
Central cluster is for secondary processing or back-up
Lag and ordering issues at the central cluster may arise if regional data is co-
dependent
25
Planning multi-cluster architectures
for fail-over
Consider how applications will fail-over
to different clusters
• Bootstrap address / DNS
• Certificates, user credentials,
permissions
• Consumer offsets
• Topic subscriptions
• Message ordering
Consider effect of duplicates/lost
messages
• Idempotent writes
Monitor that data is arriving at remote
clusters as well as your primary system
health
Practice your fail-over and fail-back
26
Summary
Kafka provides a good degree of high availability
Some applications require additional guarantees
Dependent on your requirements and your infrastructure
27
Thank you
Andrew Borley
Emma Humber
—
borley@uk.ibm.com
ehumber@confluent.io
© Copyright IBM Corporation 2021. All rights reserved. The information contained in these materials is provided for informational purposes only, and is provided AS IS without warranty of
any kind, express or implied. Any statement of direction represents IBM’s current intent, is subject to change or withdrawal, and represent only goals and objectives. IBM, the IBM logo, and
ibm.com are trademarks of IBM Corp., registered in many jurisdictions worldwide. Other product and service names might be trademarks of IBM or other companies. A current list of IBM
trademarks is available at Copyright and trademark information.
28

More Related Content

PDF
Fan-out, fan-in & the multiplexer: Replication recipes for global platform di...
HostedbyConfluent
 
PDF
Using Kafka as a Database For Real-Time Transaction Processing | Chad Preisle...
HostedbyConfluent
 
PPTX
Building a Modern, Scalable Cyber Intelligence Platform with Apache Kafka | J...
HostedbyConfluent
 
PDF
Lessons from the field: Catalog of Kafka Deployments | Joseph Niemiec, Cloudera
HostedbyConfluent
 
PPTX
Kafka error handling patterns and best practices | Hemant Desale and Aruna Ka...
HostedbyConfluent
 
PDF
Guaranteed Event Delivery with Kafka and NodeJS | Amitesh Madhur, Nutanix
HostedbyConfluent
 
PDF
Introducing Events and Stream Processing into Nationwide Building Society (Ro...
confluent
 
PDF
Supercharge Your Real-time Event Processing with Neo4j's Streams Kafka Connec...
HostedbyConfluent
 
Fan-out, fan-in & the multiplexer: Replication recipes for global platform di...
HostedbyConfluent
 
Using Kafka as a Database For Real-Time Transaction Processing | Chad Preisle...
HostedbyConfluent
 
Building a Modern, Scalable Cyber Intelligence Platform with Apache Kafka | J...
HostedbyConfluent
 
Lessons from the field: Catalog of Kafka Deployments | Joseph Niemiec, Cloudera
HostedbyConfluent
 
Kafka error handling patterns and best practices | Hemant Desale and Aruna Ka...
HostedbyConfluent
 
Guaranteed Event Delivery with Kafka and NodeJS | Amitesh Madhur, Nutanix
HostedbyConfluent
 
Introducing Events and Stream Processing into Nationwide Building Society (Ro...
confluent
 
Supercharge Your Real-time Event Processing with Neo4j's Streams Kafka Connec...
HostedbyConfluent
 

What's hot (20)

PDF
Death of the dumb pipes: Using Apache Kafka® for Integration projects
HostedbyConfluent
 
PDF
Enhancing Apache Kafka for Large Scale Real-Time Data Pipeline at Tencent | K...
HostedbyConfluent
 
PDF
How to over-engineer things and have fun? | Oto Brglez, OPALAB
HostedbyConfluent
 
PDF
Understanding Kafka Produce and Fetch api calls for high throughtput applicat...
HostedbyConfluent
 
PDF
Tales from the four-comma club: Managing Kafka as a service at Salesforce | L...
HostedbyConfluent
 
PDF
Building Retry Architectures in Kafka with Compacted Topics | Matthew Zhou, V...
HostedbyConfluent
 
PDF
Apache Kafka from 0.7 to 1.0, History and Lesson Learned
Guozhang Wang
 
PPTX
0-330km/h: Porsche's Data Streaming Journey | Sridhar Mamella, Porsche
HostedbyConfluent
 
PPTX
Beyond the Brokers | Emma Humber and Andrew Borley, IBM
HostedbyConfluent
 
PPTX
Intro to Apache Kafka
Jason Hubbard
 
PDF
Evolving from Messaging to Event Streaming
confluent
 
PDF
Give Your Confluent Platform Superpowers! (Sandeep Togrika, Intel and Bert Ha...
HostedbyConfluent
 
PDF
How did we move the mountain? - Migrating 1 trillion+ messages per day across...
HostedbyConfluent
 
PPTX
How Zillow Unlocked Kafka to 50 Teams in 8 months | Shahar Cizer Kobrinsky, Z...
HostedbyConfluent
 
PDF
Streaming all over the world Real life use cases with Kafka Streams
confluent
 
PDF
War Stories: DIY Kafka
confluent
 
PDF
user Behavior Analysis with Session Windows and Apache Kafka's Streams API
confluent
 
PDF
Data integration with Apache Kafka
confluent
 
PDF
Streaming Data Analytics with ksqlDB and Superset | Robert Stolz, Preset
HostedbyConfluent
 
PPTX
One Click Streaming Data Pipelines & Flows | Leveraging Kafka & Spark | Ido F...
HostedbyConfluent
 
Death of the dumb pipes: Using Apache Kafka® for Integration projects
HostedbyConfluent
 
Enhancing Apache Kafka for Large Scale Real-Time Data Pipeline at Tencent | K...
HostedbyConfluent
 
How to over-engineer things and have fun? | Oto Brglez, OPALAB
HostedbyConfluent
 
Understanding Kafka Produce and Fetch api calls for high throughtput applicat...
HostedbyConfluent
 
Tales from the four-comma club: Managing Kafka as a service at Salesforce | L...
HostedbyConfluent
 
Building Retry Architectures in Kafka with Compacted Topics | Matthew Zhou, V...
HostedbyConfluent
 
Apache Kafka from 0.7 to 1.0, History and Lesson Learned
Guozhang Wang
 
0-330km/h: Porsche's Data Streaming Journey | Sridhar Mamella, Porsche
HostedbyConfluent
 
Beyond the Brokers | Emma Humber and Andrew Borley, IBM
HostedbyConfluent
 
Intro to Apache Kafka
Jason Hubbard
 
Evolving from Messaging to Event Streaming
confluent
 
Give Your Confluent Platform Superpowers! (Sandeep Togrika, Intel and Bert Ha...
HostedbyConfluent
 
How did we move the mountain? - Migrating 1 trillion+ messages per day across...
HostedbyConfluent
 
How Zillow Unlocked Kafka to 50 Teams in 8 months | Shahar Cizer Kobrinsky, Z...
HostedbyConfluent
 
Streaming all over the world Real life use cases with Kafka Streams
confluent
 
War Stories: DIY Kafka
confluent
 
user Behavior Analysis with Session Windows and Apache Kafka's Streams API
confluent
 
Data integration with Apache Kafka
confluent
 
Streaming Data Analytics with ksqlDB and Superset | Robert Stolz, Preset
HostedbyConfluent
 
One Click Streaming Data Pipelines & Flows | Leveraging Kafka & Spark | Ido F...
HostedbyConfluent
 
Ad

Similar to Availability of Kafka - Beyond the Brokers | Andrew Borley and Emma Humber, IBM (20)

PDF
Architecture patterns for distributed, hybrid, edge and global Apache Kafka d...
Kai Wähner
 
PPTX
Multi-Datacenter Kafka - Strata San Jose 2017
Gwen (Chen) Shapira
 
PDF
A Hitchhiker's Guide to Apache Kafka Geo-Replication with Sanjana Kaundinya ...
HostedbyConfluent
 
PPTX
Multi-Cluster and Failover for Apache Kafka - Kafka Summit SF 17
Gwen (Chen) Shapira
 
PDF
Kafka Summit SF 2017 - One Data Center is Not Enough: Scaling Apache Kafka Ac...
confluent
 
PDF
Building Stream Infrastructure across Multiple Data Centers with Apache Kafka
Guozhang Wang
 
PPTX
Building Large-Scale Stream Infrastructures Across Multiple Data Centers with...
DataWorks Summit/Hadoop Summit
 
PPTX
Building Large-Scale Stream Infrastructures Across Multiple Data Centers with...
confluent
 
PPTX
Streaming in Practice - Putting Apache Kafka in Production
confluent
 
PDF
6 Nines: How Stripe keeps Kafka highly-available across the globe with Donny ...
HostedbyConfluent
 
PDF
Disaster Recovery Plans for Apache Kafka
confluent
 
PDF
Non-Kafkaesque Apache Kafka - Yottabyte 2018
Otávio Carvalho
 
PDF
Capital One Delivers Risk Insights in Real Time with Stream Processing
confluent
 
PDF
Kafka in action - Tech Talk - Paytm
Sumit Jain
 
PPTX
AMIS SIG - Introducing Apache Kafka - Scalable, reliable Event Bus & Message ...
Lucas Jellema
 
PPTX
Kafkha real time analytics platform.pptx
dummyuseage1
 
PPTX
Citi Tech Talk Disaster Recovery Solutions Deep Dive
confluent
 
PPTX
Kafka and ibm event streams basics
Brian S. Paskin
 
PDF
Event driven-arch
Mohammed Shoaib
 
PDF
3 Ways to Deliver an Elastic, Cost-Effective Cloud Architecture (ANZ)
confluent
 
Architecture patterns for distributed, hybrid, edge and global Apache Kafka d...
Kai Wähner
 
Multi-Datacenter Kafka - Strata San Jose 2017
Gwen (Chen) Shapira
 
A Hitchhiker's Guide to Apache Kafka Geo-Replication with Sanjana Kaundinya ...
HostedbyConfluent
 
Multi-Cluster and Failover for Apache Kafka - Kafka Summit SF 17
Gwen (Chen) Shapira
 
Kafka Summit SF 2017 - One Data Center is Not Enough: Scaling Apache Kafka Ac...
confluent
 
Building Stream Infrastructure across Multiple Data Centers with Apache Kafka
Guozhang Wang
 
Building Large-Scale Stream Infrastructures Across Multiple Data Centers with...
DataWorks Summit/Hadoop Summit
 
Building Large-Scale Stream Infrastructures Across Multiple Data Centers with...
confluent
 
Streaming in Practice - Putting Apache Kafka in Production
confluent
 
6 Nines: How Stripe keeps Kafka highly-available across the globe with Donny ...
HostedbyConfluent
 
Disaster Recovery Plans for Apache Kafka
confluent
 
Non-Kafkaesque Apache Kafka - Yottabyte 2018
Otávio Carvalho
 
Capital One Delivers Risk Insights in Real Time with Stream Processing
confluent
 
Kafka in action - Tech Talk - Paytm
Sumit Jain
 
AMIS SIG - Introducing Apache Kafka - Scalable, reliable Event Bus & Message ...
Lucas Jellema
 
Kafkha real time analytics platform.pptx
dummyuseage1
 
Citi Tech Talk Disaster Recovery Solutions Deep Dive
confluent
 
Kafka and ibm event streams basics
Brian S. Paskin
 
Event driven-arch
Mohammed Shoaib
 
3 Ways to Deliver an Elastic, Cost-Effective Cloud Architecture (ANZ)
confluent
 
Ad

More from HostedbyConfluent (20)

PDF
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
HostedbyConfluent
 
PDF
Renaming a Kafka Topic | Kafka Summit London
HostedbyConfluent
 
PDF
Evolution of NRT Data Ingestion Pipeline at Trendyol
HostedbyConfluent
 
PDF
Ensuring Kafka Service Resilience: A Dive into Health-Checking Techniques
HostedbyConfluent
 
PDF
Exactly-once Stream Processing with Arroyo and Kafka
HostedbyConfluent
 
PDF
Fish Plays Pokemon | Kafka Summit London
HostedbyConfluent
 
PDF
Tiered Storage 101 | Kafla Summit London
HostedbyConfluent
 
PDF
Building a Self-Service Stream Processing Portal: How And Why
HostedbyConfluent
 
PDF
From the Trenches: Improving Kafka Connect Source Connector Ingestion from 7 ...
HostedbyConfluent
 
PDF
Future with Zero Down-Time: End-to-end Resiliency with Chaos Engineering and ...
HostedbyConfluent
 
PDF
Navigating Private Network Connectivity Options for Kafka Clusters
HostedbyConfluent
 
PDF
Apache Flink: Building a Company-wide Self-service Streaming Data Platform
HostedbyConfluent
 
PDF
Explaining How Real-Time GenAI Works in a Noisy Pub
HostedbyConfluent
 
PDF
TL;DR Kafka Metrics | Kafka Summit London
HostedbyConfluent
 
PDF
A Window Into Your Kafka Streams Tasks | KSL
HostedbyConfluent
 
PDF
Mastering Kafka Producer Configs: A Guide to Optimizing Performance
HostedbyConfluent
 
PDF
Data Contracts Management: Schema Registry and Beyond
HostedbyConfluent
 
PDF
Code-First Approach: Crafting Efficient Flink Apps
HostedbyConfluent
 
PDF
Debezium vs. the World: An Overview of the CDC Ecosystem
HostedbyConfluent
 
PDF
Beyond Tiered Storage: Serverless Kafka with No Local Disks
HostedbyConfluent
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
HostedbyConfluent
 
Renaming a Kafka Topic | Kafka Summit London
HostedbyConfluent
 
Evolution of NRT Data Ingestion Pipeline at Trendyol
HostedbyConfluent
 
Ensuring Kafka Service Resilience: A Dive into Health-Checking Techniques
HostedbyConfluent
 
Exactly-once Stream Processing with Arroyo and Kafka
HostedbyConfluent
 
Fish Plays Pokemon | Kafka Summit London
HostedbyConfluent
 
Tiered Storage 101 | Kafla Summit London
HostedbyConfluent
 
Building a Self-Service Stream Processing Portal: How And Why
HostedbyConfluent
 
From the Trenches: Improving Kafka Connect Source Connector Ingestion from 7 ...
HostedbyConfluent
 
Future with Zero Down-Time: End-to-end Resiliency with Chaos Engineering and ...
HostedbyConfluent
 
Navigating Private Network Connectivity Options for Kafka Clusters
HostedbyConfluent
 
Apache Flink: Building a Company-wide Self-service Streaming Data Platform
HostedbyConfluent
 
Explaining How Real-Time GenAI Works in a Noisy Pub
HostedbyConfluent
 
TL;DR Kafka Metrics | Kafka Summit London
HostedbyConfluent
 
A Window Into Your Kafka Streams Tasks | KSL
HostedbyConfluent
 
Mastering Kafka Producer Configs: A Guide to Optimizing Performance
HostedbyConfluent
 
Data Contracts Management: Schema Registry and Beyond
HostedbyConfluent
 
Code-First Approach: Crafting Efficient Flink Apps
HostedbyConfluent
 
Debezium vs. the World: An Overview of the CDC Ecosystem
HostedbyConfluent
 
Beyond Tiered Storage: Serverless Kafka with No Local Disks
HostedbyConfluent
 

Recently uploaded (20)

PDF
AI Unleashed - Shaping the Future -Starting Today - AIOUG Yatra 2025 - For Co...
Sandesh Rao
 
PDF
Responsible AI and AI Ethics - By Sylvester Ebhonu
Sylvester Ebhonu
 
PDF
Oracle AI Vector Search- Getting Started and what's new in 2025- AIOUG Yatra ...
Sandesh Rao
 
PDF
SparkLabs Primer on Artificial Intelligence 2025
SparkLabs Group
 
PDF
Automating ArcGIS Content Discovery with FME: A Real World Use Case
Safe Software
 
PPTX
cloud computing vai.pptx for the project
vaibhavdobariyal79
 
PDF
REPORT: Heating appliances market in Poland 2024
SPIUG
 
PPTX
OA presentation.pptx OA presentation.pptx
pateldhruv002338
 
PDF
Security features in Dell, HP, and Lenovo PC systems: A research-based compar...
Principled Technologies
 
PDF
Using Anchore and DefectDojo to Stand Up Your DevSecOps Function
Anchore
 
PDF
The Future of Mobile Is Context-Aware—Are You Ready?
iProgrammer Solutions Private Limited
 
PDF
The Evolution of KM Roles (Presented at Knowledge Summit Dublin 2025)
Enterprise Knowledge
 
PDF
Doc9.....................................
SofiaCollazos
 
PDF
A Strategic Analysis of the MVNO Wave in Emerging Markets.pdf
IPLOOK Networks
 
PDF
Structs to JSON: How Go Powers REST APIs
Emily Achieng
 
PPTX
IT Runs Better with ThousandEyes AI-driven Assurance
ThousandEyes
 
PDF
Make GenAI investments go further with the Dell AI Factory
Principled Technologies
 
PPTX
Dev Dives: Automate, test, and deploy in one place—with Unified Developer Exp...
AndreeaTom
 
PDF
CIFDAQ's Market Wrap : Bears Back in Control?
CIFDAQ
 
PDF
Presentation about Hardware and Software in Computer
snehamodhawadiya
 
AI Unleashed - Shaping the Future -Starting Today - AIOUG Yatra 2025 - For Co...
Sandesh Rao
 
Responsible AI and AI Ethics - By Sylvester Ebhonu
Sylvester Ebhonu
 
Oracle AI Vector Search- Getting Started and what's new in 2025- AIOUG Yatra ...
Sandesh Rao
 
SparkLabs Primer on Artificial Intelligence 2025
SparkLabs Group
 
Automating ArcGIS Content Discovery with FME: A Real World Use Case
Safe Software
 
cloud computing vai.pptx for the project
vaibhavdobariyal79
 
REPORT: Heating appliances market in Poland 2024
SPIUG
 
OA presentation.pptx OA presentation.pptx
pateldhruv002338
 
Security features in Dell, HP, and Lenovo PC systems: A research-based compar...
Principled Technologies
 
Using Anchore and DefectDojo to Stand Up Your DevSecOps Function
Anchore
 
The Future of Mobile Is Context-Aware—Are You Ready?
iProgrammer Solutions Private Limited
 
The Evolution of KM Roles (Presented at Knowledge Summit Dublin 2025)
Enterprise Knowledge
 
Doc9.....................................
SofiaCollazos
 
A Strategic Analysis of the MVNO Wave in Emerging Markets.pdf
IPLOOK Networks
 
Structs to JSON: How Go Powers REST APIs
Emily Achieng
 
IT Runs Better with ThousandEyes AI-driven Assurance
ThousandEyes
 
Make GenAI investments go further with the Dell AI Factory
Principled Technologies
 
Dev Dives: Automate, test, and deploy in one place—with Unified Developer Exp...
AndreeaTom
 
CIFDAQ's Market Wrap : Bears Back in Control?
CIFDAQ
 
Presentation about Hardware and Software in Computer
snehamodhawadiya
 

Availability of Kafka - Beyond the Brokers | Andrew Borley and Emma Humber, IBM

  • 1. Apache Kafka Availability of Kafka - Beyond the Brokers Andrew Borley Emma Humber Kafka Summit Europe 2021
  • 2. High availability Kafka has guarantees around the number of server failures a cluster can tolerate What if the environment becomes unavailable? Many layers: Applications, Kafka, OS, Network, Storage Design into the system 2
  • 3. Constraints What guarantees do you need • Consistency, availability, performance, manual intervention What resources do you have • Datacenters, network, cost Consistency OR availability 3
  • 5. Availability zones Set of isolated infrastructure • Compute, storage and network connectivity + associated power and cooling • Limits the blast radius of an infrastructure problem A datacenter, or a failure domain within a datacenter Geographic regions (eg Central Europe) often support multiple availability zones 5
  • 6. AVAILABILITY ZONE 1 DATACENTER KAFKA BROKER 1 rack.id AZ1 TOPIC A PARTITION 1 Leader TOPIC A PARTITION 2 Follower TOPIC A PARTITION 3 Follower ZOOKEEPER SERVER 1 AVAILABILITY ZONE 3 DATACENTER KAFKA BROKER 3 rack.id AZ3 TOPIC A PARTITION 1 Follower TOPIC A PARTITION 2 Follower TOPIC A PARTITION 3 Leader ZOOKEEPER SERVER 3 KAFKA BROKER 2 rack.id AZ2 TOPIC A PARTITION 1 Follower TOPIC A PARTITION 2 Leader TOPIC A PARTITION 3 Follower ZOOKEEPER SERVER 2 AVAILABILITY ZONE 2 DATACENTER 6
  • 7. AVAILABILITY ZONE 1 DATACENTER KAFKA BROKER 1 rack.id AZ1 TOPIC A PARTITION 1 Leader TOPIC A PARTITION 2 Follower TOPIC A PARTITION 3 Follower ZOOKEEPER SERVER 1 AVAILABILITY ZONE 3 DATACENTER KAFKA BROKER 3 rack.id AZ3 TOPIC A PARTITION 1 Follower TOPIC A PARTITION 2 Follower TOPIC A PARTITION 3 Leader ZOOKEEPER SERVER 3 KAFKA BROKER 2 rack.id AZ2 TOPIC A PARTITION 1 Follower TOPIC A PARTITION 2 Leader TOPIC A PARTITION 3 Follower ZOOKEEPER SERVER 2 AVAILABILITY ZONE 2 DATACENTER 7
  • 8. Configuration Kubernetes nodes assigned a zone, labelled with topology.kubernetes.io/zone Kafka’s broker.rack : topology.kubernetes.io/zone value Configuration depends on installation technology eg rackAssignment or topologyKey Low latency a MUST Look at timeouts and client configuration zookeeper.connection.timeout.ms replica.lag.time.max.ms 8
  • 9. AVAILABILITY ZONE 1 DATACENTER KAFKA BROKER 1 rack.id AZ1 TOPIC A PARTITION 1 Leader TOPIC A PARTITION 2 Follower TOPIC A PARTITION 3 Follower KAFKA BROKER 2 rack.id AZ1 TOPIC A PARTITION 1 Follower TOPIC A PARTITION 2 Follower TOPIC A PARTITION 3 Leader AVAILABILITY ZONE 2 DATACENTER KAFKA BROKER 4 rack.id AZ2 TOPIC A PARTITION 1 Follower TOPIC A PARTITION 2 Follower TOPIC A PARTITION 3 Follower KAFKA BROKER 3 rack.id AZ2 TOPIC A PARTITION 1 Follower TOPIC A PARTITION 2 Leader TOPIC A PARTITION 3 Follower AVAILABILITY ZONE 3 DATACENTER KAFKA BROKER6 rack.id AZ3 TOPIC A PARTITION 1 Follower TOPIC A PARTITION 2 Follower TOPIC A PARTITION 3 Follower KAFKA BROKER 5 rack.id AZ3 TOPIC A PARTITION 1 Follower TOPIC A PARTITION 3 Follower TOPIC A PARTITION 2 Follower Replication factor > min.insync.replicas Ensure there are sufficient replicas to cover all zones client.rack tags which node the client application is running on • Consumers to fetch from the closest replica 9
  • 10. AVAILABILITY ZONE 1 DATACENTER KAFKA BROKER 1 rack.id AZ1 TOPIC A PARTITION 1 Leader TOPIC A PARTITION 2 Follower TOPIC A PARTITION 3 Follower ZOOKEEPER SERVER 1 KAFKA BROKER 2 rack.id AZ1 TOPIC A PARTITION 1 Follower TOPIC A PARTITION 2 Follower TOPIC A PARTITION 3 Leader ZOOKEEPER SERVER 2 AVAILABILITY ZONE 2 DATACENTER ZOOKEEPER SERVER 3 KAFKA BROKER 4 rack.id AZ2 TOPIC A PARTITION 1 Follower TOPIC A PARTITION 2 Follower TOPIC A PARTITION 3 Follower KAFKA BROKER 3 rack.id AZ2 TOPIC A PARTITION 1 Follower TOPIC A PARTITION 2 Leader TOPIC A PARTITION 3 Follower Zookeeper 10
  • 11. Stretch clusters Kafka and Zookeeper replication ensures data is highly available Data consistency Exactly once processing possible Event order can be preserved Guards against the loss of a data center Simple client configuration No offset lag or offset translation Utilizes all brokers 11
  • 12. Considerations Stable, low latency, high-bandwidth connection is a must : care if crossing regions No 0 downtime upgrades Doesn't protect against whole cluster failure 12 Cross-availability zone data transfer fees will apply : especially ingress/egress Third datacenter required Complexity of configuration
  • 14. Multiple clusters Multiple independent Kafka clusters in different regions. Topic data is mirrored across clusters • Active/passive: Produce to primary cluster, consume from any • Active/active: Produce to and consume from any cluster • Federated: Central cluster with multiple regional clusters A mirror making technology mirrors the data between clusters • Typically runs at the target cluster • Data is consumed remotely and produced locally 14
  • 15. PRIMARY data center SECONDARY data center PRODUCER CONSUMER CONSUMER TOPIC: TOPIC1 TOPIC: PRIMARY.TOPIC1 MIRROR-MAKER 2 Active/passive clusters 15
  • 16. PRIMARY data center SECONDARY data center PRODUCER CONSUMER CONSUMER TOPIC: TOPIC1 TOPIC: PRIMARY.TOPIC1 MIRROR-MAKER 2 Active/passive clusters 16
  • 17. Active/passive clusters Data back-up to multiple destinations Disaster recovery fail-over after loss of infrastructure Data migration: • Moving to a new cluster • Moving from a staging environment to a production environment Clusters are independent Message keys used for partitioning, so order is preserved on a per-key basis Source cluster must be the 'owner' of the data. Target cluster is essentially read-only 17
  • 18. Considerations Target cluster will lag behind source cluster - mirroring is asynchronous, so monitor that it's not too far behind • High traffic topics could mean 1000’s of un-mirrored messages, even if there is only a few milliseconds of lag Message offsets in the source and target topics may not match up • On fail-over consumers need to be updated to start at the correct offset • Returning to the source cluster will require a similar process Configuration changes on the source may need to be propagated to the target Target cluster may be unused (until it’s needed) 18
  • 19. LONDON data center BOSTON data center PRODUCER CONSUMER TOPIC: TOPIC1 TOPIC: LONDON.TOPIC1 MIRROR-MAKER 2 TOPIC: TOPIC1 TOPIC: BOSTON.TOPIC1 MIRROR-MAKER 2 PRODUCER CONSUMER Active/active clusters 19
  • 20. LONDON data center BOSTON data center PRODUCER CONSUMER TOPIC: TOPIC1 TOPIC: LONDON.TOPIC1 MIRROR-MAKER 2 TOPIC: TOPIC1 TOPIC: BOSTON.TOPIC1 MIRROR-MAKER 2 PRODUCER CONSUMER Active/active clusters 20
  • 21. Active/active clusters Implicit disaster recovery – data centers can operate independently Data back-up to multiple destinations A ‘virtual’ topic shared across geographies Serve users from a nearby data center to increase performance Redundancy and resilience - can easily do a network redirect on failure to route all traffic to surviving cluster 21
  • 22. Considerations Possible to accidentally configure loops of data Lag and ordering issues may arise if trying to consume from a different data center to where the data was produced 22
  • 23. BOSTON data center CONSUMER TOPIC: LONDON.TOPIC1 TOPIC: TOPIC1 TOPIC: BOSTON.TOPIC1 MIRROR-MAKER 2 PRODUCER CONSUMER CENTRAL data center LONDON data center PRODUCER CONSUMER REJKJAVIK data center PRODUCER CONSUMER MIRROR-MAKER 2 TOPIC: TOPIC1 TOPIC: REJKJAVIK.TOPIC1 M I R R O R - M A K E R 2 TOPIC: TOPIC1 Federated (hub and spoke) clusters 23
  • 24. Federated (hub and spoke) clusters Each region has a Kafka cluster to handle data for the region There is no requirement for one region to know about data for other regions Serve users from a nearby data center to increase performance Central cluster consolidates data from each regional center • Can be used when central processing requires access to the full set of data Mirroring is single direction • Makes it simple to configure, deploy and monitor 24
  • 25. Be aware Not useful if all regions need access to the data Central cluster is for secondary processing or back-up Lag and ordering issues at the central cluster may arise if regional data is co- dependent 25
  • 26. Planning multi-cluster architectures for fail-over Consider how applications will fail-over to different clusters • Bootstrap address / DNS • Certificates, user credentials, permissions • Consumer offsets • Topic subscriptions • Message ordering Consider effect of duplicates/lost messages • Idempotent writes Monitor that data is arriving at remote clusters as well as your primary system health Practice your fail-over and fail-back 26
  • 27. Summary Kafka provides a good degree of high availability Some applications require additional guarantees Dependent on your requirements and your infrastructure 27
  • 28. Thank you Andrew Borley Emma Humber — [email protected] [email protected] © Copyright IBM Corporation 2021. All rights reserved. The information contained in these materials is provided for informational purposes only, and is provided AS IS without warranty of any kind, express or implied. Any statement of direction represents IBM’s current intent, is subject to change or withdrawal, and represent only goals and objectives. IBM, the IBM logo, and ibm.com are trademarks of IBM Corp., registered in many jurisdictions worldwide. Other product and service names might be trademarks of IBM or other companies. A current list of IBM trademarks is available at Copyright and trademark information. 28