Availability of Kafka - Beyond the Brokers | Andrew Borley and Emma Humber, IBM

Apache Kafka
Availability of Kafka - Beyond the Brokers
Andrew Borley
Emma Humber
Kafka Summit Europe 2021

High availability
Kafka has guarantees around the number of server failures a cluster can tolerate
What if the environment becomes unavailable?
Many layers: Applications, Kafka, OS, Network, Storage
Design into the system
2

Constraints
What guarantees do you need
• Consistency, availability, performance, manual intervention
What resources do you have
• Datacenters, network, cost
Consistency OR availability
3

Availability zones
Set of isolated infrastructure
• Compute, storage and network connectivity + associated power and cooling
• Limits the blast radius of an infrastructure problem
A datacenter, or a failure domain within a datacenter
Geographic regions (eg Central Europe) often support multiple availability zones
5

AVAILABILITY ZONE 1
DATACENTER
KAFKA BROKER 1
rack.id AZ1
TOPIC A
PARTITION 1
Leader
TOPIC A
PARTITION 2
Follower
TOPIC A
PARTITION 3
Follower
ZOOKEEPER
SERVER 1
AVAILABILITY ZONE 3
DATACENTER
KAFKA BROKER 3
rack.id AZ3
TOPIC A
PARTITION 1
Follower
TOPIC A
PARTITION 2
Follower
TOPIC A
PARTITION 3
Leader
ZOOKEEPER
SERVER 3
KAFKA BROKER 2
rack.id AZ2
TOPIC A
PARTITION 1
Follower
TOPIC A
PARTITION 2
Leader
TOPIC A
PARTITION 3
Follower
ZOOKEEPER
SERVER 2
AVAILABILITY ZONE 2
DATACENTER
6

AVAILABILITY ZONE 1
DATACENTER
KAFKA BROKER 1
rack.id AZ1
TOPIC A
PARTITION 1
Leader
TOPIC A
PARTITION 2
Follower
TOPIC A
PARTITION 3
Follower
ZOOKEEPER
SERVER 1
AVAILABILITY ZONE 3
DATACENTER
KAFKA BROKER 3
rack.id AZ3
TOPIC A
PARTITION 1
Follower
TOPIC A
PARTITION 2
Follower
TOPIC A
PARTITION 3
Leader
ZOOKEEPER
SERVER 3
KAFKA BROKER 2
rack.id AZ2
TOPIC A
PARTITION 1
Follower
TOPIC A
PARTITION 2
Leader
TOPIC A
PARTITION 3
Follower
ZOOKEEPER
SERVER 2
AVAILABILITY ZONE 2
DATACENTER
7

Configuration
Kubernetes nodes assigned a zone, labelled with
topology.kubernetes.io/zone
Kafka’s broker.rack : topology.kubernetes.io/zone value
Configuration depends on installation technology eg
rackAssignment or topologyKey
Low latency a MUST
Look at timeouts and client configuration
zookeeper.connection.timeout.ms
replica.lag.time.max.ms 8

AVAILABILITY ZONE 1
DATACENTER
KAFKA BROKER 1
rack.id AZ1
TOPIC A
PARTITION 1
Leader
TOPIC A
PARTITION 2
Follower
TOPIC A
PARTITION 3
Follower
KAFKA BROKER 2
rack.id AZ1
TOPIC A
PARTITION 1
Follower
TOPIC A
PARTITION 2
Follower
TOPIC A
PARTITION 3
Leader
AVAILABILITY ZONE 2
DATACENTER
KAFKA BROKER 4
rack.id AZ2
TOPIC A
PARTITION 1
Follower
TOPIC A
PARTITION 2
Follower
TOPIC A
PARTITION 3
Follower
KAFKA BROKER 3
rack.id AZ2
TOPIC A
PARTITION 1
Follower
TOPIC A
PARTITION 2
Leader
TOPIC A
PARTITION 3
Follower
AVAILABILITY ZONE 3
DATACENTER
KAFKA BROKER6
rack.id AZ3
TOPIC A
PARTITION 1
Follower
TOPIC A
PARTITION 2
Follower
TOPIC A
PARTITION 3
Follower
KAFKA BROKER 5
rack.id AZ3
TOPIC A
PARTITION 1
Follower
TOPIC A
PARTITION 3
Follower
TOPIC A
PARTITION 2
Follower
Replication factor > min.insync.replicas
Ensure there are sufficient replicas to cover all zones
client.rack tags which node the client application is running on
• Consumers to fetch from the closest replica
9

AVAILABILITY ZONE 1
DATACENTER
KAFKA BROKER 1
rack.id AZ1
TOPIC A
PARTITION 1
Leader
TOPIC A
PARTITION 2
Follower
TOPIC A
PARTITION 3
Follower
ZOOKEEPER
SERVER 1
KAFKA BROKER 2
rack.id AZ1
TOPIC A
PARTITION 1
Follower
TOPIC A
PARTITION 2
Follower
TOPIC A
PARTITION 3
Leader
ZOOKEEPER
SERVER 2
AVAILABILITY ZONE 2
DATACENTER
ZOOKEEPER
SERVER 3
KAFKA BROKER 4
rack.id AZ2
TOPIC A
PARTITION 1
Follower
TOPIC A
PARTITION 2
Follower
TOPIC A
PARTITION 3
Follower
KAFKA BROKER 3
rack.id AZ2
TOPIC A
PARTITION 1
Follower
TOPIC A
PARTITION 2
Leader
TOPIC A
PARTITION 3
Follower
Zookeeper
10

Stretch clusters
Kafka and Zookeeper replication
ensures data is highly available
Data consistency
Exactly once processing possible
Event order can be preserved
Guards against the loss of a data
center
Simple client configuration
No offset lag or offset translation
Utilizes all brokers
11

Considerations
Stable, low latency, high-bandwidth
connection is a must : care if crossing
regions
No 0 downtime upgrades
Doesn't protect against whole cluster
failure
12
Cross-availability zone data transfer
fees will apply : especially
ingress/egress
Third datacenter required
Complexity of configuration

Multiple clusters
Multiple independent Kafka clusters in different regions. Topic data is
mirrored across clusters
• Active/passive: Produce to primary cluster, consume from any
• Active/active: Produce to and consume from any cluster
• Federated: Central cluster with multiple regional clusters
A mirror making technology mirrors the data between clusters
• Typically runs at the target cluster
• Data is consumed remotely and produced locally
14

PRIMARY data center SECONDARY data center
PRODUCER
CONSUMER
CONSUMER
TOPIC:
TOPIC1
TOPIC:
PRIMARY.TOPIC1
MIRROR-MAKER 2
Active/passive clusters
15

PRIMARY data center SECONDARY data center
PRODUCER
CONSUMER
CONSUMER
TOPIC:
TOPIC1
TOPIC:
PRIMARY.TOPIC1
MIRROR-MAKER 2
16

Data back-up to multiple destinations
Disaster recovery fail-over after loss of
infrastructure
Data migration:
• Moving to a new cluster
• Moving from a staging environment
to a production environment
Clusters are independent
Message keys used for partitioning, so
order is preserved on a per-key basis
Source cluster must be the 'owner' of
the data. Target cluster is essentially
read-only
17

Considerations
Target cluster will lag behind source cluster - mirroring is asynchronous, so
monitor that it's not too far behind
• High traffic topics could mean 1000’s of un-mirrored messages, even if there
is only a few milliseconds of lag
Message offsets in the source and target topics may not match up
• On fail-over consumers need to be updated to start at the correct offset
• Returning to the source cluster will require a similar process
Configuration changes on the source may need to be propagated to the target
Target cluster may be unused (until it’s needed)
18

LONDON data center BOSTON data center
PRODUCER
CONSUMER
TOPIC:
TOPIC1
TOPIC:
LONDON.TOPIC1
MIRROR-MAKER 2
TOPIC:
TOPIC1
TOPIC:
BOSTON.TOPIC1
MIRROR-MAKER 2
PRODUCER
CONSUMER
Active/active clusters
19

LONDON data center BOSTON data center
PRODUCER
CONSUMER
TOPIC:
TOPIC1
TOPIC:
LONDON.TOPIC1
MIRROR-MAKER 2
TOPIC:
TOPIC1
TOPIC:
BOSTON.TOPIC1
MIRROR-MAKER 2
PRODUCER
CONSUMER
20

Implicit disaster recovery – data centers can operate independently
Data back-up to multiple destinations
A ‘virtual’ topic shared across geographies
Serve users from a nearby data center to increase performance
Redundancy and resilience - can easily do a network redirect on failure to route all
traffic to surviving cluster
21

Considerations
Possible to accidentally configure loops of data
Lag and ordering issues may arise if trying to consume from a different data center
to where the data was produced
22

BOSTON data center
CONSUMER
TOPIC:
LONDON.TOPIC1
TOPIC:
TOPIC1
TOPIC:
BOSTON.TOPIC1
MIRROR-MAKER 2
PRODUCER
CONSUMER
CENTRAL data center
LONDON data center
PRODUCER
CONSUMER
REJKJAVIK data center
PRODUCER
CONSUMER
MIRROR-MAKER
2
TOPIC:
TOPIC1
TOPIC:
REJKJAVIK.TOPIC1
M
I
R
R
O
R
-
M
A
K
E
R
2
TOPIC:
TOPIC1
Federated (hub and spoke) clusters
23

Federated (hub and spoke) clusters
Each region has a Kafka cluster to handle data for the region
There is no requirement for one region to know about data for other regions
Serve users from a nearby data center to increase performance
Central cluster consolidates data from each regional center
• Can be used when central processing requires access to the full set of data
Mirroring is single direction
• Makes it simple to configure, deploy and monitor
24

Be aware
Not useful if all regions need access to the data
Central cluster is for secondary processing or back-up
Lag and ordering issues at the central cluster may arise if regional data is co-
dependent
25

Planning multi-cluster architectures
for fail-over
Consider how applications will fail-over
to different clusters
• Bootstrap address / DNS
• Certificates, user credentials,
permissions
• Consumer offsets
• Topic subscriptions
• Message ordering
Consider effect of duplicates/lost
messages
• Idempotent writes
Monitor that data is arriving at remote
clusters as well as your primary system
health
Practice your fail-over and fail-back
26

Summary
Kafka provides a good degree of high availability
Some applications require additional guarantees
Dependent on your requirements and your infrastructure
27

Thank you
Andrew Borley
Emma Humber
—
borley@uk.ibm.com
ehumber@confluent.io
© Copyright IBM Corporation 2021. All rights reserved. The information contained in these materials is provided for informational purposes only, and is provided AS IS without warranty of
any kind, express or implied. Any statement of direction represents IBM’s current intent, is subject to change or withdrawal, and represent only goals and objectives. IBM, the IBM logo, and
ibm.com are trademarks of IBM Corp., registered in many jurisdictions worldwide. Other product and service names might be trademarks of IBM or other companies. A current list of IBM
trademarks is available at Copyright and trademark information.
28

Availability of Kafka - Beyond the Brokers | Andrew Borley and Emma Humber, IBM

More Related Content

What's hot (20)

Similar to Availability of Kafka - Beyond the Brokers | Andrew Borley and Emma Humber, IBM (20)

More from HostedbyConfluent (20)

Recently uploaded (20)

Availability of Kafka - Beyond the Brokers | Andrew Borley and Emma Humber, IBM