SlideShare a Scribd company logo
Hardening Kafka Replication
Jason Gustafson, Engineer@Confluent
● At the heart of Kafka is the log
● Log replication provides high availability
● Kafka has a solid replication protocol
● 99.999% of the time it does the right thing
● This talk is about the remaining 0.001%
Overview
Preliminaries
View of a single partition
Key
Value
Offset
View of a single partition
Message Appends
Key
Value
Offset
Key
Value
Offset
View of a single partition
Message Appends
k0
v0
0
View of a single partition
Message Appends
k0 k1 k2
v0 v1 v2
0 1 2
Key
Value
Offset
View of a single partition
k0 k1 k2
v0 v1 v2
0 1 2
Key
Value
Offset
View of a single partition
k0 k1 k2
v0 v1 v2
0 1 2
Key
Value
Offset
k0 k1 k2
v0 v1 v2
View of a single partition
Key
Value
k0 k1 k2
v0 v1 v2
Offset 0
View of a single partition
Key
Value
k0 k1 k2
v0 v1 v2
r0 r1 r2
View of a single partition
Record
r0 r1 r2
View of a single partition
Record
r0 r1 r2
View of a single
partition with 3
replicas
r0 r1 r2 A
B
C
View of a single
partition with 3
replicas
r0 r1 r2 A
B
C
The protocol’s goal is
to replicate the logs
exactly to all replicas
r0 r1 r2
r0 r1 r2
r0 r1 r2
A
B
C
The protocol’s goal is
to replicate the logs
exactly to all replicas
The Theory
A
B
C
Leader
A
B
C
For each partition, one replica
is elected as the leader
Leader
Follower
Follower
A
B
C
Replicas that are not leaders
are called followers
Leader
Follower
Follower
A
B
C
Leaders accept writes from
producers.
r0 r1 r2Leader
Follower
Follower
A
B
C
Leaders accept writes from
producers.
r0 r1 r2
A
B
C
Leader
Follower
Follower
Followers fetch from the
leader.
r0 r1
r0 r1 r2
A
B
C
Leader
Follower
Follower
Followers fetch from the
leader.
r0 r1
r0 r1 r2
r0
A
B
C
Leader
Follower
Follower
Followers fetch from the
leader.
r0 r1
r0 r1 r2
r0
A
B
C
Leader
Follower
Follower
Leader election is handled by
a separate component known
as the controller
r0 r1
r0 r1 r2
r0
A
B
C
Leader
Follower
Follower
Leader Epoch ISR
B 0 A, B, C
In order to enable election by
the controller, we maintain
state in Zookeeper about the
in-sync replicas (ISR).
r0 r1
r0 r1 r2
r0
A
B
C
Leader
Follower
Follower
When there is a state change
(e.g. a new leader), the
controller sends the updated
state to all the replicas.
Leader Epoch ISR
B 0 A, B, C
r0 r1
r0 r1 r2
r0
A
B
C
Leader
(epoch=0)
Follower
(epoch=0)
Leader Epoch ISR
B 0 A, B, C
Follower
(epoch=0)
When there is a state change
(e.g. a new leader), the
controller sends the updated
state to all the replicas.
r0 r1
r0 r1 r2
r0
A
B
C
Leader
(epoch=0)
Follower
(epoch=0)
Leader Epoch ISR
B 0 A, B, C
Follower
(epoch=0)
r0 r1
r0 r1 r2
r0
A
B
C
Leader
(epoch=0)
Follower
(epoch=0)
Leader Epoch ISR
B 0 A, B, C
Follower
(epoch=0)
High Watermark
The high watermark is
the largest offset known
to be replicated to all
members of the ISR.
r0 r1
r0 r1 r2
r0
A
B
C
Leader
(epoch=0)
Follower
(epoch=0)
Leader Epoch ISR
B 0 A, B, C
Follower
(epoch=0)
The high watermark is
the largest offset known
to be replicated to all
members of the ISR.
r0 r1
r0 r1 r2
r0
A
B
C
Leader
(epoch=0)
Follower
(epoch=0)
Leader Epoch ISR
B 0 A, B, C
Follower
(epoch=0)
Records below the high
watermark are considered
“committed” and are visible
to consumers.
Committed
r0 r1
r0 r1 r2
r0
A
B
C
Leader
(epoch=0)
Follower
(epoch=0)
Leader Epoch ISR
B 0 A, B, C
Follower
(epoch=0)
Records above the high
watermark are considered
uncommitted.
Committed Uncommitted
r0 r1
r0 r1 r2
r0
A
B
C
Leader
(epoch=0)
Follower
(epoch=0)
Leader Epoch ISR
B 0 A, B, C
Follower
(epoch=0)
As records are replicated,
the high watermark moves
forward.
r0 r1
r0 r1 r2
r0 r1
A
B
C
Leader
(epoch=0)
Follower
(epoch=0)
Leader Epoch ISR
B 0 A, B, C
Follower
(epoch=0)
As records are replicated,
the high watermark moves
forward.
r0 r1
r0 r1 r2
r0 r1
A
B
C
Leader
(epoch=0)
Follower
(epoch=0)
Leader Epoch ISR
B 0 A, B, C
Follower
(epoch=0)
As records are replicated,
the high watermark moves
forward.
r0 r1
r0 r1 r2
r0 r1
A
B
C
Leader
(epoch=0)
Follower
(epoch=0)
Leader Epoch ISR
B 0 A, B, C
Follower
(epoch=0)
If a replica falls behind, it
can be removed from the
ISR by the leader.
r0 r1
r0 r1 r2 r3 r4 r5 r6
r0 r1
A
B
C
Leader
(epoch=0)
Follower
(epoch=0)
Leader Epoch ISR
B 0 A, B, C
Follower
(epoch=0)
If a replica falls behind, it
can be removed from the
ISR by the leader.
r0 r1 r2 r3 r4
r0 r1 r2 r3 r4 r5 r6
r0 r1
A
B
C
Leader
(epoch=0)
Follower
(epoch=0)
Leader Epoch ISR
B 0 A, B, C
Follower
(epoch=0)
If a replica falls behind, it
can be removed from the
ISR by the leader.
r0 r1 r2 r3 r4
r0 r1 r2 r3 r4 r5 r6
r0 r1
A
B
C
Leader
(epoch=0)
Follower
(epoch=0)
Leader Epoch ISR
B 0 A, B
Follower
(epoch=0)
If a replica falls behind, it
can be removed from the
ISR by the leader.
r0 r1 r2 r3 r4
r0 r1 r2 r3 r4 r5 r6
r0 r1
A
B
C
Leader
(epoch=0)
Follower
(epoch=0)
Leader Epoch ISR
B 0 A, B
Follower
(epoch=0)
If a replica falls behind, it
can be removed from the
ISR by the leader.
r0 r1 r2 r3 r4
r0 r1 r2 r3 r4 r5 r6
r0 r1
A
B
C
Leader
(epoch=0)
Follower
(epoch=0)
Leader Epoch ISR
B 0 A, B
Follower
(epoch=0)
If a replica falls behind, it
can be removed from the
ISR by the leader.
r0 r1 r2 r3 r4
r0 r1 r2 r3 r4 r5 r6
r0 r1
A
B
C
Leader
(epoch=0)
Follower
(epoch=0)
Leader Epoch ISR
B 0 A, B
Follower
(epoch=0)
An out-of-sync replica that
catches up to the high
watermark is added back
to the ISR.
r0 r1 r2 r3 r4
r0 r1 r2 r3 r4 r5 r6
r0 r1 r2 r3 r4 r5
A
B
C
Leader
(epoch=0)
Follower
(epoch=0)
Leader Epoch ISR
B 0 A, B
Follower
(epoch=0)
An out-of-sync replica that
catches up to the high
watermark is added back
to the ISR.
r0 r1 r2 r3 r4
r0 r1 r2 r3 r4 r5 r6
r0 r1 r2 r3 r4 r5
A
B
C
Leader
(epoch=0)
Follower
(epoch=0)
Leader Epoch ISR
B 0 A, B, C
Follower
(epoch=0)
An out-of-sync replica that
catches up to the high
watermark is added back
to the ISR.
r0 r1 r2 r3 r4
r0 r1 r2 r3 r4 r5 r6
r0 r1 r2 r3 r4 r5
A
B
C
Leader
(epoch=0)
Follower
(epoch=0)
Leader Epoch ISR
B 0 A, B, C
Follower
(epoch=0)
r0 r1 r2 r3 r4
r0 r1 r2 r3 r4 r5 r6
r0 r1 r2 r3 r4 r5
A
B
C
Leader
(epoch=0)
Follower
(epoch=0)
Leader Epoch ISR
B 0 A, B, C
Follower
(epoch=0)
Only replicas in the ISR are
eligible to become leader
r0 r1 r2 r3 r4
r0 r1 r2 r3 r4 r5 r6
r0 r1 r2 r3 r4 r5
A
B
C
Leader
(epoch=0)
Follower
(epoch=0)
Leader Epoch ISR
B 0 A, B, C
Follower
(epoch=0)
When a leader fails, the
controller will take it out of
the ISR and elect a new
leader from the remaining
ISR.
r0 r1 r2 r3 r4
r0 r1 r2 r3 r4 r5 r6
r0 r1 r2 r3 r4 r5
A
B
C
Leader
(epoch=0)
Follower
(epoch=0)
Leader Epoch ISR
A 1 A, C
Follower
(epoch=0)
When a leader fails, the
controller will take it out of
the ISR and elect a new
leader from the remaining
ISR.
r0 r1 r2 r3 r4
r0 r1 r2 r3 r4 r5 r6
r0 r1 r2 r3 r4 r5
A
B
C
Leader
(epoch=0)
Leader
(epoch=1)
Leader Epoch ISR
A 1 A, C
Follower
(epoch=0)
The new leader/ISR state is
propagated to the
remaining replicas
r0 r1 r2 r3 r4 r7 r8
r0 r1 r2 r3 r4 r5 r6
r0 r1 r2 r3 r4 r5
A
B
C
Leader
(epoch=0)
Leader
(epoch=1)
Leader Epoch ISR
A 1 A, C
Follower
(epoch=0)
The leader can begin
accepting writes
immediately.
r0 r1 r2 r3 r4 r7 r8
r0 r1 r2 r3 r4 r5 r6
r0 r1 r2 r3 r4 r5
A
B
C
Leader
(epoch=0)
Leader
(epoch=1)
Leader Epoch ISR
A 1 A, C
Follower
(epoch=1)
Upon becoming a follower,
the replica may have
uncommitted data which
needs to be truncated.
r0 r1 r2 r3 r4 r7 r8
r0 r1 r2 r3 r4 r5 r6
r0 r1 r2 r3 r4
A
B
C
Leader
(epoch=0)
Leader
(epoch=1)
Leader Epoch ISR
A 1 A, C
Follower
(epoch=1)
Upon becoming a follower,
the replica may have
uncommitted data which
needs to be truncated.
r0 r1 r2 r3 r4 r7 r8
r0 r1 r2 r3 r4 r5 r6
r0 r1 r2 r3 r4 r7
A
B
C
Leader
(epoch=0)
Leader
(epoch=1)
Leader Epoch ISR
A 1 A, C
Follower
(epoch=1)
Upon becoming a follower,
the replica may have
uncommitted data which
needs to be truncated.
In Practice
A
B
C
Leader
(epoch=0)
Follower
(epoch=0)
Leader Epoch ISR
B 0 A, B, C
Follower
(epoch=0)
A
B
C
Leader
(epoch=0)
Follower
(epoch=0)
Leader Epoch ISR
B 0 A, B, C
Follower
(epoch=0)
High Watermark
A
B
C
Leader
(epoch=0)
Follower
(epoch=0)
Leader Epoch ISR
B 0 A, B, C
Follower
(epoch=0)
High Watermark
Every replica tracks the
high watermark separately
A
B
C
Leader
(epoch=0)
Follower
(epoch=0)
Leader Epoch ISR
B 0 A, B, C
Follower
(epoch=0)
Every replica tracks the
high watermark separately
r0 r1 r2
A
B
C
Leader
(epoch=0)
Follower
(epoch=0)
Leader Epoch ISR
B 0 A, B, C
Follower
(epoch=0)
Every replica tracks the
high watermark separately
r0 r1
r0 r1 r2
A
B
C
Leader
(epoch=0)
Follower
(epoch=0)
Leader Epoch ISR
B 0 A, B, C
Follower
(epoch=0)
Every replica tracks the
high watermark separately
r0 r1
r0 r1 r2
r0
A
B
C
Leader
(epoch=0)
Follower
(epoch=0)
Leader Epoch ISR
B 0 A, B, C
Follower
(epoch=0)
Every replica tracks the
high watermark separately
r0 r1
r0 r1 r2
r0
A
B
C
Leader
(epoch=0)
Follower
(epoch=0)
Leader Epoch ISR
B 0 A, B, C
Follower
(epoch=0)
The leader advances its
high watermark based on
the fetch offsets of replicas
r0 r1
r0 r1 r2
r0
A
B
C
Leader
(epoch=0)
Follower
(epoch=0)
Leader Epoch ISR
B 0 A, B, C
Follower
(epoch=0)
The leader advances its
high watermark based on
the fetch offsets of replicas
r0 r1
r0 r1 r2
r0
A
B
C
Leader
(epoch=0)
Follower
(epoch=0)
Leader Epoch ISR
B 0 A, B, C
Follower
(epoch=0)
The leader piggybacks its
high watermark onto fetch
responses
r0 r1
r0 r1 r2
r0 r1
A
B
C
Leader
(epoch=0)
Follower
(epoch=0)
Leader Epoch ISR
B 0 A, B, C
Follower
(epoch=0)
The leader piggybacks its
high watermark onto fetch
responses
r0 r1
r0 r1 r2
r0 r1
A
B
C
Leader
(epoch=0)
Follower
(epoch=0)
Leader Epoch ISR
B 0 A, B, C
Follower
(epoch=0)
At any point in time, the
follower high watermarks
may be a little behind the
leader’s.
Edge Case 1:
Fast leader elections
r0 r1 r2 r3
r0 r1 r2 r3 r4 r5 r6
r0 r1 r2 r3 r4 r5
A
B
C
Leader
(epoch=0)
Follower
(epoch=0)
Leader Epoch ISR
B 0 A, B, C
Follower
(epoch=0)
r0 r1 r2 r3
r0 r1 r2 r3 r4 r5 r6
r0 r1 r2 r3 r4 r5
A
B
C
Leader
(epoch=0)
Follower
(epoch=0)
Leader Epoch ISR
B 0 A, B, C
Follower
(epoch=0)
Replica B fails.
r0 r1 r2 r3
r0 r1 r2 r3 r4 r5 r6
r0 r1 r2 r3 r4 r5
A
B
C
Leader
(epoch=0)
Follower
(epoch=0)
Leader Epoch ISR
C 1 A, C
Follower
(epoch=0)
Replica B is removed from
the ISR and C is elected as
the new leader.
r0 r1 r2 r3
r0 r1 r2 r3 r4 r5 r6
r0 r1 r2 r3 r4 r5
A
B
C
Leader
(epoch=0)
Follower
(epoch=0)
Leader Epoch ISR
C 1 A, C
Leader
(epoch=1)
Replica B is removed from
the ISR and C is elected as
the new leader.
r0 r1 r2 r3
r0 r1 r2 r3 r4 r5 r6
r0 r1 r2 r3 r4 r5
A
B
C
Leader
(epoch=0)
Follower
(epoch=1)
Leader Epoch ISR
C 1 A, C
Leader
(epoch=1)
Replica A finds the new
leader and truncates its log
to the local high watermark
r0 r1 r2 r3
r0 r1 r2 r3 r4 r5 r6
r0 r1 r2 r3 r4 r5
A
B
C
Leader
(epoch=0)
Follower
(epoch=1)
Leader Epoch ISR
C 1 A, C
Leader
(epoch=1)
Replica A finds the new
leader and truncates its log
to the local high watermark
r0 r1
r0 r1 r2 r3 r4 r5 r6
r0 r1 r2 r3 r4 r5
A
B
C
Leader
(epoch=0)
Follower
(epoch=1)
Leader Epoch ISR
C 1 A, C
Leader
(epoch=1)
Replica A finds the new
leader and truncates its log
to the local high watermark
r0 r1
r0 r1 r2 r3 r4 r5 r6
r0 r1 r2 r3 r4 r5
A
B
C
Leader
(epoch=0)
Follower
(epoch=1)
Leader Epoch ISR
C 1 A, C
Leader
(epoch=1)
Before replica A begins
fetching, the new leader
fails.
r0 r1
r0 r1 r2 r3 r4 r5 r6
r0 r1 r2 r3 r4 r5
A
B
C
Leader
(epoch=0)
Follower
(epoch=1)
Leader Epoch ISR
C 1 A, C
Leader
(epoch=1)
Before replica A begins
fetching, the new leader
fails.
r0 r1
r0 r1 r2 r3 r4 r5 r6
r0 r1 r2 r3 r4 r5
A
B
C
Leader
(epoch=0)
Follower
(epoch=1)
Leader Epoch ISR
A 2 A
Leader
(epoch=1)
Before replica A begins
fetching, the new leader
fails.
r0 r1
r0 r1 r2 r3 r4 r5 r6
r0 r1 r2 r3 r4 r5
A
B
C
Leader
(epoch=0)
Leader
(epoch=2)
Leader Epoch ISR
A 2 A
Leader
(epoch=1)
Before replica A begins
fetching, the new leader
fails.
r0 r1 r7 r8 r9
r0 r1 r2 r3 r4 r5 r6
r0 r1 r2 r3 r4 r5
A
B
C
Leader
(epoch=0)
Leader
(epoch=2)
Leader Epoch ISR
A 2 A
Leader
(epoch=1)
Leader A then begins
accepting writes.
r0 r1 r7 r8 r9
r0 r1 r2 r3 r4 r5 r6
r0 r1 r2 r3 r4 r5
A
B
C
Leader
(epoch=0)
Leader
(epoch=2)
Leader Epoch ISR
A 2 A
Leader
(epoch=1)
But r2 and r3 had already
been committed to the ISR!
r0 r1 r7 r8 r9
r0 r1 r2 r3 r4 r5 r6
r0 r1 r2 r3 r4 r5
A
B
C
Leader
(epoch=0)
Leader
(epoch=2)
Leader Epoch ISR
A 2 A
Leader
(epoch=1)
Suppose that B eventually
gets restarted.
r0 r1 r7 r8 r9
r0 r1 r2 r3 r4 r5 r6
r0 r1 r2 r3 r4 r5
A
B
C
Leader
(epoch=0)
Leader
(epoch=2)
Leader Epoch ISR
A 2 A
Leader
(epoch=1)
Suppose that B eventually
gets restarted.
r0 r1 r7 r8 r9
r0 r1 r2 r3 r4 r5 r6
r0 r1 r2 r3 r4 r5
A
B
C
Follower
(epoch=2)
Leader
(epoch=2)
Leader Epoch ISR
A 2 A
Leader
(epoch=1)
Suppose that B eventually
gets restarted.
r0 r1 r7 r8 r9
r0 r1 r2 r3
r0 r1 r2 r3 r4 r5
A
B
C
Follower
(epoch=2)
Leader
(epoch=2)
Leader Epoch ISR
A 2 A
Leader
(epoch=1)
Suppose that B eventually
gets restarted.
r0 r1 r7 r8 r9
r0 r1 r2 r3 r9
r0 r1 r2 r3 r4 r5
A
B
C
Follower
(epoch=2)
Leader
(epoch=2)
Leader Epoch ISR
A 2 A
Leader
(epoch=1)
The logs have now
diverged.
KIP-101
r0 r1 r2 r3
r0 r1 r2 r3 r4 r5 r6
r0 r1 r2 r3 r4 r5
A
B
C
Leader
(epoch=0)
Follower
(epoch=1)
Leader Epoch ISR
C 1 A, C
Leader
(epoch=1)
Replica B has failed and
replica A needs to truncate
its log.
r0 r1 r2 r3
r0 r1 r2 r3 r4 r5 r6
r0 r1 r2 r3 r4 r5
A
B
C
Leader
(epoch=0)
Follower
(epoch=1)
Leader Epoch ISR
C 1 A, C
Leader
(epoch=1)
A -> C: What is the end offset
for epoch=0?
r0 r1 r2 r3
r0 r1 r2 r3 r4 r5 r6
r0 r1 r2 r3 r4 r5
A
B
C
Leader
(epoch=0)
Follower
(epoch=1)
Leader Epoch ISR
C 1 A, C
Leader
(epoch=1)
A -> C: What is the end offset
for epoch=0?
C -> A: The end offset is 6
Offset 6
r0 r1 r2 r3
r0 r1 r2 r3 r4 r5 r6
r0 r1 r2 r3 r4 r5
A
B
C
Leader
(epoch=0)
Follower
(epoch=1)
Leader Epoch ISR
C 1 A, C
Leader
(epoch=1)
A -> C: What is the end offset
for epoch=0?
C -> A: The end offset is 6
C: Cool, no truncation needed!
r0 r1 r2 r3
r0 r1 r2 r3 r4 r5 r6
r0 r1 r2 r3 r4 r5
A
B
C
Leader
(epoch=0)
Follower
(epoch=1)
Leader Epoch ISR
C 1 A, C
Leader
(epoch=1)
r0 r1 r2 r3
r0 r1 r2 r3 r4 r5 r6
r0 r1 r2 r3 r4 r5
A
B
C
Leader
(epoch=0)
Follower
(epoch=1)
Leader Epoch ISR
C 1 A, C
Leader
(epoch=1)
r0 r1 r2 r3
r0 r1 r2 r3 r4 r5 r6
r0 r1 r2 r3 r4 r5
A
B
C
Leader
(epoch=0)
Follower
(epoch=1)
Leader Epoch ISR
A 2 A
Leader
(epoch=1)
r0 r1 r2 r3 r7 r8
r0 r1 r2 r3 r4 r5 r6
r0 r1 r2 r3 r4 r5
A
B
C
Leader
(epoch=0)
Leader
(epoch=2)
Leader Epoch ISR
A 2 A
Leader
(epoch=1)
Edge Case 2:
Fast leader elections redux
r0 r1 r2
r0 r1 r2 r3 r4
r0 r1 r2 r3 r4
A
B
C
Leader
(epoch=0)
Leader
(epoch=1)
Leader Epoch ISR
A 1 A, C
Follower
(epoch=0)
Replica B has failed and
replica A has been elected
as the new leader
r0 r1 r2 r7 r8
r0 r1 r2 r3 r4
r0 r1 r2 r3 r4
A
B
C
Leader
(epoch=0)
Leader
(epoch=1)
Leader Epoch ISR
A 1 A, C
Follower
(epoch=0)
Replica B has failed and
replica A has been elected
as the new leader
r0 r1 r2 r7 r8
r0 r1 r2 r3 r4
r0 r1 r2 r3 r4
A
B
C
Leader
(epoch=0)
Leader
(epoch=1)
Leader Epoch ISR
A 1 A, C
Follower
(epoch=0)
Replica B has failed and
replica A has been elected
as the new leader
epoch=0
offset=0
epoch=1
offset=3
r0 r1 r2 r7 r8
r0 r1 r2 r3 r4
r0 r1 r2 r3 r4
A
B
C
Leader
(epoch=0)
Leader
(epoch=1)
Leader Epoch ISR
A 1 A, C
Follower
(epoch=0)
epoch=0
offset=0
epoch=1
offset=3
r0 r1 r2 r7 r8
r0 r1 r2 r3 r4
r0 r1 r2 r3 r4
A
B
C
Leader
(epoch=0)
Leader
(epoch=1)
Leader Epoch ISR
A 1 A, C
Follower
(epoch=0)
Before replica C can
truncate its log, it becomes
the new leader.
epoch=0
offset=0
epoch=1
offset=3
r0 r1 r2 r7 r8
r0 r1 r2 r3 r4
r0 r1 r2 r3 r4
A
B
C
Leader
(epoch=0)
Leader
(epoch=1)
Leader Epoch ISR
C 2 A, C
Follower
(epoch=0)
Before replica C can
truncate its log, it becomes
the new leader.
epoch=0
offset=0
epoch=1
offset=3
r0 r1 r2 r7 r8
r0 r1 r2 r3 r4
r0 r1 r2 r3 r4
A
B
C
Leader
(epoch=0)
Leader
(epoch=1)
Leader Epoch ISR
C 2 A, C
Leader
(epoch=2)
Before replica C can
truncate its log, it becomes
the new leader.
epoch=0
offset=0
epoch=1
offset=3
r0 r1 r2 r7 r8
r0 r1 r2 r3 r4
r0 r1 r2 r3 r4 r9
A
B
C
Leader
(epoch=0)
Leader
(epoch=1)
Leader Epoch ISR
C 2 A, C
Leader
(epoch=2)
Before replica C can
truncate its log, it becomes
the new leader.
epoch=0
offset=0
epoch=1
offset=3
epoch=2
offset=5
r0 r1 r2 r7 r8
r0 r1 r2 r3 r4
r0 r1 r2 r3 r4 r9
A
B
C
Leader
(epoch=0)
Follower
(epoch=2)
Leader Epoch ISR
C 2 A, C
Leader
(epoch=2)
epoch=0
offset=0
epoch=1
offset=3
epoch=2
offset=5
A -> C: What is the end offset
for epoch=1?
r0 r1 r2 r7 r8
r0 r1 r2 r3 r4
r0 r1 r2 r3 r4 r9
A
B
C
Leader
(epoch=0)
Follower
(epoch=2)
Leader Epoch ISR
C 2 A, C
Leader
(epoch=2)
epoch=0
offset=0
epoch=1
offset=3
epoch=2
offset=5
A -> C: What is the end offset
for epoch=1?
C -> A: The end offset is 5
r0 r1 r2 r7 r8
r0 r1 r2 r3 r4
r0 r1 r2 r3 r4 r9
A
B
C
Leader
(epoch=0)
Follower
(epoch=2)
Leader Epoch ISR
C 2 A, C
Leader
(epoch=2)
epoch=0
offset=0
epoch=1
offset=3
epoch=2
offset=5
A -> C: What is the end offset
for epoch=1?
C -> A: The end offset is 5
C: Cool, no truncation needed!
r0 r1 r2 r7 r8
r0 r1 r2 r3 r4
r0 r1 r2 r3 r4 r9
A
B
C
Leader
(epoch=0)
Follower
(epoch=2)
Leader Epoch ISR
C 2 A, C
Leader
(epoch=2)
r0 r1 r2 r7 r8 r9
r0 r1 r2 r3 r4
r0 r1 r2 r3 r4 r9
A
B
C
Leader
(epoch=0)
Follower
(epoch=2)
Leader Epoch ISR
C 2 A, C
Leader
(epoch=2)
Edge Case 3:
Zombie follower
r0 r1 r2 r3
r0 r1 r2 r3 r4 r5 r6
r0 r1 r2
A
B
C
Leader
(epoch=0)
Follower
(epoch=0)
Leader Epoch ISR
A 0 A, B, C
Follower
(epoch=0)
r0 r1 r2 r3
r0 r1 r2 r3 r4 r5 r6
r0 r1 r2
A
B
C
Leader
(epoch=0)
Follower
(epoch=0)
Leader Epoch ISR
A 0 A, B, C
Follower
(epoch=0)
Follower A fails and is
removed from the ISR.
r0 r1 r2 r3
r0 r1 r2 r3 r4 r5 r6
r0 r1 r2
A
B
C
Leader
(epoch=0)
Follower
(epoch=0)
Leader Epoch ISR
A 0 B, C
Follower
(epoch=0)
Follower A fails and is
removed from the ISR.
r0 r1 r2 r3
r0 r1 r2 r3 r4 r5 r6
r0 r1 r2
A
B
C
Leader
(epoch=0)
Follower
(epoch=0)
Leader Epoch ISR
A 0 B, C
Follower
(epoch=0)
r0 r1 r2 r3
r0 r1 r2 r3 r4 r5 r6
r0 r1 r2
A
B
C
Leader
(epoch=0)
Follower
(epoch=0)
Leader Epoch ISR
A 0 B, C
Follower
(epoch=0)
Replica A could not re-register
in order to get the latest
leader/ISR state and continued
fetching from the current
leader.
r0 r1 r2 r3 r4
r0 r1 r2 r3 r4 r5 r6
r0 r1 r2
A
B
C
Leader
(epoch=0)
Follower
(epoch=0)
Leader Epoch ISR
A 0 B, C
Follower
(epoch=0)
Replica A could not re-register
in order to get the latest
leader/ISR state and continued
fetching from the current
leader.
r0 r1 r2 r3 r4
r0 r1 r2 r3 r4 r5 r6
r0 r1 r2
A
B
C
Follower
(epoch=0)
Leader Epoch ISR
A 0 B, C
Follower
(epoch=0)
Leader
(epoch=0)
r0 r1 r2 r3 r4
r0 r1 r2 r3 r4 r5 r6
r0 r1 r2
A
B
C
Follower
(epoch=0)
Leader Epoch ISR
C 1 C
Follower
(epoch=0)
Leader
(epoch=0)
r0 r1 r2 r3 r4
r0 r1 r2 r3 r4 r5 r6
r0 r1 r2
A
B
C
Follower
(epoch=0)
Leader Epoch ISR
C 1 C
Leader
(epoch=1)
Leader
(epoch=0)
r0 r1 r2 r3 r4
r0 r1 r2 r3 r4 r5 r6
r0 r1 r2
A
B
C
Follower
(epoch=0)
Leader Epoch ISR
C 1 C
Leader
(epoch=1)
Leader
(epoch=0)
r0 r1 r2 r3 r4
r0 r1 r2 r3 r4 r5 r6
r0 r1 r2 r7 r8 r9
A
B
C
Follower
(epoch=0)
Leader Epoch ISR
C 1 C
Leader
(epoch=1)
Leader
(epoch=0)
r0 r1 r2 r3 r4
r0 r1 r2 r3 r4 r5 r6
r0 r1 r2 r7 r8 r9
A
B
C
Follower
(epoch=0)
Leader Epoch ISR
C 1 C
Leader
(epoch=1)
Leader
(epoch=0)
Meanwhile, replica A still
thought B was the leader and
was still trying to make
progress
r0 r1 r2 r3 r4
r0 r1 r2 r3 r4 r5 r6
r0 r1 r2 r7 r8 r9
A
B
C
Follower
(epoch=0)
Leader Epoch ISR
C 1 C
Leader
(epoch=1)
Leader
(epoch=0)
r0 r1 r2 r3 r4
r0 r1 r2 r3 r4 r5 r6
r0 r1 r2 r7 r8 r9
A
B
C
Follower
(epoch=0)
Leader Epoch ISR
C 1 C
Leader
(epoch=1)
Follower
(epoch=1)
r0 r1 r2 r3 r4
r0 r1 r2 r3 r4 r5 r6
r0 r1 r2 r7 r8 r9
A
B
C
Follower
(epoch=0)
Leader Epoch ISR
C 1 C
Leader
(epoch=1)
Follower
(epoch=1)
r0 r1 r2 r3 r4
r0 r1 r2
r0 r1 r2 r7 r8 r9
A
B
C
Follower
(epoch=0)
Leader Epoch ISR
C 1 C
Leader
(epoch=1)
Follower
(epoch=1)
r0 r1 r2 r3 r4
r0 r1 r2
r0 r1 r2 r7 r8 r9
A
B
C
Follower
(epoch=0)
Leader Epoch ISR
C 1 C
Leader
(epoch=1)
Follower
(epoch=1)
r0 r1 r2 r3 r4
r0 r1 r2 r7 r8 r9
r0 r1 r2 r7 r8 r9
A
B
C
Follower
(epoch=0)
Leader Epoch ISR
C 1 C
Leader
(epoch=1)
Follower
(epoch=1)
r0 r1 r2 r3 r4
r0 r1 r2 r7 r8 r9
r0 r1 r2 r7 r8 r9
A
B
C
Follower
(epoch=0)
Leader Epoch ISR
C 1 B, C
Leader
(epoch=1)
Follower
(epoch=1)
r0 r1 r2 r3 r4
r0 r1 r2 r7 r8 r9
r0 r1 r2 r7 r8 r9
A
B
C
Follower
(epoch=0)
Leader Epoch ISR
B 2 B, C
Leader
(epoch=1)
Follower
(epoch=1)
Once back in the ISR, the
controller elected it as leader
r0 r1 r2 r3 r4
r0 r1 r2 r7 r8 r9
r0 r1 r2 r7 r8 r9
A
B
C
Follower
(epoch=0)
Leader Epoch ISR
B 2 B, C
Leader
(epoch=1)
Leader
(epoch=2)
Once back in the ISR, the
controller elected it as leader
r0 r1 r2 r3 r4
r0 r1 r2 r7 r8 r9
r0 r1 r2 r7 r8 r9
A
B
C
Follower
(epoch=0)
Leader Epoch ISR
B 2 B, C
Leader
(epoch=1)
Leader
(epoch=2)
Suddenly, replica A was able to
make progress again!
r0 r1 r2 r3 r4 r9
r0 r1 r2 r7 r8 r9
r0 r1 r2 r7 r8 r9
A
B
C
Follower
(epoch=0)
Leader Epoch ISR
B 2 B, C
Leader
(epoch=1)
Leader
(epoch=2)
Suddenly, replica A was able to
make progress again!
Reflection
● Our mushy brains are not equipped to thinking
about edge cases in distributed systems
● How do we know that our fixes are not just
trading one edge case for another?
● How do we know there are not more edge
cases?
Model Checking
TLA+/TLC
● TLA+ is a specification language
created by Leslie Lamport
● TLC is a model checker
● Think “brute force proof by
mathematical induction”
TLA+/TLCUsing LaTeX syntax makes
model checking just as much
fun as writing research papers!● TLA+ is a specification language
created by Leslie Lamport
● TLC is a model checker
● Think “brute force proof by
mathematical induction”
Kafka TLA+ Model
● Define the state and how to initialize it
● Define the valid state transitions
● Define expected state invariants
● Run model to check invariants
Model
Checklist
● Define the state and how to initialize it
● Define the valid state transitions
● Define expected state invariants
● Run model to check invariants
Model
Checklist
r0 r1 r2 r3
r0 r1 r2 r3 r4 r5 r6
r0 r1 r2 r3 r4 r5
A
B
C
Leader
(epoch=0)
Follower
(epoch=0)
Leader Epoch ISR
B 0 A, B, C
Follower
(epoch=0)
r0 r1 r2 r3
r0 r1 r2 r3 r4 r5 r6
r0 r1 r2 r3 r4 r5
A
B
C
Leader
(epoch=0)
Follower
(epoch=0)
Leader Epoch ISR
B 0 A, B, C
Follower
(epoch=0)
1. Records and the log
Log
Representation LogRecords == [
id: Nat,
epoch: Nat
]
Log
Representation LogRecords == [
id: Nat,
epoch: Nat
]
Log == [
endOffset: Nat,
records: [Nat -> LogRecords]
]
r0 r1 r2 r3
r0 r1 r2 r3 r4 r5 r6
r0 r1 r2 r3 r4 r5
A
B
C
Leader
(epoch=0)
Follower
(epoch=0)
Leader Epoch ISR
B 0 A, B, C
Follower
(epoch=0)
1. Records and the log
r0 r1 r2 r3
r0 r1 r2 r3 r4 r5 r6
r0 r1 r2 r3 r4 r5
A
B
C
Leader
(epoch=0)
Follower
(epoch=0)
Leader Epoch ISR
B 0 A, B, C
Follower
(epoch=0)
1. Records and the log
2. Replica State
Replica State
Representation CONSTANT Replicas
Replica State
Representation CONSTANT Replicas * {r1, r2, r3}
Replica State
Representation CONSTANT Replicas * {r1, r2, r3}
ReplicaState == [
log: Log,
hw: Nat,
leaderEpoch: Nat,
leader: Replicas,
isr: SUBSET Replicas
]
Replica State
Representation CONSTANT Replicas * {r1, r2, r3}
ReplicaState == [
log: Log,
hw: Nat,
leaderEpoch: Nat,
leader: Replicas,
isr: SUBSET Replicas
]
AllReplicaStates ==
[Replicas -> ReplicaState]
r0 r1 r2 r3
r0 r1 r2 r3 r4 r5 r6
r0 r1 r2 r3 r4 r5
A
B
C
Leader
(epoch=0)
Follower
(epoch=0)
Leader Epoch ISR
B 0 A, B, C
Follower
(epoch=0)
1. Records and the log
2. Replica State
r0 r1 r2 r3
r0 r1 r2 r3 r4 r5 r6
r0 r1 r2 r3 r4 r5
A
B
C
Leader
(epoch=0)
Follower
(epoch=0)
Leader Epoch ISR
B 0 A, B, C
Follower
(epoch=0)
1. Records and the log
2. Replica State
3. Quorum State
Quorum State
Representation QuorumState == [
leaderEpoch: Nat,
leader: Replicas,
isr: SUBSET Replicas
]
r0 r1 r2 r3
r0 r1 r2 r3 r4 r5 r6
r0 r1 r2 r3 r4 r5
A
B
C
Leader
(epoch=0)
Follower
(epoch=0)
Leader Epoch ISR
B 0 A, B, C
Follower
(epoch=0)
1. Records and the log
2. Replica State
3. Quorum State
r0 r1 r2 r3
r0 r1 r2 r3 r4 r5 r6
r0 r1 r2 r3 r4 r5
A
B
C
Leader
(epoch=0)
Follower
(epoch=0)
Leader Epoch ISR
B 0 A, B, C
Follower
(epoch=0)
1. Records and the log
2. Replica State
3. Quorum State
4. LeaderAndIsr Propagation
Leader/ISR
Propagation LeaderAndIsrRequests ==
SUBSET QuorumState
Leader/ISR
Propagation LeaderAndIsrRequests ==
SUBSET QuorumState
leaderAndIsrRequests: {}
Example: initialization
Leader/ISR
Propagation LeaderAndIsrRequests ==
SUBSET QuorumState
leaderAndIsrRequests: {
[leader: A, epoch: 0, isr: {A, B, C}]
}
Example: after first leader election
Leader/ISR
Propagation LeaderAndIsrRequests ==
SUBSET QuorumState
leaderAndIsrRequests: {
[leader: A, epoch: 0, isr: {A, B, C}],
[leader: B, epoch: 1, isr: {B, C}]
}
Example: after leader failure and reelection
● Define the state and how to initialize it
● Define the valid state transitions
● Define expected state invariants
● Run model to check invariants
Model
Checklist
Next ==
/ ControllerElectLeader
/ ControllerShrinkIsr
/ ReplicaBecomeLeader
/ LeaderExpandIsr
/ LeaderShrinkIsr
/ LeaderWrite
/ LeaderIncHighWatermark
/ ReplicaBecomeFollower
/ FollowerFetch
State
Transitions
Next ==
/ ControllerElectLeader
/ ControllerShrinkIsr
/ ReplicaBecomeLeader
/ LeaderExpandIsr
/ LeaderShrinkIsr
/ LeaderWrite
/ LeaderIncHighWatermark
/ ReplicaBecomeFollower
/ FollowerFetch
State
Transitions
Next ==
/ ControllerElectLeader
/ ControllerShrinkIsr
/ ReplicaBecomeLeader
/ LeaderExpandIsr
/ LeaderShrinkIsr
/ LeaderWrite
/ LeaderIncHighWatermark
/ ReplicaBecomeFollower
/ FollowerFetch
State
Transitions Controller actions
Next ==
/ ControllerElectLeader
/ ControllerShrinkIsr
/ ReplicaBecomeLeader
/ LeaderExpandIsr
/ LeaderShrinkIsr
/ LeaderWrite
/ LeaderIncHighWatermark
/ ReplicaBecomeFollower
/ FollowerFetch
State
Transitions
Leader actions
Next ==
/ ControllerElectLeader
/ ControllerShrinkIsr
/ ReplicaBecomeLeader
/ LeaderExpandIsr
/ LeaderShrinkIsr
/ LeaderWrite
/ LeaderIncHighWatermark
/ ReplicaBecomeFollower
/ FollowerFetch
State
Transitions
Follower actions
State
Transitions
Start off with empty logs, a full ISR, and
no leader
Init
State
Transitions
Init
ControllerElectLeader
The first enabled action is leader election.
State
Transitions
Init
ControllerElectLeader
Electing the first leader enables several
new state transitions
State
Transitions
Init
ControllerElectLeader
ReplicaBecomeLeader
Electing the first leader enables several
new state transitions
State
Transitions
Init
ControllerElectLeader
ReplicaBecomeLeader
Electing the first leader enables several
new state transitions
ReplicaBecomeFollower
State
Transitions
Init
ControllerElectLeader
ReplicaBecomeLeader
Electing the first leader enables several
new state transitions
ReplicaBecomeFollower
ControllerElectLeader
State
Transitions
Init
ControllerElectLeader
ReplicaBecomeLeader
Every transition enables a different set
of next actions.
ReplicaBecomeFollower
ControllerElectLeader
State
Transitions
Init
ControllerElectLeader
ReplicaBecomeLeader
Every transition enables a different set
of next actions.
ReplicaBecomeFollower
ControllerElectLeader
LeaderWrite ReplicaBecomeFollower
ControllerShrinkIsr
State
Transitions
Init
ControllerElectLeader
ReplicaBecomeLeader
Every transition enables a different set
of next actions.
ReplicaBecomeFollower
ControllerElectLeader
LeaderWrite ReplicaBecomeFollower
ControllerShrinkIsr FollowerFetch
LeaderShrinkIsr
State
Transitions
Init
ControllerElectLeader
ReplicaBecomeLeader
ReplicaBecomeFollower
ReplicaBecomeLeader
LeaderWrite
FollowerFetch
State
Transitions
Init
ControllerElectLeader(epoch=0)
ControllerShrinkIsr
ControllerElectLeader(epoch=1)
ReplicaBecomeLeader(epoch=0)
LeaderWrite(epoch=0)
ReplicaBecomeFollower(epoch=1)
ControllerShrinkIsr
● Define the state and how to initialize it
● Define the valid state transitions
● Define expected state invariants
● Run model to check invariants
Model
Checklist
Replication
Invariant StrongIsr == A r1 in Replicas:
/ ~ ReplicaPresumesLeadership(r1)
/ LET hw == replicaState[r1].hw
IN A r2 in quorumState.isr:
HasMatchingLogsUpTo(r1, r2, hw)
Replication
Invariant StrongIsr == A r1 in Replicas:
/ ~ ReplicaPresumesLeadership(r1)
/ LET hw == replicaState[r1].hw
IN A r2 in quorumState.isr:
HasMatchingLogsUpTo(r1, r2, hw)
“If any replica is eligible to return data, then that data
must be replicated to all members of the current ISR”
r0 r1 r2 r3
r0 r1 r2 r3 r4 r5 r6
r0 r1 r2 r3 r4 r5
A
B
C
Leader
(epoch=0)
Follower
(epoch=1)
Leader Epoch ISR
C 1 A, C
Leader
(epoch=1)
Leader A had failed and
replica C was being elected
as the new leader.
r0 r1 r2 r3
r0 r1 r2 r3 r4 r5 r6
r0 r1 r2 r3 r4 r5
A
B
C
Leader
(epoch=0)
Follower
(epoch=1)
Leader Epoch ISR
C 1 A, C
Leader
(epoch=1)
Upon becoming a follower
of C, replica A would
truncate its log to the local
high watermark.
r0 r1
r0 r1 r2 r3 r4 r5 r6
r0 r1 r2 r3 r4 r5
A
B
C
Leader
(epoch=0)
Follower
(epoch=1)
Leader Epoch ISR
C 1 A, C
Leader
(epoch=1)
r0 r1
r0 r1 r2 r3 r4 r5 r6
r0 r1 r2 r3 r4 r5
A
B
C
Leader
(epoch=0)
Follower
(epoch=1)
Leader Epoch ISR
C 1 A, C
Leader
(epoch=1)
This state violates the
StrongIsr property because
leader C is eligible to return
records r2 and r3, though
they are not present on A.
● Define the state and how to initialize it
● Define the valid state transitions
● Define expected state invariants
● Run model to check invariants
Model
Checklist
Edge Case 4
(Premature ISR expansion)
r0 r1
r0 r1 r2 r3 r4 r5 r6
r0 r1 r2 r3 r4 r5
A
B
C
Leader
(epoch=0)
Follower
(epoch=0)
Leader Epoch ISR
B 0 B, C
Follower
(epoch=0)
The leader is B and replica
A is trying to catch up to
rejoin the ISR.
r0 r1
r0 r1 r2 r3 r4 r5 r6
r0 r1 r2 r3 r4 r5
A
B
C
Leader
(epoch=0)
Follower
(epoch=0)
Leader Epoch ISR
C 1 B, C
Follower
(epoch=0)
The leader changes to C.
r0 r1
r0 r1 r2 r3 r4 r5 r6
r0 r1 r2 r3 r4 r5
A
B
C
Leader
(epoch=0)
Follower
(epoch=0)
Leader Epoch ISR
C 1 B, C
Leader
(epoch=1)
The leader changes to C.
r0 r1
r0 r1 r2 r3 r4 r5 r6
r0 r1 r2 r3 r4 r5
A
B
C
Leader
(epoch=0)
Follower
(epoch=1)
Leader Epoch ISR
C 1 B, C
Leader
(epoch=1)
Follower A catches up and
rejoins the ISR.
r0 r1 r2
r0 r1 r2 r3 r4 r5 r6
r0 r1 r2 r3 r4 r5
A
B
C
Leader
(epoch=0)
Follower
(epoch=1)
Leader Epoch ISR
C 1 B, C
Leader
(epoch=1)
Follower A catches up and
rejoins the ISR.
r0 r1 r2
r0 r1 r2 r3 r4 r5 r6
r0 r1 r2 r3 r4 r5
A
B
C
Leader
(epoch=0)
Follower
(epoch=1)
Leader Epoch ISR
C 1 A, B, C
Leader
(epoch=1)
Follower A catches up and
rejoins the ISR.
r0 r1 r2
r0 r1 r2 r3 r4 r5 r6
r0 r1 r2 r3 r4 r5
A
B
C
Leader
(epoch=0)
Follower
(epoch=1)
Leader Epoch ISR
C 1 A, B, C
Leader
(epoch=1)
This violates StrongIsr
because replica B may
have returned records r3,
r4, and r5 which A does not
yet have.
KAFKA-7128
r0 r1 r2
r0 r1 r2 r3 r4 r5 r6
r0 r1 r2 r3 r4 r5
A
B
C
Leader
(epoch=0)
Follower
(epoch=1)
Leader Epoch ISR
C 1 B, C
Leader
(epoch=1)
After becoming leader, C
only knows that the true
high watermark is between
its own high watermark and
the end of the log.
True high
watermark
r0 r1 r2
r0 r1 r2 r3 r4 r5 r6
r0 r1 r2 r3 r4 r5
A
B
C
Leader
(epoch=0)
Follower
(epoch=1)
Leader Epoch ISR
C 1 B, C
Leader
(epoch=1)
So we wait until the
follower has reached the
starting offset of this
leader’s own epoch before
allowing it into the ISR.
True high
watermark
r0 r1 r2
r0 r1 r2 r3 r4 r5
r0 r1 r2 r3 r4 r5 r7 r8
A
B
C
Follower
(epoch=1)
Follower
(epoch=1)
Leader Epoch ISR
C 1 B, C
Leader
(epoch=1)
So we wait until the
follower has reached the
starting offset of this
leader’s own epoch before
allowing it into the ISR.
True high
watermark
r0 r1 r2 r3 r4 r5 r7
r0 r1 r2 r3 r4 r5
r0 r1 r2 r3 r4 r5 r7 r8
A
B
C
Follower
(epoch=1)
Follower
(epoch=1)
Leader Epoch ISR
C 1 B, C
Leader
(epoch=1)
So we wait until the
follower has reached the
starting offset of this
leader’s own epoch before
allowing it into the ISR.
True high
watermark
r0 r1 r2 r3 r4 r5 r7
r0 r1 r2 r3 r4 r5
r0 r1 r2 r3 r4 r5 r7 r8
A
B
C
Follower
(epoch=1)
Follower
(epoch=1)
Leader Epoch ISR
C 1 A, B, C
Leader
(epoch=1)
So we wait until the
follower has reached the
starting offset of this
leader’s own epoch before
allowing it into the ISR.
True high
watermark
KIP-320
r0 r1 r2 r3
r0 r1 r2 r5 r6
r0 r1 r2 r5 r6
A
B
C
Follower
(epoch=0)
Leader Epoch ISR
B 2 B, C
Leader
(epoch=1)
Leader
(epoch=2)
Replica A was a zombie
which was still fetching
from B. After a couple
leader elections, replica B
became the leader again.
r0 r1 r2 r3
r0 r1 r2 r5 r6
r0 r1 r2 r5 r6
A
B
C
Follower
(epoch=0)
Leader Epoch ISR
B 2 B, C
Leader
(epoch=1)
Leader
(epoch=2)
A -> B:
Fetch(offset=4, epoch=0)
r0 r1 r2 r3
r0 r1 r2 r5 r6
r0 r1 r2 r5 r6
A
B
C
Follower
(epoch=0)
Leader Epoch ISR
B 2 B, C
Leader
(epoch=1)
Leader
(epoch=2)
A -> B:
Fetch(offset=4, epoch=0)
B -> A:
You are fenced!
KIP-320
Model
Checking
Results
#Replicas Log Size Distinct States Depth Duration
3 3 84,313,696 40 ~2 hours
3 4 133,768,793 20 ~3 hours
4 4 200,534,415 18 ~6 hours
Conclusion
Summary
● Distributed systems are subtle and we are
poorly equipped to reason about edge cases.
● Model checking is a systematic approach to
finding these edge cases and verifying our
fixes address them.
● All of the replication fixes we know of will be
available in Apache Kafka 2.1.0.
Note of
Caution ● The model is not the implementation.
● The implementation will have complexity that
the model cannot capture.
● Kafka TLA+ Specification:
https://github.com/hachikuji/kafka-specification
● TLA+ video tutorial:
https://lamport.azurewebsites.net/video/videos.html
● Kafka Improvement Proposals:
○ KIP-101:
https://cwiki.apache.org/confluence/display/KAFKA/K
IP-101+-+Alter+Replication+Protocol+to+use+Leader+
Epoch+rather+than+High+Watermark+for+Truncation
○ KIP-279:
https://cwiki.apache.org/confluence/display/KAFKA/K
IP-279%3A+Fix+log+divergence+between+leader+and
+follower+after+fast+leader+fail+over
○ KIP-320:
https://cwiki.apache.org/confluence/display/KAFKA/K
IP-320%3A+Allow+fetchers+to+detect+and+handle+lo
g+truncation
Resources
Thank you!
Appendix 1:
Zombie Leaders
r0 r1 r2
r0 r1 r2 r3
r0 r1 r2 r3
A
B
C
Leader
(epoch=0)
Follower
(epoch=1)
Leader Epoch ISR
C 1 B, C
Leader
(epoch=1)
B became a zombie while it
was the leader for epoch 0.
r0 r1 r2
r0 r1 r2 r3
r0 r1 r2 r3 r7 r8
A
B
C
Leader
(epoch=0)
Follower
(epoch=1)
Leader Epoch ISR
C 1 B, C
Leader
(epoch=1)
The new leader will be
accepting writes.
r0 r1 r2
r0 r1 r2 r3 r9 r10
r0 r1 r2 r3 r7 r8
A
B
C
Leader
(epoch=0)
Follower
(epoch=1)
Leader Epoch ISR
C 1 B, C
Leader
(epoch=1)
The old leader may accept
writes as well!
r0 r1 r2
r0 r1 r2 r3 r9 r10
r0 r1 r2 r3 r7 r8
A
B
C
Leader
(epoch=0)
Follower
(epoch=1)
Leader Epoch ISR
C 1 B, C
Leader
(epoch=1)
As long as the leader
cannot advance its high
watermark, there is no
semantic violation.
r0 r1 r2
r0 r1 r2 r3 r9 r10
r0 r1 r2 r3 r7 r8
A
B
C
Leader
(epoch=0)
Follower
(epoch=1)
Leader Epoch ISR Ver
C 1 B, C 1
Leader
(epoch=1)
As long as the leader
cannot advance its high
watermark, there is no
semantic violation.
r0 r1 r2
r0 r1 r2 r3 r9 r10
r0 r1 r2 r3 r7 r8
A
B
C
Leader
(epoch=0)
Follower
(epoch=1)
Leader Epoch ISR Ver
C 1 B, C 1
Leader
(epoch=1)
The controller sends the
latest version of the leader
and ISR state to replicas in
the LeaderAndIsr request
r0 r1 r2
r0 r1 r2 r3 r9 r10
r0 r1 r2 r3 r7 r8
A
B
C
Leader
(epoch=0,
version=0)
Follower
(epoch=1)
Leader Epoch ISR Ver
C 1 B, C 1
Leader
(epoch=1,
version=1)
The controller sends the
latest version of the leader
and ISR state to replicas in
the LeaderAndIsr request
r0 r1 r2
r0 r1 r2 r3 r9 r10
r0 r1 r2 r3 r7 r8
A
B
C
Leader
(epoch=0,
version=0)
Follower
(epoch=1)
Leader Epoch ISR Ver
C 1 B, C 1
Leader
(epoch=1,
version=1)
This allows for CAS
updates, which effectively
fences replicas which have
old state.
Appendix 2:
What goes in a TLA+ Model?
VARIABLES var1, var2, …
Init ==
/ var1 = 1
/ …
Action1 ==
/ var1 leq 10
/ var1’ = var + 1
…
Next ==
/ Action1
/ Action2
/ …
Spec == Init / []Next
Invariant ==
/ var1 geq 1
/ …
TLA+
Overview
VARIABLES var1, var2, …
Init ==
/ var1 = 1
/ …
Action1 ==
/ var1 leq 10
/ var1’ = var + 1
…
Next ==
/ Action1
/ Action2
/ …
Spec == Init / []Next
Invariant ==
/ var1 geq 1
/ …
TLA+
Overview
Define the model’s state
VARIABLES var1, var2, …
Init ==
/ var1 = 1
/ …
Action1 ==
/ var1 leq 10
/ var1’ = var + 1
…
Next ==
/ Action1
/ Action2
/ …
Spec == Init / []Next
Invariant ==
/ var1 geq 1
/ …
TLA+
Overview
Specify how the state is
initialized
VARIABLES var1, var2, …
Init ==
/ var1 = 1
/ …
Action1 ==
/ var1 leq 10
/ var1’ = var + 1
…
Next ==
/ Action1
/ Action2
/ …
Spec == Init / []Next
Invariant ==
/ var1 geq 1
/ …
TLA+
Overview
Specify how the state is
initialized
VARIABLES var1, var2, …
Init ==
/ var1 = 1
/ …
Action1 ==
/ var1 leq 10
/ var1’ = var + 1
…
Next ==
/ Action1
/ Action2
/ …
Spec == Init / []Next
Invariant ==
/ var1 geq 1
/ …
TLA+
Overview
Specify the valid state
transitions
VARIABLES var1, var2, …
Init ==
/ var1 = 1
/ …
Action1 ==
/ var1 leq 10
/ var1’ = var + 1
…
Next ==
/ Action1
/ Action2
/ …
Spec == Init / []Next
Invariant ==
/ var1 geq 1
/ …
TLA+
Overview
Specify the valid state
transitions
VARIABLES var1, var2, …
Init ==
/ var1 = 1
/ …
Action1 ==
/ var1 leq 10
/ var1’ = var + 1
…
Next ==
/ Action1
/ Action2
/ …
Spec == Init / []Next
Invariant ==
/ var1 geq 1
/ …
TLA+
Overview
Specify the valid state
transitions
VARIABLES var1, var2, …
Init ==
/ var1 = 1
/ …
Action1 ==
/ var1 leq 10
/ var1’ = var + 1
…
Next ==
/ Action1
/ Action2
/ …
Spec == Init / []Next
Invariant ==
/ var1 geq 1
/ …
TLA+
Overview
Specify the set of valid
state transitions
VARIABLES var1, var2, …
Init ==
/ var1 = 1
/ …
Action1 ==
/ var1 leq 10
/ var1’ = var + 1
…
Next ==
/ Action1
/ Action2
/ …
Spec == Init / []Next
Invariant ==
/ var1 geq 1
/ …
TLA+
Overview
Specify the set of valid
state transitions
VARIABLES var1, var2, …
Init ==
/ var1 = 1
/ …
Action1 ==
/ var1 leq 10
/ var1’ = var + 1
…
Next ==
/ Action1
/ Action2
/ …
Spec == Init / []Next
Invariant ==
/ var1 geq 1
/ …
TLA+
Overview
The specification is the
conjunction of the initial state
and all the states reachable
by repeatedly applying the
`Next` state transition
VARIABLES var1, var2, …
Init ==
/ var1 = 1
/ …
Action1 ==
/ var1 leq 10
/ var1’ = var + 1
…
Next ==
/ Action1
/ Action2
/ …
Spec == Init / []Next
Invariant ==
/ var1 geq 1
/ …
TLA+
Overview
Define the model invariants
that should hold after every
state transition
Appendix 3:
Buggy Replication Optimizations
Ad

More Related Content

What's hot (20)

BPF - in-kernel virtual machine
BPF - in-kernel virtual machineBPF - in-kernel virtual machine
BPF - in-kernel virtual machine
Alexei Starovoitov
 
Handle Large Messages In Apache Kafka
Handle Large Messages In Apache KafkaHandle Large Messages In Apache Kafka
Handle Large Messages In Apache Kafka
Jiangjie Qin
 
Everything You Always Wanted to Know About Kafka’s Rebalance Protocol but Wer...
Everything You Always Wanted to Know About Kafka’s Rebalance Protocol but Wer...Everything You Always Wanted to Know About Kafka’s Rebalance Protocol but Wer...
Everything You Always Wanted to Know About Kafka’s Rebalance Protocol but Wer...
confluent
 
The linux networking architecture
The linux networking architectureThe linux networking architecture
The linux networking architecture
hugo lu
 
What's new in apache hive
What's new in apache hive What's new in apache hive
What's new in apache hive
DataWorks Summit
 
Introducing the Apache Flink Kubernetes Operator
Introducing the Apache Flink Kubernetes OperatorIntroducing the Apache Flink Kubernetes Operator
Introducing the Apache Flink Kubernetes Operator
Flink Forward
 
Apache kafka 모니터링을 위한 Metrics 이해 및 최적화 방안
Apache kafka 모니터링을 위한 Metrics 이해 및 최적화 방안Apache kafka 모니터링을 위한 Metrics 이해 및 최적화 방안
Apache kafka 모니터링을 위한 Metrics 이해 및 최적화 방안
SANG WON PARK
 
ksqlDB로 실시간 데이터 변환 및 스트림 처리
ksqlDB로 실시간 데이터 변환 및 스트림 처리ksqlDB로 실시간 데이터 변환 및 스트림 처리
ksqlDB로 실시간 데이터 변환 및 스트림 처리
confluent
 
Apache Flink in the Cloud-Native Era
Apache Flink in the Cloud-Native EraApache Flink in the Cloud-Native Era
Apache Flink in the Cloud-Native Era
Flink Forward
 
PostgreSQL and JDBC: striving for high performance
PostgreSQL and JDBC: striving for high performancePostgreSQL and JDBC: striving for high performance
PostgreSQL and JDBC: striving for high performance
Vladimir Sitnikov
 
Scalability, Availability & Stability Patterns
Scalability, Availability & Stability PatternsScalability, Availability & Stability Patterns
Scalability, Availability & Stability Patterns
Jonas Bonér
 
Real-time Analytics with Trino and Apache Pinot
Real-time Analytics with Trino and Apache PinotReal-time Analytics with Trino and Apache Pinot
Real-time Analytics with Trino and Apache Pinot
Xiang Fu
 
Kafka tiered-storage-meetup-2022-final-presented
Kafka tiered-storage-meetup-2022-final-presentedKafka tiered-storage-meetup-2022-final-presented
Kafka tiered-storage-meetup-2022-final-presented
Sumant Tambe
 
BPF: Tracing and more
BPF: Tracing and moreBPF: Tracing and more
BPF: Tracing and more
Brendan Gregg
 
Nginx Reverse Proxy with Kafka.pptx
Nginx Reverse Proxy with Kafka.pptxNginx Reverse Proxy with Kafka.pptx
Nginx Reverse Proxy with Kafka.pptx
wonyong hwang
 
Logstash
LogstashLogstash
Logstash
琛琳 饶
 
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Spark Summit
 
Improving fault tolerance and scaling out in Kafka Streams with Bill Bejeck |...
Improving fault tolerance and scaling out in Kafka Streams with Bill Bejeck |...Improving fault tolerance and scaling out in Kafka Streams with Bill Bejeck |...
Improving fault tolerance and scaling out in Kafka Streams with Bill Bejeck |...
HostedbyConfluent
 
Performance Profiling in Rust
Performance Profiling in RustPerformance Profiling in Rust
Performance Profiling in Rust
InfluxData
 
Cfgmgmtcamp 2023 — eBPF Superpowers
Cfgmgmtcamp 2023 — eBPF SuperpowersCfgmgmtcamp 2023 — eBPF Superpowers
Cfgmgmtcamp 2023 — eBPF Superpowers
Raphaël PINSON
 
BPF - in-kernel virtual machine
BPF - in-kernel virtual machineBPF - in-kernel virtual machine
BPF - in-kernel virtual machine
Alexei Starovoitov
 
Handle Large Messages In Apache Kafka
Handle Large Messages In Apache KafkaHandle Large Messages In Apache Kafka
Handle Large Messages In Apache Kafka
Jiangjie Qin
 
Everything You Always Wanted to Know About Kafka’s Rebalance Protocol but Wer...
Everything You Always Wanted to Know About Kafka’s Rebalance Protocol but Wer...Everything You Always Wanted to Know About Kafka’s Rebalance Protocol but Wer...
Everything You Always Wanted to Know About Kafka’s Rebalance Protocol but Wer...
confluent
 
The linux networking architecture
The linux networking architectureThe linux networking architecture
The linux networking architecture
hugo lu
 
What's new in apache hive
What's new in apache hive What's new in apache hive
What's new in apache hive
DataWorks Summit
 
Introducing the Apache Flink Kubernetes Operator
Introducing the Apache Flink Kubernetes OperatorIntroducing the Apache Flink Kubernetes Operator
Introducing the Apache Flink Kubernetes Operator
Flink Forward
 
Apache kafka 모니터링을 위한 Metrics 이해 및 최적화 방안
Apache kafka 모니터링을 위한 Metrics 이해 및 최적화 방안Apache kafka 모니터링을 위한 Metrics 이해 및 최적화 방안
Apache kafka 모니터링을 위한 Metrics 이해 및 최적화 방안
SANG WON PARK
 
ksqlDB로 실시간 데이터 변환 및 스트림 처리
ksqlDB로 실시간 데이터 변환 및 스트림 처리ksqlDB로 실시간 데이터 변환 및 스트림 처리
ksqlDB로 실시간 데이터 변환 및 스트림 처리
confluent
 
Apache Flink in the Cloud-Native Era
Apache Flink in the Cloud-Native EraApache Flink in the Cloud-Native Era
Apache Flink in the Cloud-Native Era
Flink Forward
 
PostgreSQL and JDBC: striving for high performance
PostgreSQL and JDBC: striving for high performancePostgreSQL and JDBC: striving for high performance
PostgreSQL and JDBC: striving for high performance
Vladimir Sitnikov
 
Scalability, Availability & Stability Patterns
Scalability, Availability & Stability PatternsScalability, Availability & Stability Patterns
Scalability, Availability & Stability Patterns
Jonas Bonér
 
Real-time Analytics with Trino and Apache Pinot
Real-time Analytics with Trino and Apache PinotReal-time Analytics with Trino and Apache Pinot
Real-time Analytics with Trino and Apache Pinot
Xiang Fu
 
Kafka tiered-storage-meetup-2022-final-presented
Kafka tiered-storage-meetup-2022-final-presentedKafka tiered-storage-meetup-2022-final-presented
Kafka tiered-storage-meetup-2022-final-presented
Sumant Tambe
 
BPF: Tracing and more
BPF: Tracing and moreBPF: Tracing and more
BPF: Tracing and more
Brendan Gregg
 
Nginx Reverse Proxy with Kafka.pptx
Nginx Reverse Proxy with Kafka.pptxNginx Reverse Proxy with Kafka.pptx
Nginx Reverse Proxy with Kafka.pptx
wonyong hwang
 
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Spark Summit
 
Improving fault tolerance and scaling out in Kafka Streams with Bill Bejeck |...
Improving fault tolerance and scaling out in Kafka Streams with Bill Bejeck |...Improving fault tolerance and scaling out in Kafka Streams with Bill Bejeck |...
Improving fault tolerance and scaling out in Kafka Streams with Bill Bejeck |...
HostedbyConfluent
 
Performance Profiling in Rust
Performance Profiling in RustPerformance Profiling in Rust
Performance Profiling in Rust
InfluxData
 
Cfgmgmtcamp 2023 — eBPF Superpowers
Cfgmgmtcamp 2023 — eBPF SuperpowersCfgmgmtcamp 2023 — eBPF Superpowers
Cfgmgmtcamp 2023 — eBPF Superpowers
Raphaël PINSON
 

More from confluent (20)

Webinar Think Right - Shift Left - 19-03-2025.pptx
Webinar Think Right - Shift Left - 19-03-2025.pptxWebinar Think Right - Shift Left - 19-03-2025.pptx
Webinar Think Right - Shift Left - 19-03-2025.pptx
confluent
 
Migration, backup and restore made easy using Kannika
Migration, backup and restore made easy using KannikaMigration, backup and restore made easy using Kannika
Migration, backup and restore made easy using Kannika
confluent
 
Five Things You Need to Know About Data Streaming in 2025
Five Things You Need to Know About Data Streaming in 2025Five Things You Need to Know About Data Streaming in 2025
Five Things You Need to Know About Data Streaming in 2025
confluent
 
Data in Motion Tour Seoul 2024 - Keynote
Data in Motion Tour Seoul 2024 - KeynoteData in Motion Tour Seoul 2024 - Keynote
Data in Motion Tour Seoul 2024 - Keynote
confluent
 
Data in Motion Tour Seoul 2024 - Roadmap Demo
Data in Motion Tour Seoul 2024  - Roadmap DemoData in Motion Tour Seoul 2024  - Roadmap Demo
Data in Motion Tour Seoul 2024 - Roadmap Demo
confluent
 
From Stream to Screen: Real-Time Data Streaming to Web Frontends with Conflue...
From Stream to Screen: Real-Time Data Streaming to Web Frontends with Conflue...From Stream to Screen: Real-Time Data Streaming to Web Frontends with Conflue...
From Stream to Screen: Real-Time Data Streaming to Web Frontends with Conflue...
confluent
 
Confluent per il settore FSI: Accelerare l'Innovazione con il Data Streaming...
Confluent per il settore FSI:  Accelerare l'Innovazione con il Data Streaming...Confluent per il settore FSI:  Accelerare l'Innovazione con il Data Streaming...
Confluent per il settore FSI: Accelerare l'Innovazione con il Data Streaming...
confluent
 
Data in Motion Tour 2024 Riyadh, Saudi Arabia
Data in Motion Tour 2024 Riyadh, Saudi ArabiaData in Motion Tour 2024 Riyadh, Saudi Arabia
Data in Motion Tour 2024 Riyadh, Saudi Arabia
confluent
 
Build a Real-Time Decision Support Application for Financial Market Traders w...
Build a Real-Time Decision Support Application for Financial Market Traders w...Build a Real-Time Decision Support Application for Financial Market Traders w...
Build a Real-Time Decision Support Application for Financial Market Traders w...
confluent
 
Strumenti e Strategie di Stream Governance con Confluent Platform
Strumenti e Strategie di Stream Governance con Confluent PlatformStrumenti e Strategie di Stream Governance con Confluent Platform
Strumenti e Strategie di Stream Governance con Confluent Platform
confluent
 
Compose Gen-AI Apps With Real-Time Data - In Minutes, Not Weeks
Compose Gen-AI Apps With Real-Time Data - In Minutes, Not WeeksCompose Gen-AI Apps With Real-Time Data - In Minutes, Not Weeks
Compose Gen-AI Apps With Real-Time Data - In Minutes, Not Weeks
confluent
 
Building Real-Time Gen AI Applications with SingleStore and Confluent
Building Real-Time Gen AI Applications with SingleStore and ConfluentBuilding Real-Time Gen AI Applications with SingleStore and Confluent
Building Real-Time Gen AI Applications with SingleStore and Confluent
confluent
 
Unlocking value with event-driven architecture by Confluent
Unlocking value with event-driven architecture by ConfluentUnlocking value with event-driven architecture by Confluent
Unlocking value with event-driven architecture by Confluent
confluent
 
Il Data Streaming per un’AI real-time di nuova generazione
Il Data Streaming per un’AI real-time di nuova generazioneIl Data Streaming per un’AI real-time di nuova generazione
Il Data Streaming per un’AI real-time di nuova generazione
confluent
 
Unleashing the Future: Building a Scalable and Up-to-Date GenAI Chatbot with ...
Unleashing the Future: Building a Scalable and Up-to-Date GenAI Chatbot with ...Unleashing the Future: Building a Scalable and Up-to-Date GenAI Chatbot with ...
Unleashing the Future: Building a Scalable and Up-to-Date GenAI Chatbot with ...
confluent
 
Break data silos with real-time connectivity using Confluent Cloud Connectors
Break data silos with real-time connectivity using Confluent Cloud ConnectorsBreak data silos with real-time connectivity using Confluent Cloud Connectors
Break data silos with real-time connectivity using Confluent Cloud Connectors
confluent
 
Building API data products on top of your real-time data infrastructure
Building API data products on top of your real-time data infrastructureBuilding API data products on top of your real-time data infrastructure
Building API data products on top of your real-time data infrastructure
confluent
 
Speed Wins: From Kafka to APIs in Minutes
Speed Wins: From Kafka to APIs in MinutesSpeed Wins: From Kafka to APIs in Minutes
Speed Wins: From Kafka to APIs in Minutes
confluent
 
Evolving Data Governance for the Real-time Streaming and AI Era
Evolving Data Governance for the Real-time Streaming and AI EraEvolving Data Governance for the Real-time Streaming and AI Era
Evolving Data Governance for the Real-time Streaming and AI Era
confluent
 
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...
confluent
 
Webinar Think Right - Shift Left - 19-03-2025.pptx
Webinar Think Right - Shift Left - 19-03-2025.pptxWebinar Think Right - Shift Left - 19-03-2025.pptx
Webinar Think Right - Shift Left - 19-03-2025.pptx
confluent
 
Migration, backup and restore made easy using Kannika
Migration, backup and restore made easy using KannikaMigration, backup and restore made easy using Kannika
Migration, backup and restore made easy using Kannika
confluent
 
Five Things You Need to Know About Data Streaming in 2025
Five Things You Need to Know About Data Streaming in 2025Five Things You Need to Know About Data Streaming in 2025
Five Things You Need to Know About Data Streaming in 2025
confluent
 
Data in Motion Tour Seoul 2024 - Keynote
Data in Motion Tour Seoul 2024 - KeynoteData in Motion Tour Seoul 2024 - Keynote
Data in Motion Tour Seoul 2024 - Keynote
confluent
 
Data in Motion Tour Seoul 2024 - Roadmap Demo
Data in Motion Tour Seoul 2024  - Roadmap DemoData in Motion Tour Seoul 2024  - Roadmap Demo
Data in Motion Tour Seoul 2024 - Roadmap Demo
confluent
 
From Stream to Screen: Real-Time Data Streaming to Web Frontends with Conflue...
From Stream to Screen: Real-Time Data Streaming to Web Frontends with Conflue...From Stream to Screen: Real-Time Data Streaming to Web Frontends with Conflue...
From Stream to Screen: Real-Time Data Streaming to Web Frontends with Conflue...
confluent
 
Confluent per il settore FSI: Accelerare l'Innovazione con il Data Streaming...
Confluent per il settore FSI:  Accelerare l'Innovazione con il Data Streaming...Confluent per il settore FSI:  Accelerare l'Innovazione con il Data Streaming...
Confluent per il settore FSI: Accelerare l'Innovazione con il Data Streaming...
confluent
 
Data in Motion Tour 2024 Riyadh, Saudi Arabia
Data in Motion Tour 2024 Riyadh, Saudi ArabiaData in Motion Tour 2024 Riyadh, Saudi Arabia
Data in Motion Tour 2024 Riyadh, Saudi Arabia
confluent
 
Build a Real-Time Decision Support Application for Financial Market Traders w...
Build a Real-Time Decision Support Application for Financial Market Traders w...Build a Real-Time Decision Support Application for Financial Market Traders w...
Build a Real-Time Decision Support Application for Financial Market Traders w...
confluent
 
Strumenti e Strategie di Stream Governance con Confluent Platform
Strumenti e Strategie di Stream Governance con Confluent PlatformStrumenti e Strategie di Stream Governance con Confluent Platform
Strumenti e Strategie di Stream Governance con Confluent Platform
confluent
 
Compose Gen-AI Apps With Real-Time Data - In Minutes, Not Weeks
Compose Gen-AI Apps With Real-Time Data - In Minutes, Not WeeksCompose Gen-AI Apps With Real-Time Data - In Minutes, Not Weeks
Compose Gen-AI Apps With Real-Time Data - In Minutes, Not Weeks
confluent
 
Building Real-Time Gen AI Applications with SingleStore and Confluent
Building Real-Time Gen AI Applications with SingleStore and ConfluentBuilding Real-Time Gen AI Applications with SingleStore and Confluent
Building Real-Time Gen AI Applications with SingleStore and Confluent
confluent
 
Unlocking value with event-driven architecture by Confluent
Unlocking value with event-driven architecture by ConfluentUnlocking value with event-driven architecture by Confluent
Unlocking value with event-driven architecture by Confluent
confluent
 
Il Data Streaming per un’AI real-time di nuova generazione
Il Data Streaming per un’AI real-time di nuova generazioneIl Data Streaming per un’AI real-time di nuova generazione
Il Data Streaming per un’AI real-time di nuova generazione
confluent
 
Unleashing the Future: Building a Scalable and Up-to-Date GenAI Chatbot with ...
Unleashing the Future: Building a Scalable and Up-to-Date GenAI Chatbot with ...Unleashing the Future: Building a Scalable and Up-to-Date GenAI Chatbot with ...
Unleashing the Future: Building a Scalable and Up-to-Date GenAI Chatbot with ...
confluent
 
Break data silos with real-time connectivity using Confluent Cloud Connectors
Break data silos with real-time connectivity using Confluent Cloud ConnectorsBreak data silos with real-time connectivity using Confluent Cloud Connectors
Break data silos with real-time connectivity using Confluent Cloud Connectors
confluent
 
Building API data products on top of your real-time data infrastructure
Building API data products on top of your real-time data infrastructureBuilding API data products on top of your real-time data infrastructure
Building API data products on top of your real-time data infrastructure
confluent
 
Speed Wins: From Kafka to APIs in Minutes
Speed Wins: From Kafka to APIs in MinutesSpeed Wins: From Kafka to APIs in Minutes
Speed Wins: From Kafka to APIs in Minutes
confluent
 
Evolving Data Governance for the Real-time Streaming and AI Era
Evolving Data Governance for the Real-time Streaming and AI EraEvolving Data Governance for the Real-time Streaming and AI Era
Evolving Data Governance for the Real-time Streaming and AI Era
confluent
 
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...
confluent
 
Ad

Recently uploaded (20)

#StandardsGoals for 2025: Standards & certification roundup - Tech Forum 2025
#StandardsGoals for 2025: Standards & certification roundup - Tech Forum 2025#StandardsGoals for 2025: Standards & certification roundup - Tech Forum 2025
#StandardsGoals for 2025: Standards & certification roundup - Tech Forum 2025
BookNet Canada
 
Designing Low-Latency Systems with Rust and ScyllaDB: An Architectural Deep Dive
Designing Low-Latency Systems with Rust and ScyllaDB: An Architectural Deep DiveDesigning Low-Latency Systems with Rust and ScyllaDB: An Architectural Deep Dive
Designing Low-Latency Systems with Rust and ScyllaDB: An Architectural Deep Dive
ScyllaDB
 
TrustArc Webinar: Consumer Expectations vs Corporate Realities on Data Broker...
TrustArc Webinar: Consumer Expectations vs Corporate Realities on Data Broker...TrustArc Webinar: Consumer Expectations vs Corporate Realities on Data Broker...
TrustArc Webinar: Consumer Expectations vs Corporate Realities on Data Broker...
TrustArc
 
2025-05-Q4-2024-Investor-Presentation.pptx
2025-05-Q4-2024-Investor-Presentation.pptx2025-05-Q4-2024-Investor-Presentation.pptx
2025-05-Q4-2024-Investor-Presentation.pptx
Samuele Fogagnolo
 
SAP Modernization: Maximizing the Value of Your SAP S/4HANA Migration.pdf
SAP Modernization: Maximizing the Value of Your SAP S/4HANA Migration.pdfSAP Modernization: Maximizing the Value of Your SAP S/4HANA Migration.pdf
SAP Modernization: Maximizing the Value of Your SAP S/4HANA Migration.pdf
Precisely
 
Mobile App Development Company in Saudi Arabia
Mobile App Development Company in Saudi ArabiaMobile App Development Company in Saudi Arabia
Mobile App Development Company in Saudi Arabia
Steve Jonas
 
Massive Power Outage Hits Spain, Portugal, and France: Causes, Impact, and On...
Massive Power Outage Hits Spain, Portugal, and France: Causes, Impact, and On...Massive Power Outage Hits Spain, Portugal, and France: Causes, Impact, and On...
Massive Power Outage Hits Spain, Portugal, and France: Causes, Impact, and On...
Aqusag Technologies
 
What is Model Context Protocol(MCP) - The new technology for communication bw...
What is Model Context Protocol(MCP) - The new technology for communication bw...What is Model Context Protocol(MCP) - The new technology for communication bw...
What is Model Context Protocol(MCP) - The new technology for communication bw...
Vishnu Singh Chundawat
 
Semantic Cultivators : The Critical Future Role to Enable AI
Semantic Cultivators : The Critical Future Role to Enable AISemantic Cultivators : The Critical Future Role to Enable AI
Semantic Cultivators : The Critical Future Role to Enable AI
artmondano
 
Technology Trends in 2025: AI and Big Data Analytics
Technology Trends in 2025: AI and Big Data AnalyticsTechnology Trends in 2025: AI and Big Data Analytics
Technology Trends in 2025: AI and Big Data Analytics
InData Labs
 
How Can I use the AI Hype in my Business Context?
How Can I use the AI Hype in my Business Context?How Can I use the AI Hype in my Business Context?
How Can I use the AI Hype in my Business Context?
Daniel Lehner
 
Complete Guide to Advanced Logistics Management Software in Riyadh.pdf
Complete Guide to Advanced Logistics Management Software in Riyadh.pdfComplete Guide to Advanced Logistics Management Software in Riyadh.pdf
Complete Guide to Advanced Logistics Management Software in Riyadh.pdf
Software Company
 
Procurement Insights Cost To Value Guide.pptx
Procurement Insights Cost To Value Guide.pptxProcurement Insights Cost To Value Guide.pptx
Procurement Insights Cost To Value Guide.pptx
Jon Hansen
 
IEDM 2024 Tutorial2_Advances in CMOS Technologies and Future Directions for C...
IEDM 2024 Tutorial2_Advances in CMOS Technologies and Future Directions for C...IEDM 2024 Tutorial2_Advances in CMOS Technologies and Future Directions for C...
IEDM 2024 Tutorial2_Advances in CMOS Technologies and Future Directions for C...
organizerofv
 
AI Changes Everything – Talk at Cardiff Metropolitan University, 29th April 2...
AI Changes Everything – Talk at Cardiff Metropolitan University, 29th April 2...AI Changes Everything – Talk at Cardiff Metropolitan University, 29th April 2...
AI Changes Everything – Talk at Cardiff Metropolitan University, 29th April 2...
Alan Dix
 
Drupalcamp Finland – Measuring Front-end Energy Consumption
Drupalcamp Finland – Measuring Front-end Energy ConsumptionDrupalcamp Finland – Measuring Front-end Energy Consumption
Drupalcamp Finland – Measuring Front-end Energy Consumption
Exove
 
Cybersecurity Identity and Access Solutions using Azure AD
Cybersecurity Identity and Access Solutions using Azure ADCybersecurity Identity and Access Solutions using Azure AD
Cybersecurity Identity and Access Solutions using Azure AD
VICTOR MAESTRE RAMIREZ
 
TrsLabs - Fintech Product & Business Consulting
TrsLabs - Fintech Product & Business ConsultingTrsLabs - Fintech Product & Business Consulting
TrsLabs - Fintech Product & Business Consulting
Trs Labs
 
Special Meetup Edition - TDX Bengaluru Meetup #52.pptx
Special Meetup Edition - TDX Bengaluru Meetup #52.pptxSpecial Meetup Edition - TDX Bengaluru Meetup #52.pptx
Special Meetup Edition - TDX Bengaluru Meetup #52.pptx
shyamraj55
 
Cyber Awareness overview for 2025 month of security
Cyber Awareness overview for 2025 month of securityCyber Awareness overview for 2025 month of security
Cyber Awareness overview for 2025 month of security
riccardosl1
 
#StandardsGoals for 2025: Standards & certification roundup - Tech Forum 2025
#StandardsGoals for 2025: Standards & certification roundup - Tech Forum 2025#StandardsGoals for 2025: Standards & certification roundup - Tech Forum 2025
#StandardsGoals for 2025: Standards & certification roundup - Tech Forum 2025
BookNet Canada
 
Designing Low-Latency Systems with Rust and ScyllaDB: An Architectural Deep Dive
Designing Low-Latency Systems with Rust and ScyllaDB: An Architectural Deep DiveDesigning Low-Latency Systems with Rust and ScyllaDB: An Architectural Deep Dive
Designing Low-Latency Systems with Rust and ScyllaDB: An Architectural Deep Dive
ScyllaDB
 
TrustArc Webinar: Consumer Expectations vs Corporate Realities on Data Broker...
TrustArc Webinar: Consumer Expectations vs Corporate Realities on Data Broker...TrustArc Webinar: Consumer Expectations vs Corporate Realities on Data Broker...
TrustArc Webinar: Consumer Expectations vs Corporate Realities on Data Broker...
TrustArc
 
2025-05-Q4-2024-Investor-Presentation.pptx
2025-05-Q4-2024-Investor-Presentation.pptx2025-05-Q4-2024-Investor-Presentation.pptx
2025-05-Q4-2024-Investor-Presentation.pptx
Samuele Fogagnolo
 
SAP Modernization: Maximizing the Value of Your SAP S/4HANA Migration.pdf
SAP Modernization: Maximizing the Value of Your SAP S/4HANA Migration.pdfSAP Modernization: Maximizing the Value of Your SAP S/4HANA Migration.pdf
SAP Modernization: Maximizing the Value of Your SAP S/4HANA Migration.pdf
Precisely
 
Mobile App Development Company in Saudi Arabia
Mobile App Development Company in Saudi ArabiaMobile App Development Company in Saudi Arabia
Mobile App Development Company in Saudi Arabia
Steve Jonas
 
Massive Power Outage Hits Spain, Portugal, and France: Causes, Impact, and On...
Massive Power Outage Hits Spain, Portugal, and France: Causes, Impact, and On...Massive Power Outage Hits Spain, Portugal, and France: Causes, Impact, and On...
Massive Power Outage Hits Spain, Portugal, and France: Causes, Impact, and On...
Aqusag Technologies
 
What is Model Context Protocol(MCP) - The new technology for communication bw...
What is Model Context Protocol(MCP) - The new technology for communication bw...What is Model Context Protocol(MCP) - The new technology for communication bw...
What is Model Context Protocol(MCP) - The new technology for communication bw...
Vishnu Singh Chundawat
 
Semantic Cultivators : The Critical Future Role to Enable AI
Semantic Cultivators : The Critical Future Role to Enable AISemantic Cultivators : The Critical Future Role to Enable AI
Semantic Cultivators : The Critical Future Role to Enable AI
artmondano
 
Technology Trends in 2025: AI and Big Data Analytics
Technology Trends in 2025: AI and Big Data AnalyticsTechnology Trends in 2025: AI and Big Data Analytics
Technology Trends in 2025: AI and Big Data Analytics
InData Labs
 
How Can I use the AI Hype in my Business Context?
How Can I use the AI Hype in my Business Context?How Can I use the AI Hype in my Business Context?
How Can I use the AI Hype in my Business Context?
Daniel Lehner
 
Complete Guide to Advanced Logistics Management Software in Riyadh.pdf
Complete Guide to Advanced Logistics Management Software in Riyadh.pdfComplete Guide to Advanced Logistics Management Software in Riyadh.pdf
Complete Guide to Advanced Logistics Management Software in Riyadh.pdf
Software Company
 
Procurement Insights Cost To Value Guide.pptx
Procurement Insights Cost To Value Guide.pptxProcurement Insights Cost To Value Guide.pptx
Procurement Insights Cost To Value Guide.pptx
Jon Hansen
 
IEDM 2024 Tutorial2_Advances in CMOS Technologies and Future Directions for C...
IEDM 2024 Tutorial2_Advances in CMOS Technologies and Future Directions for C...IEDM 2024 Tutorial2_Advances in CMOS Technologies and Future Directions for C...
IEDM 2024 Tutorial2_Advances in CMOS Technologies and Future Directions for C...
organizerofv
 
AI Changes Everything – Talk at Cardiff Metropolitan University, 29th April 2...
AI Changes Everything – Talk at Cardiff Metropolitan University, 29th April 2...AI Changes Everything – Talk at Cardiff Metropolitan University, 29th April 2...
AI Changes Everything – Talk at Cardiff Metropolitan University, 29th April 2...
Alan Dix
 
Drupalcamp Finland – Measuring Front-end Energy Consumption
Drupalcamp Finland – Measuring Front-end Energy ConsumptionDrupalcamp Finland – Measuring Front-end Energy Consumption
Drupalcamp Finland – Measuring Front-end Energy Consumption
Exove
 
Cybersecurity Identity and Access Solutions using Azure AD
Cybersecurity Identity and Access Solutions using Azure ADCybersecurity Identity and Access Solutions using Azure AD
Cybersecurity Identity and Access Solutions using Azure AD
VICTOR MAESTRE RAMIREZ
 
TrsLabs - Fintech Product & Business Consulting
TrsLabs - Fintech Product & Business ConsultingTrsLabs - Fintech Product & Business Consulting
TrsLabs - Fintech Product & Business Consulting
Trs Labs
 
Special Meetup Edition - TDX Bengaluru Meetup #52.pptx
Special Meetup Edition - TDX Bengaluru Meetup #52.pptxSpecial Meetup Edition - TDX Bengaluru Meetup #52.pptx
Special Meetup Edition - TDX Bengaluru Meetup #52.pptx
shyamraj55
 
Cyber Awareness overview for 2025 month of security
Cyber Awareness overview for 2025 month of securityCyber Awareness overview for 2025 month of security
Cyber Awareness overview for 2025 month of security
riccardosl1
 
Ad

Hardening Kafka Replication

  • 1. Hardening Kafka Replication Jason Gustafson, Engineer@Confluent
  • 2. ● At the heart of Kafka is the log ● Log replication provides high availability ● Kafka has a solid replication protocol ● 99.999% of the time it does the right thing ● This talk is about the remaining 0.001% Overview
  • 4. View of a single partition Key Value Offset
  • 5. View of a single partition Message Appends Key Value Offset
  • 6. Key Value Offset View of a single partition Message Appends k0 v0 0
  • 7. View of a single partition Message Appends k0 k1 k2 v0 v1 v2 0 1 2 Key Value Offset
  • 8. View of a single partition k0 k1 k2 v0 v1 v2 0 1 2 Key Value Offset
  • 9. View of a single partition k0 k1 k2 v0 v1 v2 0 1 2 Key Value Offset k0 k1 k2 v0 v1 v2
  • 10. View of a single partition Key Value k0 k1 k2 v0 v1 v2 Offset 0
  • 11. View of a single partition Key Value k0 k1 k2 v0 v1 v2
  • 12. r0 r1 r2 View of a single partition Record
  • 13. r0 r1 r2 View of a single partition Record
  • 14. r0 r1 r2 View of a single partition with 3 replicas
  • 15. r0 r1 r2 A B C View of a single partition with 3 replicas
  • 16. r0 r1 r2 A B C The protocol’s goal is to replicate the logs exactly to all replicas
  • 17. r0 r1 r2 r0 r1 r2 r0 r1 r2 A B C The protocol’s goal is to replicate the logs exactly to all replicas
  • 19. A B C
  • 20. Leader A B C For each partition, one replica is elected as the leader
  • 21. Leader Follower Follower A B C Replicas that are not leaders are called followers
  • 23. r0 r1 r2Leader Follower Follower A B C Leaders accept writes from producers.
  • 25. r0 r1 r0 r1 r2 A B C Leader Follower Follower Followers fetch from the leader.
  • 26. r0 r1 r0 r1 r2 r0 A B C Leader Follower Follower Followers fetch from the leader.
  • 27. r0 r1 r0 r1 r2 r0 A B C Leader Follower Follower Leader election is handled by a separate component known as the controller
  • 28. r0 r1 r0 r1 r2 r0 A B C Leader Follower Follower Leader Epoch ISR B 0 A, B, C In order to enable election by the controller, we maintain state in Zookeeper about the in-sync replicas (ISR).
  • 29. r0 r1 r0 r1 r2 r0 A B C Leader Follower Follower When there is a state change (e.g. a new leader), the controller sends the updated state to all the replicas. Leader Epoch ISR B 0 A, B, C
  • 30. r0 r1 r0 r1 r2 r0 A B C Leader (epoch=0) Follower (epoch=0) Leader Epoch ISR B 0 A, B, C Follower (epoch=0) When there is a state change (e.g. a new leader), the controller sends the updated state to all the replicas.
  • 31. r0 r1 r0 r1 r2 r0 A B C Leader (epoch=0) Follower (epoch=0) Leader Epoch ISR B 0 A, B, C Follower (epoch=0)
  • 32. r0 r1 r0 r1 r2 r0 A B C Leader (epoch=0) Follower (epoch=0) Leader Epoch ISR B 0 A, B, C Follower (epoch=0) High Watermark The high watermark is the largest offset known to be replicated to all members of the ISR.
  • 33. r0 r1 r0 r1 r2 r0 A B C Leader (epoch=0) Follower (epoch=0) Leader Epoch ISR B 0 A, B, C Follower (epoch=0) The high watermark is the largest offset known to be replicated to all members of the ISR.
  • 34. r0 r1 r0 r1 r2 r0 A B C Leader (epoch=0) Follower (epoch=0) Leader Epoch ISR B 0 A, B, C Follower (epoch=0) Records below the high watermark are considered “committed” and are visible to consumers. Committed
  • 35. r0 r1 r0 r1 r2 r0 A B C Leader (epoch=0) Follower (epoch=0) Leader Epoch ISR B 0 A, B, C Follower (epoch=0) Records above the high watermark are considered uncommitted. Committed Uncommitted
  • 36. r0 r1 r0 r1 r2 r0 A B C Leader (epoch=0) Follower (epoch=0) Leader Epoch ISR B 0 A, B, C Follower (epoch=0) As records are replicated, the high watermark moves forward.
  • 37. r0 r1 r0 r1 r2 r0 r1 A B C Leader (epoch=0) Follower (epoch=0) Leader Epoch ISR B 0 A, B, C Follower (epoch=0) As records are replicated, the high watermark moves forward.
  • 38. r0 r1 r0 r1 r2 r0 r1 A B C Leader (epoch=0) Follower (epoch=0) Leader Epoch ISR B 0 A, B, C Follower (epoch=0) As records are replicated, the high watermark moves forward.
  • 39. r0 r1 r0 r1 r2 r0 r1 A B C Leader (epoch=0) Follower (epoch=0) Leader Epoch ISR B 0 A, B, C Follower (epoch=0) If a replica falls behind, it can be removed from the ISR by the leader.
  • 40. r0 r1 r0 r1 r2 r3 r4 r5 r6 r0 r1 A B C Leader (epoch=0) Follower (epoch=0) Leader Epoch ISR B 0 A, B, C Follower (epoch=0) If a replica falls behind, it can be removed from the ISR by the leader.
  • 41. r0 r1 r2 r3 r4 r0 r1 r2 r3 r4 r5 r6 r0 r1 A B C Leader (epoch=0) Follower (epoch=0) Leader Epoch ISR B 0 A, B, C Follower (epoch=0) If a replica falls behind, it can be removed from the ISR by the leader.
  • 42. r0 r1 r2 r3 r4 r0 r1 r2 r3 r4 r5 r6 r0 r1 A B C Leader (epoch=0) Follower (epoch=0) Leader Epoch ISR B 0 A, B Follower (epoch=0) If a replica falls behind, it can be removed from the ISR by the leader.
  • 43. r0 r1 r2 r3 r4 r0 r1 r2 r3 r4 r5 r6 r0 r1 A B C Leader (epoch=0) Follower (epoch=0) Leader Epoch ISR B 0 A, B Follower (epoch=0) If a replica falls behind, it can be removed from the ISR by the leader.
  • 44. r0 r1 r2 r3 r4 r0 r1 r2 r3 r4 r5 r6 r0 r1 A B C Leader (epoch=0) Follower (epoch=0) Leader Epoch ISR B 0 A, B Follower (epoch=0) If a replica falls behind, it can be removed from the ISR by the leader.
  • 45. r0 r1 r2 r3 r4 r0 r1 r2 r3 r4 r5 r6 r0 r1 A B C Leader (epoch=0) Follower (epoch=0) Leader Epoch ISR B 0 A, B Follower (epoch=0) An out-of-sync replica that catches up to the high watermark is added back to the ISR.
  • 46. r0 r1 r2 r3 r4 r0 r1 r2 r3 r4 r5 r6 r0 r1 r2 r3 r4 r5 A B C Leader (epoch=0) Follower (epoch=0) Leader Epoch ISR B 0 A, B Follower (epoch=0) An out-of-sync replica that catches up to the high watermark is added back to the ISR.
  • 47. r0 r1 r2 r3 r4 r0 r1 r2 r3 r4 r5 r6 r0 r1 r2 r3 r4 r5 A B C Leader (epoch=0) Follower (epoch=0) Leader Epoch ISR B 0 A, B, C Follower (epoch=0) An out-of-sync replica that catches up to the high watermark is added back to the ISR.
  • 48. r0 r1 r2 r3 r4 r0 r1 r2 r3 r4 r5 r6 r0 r1 r2 r3 r4 r5 A B C Leader (epoch=0) Follower (epoch=0) Leader Epoch ISR B 0 A, B, C Follower (epoch=0)
  • 49. r0 r1 r2 r3 r4 r0 r1 r2 r3 r4 r5 r6 r0 r1 r2 r3 r4 r5 A B C Leader (epoch=0) Follower (epoch=0) Leader Epoch ISR B 0 A, B, C Follower (epoch=0) Only replicas in the ISR are eligible to become leader
  • 50. r0 r1 r2 r3 r4 r0 r1 r2 r3 r4 r5 r6 r0 r1 r2 r3 r4 r5 A B C Leader (epoch=0) Follower (epoch=0) Leader Epoch ISR B 0 A, B, C Follower (epoch=0) When a leader fails, the controller will take it out of the ISR and elect a new leader from the remaining ISR.
  • 51. r0 r1 r2 r3 r4 r0 r1 r2 r3 r4 r5 r6 r0 r1 r2 r3 r4 r5 A B C Leader (epoch=0) Follower (epoch=0) Leader Epoch ISR A 1 A, C Follower (epoch=0) When a leader fails, the controller will take it out of the ISR and elect a new leader from the remaining ISR.
  • 52. r0 r1 r2 r3 r4 r0 r1 r2 r3 r4 r5 r6 r0 r1 r2 r3 r4 r5 A B C Leader (epoch=0) Leader (epoch=1) Leader Epoch ISR A 1 A, C Follower (epoch=0) The new leader/ISR state is propagated to the remaining replicas
  • 53. r0 r1 r2 r3 r4 r7 r8 r0 r1 r2 r3 r4 r5 r6 r0 r1 r2 r3 r4 r5 A B C Leader (epoch=0) Leader (epoch=1) Leader Epoch ISR A 1 A, C Follower (epoch=0) The leader can begin accepting writes immediately.
  • 54. r0 r1 r2 r3 r4 r7 r8 r0 r1 r2 r3 r4 r5 r6 r0 r1 r2 r3 r4 r5 A B C Leader (epoch=0) Leader (epoch=1) Leader Epoch ISR A 1 A, C Follower (epoch=1) Upon becoming a follower, the replica may have uncommitted data which needs to be truncated.
  • 55. r0 r1 r2 r3 r4 r7 r8 r0 r1 r2 r3 r4 r5 r6 r0 r1 r2 r3 r4 A B C Leader (epoch=0) Leader (epoch=1) Leader Epoch ISR A 1 A, C Follower (epoch=1) Upon becoming a follower, the replica may have uncommitted data which needs to be truncated.
  • 56. r0 r1 r2 r3 r4 r7 r8 r0 r1 r2 r3 r4 r5 r6 r0 r1 r2 r3 r4 r7 A B C Leader (epoch=0) Leader (epoch=1) Leader Epoch ISR A 1 A, C Follower (epoch=1) Upon becoming a follower, the replica may have uncommitted data which needs to be truncated.
  • 59. A B C Leader (epoch=0) Follower (epoch=0) Leader Epoch ISR B 0 A, B, C Follower (epoch=0) High Watermark
  • 60. A B C Leader (epoch=0) Follower (epoch=0) Leader Epoch ISR B 0 A, B, C Follower (epoch=0) High Watermark Every replica tracks the high watermark separately
  • 61. A B C Leader (epoch=0) Follower (epoch=0) Leader Epoch ISR B 0 A, B, C Follower (epoch=0) Every replica tracks the high watermark separately
  • 62. r0 r1 r2 A B C Leader (epoch=0) Follower (epoch=0) Leader Epoch ISR B 0 A, B, C Follower (epoch=0) Every replica tracks the high watermark separately
  • 63. r0 r1 r0 r1 r2 A B C Leader (epoch=0) Follower (epoch=0) Leader Epoch ISR B 0 A, B, C Follower (epoch=0) Every replica tracks the high watermark separately
  • 64. r0 r1 r0 r1 r2 r0 A B C Leader (epoch=0) Follower (epoch=0) Leader Epoch ISR B 0 A, B, C Follower (epoch=0) Every replica tracks the high watermark separately
  • 65. r0 r1 r0 r1 r2 r0 A B C Leader (epoch=0) Follower (epoch=0) Leader Epoch ISR B 0 A, B, C Follower (epoch=0) The leader advances its high watermark based on the fetch offsets of replicas
  • 66. r0 r1 r0 r1 r2 r0 A B C Leader (epoch=0) Follower (epoch=0) Leader Epoch ISR B 0 A, B, C Follower (epoch=0) The leader advances its high watermark based on the fetch offsets of replicas
  • 67. r0 r1 r0 r1 r2 r0 A B C Leader (epoch=0) Follower (epoch=0) Leader Epoch ISR B 0 A, B, C Follower (epoch=0) The leader piggybacks its high watermark onto fetch responses
  • 68. r0 r1 r0 r1 r2 r0 r1 A B C Leader (epoch=0) Follower (epoch=0) Leader Epoch ISR B 0 A, B, C Follower (epoch=0) The leader piggybacks its high watermark onto fetch responses
  • 69. r0 r1 r0 r1 r2 r0 r1 A B C Leader (epoch=0) Follower (epoch=0) Leader Epoch ISR B 0 A, B, C Follower (epoch=0) At any point in time, the follower high watermarks may be a little behind the leader’s.
  • 70. Edge Case 1: Fast leader elections
  • 71. r0 r1 r2 r3 r0 r1 r2 r3 r4 r5 r6 r0 r1 r2 r3 r4 r5 A B C Leader (epoch=0) Follower (epoch=0) Leader Epoch ISR B 0 A, B, C Follower (epoch=0)
  • 72. r0 r1 r2 r3 r0 r1 r2 r3 r4 r5 r6 r0 r1 r2 r3 r4 r5 A B C Leader (epoch=0) Follower (epoch=0) Leader Epoch ISR B 0 A, B, C Follower (epoch=0) Replica B fails.
  • 73. r0 r1 r2 r3 r0 r1 r2 r3 r4 r5 r6 r0 r1 r2 r3 r4 r5 A B C Leader (epoch=0) Follower (epoch=0) Leader Epoch ISR C 1 A, C Follower (epoch=0) Replica B is removed from the ISR and C is elected as the new leader.
  • 74. r0 r1 r2 r3 r0 r1 r2 r3 r4 r5 r6 r0 r1 r2 r3 r4 r5 A B C Leader (epoch=0) Follower (epoch=0) Leader Epoch ISR C 1 A, C Leader (epoch=1) Replica B is removed from the ISR and C is elected as the new leader.
  • 75. r0 r1 r2 r3 r0 r1 r2 r3 r4 r5 r6 r0 r1 r2 r3 r4 r5 A B C Leader (epoch=0) Follower (epoch=1) Leader Epoch ISR C 1 A, C Leader (epoch=1) Replica A finds the new leader and truncates its log to the local high watermark
  • 76. r0 r1 r2 r3 r0 r1 r2 r3 r4 r5 r6 r0 r1 r2 r3 r4 r5 A B C Leader (epoch=0) Follower (epoch=1) Leader Epoch ISR C 1 A, C Leader (epoch=1) Replica A finds the new leader and truncates its log to the local high watermark
  • 77. r0 r1 r0 r1 r2 r3 r4 r5 r6 r0 r1 r2 r3 r4 r5 A B C Leader (epoch=0) Follower (epoch=1) Leader Epoch ISR C 1 A, C Leader (epoch=1) Replica A finds the new leader and truncates its log to the local high watermark
  • 78. r0 r1 r0 r1 r2 r3 r4 r5 r6 r0 r1 r2 r3 r4 r5 A B C Leader (epoch=0) Follower (epoch=1) Leader Epoch ISR C 1 A, C Leader (epoch=1) Before replica A begins fetching, the new leader fails.
  • 79. r0 r1 r0 r1 r2 r3 r4 r5 r6 r0 r1 r2 r3 r4 r5 A B C Leader (epoch=0) Follower (epoch=1) Leader Epoch ISR C 1 A, C Leader (epoch=1) Before replica A begins fetching, the new leader fails.
  • 80. r0 r1 r0 r1 r2 r3 r4 r5 r6 r0 r1 r2 r3 r4 r5 A B C Leader (epoch=0) Follower (epoch=1) Leader Epoch ISR A 2 A Leader (epoch=1) Before replica A begins fetching, the new leader fails.
  • 81. r0 r1 r0 r1 r2 r3 r4 r5 r6 r0 r1 r2 r3 r4 r5 A B C Leader (epoch=0) Leader (epoch=2) Leader Epoch ISR A 2 A Leader (epoch=1) Before replica A begins fetching, the new leader fails.
  • 82. r0 r1 r7 r8 r9 r0 r1 r2 r3 r4 r5 r6 r0 r1 r2 r3 r4 r5 A B C Leader (epoch=0) Leader (epoch=2) Leader Epoch ISR A 2 A Leader (epoch=1) Leader A then begins accepting writes.
  • 83. r0 r1 r7 r8 r9 r0 r1 r2 r3 r4 r5 r6 r0 r1 r2 r3 r4 r5 A B C Leader (epoch=0) Leader (epoch=2) Leader Epoch ISR A 2 A Leader (epoch=1) But r2 and r3 had already been committed to the ISR!
  • 84. r0 r1 r7 r8 r9 r0 r1 r2 r3 r4 r5 r6 r0 r1 r2 r3 r4 r5 A B C Leader (epoch=0) Leader (epoch=2) Leader Epoch ISR A 2 A Leader (epoch=1) Suppose that B eventually gets restarted.
  • 85. r0 r1 r7 r8 r9 r0 r1 r2 r3 r4 r5 r6 r0 r1 r2 r3 r4 r5 A B C Leader (epoch=0) Leader (epoch=2) Leader Epoch ISR A 2 A Leader (epoch=1) Suppose that B eventually gets restarted.
  • 86. r0 r1 r7 r8 r9 r0 r1 r2 r3 r4 r5 r6 r0 r1 r2 r3 r4 r5 A B C Follower (epoch=2) Leader (epoch=2) Leader Epoch ISR A 2 A Leader (epoch=1) Suppose that B eventually gets restarted.
  • 87. r0 r1 r7 r8 r9 r0 r1 r2 r3 r0 r1 r2 r3 r4 r5 A B C Follower (epoch=2) Leader (epoch=2) Leader Epoch ISR A 2 A Leader (epoch=1) Suppose that B eventually gets restarted.
  • 88. r0 r1 r7 r8 r9 r0 r1 r2 r3 r9 r0 r1 r2 r3 r4 r5 A B C Follower (epoch=2) Leader (epoch=2) Leader Epoch ISR A 2 A Leader (epoch=1) The logs have now diverged.
  • 90. r0 r1 r2 r3 r0 r1 r2 r3 r4 r5 r6 r0 r1 r2 r3 r4 r5 A B C Leader (epoch=0) Follower (epoch=1) Leader Epoch ISR C 1 A, C Leader (epoch=1) Replica B has failed and replica A needs to truncate its log.
  • 91. r0 r1 r2 r3 r0 r1 r2 r3 r4 r5 r6 r0 r1 r2 r3 r4 r5 A B C Leader (epoch=0) Follower (epoch=1) Leader Epoch ISR C 1 A, C Leader (epoch=1) A -> C: What is the end offset for epoch=0?
  • 92. r0 r1 r2 r3 r0 r1 r2 r3 r4 r5 r6 r0 r1 r2 r3 r4 r5 A B C Leader (epoch=0) Follower (epoch=1) Leader Epoch ISR C 1 A, C Leader (epoch=1) A -> C: What is the end offset for epoch=0? C -> A: The end offset is 6 Offset 6
  • 93. r0 r1 r2 r3 r0 r1 r2 r3 r4 r5 r6 r0 r1 r2 r3 r4 r5 A B C Leader (epoch=0) Follower (epoch=1) Leader Epoch ISR C 1 A, C Leader (epoch=1) A -> C: What is the end offset for epoch=0? C -> A: The end offset is 6 C: Cool, no truncation needed!
  • 94. r0 r1 r2 r3 r0 r1 r2 r3 r4 r5 r6 r0 r1 r2 r3 r4 r5 A B C Leader (epoch=0) Follower (epoch=1) Leader Epoch ISR C 1 A, C Leader (epoch=1)
  • 95. r0 r1 r2 r3 r0 r1 r2 r3 r4 r5 r6 r0 r1 r2 r3 r4 r5 A B C Leader (epoch=0) Follower (epoch=1) Leader Epoch ISR C 1 A, C Leader (epoch=1)
  • 96. r0 r1 r2 r3 r0 r1 r2 r3 r4 r5 r6 r0 r1 r2 r3 r4 r5 A B C Leader (epoch=0) Follower (epoch=1) Leader Epoch ISR A 2 A Leader (epoch=1)
  • 97. r0 r1 r2 r3 r7 r8 r0 r1 r2 r3 r4 r5 r6 r0 r1 r2 r3 r4 r5 A B C Leader (epoch=0) Leader (epoch=2) Leader Epoch ISR A 2 A Leader (epoch=1)
  • 98. Edge Case 2: Fast leader elections redux
  • 99. r0 r1 r2 r0 r1 r2 r3 r4 r0 r1 r2 r3 r4 A B C Leader (epoch=0) Leader (epoch=1) Leader Epoch ISR A 1 A, C Follower (epoch=0) Replica B has failed and replica A has been elected as the new leader
  • 100. r0 r1 r2 r7 r8 r0 r1 r2 r3 r4 r0 r1 r2 r3 r4 A B C Leader (epoch=0) Leader (epoch=1) Leader Epoch ISR A 1 A, C Follower (epoch=0) Replica B has failed and replica A has been elected as the new leader
  • 101. r0 r1 r2 r7 r8 r0 r1 r2 r3 r4 r0 r1 r2 r3 r4 A B C Leader (epoch=0) Leader (epoch=1) Leader Epoch ISR A 1 A, C Follower (epoch=0) Replica B has failed and replica A has been elected as the new leader epoch=0 offset=0 epoch=1 offset=3
  • 102. r0 r1 r2 r7 r8 r0 r1 r2 r3 r4 r0 r1 r2 r3 r4 A B C Leader (epoch=0) Leader (epoch=1) Leader Epoch ISR A 1 A, C Follower (epoch=0) epoch=0 offset=0 epoch=1 offset=3
  • 103. r0 r1 r2 r7 r8 r0 r1 r2 r3 r4 r0 r1 r2 r3 r4 A B C Leader (epoch=0) Leader (epoch=1) Leader Epoch ISR A 1 A, C Follower (epoch=0) Before replica C can truncate its log, it becomes the new leader. epoch=0 offset=0 epoch=1 offset=3
  • 104. r0 r1 r2 r7 r8 r0 r1 r2 r3 r4 r0 r1 r2 r3 r4 A B C Leader (epoch=0) Leader (epoch=1) Leader Epoch ISR C 2 A, C Follower (epoch=0) Before replica C can truncate its log, it becomes the new leader. epoch=0 offset=0 epoch=1 offset=3
  • 105. r0 r1 r2 r7 r8 r0 r1 r2 r3 r4 r0 r1 r2 r3 r4 A B C Leader (epoch=0) Leader (epoch=1) Leader Epoch ISR C 2 A, C Leader (epoch=2) Before replica C can truncate its log, it becomes the new leader. epoch=0 offset=0 epoch=1 offset=3
  • 106. r0 r1 r2 r7 r8 r0 r1 r2 r3 r4 r0 r1 r2 r3 r4 r9 A B C Leader (epoch=0) Leader (epoch=1) Leader Epoch ISR C 2 A, C Leader (epoch=2) Before replica C can truncate its log, it becomes the new leader. epoch=0 offset=0 epoch=1 offset=3 epoch=2 offset=5
  • 107. r0 r1 r2 r7 r8 r0 r1 r2 r3 r4 r0 r1 r2 r3 r4 r9 A B C Leader (epoch=0) Follower (epoch=2) Leader Epoch ISR C 2 A, C Leader (epoch=2) epoch=0 offset=0 epoch=1 offset=3 epoch=2 offset=5 A -> C: What is the end offset for epoch=1?
  • 108. r0 r1 r2 r7 r8 r0 r1 r2 r3 r4 r0 r1 r2 r3 r4 r9 A B C Leader (epoch=0) Follower (epoch=2) Leader Epoch ISR C 2 A, C Leader (epoch=2) epoch=0 offset=0 epoch=1 offset=3 epoch=2 offset=5 A -> C: What is the end offset for epoch=1? C -> A: The end offset is 5
  • 109. r0 r1 r2 r7 r8 r0 r1 r2 r3 r4 r0 r1 r2 r3 r4 r9 A B C Leader (epoch=0) Follower (epoch=2) Leader Epoch ISR C 2 A, C Leader (epoch=2) epoch=0 offset=0 epoch=1 offset=3 epoch=2 offset=5 A -> C: What is the end offset for epoch=1? C -> A: The end offset is 5 C: Cool, no truncation needed!
  • 110. r0 r1 r2 r7 r8 r0 r1 r2 r3 r4 r0 r1 r2 r3 r4 r9 A B C Leader (epoch=0) Follower (epoch=2) Leader Epoch ISR C 2 A, C Leader (epoch=2)
  • 111. r0 r1 r2 r7 r8 r9 r0 r1 r2 r3 r4 r0 r1 r2 r3 r4 r9 A B C Leader (epoch=0) Follower (epoch=2) Leader Epoch ISR C 2 A, C Leader (epoch=2)
  • 112. Edge Case 3: Zombie follower
  • 113. r0 r1 r2 r3 r0 r1 r2 r3 r4 r5 r6 r0 r1 r2 A B C Leader (epoch=0) Follower (epoch=0) Leader Epoch ISR A 0 A, B, C Follower (epoch=0)
  • 114. r0 r1 r2 r3 r0 r1 r2 r3 r4 r5 r6 r0 r1 r2 A B C Leader (epoch=0) Follower (epoch=0) Leader Epoch ISR A 0 A, B, C Follower (epoch=0) Follower A fails and is removed from the ISR.
  • 115. r0 r1 r2 r3 r0 r1 r2 r3 r4 r5 r6 r0 r1 r2 A B C Leader (epoch=0) Follower (epoch=0) Leader Epoch ISR A 0 B, C Follower (epoch=0) Follower A fails and is removed from the ISR.
  • 116. r0 r1 r2 r3 r0 r1 r2 r3 r4 r5 r6 r0 r1 r2 A B C Leader (epoch=0) Follower (epoch=0) Leader Epoch ISR A 0 B, C Follower (epoch=0)
  • 117. r0 r1 r2 r3 r0 r1 r2 r3 r4 r5 r6 r0 r1 r2 A B C Leader (epoch=0) Follower (epoch=0) Leader Epoch ISR A 0 B, C Follower (epoch=0) Replica A could not re-register in order to get the latest leader/ISR state and continued fetching from the current leader.
  • 118. r0 r1 r2 r3 r4 r0 r1 r2 r3 r4 r5 r6 r0 r1 r2 A B C Leader (epoch=0) Follower (epoch=0) Leader Epoch ISR A 0 B, C Follower (epoch=0) Replica A could not re-register in order to get the latest leader/ISR state and continued fetching from the current leader.
  • 119. r0 r1 r2 r3 r4 r0 r1 r2 r3 r4 r5 r6 r0 r1 r2 A B C Follower (epoch=0) Leader Epoch ISR A 0 B, C Follower (epoch=0) Leader (epoch=0)
  • 120. r0 r1 r2 r3 r4 r0 r1 r2 r3 r4 r5 r6 r0 r1 r2 A B C Follower (epoch=0) Leader Epoch ISR C 1 C Follower (epoch=0) Leader (epoch=0)
  • 121. r0 r1 r2 r3 r4 r0 r1 r2 r3 r4 r5 r6 r0 r1 r2 A B C Follower (epoch=0) Leader Epoch ISR C 1 C Leader (epoch=1) Leader (epoch=0)
  • 122. r0 r1 r2 r3 r4 r0 r1 r2 r3 r4 r5 r6 r0 r1 r2 A B C Follower (epoch=0) Leader Epoch ISR C 1 C Leader (epoch=1) Leader (epoch=0)
  • 123. r0 r1 r2 r3 r4 r0 r1 r2 r3 r4 r5 r6 r0 r1 r2 r7 r8 r9 A B C Follower (epoch=0) Leader Epoch ISR C 1 C Leader (epoch=1) Leader (epoch=0)
  • 124. r0 r1 r2 r3 r4 r0 r1 r2 r3 r4 r5 r6 r0 r1 r2 r7 r8 r9 A B C Follower (epoch=0) Leader Epoch ISR C 1 C Leader (epoch=1) Leader (epoch=0) Meanwhile, replica A still thought B was the leader and was still trying to make progress
  • 125. r0 r1 r2 r3 r4 r0 r1 r2 r3 r4 r5 r6 r0 r1 r2 r7 r8 r9 A B C Follower (epoch=0) Leader Epoch ISR C 1 C Leader (epoch=1) Leader (epoch=0)
  • 126. r0 r1 r2 r3 r4 r0 r1 r2 r3 r4 r5 r6 r0 r1 r2 r7 r8 r9 A B C Follower (epoch=0) Leader Epoch ISR C 1 C Leader (epoch=1) Follower (epoch=1)
  • 127. r0 r1 r2 r3 r4 r0 r1 r2 r3 r4 r5 r6 r0 r1 r2 r7 r8 r9 A B C Follower (epoch=0) Leader Epoch ISR C 1 C Leader (epoch=1) Follower (epoch=1)
  • 128. r0 r1 r2 r3 r4 r0 r1 r2 r0 r1 r2 r7 r8 r9 A B C Follower (epoch=0) Leader Epoch ISR C 1 C Leader (epoch=1) Follower (epoch=1)
  • 129. r0 r1 r2 r3 r4 r0 r1 r2 r0 r1 r2 r7 r8 r9 A B C Follower (epoch=0) Leader Epoch ISR C 1 C Leader (epoch=1) Follower (epoch=1)
  • 130. r0 r1 r2 r3 r4 r0 r1 r2 r7 r8 r9 r0 r1 r2 r7 r8 r9 A B C Follower (epoch=0) Leader Epoch ISR C 1 C Leader (epoch=1) Follower (epoch=1)
  • 131. r0 r1 r2 r3 r4 r0 r1 r2 r7 r8 r9 r0 r1 r2 r7 r8 r9 A B C Follower (epoch=0) Leader Epoch ISR C 1 B, C Leader (epoch=1) Follower (epoch=1)
  • 132. r0 r1 r2 r3 r4 r0 r1 r2 r7 r8 r9 r0 r1 r2 r7 r8 r9 A B C Follower (epoch=0) Leader Epoch ISR B 2 B, C Leader (epoch=1) Follower (epoch=1) Once back in the ISR, the controller elected it as leader
  • 133. r0 r1 r2 r3 r4 r0 r1 r2 r7 r8 r9 r0 r1 r2 r7 r8 r9 A B C Follower (epoch=0) Leader Epoch ISR B 2 B, C Leader (epoch=1) Leader (epoch=2) Once back in the ISR, the controller elected it as leader
  • 134. r0 r1 r2 r3 r4 r0 r1 r2 r7 r8 r9 r0 r1 r2 r7 r8 r9 A B C Follower (epoch=0) Leader Epoch ISR B 2 B, C Leader (epoch=1) Leader (epoch=2) Suddenly, replica A was able to make progress again!
  • 135. r0 r1 r2 r3 r4 r9 r0 r1 r2 r7 r8 r9 r0 r1 r2 r7 r8 r9 A B C Follower (epoch=0) Leader Epoch ISR B 2 B, C Leader (epoch=1) Leader (epoch=2) Suddenly, replica A was able to make progress again!
  • 136. Reflection ● Our mushy brains are not equipped to thinking about edge cases in distributed systems ● How do we know that our fixes are not just trading one edge case for another? ● How do we know there are not more edge cases?
  • 138. TLA+/TLC ● TLA+ is a specification language created by Leslie Lamport ● TLC is a model checker ● Think “brute force proof by mathematical induction”
  • 139. TLA+/TLCUsing LaTeX syntax makes model checking just as much fun as writing research papers!● TLA+ is a specification language created by Leslie Lamport ● TLC is a model checker ● Think “brute force proof by mathematical induction”
  • 141. ● Define the state and how to initialize it ● Define the valid state transitions ● Define expected state invariants ● Run model to check invariants Model Checklist
  • 142. ● Define the state and how to initialize it ● Define the valid state transitions ● Define expected state invariants ● Run model to check invariants Model Checklist
  • 143. r0 r1 r2 r3 r0 r1 r2 r3 r4 r5 r6 r0 r1 r2 r3 r4 r5 A B C Leader (epoch=0) Follower (epoch=0) Leader Epoch ISR B 0 A, B, C Follower (epoch=0)
  • 144. r0 r1 r2 r3 r0 r1 r2 r3 r4 r5 r6 r0 r1 r2 r3 r4 r5 A B C Leader (epoch=0) Follower (epoch=0) Leader Epoch ISR B 0 A, B, C Follower (epoch=0) 1. Records and the log
  • 145. Log Representation LogRecords == [ id: Nat, epoch: Nat ]
  • 146. Log Representation LogRecords == [ id: Nat, epoch: Nat ] Log == [ endOffset: Nat, records: [Nat -> LogRecords] ]
  • 147. r0 r1 r2 r3 r0 r1 r2 r3 r4 r5 r6 r0 r1 r2 r3 r4 r5 A B C Leader (epoch=0) Follower (epoch=0) Leader Epoch ISR B 0 A, B, C Follower (epoch=0) 1. Records and the log
  • 148. r0 r1 r2 r3 r0 r1 r2 r3 r4 r5 r6 r0 r1 r2 r3 r4 r5 A B C Leader (epoch=0) Follower (epoch=0) Leader Epoch ISR B 0 A, B, C Follower (epoch=0) 1. Records and the log 2. Replica State
  • 150. Replica State Representation CONSTANT Replicas * {r1, r2, r3}
  • 151. Replica State Representation CONSTANT Replicas * {r1, r2, r3} ReplicaState == [ log: Log, hw: Nat, leaderEpoch: Nat, leader: Replicas, isr: SUBSET Replicas ]
  • 152. Replica State Representation CONSTANT Replicas * {r1, r2, r3} ReplicaState == [ log: Log, hw: Nat, leaderEpoch: Nat, leader: Replicas, isr: SUBSET Replicas ] AllReplicaStates == [Replicas -> ReplicaState]
  • 153. r0 r1 r2 r3 r0 r1 r2 r3 r4 r5 r6 r0 r1 r2 r3 r4 r5 A B C Leader (epoch=0) Follower (epoch=0) Leader Epoch ISR B 0 A, B, C Follower (epoch=0) 1. Records and the log 2. Replica State
  • 154. r0 r1 r2 r3 r0 r1 r2 r3 r4 r5 r6 r0 r1 r2 r3 r4 r5 A B C Leader (epoch=0) Follower (epoch=0) Leader Epoch ISR B 0 A, B, C Follower (epoch=0) 1. Records and the log 2. Replica State 3. Quorum State
  • 155. Quorum State Representation QuorumState == [ leaderEpoch: Nat, leader: Replicas, isr: SUBSET Replicas ]
  • 156. r0 r1 r2 r3 r0 r1 r2 r3 r4 r5 r6 r0 r1 r2 r3 r4 r5 A B C Leader (epoch=0) Follower (epoch=0) Leader Epoch ISR B 0 A, B, C Follower (epoch=0) 1. Records and the log 2. Replica State 3. Quorum State
  • 157. r0 r1 r2 r3 r0 r1 r2 r3 r4 r5 r6 r0 r1 r2 r3 r4 r5 A B C Leader (epoch=0) Follower (epoch=0) Leader Epoch ISR B 0 A, B, C Follower (epoch=0) 1. Records and the log 2. Replica State 3. Quorum State 4. LeaderAndIsr Propagation
  • 159. Leader/ISR Propagation LeaderAndIsrRequests == SUBSET QuorumState leaderAndIsrRequests: {} Example: initialization
  • 160. Leader/ISR Propagation LeaderAndIsrRequests == SUBSET QuorumState leaderAndIsrRequests: { [leader: A, epoch: 0, isr: {A, B, C}] } Example: after first leader election
  • 161. Leader/ISR Propagation LeaderAndIsrRequests == SUBSET QuorumState leaderAndIsrRequests: { [leader: A, epoch: 0, isr: {A, B, C}], [leader: B, epoch: 1, isr: {B, C}] } Example: after leader failure and reelection
  • 162. ● Define the state and how to initialize it ● Define the valid state transitions ● Define expected state invariants ● Run model to check invariants Model Checklist
  • 163. Next == / ControllerElectLeader / ControllerShrinkIsr / ReplicaBecomeLeader / LeaderExpandIsr / LeaderShrinkIsr / LeaderWrite / LeaderIncHighWatermark / ReplicaBecomeFollower / FollowerFetch State Transitions
  • 164. Next == / ControllerElectLeader / ControllerShrinkIsr / ReplicaBecomeLeader / LeaderExpandIsr / LeaderShrinkIsr / LeaderWrite / LeaderIncHighWatermark / ReplicaBecomeFollower / FollowerFetch State Transitions
  • 165. Next == / ControllerElectLeader / ControllerShrinkIsr / ReplicaBecomeLeader / LeaderExpandIsr / LeaderShrinkIsr / LeaderWrite / LeaderIncHighWatermark / ReplicaBecomeFollower / FollowerFetch State Transitions Controller actions
  • 166. Next == / ControllerElectLeader / ControllerShrinkIsr / ReplicaBecomeLeader / LeaderExpandIsr / LeaderShrinkIsr / LeaderWrite / LeaderIncHighWatermark / ReplicaBecomeFollower / FollowerFetch State Transitions Leader actions
  • 167. Next == / ControllerElectLeader / ControllerShrinkIsr / ReplicaBecomeLeader / LeaderExpandIsr / LeaderShrinkIsr / LeaderWrite / LeaderIncHighWatermark / ReplicaBecomeFollower / FollowerFetch State Transitions Follower actions
  • 168. State Transitions Start off with empty logs, a full ISR, and no leader Init
  • 170. State Transitions Init ControllerElectLeader Electing the first leader enables several new state transitions
  • 172. State Transitions Init ControllerElectLeader ReplicaBecomeLeader Electing the first leader enables several new state transitions ReplicaBecomeFollower
  • 173. State Transitions Init ControllerElectLeader ReplicaBecomeLeader Electing the first leader enables several new state transitions ReplicaBecomeFollower ControllerElectLeader
  • 174. State Transitions Init ControllerElectLeader ReplicaBecomeLeader Every transition enables a different set of next actions. ReplicaBecomeFollower ControllerElectLeader
  • 175. State Transitions Init ControllerElectLeader ReplicaBecomeLeader Every transition enables a different set of next actions. ReplicaBecomeFollower ControllerElectLeader LeaderWrite ReplicaBecomeFollower ControllerShrinkIsr
  • 176. State Transitions Init ControllerElectLeader ReplicaBecomeLeader Every transition enables a different set of next actions. ReplicaBecomeFollower ControllerElectLeader LeaderWrite ReplicaBecomeFollower ControllerShrinkIsr FollowerFetch LeaderShrinkIsr
  • 179. ● Define the state and how to initialize it ● Define the valid state transitions ● Define expected state invariants ● Run model to check invariants Model Checklist
  • 180. Replication Invariant StrongIsr == A r1 in Replicas: / ~ ReplicaPresumesLeadership(r1) / LET hw == replicaState[r1].hw IN A r2 in quorumState.isr: HasMatchingLogsUpTo(r1, r2, hw)
  • 181. Replication Invariant StrongIsr == A r1 in Replicas: / ~ ReplicaPresumesLeadership(r1) / LET hw == replicaState[r1].hw IN A r2 in quorumState.isr: HasMatchingLogsUpTo(r1, r2, hw) “If any replica is eligible to return data, then that data must be replicated to all members of the current ISR”
  • 182. r0 r1 r2 r3 r0 r1 r2 r3 r4 r5 r6 r0 r1 r2 r3 r4 r5 A B C Leader (epoch=0) Follower (epoch=1) Leader Epoch ISR C 1 A, C Leader (epoch=1) Leader A had failed and replica C was being elected as the new leader.
  • 183. r0 r1 r2 r3 r0 r1 r2 r3 r4 r5 r6 r0 r1 r2 r3 r4 r5 A B C Leader (epoch=0) Follower (epoch=1) Leader Epoch ISR C 1 A, C Leader (epoch=1) Upon becoming a follower of C, replica A would truncate its log to the local high watermark.
  • 184. r0 r1 r0 r1 r2 r3 r4 r5 r6 r0 r1 r2 r3 r4 r5 A B C Leader (epoch=0) Follower (epoch=1) Leader Epoch ISR C 1 A, C Leader (epoch=1)
  • 185. r0 r1 r0 r1 r2 r3 r4 r5 r6 r0 r1 r2 r3 r4 r5 A B C Leader (epoch=0) Follower (epoch=1) Leader Epoch ISR C 1 A, C Leader (epoch=1) This state violates the StrongIsr property because leader C is eligible to return records r2 and r3, though they are not present on A.
  • 186. ● Define the state and how to initialize it ● Define the valid state transitions ● Define expected state invariants ● Run model to check invariants Model Checklist
  • 187. Edge Case 4 (Premature ISR expansion)
  • 188. r0 r1 r0 r1 r2 r3 r4 r5 r6 r0 r1 r2 r3 r4 r5 A B C Leader (epoch=0) Follower (epoch=0) Leader Epoch ISR B 0 B, C Follower (epoch=0) The leader is B and replica A is trying to catch up to rejoin the ISR.
  • 189. r0 r1 r0 r1 r2 r3 r4 r5 r6 r0 r1 r2 r3 r4 r5 A B C Leader (epoch=0) Follower (epoch=0) Leader Epoch ISR C 1 B, C Follower (epoch=0) The leader changes to C.
  • 190. r0 r1 r0 r1 r2 r3 r4 r5 r6 r0 r1 r2 r3 r4 r5 A B C Leader (epoch=0) Follower (epoch=0) Leader Epoch ISR C 1 B, C Leader (epoch=1) The leader changes to C.
  • 191. r0 r1 r0 r1 r2 r3 r4 r5 r6 r0 r1 r2 r3 r4 r5 A B C Leader (epoch=0) Follower (epoch=1) Leader Epoch ISR C 1 B, C Leader (epoch=1) Follower A catches up and rejoins the ISR.
  • 192. r0 r1 r2 r0 r1 r2 r3 r4 r5 r6 r0 r1 r2 r3 r4 r5 A B C Leader (epoch=0) Follower (epoch=1) Leader Epoch ISR C 1 B, C Leader (epoch=1) Follower A catches up and rejoins the ISR.
  • 193. r0 r1 r2 r0 r1 r2 r3 r4 r5 r6 r0 r1 r2 r3 r4 r5 A B C Leader (epoch=0) Follower (epoch=1) Leader Epoch ISR C 1 A, B, C Leader (epoch=1) Follower A catches up and rejoins the ISR.
  • 194. r0 r1 r2 r0 r1 r2 r3 r4 r5 r6 r0 r1 r2 r3 r4 r5 A B C Leader (epoch=0) Follower (epoch=1) Leader Epoch ISR C 1 A, B, C Leader (epoch=1) This violates StrongIsr because replica B may have returned records r3, r4, and r5 which A does not yet have.
  • 196. r0 r1 r2 r0 r1 r2 r3 r4 r5 r6 r0 r1 r2 r3 r4 r5 A B C Leader (epoch=0) Follower (epoch=1) Leader Epoch ISR C 1 B, C Leader (epoch=1) After becoming leader, C only knows that the true high watermark is between its own high watermark and the end of the log. True high watermark
  • 197. r0 r1 r2 r0 r1 r2 r3 r4 r5 r6 r0 r1 r2 r3 r4 r5 A B C Leader (epoch=0) Follower (epoch=1) Leader Epoch ISR C 1 B, C Leader (epoch=1) So we wait until the follower has reached the starting offset of this leader’s own epoch before allowing it into the ISR. True high watermark
  • 198. r0 r1 r2 r0 r1 r2 r3 r4 r5 r0 r1 r2 r3 r4 r5 r7 r8 A B C Follower (epoch=1) Follower (epoch=1) Leader Epoch ISR C 1 B, C Leader (epoch=1) So we wait until the follower has reached the starting offset of this leader’s own epoch before allowing it into the ISR. True high watermark
  • 199. r0 r1 r2 r3 r4 r5 r7 r0 r1 r2 r3 r4 r5 r0 r1 r2 r3 r4 r5 r7 r8 A B C Follower (epoch=1) Follower (epoch=1) Leader Epoch ISR C 1 B, C Leader (epoch=1) So we wait until the follower has reached the starting offset of this leader’s own epoch before allowing it into the ISR. True high watermark
  • 200. r0 r1 r2 r3 r4 r5 r7 r0 r1 r2 r3 r4 r5 r0 r1 r2 r3 r4 r5 r7 r8 A B C Follower (epoch=1) Follower (epoch=1) Leader Epoch ISR C 1 A, B, C Leader (epoch=1) So we wait until the follower has reached the starting offset of this leader’s own epoch before allowing it into the ISR. True high watermark
  • 202. r0 r1 r2 r3 r0 r1 r2 r5 r6 r0 r1 r2 r5 r6 A B C Follower (epoch=0) Leader Epoch ISR B 2 B, C Leader (epoch=1) Leader (epoch=2) Replica A was a zombie which was still fetching from B. After a couple leader elections, replica B became the leader again.
  • 203. r0 r1 r2 r3 r0 r1 r2 r5 r6 r0 r1 r2 r5 r6 A B C Follower (epoch=0) Leader Epoch ISR B 2 B, C Leader (epoch=1) Leader (epoch=2) A -> B: Fetch(offset=4, epoch=0)
  • 204. r0 r1 r2 r3 r0 r1 r2 r5 r6 r0 r1 r2 r5 r6 A B C Follower (epoch=0) Leader Epoch ISR B 2 B, C Leader (epoch=1) Leader (epoch=2) A -> B: Fetch(offset=4, epoch=0) B -> A: You are fenced!
  • 205. KIP-320 Model Checking Results #Replicas Log Size Distinct States Depth Duration 3 3 84,313,696 40 ~2 hours 3 4 133,768,793 20 ~3 hours 4 4 200,534,415 18 ~6 hours
  • 207. Summary ● Distributed systems are subtle and we are poorly equipped to reason about edge cases. ● Model checking is a systematic approach to finding these edge cases and verifying our fixes address them. ● All of the replication fixes we know of will be available in Apache Kafka 2.1.0.
  • 208. Note of Caution ● The model is not the implementation. ● The implementation will have complexity that the model cannot capture.
  • 209. ● Kafka TLA+ Specification: https://github.com/hachikuji/kafka-specification ● TLA+ video tutorial: https://lamport.azurewebsites.net/video/videos.html ● Kafka Improvement Proposals: ○ KIP-101: https://cwiki.apache.org/confluence/display/KAFKA/K IP-101+-+Alter+Replication+Protocol+to+use+Leader+ Epoch+rather+than+High+Watermark+for+Truncation ○ KIP-279: https://cwiki.apache.org/confluence/display/KAFKA/K IP-279%3A+Fix+log+divergence+between+leader+and +follower+after+fast+leader+fail+over ○ KIP-320: https://cwiki.apache.org/confluence/display/KAFKA/K IP-320%3A+Allow+fetchers+to+detect+and+handle+lo g+truncation Resources
  • 212. r0 r1 r2 r0 r1 r2 r3 r0 r1 r2 r3 A B C Leader (epoch=0) Follower (epoch=1) Leader Epoch ISR C 1 B, C Leader (epoch=1) B became a zombie while it was the leader for epoch 0.
  • 213. r0 r1 r2 r0 r1 r2 r3 r0 r1 r2 r3 r7 r8 A B C Leader (epoch=0) Follower (epoch=1) Leader Epoch ISR C 1 B, C Leader (epoch=1) The new leader will be accepting writes.
  • 214. r0 r1 r2 r0 r1 r2 r3 r9 r10 r0 r1 r2 r3 r7 r8 A B C Leader (epoch=0) Follower (epoch=1) Leader Epoch ISR C 1 B, C Leader (epoch=1) The old leader may accept writes as well!
  • 215. r0 r1 r2 r0 r1 r2 r3 r9 r10 r0 r1 r2 r3 r7 r8 A B C Leader (epoch=0) Follower (epoch=1) Leader Epoch ISR C 1 B, C Leader (epoch=1) As long as the leader cannot advance its high watermark, there is no semantic violation.
  • 216. r0 r1 r2 r0 r1 r2 r3 r9 r10 r0 r1 r2 r3 r7 r8 A B C Leader (epoch=0) Follower (epoch=1) Leader Epoch ISR Ver C 1 B, C 1 Leader (epoch=1) As long as the leader cannot advance its high watermark, there is no semantic violation.
  • 217. r0 r1 r2 r0 r1 r2 r3 r9 r10 r0 r1 r2 r3 r7 r8 A B C Leader (epoch=0) Follower (epoch=1) Leader Epoch ISR Ver C 1 B, C 1 Leader (epoch=1) The controller sends the latest version of the leader and ISR state to replicas in the LeaderAndIsr request
  • 218. r0 r1 r2 r0 r1 r2 r3 r9 r10 r0 r1 r2 r3 r7 r8 A B C Leader (epoch=0, version=0) Follower (epoch=1) Leader Epoch ISR Ver C 1 B, C 1 Leader (epoch=1, version=1) The controller sends the latest version of the leader and ISR state to replicas in the LeaderAndIsr request
  • 219. r0 r1 r2 r0 r1 r2 r3 r9 r10 r0 r1 r2 r3 r7 r8 A B C Leader (epoch=0, version=0) Follower (epoch=1) Leader Epoch ISR Ver C 1 B, C 1 Leader (epoch=1, version=1) This allows for CAS updates, which effectively fences replicas which have old state.
  • 220. Appendix 2: What goes in a TLA+ Model?
  • 221. VARIABLES var1, var2, … Init == / var1 = 1 / … Action1 == / var1 leq 10 / var1’ = var + 1 … Next == / Action1 / Action2 / … Spec == Init / []Next Invariant == / var1 geq 1 / … TLA+ Overview
  • 222. VARIABLES var1, var2, … Init == / var1 = 1 / … Action1 == / var1 leq 10 / var1’ = var + 1 … Next == / Action1 / Action2 / … Spec == Init / []Next Invariant == / var1 geq 1 / … TLA+ Overview Define the model’s state
  • 223. VARIABLES var1, var2, … Init == / var1 = 1 / … Action1 == / var1 leq 10 / var1’ = var + 1 … Next == / Action1 / Action2 / … Spec == Init / []Next Invariant == / var1 geq 1 / … TLA+ Overview Specify how the state is initialized
  • 224. VARIABLES var1, var2, … Init == / var1 = 1 / … Action1 == / var1 leq 10 / var1’ = var + 1 … Next == / Action1 / Action2 / … Spec == Init / []Next Invariant == / var1 geq 1 / … TLA+ Overview Specify how the state is initialized
  • 225. VARIABLES var1, var2, … Init == / var1 = 1 / … Action1 == / var1 leq 10 / var1’ = var + 1 … Next == / Action1 / Action2 / … Spec == Init / []Next Invariant == / var1 geq 1 / … TLA+ Overview Specify the valid state transitions
  • 226. VARIABLES var1, var2, … Init == / var1 = 1 / … Action1 == / var1 leq 10 / var1’ = var + 1 … Next == / Action1 / Action2 / … Spec == Init / []Next Invariant == / var1 geq 1 / … TLA+ Overview Specify the valid state transitions
  • 227. VARIABLES var1, var2, … Init == / var1 = 1 / … Action1 == / var1 leq 10 / var1’ = var + 1 … Next == / Action1 / Action2 / … Spec == Init / []Next Invariant == / var1 geq 1 / … TLA+ Overview Specify the valid state transitions
  • 228. VARIABLES var1, var2, … Init == / var1 = 1 / … Action1 == / var1 leq 10 / var1’ = var + 1 … Next == / Action1 / Action2 / … Spec == Init / []Next Invariant == / var1 geq 1 / … TLA+ Overview Specify the set of valid state transitions
  • 229. VARIABLES var1, var2, … Init == / var1 = 1 / … Action1 == / var1 leq 10 / var1’ = var + 1 … Next == / Action1 / Action2 / … Spec == Init / []Next Invariant == / var1 geq 1 / … TLA+ Overview Specify the set of valid state transitions
  • 230. VARIABLES var1, var2, … Init == / var1 = 1 / … Action1 == / var1 leq 10 / var1’ = var + 1 … Next == / Action1 / Action2 / … Spec == Init / []Next Invariant == / var1 geq 1 / … TLA+ Overview The specification is the conjunction of the initial state and all the states reachable by repeatedly applying the `Next` state transition
  • 231. VARIABLES var1, var2, … Init == / var1 = 1 / … Action1 == / var1 leq 10 / var1’ = var + 1 … Next == / Action1 / Action2 / … Spec == Init / []Next Invariant == / var1 geq 1 / … TLA+ Overview Define the model invariants that should hold after every state transition