Understanding Kafka Topic Partitions - by Dunith Danushka - Tributary Data - Medium
Understanding Kafka Topic Partitions - by Dunith Danushka - Tributary Data - Medium
Get unlimited access to the best of Medium for less than $1/week. Become a member
P.S I did some edits to reflect the feedback I received from the audience. Thanks for your
valuable contribution. I expect more! :)
https://medium.com/event-driven-utopia/understanding-kafka-topic-partitions-ae40f80552e8 1/17
14/09/2023, 12:25 Understanding Kafka Topic Partitions | by Dunith Danushka | Tributary Data | Medium
Understanding partitions helps you learn Kafka faster. This article walks through
the concepts, structure, and behavior of Kafka’s partitions.
Events
An event represents a fact that happened in the past. Events are immutable and
never stay in one place. They always travel from one system to another system,
carrying the state changes that happened.
Streams
An event stream represents related events in motion.
Topics
When an event stream enters Kafka, it is persisted as a topic. In Kafka’s universe, a
topic is a materialized event stream. In other words, a topic is a stream at rest.
Topic groups related events together and durably stores them. The closest analogy
for a Kafka topic is a table in a database or folder in a file system.
Topics are the central concept in Kafka that decouples producers and consumers. A
consumer pulls messages off of a Kafka topic while producers push messages into a
Kafka topic. A topic can have many producers and many consumers.
https://medium.com/event-driven-utopia/understanding-kafka-topic-partitions-ae40f80552e8 2/17
14/09/2023, 12:25 Understanding Kafka Topic Partitions | by Dunith Danushka | Tributary Data | Medium
Partitions
Kafka’s topics are divided into several partitions. While the topic is a logical concept
in Kafka, a partition is the smallest storage unit that holds a subset of records owned
by a topic. Each partition is a single log file where records are written to it in an
append-only fashion.
When talking about the content inside a partition, I will use the terms record and
message interchangeably.
https://medium.com/event-driven-utopia/understanding-kafka-topic-partitions-ae40f80552e8 3/17
14/09/2023, 12:25 Understanding Kafka Topic Partitions | by Dunith Danushka | Tributary Data | Medium
The figure below shows a topic with three partitions. Records are being appended
to the end of each one.
Although the messages within a partition are ordered, messages across a topic are
not guaranteed to be ordered.
Open in app
https://medium.com/event-driven-utopia/understanding-kafka-topic-partitions-ae40f80552e8 4/17
14/09/2023, 12:25 Understanding Kafka Topic Partitions | by Dunith Danushka | Tributary Data | Medium
If we are to put all partitions of a topic in a single broker, the scalability of that
topic will be constrained by the broker’s IO throughput. A topic will never get
bigger than the biggest machine in the cluster. By spreading partitions across
multiple brokers, a single topic can be scaled horizontally to provide
performance far beyond a single broker’s ability.
Partition replication is complex, and it deserves its own post. Next time maybe?
By default, the partition key is passed through a hashing function, which creates the
partition assignment. That assures that all records produced with the same key will
arrive at the same partition. Specifying a partition key enables keeping related
events together in the same partition and in the exact order in which they were
sent.
https://medium.com/event-driven-utopia/understanding-kafka-topic-partitions-ae40f80552e8 5/17
14/09/2023, 12:25 Understanding Kafka Topic Partitions | by Dunith Danushka | Tributary Data | Medium
Messages with the same partition key will end up at the same partition
Key based partition assignment can lead to broker skew if keys aren’t well
distributed.
For example, when customer ID is used as the partition key, and one customer
generates 90% of traffic, then one partition will be getting 90% of the traffic most of
the time. On small topics, this is negligible, on larger ones, it can sometime take a
broker down.
When choosing a partition key, ensure that they are well distributed.
However, if no partition key is used, the ordering of records can not be guaranteed
within a given partition.
The key takeaway is to use a partition key to put related events together in the same
partition in the exact order in which they were sent.
https://medium.com/event-driven-utopia/understanding-kafka-topic-partitions-ae40f80552e8 6/17
14/09/2023, 12:25 Understanding Kafka Topic Partitions | by Dunith Danushka | Tributary Data | Medium
The offset of a message works as a consumer side cursor at this point. The
consumer keeps track of which messages it has already consumed by keeping track
of the offset of messages. After reading a message, the consumer advances its
cursor to the next offset in the partition and continues. Advancing and
remembering the last read offset within a partition is the responsibility of the
consumer. Kafka has nothing to do with it.
By remembering the offset of the last consumed message for each partition, a
consumer can join a partition at the point in time they choose and resume from
there. That is particularly useful for a consumer to resume reading after recovering
from a crash.
Kafka has the concept of consumer groups where several consumers are grouped to
consume a given topic. Consumers in the same consumer group are assigned the
same group-id value.
https://medium.com/event-driven-utopia/understanding-kafka-topic-partitions-ae40f80552e8 7/17
14/09/2023, 12:25 Understanding Kafka Topic Partitions | by Dunith Danushka | Tributary Data | Medium
The consumer group concept ensures that a message is only ever read by a single
consumer in the group.
For example, if you have N + 1 consumers for a topic with N partitions, then the first
N consumers will be assigned a partition, and the remaining consumer will be idle,
unless one of the N consumers fails, then the waiting consumer will be assigned its
partition. This is a good strategy to implement a hot failover.
https://medium.com/event-driven-utopia/understanding-kafka-topic-partitions-ae40f80552e8 8/17
14/09/2023, 12:25 Understanding Kafka Topic Partitions | by Dunith Danushka | Tributary Data | Medium
The key takeaway is that number of consumers don’t govern the degree of
parallelism of a topic. It’s the number of partitions.
Follow
Editor of Tributary Data. Technologist, Writer, Senior Developer Advocate at Redpanda. Opinions are my
own.
https://medium.com/event-driven-utopia/understanding-kafka-topic-partitions-ae40f80552e8 9/17
14/09/2023, 12:25 Understanding Kafka Topic Partitions | by Dunith Danushka | Tributary Data | Medium
558 5
https://medium.com/event-driven-utopia/understanding-kafka-topic-partitions-ae40f80552e8 10/17