0% found this document useful (0 votes)
10 views

kafka-in-depth

In Apache Kafka, topics are logical channels for message categorization, while partitions are physical storage units that enable efficient data distribution and processing. Partitions allow for scalability, parallelism, and ordered message handling, making them essential for high-volume data applications. Consumer groups facilitate parallel data consumption, ensuring that each partition is read by only one consumer, thus enhancing performance and fault tolerance.

Uploaded by

puryabzp
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views

kafka-in-depth

In Apache Kafka, topics are logical channels for message categorization, while partitions are physical storage units that enable efficient data distribution and processing. Partitions allow for scalability, parallelism, and ordered message handling, making them essential for high-volume data applications. Consumer groups facilitate parallel data consumption, ensuring that each partition is read by only one consumer, thus enhancing performance and fault tolerance.

Uploaded by

puryabzp
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 15

Topic,Partitions:

In Apache Kafka, topics and partitions are fundamental concepts that

play a crucial role in distributing and managing data efficiently within a

Kafka cluster.

Topic:

- A topic is a logical channel or category to which messages are

published by producers and from which messages are consumed by

consumers.
- Topics act as a way to categorize and organize messages based on

their content or purpose.

- Topics are identified by their names, which are strings.

Partitions:

- Partitions are the physical storage units for a topic. Each topic can

be divided into multiple partitions.

- Partitions allow Kafka to horizontally distribute and parallelize the

storage and processing of messages.

- Each partition is an ordered and immutable sequence of messages.

Messages within a partition are assigned sequential offsets, starting

from 0.

- Partitions are identified by a numeric index (e.g., partition 0, partition

1) within a topic.
Here's why partitions are used with a real-world example:

Example: Event Log Streaming

Imagine you're building a real-time event log streaming system, such

as a system to track user activity on a website or a mobile app. Each

event represents a user action, and you want to collect and process

these events efficiently.

Why Use Partitions:

1. Scalability: As your system scales and the volume of incoming

events increases, a single server may not be sufficient to handle all the

data. Partitions allow you to distribute the event log across multiple

servers or nodes in a Kafka cluster, enabling horizontal scalability.

2. Parallelism: With multiple partitions, Kafka consumers can process

events in parallel. Each partition can be consumed by a separate

consumer instance or thread, allowing you to take full advantage of

multi-core processors and distribute the workload.


3. Ordering: Kafka guarantees that messages within a single partition

are strictly ordered. This means that events generated by a single user

or from a single source will be processed in the order they occurred,

ensuring data integrity.

4. Retention: Kafka allows you to set different retention policies for

each partition. Some partitions can retain data for a longer period,

while others may have shorter retention periods, based on your data

retention requirements.

Real-World Usage:

In this scenario, you could have a Kafka topic called "user-events,"

which is divided into multiple partitions. Each partition can represent

events generated by a specific group of users, allowing you to scale,

process events in parallel, maintain order within each user group, and

manage retention policies separately for different user groups.


For example, partition 0 might handle events for users with user IDs

0-999, partition 1 for users with user IDs 1000-1999, and so on. This

partitioning strategy ensures efficient data processing, scalability, and

ordered event handling for each user group.

Consumer Groups and Group ID:

Kafka consumer group is basically several Kafka Consumers who can

read data in parallel from a Kafka topic. A Kafka Consumer Group has

the following properties:

● All the Consumers in a group have the same group.id.

● Only one Consumer reads each partition in the topic.

● The maximum number of Consumers is equal to the number of

partitions in the topic. If there are more consumers than

partitions, then some of the consumers will remain idle.

● A Consumer can read from more than one partition.


Importance of Kafka Consumer Group:

There will be a large number of Producers generating data at a huge

rate for a retail organization. Now, to read a large volume of data, we

need multiple Consumers running in parallel. It is comparatively easier

on the Producer side, where each Producer generates data

independently of the others. But, on the Consumer side, if we have

more than one consumer reading from the same topic, there is a high

chance that each message will be read more than once. Kafka solves

this problem using Consumer Group. In any instance, only one

consumer is allowed to read data from a partition.

Partitions of Kafka Consumer Group:

Let’s assume that we have a Kafka topic, and there are 4 partitions in

it. Then we can have the following scenarios:


1. Number of consumers = Number of partitions

In this case, each Consumer will read data from each partition, which
is the ideal case.

2. Number of consumers > Number of partitions

In this case, one consumer will remain idle and leads to poor
utilization of the resource.
3. Number of consumers < Number of partitions

In this case, one of the consumers will read data from more than one
partition.

4. Number of Consumer Group > 1

In this case, the topic is subscribed by more than one consumer


group, which caters to two different applications. The two applications
can run independently of one another.
Consumer Group adds the following advantages:

● Scalability: Several Consumers reading data in parallel definitely

increases the data consumption rate and makes the system

capable of reading a high volume of data.

● Fault Tolerance: Suppose we had only one Consumer (for

reading not so high volume of data); what would happen if the

Consumer fails for some reason? The whole pipeline will break.

● Load Balancing: Kafka shares the partitions fairly with each

Consumer, thereby making the process of data consumption

smooth and efficient.

Real Example:

Let’s assume that we have a simple Cloud Platform where we allow

the following operations to users:

● Store files to Cloud.

● View their files in the Cloud.

● Download their files from the Cloud.


In the beginning, we had a tiny user base. We wanted to derive various

stats (on an hourly basis) like active users, number of upload

requests, number of download requests, etc. To meet the

requirements, we set up a Kafka Cluster that produces the logs

(generated by our application) into a topic, and there is an application

that consumes the topic (using a Consumer) and then processes it to

generate the required stats and finally display those in a webpage.

As people started liking our services, more people started using them,

thus generating many logs per hour. We found that the application

which consumes the topic became extremely slow as we were using

only one Consumer. To solve the problem, we added some Consumers

to the group and found significant performance improvement.

We came across another requirement, where we had to write the logs

into an cluster, and this process should run independently of the

previous application (This is because, with further increase in data, we

were planning to decommission the first application and do derive all

the stats in the cluster). To meet this requirement, we developed

another application that subscribed to the topic using a different

Consumer group and wrote the data into the cluster.


add a consumer group and add two consumer to it:

kafka-console-consumer --bootstrap-server localhost:9092 --topic


fortest --group gorup-1

When we want add a new consumer to this group we use this


command

To list the consumers in the Kafka cluster, we can use the


kafka-consumer-groups.sh shell script. The –list option will list all the
consumer groups:

kafka-consumer-groups --list --bootstrap-server localhost:9092

To see the members of the first group, we can use:

kafka-consumer-groups.sh --describe --group new-user --members


--bootstrap-server localhost:9092
Keys In kafka:

In Kafka, the use of a key when producing a message serves two main
purposes:

1. Partitioning: Kafka uses keys to determine which partition within a


topic a message should be assigned to. Each partition is responsible
for storing a specific subset of the data. By specifying a key, you can
control which partition a message goes to. This allows for parallel
processing, ordering, and efficient data distribution across multiple
partitions.

2. Message Identification: The key in a Kafka message can serve as a


unique identifier for that message. This can be helpful when you need
to look up or reference a specific message in the future. It enables you
to associate related messages and perform operations like
aggregation, filtering, and deduplication based on the message key.

Real-World Example:

Let's consider a practical example of using keys in a ride-sharing


application:

Suppose you're building a ride-sharing platform, and you want to track


ride requests from different users. Each user generates ride requests,
and you want to ensure that all requests from the same user are
processed in the order they were made while allowing for parallel
processing of requests from different users.

In this case, you can use the user ID as the key when producing ride
request messages to Kafka. Here's how it works:

- User A generates a ride request and sends it to Kafka with a key of


"User_A."
- User B generates a ride request and sends it to Kafka with a key of
"User_B."
- User A generates another ride request, and it is sent to Kafka with the
same key, "User_A."

Kafka's partitioning mechanism ensures that all messages with the


same key, in this case, "User_A," go to the same partition. This means
that all ride requests from User A will be processed in order within the
same partition, while requests from User B and other users can be
processed independently in their respective partitions.

So, in simple terms, using keys in Kafka allows you to organize and
process related messages together while still benefiting from Kafka's
parallelism and scalability. It helps you maintain order and efficiency
in your data processing pipeline.

kafka-console-producer --topic fortest --bootstrap-server


localhost:9092 --property "parse.key=true" --property "key.separator=:"

- As you produce messages with different keys, Kafka will


automatically assign each message to one of the topic's partitions
based on the key. Messages with the same key will go to the same
partition, ensuring that related data is grouped together.

- You can use the `--partition` flag to manually specify the partition
number when producing messages, but this is typically not necessary
unless you have a specific reason to override Kafka's default
partitioning behavior.
Simple python code
Kafka UseCase - Uber:
For more Details:

https://www.uber.com/en-DE/blog/kafka-async-queuing-with-consum
er-proxy/

https://blog.devgenius.io/unraveling-kafka-with-uber-a-real-life-applica
tion-of-event-streaming-43c07ab305cc

You might also like