0% found this document useful (0 votes)

10 views

kafka-in-depth

In Apache Kafka, topics are logical channels for message categorization, while partitions are physical storage units that enable efficient data distribution and processing. Partitions allow for scalability, parallelism, and ordered message handling, making them essential for high-volume data applications. Consumer groups facilitate parallel data consumption, ensuring that each partition is read by only one consumer, thus enhancing performance and fault tolerance.

Uploaded by

puryabzp

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

10 views

kafka-in-depth

Uploaded by

puryabzp

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 15

Topic,Partitions:

In Apache Kafka, topics and partitions are fundamental concepts that

play a crucial role in distributing and managing data efficiently within a

Kafka cluster.

Topic:

- A topic is a logical channel or category to which messages are

published by producers and from which messages are consumed by

consumers.
- Topics act as a way to categorize and organize messages based on

their content or purpose.

- Topics are identified by their names, which are strings.

Partitions:

- Partitions are the physical storage units for a topic. Each topic can

be divided into multiple partitions.

- Partitions allow Kafka to horizontally distribute and parallelize the

storage and processing of messages.

- Each partition is an ordered and immutable sequence of messages.

Messages within a partition are assigned sequential offsets, starting

from 0.

- Partitions are identified by a numeric index (e.g., partition 0, partition

1) within a topic.
Here's why partitions are used with a real-world example:

Example: Event Log Streaming

Imagine you're building a real-time event log streaming system, such

as a system to track user activity on a website or a mobile app. Each

event represents a user action, and you want to collect and process

these events efficiently.

Why Use Partitions:

1. Scalability: As your system scales and the volume of incoming

events increases, a single server may not be sufficient to handle all the

data. Partitions allow you to distribute the event log across multiple

servers or nodes in a Kafka cluster, enabling horizontal scalability.

2. Parallelism: With multiple partitions, Kafka consumers can process

events in parallel. Each partition can be consumed by a separate

consumer instance or thread, allowing you to take full advantage of

multi-core processors and distribute the workload.

3. Ordering: Kafka guarantees that messages within a single partition

are strictly ordered. This means that events generated by a single user

or from a single source will be processed in the order they occurred,

ensuring data integrity.

4. Retention: Kafka allows you to set different retention policies for

each partition. Some partitions can retain data for a longer period,

while others may have shorter retention periods, based on your data

retention requirements.

Real-World Usage:

In this scenario, you could have a Kafka topic called "user-events,"

which is divided into multiple partitions. Each partition can represent

events generated by a specific group of users, allowing you to scale,

process events in parallel, maintain order within each user group, and

manage retention policies separately for different user groups.

For example, partition 0 might handle events for users with user IDs

0-999, partition 1 for users with user IDs 1000-1999, and so on. This

partitioning strategy ensures efficient data processing, scalability, and

ordered event handling for each user group.

Consumer Groups and Group ID:

Kafka consumer group is basically several Kafka Consumers who can

read data in parallel from a Kafka topic. A Kafka Consumer Group has

the following properties:

● All the Consumers in a group have the same group.id.

● Only one Consumer reads each partition in the topic.

● The maximum number of Consumers is equal to the number of

partitions in the topic. If there are more consumers than

partitions, then some of the consumers will remain idle.

● A Consumer can read from more than one partition.

Importance of Kafka Consumer Group:

There will be a large number of Producers generating data at a huge

rate for a retail organization. Now, to read a large volume of data, we

need multiple Consumers running in parallel. It is comparatively easier

on the Producer side, where each Producer generates data

independently of the others. But, on the Consumer side, if we have

more than one consumer reading from the same topic, there is a high

chance that each message will be read more than once. Kafka solves

this problem using Consumer Group. In any instance, only one

consumer is allowed to read data from a partition.

Partitions of Kafka Consumer Group:

Let’s assume that we have a Kafka topic, and there are 4 partitions in

it. Then we can have the following scenarios:

1. Number of consumers = Number of partitions

In this case, each Consumer will read data from each partition, which
is the ideal case.

2. Number of consumers > Number of partitions

In this case, one consumer will remain idle and leads to poor
utilization of the resource.
3. Number of consumers < Number of partitions

In this case, one of the consumers will read data from more than one
partition.

4. Number of Consumer Group > 1

In this case, the topic is subscribed by more than one consumer

group, which caters to two different applications. The two applications
can run independently of one another.
Consumer Group adds the following advantages:

● Scalability: Several Consumers reading data in parallel definitely

increases the data consumption rate and makes the system

capable of reading a high volume of data.

● Fault Tolerance: Suppose we had only one Consumer (for

reading not so high volume of data); what would happen if the

Consumer fails for some reason? The whole pipeline will break.

● Load Balancing: Kafka shares the partitions fairly with each

Consumer, thereby making the process of data consumption

smooth and efficient.

Real Example:

Let’s assume that we have a simple Cloud Platform where we allow

the following operations to users:

● Store files to Cloud.

● View their files in the Cloud.

● Download their files from the Cloud.

In the beginning, we had a tiny user base. We wanted to derive various

stats (on an hourly basis) like active users, number of upload

requests, number of download requests, etc. To meet the

requirements, we set up a Kafka Cluster that produces the logs

(generated by our application) into a topic, and there is an application

that consumes the topic (using a Consumer) and then processes it to

generate the required stats and finally display those in a webpage.

As people started liking our services, more people started using them,

thus generating many logs per hour. We found that the application

which consumes the topic became extremely slow as we were using

only one Consumer. To solve the problem, we added some Consumers

to the group and found significant performance improvement.

We came across another requirement, where we had to write the logs

into an cluster, and this process should run independently of the

previous application (This is because, with further increase in data, we

were planning to decommission the first application and do derive all

the stats in the cluster). To meet this requirement, we developed

another application that subscribed to the topic using a different

Consumer group and wrote the data into the cluster.

add a consumer group and add two consumer to it:

kafka-console-consumer --bootstrap-server localhost:9092 --topic

fortest --group gorup-1

When we want add a new consumer to this group we use this

command

To list the consumers in the Kafka cluster, we can use the

kafka-consumer-groups.sh shell script. The –list option will list all the
consumer groups:

kafka-consumer-groups --list --bootstrap-server localhost:9092

To see the members of the first group, we can use:

kafka-consumer-groups.sh --describe --group new-user --members

--bootstrap-server localhost:9092
Keys In kafka:

In Kafka, the use of a key when producing a message serves two main
purposes:

1. Partitioning: Kafka uses keys to determine which partition within a

topic a message should be assigned to. Each partition is responsible
for storing a specific subset of the data. By specifying a key, you can
control which partition a message goes to. This allows for parallel
processing, ordering, and efficient data distribution across multiple
partitions.

2. Message Identification: The key in a Kafka message can serve as a

unique identifier for that message. This can be helpful when you need
to look up or reference a specific message in the future. It enables you
to associate related messages and perform operations like
aggregation, filtering, and deduplication based on the message key.

Real-World Example:

Let's consider a practical example of using keys in a ride-sharing

application:

Suppose you're building a ride-sharing platform, and you want to track

ride requests from different users. Each user generates ride requests,
and you want to ensure that all requests from the same user are
processed in the order they were made while allowing for parallel
processing of requests from different users.

In this case, you can use the user ID as the key when producing ride
request messages to Kafka. Here's how it works:

- User A generates a ride request and sends it to Kafka with a key of

"User_A."
- User B generates a ride request and sends it to Kafka with a key of
"User_B."
- User A generates another ride request, and it is sent to Kafka with the
same key, "User_A."

Kafka's partitioning mechanism ensures that all messages with the

same key, in this case, "User_A," go to the same partition. This means
that all ride requests from User A will be processed in order within the
same partition, while requests from User B and other users can be
processed independently in their respective partitions.

So, in simple terms, using keys in Kafka allows you to organize and
process related messages together while still benefiting from Kafka's
parallelism and scalability. It helps you maintain order and efficiency
in your data processing pipeline.

kafka-console-producer --topic fortest --bootstrap-server

localhost:9092 --property "parse.key=true" --property "key.separator=:"

- As you produce messages with different keys, Kafka will

automatically assign each message to one of the topic's partitions
based on the key. Messages with the same key will go to the same
partition, ensuring that related data is grouped together.

- You can use the `--partition` flag to manually specify the partition
number when producing messages, but this is typically not necessary
unless you have a specific reason to override Kafka's default
partitioning behavior.
Simple python code
Kafka UseCase - Uber:
For more Details:

https://www.uber.com/en-DE/blog/kafka-async-queuing-with-consum
er-proxy/

https://blog.devgenius.io/unraveling-kafka-with-uber-a-real-life-applica
tion-of-event-streaming-43c07ab305cc

CCNA 1 Skills Final-With-Answers
75% (4)
CCNA 1 Skills Final-With-Answers
5 pages
Kafka Using Spring Boot
No ratings yet
Kafka Using Spring Boot
136 pages
Learn Multithreading with Modern C++
From Everand
Learn Multithreading with Modern C++
James Raynard
No ratings yet
Apache Kafka
No ratings yet
Apache Kafka
9 pages
Understanding Apache Kafka White Paper
No ratings yet
Understanding Apache Kafka White Paper
7 pages
IPexpert CCUE Routing &amp Switching Volume 3 Lab 1 Proctor Guide
No ratings yet
IPexpert CCUE Routing &amp Switching Volume 3 Lab 1 Proctor Guide
35 pages
Kafka Up and Running for Network DevOps: Set Your Network Data in Motion
From Everand
Kafka Up and Running for Network DevOps: Set Your Network Data in Motion
Eric Chou
No ratings yet
Kafka
No ratings yet
Kafka
12 pages
Unveiling Kafka Topics - The Heartbeat of Real-Time Data Streaming
No ratings yet
Unveiling Kafka Topics - The Heartbeat of Real-Time Data Streaming
5 pages
Kafka Using Spring Boot v2
No ratings yet
Kafka Using Spring Boot v2
150 pages
5_kafka_2.7m
No ratings yet
5_kafka_2.7m
46 pages
Kafka
No ratings yet
Kafka
23 pages
Kafka
No ratings yet
Kafka
5 pages
Kafka and Mongodb
No ratings yet
Kafka and Mongodb
15 pages
Kafka Notes
No ratings yet
Kafka Notes
7 pages
Best Practices for Apache Kafka
No ratings yet
Best Practices for Apache Kafka
6 pages
Kafka
No ratings yet
Kafka
88 pages
Kafka: Big Data Huawei Course
No ratings yet
Kafka: Big Data Huawei Course
14 pages
Cours - Kafka
No ratings yet
Cours - Kafka
72 pages
Apache Kafka(1)
No ratings yet
Apache Kafka(1)
10 pages
kafka-overview
No ratings yet
kafka-overview
36 pages
Apache Kafka Key Concepts
100% (1)
Apache Kafka Key Concepts
8 pages
Apache Kafka Long Polling
No ratings yet
Apache Kafka Long Polling
20 pages
kafka_arch
No ratings yet
kafka_arch
4 pages
Apache Kafka
No ratings yet
Apache Kafka
17 pages
Big Data-Kafka
No ratings yet
Big Data-Kafka
14 pages
Kafka Notes
No ratings yet
Kafka Notes
7 pages
AK
No ratings yet
AK
22 pages
Documentation
No ratings yet
Documentation
105 pages
Apache Kafka 101
No ratings yet
Apache Kafka 101
25 pages
Apache Kafka | Thi Nguyen's Blog
No ratings yet
Apache Kafka | Thi Nguyen's Blog
39 pages
Kafka Notes1
No ratings yet
Kafka Notes1
19 pages
Fundamentals and Architecture of Apache Kafka
No ratings yet
Fundamentals and Architecture of Apache Kafka
30 pages
Some Special Terms in Kafka
No ratings yet
Some Special Terms in Kafka
10 pages
Pache Kafka Is An Open-Source Distr
No ratings yet
Pache Kafka Is An Open-Source Distr
1 page
Design Patterns For Working With Fast Data: © 2016 Mapr Technologies © 2016 Mapr Technologies
No ratings yet
Design Patterns For Working With Fast Data: © 2016 Mapr Technologies © 2016 Mapr Technologies
64 pages
Apache Kafka Beginner Guide
No ratings yet
Apache Kafka Beginner Guide
40 pages
Kafka Clustering v1.0.0
No ratings yet
Kafka Clustering v1.0.0
20 pages
Configuring Kafka For High Throughput
No ratings yet
Configuring Kafka For High Throughput
11 pages
kafka
No ratings yet
kafka
43 pages
Apache Kafka
No ratings yet
Apache Kafka
6 pages
SITA1603 Unit 3 Material
No ratings yet
SITA1603 Unit 3 Material
45 pages
Apache Kafka
No ratings yet
Apache Kafka
27 pages
Kafka
No ratings yet
Kafka
1 page
Kafka Interview Questions
No ratings yet
Kafka Interview Questions
60 pages
Apache Kafka
No ratings yet
Apache Kafka
43 pages
Apache Kafka Essentials
No ratings yet
Apache Kafka Essentials
10 pages
Kafka With Spring Boot
No ratings yet
Kafka With Spring Boot
48 pages
20 Best Practices For Working With Apache Kafka at Scale - DZone Big Data
No ratings yet
20 Best Practices For Working With Apache Kafka at Scale - DZone Big Data
10 pages
Kafka
No ratings yet
Kafka
19 pages
i
No ratings yet
i
26 pages
? Kafka
No ratings yet
? Kafka
2 pages
5a - Streaming Data Analytics PDF
No ratings yet
5a - Streaming Data Analytics PDF
37 pages
Kafka Notes2
No ratings yet
Kafka Notes2
19 pages
Kafka Streams - Real-time Streams Processing
From Everand
Kafka Streams - Real-time Streams Processing
Prashant Kumar Pandey
5/5 (2)
Unit 5 Apache Kafka Notes
No ratings yet
Unit 5 Apache Kafka Notes
54 pages
Apache Kafka
No ratings yet
Apache Kafka
13 pages
Using Kafka For Real Time Data Ingestion With .NET KevinFeasel
No ratings yet
Using Kafka For Real Time Data Ingestion With .NET KevinFeasel
33 pages
Chapter 10 Kafka Distributed Publish-Subscribe Messaging System
No ratings yet
Chapter 10 Kafka Distributed Publish-Subscribe Messaging System
36 pages
Kafka Introduction1
No ratings yet
Kafka Introduction1
11 pages
Kafka
No ratings yet
Kafka
8 pages
Message Partitions: Find Answers On The Fly, or Master Something New. Subscribe Today
No ratings yet
Message Partitions: Find Answers On The Fly, or Master Something New. Subscribe Today
1 page
Embedded Linux Server
100% (1)
Embedded Linux Server
10 pages
Interrupciones 8259A
No ratings yet
Interrupciones 8259A
89 pages
Sales Wording and Competitive Edges of Huawei OceanStor Dorado All-Flash...
No ratings yet
Sales Wording and Competitive Edges of Huawei OceanStor Dorado All-Flash...
1 page
User Manual with FAQs - Multi Batch
No ratings yet
User Manual with FAQs - Multi Batch
5 pages
WQQQQQ FTW
No ratings yet
WQQQQQ FTW
77 pages
MT6797 Android Scatter
No ratings yet
MT6797 Android Scatter
10 pages
Computer Network - Assignment 03 Solution
No ratings yet
Computer Network - Assignment 03 Solution
2 pages
Operation Node B With Hyper Terminal PDF
No ratings yet
Operation Node B With Hyper Terminal PDF
29 pages
تطور الحاسبات و اجيال الحاسوب
No ratings yet
تطور الحاسبات و اجيال الحاسوب
24 pages
computer chip
No ratings yet
computer chip
3 pages
Digital Signal Controller TMS320F2812
No ratings yet
Digital Signal Controller TMS320F2812
15 pages
StrikeForce Kitty v15 SiMPLEX PDF
No ratings yet
StrikeForce Kitty v15 SiMPLEX PDF
4 pages
Block WAN Scanning
No ratings yet
Block WAN Scanning
2 pages
General Objective:: Lesson Plan
100% (1)
General Objective:: Lesson Plan
4 pages
Side-by-Side Extensibility with SAP BTP and Kyma Runtime
No ratings yet
Side-by-Side Extensibility with SAP BTP and Kyma Runtime
20 pages
Cache is King
No ratings yet
Cache is King
16 pages
DMA Controller
No ratings yet
DMA Controller
2 pages
Hitachi Datasheet Thin Image Snapshot
No ratings yet
Hitachi Datasheet Thin Image Snapshot
2 pages
Atm Machine Computer Project
100% (1)
Atm Machine Computer Project
30 pages
Indoor ATM Product Sheet
No ratings yet
Indoor ATM Product Sheet
4 pages
HP 250 G8 Notebook PC: Budget Friendly. Business Ready
No ratings yet
HP 250 G8 Notebook PC: Budget Friendly. Business Ready
4 pages
Computer Assignment
No ratings yet
Computer Assignment
22 pages
Digsi4 Catalog Sip-2006 en
No ratings yet
Digsi4 Catalog Sip-2006 en
6 pages
mc56
No ratings yet
mc56
72 pages
MPI Mid Exam Questions
0% (1)
MPI Mid Exam Questions
8 pages
Monster Tulpar T7 V20.2
No ratings yet
Monster Tulpar T7 V20.2
4 pages
Full Rotork Master Station Presentation - V3
No ratings yet
Full Rotork Master Station Presentation - V3
47 pages
Fortiauthenticator v5.1.0 Release Notes
No ratings yet
Fortiauthenticator v5.1.0 Release Notes
27 pages