0% found this document useful (0 votes)
102 views

Apache Kafka Essentials

Uploaded by

Priya Sharma
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
102 views

Apache Kafka Essentials

Uploaded by

Priya Sharma
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

BROUGHT TO YOU IN PARTNERSHIP WITH

CONTENTS

∙ Introduction

Apache Kafka
∙ About Apache Kafka

∙ Quickstart for Apache Kafka

∙ Pub/Sub in Apache Kafka

Essentials
∙ Kafka Connect

∙ Kafka Streams

∙ Extending Apache Kafka

∙ One Size Does Not Fit All

∙ Conclusion

∙ Additional Resources

ORIGINAL BY JUN RAO | UPDATE BY BILL MCLANE

Two trends have emerged in the information technology space. First, The main benefits of Kafka are:
the diversity and velocity of the data that an enterprise wants to
1. High throughput: Each server is capable of handling hundreds
collect for decision-making continues to grow. Second, there is a
of MB per second of data.
growing need for an enterprise to make decisions in real-time based
on that collected data. For example, financial institutions want to 2. High availability: Data can be stored redundantly in multiple
not only detect fraud immediately, but also offer a better banking servers and can survive individual server failure.
experience through features like real-time alerting, real-time product
recommendations, and more effective customer service. 3. High scalability: New servers can be added over time to scale
out the system.
Apache Kafka is a streaming engine for collecting, caching, and
processing high volumes of data in real-time. As illustrated in Figure 4. Easy integration with external data sources or data sinks.

1, Kafka typically serves as a part of a central data hub in which


5. Built-in real-time processing layer.
data within an enterprise is collected. The data can then be used for
continuous processing or fed into other systems and applications in
real time. Kafka is used by more than 40% of Fortune 500 companies
across all industries. Refer to Figure 1 on page 3.

ABOUT APACHE KAFKA


Kafka was originally developed at LinkedIn in 2010, and it became a
top-level Apache project in 2012. It has three main components: Pub/
Sub, Kafka Connect, and Kafka Streams. The role of each component
is summarized in the table below.

Storing and delivering data efficiently and


Pub/Sub
reliably at scale
Integrating Kafka with external data sources
Kafka Connect
and data sinks.
Kafka Streams Processing data in Kafka in real time.

1
APACHE KAFKA

Figure 1: Apache Kafka as a central real-time hub Defines a logical name for producing and
Topic
consuming records.
QUICKSTART FOR APACHE KAFKA Defines a non-overlapping subset of records
Partition
within a topic.
It’s easy to get started on Kafka. The following are the steps to get
A unique sequential number assigned to each
Kafka running in your environment: Offset
record within a topic partition.
A record contains a key, a value, a timestamp,
1. Download the latest Apache Kafka binary distribution from Record
and a list of headers.
http://kafka.apache.org/downloads and untar it.
Server where records are stored. Multiple
Broker
brokers can be used to form a cluster.
2. Start the Zookeeper server

3. Start the Kafka broker Figure 2 depicts a topic with two partitions. Partition 0 has 5 records,
with offsets from 0 to 4, and partition 1 has 4 records, with offsets
4. Create a topic from 0 to 3.

5. Produce and Consume data

For detailed Quickstart instructions, please see the Apache Kafka


documentation in the Additional Resources section below.

PUB/SUB IN APACHE KAFKA


The first component in Kafka deals with the production and Figure 2: Partitions in a topic
consumption of the data. The following table describes a few key
concepts in Kafka: The following code snippet shows how to produce records to a topic
“test” using the Java API:

3 BROUGHT TO YOU IN PARTNERSHIP WITH


APACHE KAFKA

Properties props = new Properties();


for (ConsumerRecord<String, String> record : records)
props.put("bootstrap.servers",
System.out.printf("offset=%d, key=%s, value=%s",
"localhost:9092");
record.offset(), record.key(), record.value());
props.put("key.serializer",
consumer.commitSync();
"org.apache.kafka.common.serialization.
StringSerializer"); }

props.put("value.serializer",
Records within a partition are always delivered to the consumer in

"org.apache.kafka.common.serialization. offset order. By saving the offset of the last consumed record from
StringSerializer"); each partition, the consumer can resume from where it left off after a
restart. In the example above, we use the commitSync() API to save
Producer<String, String> producer = new
the offsets explicitly after consuming a batch of records. One can also
save the offsets automatically by setting the property enable.auto.
KafkaProducer<>(props);
commit to true.
producer.send(
A record in Kafka is not removed from the broker immediately after
new ProducerRecord<String, String>("test", "key", it is consumed. Instead, it is retained according to a configured
"value")); retention policy. The following table summarizes the two common
policies:

In the above example, both the key and value are strings, so we are Retention Policy Meaning
using a StringSerializer . It’s possible to customize the serializer The number of hours to keep a record on
log.retention.hours
when types become more complex. the broker.
The maximum size of records retained in
The following code snippet shows how to consume records with log.retention.bytes
a partition.
string key and value in Java.

Properties props = new Properties(); props. KAFKA CONNECT


put("bootstrap.servers", "localhost:9092");
The second component in Kafka is Kafka Connect, which is a
props.put("key.deserializer", framework that makes it easy to stream data between Kafka and
other systems. As shown in Figure 3, one can deploy a Connect
"org.apache.kafka.common.serialization.
cluster and run various connectors to import data from sources like
StringDeserializer");
MySQL, TIBCO Messaging, or Splunk into Kafka (Source Connectors)
props.put("value.deserializer", and export data from Kafka (Sink Connectors) such as HDFS, S3,
and Elasticsearch.
"org.apache.kafka.common.serialization.
StringDeserializer"); See page 5 for Figure 3.

KafkaConsumer<String, String> consumer =

new KafkaConsumer<>(props);

consumer.subscribe(Arrays.asList("test"));

while (true) {

ConsumerRecords<String, String> records =

consumer.poll(100);

4 BROUGHT TO YOU IN PARTNERSHIP WITH


APACHE KAFKA

Figure 3: Usage of Apache Kafka Connect 4. Verify the data in Kafka:

The benefits of using Kafka Connect are: > bin/kafka-console-consumer.sh

∙ Parallelism and fault tolerance --bootstrap-server localhost:9092

∙ Avoiding ad-hoc code by reusing existing connectors --topic connect-test

∙ Built-in offset and configuration management --from-beginning

QUICKSTART FOR KAFKA CONNECT {"schema":{"type":"string",

The following steps show how to run the existing file connector "optional":false},

in standalone mode to copy the content from a source file to a


"payload":"hello"}
destination file via Kafka:
{"schema":{"type":"string",
1. Prepare some data in a source file:
"optional":false},
> echo -e \"hello\nworld\" > test.txt

"payload":"world"}
2. Start a file source and a file sink connector:

> bin/connect-standalone.sh In the example above, the data in the source file test.txt is first
streamed into a Kafka topic connect-test through a file source
config/connect-file-source.properties connector. The records in connect-test are then streamed into
the destination file test.sink.txt . If a new line is added to test.
config/connect-file-sink.properties
txt , it will show up immediately in test.sink.txt . Note that

3. Verify the data in the destination file: we achieve this by running two connectors without writing any
custom code.
> more test.sink.txt

Connectors are powerful tools that allow for integration of Apache


hello
Kafka into many other systems. There are many open source and
commercially supported options for integration of Apache Kafka both
at the connector layer as well as through an integration services layer
that can provide much more flexibility in message transformation.
In addition to open-source connectors, vendors like Confluent
and TIBCO Software offer commercially supported connectors to
hundreds of endpoints for simplifying integration with Apache Kafka.

5 BROUGHT TO YOU IN PARTNERSHIP WITH


APACHE KAFKA

TRANSFORMATIONS IN CONNECT In step 1 above, we add two transformations MakeMap


and InsertSource , which are implemented by class
Connect is primarily designed to stream data between systems
HoistField$Value and InsertField$Value , respectively. The
as is, whereas Kafka Streams is designed to perform complex
first one adds a field name “line” to each input string. The second
transformations once the data is in Kafka. That said, Kafka Connect
one adds an additional field “data_source” that indicates the name
provides a mechanism used to perform simple transformations per
of the source file. After applying the transformation logic, the data in
record. The following example shows how to enable a couple of
the input file is now transformed to the output in step 3. Because the
transformations in the file source connector.
last transformation step is more complex, we implement it with the

1. Add the following lines to connect-file-source. Streams API (covered in more detail below):

properties : final Serde<String> stringSerde = Serdes.String();

transforms=MakeMap, InsertSource
final Serde<Long> longSerde = Serdes.Long();

transforms.MakeMap.type=org.apache.kafka
StreamsBuilder builder = new StreamsBuilder();

.connect.transforms.HoistField$Value
// build a stream from an input topic

transforms.MakeMap.field=line
KStream<String, String> source = builder.stream(

transforms.InsertSource.type=org.apache
"streams-plaintext-input",

.kafka.connect.transforms
Consumed.with(stringSerde, stringSerde));

.InsertField$Value
KTable<String, Long> counts = source

transforms.InsertSource.static.field=
.flatMapValues(value -> Arrays.asList(value.
toLowerCase().split(" ")))
data_source

.groupBy((key, value) -> value)


transforms.InsertSource.static.value=

.count();
test-file-source

// convert the output to another topic


2. Start a file source connector:
counts.toStream().to("streams-wordcount-output",
> bin/connect-standalone.sh

Produced.with(stringSerde, longSerde));
config/connect-file-source.properties

3. Verify the data in Kafka:


CONNECT REST API
> bin/kafka-console-consumer.sh
In production, Kafka Connect typically runs in distributed mode and
--bootstrap-server localhost:9092 can be managed through REST APIs. The following table shows the
common APIs. See the Kafka documentation for more information.
--topic connect-test

{"line":"hello","data_source":"test

-file-source"}

{"line":"world","data_source":"test

-file-source"}

6 BROUGHT TO YOU IN PARTNERSHIP WITH


APACHE KAFKA

Connect REST API Meaning


Produced.with(stringSerde, longSerde));
GET /connectors Return a list of active connectors
POST /connectors Create a new connector
GET /connectors/ The code above first creates a stream from an input topic streams-
Get the information of a specific connector
{name} plaintext-input . It then applies a transformation to split each
GET /connectors/ Get configuration parameters for a specific input line into words. Next, it counts the number of occurrences of
{name} /config connector
each unique word. Finally, the results are written to an output topic
PUT /connectors/ Update configuration parameters for a
streams-wordcount-output .
{name} /config specific connector
GET /connectors/
Get the current status of the connector The following are the steps to run the example code.
{name} /status
1. Create the input topic:
KAFKA STREAMS
bin/kafka-topics.sh --create

Kafka Streams is a client library for building real-time applications


--zookeeper localhost:2181
and microservices where the input and/or output data is stored in
Kafka. The benefits of using Kafka Streams are: --replication-factor 1

∙ Less code in the application --partitions 1

∙ Built-in state management --topic streams-plaintext-input

∙ Lightweight 2. Run the stream application:

∙ Parallelism and fault tolerance bin/kafka-run-class.sh org.apache.

The most common way of using Kafka Streams is through the kafka.streams.examples.wordcount.
Streams DSL, which includes operations such as filtering, joining,
grouping, and aggregation. The following code snippet shows the WordCountDemo

main logic of a Streams example called WordCountDemo .


3. Produce some data in the input topic:
final Serde stringSerde = Serdes.String();
bin/kafka-console-producer.sh

final Serde longSerde = Serdes.Long();


--broker-list localhost:9092

StreamsBuilder builder = new StreamsBuilder();


--topic streams-plaintext-input

// build a stream from an input topic


hello world

KStream source = builder.stream(


4. Verify the data in the output topic:
"streams-plaintext-input",
bin/kafka-console-consumer.sh

Consumed.with(stringSerde, stringSerde));
--bootstrap-server localhost:9092

KTable counts = source


--topic streams-wordcount-output

.flatMapValues(value -\> Arrays.asList(value.


--from-beginning

toLowerCase().split(" ")))
--formatter kafka.tools.
.groupBy((key, value) -\> value) .count();

DefaultMessageFormatter
// convert the output to another topic

--property print.key=true
counts.toStream().to("streams-wordcount-output",

7 BROUGHT TO YOU IN PARTNERSHIP WITH


APACHE KAFKA

COMMONLY USED OPERATIONS IN KSTREAM


--property print.value=true
Operation Example
--property key.deserializer= ks_out = ks_in.filter(
filter(Predicate)
(key,value) -> value > 5
org.apache.kafka.common. Create a new KStream that ); ks_in: ks_out: ("k1",
consists of all records of this 2) ("k2", 7) ("k2", 7)
serialization.StringDeserializer stream that satisfy the given
predicate.
--property value.deserializer=
map(KeyValueMapper) ks_out = ks_in..map(
(key,value) -> new
org.apache.kafka.common. Transform each record of the KeyValue<>(key,key) )
input stream into a new record ks_in: ks_out: ("k1", 2)
serialization.LongDeserializer in the output stream (both key ("k1", "k1") ("k2", 7)
and value type can be altered ("k2", "k2")
hello 1 arbitrarily).
groupBy() ks_out = ks.groupBy()
world 1 ks_in: ks_out: ("k1", 1)
Group the records by their ("k1", (("k1", 1),
current key into a KGrouped- ("k2", 2) ("k1", 3)))
KSTREAM VS. KTABLE
Stream while preserving the ("k1", 3) ("k2",
original values. (("k2", 2)))
There are two key concepts in Kafka Streams: KStream and KTable.
join(KTable, ks_out = ks_in.join(
A topic can be viewed as either of the two. Their differences are
ValueJoiner) kt, (value1, value2) ->
summarized in the table below. value1 + value2
Join records of the input stream ); ks_in: kt:
KStream KTable with records from the KTable ("k1", 1) ("k1", 11)
Each record is treated if the keys from the records ("k2", 2) ("k2", 12)
Each record is treated as an ("k3", 3) ("k4", 13)
Concept as an update to an match. Return a stream of the
append to the stream. ks_out: ("k1", 12)
existing key. key and the combined value
("k2", 14)
Model updatable using ValueJoiner .
Model append-only data join(KStream, ks_out = ks1.join( ks2,
Usage reference data such as
such as click streams. ValueJoiner, (value1, value2) ->
user profiles.
JoinWindows) value1 + value2,
JoinWindows. of(100) );
The following example illustrates the difference of the two: Join records of the two streams ks1: ks2:
if the keys match and the ("k1", 1, 100t)
(Key, Value) Sum of the Values as Sum of the Values as timestamp from the records ("k1", 11, 150t) ("k2",
Records KStream KTable 2, 200t) ("k2", 12,
satisfy the time constraint
350t) ("k3", 3, 300t)
(“k1”, 2) (“k1”, 5) 7 5 specified by JoinWindows .
("k4", 13, 380t) * t
Return a stream of the key
indicates a timestamp.
When a topic is viewed as a KStream, there are two independent and the combined value using ks_out: ("k1", 12)
ValueJoiner .
records and thus the sum of the values is 7. On the other hand, if the
topic is viewed as a KTable, the second record is treated as an update
to the first record since they have the same key “k1”. Therefore, only
the second record is retained in the stream and the sum is 5 instead.

KSTREAMS DSL

The following tables show a list of common operations available in


Kafka Streams:

8 BROUGHT TO YOU IN PARTNERSHIP WITH


APACHE KAFKA

COMMONLY USED OPERATIONS IN KGROUPEDSTREAM functional, many client applications are looking for a lighter weight
interface to Apache Kafka Streams via low-code environments
Operation Example
or continuous queries using SQL-like commands. In many cases,
count() kt = kgs.count();
developers looking to leverage continuous query functionality
kgs: ("k1", (("k1",
Count the number of records in 1), ("k1", 3))) ("k2", are looking for a low-code environment where stream processing
this stream by the grouped key (("k2", 2))) kt: ("k1", can be dynamically accessed, modified, and scaled in real-time.
and return it as a KTable. 2) ("k2", 1)
Accessing Apache Kafka data streams via SQL is just one approach to
addressing this low-code stream processing, and many commercial
reduce(Reducer) kt = kgs.reduce( vendors (TIBCO Software, Confluent) as well as open-source
(aggValue, newValue)
solutions (Apache Spark) offer solutions to providing SQL access to
Combine the values of records -> aggValue + newValue
in this stream by the grouped ); kgs: ("k1", ("k1", unlock Apache Kafka data streams for stream processing.
key and return it as a KTable. 1), ("k1", 3))) ("k2",
(("k2", 2))) kt: ("k1", Apache Kafka is a data distribution platform; it’s what you do with
4) ("k2", 2) the data that is important. Once data is available via Kafka, it can be
windowedBy(Windows) twks = kgs.windowedBy ( distributed to many different processing engines from integration
TimeWindows. of(100) ); services, event streaming, and AI/ML functions to data analytics.
Further group the records by kgs: ("k1", (("k1", 1,
the timestamp and return it as a 100t), ("k1", 3, 150t)))
TimeWindowedKStream . ("k2", (("k2", 2, 100t),
("k2", 4, 250t))) * t
indicates a timestamp.
twks: ("k1", 100t --
200t, (("k1", 1, 100t),
("k1", 3, 150t))) ("k2",
100t -- 200t, (("k2", 2,
100t))) ("k2", 200t --
300t, (("k2", 4, 250t)))

A similar set of operations is available on KTable and KGroupedTable.


You can check the Kafka documentation for more information.

QUERYING THE STATES IN KSTREAMS Figure 4: Leveraging Apache Kafka beyond data distribution

While processing the data in real-time, a KStreams application For more information on low-code stream processing options,
locally maintains the states such as the word counts in the previous including SQL access to Apache Kafka, please see the Additional
example. Those states can be queried interactively through an Resources section below.
API described in the Interactive Queries section of the Kafka
documentation. This avoids the need of an external data store for ONE SIZE DOES NOT FIT ALL
exporting and serving those states.
With the increasing popularity of real-time stream processing and
EXACTLY-ONCE PROCESSING IN KSTREAMS the rise of event-driven architectures, a number of alternatives have
started to gain traction for real-time data distribution. Apache Kafka
Failures in the brokers or the clients may introduce duplicates
is the flavor of choice for distributed high volume data streaming;
during the processing of records. KStreams provides the capability
however, many implementations have begun to struggle with
of processing records exactly once even under failures. This can be
building solutions at scale when the application’s requirements go
achieved by simply setting the property processing.guarantee
beyond a single data center or single location.
to exactly_once in KStreams.
So, while Apache Kafka is purpose-built for real-time data
EXTENDING APACHE KAFKA distribution and streaming processing, it will not fit all the
requirements of every enterprise application. Alternatives like
While processing data directly into an application via KStreams is
Apache Pulsar, Eclipse Mosquitto, and many others may be worth

9 BROUGHT TO YOU IN PARTNERSHIP WITH


APACHE KAFKA

investigating, especially if requirements prioritize large scale global ADDITIONAL RESOURCES


infrastructure where built-in replication is needed or if native IoT/
MQTT support is needed. ∙ Documentation of Apache Kafka

For more information on comparisons between Apache Kafka ∙ Apache NiFi website

and other data distribution solutions, please see the Additional


∙ Article on real-time stock processing with Apache NiFi and
Resources section below.
Apache Kafka

CONCLUSION ∙ Apache Kafka Summit website

Apache Kafka has become the de-facto standard for high ∙ Apache Kafka Mirroring and Replication
performance, distributed data streaming. It has a large and
growing community of developers, corporations, and applications ∙ Apache Pulsar Vs. Apache Kafka O’Reilly eBook

that are supporting, maintaining, and leveraging it. If you are


∙ Apache Pulsar website
building an event-driven architecture or looking for a way to
stream data in real-time, Apache Kafka is a clear leader in providing
a proven, robust platform for enabling stream processing and
enterprise communications.

Written by Bill McLane,


TIBCO Messaging Evangelist

With over 20 years of experience building, architecting, and designing large scale messaging infrastructure, William McLane is one of the
thought leaders for global data distribution. William and TIBCO have history and experience building mission critical, real world data
distribution architectures that power some of the largest financial services institutions, to the global scale of tracking transportation
and logistics operations. From Pub/Sub, to point-to-point, to real-time data streaming, William has experience designing, building, and
leveraging the right tools for building a nervous system that can connect, augment and unify your enterprise.

DZone, a Devada Media Property, is the resource software Devada, Inc.


developers, engineers, and architects turn to time and again 600 Park Offices Drive
to learn new skills, solve software development problems, Suite 150
and share their expertise. Every day, hundreds of tousands Research Triangle Park, NC 27709
of developers come to DZone to read about the latest
888.678.0399 919.678.0300
technologies, methodologies, and best practices. That makes
DZone the ideal place for developer marketers to build product Copyright © 2020 Devada, Inc. All rights reserved. No
and brand awareness and drive sales. DZone clients include part of this publication may be reporoduced, stored in a
some of the most innovative technology and tech-enabled retrieval system, or transmitted, in any form or by means
companies in the world including Red Hat, Cloud Elements, of electronic, mechanical, photocopying, or otherwise,
Sensu, and Sauce Labs. without prior written permission of the publisher.

10 BROUGHT TO YOU IN PARTNERSHIP WITH

You might also like