Apache Kafka Essentials
Apache Kafka Essentials
CONTENTS
∙ Introduction
Apache Kafka
∙ About Apache Kafka
Essentials
∙ Kafka Connect
∙ Kafka Streams
∙ Conclusion
∙ Additional Resources
Two trends have emerged in the information technology space. First, The main benefits of Kafka are:
the diversity and velocity of the data that an enterprise wants to
1. High throughput: Each server is capable of handling hundreds
collect for decision-making continues to grow. Second, there is a
of MB per second of data.
growing need for an enterprise to make decisions in real-time based
on that collected data. For example, financial institutions want to 2. High availability: Data can be stored redundantly in multiple
not only detect fraud immediately, but also offer a better banking servers and can survive individual server failure.
experience through features like real-time alerting, real-time product
recommendations, and more effective customer service. 3. High scalability: New servers can be added over time to scale
out the system.
Apache Kafka is a streaming engine for collecting, caching, and
processing high volumes of data in real-time. As illustrated in Figure 4. Easy integration with external data sources or data sinks.
1
APACHE KAFKA
Figure 1: Apache Kafka as a central real-time hub Defines a logical name for producing and
Topic
consuming records.
QUICKSTART FOR APACHE KAFKA Defines a non-overlapping subset of records
Partition
within a topic.
It’s easy to get started on Kafka. The following are the steps to get
A unique sequential number assigned to each
Kafka running in your environment: Offset
record within a topic partition.
A record contains a key, a value, a timestamp,
1. Download the latest Apache Kafka binary distribution from Record
and a list of headers.
http://kafka.apache.org/downloads and untar it.
Server where records are stored. Multiple
Broker
brokers can be used to form a cluster.
2. Start the Zookeeper server
3. Start the Kafka broker Figure 2 depicts a topic with two partitions. Partition 0 has 5 records,
with offsets from 0 to 4, and partition 1 has 4 records, with offsets
4. Create a topic from 0 to 3.
props.put("value.serializer",
Records within a partition are always delivered to the consumer in
"org.apache.kafka.common.serialization. offset order. By saving the offset of the last consumed record from
StringSerializer"); each partition, the consumer can resume from where it left off after a
restart. In the example above, we use the commitSync() API to save
Producer<String, String> producer = new
the offsets explicitly after consuming a batch of records. One can also
save the offsets automatically by setting the property enable.auto.
KafkaProducer<>(props);
commit to true.
producer.send(
A record in Kafka is not removed from the broker immediately after
new ProducerRecord<String, String>("test", "key", it is consumed. Instead, it is retained according to a configured
"value")); retention policy. The following table summarizes the two common
policies:
In the above example, both the key and value are strings, so we are Retention Policy Meaning
using a StringSerializer . It’s possible to customize the serializer The number of hours to keep a record on
log.retention.hours
when types become more complex. the broker.
The maximum size of records retained in
The following code snippet shows how to consume records with log.retention.bytes
a partition.
string key and value in Java.
new KafkaConsumer<>(props);
consumer.subscribe(Arrays.asList("test"));
while (true) {
consumer.poll(100);
The following steps show how to run the existing file connector "optional":false},
"payload":"world"}
2. Start a file source and a file sink connector:
> bin/connect-standalone.sh In the example above, the data in the source file test.txt is first
streamed into a Kafka topic connect-test through a file source
config/connect-file-source.properties connector. The records in connect-test are then streamed into
the destination file test.sink.txt . If a new line is added to test.
config/connect-file-sink.properties
txt , it will show up immediately in test.sink.txt . Note that
3. Verify the data in the destination file: we achieve this by running two connectors without writing any
custom code.
> more test.sink.txt
1. Add the following lines to connect-file-source. Streams API (covered in more detail below):
transforms=MakeMap, InsertSource
final Serde<Long> longSerde = Serdes.Long();
transforms.MakeMap.type=org.apache.kafka
StreamsBuilder builder = new StreamsBuilder();
.connect.transforms.HoistField$Value
// build a stream from an input topic
transforms.MakeMap.field=line
KStream<String, String> source = builder.stream(
transforms.InsertSource.type=org.apache
"streams-plaintext-input",
.kafka.connect.transforms
Consumed.with(stringSerde, stringSerde));
.InsertField$Value
KTable<String, Long> counts = source
transforms.InsertSource.static.field=
.flatMapValues(value -> Arrays.asList(value.
toLowerCase().split(" ")))
data_source
.count();
test-file-source
Produced.with(stringSerde, longSerde));
config/connect-file-source.properties
{"line":"hello","data_source":"test
-file-source"}
{"line":"world","data_source":"test
-file-source"}
The most common way of using Kafka Streams is through the kafka.streams.examples.wordcount.
Streams DSL, which includes operations such as filtering, joining,
grouping, and aggregation. The following code snippet shows the WordCountDemo
Consumed.with(stringSerde, stringSerde));
--bootstrap-server localhost:9092
toLowerCase().split(" ")))
--formatter kafka.tools.
.groupBy((key, value) -\> value) .count();
DefaultMessageFormatter
// convert the output to another topic
--property print.key=true
counts.toStream().to("streams-wordcount-output",
KSTREAMS DSL
COMMONLY USED OPERATIONS IN KGROUPEDSTREAM functional, many client applications are looking for a lighter weight
interface to Apache Kafka Streams via low-code environments
Operation Example
or continuous queries using SQL-like commands. In many cases,
count() kt = kgs.count();
developers looking to leverage continuous query functionality
kgs: ("k1", (("k1",
Count the number of records in 1), ("k1", 3))) ("k2", are looking for a low-code environment where stream processing
this stream by the grouped key (("k2", 2))) kt: ("k1", can be dynamically accessed, modified, and scaled in real-time.
and return it as a KTable. 2) ("k2", 1)
Accessing Apache Kafka data streams via SQL is just one approach to
addressing this low-code stream processing, and many commercial
reduce(Reducer) kt = kgs.reduce( vendors (TIBCO Software, Confluent) as well as open-source
(aggValue, newValue)
solutions (Apache Spark) offer solutions to providing SQL access to
Combine the values of records -> aggValue + newValue
in this stream by the grouped ); kgs: ("k1", ("k1", unlock Apache Kafka data streams for stream processing.
key and return it as a KTable. 1), ("k1", 3))) ("k2",
(("k2", 2))) kt: ("k1", Apache Kafka is a data distribution platform; it’s what you do with
4) ("k2", 2) the data that is important. Once data is available via Kafka, it can be
windowedBy(Windows) twks = kgs.windowedBy ( distributed to many different processing engines from integration
TimeWindows. of(100) ); services, event streaming, and AI/ML functions to data analytics.
Further group the records by kgs: ("k1", (("k1", 1,
the timestamp and return it as a 100t), ("k1", 3, 150t)))
TimeWindowedKStream . ("k2", (("k2", 2, 100t),
("k2", 4, 250t))) * t
indicates a timestamp.
twks: ("k1", 100t --
200t, (("k1", 1, 100t),
("k1", 3, 150t))) ("k2",
100t -- 200t, (("k2", 2,
100t))) ("k2", 200t --
300t, (("k2", 4, 250t)))
QUERYING THE STATES IN KSTREAMS Figure 4: Leveraging Apache Kafka beyond data distribution
While processing the data in real-time, a KStreams application For more information on low-code stream processing options,
locally maintains the states such as the word counts in the previous including SQL access to Apache Kafka, please see the Additional
example. Those states can be queried interactively through an Resources section below.
API described in the Interactive Queries section of the Kafka
documentation. This avoids the need of an external data store for ONE SIZE DOES NOT FIT ALL
exporting and serving those states.
With the increasing popularity of real-time stream processing and
EXACTLY-ONCE PROCESSING IN KSTREAMS the rise of event-driven architectures, a number of alternatives have
started to gain traction for real-time data distribution. Apache Kafka
Failures in the brokers or the clients may introduce duplicates
is the flavor of choice for distributed high volume data streaming;
during the processing of records. KStreams provides the capability
however, many implementations have begun to struggle with
of processing records exactly once even under failures. This can be
building solutions at scale when the application’s requirements go
achieved by simply setting the property processing.guarantee
beyond a single data center or single location.
to exactly_once in KStreams.
So, while Apache Kafka is purpose-built for real-time data
EXTENDING APACHE KAFKA distribution and streaming processing, it will not fit all the
requirements of every enterprise application. Alternatives like
While processing data directly into an application via KStreams is
Apache Pulsar, Eclipse Mosquitto, and many others may be worth
For more information on comparisons between Apache Kafka ∙ Apache NiFi website
Apache Kafka has become the de-facto standard for high ∙ Apache Kafka Mirroring and Replication
performance, distributed data streaming. It has a large and
growing community of developers, corporations, and applications ∙ Apache Pulsar Vs. Apache Kafka O’Reilly eBook
With over 20 years of experience building, architecting, and designing large scale messaging infrastructure, William McLane is one of the
thought leaders for global data distribution. William and TIBCO have history and experience building mission critical, real world data
distribution architectures that power some of the largest financial services institutions, to the global scale of tracking transportation
and logistics operations. From Pub/Sub, to point-to-point, to real-time data streaming, William has experience designing, building, and
leveraging the right tools for building a nervous system that can connect, augment and unify your enterprise.