Apache Kafka
Apache Kafka
CONTENTS
öö Kafka Connect
Apache Kafka
öö Transformations in Connect
öö Kafka Streams
ö ö KStreams DSL
ö ö Exactly-Once Processing in
KStreams
ö ö And More...
WRITTEN BY TIM SPANN
BIG DATA SOLUTION ENGINEER
Why Apache Kafka in real time. Kafka is in use by more than 40% of Fortune 500 compa-
Two trends have emerged in the information technology space. nies across all industries.
First, the diversity and velocity of the data that an enterprise wants
to collect for decision-making continue to grow. Such data include
not only transactional records, but also business metrics, IoT data,
operational metrics, application logs, etc.
Before Apache Kafka, there wasn't a system that perfectly met both
of the above business needs. Traditional messaging systems are re-
al-time, but weren't designed to handle data at scale. Newer systems
such as Hadoop are much more scalable and handle all streaming
use cases.
1
Where DevOps
Meets Data Integration
Efficiency. Agility. Reliability. Confidence.
TRUSTED BY
CLOUD NATIVE DATA GRIDS: HAZELCAST IMDG WITH KUBERNETES
2. High availability: Data can be stored redundantly in multiple Defines a non-overlapping subset of records within
partition
servers and can survive individual server failure. a topic.
3. High scalability: New servers can be added over time to scale
A unique sequential number assigned to each
out the system. offset
record within a topic partition.
4. Easy integration with external data sources or data sinks.
A record contains a key, a value, a timestamp, and
5. Built-in real-time processing layer. record
a list of headers.
bin/zookeeper-server-start.sh
config/zookeeper.properties Figure 2: Partitions in a topic
3. Start the Kafka broker: The following code snippet shows how to produce records to a topic
"test" using the Java API:
bin/kafka-server-start.sh
Config/server.properties
Properties props = new Properties();
props.put("bootstrap.servers",
4. Create a topic:
"localhost:9092");
props.put("key.serializer",
bin/kafka-topics.sh --create --bootstrap-server
"org.apache.kafka.common.serialization.
localhost:9092
StringSerializer");
--replication-factor 1 --partitions 1 --topic TopicName
props.put("value.serializer",
"org.apache.kafka.common.serialization.
5. Produce data:
StringSerializer");
bin/kafka-console-producer.sh Producer<String, String> producer = new
--broker-list localhost:9092 KafkaProducer<>(props);
--topic TopicName producer.send(
hello new ProducerRecord<String, String>("test", "key",
World "value"));
In the above example, both the key and value are strings, so we are Kafka Connect
using a StringSerializer . It's possible to customize the serializer The second component in Kafka is Kafka Connect, which is a frame-
when types become more complex. For example, the KafkaAvroSe- work for making it easy to stream data between Kafka and other sys-
rializer from https://docs.confluent.io/current/schema-registry/ tems. As shown in Figure 3, one can deploy a Connect cluster and run
docs/serializer-formatter.html allows the user to produce Avro various connectors to import data from sources like MySQL, MQ, or
records. A second option for serialization is using the open source Splunk into Kafka and export data in Kafka to sinks such as HDFS, S3,
com.hortonworks.registries.schemaregistry.serdes.avro.kafka.Kaf- and Elastic Search. A connector can be either of source or sink type:
kaAvroSerializer available from https://registry-project.readthedocs.
io/en/latest/examples.html.
Unlike other messaging systems, a record in Kafka is not removed 4. Verify the data in Kafka:
from the broker immediately after it is consumed. Instead, it is
> bin/kafka-console-consumer.sh
retained according to a configured retention policy. The following
--bootstrap-server localhost:9092
table summarizes the two common policies: --topic connect-test
--from-beginning
RETENTION POLICY MEANING
{"schema":{"type":"string",
In the above example, the data in the source file test.txt is first --topic connect-test
{"line":"hello","data_source":"test
streamed into a Kafka topic connect-test through a file source
-file-source"}
connector. The records in connect-test are then streamed into the
{"line":"world","data_source":"test
destination file test.sink.txt . If a new line is added to test.txt , -file-source"}
it will show up immediately test.sink.txt. Note that we achieve the
above by running two connectors without writing any custom code. In step 1 above, we add two transformations MakeMap and Insert-
Source , which are implemented by class HoistField$Value and In-
The following is a partial list of existing connectors. A more complete sertField$Value , respectively. The first one adds a field name "line"
list can be found at https://www.confluent.io/product/connectors/. to each input string. The second one adds an additional field "data_
Please note many of these are not open source, and may require source" that indicates the name of the source file. After applying the
paid licenses for usage. transformation logic, the data in the input file is now transformed
to the output in step 3. Because the last transformation step is more
CONNECTOR TYPE CONNECTOR TYPE
complex, we implement it with the Streams API (covered in more
Elastic Search sink HDFS sink
detail below):
Start a file source connector: GET /connectors/ Get configuration parameters for a
{name} /config specific connector
bin/connect-standalone.sh
config/connect-file-source.properties
PUT /connectors/ Update configuration parameters for a
{name} /config specific connector
Verify the data in Kafka:
GET /connectors/
bin/kafka-console-consumer.sh Get the current status of the connector
{name} /status
--bootstrap-server localhost:9092
bin/kafka-console-producer.sh
Kafka Streams
--broker-list localhost:9092
Kafka Streams is a client library for building real-time applications
--topic streams-plaintext-input
and microservices, where the input and/or output data is stored in hello world
Kafka. The benefits of using Kafka Streams are:
Verify the data in the output topic:
• Less code in the application
bin/kafka-console-consumer.sh
• Built-in state management
--bootstrap-server localhost:9092
• Lightweight --topic streams-wordcount-output
--from-beginning
• Parallelism and fault tolerance
--formatter kafka.tools.
The most common way of using Kafka Streams is through the DefaultMessageFormatter
--property print.key=true
Streams DSL, which includes operations such as filtering, joining,
--property print.value=true
grouping, and aggregation. The following code snippet shows the
--property key.deserializer=
main logic of a Streams example called WordCountDemo .
org.apache.kafka.common.
serialization.StringDeserializer
final Serde stringSerde = Serdes.String();
--property value.deserializer=
final Serde longSerde = Serdes.Long();
org.apache.kafka.common.
StreamsBuilder builder = new StreamsBuilder();
serialization.LongDeserializer
// build a stream from an input topic
hello 1
KStream source = builder.stream(
world 1
"streams-plaintext-input",
Consumed.with(stringSerde, stringSerde));
KTable counts = source KStream vs. KTable
.flatMapValues(value -\> Arrays.asList(value. There are two key concepts in Kafka Streams: KStream and KTable.
toLowerCase().split(" "))) A topic can be viewed as either of the two. Their differences are
.groupBy((key, value) -\> value) .count(); summarized in the table below.
// convert the output to another topic
counts.toStream().to("streams-wordcount-output", KSTREAM KTABLE
Produced.with(stringSerde, longSerde));
Each record is treat-
Each record is treated as an
Concept ed as an append to
The above code first creates a stream from an input topic update to an existing key.
the stream.
streams-plaintext-input. It then applies a transformation to split
each input line into words. After that, it groups by the words and Model append-only
Model updatable reference
count the number of occurrences in each word. Finally, the results Usage data such as click
data such as user profiles.
are written to an output topic streams-wordcount-output. streams.
The following are the steps to run the example code. The following example illustrates the difference of the two:
Run the stream application: update to the first record since they have the same key "k1". There-
fore, only the second record is retained in the stream and the sum is
bin/kafka-run-class.sh org.apache.
5 instead.
Kafka.streams.examples.wordcount.
WordCountDemo KStreams DSL
The following tables show a list of common operations available in
Produce some data in the input topic: Kafka Streams:
OPERATION EXAMPLE
reduce(Reducer)
kt = kgs.reduce( (aggValue, newVal-
filter(Predicate) Combine the values of ue) -> aggValue + newValue );
Create a new KStream that ks_out = ks_in.filter( (key,
records in this stream kgs: ("k1", ("k1", 1), ("k1", 3)))
value) -> value > 5 ); ks_in:
consists of all records of this ("k2", (("k2", 2)))
ks_out: ("k1", 2) ("k2", 7) by the grouped key and
kt: ("k1", 4) ("k2", 2)
stream that satisfy the given ("k2", 7) return it as a KTable.
predicate.
twks = kgs.windowedBy( TimeWindows.
map(KeyValueMapper) of(100) );
ks_out = ks_in..map( (key, windowedBy(Windows)
kgs: ("k1", (("k1", 1, 100t), ("k1",
Transform each record of the value) -> new KeyValue<>(key, Further group the 3, 150t))) ("k2", (("k2", 2, 100t),
input stream into a new record key) )
records by the time- ("k2", 4, 250t))) * t indicates a
in the output stream (both key ks_in: ks_out: ("k1", 2)
stamp and return it as timestamp.
("k1", "k1") ("k2", 7) ("k2",
and value type can be altered twks: ("k1", 100t -- 200t, (("k1", 1,
"k2") a TimeWindowedK-
100t), ("k1", 3, 150t))) ("k2", 100t
arbitrarily).
Stream. -- 200t, (("k2", 2, 100t))) ("k2",
200t -- 300t, (("k2", 4, 250t)))
groupBy()
ks_out = ks.groupBy()
Group the records by their ks_in: ks_out: ("k1", 1) A similar set of operations is available on KTable and KGroupedTa-
current key into a KGrouped- ("k1", (("k1", 1), ("k2", 2)
ble. You can check the Kafka documentation for more information.
("k1", 3))) ("k1", 3) ("k2",
Stream while preserving the
(("k2", 2)))
original values. Querying the States in KStreams
While processing the data in real-time, a KStreams application
locally maintains the states such as the word counts in the previous
join(KTable, ValueJoiner)
ks_out = ks_in.join( kt, example. Those states can be queried interactively through an API
Join records of the input stream (value1, value2) -> value1 +
described in the Interactive Queries section of the Kafka documenta-
with records from the KTable value2 );
ks_in: kt: ("k1", 1) ("k1",
tion. This avoids the need of an external data store for exporting and
if the keys from the records
11) ("k2", 2) ("k2", 12) serving those states.
match. Return a stream of the
("k3", 3) ("k4", 13)
key and the combined value
ks_out: ("k1", 12) ("k2", 14)
using ValueJoiner. Exactly-Once Processing in KStreams
Failures in the brokers or the clients may introduce duplicates
during the processing of records. KStreams provides the capability
join(KStream, ValueJoiner,
ks_out = ks1.join( ks2, (val-
JoinWindows)
of processing records exactly once even under failures. This can be
ue1, value2) ->
achieved by simply setting the property processing.guarantee to
Join records of the two streams value1 + value2, JoinWindows.
of(100) ); exactly_once in KStreams. More details on exactly-once processing
if the keys match and the time-
ks1: ks2: ("k1", 1, 100t) can be found in the Kafka Confluence Space.
stamp from the records satisfy
("k1", 11, 150t) ("k2", 2,
the time constraint specified by 200t) ("k2", 12, 350t) ("k3",
KSQL
JoinWindows. Return a stream 3, 300t) ("k4", 13, 380t) * t
KSQL is an open-source streaming SQL engine that implements
of the key and the combined indicates a timestamp.
continuous, interactive queries against Apache Kafka. It's built using
ks_out: ("k1", 12)
value using ValueJoiner.
the Kafka Streams API and further simplifies the job of a developer.
Currently, KSQL is still not part of the Apache Kafka project, but is
COMMONLY USED OPERATIONS IN KGROUPEDSTREAM
available under the Apache 2.0 license.
OPERATION EXAMPLE
To see how KSQL works, let's first download it and prepare some
count()
data sets.
Count the number of kt = kgs.count(); kgs: ("k1", (("k1",
records in this stream 1), ("k1", 3))) ("k2", (("k2", 2)))
Clone the KSQL repository and compile the code:
by the grouped key and kt: ("k1", 2) ("k2", 1)
> git clone [email protected]:
return it as a KTable.
confluentinc/ksql.git
quickstart=pageviews
Note that in the above, each schema always contains two built-in
format=delimited
topic=pageviews fields, ROWTIME and ROWKEY . They correspond to the timestamp and
maxInterval=10000 the key of the record, respectively. Finally, let's run some KSQL que-
> java -jar ksql-examples/target/ ries using the data and the schema that we prepared.
ksql-examples-0.1-SNAPSHOT-
standalone.jar Select a field from a stream:
quickstart=users
ksql> SELECT pageid
format=json
FROM pageviews_stream
topic=users
LIMIT 3;
maxInterval=10000 Page_24
Page_73
Next, let's define the schema of the input topics. Similar to Kafka Page_78
Streams, one can define a schema as either a stream or a table.
Join a stream and a table:
Start KSQL CLI:
ksql> CREATE STREAM pageviews_female
./bin/ksql-cli local AS SELECT
users_table.userid AS userid,
Create a KStream from a topic: pageid, regionid, gender
FROM pageviews_stream
ksql> CREATE STREAM pageviews_stream LEFT JOIN users_table ON
(viewtime bigint, pageviews_stream.userid =
userid varchar, users_table.userid
pageid varchar) WHERE gender = 'FEMALE';
WITH (kafka_topic='pageviews', ksql> SELECT userid, pageid,
value_format='JSON');
Additional Resources
• Documentation of Apache Kafka: kafka.apache.org/docu-
mentation/
Devada, Inc.
600 Park Offices Drive
Suite 150
Research Triangle Park, NC
888.678.0399 919.678.0300
DZone communities deliver over 6 million pages each
month to more than 3.3 million software developers, Copyright © 2019 DZone, Inc. All rights reserved. No part
architects, and decision makers. DZone offers something for of this publication may be reproduced, stored in a retrieval
everyone, including news, tutorials, cheat sheets, research system, or transmitted, in any form or by means electronic,
guides, feature articles, source code, and more. "DZone is a mechanical, photocopying, or otherwise, without prior written
developer’s dream," says PC Magazine. permission of the publisher.