Apache Kafka - From zero to hero

WELCOMETO KAFKA
Avi Levi
123avi@gmail.com
https://www.linkedin.com/in/leviavi/

AGENDA
• Who am I
• Short Intro to Kafka
• Core concepts

• Please download
• https://www.apache.org/dyn/closer.cgi?path=/
kafka/0.11.0.0/kafka_2.11-0.11.0.0.tgz
• tar -xzf kafka_2.11-011.0.0.tgz
• cd kafka_2.11-011.0.0

APACHE KAFKA
A high throughput distributed messaging system
http://kafka.apache.org
• What do we want to solve ?
Founded at LinkedIn , open source since 2010

Implemented in Scala (some in Java)

Images from Jay Kreps blog
The problem
Hard to track

Loss of data

Hard to scale

Bottom line your data pipeline looks like spaghetti

Images from Jay Kreps blog
The Solution

Message
broker
Producer Consumer
Producer
Producer
Consumer
Consumer

• Enterprise Messaging system
• Stream processing API
• Provides connectors to push/pull data from db
etc’

Core concepts
• Producer
• Message
• Consumer
• Kafka broker
• Cluster
Broker
Producer Consumer
Broker
Broker
Zookeeper

Producer Consumer
Producer
Producer
Consumer
Consumer
Problem
The Producers can push any data, but how can a consumer
consume only the data he is interested in ?
Accounts
Orders
Topics
Kafka cluster

SCALE PROBLEM 1
• What if our data of a single topic is bigger than
our local storage ?

Partitions
0 1
32
Broker 1 Broker 2 Broker 3 Broker 4
Rep 1 Rep 2 Rep 4
Replicas
Each partition has an id
Leader of
partition 2Follower of
partition 2
Follower of
partition 2
Cluster
We can split the data of a topic into partitions, this way it can be distributed to other machines in the cluster (how many partitions ??? - it is yours to decide)

Partitions are managed in a Leader - follower style . The leader accepts all interactions from producers and consumers . The followers maintain the copies (replicas)

Fault tolerance - what if one of the partitions crashes? Can we tolerate data loss ? We set the replication factor per topic - the replica’s id is the same as broker id

OFFESET
• Sequence number assigned to a message in a
partition
Oﬀsets are per partition.

To locate a message directly you need topic name, partition number and oﬀset

SCALE PROBLEM 2
• Now that we can split the
data to partitions - no more
storage limitation
• we can add more produces

Apache Kafka - From zero to hero

REPLICAS IN ACTION
• bin/kafka-topics.sh --create --zookeeper localhost:2181 --replication-factor 3 --
partitions 4 --topic test4 (in case we have two brokers we will get an error)
Try to create a topic with multiple replicas

Add more brokers - cp the server.properties ﬁle (* 3) to server[n].properties

Change broker_id (the default is 0)

Change broker port (just if in the same machine)

Change broker log

Describe topic command
• bin/kafka-topics.sh --describe --zookeeper
localhost:2181 --topic test
Shows topic name, partitions, replicas ….

Partion id = 0 the Leader is broker 0 (which means that it is responsible for all communications with consumer and producer ) , he maintains the ﬁrst replica and broker 2
and 1 are the followers

Partion id = 1 the Leader is broker 1



Isr = in sync replicas, all replicas that in sync with the leader (in this case all replicas are in sync )

Configuration parts worth to mention
• Read the configuration documentation : https://kafka.apache.org/documentation/#configuration
• zookeeper.connect - essential for creating a cluster, zookeeper connects all brokers to perform a
cluster
• Delete.topic.enable = false
• Auto.create.topics.enable = true, I do not recommend using that in production
• Default.replication.factor = 1
• Num.partitions = 1
• Log.retention.ms (default) = 7 days
• Log.retention.bytes
Zookeeper.connect = connection string - the zookeeper address. It is essential that each broker will know this address

Delete.topic.enable = false as default for securing production environment

Default.replication.factor & num.partitions are relevant if auto create is true

Log.retention = kafka is not a database , and data should be cleared

Retention by time is faster than by size (doesn’t need to calc the size, size is per partition and not by topic)

Groups
• A group of consumers that are sharing work and
consuming the same messages as a single unit
P1P3 P2
Kafka cluster
Consumer group A Consumer group B
P0
* Number of partitions set the max consumer group parallelism. if we have to more consumers than partitions in the same group, we will get unemployment …

* To avoid duplicate reads - partition never shared among members of the same group in the same time

* Group coordinator - one of the brokers become a group coordinator and manages a list of consumers (the ﬁrst consumer is the leader), when a member joins/leaves,
the coordinator modify the list and initiate a rebalance (block all reads) .

* The leader is responsible to execute the rebalance activity . Takes the list , reassign the members and send back the list to the coordinator

PARTITIONER
• If partitioner defined use it -You can always create
custom partitioner by Implementing the Partitioner
interface
• Else if key defined than choose partition based on
the hash key (caution!)
• Else distribute the massage in a round robin fashion
Partition by key - (careful - not reliable because, although hashing ensures the a key will always have the same hash - two keys can have the same hash also the
partition is set by hash key % numOfPartitions so if the numOfPartitons changes we will have different partitions for the key )

* create a topic with two partitions

* Create three consumers (two with the same group id)

* Place a message one grouped consumer + the non grouped will get the message.

* (Show round robin) Place another message the other grouped consumer + the non grouped will get the message.

* Stop the grouped actor and show that the other one in the group will get all messages

*

Producer
• Properties props = new Properties() //map that must include
bootstrap.servers(list of brokers), key serializer, value serializer
• Producer< String, String> producer = new KafkaProducer(props)
• ProducerRecord<String, String> = ProducerRecord<>(topicName,
[partition num], [timestamp], [key], value)
• producer.send(record) //.get or producer.send(record, callback)
• producer.close() // after sending all we need to free resources
• * max.in.ﬂight.requests.per.connection = 5 (for async calls - how
many requests are sent without response)

* Producer<key type, value type>

*PRODUCER
Properties
Producer record
Serialize
Assign Partition
Partition Buffer
Record 1
Recored 2
…
Broker
RecordMetadata
Error
Mostly you will use generic serializer
like Avro but you can use custom
Producer maintain a partition buffer and send the messages in a batch (size and time of the buffer is configurable )

If error is recoverable it will retry (e.g leader is down till new leader elected, number of retries and the time gap between them, is configurable )

SENDING APPROACH
• Fire and forget - kafka is highly available system but
we might loos a little portion of data
• Synchronous Send - messages are critical and we
cannot loose data
• Asynchronous Send - allows to handle failures async,
better throughput but you have in ﬂight requests limit

SOME PRODUCER
CONFIGURATIONS
• Bootstrap servers - List of brokers in your cluster (recommend more than one)
• Key serializer class
• Value serializer class
• acks (0,1,all)
• Retries = 0 (retry.backoff.ms = 100 set the time interval between retries)
• max.in.ﬂight.requests.per.connection how many messages you can send async without
getting acknowledge, high number will consume more memory but gain high
throughput
• Check the documentation
acks - 0 no wait for acknowledgment (loss of data, no retries , highest throughput )

acks - 1 (default) only the leader respond (safe chance but still small chance of loosing data in case the leader crashes before data is replicated)

acks - all (hight reliability but late latency - waiting for all replicas )

max.in.ﬂight.requests.per.connection = (due to retries you might loose the order of messages, if order is critical set it to 1or use sync send)

Consumer
• Properties props = new Properties(); // best practice is to use config file
• props.put("bootstrap.servers",“localhost:9092");
• props.put("group.id",“accounts"); //optional but if omitted than work cannot be shared
• props.put("key.deserializer", StringDeserializer.class.getName());
• props.put("value.deserializer", StringDeserializer.class.getName());
• KafkaConsumer<String, String> consumer = new KafkaConsumer<>(props);
• consumer.subscribe(Arrays.asList(“foo”,“bar”));
try {
while (running) {
ConsumerRecords<String, String> records = consumer.poll(1000);
for (ConsumerRecord<String, String> record : records){
System.out.println(record.offset() + ": " + record.value());
}
consumer.commitAsync()
}
} finally {
consumer.close();
}
* The poll method accepts a timeout parameter to establish how long we want to wait for data

* It actually doing a lot of important things like connect to coordinator, get partition assignment, fetch messages, send heartbeat, and much more

* Iteration must complete in 3 seconds otherwise the coordinator will consider the consumer is dead and will initiate rebalance (can be configureable )

Apache Kafka - From zero to hero

Recommended

More Related Content

What's hot (20)

Similar to Apache Kafka - From zero to hero (20)

Recently uploaded (20)

Apache Kafka - From zero to hero