SlideShare a Scribd company logo
WELCOMETO KAFKA
Avi Levi
123avi@gmail.com
https://www.linkedin.com/in/leviavi/
AGENDA
• Who am I
• Short Intro to Kafka
• Core concepts
• Please download
• https://www.apache.org/dyn/closer.cgi?path=/
kafka/0.11.0.0/kafka_2.11-0.11.0.0.tgz
• tar -xzf kafka_2.11-011.0.0.tgz
• cd kafka_2.11-011.0.0
APACHE KAFKA
A high throughput distributed messaging system
http://kafka.apache.org
• What do we want to solve ?
Founded at LinkedIn , open source since 2010

Implemented in Scala (some in Java)
Images from Jay Kreps blog
The problem
Hard to track 

Loss of data

Hard to scale

Bottom line your data pipeline looks like spaghetti
Images from Jay Kreps blog
The Solution
Message
broker
Producer Consumer
Producer
Producer
Consumer
Consumer
• Enterprise Messaging system
• Stream processing API
• Provides connectors to push/pull data from db
etc’
Core concepts
• Producer
• Message
• Consumer
• Kafka broker
• Cluster
Broker
Producer Consumer
Broker
Broker
Zookeeper
Producer Consumer
Producer
Producer
Consumer
Consumer
Problem
The Producers can push any data, but how can a consumer
consume only the data he is interested in ?
Accounts
Orders
Topics
Kafka cluster
SCALE PROBLEM 1
• What if our data of a single topic is bigger than
our local storage ?
Partitions
0 1
32
Broker 1 Broker 2 Broker 3 Broker 4
Rep 1 Rep 2 Rep 4
Replicas
Each partition has an id
Leader of
partition 2Follower of
partition 2
Follower of
partition 2
Cluster
We can split the data of a topic into partitions, this way it can be distributed to other machines in the cluster (how many partitions ??? - it is yours to decide)

Partitions are managed in a Leader - follower style . The leader accepts all interactions from producers and consumers . The followers maintain the copies (replicas)

Fault tolerance - what if one of the partitions crashes? Can we tolerate data loss ? We set the replication factor per topic - the replica’s id is the same as broker id
OFFESET
• Sequence number assigned to a message in a
partition
Offsets are per partition. 

To locate a message directly you need topic name, partition number and offset
SCALE PROBLEM 2
• Now that we can split the
data to partitions - no more
storage limitation
• we can add more produces
Apache Kafka - From zero to hero
REPLICAS IN ACTION
• bin/kafka-topics.sh --create --zookeeper localhost:2181 --replication-factor 3 --
partitions 4 --topic test4 (in case we have two brokers we will get an error)
Try to create a topic with multiple replicas 

Add more brokers - cp the server.properties file (* 3) to server[n].properties

Change broker_id (the default is 0)

Change broker port (just if in the same machine)

Change broker log
Describe topic command
• bin/kafka-topics.sh --describe --zookeeper
localhost:2181 --topic test
Shows topic name, partitions, replicas ….

Partion id = 0 the Leader is broker 0 (which means that it is responsible for all communications with consumer and producer ) , he maintains the first replica and broker 2
and 1 are the followers 

Partion id = 1 the Leader is broker 1

Partion id = 2 the Leader is broker 2

Partion id = 3 the Leader is broker 0

Isr = in sync replicas, all replicas that in sync with the leader (in this case all replicas are in sync )
Configuration parts worth to mention
• Read the configuration documentation : https://kafka.apache.org/documentation/#configuration
• zookeeper.connect - essential for creating a cluster, zookeeper connects all brokers to perform a
cluster
• Delete.topic.enable = false
• Auto.create.topics.enable = true, I do not recommend using that in production
• Default.replication.factor = 1
• Num.partitions = 1
• Log.retention.ms (default) = 7 days
• Log.retention.bytes
Zookeeper.connect = connection string - the zookeeper address. It is essential that each broker will know this address 

Delete.topic.enable = false as default for securing production environment

Default.replication.factor & num.partitions are relevant if auto create is true

Log.retention = kafka is not a database , and data should be cleared 

Retention by time is faster than by size (doesn’t need to calc the size, size is per partition and not by topic)
Groups
• A group of consumers that are sharing work and
consuming the same messages as a single unit
P1P3 P2
Kafka cluster
Consumer group A Consumer group B
P0
* Number of partitions set the max consumer group parallelism. if we have to more consumers than partitions in the same group, we will get unemployment …

* To avoid duplicate reads - partition never shared among members of the same group in the same time 

* Group coordinator - one of the brokers become a group coordinator and manages a list of consumers (the first consumer is the leader), when a member joins/leaves,
the coordinator modify the list and initiate a rebalance (block all reads) .

* The leader is responsible to execute the rebalance activity . Takes the list , reassign the members and send back the list to the coordinator
PARTITIONER
• If partitioner defined use it -You can always create
custom partitioner by Implementing the Partitioner
interface
• Else if key defined than choose partition based on
the hash key (caution!)
• Else distribute the massage in a round robin fashion
Partition by key - (careful - not reliable because, although hashing ensures the a key will always have the same hash - two keys can have the same hash also the
partition is set by hash key % numOfPartitions so if the numOfPartitons changes we will have different partitions for the key )
* create a topic with two partitions 

* Create three consumers (two with the same group id)

* Place a message one grouped consumer + the non grouped will get the message. 

* (Show round robin) Place another message the other grouped consumer + the non grouped will get the message.

* Stop the grouped actor and show that the other one in the group will get all messages

*
Apache Kafka - From zero to hero
Producer
• Properties props = new Properties() //map that must include
bootstrap.servers(list of brokers), key serializer, value serializer
• Producer< String, String> producer = new KafkaProducer(props)
• ProducerRecord<String, String> = ProducerRecord<>(topicName,
[partition num], [timestamp], [key], value)
• producer.send(record) //.get or producer.send(record, callback)
• producer.close() // after sending all we need to free resources
• * max.in.flight.requests.per.connection = 5 (for async calls - how
many requests are sent without response)

* Producer<key type, value type>
*PRODUCER
Properties
Producer record
Serialize
Assign Partition
Partition Buffer
Record 1
Recored 2
…
Broker
RecordMetadata
Error
Mostly you will use generic serializer
like Avro but you can use custom
Producer maintain a partition buffer and send the messages in a batch (size and time of the buffer is configurable )

If error is recoverable it will retry (e.g leader is down till new leader elected, number of retries and the time gap between them, is configurable )
SENDING APPROACH
• Fire and forget - kafka is highly available system but
we might loos a little portion of data
• Synchronous Send - messages are critical and we
cannot loose data
• Asynchronous Send - allows to handle failures async,
better throughput but you have in flight requests limit
PRODUCER CONFIGURATION
SOME PRODUCER
CONFIGURATIONS
• Bootstrap servers - List of brokers in your cluster (recommend more than one)
• Key serializer class
• Value serializer class
• acks (0,1,all)
• Retries = 0 (retry.backoff.ms = 100 set the time interval between retries)
• max.in.flight.requests.per.connection how many messages you can send async without
getting acknowledge, high number will consume more memory but gain high
throughput
• Check the documentation
acks - 0 no wait for acknowledgment (loss of data, no retries , highest throughput )

acks - 1 (default) only the leader respond (safe chance but still small chance of loosing data in case the leader crashes before data is replicated)

acks - all (hight reliability but late latency - waiting for all replicas )

max.in.flight.requests.per.connection = (due to retries you might loose the order of messages, if order is critical set it to 1or use sync send)
Consumer
• Properties props = new Properties(); // best practice is to use config file
• props.put("bootstrap.servers",“localhost:9092");
• props.put("group.id",“accounts"); //optional but if omitted than work cannot be shared
• props.put("key.deserializer", StringDeserializer.class.getName());
• props.put("value.deserializer", StringDeserializer.class.getName());
• KafkaConsumer<String, String> consumer = new KafkaConsumer<>(props); 
• consumer.subscribe(Arrays.asList(“foo”,“bar”)); 
try {
 while (running) {
   ConsumerRecords<String, String> records = consumer.poll(1000);
for (ConsumerRecord<String, String> record : records){
System.out.println(record.offset() + ": " + record.value());
}
consumer.commitAsync()
}
} finally {
consumer.close();
}
* The poll method accepts a timeout parameter to establish how long we want to wait for data

* It actually doing a lot of important things like connect to coordinator, get partition assignment, fetch messages, send heartbeat, and much more 

* Iteration must complete in 3 seconds otherwise the coordinator will consider the consumer is dead and will initiate rebalance (can be configureable )
Apache Kafka - From zero to hero
Apache Kafka - From zero to hero
Ad

More Related Content

What's hot (20)

ksqlDB: A Stream-Relational Database System
ksqlDB: A Stream-Relational Database SystemksqlDB: A Stream-Relational Database System
ksqlDB: A Stream-Relational Database System
confluent
 
Kafka Summit NYC 2017 Introduction to Kafka Streams with a Real-life Example
Kafka Summit NYC 2017 Introduction to Kafka Streams with a Real-life ExampleKafka Summit NYC 2017 Introduction to Kafka Streams with a Real-life Example
Kafka Summit NYC 2017 Introduction to Kafka Streams with a Real-life Example
confluent
 
Introduction to Structured Streaming
Introduction to Structured StreamingIntroduction to Structured Streaming
Introduction to Structured Streaming
Knoldus Inc.
 
Introduction to Spark Streaming
Introduction to Spark StreamingIntroduction to Spark Streaming
Introduction to Spark Streaming
Knoldus Inc.
 
Serverless and Streaming: Building ‘eBay’ by ‘Turning the Database Inside Out’
Serverless and Streaming: Building ‘eBay’ by ‘Turning the Database Inside Out’ Serverless and Streaming: Building ‘eBay’ by ‘Turning the Database Inside Out’
Serverless and Streaming: Building ‘eBay’ by ‘Turning the Database Inside Out’
confluent
 
Steps to Building a Streaming ETL Pipeline with Apache Kafka® and KSQL
Steps to Building a Streaming ETL Pipeline with Apache Kafka® and KSQLSteps to Building a Streaming ETL Pipeline with Apache Kafka® and KSQL
Steps to Building a Streaming ETL Pipeline with Apache Kafka® and KSQL
confluent
 
It's Time To Stop Using Lambda Architecture
It's Time To Stop Using Lambda ArchitectureIt's Time To Stop Using Lambda Architecture
It's Time To Stop Using Lambda Architecture
Yaroslav Tkachenko
 
What is the State of my Kafka Streams Application? Unleashing Metrics. | Neil...
What is the State of my Kafka Streams Application? Unleashing Metrics. | Neil...What is the State of my Kafka Streams Application? Unleashing Metrics. | Neil...
What is the State of my Kafka Streams Application? Unleashing Metrics. | Neil...
HostedbyConfluent
 
Deep Dive Into Kafka Streams (and the Distributed Stream Processing Engine) (...
Deep Dive Into Kafka Streams (and the Distributed Stream Processing Engine) (...Deep Dive Into Kafka Streams (and the Distributed Stream Processing Engine) (...
Deep Dive Into Kafka Streams (and the Distributed Stream Processing Engine) (...
confluent
 
Kafka Streams: the easiest way to start with stream processing
Kafka Streams: the easiest way to start with stream processingKafka Streams: the easiest way to start with stream processing
Kafka Streams: the easiest way to start with stream processing
Yaroslav Tkachenko
 
Apache Big Data EU 2016: Building Streaming Applications with Apache Apex
Apache Big Data EU 2016: Building Streaming Applications with Apache ApexApache Big Data EU 2016: Building Streaming Applications with Apache Apex
Apache Big Data EU 2016: Building Streaming Applications with Apache Apex
Apache Apex
 
Intro to Apache Apex - Next Gen Native Hadoop Platform - Hackac
Intro to Apache Apex - Next Gen Native Hadoop Platform - HackacIntro to Apache Apex - Next Gen Native Hadoop Platform - Hackac
Intro to Apache Apex - Next Gen Native Hadoop Platform - Hackac
Apache Apex
 
Kafka Summit SF 2017 - Exactly-once Stream Processing with Kafka Streams
Kafka Summit SF 2017 - Exactly-once Stream Processing with Kafka StreamsKafka Summit SF 2017 - Exactly-once Stream Processing with Kafka Streams
Kafka Summit SF 2017 - Exactly-once Stream Processing with Kafka Streams
confluent
 
Building real time Data Pipeline using Spark Streaming
Building real time Data Pipeline using Spark StreamingBuilding real time Data Pipeline using Spark Streaming
Building real time Data Pipeline using Spark Streaming
datamantra
 
Java High Level Stream API
Java High Level Stream APIJava High Level Stream API
Java High Level Stream API
Apache Apex
 
Kafka Connect
Kafka ConnectKafka Connect
Kafka Connect
Oleg Kuznetsov
 
Kafka Summit NYC 2017 - Easy, Scalable, Fault-tolerant Stream Processing with...
Kafka Summit NYC 2017 - Easy, Scalable, Fault-tolerant Stream Processing with...Kafka Summit NYC 2017 - Easy, Scalable, Fault-tolerant Stream Processing with...
Kafka Summit NYC 2017 - Easy, Scalable, Fault-tolerant Stream Processing with...
confluent
 
Developing Secure Scala Applications With Fortify For Scala
Developing Secure Scala Applications With Fortify For ScalaDeveloping Secure Scala Applications With Fortify For Scala
Developing Secure Scala Applications With Fortify For Scala
Lightbend
 
Apache Kafka, and the Rise of Stream Processing
Apache Kafka, and the Rise of Stream ProcessingApache Kafka, and the Rise of Stream Processing
Apache Kafka, and the Rise of Stream Processing
Guozhang Wang
 
Achieving a 50% Reduction in Cross-AZ Network Costs from Kafka (Uday Sagar Si...
Achieving a 50% Reduction in Cross-AZ Network Costs from Kafka (Uday Sagar Si...Achieving a 50% Reduction in Cross-AZ Network Costs from Kafka (Uday Sagar Si...
Achieving a 50% Reduction in Cross-AZ Network Costs from Kafka (Uday Sagar Si...
confluent
 
ksqlDB: A Stream-Relational Database System
ksqlDB: A Stream-Relational Database SystemksqlDB: A Stream-Relational Database System
ksqlDB: A Stream-Relational Database System
confluent
 
Kafka Summit NYC 2017 Introduction to Kafka Streams with a Real-life Example
Kafka Summit NYC 2017 Introduction to Kafka Streams with a Real-life ExampleKafka Summit NYC 2017 Introduction to Kafka Streams with a Real-life Example
Kafka Summit NYC 2017 Introduction to Kafka Streams with a Real-life Example
confluent
 
Introduction to Structured Streaming
Introduction to Structured StreamingIntroduction to Structured Streaming
Introduction to Structured Streaming
Knoldus Inc.
 
Introduction to Spark Streaming
Introduction to Spark StreamingIntroduction to Spark Streaming
Introduction to Spark Streaming
Knoldus Inc.
 
Serverless and Streaming: Building ‘eBay’ by ‘Turning the Database Inside Out’
Serverless and Streaming: Building ‘eBay’ by ‘Turning the Database Inside Out’ Serverless and Streaming: Building ‘eBay’ by ‘Turning the Database Inside Out’
Serverless and Streaming: Building ‘eBay’ by ‘Turning the Database Inside Out’
confluent
 
Steps to Building a Streaming ETL Pipeline with Apache Kafka® and KSQL
Steps to Building a Streaming ETL Pipeline with Apache Kafka® and KSQLSteps to Building a Streaming ETL Pipeline with Apache Kafka® and KSQL
Steps to Building a Streaming ETL Pipeline with Apache Kafka® and KSQL
confluent
 
It's Time To Stop Using Lambda Architecture
It's Time To Stop Using Lambda ArchitectureIt's Time To Stop Using Lambda Architecture
It's Time To Stop Using Lambda Architecture
Yaroslav Tkachenko
 
What is the State of my Kafka Streams Application? Unleashing Metrics. | Neil...
What is the State of my Kafka Streams Application? Unleashing Metrics. | Neil...What is the State of my Kafka Streams Application? Unleashing Metrics. | Neil...
What is the State of my Kafka Streams Application? Unleashing Metrics. | Neil...
HostedbyConfluent
 
Deep Dive Into Kafka Streams (and the Distributed Stream Processing Engine) (...
Deep Dive Into Kafka Streams (and the Distributed Stream Processing Engine) (...Deep Dive Into Kafka Streams (and the Distributed Stream Processing Engine) (...
Deep Dive Into Kafka Streams (and the Distributed Stream Processing Engine) (...
confluent
 
Kafka Streams: the easiest way to start with stream processing
Kafka Streams: the easiest way to start with stream processingKafka Streams: the easiest way to start with stream processing
Kafka Streams: the easiest way to start with stream processing
Yaroslav Tkachenko
 
Apache Big Data EU 2016: Building Streaming Applications with Apache Apex
Apache Big Data EU 2016: Building Streaming Applications with Apache ApexApache Big Data EU 2016: Building Streaming Applications with Apache Apex
Apache Big Data EU 2016: Building Streaming Applications with Apache Apex
Apache Apex
 
Intro to Apache Apex - Next Gen Native Hadoop Platform - Hackac
Intro to Apache Apex - Next Gen Native Hadoop Platform - HackacIntro to Apache Apex - Next Gen Native Hadoop Platform - Hackac
Intro to Apache Apex - Next Gen Native Hadoop Platform - Hackac
Apache Apex
 
Kafka Summit SF 2017 - Exactly-once Stream Processing with Kafka Streams
Kafka Summit SF 2017 - Exactly-once Stream Processing with Kafka StreamsKafka Summit SF 2017 - Exactly-once Stream Processing with Kafka Streams
Kafka Summit SF 2017 - Exactly-once Stream Processing with Kafka Streams
confluent
 
Building real time Data Pipeline using Spark Streaming
Building real time Data Pipeline using Spark StreamingBuilding real time Data Pipeline using Spark Streaming
Building real time Data Pipeline using Spark Streaming
datamantra
 
Java High Level Stream API
Java High Level Stream APIJava High Level Stream API
Java High Level Stream API
Apache Apex
 
Kafka Summit NYC 2017 - Easy, Scalable, Fault-tolerant Stream Processing with...
Kafka Summit NYC 2017 - Easy, Scalable, Fault-tolerant Stream Processing with...Kafka Summit NYC 2017 - Easy, Scalable, Fault-tolerant Stream Processing with...
Kafka Summit NYC 2017 - Easy, Scalable, Fault-tolerant Stream Processing with...
confluent
 
Developing Secure Scala Applications With Fortify For Scala
Developing Secure Scala Applications With Fortify For ScalaDeveloping Secure Scala Applications With Fortify For Scala
Developing Secure Scala Applications With Fortify For Scala
Lightbend
 
Apache Kafka, and the Rise of Stream Processing
Apache Kafka, and the Rise of Stream ProcessingApache Kafka, and the Rise of Stream Processing
Apache Kafka, and the Rise of Stream Processing
Guozhang Wang
 
Achieving a 50% Reduction in Cross-AZ Network Costs from Kafka (Uday Sagar Si...
Achieving a 50% Reduction in Cross-AZ Network Costs from Kafka (Uday Sagar Si...Achieving a 50% Reduction in Cross-AZ Network Costs from Kafka (Uday Sagar Si...
Achieving a 50% Reduction in Cross-AZ Network Costs from Kafka (Uday Sagar Si...
confluent
 

Similar to Apache Kafka - From zero to hero (20)

Developing Realtime Data Pipelines With Apache Kafka
Developing Realtime Data Pipelines With Apache KafkaDeveloping Realtime Data Pipelines With Apache Kafka
Developing Realtime Data Pipelines With Apache Kafka
Joe Stein
 
Developing Real-Time Data Pipelines with Apache Kafka
Developing Real-Time Data Pipelines with Apache KafkaDeveloping Real-Time Data Pipelines with Apache Kafka
Developing Real-Time Data Pipelines with Apache Kafka
Joe Stein
 
Kafka overview v0.1
Kafka overview v0.1Kafka overview v0.1
Kafka overview v0.1
Mahendran Ponnusamy
 
Kafka blr-meetup-presentation - Kafka internals
Kafka blr-meetup-presentation - Kafka internalsKafka blr-meetup-presentation - Kafka internals
Kafka blr-meetup-presentation - Kafka internals
Ayyappadas Ravindran (Appu)
 
Fundamentals and Architecture of Apache Kafka
Fundamentals and Architecture of Apache KafkaFundamentals and Architecture of Apache Kafka
Fundamentals and Architecture of Apache Kafka
Angelo Cesaro
 
Real time Analytics with Apache Kafka and Apache Spark
Real time Analytics with Apache Kafka and Apache SparkReal time Analytics with Apache Kafka and Apache Spark
Real time Analytics with Apache Kafka and Apache Spark
Rahul Jain
 
Apache Kafka
Apache KafkaApache Kafka
Apache Kafka
Saroj Panyasrivanit
 
Apache Kafka - Scalable Message-Processing and more !
Apache Kafka - Scalable Message-Processing and more !Apache Kafka - Scalable Message-Processing and more !
Apache Kafka - Scalable Message-Processing and more !
Guido Schmutz
 
Apache Kafka Women Who Code Meetup
Apache Kafka Women Who Code MeetupApache Kafka Women Who Code Meetup
Apache Kafka Women Who Code Meetup
Snehal Nagmote
 
Kafka and ibm event streams basics
Kafka and ibm event streams basicsKafka and ibm event streams basics
Kafka and ibm event streams basics
Brian S. Paskin
 
Streaming in Practice - Putting Apache Kafka in Production
Streaming in Practice - Putting Apache Kafka in ProductionStreaming in Practice - Putting Apache Kafka in Production
Streaming in Practice - Putting Apache Kafka in Production
confluent
 
Building an Event Bus at Scale
Building an Event Bus at ScaleBuilding an Event Bus at Scale
Building an Event Bus at Scale
jimriecken
 
Fundamentals of Apache Kafka
Fundamentals of Apache KafkaFundamentals of Apache Kafka
Fundamentals of Apache Kafka
Chhavi Parasher
 
Kafka Deep Dive
Kafka Deep DiveKafka Deep Dive
Kafka Deep Dive
Knoldus Inc.
 
Common issues with Apache Kafka® Producer
Common issues with Apache Kafka® ProducerCommon issues with Apache Kafka® Producer
Common issues with Apache Kafka® Producer
confluent
 
Near Real time Indexing Kafka Messages to Apache Blur using Spark Streaming
Near Real time Indexing Kafka Messages to Apache Blur using Spark StreamingNear Real time Indexing Kafka Messages to Apache Blur using Spark Streaming
Near Real time Indexing Kafka Messages to Apache Blur using Spark Streaming
Dibyendu Bhattacharya
 
Kafka Multi-Tenancy—160 Billion Daily Messages on One Shared Cluster at LINE
Kafka Multi-Tenancy—160 Billion Daily Messages on One Shared Cluster at LINE Kafka Multi-Tenancy—160 Billion Daily Messages on One Shared Cluster at LINE
Kafka Multi-Tenancy—160 Billion Daily Messages on One Shared Cluster at LINE
confluent
 
Kafka Multi-Tenancy - 160 Billion Daily Messages on One Shared Cluster at LINE
Kafka Multi-Tenancy - 160 Billion Daily Messages on One Shared Cluster at LINEKafka Multi-Tenancy - 160 Billion Daily Messages on One Shared Cluster at LINE
Kafka Multi-Tenancy - 160 Billion Daily Messages on One Shared Cluster at LINE
kawamuray
 
Envoy and Kafka
Envoy and KafkaEnvoy and Kafka
Envoy and Kafka
Adam Kotwasinski
 
Performance Tuning
Performance TuningPerformance Tuning
Performance Tuning
Ligaya Turmelle
 
Developing Realtime Data Pipelines With Apache Kafka
Developing Realtime Data Pipelines With Apache KafkaDeveloping Realtime Data Pipelines With Apache Kafka
Developing Realtime Data Pipelines With Apache Kafka
Joe Stein
 
Developing Real-Time Data Pipelines with Apache Kafka
Developing Real-Time Data Pipelines with Apache KafkaDeveloping Real-Time Data Pipelines with Apache Kafka
Developing Real-Time Data Pipelines with Apache Kafka
Joe Stein
 
Fundamentals and Architecture of Apache Kafka
Fundamentals and Architecture of Apache KafkaFundamentals and Architecture of Apache Kafka
Fundamentals and Architecture of Apache Kafka
Angelo Cesaro
 
Real time Analytics with Apache Kafka and Apache Spark
Real time Analytics with Apache Kafka and Apache SparkReal time Analytics with Apache Kafka and Apache Spark
Real time Analytics with Apache Kafka and Apache Spark
Rahul Jain
 
Apache Kafka - Scalable Message-Processing and more !
Apache Kafka - Scalable Message-Processing and more !Apache Kafka - Scalable Message-Processing and more !
Apache Kafka - Scalable Message-Processing and more !
Guido Schmutz
 
Apache Kafka Women Who Code Meetup
Apache Kafka Women Who Code MeetupApache Kafka Women Who Code Meetup
Apache Kafka Women Who Code Meetup
Snehal Nagmote
 
Kafka and ibm event streams basics
Kafka and ibm event streams basicsKafka and ibm event streams basics
Kafka and ibm event streams basics
Brian S. Paskin
 
Streaming in Practice - Putting Apache Kafka in Production
Streaming in Practice - Putting Apache Kafka in ProductionStreaming in Practice - Putting Apache Kafka in Production
Streaming in Practice - Putting Apache Kafka in Production
confluent
 
Building an Event Bus at Scale
Building an Event Bus at ScaleBuilding an Event Bus at Scale
Building an Event Bus at Scale
jimriecken
 
Fundamentals of Apache Kafka
Fundamentals of Apache KafkaFundamentals of Apache Kafka
Fundamentals of Apache Kafka
Chhavi Parasher
 
Common issues with Apache Kafka® Producer
Common issues with Apache Kafka® ProducerCommon issues with Apache Kafka® Producer
Common issues with Apache Kafka® Producer
confluent
 
Near Real time Indexing Kafka Messages to Apache Blur using Spark Streaming
Near Real time Indexing Kafka Messages to Apache Blur using Spark StreamingNear Real time Indexing Kafka Messages to Apache Blur using Spark Streaming
Near Real time Indexing Kafka Messages to Apache Blur using Spark Streaming
Dibyendu Bhattacharya
 
Kafka Multi-Tenancy—160 Billion Daily Messages on One Shared Cluster at LINE
Kafka Multi-Tenancy—160 Billion Daily Messages on One Shared Cluster at LINE Kafka Multi-Tenancy—160 Billion Daily Messages on One Shared Cluster at LINE
Kafka Multi-Tenancy—160 Billion Daily Messages on One Shared Cluster at LINE
confluent
 
Kafka Multi-Tenancy - 160 Billion Daily Messages on One Shared Cluster at LINE
Kafka Multi-Tenancy - 160 Billion Daily Messages on One Shared Cluster at LINEKafka Multi-Tenancy - 160 Billion Daily Messages on One Shared Cluster at LINE
Kafka Multi-Tenancy - 160 Billion Daily Messages on One Shared Cluster at LINE
kawamuray
 
Ad

Recently uploaded (20)

Adobe Illustrator Crack FREE Download 2025 Latest Version
Adobe Illustrator Crack FREE Download 2025 Latest VersionAdobe Illustrator Crack FREE Download 2025 Latest Version
Adobe Illustrator Crack FREE Download 2025 Latest Version
kashifyounis067
 
Societal challenges of AI: biases, multilinguism and sustainability
Societal challenges of AI: biases, multilinguism and sustainabilitySocietal challenges of AI: biases, multilinguism and sustainability
Societal challenges of AI: biases, multilinguism and sustainability
Jordi Cabot
 
Landscape of Requirements Engineering for/by AI through Literature Review
Landscape of Requirements Engineering for/by AI through Literature ReviewLandscape of Requirements Engineering for/by AI through Literature Review
Landscape of Requirements Engineering for/by AI through Literature Review
Hironori Washizaki
 
Top 10 Client Portal Software Solutions for 2025.docx
Top 10 Client Portal Software Solutions for 2025.docxTop 10 Client Portal Software Solutions for 2025.docx
Top 10 Client Portal Software Solutions for 2025.docx
Portli
 
Download YouTube By Click 2025 Free Full Activated
Download YouTube By Click 2025 Free Full ActivatedDownload YouTube By Click 2025 Free Full Activated
Download YouTube By Click 2025 Free Full Activated
saniamalik72555
 
Secure Test Infrastructure: The Backbone of Trustworthy Software Development
Secure Test Infrastructure: The Backbone of Trustworthy Software DevelopmentSecure Test Infrastructure: The Backbone of Trustworthy Software Development
Secure Test Infrastructure: The Backbone of Trustworthy Software Development
Shubham Joshi
 
Automation Techniques in RPA - UiPath Certificate
Automation Techniques in RPA - UiPath CertificateAutomation Techniques in RPA - UiPath Certificate
Automation Techniques in RPA - UiPath Certificate
VICTOR MAESTRE RAMIREZ
 
Get & Download Wondershare Filmora Crack Latest [2025]
Get & Download Wondershare Filmora Crack Latest [2025]Get & Download Wondershare Filmora Crack Latest [2025]
Get & Download Wondershare Filmora Crack Latest [2025]
saniaaftab72555
 
WinRAR Crack for Windows (100% Working 2025)
WinRAR Crack for Windows (100% Working 2025)WinRAR Crack for Windows (100% Working 2025)
WinRAR Crack for Windows (100% Working 2025)
sh607827
 
F-Secure Freedome VPN 2025 Crack Plus Activation New Version
F-Secure Freedome VPN 2025 Crack Plus Activation  New VersionF-Secure Freedome VPN 2025 Crack Plus Activation  New Version
F-Secure Freedome VPN 2025 Crack Plus Activation New Version
saimabibi60507
 
Adobe Marketo Engage Champion Deep Dive - SFDC CRM Synch V2 & Usage Dashboards
Adobe Marketo Engage Champion Deep Dive - SFDC CRM Synch V2 & Usage DashboardsAdobe Marketo Engage Champion Deep Dive - SFDC CRM Synch V2 & Usage Dashboards
Adobe Marketo Engage Champion Deep Dive - SFDC CRM Synch V2 & Usage Dashboards
BradBedford3
 
Exploring Wayland: A Modern Display Server for the Future
Exploring Wayland: A Modern Display Server for the FutureExploring Wayland: A Modern Display Server for the Future
Exploring Wayland: A Modern Display Server for the Future
ICS
 
Not So Common Memory Leaks in Java Webinar
Not So Common Memory Leaks in Java WebinarNot So Common Memory Leaks in Java Webinar
Not So Common Memory Leaks in Java Webinar
Tier1 app
 
Adobe After Effects Crack FREE FRESH version 2025
Adobe After Effects Crack FREE FRESH version 2025Adobe After Effects Crack FREE FRESH version 2025
Adobe After Effects Crack FREE FRESH version 2025
kashifyounis067
 
Pixologic ZBrush Crack Plus Activation Key [Latest 2025] New Version
Pixologic ZBrush Crack Plus Activation Key [Latest 2025] New VersionPixologic ZBrush Crack Plus Activation Key [Latest 2025] New Version
Pixologic ZBrush Crack Plus Activation Key [Latest 2025] New Version
saimabibi60507
 
LEARN SEO AND INCREASE YOUR KNOWLDGE IN SOFTWARE INDUSTRY
LEARN SEO AND INCREASE YOUR KNOWLDGE IN SOFTWARE INDUSTRYLEARN SEO AND INCREASE YOUR KNOWLDGE IN SOFTWARE INDUSTRY
LEARN SEO AND INCREASE YOUR KNOWLDGE IN SOFTWARE INDUSTRY
NidaFarooq10
 
Exploring Code Comprehension in Scientific Programming: Preliminary Insight...
Exploring Code Comprehension  in Scientific Programming:  Preliminary Insight...Exploring Code Comprehension  in Scientific Programming:  Preliminary Insight...
Exploring Code Comprehension in Scientific Programming: Preliminary Insight...
University of Hawai‘i at Mānoa
 
Solidworks Crack 2025 latest new + license code
Solidworks Crack 2025 latest new + license codeSolidworks Crack 2025 latest new + license code
Solidworks Crack 2025 latest new + license code
aneelaramzan63
 
FL Studio Producer Edition Crack 2025 Full Version
FL Studio Producer Edition Crack 2025 Full VersionFL Studio Producer Edition Crack 2025 Full Version
FL Studio Producer Edition Crack 2025 Full Version
tahirabibi60507
 
Designing AI-Powered APIs on Azure: Best Practices& Considerations
Designing AI-Powered APIs on Azure: Best Practices& ConsiderationsDesigning AI-Powered APIs on Azure: Best Practices& Considerations
Designing AI-Powered APIs on Azure: Best Practices& Considerations
Dinusha Kumarasiri
 
Adobe Illustrator Crack FREE Download 2025 Latest Version
Adobe Illustrator Crack FREE Download 2025 Latest VersionAdobe Illustrator Crack FREE Download 2025 Latest Version
Adobe Illustrator Crack FREE Download 2025 Latest Version
kashifyounis067
 
Societal challenges of AI: biases, multilinguism and sustainability
Societal challenges of AI: biases, multilinguism and sustainabilitySocietal challenges of AI: biases, multilinguism and sustainability
Societal challenges of AI: biases, multilinguism and sustainability
Jordi Cabot
 
Landscape of Requirements Engineering for/by AI through Literature Review
Landscape of Requirements Engineering for/by AI through Literature ReviewLandscape of Requirements Engineering for/by AI through Literature Review
Landscape of Requirements Engineering for/by AI through Literature Review
Hironori Washizaki
 
Top 10 Client Portal Software Solutions for 2025.docx
Top 10 Client Portal Software Solutions for 2025.docxTop 10 Client Portal Software Solutions for 2025.docx
Top 10 Client Portal Software Solutions for 2025.docx
Portli
 
Download YouTube By Click 2025 Free Full Activated
Download YouTube By Click 2025 Free Full ActivatedDownload YouTube By Click 2025 Free Full Activated
Download YouTube By Click 2025 Free Full Activated
saniamalik72555
 
Secure Test Infrastructure: The Backbone of Trustworthy Software Development
Secure Test Infrastructure: The Backbone of Trustworthy Software DevelopmentSecure Test Infrastructure: The Backbone of Trustworthy Software Development
Secure Test Infrastructure: The Backbone of Trustworthy Software Development
Shubham Joshi
 
Automation Techniques in RPA - UiPath Certificate
Automation Techniques in RPA - UiPath CertificateAutomation Techniques in RPA - UiPath Certificate
Automation Techniques in RPA - UiPath Certificate
VICTOR MAESTRE RAMIREZ
 
Get & Download Wondershare Filmora Crack Latest [2025]
Get & Download Wondershare Filmora Crack Latest [2025]Get & Download Wondershare Filmora Crack Latest [2025]
Get & Download Wondershare Filmora Crack Latest [2025]
saniaaftab72555
 
WinRAR Crack for Windows (100% Working 2025)
WinRAR Crack for Windows (100% Working 2025)WinRAR Crack for Windows (100% Working 2025)
WinRAR Crack for Windows (100% Working 2025)
sh607827
 
F-Secure Freedome VPN 2025 Crack Plus Activation New Version
F-Secure Freedome VPN 2025 Crack Plus Activation  New VersionF-Secure Freedome VPN 2025 Crack Plus Activation  New Version
F-Secure Freedome VPN 2025 Crack Plus Activation New Version
saimabibi60507
 
Adobe Marketo Engage Champion Deep Dive - SFDC CRM Synch V2 & Usage Dashboards
Adobe Marketo Engage Champion Deep Dive - SFDC CRM Synch V2 & Usage DashboardsAdobe Marketo Engage Champion Deep Dive - SFDC CRM Synch V2 & Usage Dashboards
Adobe Marketo Engage Champion Deep Dive - SFDC CRM Synch V2 & Usage Dashboards
BradBedford3
 
Exploring Wayland: A Modern Display Server for the Future
Exploring Wayland: A Modern Display Server for the FutureExploring Wayland: A Modern Display Server for the Future
Exploring Wayland: A Modern Display Server for the Future
ICS
 
Not So Common Memory Leaks in Java Webinar
Not So Common Memory Leaks in Java WebinarNot So Common Memory Leaks in Java Webinar
Not So Common Memory Leaks in Java Webinar
Tier1 app
 
Adobe After Effects Crack FREE FRESH version 2025
Adobe After Effects Crack FREE FRESH version 2025Adobe After Effects Crack FREE FRESH version 2025
Adobe After Effects Crack FREE FRESH version 2025
kashifyounis067
 
Pixologic ZBrush Crack Plus Activation Key [Latest 2025] New Version
Pixologic ZBrush Crack Plus Activation Key [Latest 2025] New VersionPixologic ZBrush Crack Plus Activation Key [Latest 2025] New Version
Pixologic ZBrush Crack Plus Activation Key [Latest 2025] New Version
saimabibi60507
 
LEARN SEO AND INCREASE YOUR KNOWLDGE IN SOFTWARE INDUSTRY
LEARN SEO AND INCREASE YOUR KNOWLDGE IN SOFTWARE INDUSTRYLEARN SEO AND INCREASE YOUR KNOWLDGE IN SOFTWARE INDUSTRY
LEARN SEO AND INCREASE YOUR KNOWLDGE IN SOFTWARE INDUSTRY
NidaFarooq10
 
Exploring Code Comprehension in Scientific Programming: Preliminary Insight...
Exploring Code Comprehension  in Scientific Programming:  Preliminary Insight...Exploring Code Comprehension  in Scientific Programming:  Preliminary Insight...
Exploring Code Comprehension in Scientific Programming: Preliminary Insight...
University of Hawai‘i at Mānoa
 
Solidworks Crack 2025 latest new + license code
Solidworks Crack 2025 latest new + license codeSolidworks Crack 2025 latest new + license code
Solidworks Crack 2025 latest new + license code
aneelaramzan63
 
FL Studio Producer Edition Crack 2025 Full Version
FL Studio Producer Edition Crack 2025 Full VersionFL Studio Producer Edition Crack 2025 Full Version
FL Studio Producer Edition Crack 2025 Full Version
tahirabibi60507
 
Designing AI-Powered APIs on Azure: Best Practices& Considerations
Designing AI-Powered APIs on Azure: Best Practices& ConsiderationsDesigning AI-Powered APIs on Azure: Best Practices& Considerations
Designing AI-Powered APIs on Azure: Best Practices& Considerations
Dinusha Kumarasiri
 
Ad

Apache Kafka - From zero to hero

  • 2. AGENDA • Who am I • Short Intro to Kafka • Core concepts
  • 3. • Please download • https://www.apache.org/dyn/closer.cgi?path=/ kafka/0.11.0.0/kafka_2.11-0.11.0.0.tgz • tar -xzf kafka_2.11-011.0.0.tgz • cd kafka_2.11-011.0.0
  • 4. APACHE KAFKA A high throughput distributed messaging system http://kafka.apache.org • What do we want to solve ? Founded at LinkedIn , open source since 2010 Implemented in Scala (some in Java)
  • 5. Images from Jay Kreps blog The problem Hard to track Loss of data Hard to scale Bottom line your data pipeline looks like spaghetti
  • 6. Images from Jay Kreps blog The Solution
  • 8. • Enterprise Messaging system • Stream processing API • Provides connectors to push/pull data from db etc’
  • 9. Core concepts • Producer • Message • Consumer • Kafka broker • Cluster Broker Producer Consumer Broker Broker Zookeeper
  • 10. Producer Consumer Producer Producer Consumer Consumer Problem The Producers can push any data, but how can a consumer consume only the data he is interested in ? Accounts Orders Topics Kafka cluster
  • 11. SCALE PROBLEM 1 • What if our data of a single topic is bigger than our local storage ?
  • 12. Partitions 0 1 32 Broker 1 Broker 2 Broker 3 Broker 4 Rep 1 Rep 2 Rep 4 Replicas Each partition has an id Leader of partition 2Follower of partition 2 Follower of partition 2 Cluster We can split the data of a topic into partitions, this way it can be distributed to other machines in the cluster (how many partitions ??? - it is yours to decide) Partitions are managed in a Leader - follower style . The leader accepts all interactions from producers and consumers . The followers maintain the copies (replicas) Fault tolerance - what if one of the partitions crashes? Can we tolerate data loss ? We set the replication factor per topic - the replica’s id is the same as broker id
  • 13. OFFESET • Sequence number assigned to a message in a partition Offsets are per partition. To locate a message directly you need topic name, partition number and offset
  • 14. SCALE PROBLEM 2 • Now that we can split the data to partitions - no more storage limitation • we can add more produces
  • 16. REPLICAS IN ACTION • bin/kafka-topics.sh --create --zookeeper localhost:2181 --replication-factor 3 -- partitions 4 --topic test4 (in case we have two brokers we will get an error) Try to create a topic with multiple replicas Add more brokers - cp the server.properties file (* 3) to server[n].properties Change broker_id (the default is 0) Change broker port (just if in the same machine) Change broker log
  • 17. Describe topic command • bin/kafka-topics.sh --describe --zookeeper localhost:2181 --topic test Shows topic name, partitions, replicas …. Partion id = 0 the Leader is broker 0 (which means that it is responsible for all communications with consumer and producer ) , he maintains the first replica and broker 2 and 1 are the followers Partion id = 1 the Leader is broker 1 Partion id = 2 the Leader is broker 2 Partion id = 3 the Leader is broker 0 Isr = in sync replicas, all replicas that in sync with the leader (in this case all replicas are in sync )
  • 18. Configuration parts worth to mention • Read the configuration documentation : https://kafka.apache.org/documentation/#configuration • zookeeper.connect - essential for creating a cluster, zookeeper connects all brokers to perform a cluster • Delete.topic.enable = false • Auto.create.topics.enable = true, I do not recommend using that in production • Default.replication.factor = 1 • Num.partitions = 1 • Log.retention.ms (default) = 7 days • Log.retention.bytes Zookeeper.connect = connection string - the zookeeper address. It is essential that each broker will know this address Delete.topic.enable = false as default for securing production environment Default.replication.factor & num.partitions are relevant if auto create is true Log.retention = kafka is not a database , and data should be cleared Retention by time is faster than by size (doesn’t need to calc the size, size is per partition and not by topic)
  • 19. Groups • A group of consumers that are sharing work and consuming the same messages as a single unit P1P3 P2 Kafka cluster Consumer group A Consumer group B P0 * Number of partitions set the max consumer group parallelism. if we have to more consumers than partitions in the same group, we will get unemployment … * To avoid duplicate reads - partition never shared among members of the same group in the same time * Group coordinator - one of the brokers become a group coordinator and manages a list of consumers (the first consumer is the leader), when a member joins/leaves, the coordinator modify the list and initiate a rebalance (block all reads) . * The leader is responsible to execute the rebalance activity . Takes the list , reassign the members and send back the list to the coordinator
  • 20. PARTITIONER • If partitioner defined use it -You can always create custom partitioner by Implementing the Partitioner interface • Else if key defined than choose partition based on the hash key (caution!) • Else distribute the massage in a round robin fashion Partition by key - (careful - not reliable because, although hashing ensures the a key will always have the same hash - two keys can have the same hash also the partition is set by hash key % numOfPartitions so if the numOfPartitons changes we will have different partitions for the key )
  • 21. * create a topic with two partitions * Create three consumers (two with the same group id) * Place a message one grouped consumer + the non grouped will get the message. * (Show round robin) Place another message the other grouped consumer + the non grouped will get the message. * Stop the grouped actor and show that the other one in the group will get all messages *
  • 23. Producer • Properties props = new Properties() //map that must include bootstrap.servers(list of brokers), key serializer, value serializer • Producer< String, String> producer = new KafkaProducer(props) • ProducerRecord<String, String> = ProducerRecord<>(topicName, [partition num], [timestamp], [key], value) • producer.send(record) //.get or producer.send(record, callback) • producer.close() // after sending all we need to free resources • * max.in.flight.requests.per.connection = 5 (for async calls - how many requests are sent without response) * Producer<key type, value type>
  • 24. *PRODUCER Properties Producer record Serialize Assign Partition Partition Buffer Record 1 Recored 2 … Broker RecordMetadata Error Mostly you will use generic serializer like Avro but you can use custom Producer maintain a partition buffer and send the messages in a batch (size and time of the buffer is configurable ) If error is recoverable it will retry (e.g leader is down till new leader elected, number of retries and the time gap between them, is configurable )
  • 25. SENDING APPROACH • Fire and forget - kafka is highly available system but we might loos a little portion of data • Synchronous Send - messages are critical and we cannot loose data • Asynchronous Send - allows to handle failures async, better throughput but you have in flight requests limit
  • 27. SOME PRODUCER CONFIGURATIONS • Bootstrap servers - List of brokers in your cluster (recommend more than one) • Key serializer class • Value serializer class • acks (0,1,all) • Retries = 0 (retry.backoff.ms = 100 set the time interval between retries) • max.in.flight.requests.per.connection how many messages you can send async without getting acknowledge, high number will consume more memory but gain high throughput • Check the documentation acks - 0 no wait for acknowledgment (loss of data, no retries , highest throughput ) acks - 1 (default) only the leader respond (safe chance but still small chance of loosing data in case the leader crashes before data is replicated) acks - all (hight reliability but late latency - waiting for all replicas ) max.in.flight.requests.per.connection = (due to retries you might loose the order of messages, if order is critical set it to 1or use sync send)
  • 28. Consumer • Properties props = new Properties(); // best practice is to use config file • props.put("bootstrap.servers",“localhost:9092"); • props.put("group.id",“accounts"); //optional but if omitted than work cannot be shared • props.put("key.deserializer", StringDeserializer.class.getName()); • props.put("value.deserializer", StringDeserializer.class.getName()); • KafkaConsumer<String, String> consumer = new KafkaConsumer<>(props);  • consumer.subscribe(Arrays.asList(“foo”,“bar”));  try {  while (running) {    ConsumerRecords<String, String> records = consumer.poll(1000); for (ConsumerRecord<String, String> record : records){ System.out.println(record.offset() + ": " + record.value()); } consumer.commitAsync() } } finally { consumer.close(); } * The poll method accepts a timeout parameter to establish how long we want to wait for data * It actually doing a lot of important things like connect to coordinator, get partition assignment, fetch messages, send heartbeat, and much more * Iteration must complete in 3 seconds otherwise the coordinator will consider the consumer is dead and will initiate rebalance (can be configureable )