SlideShare a Scribd company logo
Apache Kafka
//TODO Insert Logo
About Me
David Arthur
http://mumrah.github.io/
● Software Engineer at LucidWorks
● Open source contributor
● Gardener
● Dad
TOC
● Project overview
● Architecture
● Implementation
● Miscellany
Project info
● http://kafka.apache.org
● Written in Scala
● Open sourced by LinkedIn SNA in 2011
● Soon after entered Apache Incubator
● Not just an idle "open source donation"
● Active development, 0.8 out now
● Apache TLP late 2012, 13 committers
● Still no logo
12 word overview
Apache Kafka is publish-subscribe messaging
rethought as a distributed commit log
Ok...
Kafka is a persistent, distributed, replicated
pub/sub messaging system. Publishers send
messages to a cluster of brokers. The brokers
persist the messages to disk.
Consumers then request a range of messages
using an (offset, length) style API. Use of NIO
FileChannel allows for very fast transfer of
data in and out of the system.
Highlights
● ZooKeeper for Broker coordination
● Configurable Producer acks
● Consumer Groups
● TTL persistence
● Sync/Async producer API
● Durable
● Scalable
● Fast
● <Additional buzzwords>
Motivation
● Activity stream processing (user interactions)
● Batch latency was too high
● Existing queues handle large volumes of
(unconsumed) data poorly - durability is
expensive
● Need something fast and durable
Key design choices
● Pub/sub messaging pattern
● Messages are persistent
● Everything is distributed - producers,
brokers, consumers, the queue itself
● Consumers maintain their own state (i.e.,
"dumb" brokers)
● Throughput is key
TOC
● Project overview
● Architecture
● Implementation
● Miscellany
Brokers
● Receive messages from Producers (push),
deliver messages to Consumers (pull)
● Responsible for persisting the messages for
some time
● Relatively lightweight - mostly just handling
TCP connections and keeping open file
handles to the queue files
Log-based queue
Messages are persisted to append-only log
files by the broker. Producers are appending to
these log files (sequential write), and
consumers are reading a range of these files
(sequential reads).
Log-based queue
Message 0
Message 1
Message 2
Message 3
Producer
Message 4
Message 5
Message 6
Consumer A read 0-4 Consumer Bread 1-3
Broker
Topics
Topics are queues. They are logical collections
of partitions (the physical files). A broker
contains some of the partitions for a topic
Partition 0 Partition 1
Topic A
Partition 0 Partition 1
Topic B
Broker 1
Partition 0 Partition 1
Topic A
Partition 0 Partition 1
Topic B
Broker 2
Replication
Partitions of a topic are replicated. One broker
is the "leader" of a partition. All writes and
reads must go to the leader. Replicas exist for
fault-tolerance, not scalability.
When writing, messages can be synchronously
written to N replicas (depending on the
producer's ACKiness)
Producers
Producers are responsible for load balancing
messages to the various brokers. They can
discover all brokers from any one broker.
In 0.7, producers are fire-and-forget. In 0.8,
there are 3 ack levels:
● No ack (0)
● Ack from N replicas (1..N)
● Ack from all replicas (-1)
Consumers
Consumers request a range of messages from
a Broker. They are responsible for their own
state
Default implementation uses ZooKeeper to
manage state. In 0.8.1, Brokers will expose an
API for offset management to remove direct
communication between consumers and
ZooKeeper (a good thing).
Result?
● Brokers keep very little state, mostly just
open file pointers and connections
● Can scale to thousands of producers and
consumers*
● Stable performance, good scalability
● In 0.7, 50MB/s producer throughput,
100MB/s consumer throughput
● No numbers yet from 0.8, but the same
comparable (with acks=0)
TOC
● Project overview
● Architecture
● Implementation
● Miscellany
High-level producer
API
Producer#send(KeyedMessage<K,V> datum)
KeyedMessage<K,V>(String topic, K key,
V value)
The Producer class determines where a
message is sent based on the routing key. If a
null key is given, the message is sent to a
random partition
Message routing
Partition 0 Partition 1
Topic A
Broker 1
Partition 0 Partition 1
Topic A
Broker 2
hash("foo") % 4
Partitioner
Producer.send("A", "foo",
messages)
Producer
Message routing
● Producers can be configured with a custom
routing function (implementing the Partitioner
interface)
● Default is hash-mod
● One side effect of routing is that there is no
total ordering for a topic, but there is within a
partition (this is actually really useful)
(Partially) Ordered
Messages
Consider a system that is processing updates.
If you partition the messages based on the
primary key you guarantee all messages for a
given key end up in the same partition. Since
Kafka guarantees ordering of the messages at
the partition level, your updates will be
processed in the correct sequence.
Persistent Messages
● MessageSets received by the broker and
flushed to append-only log files
● Simple log format (next slide)
● Zero-copy (i.e., sendfile) for file to socket
transfer
● Log files are not kept forever
Log Format
append
Log file
MessageSet
MessageSet
MessageSet
MessageSet
/tmp/kafka-logs/00000000000000000000.kafka
append
append
appendBroker
ByteBufferMessageSet#writeTo(FileChannel)
Index
0: 0
1: 1024
2: 2048
Message offset
index
Message sets
● To maximize throughput, the same binary
format for messages is used throughout the
system (producer API, consumer API, and
log file format)
● Broker only decodes far enough to read the
checksum and validate it, then writes it to the
disk
Compressed message sets
or, message batching
● The value of a Message can be a
compressed MessageSet
Message(value=gzip(MessageSet(...)))
● Useful for increasing throughput (especially
if you are buffering messages in the
producer)
● Can also be used as a way to atomically
write multiple messages
Zero-copy
Reading data out of Kafka is super fast thanks
to java.nio.channels.FileChannel#transferTo.
This method uses the "sendfile" system call
which allows for very efficient transfer of data
from a file to another file (including sockets).
This optimization is what allows us to pull
~100MB/s out of a single broker
High-level Consumer
API
Map topicMap = Collections.singletonMap("topic", 2);
Map<String,List<KafkaStream>> streams =
Consumer.createJavaConsumerConnector(topicMap);
● KafkaStream is an iterable
● Blocking or non-blocking behavior
● Auto-commit offset, or manual commit
● Participate in a "consumer group"
Consumer Groups
Multiple high-level consumers can participate in
a single "consumer group". A consumer group
is coordinated using ZooKeeper, so it can span
multiple machines. In a group, each partition
will be consumed by exactly one consumer (i.
e., KafkaStream).
This allows for broadcast or pub/sub type of
messaging pattern
Consumer Groups
P0 P1
Topic "foo"
Broker 1
Consumer Group A Consumer Group B
P2 P3
Topic "foo"
Broker 2
TOC
● Project overview
● Architecture
● Implementation
● Miscellany
But, I already have a MQ
Many people with distributed systems already
have something like JMS, RabbitMQ, Kestrel,
or ØMQ in place.
These are all different systems with different
implementation details and semantics. Decide
what features you need, and then chose a MQ -
not the other way around :)
RabbitMQ discussion: http://bit.ly/140ZMOx
Caveats
● Not designed for large payloads. Decoding is
done for a whole message (no streaming
decoding).
● Rebalancing can screw things up if you're
doing any aggregation in the consumer by
the partition key
● Number of partitions cannot be easily
changed (chose wisely)
● Lots of topics can hurt I/O performance
Clients
● Many exist, some more complete than
others
● No official clients (other than Java/Scala)
● Community maintained
● Most lack support for high-level consumer
due to ZooKeeper dependency (it's tricky)
● Excellent Python client: https://github.
com/mumrah/kafka-python (yes, I wrote it)
● Only Python, C, and Clojure support the 0.8
protocol
Hadoop InputFormat
● InputFormat/OutputFormat implementations
● Uses low-level consumer, no offsets in ZK
(they are stored on HDFS instead)
● Useful for long term storage of messages
● We use this!
● LinkedIn released Camus, a Kafka to HDFS
pipeline
Integration
A few good integration points. I expect more to
show up as Kafka gains adoption
● log4j appender (in Kafka itself)
● storm-kafka (in storm-contrib)
● camel-kafka
● LogStash (loges)
More listed on Kafka wiki http://bit.ly/1btZz8p
Applications
Log Aggregation
Many applications running on many machines.
You want centralized logging, but don't want to
install an agent (Flume, etc).
Kafka includes a log4j appender
<appender class="kafka.producer.KafkaLog4jAppender" name="kafka-solr">
<param name="zkConnect" value="localhost:2181"/>
<param name="topic" value="solr-logs"/>
<layout class="org.apache.log4j.PatternLayout">
<param value="%d{ISO8601} %p %c %m" name="ConversionPattern"/>
</layout>
</appender>
Applications
Notifications
Store data into data store A, need to sync it to
data store B. Suppose you can send a
message to Kafka when an update happens in
A
A
Kafka
Brecord X changed
get updated record X
Applications
Stream Processing
Using a stream processing tool like Storm or
Apache Camel, Kafka provides an excellent
backbone.
Data in
Kafka
topic A
Process A
Kafka
topic B
Process B
Bonus Slides
Stats
LinkedIn stats:
● Peak writes per second: 460k
● Average writes per day: 28 billion
● Average reads per second: 2.3 million
● ~700 topics
● Thousands of producers
● ~1000 consumers
Bonus Slides
Logos!
Being heatedly debated in JIRA as we speak!
Links!
● Slides: http://bit.ly/kafka-trihug
● Kafka Project: http://kafka.apache.org/
● Kafka Code: https://github.com/apache/kafka
● LinkedIn Camus: https://github.com/linkedin/camus
● Zero-Copy IBM article: http://ibm.co/gsETNm
● LinkedIn Kafka paper: http://bit.ly/mC8TLS
● LinkedIn blog post on replication: http://linkd.in/YAWslH
Ad

More Related Content

What's hot (20)

Apache Spark Architecture
Apache Spark ArchitectureApache Spark Architecture
Apache Spark Architecture
Alexey Grishchenko
 
Etsy Activity Feeds Architecture
Etsy Activity Feeds ArchitectureEtsy Activity Feeds Architecture
Etsy Activity Feeds Architecture
Dan McKinley
 
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Databricks
 
Kafka presentation
Kafka presentationKafka presentation
Kafka presentation
Mohammed Fazuluddin
 
kafka
kafkakafka
kafka
Amikam Snir
 
From cache to in-memory data grid. Introduction to Hazelcast.
From cache to in-memory data grid. Introduction to Hazelcast.From cache to in-memory data grid. Introduction to Hazelcast.
From cache to in-memory data grid. Introduction to Hazelcast.
Taras Matyashovsky
 
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the CloudAmazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Noritaka Sekiyama
 
Stability Patterns for Microservices
Stability Patterns for MicroservicesStability Patterns for Microservices
Stability Patterns for Microservices
pflueras
 
Apache Kafka Best Practices
Apache Kafka Best PracticesApache Kafka Best Practices
Apache Kafka Best Practices
DataWorks Summit/Hadoop Summit
 
Kafka replication apachecon_2013
Kafka replication apachecon_2013Kafka replication apachecon_2013
Kafka replication apachecon_2013
Jun Rao
 
Performance Tuning RocksDB for Kafka Streams' State Stores (Dhruba Borthakur,...
Performance Tuning RocksDB for Kafka Streams' State Stores (Dhruba Borthakur,...Performance Tuning RocksDB for Kafka Streams' State Stores (Dhruba Borthakur,...
Performance Tuning RocksDB for Kafka Streams' State Stores (Dhruba Borthakur,...
confluent
 
A Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and HudiA Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and Hudi
Databricks
 
Introduction to Apache Kafka
Introduction to Apache KafkaIntroduction to Apache Kafka
Introduction to Apache Kafka
Jeff Holoman
 
HBase and HDFS: Understanding FileSystem Usage in HBase
HBase and HDFS: Understanding FileSystem Usage in HBaseHBase and HDFS: Understanding FileSystem Usage in HBase
HBase and HDFS: Understanding FileSystem Usage in HBase
enissoz
 
Apache Tez - A New Chapter in Hadoop Data Processing
Apache Tez - A New Chapter in Hadoop Data ProcessingApache Tez - A New Chapter in Hadoop Data Processing
Apache Tez - A New Chapter in Hadoop Data Processing
DataWorks Summit
 
Apache Kafka Architecture & Fundamentals Explained
Apache Kafka Architecture & Fundamentals ExplainedApache Kafka Architecture & Fundamentals Explained
Apache Kafka Architecture & Fundamentals Explained
confluent
 
Apache Kafka Fundamentals for Architects, Admins and Developers
Apache Kafka Fundamentals for Architects, Admins and DevelopersApache Kafka Fundamentals for Architects, Admins and Developers
Apache Kafka Fundamentals for Architects, Admins and Developers
confluent
 
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Databricks
 
Kafka Streams State Stores Being Persistent
Kafka Streams State Stores Being PersistentKafka Streams State Stores Being Persistent
Kafka Streams State Stores Being Persistent
confluent
 
RedisConf17- Using Redis at scale @ Twitter
RedisConf17- Using Redis at scale @ TwitterRedisConf17- Using Redis at scale @ Twitter
RedisConf17- Using Redis at scale @ Twitter
Redis Labs
 
Etsy Activity Feeds Architecture
Etsy Activity Feeds ArchitectureEtsy Activity Feeds Architecture
Etsy Activity Feeds Architecture
Dan McKinley
 
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Databricks
 
From cache to in-memory data grid. Introduction to Hazelcast.
From cache to in-memory data grid. Introduction to Hazelcast.From cache to in-memory data grid. Introduction to Hazelcast.
From cache to in-memory data grid. Introduction to Hazelcast.
Taras Matyashovsky
 
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the CloudAmazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Noritaka Sekiyama
 
Stability Patterns for Microservices
Stability Patterns for MicroservicesStability Patterns for Microservices
Stability Patterns for Microservices
pflueras
 
Kafka replication apachecon_2013
Kafka replication apachecon_2013Kafka replication apachecon_2013
Kafka replication apachecon_2013
Jun Rao
 
Performance Tuning RocksDB for Kafka Streams' State Stores (Dhruba Borthakur,...
Performance Tuning RocksDB for Kafka Streams' State Stores (Dhruba Borthakur,...Performance Tuning RocksDB for Kafka Streams' State Stores (Dhruba Borthakur,...
Performance Tuning RocksDB for Kafka Streams' State Stores (Dhruba Borthakur,...
confluent
 
A Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and HudiA Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and Hudi
Databricks
 
Introduction to Apache Kafka
Introduction to Apache KafkaIntroduction to Apache Kafka
Introduction to Apache Kafka
Jeff Holoman
 
HBase and HDFS: Understanding FileSystem Usage in HBase
HBase and HDFS: Understanding FileSystem Usage in HBaseHBase and HDFS: Understanding FileSystem Usage in HBase
HBase and HDFS: Understanding FileSystem Usage in HBase
enissoz
 
Apache Tez - A New Chapter in Hadoop Data Processing
Apache Tez - A New Chapter in Hadoop Data ProcessingApache Tez - A New Chapter in Hadoop Data Processing
Apache Tez - A New Chapter in Hadoop Data Processing
DataWorks Summit
 
Apache Kafka Architecture & Fundamentals Explained
Apache Kafka Architecture & Fundamentals ExplainedApache Kafka Architecture & Fundamentals Explained
Apache Kafka Architecture & Fundamentals Explained
confluent
 
Apache Kafka Fundamentals for Architects, Admins and Developers
Apache Kafka Fundamentals for Architects, Admins and DevelopersApache Kafka Fundamentals for Architects, Admins and Developers
Apache Kafka Fundamentals for Architects, Admins and Developers
confluent
 
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Databricks
 
Kafka Streams State Stores Being Persistent
Kafka Streams State Stores Being PersistentKafka Streams State Stores Being Persistent
Kafka Streams State Stores Being Persistent
confluent
 
RedisConf17- Using Redis at scale @ Twitter
RedisConf17- Using Redis at scale @ TwitterRedisConf17- Using Redis at scale @ Twitter
RedisConf17- Using Redis at scale @ Twitter
Redis Labs
 

Similar to Introduction and Overview of Apache Kafka, TriHUG July 23, 2013 (20)

Copy of Kafka-Camus
Copy of Kafka-CamusCopy of Kafka-Camus
Copy of Kafka-Camus
Deep Shah
 
Apache kafka
Apache kafkaApache kafka
Apache kafka
NexThoughts Technologies
 
Apache Kafka - Scalable Message-Processing and more !
Apache Kafka - Scalable Message-Processing and more !Apache Kafka - Scalable Message-Processing and more !
Apache Kafka - Scalable Message-Processing and more !
Guido Schmutz
 
Data Pipelines with Apache Kafka
Data Pipelines with Apache KafkaData Pipelines with Apache Kafka
Data Pipelines with Apache Kafka
Ben Stopford
 
Kafka.pptx (uploaded from MyFiles SomnathDeb_PC)
Kafka.pptx (uploaded from MyFiles SomnathDeb_PC)Kafka.pptx (uploaded from MyFiles SomnathDeb_PC)
Kafka.pptx (uploaded from MyFiles SomnathDeb_PC)
somnathdeb0212
 
Notes leo kafka
Notes leo kafkaNotes leo kafka
Notes leo kafka
Léopold Gault
 
Session 23 - Kafka and Zookeeper
Session 23 - Kafka and ZookeeperSession 23 - Kafka and Zookeeper
Session 23 - Kafka and Zookeeper
AnandMHadoop
 
Apache kafka
Apache kafkaApache kafka
Apache kafka
Ramakrishna kapa
 
Apache kafka
Apache kafkaApache kafka
Apache kafka
Srikrishna k
 
Apache kafka introduction
Apache kafka introductionApache kafka introduction
Apache kafka introduction
Mohammad Mazharuddin
 
Kafka 10000 feet view
Kafka 10000 feet viewKafka 10000 feet view
Kafka 10000 feet view
younessx01
 
14th Athens Big Data Meetup - Landoop Workshop - Apache Kafka Entering The St...
14th Athens Big Data Meetup - Landoop Workshop - Apache Kafka Entering The St...14th Athens Big Data Meetup - Landoop Workshop - Apache Kafka Entering The St...
14th Athens Big Data Meetup - Landoop Workshop - Apache Kafka Entering The St...
Athens Big Data
 
Developing Realtime Data Pipelines With Apache Kafka
Developing Realtime Data Pipelines With Apache KafkaDeveloping Realtime Data Pipelines With Apache Kafka
Developing Realtime Data Pipelines With Apache Kafka
Joe Stein
 
Apache Kafka - Scalable Message-Processing and more !
Apache Kafka - Scalable Message-Processing and more !Apache Kafka - Scalable Message-Processing and more !
Apache Kafka - Scalable Message-Processing and more !
Guido Schmutz
 
Introduction to Kafka Streams Presentation
Introduction to Kafka Streams PresentationIntroduction to Kafka Streams Presentation
Introduction to Kafka Streams Presentation
Knoldus Inc.
 
bigdata 2022_ FLiP Into Pulsar Apps
bigdata 2022_ FLiP Into Pulsar Appsbigdata 2022_ FLiP Into Pulsar Apps
bigdata 2022_ FLiP Into Pulsar Apps
Timothy Spann
 
Timothy Spann: Apache Pulsar for ML
Timothy Spann: Apache Pulsar for MLTimothy Spann: Apache Pulsar for ML
Timothy Spann: Apache Pulsar for ML
Edunomica
 
Introduction to Kafka
Introduction to KafkaIntroduction to Kafka
Introduction to Kafka
Ducas Francis
 
Columbus mule soft_meetup_aug2021_Kafka_Integration
Columbus mule soft_meetup_aug2021_Kafka_IntegrationColumbus mule soft_meetup_aug2021_Kafka_Integration
Columbus mule soft_meetup_aug2021_Kafka_Integration
MuleSoft Meetup
 
Introduction to Kafka and Event-Driven
Introduction to Kafka and Event-DrivenIntroduction to Kafka and Event-Driven
Introduction to Kafka and Event-Driven
Dimosthenis Botsaris
 
Copy of Kafka-Camus
Copy of Kafka-CamusCopy of Kafka-Camus
Copy of Kafka-Camus
Deep Shah
 
Apache Kafka - Scalable Message-Processing and more !
Apache Kafka - Scalable Message-Processing and more !Apache Kafka - Scalable Message-Processing and more !
Apache Kafka - Scalable Message-Processing and more !
Guido Schmutz
 
Data Pipelines with Apache Kafka
Data Pipelines with Apache KafkaData Pipelines with Apache Kafka
Data Pipelines with Apache Kafka
Ben Stopford
 
Kafka.pptx (uploaded from MyFiles SomnathDeb_PC)
Kafka.pptx (uploaded from MyFiles SomnathDeb_PC)Kafka.pptx (uploaded from MyFiles SomnathDeb_PC)
Kafka.pptx (uploaded from MyFiles SomnathDeb_PC)
somnathdeb0212
 
Session 23 - Kafka and Zookeeper
Session 23 - Kafka and ZookeeperSession 23 - Kafka and Zookeeper
Session 23 - Kafka and Zookeeper
AnandMHadoop
 
Kafka 10000 feet view
Kafka 10000 feet viewKafka 10000 feet view
Kafka 10000 feet view
younessx01
 
14th Athens Big Data Meetup - Landoop Workshop - Apache Kafka Entering The St...
14th Athens Big Data Meetup - Landoop Workshop - Apache Kafka Entering The St...14th Athens Big Data Meetup - Landoop Workshop - Apache Kafka Entering The St...
14th Athens Big Data Meetup - Landoop Workshop - Apache Kafka Entering The St...
Athens Big Data
 
Developing Realtime Data Pipelines With Apache Kafka
Developing Realtime Data Pipelines With Apache KafkaDeveloping Realtime Data Pipelines With Apache Kafka
Developing Realtime Data Pipelines With Apache Kafka
Joe Stein
 
Apache Kafka - Scalable Message-Processing and more !
Apache Kafka - Scalable Message-Processing and more !Apache Kafka - Scalable Message-Processing and more !
Apache Kafka - Scalable Message-Processing and more !
Guido Schmutz
 
Introduction to Kafka Streams Presentation
Introduction to Kafka Streams PresentationIntroduction to Kafka Streams Presentation
Introduction to Kafka Streams Presentation
Knoldus Inc.
 
bigdata 2022_ FLiP Into Pulsar Apps
bigdata 2022_ FLiP Into Pulsar Appsbigdata 2022_ FLiP Into Pulsar Apps
bigdata 2022_ FLiP Into Pulsar Apps
Timothy Spann
 
Timothy Spann: Apache Pulsar for ML
Timothy Spann: Apache Pulsar for MLTimothy Spann: Apache Pulsar for ML
Timothy Spann: Apache Pulsar for ML
Edunomica
 
Introduction to Kafka
Introduction to KafkaIntroduction to Kafka
Introduction to Kafka
Ducas Francis
 
Columbus mule soft_meetup_aug2021_Kafka_Integration
Columbus mule soft_meetup_aug2021_Kafka_IntegrationColumbus mule soft_meetup_aug2021_Kafka_Integration
Columbus mule soft_meetup_aug2021_Kafka_Integration
MuleSoft Meetup
 
Introduction to Kafka and Event-Driven
Introduction to Kafka and Event-DrivenIntroduction to Kafka and Event-Driven
Introduction to Kafka and Event-Driven
Dimosthenis Botsaris
 
Ad

Recently uploaded (20)

HCL Nomad Web – Best Practices und Verwaltung von Multiuser-Umgebungen
HCL Nomad Web – Best Practices und Verwaltung von Multiuser-UmgebungenHCL Nomad Web – Best Practices und Verwaltung von Multiuser-Umgebungen
HCL Nomad Web – Best Practices und Verwaltung von Multiuser-Umgebungen
panagenda
 
Heap, Types of Heap, Insertion and Deletion
Heap, Types of Heap, Insertion and DeletionHeap, Types of Heap, Insertion and Deletion
Heap, Types of Heap, Insertion and Deletion
Jaydeep Kale
 
ThousandEyes Partner Innovation Updates for May 2025
ThousandEyes Partner Innovation Updates for May 2025ThousandEyes Partner Innovation Updates for May 2025
ThousandEyes Partner Innovation Updates for May 2025
ThousandEyes
 
2025-05-Q4-2024-Investor-Presentation.pptx
2025-05-Q4-2024-Investor-Presentation.pptx2025-05-Q4-2024-Investor-Presentation.pptx
2025-05-Q4-2024-Investor-Presentation.pptx
Samuele Fogagnolo
 
IEDM 2024 Tutorial2_Advances in CMOS Technologies and Future Directions for C...
IEDM 2024 Tutorial2_Advances in CMOS Technologies and Future Directions for C...IEDM 2024 Tutorial2_Advances in CMOS Technologies and Future Directions for C...
IEDM 2024 Tutorial2_Advances in CMOS Technologies and Future Directions for C...
organizerofv
 
Semantic Cultivators : The Critical Future Role to Enable AI
Semantic Cultivators : The Critical Future Role to Enable AISemantic Cultivators : The Critical Future Role to Enable AI
Semantic Cultivators : The Critical Future Role to Enable AI
artmondano
 
Generative Artificial Intelligence (GenAI) in Business
Generative Artificial Intelligence (GenAI) in BusinessGenerative Artificial Intelligence (GenAI) in Business
Generative Artificial Intelligence (GenAI) in Business
Dr. Tathagat Varma
 
Technology Trends in 2025: AI and Big Data Analytics
Technology Trends in 2025: AI and Big Data AnalyticsTechnology Trends in 2025: AI and Big Data Analytics
Technology Trends in 2025: AI and Big Data Analytics
InData Labs
 
Massive Power Outage Hits Spain, Portugal, and France: Causes, Impact, and On...
Massive Power Outage Hits Spain, Portugal, and France: Causes, Impact, and On...Massive Power Outage Hits Spain, Portugal, and France: Causes, Impact, and On...
Massive Power Outage Hits Spain, Portugal, and France: Causes, Impact, and On...
Aqusag Technologies
 
Cybersecurity Identity and Access Solutions using Azure AD
Cybersecurity Identity and Access Solutions using Azure ADCybersecurity Identity and Access Solutions using Azure AD
Cybersecurity Identity and Access Solutions using Azure AD
VICTOR MAESTRE RAMIREZ
 
Dev Dives: Automate and orchestrate your processes with UiPath Maestro
Dev Dives: Automate and orchestrate your processes with UiPath MaestroDev Dives: Automate and orchestrate your processes with UiPath Maestro
Dev Dives: Automate and orchestrate your processes with UiPath Maestro
UiPathCommunity
 
Big Data Analytics Quick Research Guide by Arthur Morgan
Big Data Analytics Quick Research Guide by Arthur MorganBig Data Analytics Quick Research Guide by Arthur Morgan
Big Data Analytics Quick Research Guide by Arthur Morgan
Arthur Morgan
 
Build Your Own Copilot & Agents For Devs
Build Your Own Copilot & Agents For DevsBuild Your Own Copilot & Agents For Devs
Build Your Own Copilot & Agents For Devs
Brian McKeiver
 
Linux Support for SMARC: How Toradex Empowers Embedded Developers
Linux Support for SMARC: How Toradex Empowers Embedded DevelopersLinux Support for SMARC: How Toradex Empowers Embedded Developers
Linux Support for SMARC: How Toradex Empowers Embedded Developers
Toradex
 
AI and Data Privacy in 2025: Global Trends
AI and Data Privacy in 2025: Global TrendsAI and Data Privacy in 2025: Global Trends
AI and Data Privacy in 2025: Global Trends
InData Labs
 
Linux Professional Institute LPIC-1 Exam.pdf
Linux Professional Institute LPIC-1 Exam.pdfLinux Professional Institute LPIC-1 Exam.pdf
Linux Professional Institute LPIC-1 Exam.pdf
RHCSA Guru
 
Transcript: #StandardsGoals for 2025: Standards & certification roundup - Tec...
Transcript: #StandardsGoals for 2025: Standards & certification roundup - Tec...Transcript: #StandardsGoals for 2025: Standards & certification roundup - Tec...
Transcript: #StandardsGoals for 2025: Standards & certification roundup - Tec...
BookNet Canada
 
Mobile App Development Company in Saudi Arabia
Mobile App Development Company in Saudi ArabiaMobile App Development Company in Saudi Arabia
Mobile App Development Company in Saudi Arabia
Steve Jonas
 
Designing Low-Latency Systems with Rust and ScyllaDB: An Architectural Deep Dive
Designing Low-Latency Systems with Rust and ScyllaDB: An Architectural Deep DiveDesigning Low-Latency Systems with Rust and ScyllaDB: An Architectural Deep Dive
Designing Low-Latency Systems with Rust and ScyllaDB: An Architectural Deep Dive
ScyllaDB
 
Cyber Awareness overview for 2025 month of security
Cyber Awareness overview for 2025 month of securityCyber Awareness overview for 2025 month of security
Cyber Awareness overview for 2025 month of security
riccardosl1
 
HCL Nomad Web – Best Practices und Verwaltung von Multiuser-Umgebungen
HCL Nomad Web – Best Practices und Verwaltung von Multiuser-UmgebungenHCL Nomad Web – Best Practices und Verwaltung von Multiuser-Umgebungen
HCL Nomad Web – Best Practices und Verwaltung von Multiuser-Umgebungen
panagenda
 
Heap, Types of Heap, Insertion and Deletion
Heap, Types of Heap, Insertion and DeletionHeap, Types of Heap, Insertion and Deletion
Heap, Types of Heap, Insertion and Deletion
Jaydeep Kale
 
ThousandEyes Partner Innovation Updates for May 2025
ThousandEyes Partner Innovation Updates for May 2025ThousandEyes Partner Innovation Updates for May 2025
ThousandEyes Partner Innovation Updates for May 2025
ThousandEyes
 
2025-05-Q4-2024-Investor-Presentation.pptx
2025-05-Q4-2024-Investor-Presentation.pptx2025-05-Q4-2024-Investor-Presentation.pptx
2025-05-Q4-2024-Investor-Presentation.pptx
Samuele Fogagnolo
 
IEDM 2024 Tutorial2_Advances in CMOS Technologies and Future Directions for C...
IEDM 2024 Tutorial2_Advances in CMOS Technologies and Future Directions for C...IEDM 2024 Tutorial2_Advances in CMOS Technologies and Future Directions for C...
IEDM 2024 Tutorial2_Advances in CMOS Technologies and Future Directions for C...
organizerofv
 
Semantic Cultivators : The Critical Future Role to Enable AI
Semantic Cultivators : The Critical Future Role to Enable AISemantic Cultivators : The Critical Future Role to Enable AI
Semantic Cultivators : The Critical Future Role to Enable AI
artmondano
 
Generative Artificial Intelligence (GenAI) in Business
Generative Artificial Intelligence (GenAI) in BusinessGenerative Artificial Intelligence (GenAI) in Business
Generative Artificial Intelligence (GenAI) in Business
Dr. Tathagat Varma
 
Technology Trends in 2025: AI and Big Data Analytics
Technology Trends in 2025: AI and Big Data AnalyticsTechnology Trends in 2025: AI and Big Data Analytics
Technology Trends in 2025: AI and Big Data Analytics
InData Labs
 
Massive Power Outage Hits Spain, Portugal, and France: Causes, Impact, and On...
Massive Power Outage Hits Spain, Portugal, and France: Causes, Impact, and On...Massive Power Outage Hits Spain, Portugal, and France: Causes, Impact, and On...
Massive Power Outage Hits Spain, Portugal, and France: Causes, Impact, and On...
Aqusag Technologies
 
Cybersecurity Identity and Access Solutions using Azure AD
Cybersecurity Identity and Access Solutions using Azure ADCybersecurity Identity and Access Solutions using Azure AD
Cybersecurity Identity and Access Solutions using Azure AD
VICTOR MAESTRE RAMIREZ
 
Dev Dives: Automate and orchestrate your processes with UiPath Maestro
Dev Dives: Automate and orchestrate your processes with UiPath MaestroDev Dives: Automate and orchestrate your processes with UiPath Maestro
Dev Dives: Automate and orchestrate your processes with UiPath Maestro
UiPathCommunity
 
Big Data Analytics Quick Research Guide by Arthur Morgan
Big Data Analytics Quick Research Guide by Arthur MorganBig Data Analytics Quick Research Guide by Arthur Morgan
Big Data Analytics Quick Research Guide by Arthur Morgan
Arthur Morgan
 
Build Your Own Copilot & Agents For Devs
Build Your Own Copilot & Agents For DevsBuild Your Own Copilot & Agents For Devs
Build Your Own Copilot & Agents For Devs
Brian McKeiver
 
Linux Support for SMARC: How Toradex Empowers Embedded Developers
Linux Support for SMARC: How Toradex Empowers Embedded DevelopersLinux Support for SMARC: How Toradex Empowers Embedded Developers
Linux Support for SMARC: How Toradex Empowers Embedded Developers
Toradex
 
AI and Data Privacy in 2025: Global Trends
AI and Data Privacy in 2025: Global TrendsAI and Data Privacy in 2025: Global Trends
AI and Data Privacy in 2025: Global Trends
InData Labs
 
Linux Professional Institute LPIC-1 Exam.pdf
Linux Professional Institute LPIC-1 Exam.pdfLinux Professional Institute LPIC-1 Exam.pdf
Linux Professional Institute LPIC-1 Exam.pdf
RHCSA Guru
 
Transcript: #StandardsGoals for 2025: Standards & certification roundup - Tec...
Transcript: #StandardsGoals for 2025: Standards & certification roundup - Tec...Transcript: #StandardsGoals for 2025: Standards & certification roundup - Tec...
Transcript: #StandardsGoals for 2025: Standards & certification roundup - Tec...
BookNet Canada
 
Mobile App Development Company in Saudi Arabia
Mobile App Development Company in Saudi ArabiaMobile App Development Company in Saudi Arabia
Mobile App Development Company in Saudi Arabia
Steve Jonas
 
Designing Low-Latency Systems with Rust and ScyllaDB: An Architectural Deep Dive
Designing Low-Latency Systems with Rust and ScyllaDB: An Architectural Deep DiveDesigning Low-Latency Systems with Rust and ScyllaDB: An Architectural Deep Dive
Designing Low-Latency Systems with Rust and ScyllaDB: An Architectural Deep Dive
ScyllaDB
 
Cyber Awareness overview for 2025 month of security
Cyber Awareness overview for 2025 month of securityCyber Awareness overview for 2025 month of security
Cyber Awareness overview for 2025 month of security
riccardosl1
 
Ad

Introduction and Overview of Apache Kafka, TriHUG July 23, 2013

  • 2. About Me David Arthur http://mumrah.github.io/ ● Software Engineer at LucidWorks ● Open source contributor ● Gardener ● Dad
  • 3. TOC ● Project overview ● Architecture ● Implementation ● Miscellany
  • 4. Project info ● http://kafka.apache.org ● Written in Scala ● Open sourced by LinkedIn SNA in 2011 ● Soon after entered Apache Incubator ● Not just an idle "open source donation" ● Active development, 0.8 out now ● Apache TLP late 2012, 13 committers ● Still no logo
  • 5. 12 word overview Apache Kafka is publish-subscribe messaging rethought as a distributed commit log
  • 6. Ok... Kafka is a persistent, distributed, replicated pub/sub messaging system. Publishers send messages to a cluster of brokers. The brokers persist the messages to disk. Consumers then request a range of messages using an (offset, length) style API. Use of NIO FileChannel allows for very fast transfer of data in and out of the system.
  • 7. Highlights ● ZooKeeper for Broker coordination ● Configurable Producer acks ● Consumer Groups ● TTL persistence ● Sync/Async producer API ● Durable ● Scalable ● Fast ● <Additional buzzwords>
  • 8. Motivation ● Activity stream processing (user interactions) ● Batch latency was too high ● Existing queues handle large volumes of (unconsumed) data poorly - durability is expensive ● Need something fast and durable
  • 9. Key design choices ● Pub/sub messaging pattern ● Messages are persistent ● Everything is distributed - producers, brokers, consumers, the queue itself ● Consumers maintain their own state (i.e., "dumb" brokers) ● Throughput is key
  • 10. TOC ● Project overview ● Architecture ● Implementation ● Miscellany
  • 11. Brokers ● Receive messages from Producers (push), deliver messages to Consumers (pull) ● Responsible for persisting the messages for some time ● Relatively lightweight - mostly just handling TCP connections and keeping open file handles to the queue files
  • 12. Log-based queue Messages are persisted to append-only log files by the broker. Producers are appending to these log files (sequential write), and consumers are reading a range of these files (sequential reads).
  • 13. Log-based queue Message 0 Message 1 Message 2 Message 3 Producer Message 4 Message 5 Message 6 Consumer A read 0-4 Consumer Bread 1-3 Broker
  • 14. Topics Topics are queues. They are logical collections of partitions (the physical files). A broker contains some of the partitions for a topic Partition 0 Partition 1 Topic A Partition 0 Partition 1 Topic B Broker 1 Partition 0 Partition 1 Topic A Partition 0 Partition 1 Topic B Broker 2
  • 15. Replication Partitions of a topic are replicated. One broker is the "leader" of a partition. All writes and reads must go to the leader. Replicas exist for fault-tolerance, not scalability. When writing, messages can be synchronously written to N replicas (depending on the producer's ACKiness)
  • 16. Producers Producers are responsible for load balancing messages to the various brokers. They can discover all brokers from any one broker. In 0.7, producers are fire-and-forget. In 0.8, there are 3 ack levels: ● No ack (0) ● Ack from N replicas (1..N) ● Ack from all replicas (-1)
  • 17. Consumers Consumers request a range of messages from a Broker. They are responsible for their own state Default implementation uses ZooKeeper to manage state. In 0.8.1, Brokers will expose an API for offset management to remove direct communication between consumers and ZooKeeper (a good thing).
  • 18. Result? ● Brokers keep very little state, mostly just open file pointers and connections ● Can scale to thousands of producers and consumers* ● Stable performance, good scalability ● In 0.7, 50MB/s producer throughput, 100MB/s consumer throughput ● No numbers yet from 0.8, but the same comparable (with acks=0)
  • 19. TOC ● Project overview ● Architecture ● Implementation ● Miscellany
  • 20. High-level producer API Producer#send(KeyedMessage<K,V> datum) KeyedMessage<K,V>(String topic, K key, V value) The Producer class determines where a message is sent based on the routing key. If a null key is given, the message is sent to a random partition
  • 21. Message routing Partition 0 Partition 1 Topic A Broker 1 Partition 0 Partition 1 Topic A Broker 2 hash("foo") % 4 Partitioner Producer.send("A", "foo", messages) Producer
  • 22. Message routing ● Producers can be configured with a custom routing function (implementing the Partitioner interface) ● Default is hash-mod ● One side effect of routing is that there is no total ordering for a topic, but there is within a partition (this is actually really useful)
  • 23. (Partially) Ordered Messages Consider a system that is processing updates. If you partition the messages based on the primary key you guarantee all messages for a given key end up in the same partition. Since Kafka guarantees ordering of the messages at the partition level, your updates will be processed in the correct sequence.
  • 24. Persistent Messages ● MessageSets received by the broker and flushed to append-only log files ● Simple log format (next slide) ● Zero-copy (i.e., sendfile) for file to socket transfer ● Log files are not kept forever
  • 26. Message sets ● To maximize throughput, the same binary format for messages is used throughout the system (producer API, consumer API, and log file format) ● Broker only decodes far enough to read the checksum and validate it, then writes it to the disk
  • 27. Compressed message sets or, message batching ● The value of a Message can be a compressed MessageSet Message(value=gzip(MessageSet(...))) ● Useful for increasing throughput (especially if you are buffering messages in the producer) ● Can also be used as a way to atomically write multiple messages
  • 28. Zero-copy Reading data out of Kafka is super fast thanks to java.nio.channels.FileChannel#transferTo. This method uses the "sendfile" system call which allows for very efficient transfer of data from a file to another file (including sockets). This optimization is what allows us to pull ~100MB/s out of a single broker
  • 29. High-level Consumer API Map topicMap = Collections.singletonMap("topic", 2); Map<String,List<KafkaStream>> streams = Consumer.createJavaConsumerConnector(topicMap); ● KafkaStream is an iterable ● Blocking or non-blocking behavior ● Auto-commit offset, or manual commit ● Participate in a "consumer group"
  • 30. Consumer Groups Multiple high-level consumers can participate in a single "consumer group". A consumer group is coordinated using ZooKeeper, so it can span multiple machines. In a group, each partition will be consumed by exactly one consumer (i. e., KafkaStream). This allows for broadcast or pub/sub type of messaging pattern
  • 31. Consumer Groups P0 P1 Topic "foo" Broker 1 Consumer Group A Consumer Group B P2 P3 Topic "foo" Broker 2
  • 32. TOC ● Project overview ● Architecture ● Implementation ● Miscellany
  • 33. But, I already have a MQ Many people with distributed systems already have something like JMS, RabbitMQ, Kestrel, or ØMQ in place. These are all different systems with different implementation details and semantics. Decide what features you need, and then chose a MQ - not the other way around :) RabbitMQ discussion: http://bit.ly/140ZMOx
  • 34. Caveats ● Not designed for large payloads. Decoding is done for a whole message (no streaming decoding). ● Rebalancing can screw things up if you're doing any aggregation in the consumer by the partition key ● Number of partitions cannot be easily changed (chose wisely) ● Lots of topics can hurt I/O performance
  • 35. Clients ● Many exist, some more complete than others ● No official clients (other than Java/Scala) ● Community maintained ● Most lack support for high-level consumer due to ZooKeeper dependency (it's tricky) ● Excellent Python client: https://github. com/mumrah/kafka-python (yes, I wrote it) ● Only Python, C, and Clojure support the 0.8 protocol
  • 36. Hadoop InputFormat ● InputFormat/OutputFormat implementations ● Uses low-level consumer, no offsets in ZK (they are stored on HDFS instead) ● Useful for long term storage of messages ● We use this! ● LinkedIn released Camus, a Kafka to HDFS pipeline
  • 37. Integration A few good integration points. I expect more to show up as Kafka gains adoption ● log4j appender (in Kafka itself) ● storm-kafka (in storm-contrib) ● camel-kafka ● LogStash (loges) More listed on Kafka wiki http://bit.ly/1btZz8p
  • 38. Applications Log Aggregation Many applications running on many machines. You want centralized logging, but don't want to install an agent (Flume, etc). Kafka includes a log4j appender <appender class="kafka.producer.KafkaLog4jAppender" name="kafka-solr"> <param name="zkConnect" value="localhost:2181"/> <param name="topic" value="solr-logs"/> <layout class="org.apache.log4j.PatternLayout"> <param value="%d{ISO8601} %p %c %m" name="ConversionPattern"/> </layout> </appender>
  • 39. Applications Notifications Store data into data store A, need to sync it to data store B. Suppose you can send a message to Kafka when an update happens in A A Kafka Brecord X changed get updated record X
  • 40. Applications Stream Processing Using a stream processing tool like Storm or Apache Camel, Kafka provides an excellent backbone. Data in Kafka topic A Process A Kafka topic B Process B
  • 41. Bonus Slides Stats LinkedIn stats: ● Peak writes per second: 460k ● Average writes per day: 28 billion ● Average reads per second: 2.3 million ● ~700 topics ● Thousands of producers ● ~1000 consumers
  • 42. Bonus Slides Logos! Being heatedly debated in JIRA as we speak!
  • 43. Links! ● Slides: http://bit.ly/kafka-trihug ● Kafka Project: http://kafka.apache.org/ ● Kafka Code: https://github.com/apache/kafka ● LinkedIn Camus: https://github.com/linkedin/camus ● Zero-Copy IBM article: http://ibm.co/gsETNm ● LinkedIn Kafka paper: http://bit.ly/mC8TLS ● LinkedIn blog post on replication: http://linkd.in/YAWslH