SlideShare a Scribd company logo
Kongo: Building a Scalable
Streaming IoT Application
using Apache Kafka
Paul Brebner
instaclustr.com Technology Evangelist
Overview
What we’ll see on
the journey up river
1. Kafka introduction
2. The Kongo problem
3. Application, Architecture and Design(s)
4. Streams extension
5. Scaling
1 Kafka
Introduction
Instaclustr
Managed
Platform
Multi-cloud
For scale,
performance,
availability,
integration, security
Instaclustr
Managed
Platform
Open Source -
Store, Analyze,
Search, Explore,
and in 2018 we
added Stream –
Kafka!
What is Kafka?
Message flow
Distributed
streams
processing
1 Distributed Producers…
2 Send Messages
3 To Distributed Consumers
4 Via Kafka Cluster
Kafka
Key Benefits
■ Fast – high throughput and low latency
■ Scalable – horizontally scalable, just add nodes and
partitions
■ Reliable – distributed and fault tolerant
■ Zero data loss
■ Open Source
■ Heterogeneous data sources and sinks
■ Available as an Instaclustr Managed service
How does
Kafka work?
Excerpt from…
Producer
Consumer
Consumer
Consumer
Consumer
?
Kafka is “pub-sub”, It’s loosely coupled,
producers and consumers don’t know about
each other.
Filtering, or which consumers get which messages, is topic based.
- Producers send messages to topics.
- Consumers subscribe to topics of interest, e.g. Parties.
- When they poll they only receive messages sent to those topics.
None of these consumers will receive messages sent to the “Work” topic.
Producer
Consumer
Consumer
Consumer
Consumer
Topic “Parties”
Topic “Work”
Consumers subscribed
to Topic “Parties”
Consumers poll to
receive messages
from “Parties”
Consumers not subscribed to
“Work” messages
Kafka works like an Amish Barn raising.
Partitions and a consumer group share
work across multiple consumers, the more
partitions a topic has the more consumers
it supports.
Image: Paul Cyr ©2018 NorthernMainePhotos.com
Kafka also works like the Clone Army
It supports delivery of the same message to
multiple consumers with consumer groups.
Kafka doesn’t throw messages away
immediately they are delivered, so the
same message can be delivered to
multiple consumer groups.
Image: AKKHARAT JARUSILAWONG / Shutterstock.com
Consumers subscribed to ”Parties” topic are allocated partitions.
When they poll they will only get messages from their allocated
partitions.
Consumer
Consumer
Topic “Parties”
Partition 1
Partition 2
Partition 3
Producer
Consumer Group
Consumer
Consumer Group
Consumer
This enables consumers in the same group to share the work
around. Each consumer gets only a subset of the available
messages.
Consumer
Consumer
Topic “Parties”
Partition 1
Partition 2
Partition 3
Producer
Consumer Group
Consumer
Consumers share
work within groups
Multiple groups enable message broadcasting. Messages
are duplicated across groups, as each consumer group
receives a copy of each message.
Consumer
Consumer
Consumer
Topic “Parties”
Partition 1
Partition 2
Partition 3
Producer
Consumer Group
Consumer
Consumer Group
Messages are
duplicated across
Consumer groups
2 The Kongo
Problem
Amazon was taken
Congo is 2nd biggest
river, 5,000km
And deepest
Congo -> Kongo
(Kingdom of)
Kongo River
Important for trade
A logistics problem
Our
Logistics
Problem
Goods stored in
Warehouses
Goods moved
between
Warehouses in
Trucks
Checking rules in
real-time
Goods
Chickens
(Perishable, Fragile,
Edible)
Toxic Waste
(Hazardous, Bullk)
Vegetables
(Perishable, edible)
Art (Fragile)
Warehouses
Trucks
Interfaces
between
virtual and
real
RFID tags
RFID readers
Produce Truck
load/unload events
Interfaces
between
virtual and
real
Interfaces
between
virtual and
real
Sensors (e.g. shock
and vibration,
environmental gas
sensor).
About 20 metrics
The Story
Warehouses
The Story
Goods in a
Warehouse
The Story
Warehouse Sensor
events
Check environment
rules for all Goods
in Warehouse
Warehouse sensor events
? ?
? ?
Warehouse sensor events No Goods
The Story
Along comes a truck
The Story
Load truck with art
The Story
RFID Load event
Art now in Truck
RFID Load Event
The Story
Load truck with
Drums
The Story
RFID Load Event
Drums now in Truck
Check Drums and
Art co-location rules
RFID Load Event
?
The Story
Truck drives to
another warehouse
The Story
Truck sensor events
Check environment
rules for all Goods
in Truck
Truck sensor events
?
?
The Story
Unload Drums and
Art
The Story
RFID Unload events
Goods now in
Warehouse
Repeat from Start
With lots more
warehouses, goods
and trucks!
RFID Unload Events
Rules
Goods
categories
■ Each Goods has 0 or more general Categories:
● Perishable
● Hazardous
● Fragile
● Edible
● Medicinal
● Bulky
● Dry
■ Real world more complex
● 97 categories in Australia
Rules
Goods
categories
■ And 0 or 1 temperature category
● Frozen Temp
● Heat Sensitive Temp
● Cool Temp
● Room Temp
● Ambient Temp
■ Some warehouses/trucks are temperature controlled
Rule
checking
Co-location
Goods have rules to
check if they are
safe in the same
Truck
Sensor rules
Goods to have rules
to check if they are
safe in the
environment of a
location -
Warehouse or
Trucks, 20 metrics,
some in common
E.g. Keep your
chickens cool
3 Application
Simulation
Logical steps
Create Goods,
Warehouses,
Trucks
Simulate next hour
Unload Goods into
Warehouses
Simulate Sensor
values (Trucks and
Warehouses)
Check Goods +
Sensor violations
(Goods in Trucks
and Warehouses)
Check Goods + co-
location violations
(Goods on trucks)
Load Trucks with
Goods, move
Trucks to random
Warehouses
repeat
Architecture
Create Goods,
Warehouses,
Trucks
Simulate next hour
Unload Goods into
Warehouses
Simulate Sensor
values (Trucks and
Warehouses)
Check Goods +
Sensor violations
(Goods in Trucks
and Warehouses)
Check Goods + co-
location violations
(Goods on trucks)
Load Trucks with
Goods, move
Trucks to random
Warehouses
repeat
Rule violations
Monolithic
Rule violations
Architecture
Create Goods,
Warehouses,
Trucks
Simulate next hour
Unload Goods into
Warehouses
Simulate Sensor
values (Trucks and
Warehouses)
Check Goods +
Sensor violations
(Goods in Trucks
and Warehouses)
Check Goods + co-
location violations
(Goods on trucks)Load Trucks with
Goods, move
Trucks to random
Warehouses
repeat
Event streams
Rule violations
Sensor events
Unload events
Load events
De-coupled with
event streams
Distributed
Architecture Create Goods,
Warehouses,
Trucks
Simulate next hour
Unload Goods into
Warehouses
Simulate Sensor
values (Trucks and
Warehouses)
Check Goods +
Sensor violations
(Goods in Trucks
and Warehouses)
Check Goods + co-
location violations
(Goods on trucks)
Load Trucks with
Goods, move
Trucks to random
Warehouses
repeat
Rule violations
Separate Kafka
producers and
consumers
Simulation has
perfect knowledge,
but violation rules
checking relies on
event stream data
Simulation
Kafka Producers
Violation rules checking
Kafka Consumers
Design Goal
Deliver events
produced from each
location to Goods in
same location
Events delivered to Goods
in same location
Events delivered to Goods
in same location
Events delivered to Goods
in same location
Events delivered to Goods
in same location
Design
Variables
Topics and
Consumers=Goods
All locations in 1 topic 1 topic per location
Goods de-coupled
from Consumers
(Consumers < Goods)
Every Goods is a
Consumer (Group)
Goods = Consumers
Topics
Consumers
1 Many
Design
Variables
Problems?
100s of locations =
topics ok, more not
ok
Too many consumer
groups not ok
All locations in 1 topic 1 topic per location
Goods de-coupled
from Consumers
(Consumers <<
Goods)
Every Goods is a
Consumer (Group)
Goods == Consumers
Topics
Consumers
1 Many
Possible
Design 1
Multiple topics
Goods =
Consumers = many
0..n
0..n
0..n
0..n
Possible
Design 2
Single Topic
1 Consumer Group,
decoupled Goods
Another component
responsible for
mapping of location
to Goods
Design
check
High Fan-out
How well does
Kafka work for
broadcast delivery
of the same event to
large numbers of
consumers?
Initial
benchmarking
100 Locations
100,000 Goods
Fan-out = 1:1000
Option 2 superior
Single topic, single
consumer group
0
20
40
60
80
100
120
Option 1 Option 2
Relative Throughput (%)
4 Kafka
Streams
1 of 4 Kafka APIs
Fishing on the
Congo
Kafka Streams = a
complex way of
fishing?!
But scalable
Streams
concurrency (tasks)
<= input topic
partitions
Streams
1 of 4 Kafka APIs
■ Kafka has 4 APIs, Producer, Consumer, Connector
and Streams!
■ The Streams API allows an application to act as a
stream processor
● consuming an input stream from one or more topics and
● producing an output stream to one or more topics
● transforming the input streams to output streams
■ A stream is an unbounded, ordered, replayable,
continuously updating data set, consisting of
strongly typed key-value records.
Processor
Topology
DAGs of stream
processors (nodes)
that are connected by
streams (edges)
Processors transform
data by receiving one
input record, applying
an operation to it, and
producing output
records.
Streams
DSL
Streams and Tables
■ The Streams DSL has built-in abstractions for
streams and tables
● KStream, KTable, GlobalKTable, KGroupedStream, and
KGroupedTable.
■ The DSL supports a Declarative functional
programming style, with
● stateless transformations (e.g. map and filter) as well as
● stateful transformations such as aggregations (e.g. count and
reduce), joins, and windowing.
How to
compose
operations
Truck
overload!
Trucks have a
maximum load
weight
Built a streams
application to check
for overloading.
Streaming
Problems?
Topology
Exceptions and
Floating trucks!
Understanding
and debugging
Streams
Topologies
Use Kafka Streams
Topology Visualizer!
https://zz85.github.io
/kafka-streams-viz/
Streaming
Problems?
■ Invalid Topology errors
● Some tricky (non-obvious)
Kafka streams topology rules
● And cycles aren’t allowed
Invalid Topologies
Streaming
Problems?
■ Anti-gravity?
● Sometimes truck weights went negative!
● Solution: Turned on “exactly-once” transactional setting
● The transactional producer allows an application to send messages
to multiple partitions atomically.
● Weights no longer go negative
Negative truck
weights!
5 Scaling
Congo Inga rapids
Scaling
Congo Inga rapids
Scaling is easy?
100 warehouses
200 trucks
10,000 Goods
Scalability
alternatives
Scale out, up and
multiple clusters
Multiple clusters
enables flexible
scaling (cluster for
violations)
Different instance
sizes have different
network speeds
Larger
instances
reduce end-
to-end
latency
2 core instances c.f.
4 core instances
Higher concurrency
and faster network
We also offer 8 core
Kafka instances
(AWS R5’s +SSDs)
Total
resources
Kafka clusters and
application cores
Application used x2
server cores
Kubernetes works
well for application
deployment, scaling,
monitoring
Scaling is
hard (1)
Actually hard to
achieve linear
scalability
Why? Kafka is
scalable, but:
■ Hash Collisions
● Too many open files exceptions
● Due to increasing and eventually too many consumers
● Some consumers were timing out
● Why? Some consumers were not receiving any events
● 300 locations and 300 partitions, but only 200 unique values, so
only 200 consumers receive events, the rest time out
● This is due to hashing collisions, some partitions get > 1 locations,
others 0
Key parking
problem
Well known problem
Knuth 1962
Ensure number of
keys >>> number of
partitions >=
number of
consumers (in a
group)
Scaling is
hard (2)
Cloudy with a
chance of
Rebalancing Storms
Rebalancing
storms
■ Rebalancing storms result in some consumers not
receiving events (drop in throughput) and a very
slow start up time for new consumers (> 20s)
■ Need to ensure consumers are started and are
polling before trying to add lots more consumers
■ So try to keep total number of consumers as low as
practical (next…)
Scaling is
hard (3)
Too much
(consumer)
scalability is bad
Consumers
Less is more
■ Even though we used the design with least
consumers…
■ If Kafka consumers take too long to read events and
process them, then need more consumer threads
(and more partitions), impacting Kafka cluster
scalability
■ Solution? Minimize consumer response time
● Only use consumers for reading events
● Do event processing asynchronously or in separate thread pool
■ My #1 Kafka rule is
● “Kafka is easy to scale with the smallest number of consumers”
More
information
The End
■ Kongo code:
● https://github.com/instaclustr/kongo2
● https://github.com/instaclustr/kongokafkastreams
■ All blog series, including Kongo, and latest,
● Anomalia Machina
ᐨ Kafka+Cassandra+Kubernetes, and
● Geospatial Anomalia Machina
ᐨ Kafka+Cassandra+Kubernetes+Geospatial queries & indexing
● https://www.instaclustr.com/paul-brebner/
■ Visual Introduction To Kafka
● https://www.instaclustr.com/resource/apache-kafka-a-visual-
introduction/
■ The Instaclustr Managed Platform
● https://www.instaclustr.com/platform/
● Free Trial
ᐨ https://console.instaclustr.com/user/signup

More Related Content

What's hot (20)

PDF
Production Ready Kafka on Kubernetes (Devandra Tagare, Lyft) Kafka Summit SF ...
confluent
 
PDF
Large scale stream processing with Apache Flink
Nikolay Stoitsev
 
PDF
Building Stream Processing Applications with Apache Kafka Using KSQL (Robin M...
confluent
 
PDF
SFBigAnalytics_20190724: Monitor kafka like a Pro
Chester Chen
 
PDF
Devoxx university - Kafka de haut en bas
Florent Ramiere
 
PDF
Follow the (Kafka) Streams
confluent
 
PDF
[Spark Summit EU 2017] Apache spark streaming + kafka 0.10 an integration story
Joan Viladrosa Riera
 
PDF
Netflix Keystone—Cloud scale event processing pipeline
Monal Daxini
 
PPTX
Apache Samza: Reliable Stream Processing Atop Apache Kafka and Hadoop YARN
blueboxtraveler
 
PDF
Spring Kafka beyond the basics - Lessons learned on our Kafka journey (Tim va...
confluent
 
PDF
Event Sourcing, Stream Processing and Serverless (Benjamin Stopford, Confluen...
confluent
 
PPTX
Gluecon - Kafka and the service mesh
Gwen (Chen) Shapira
 
PDF
Building Scalable and Extendable Data Pipeline for Call of Duty Games (Yarosl...
confluent
 
PDF
Kafka on Kubernetes: Does it really have to be "The Hard Way"? (Viktor Gamov,...
confluent
 
PDF
Ingesting Healthcare Data, Micah Whitacre
confluent
 
PDF
Consistency and Completeness: Rethinking Distributed Stream Processing in Apa...
Guozhang Wang
 
PPTX
Apache Kafka - Patterns anti-patterns
Florent Ramiere
 
PPTX
A Modern C++ Kafka API | Kenneth Jia, Morgan Stanley
HostedbyConfluent
 
PDF
10 Lessons Learned from using Kafka in 1000 microservices - ScalaUA
Natan Silnitsky
 
PDF
Jitney, Kafka at Airbnb
alexismidon
 
Production Ready Kafka on Kubernetes (Devandra Tagare, Lyft) Kafka Summit SF ...
confluent
 
Large scale stream processing with Apache Flink
Nikolay Stoitsev
 
Building Stream Processing Applications with Apache Kafka Using KSQL (Robin M...
confluent
 
SFBigAnalytics_20190724: Monitor kafka like a Pro
Chester Chen
 
Devoxx university - Kafka de haut en bas
Florent Ramiere
 
Follow the (Kafka) Streams
confluent
 
[Spark Summit EU 2017] Apache spark streaming + kafka 0.10 an integration story
Joan Viladrosa Riera
 
Netflix Keystone—Cloud scale event processing pipeline
Monal Daxini
 
Apache Samza: Reliable Stream Processing Atop Apache Kafka and Hadoop YARN
blueboxtraveler
 
Spring Kafka beyond the basics - Lessons learned on our Kafka journey (Tim va...
confluent
 
Event Sourcing, Stream Processing and Serverless (Benjamin Stopford, Confluen...
confluent
 
Gluecon - Kafka and the service mesh
Gwen (Chen) Shapira
 
Building Scalable and Extendable Data Pipeline for Call of Duty Games (Yarosl...
confluent
 
Kafka on Kubernetes: Does it really have to be "The Hard Way"? (Viktor Gamov,...
confluent
 
Ingesting Healthcare Data, Micah Whitacre
confluent
 
Consistency and Completeness: Rethinking Distributed Stream Processing in Apa...
Guozhang Wang
 
Apache Kafka - Patterns anti-patterns
Florent Ramiere
 
A Modern C++ Kafka API | Kenneth Jia, Morgan Stanley
HostedbyConfluent
 
10 Lessons Learned from using Kafka in 1000 microservices - ScalaUA
Natan Silnitsky
 
Jitney, Kafka at Airbnb
alexismidon
 

Similar to ApacheCon Berlin 2019: Kongo:Building a Scalable Streaming IoT Application using Apache Kafka (20)

PDF
Apache Kafka
Worapol Alex Pongpech, PhD
 
PPTX
Service messaging using Kafka
Robert Vadai
 
PPTX
Kafkha real time analytics platform.pptx
dummyuseage1
 
PDF
Building a Streaming Platform with Kafka
confluent
 
PDF
Kafka syed academy_v1_introduction
Syed Hadoop
 
PDF
Event driven-arch
Mohammed Shoaib
 
KEY
Data Models and Consumer Idioms Using Apache Kafka for Continuous Data Stream...
Erik Onnen
 
PPSX
Event Sourcing & CQRS, Kafka, Rabbit MQ
Araf Karsh Hamid
 
PDF
A Functional Approach to Architecture - Kafka & Kafka Streams - Kevin Mas Rui...
Thoughtworks
 
PDF
Streaming Analytics unit 2 notes for engineers
ManjuAppukuttan2
 
PDF
STREAMING WITH KAFKA Publish/Subscribe Messaging with Kafka
GravenGuan
 
PDF
DevOps Fest 2020. Сергій Калінець. Building Data Streaming Platform with Apac...
DevOps_Fest
 
PDF
Kafka 10000 feet view
younessx01
 
PPTX
Kafka.pptx (uploaded from MyFiles SomnathDeb_PC)
somnathdeb0212
 
PPTX
AMIS SIG - Introducing Apache Kafka - Scalable, reliable Event Bus & Message ...
Lucas Jellema
 
PPTX
kafka simplicity and complexity
Paolo Platter
 
PPTX
Kafka 0.8.0 Presentation to Atlanta Java User's Group March 2013
Christopher Curtin
 
PDF
10 essentials steps for kafka streaming services
inovia
 
PDF
Building Event Driven Services with Apache Kafka and Kafka Streams - Devoxx B...
Ben Stopford
 
PDF
Apache Kafka - Scalable Message-Processing and more !
Guido Schmutz
 
Service messaging using Kafka
Robert Vadai
 
Kafkha real time analytics platform.pptx
dummyuseage1
 
Building a Streaming Platform with Kafka
confluent
 
Kafka syed academy_v1_introduction
Syed Hadoop
 
Event driven-arch
Mohammed Shoaib
 
Data Models and Consumer Idioms Using Apache Kafka for Continuous Data Stream...
Erik Onnen
 
Event Sourcing & CQRS, Kafka, Rabbit MQ
Araf Karsh Hamid
 
A Functional Approach to Architecture - Kafka & Kafka Streams - Kevin Mas Rui...
Thoughtworks
 
Streaming Analytics unit 2 notes for engineers
ManjuAppukuttan2
 
STREAMING WITH KAFKA Publish/Subscribe Messaging with Kafka
GravenGuan
 
DevOps Fest 2020. Сергій Калінець. Building Data Streaming Platform with Apac...
DevOps_Fest
 
Kafka 10000 feet view
younessx01
 
Kafka.pptx (uploaded from MyFiles SomnathDeb_PC)
somnathdeb0212
 
AMIS SIG - Introducing Apache Kafka - Scalable, reliable Event Bus & Message ...
Lucas Jellema
 
kafka simplicity and complexity
Paolo Platter
 
Kafka 0.8.0 Presentation to Atlanta Java User's Group March 2013
Christopher Curtin
 
10 essentials steps for kafka streaming services
inovia
 
Building Event Driven Services with Apache Kafka and Kafka Streams - Devoxx B...
Ben Stopford
 
Apache Kafka - Scalable Message-Processing and more !
Guido Schmutz
 
Ad

More from Paul Brebner (20)

PPTX
Streaming More For Less With Apache Kafka Tiered Storage
Paul Brebner
 
PDF
30 Of My Favourite Open Source Technologies In 30 Minutes
Paul Brebner
 
PDF
Superpower Your Apache Kafka Applications Development with Complementary Open...
Paul Brebner
 
PDF
Why Apache Kafka Clusters Are Like Galaxies (And Other Cosmic Kafka Quandarie...
Paul Brebner
 
PDF
Architecting Applications With Multiple Open Source Big Data Technologies
Paul Brebner
 
PDF
The Impact of Hardware and Software Version Changes on Apache Kafka Performan...
Paul Brebner
 
PDF
Apache ZooKeeper and Apache Curator: Meet the Dining Philosophers
Paul Brebner
 
PDF
Spinning your Drones with Cadence Workflows and Apache Kafka
Paul Brebner
 
PDF
Change Data Capture (CDC) With Kafka Connect® and the Debezium PostgreSQL Sou...
Paul Brebner
 
PDF
Scaling Open Source Big Data Cloud Applications is Easy/Hard
Paul Brebner
 
PDF
OPEN Talk: Scaling Open Source Big Data Cloud Applications is Easy/Hard
Paul Brebner
 
PDF
A Visual Introduction to Apache Kafka
Paul Brebner
 
PDF
Massively Scalable Real-time Geospatial Anomaly Detection with Apache Kafka a...
Paul Brebner
 
PDF
Building a real-time data processing pipeline using Apache Kafka, Kafka Conne...
Paul Brebner
 
PDF
Grid Middleware – Principles, Practice and Potential
Paul Brebner
 
PDF
Grid middleware is easy to install, configure, secure, debug and manage acros...
Paul Brebner
 
PPTX
Massively Scalable Real-time Geospatial Data Processing with Apache Kafka and...
Paul Brebner
 
PPTX
Massively Scalable Real-time Geospatial Data Processing with Apache Kafka and...
Paul Brebner
 
PPTX
Massively Scalable Real-time Geospatial Data Processing with Apache Kafka and...
Paul Brebner
 
PPTX
0b101000 years of computing: a personal timeline - decade "0", the 1980's
Paul Brebner
 
Streaming More For Less With Apache Kafka Tiered Storage
Paul Brebner
 
30 Of My Favourite Open Source Technologies In 30 Minutes
Paul Brebner
 
Superpower Your Apache Kafka Applications Development with Complementary Open...
Paul Brebner
 
Why Apache Kafka Clusters Are Like Galaxies (And Other Cosmic Kafka Quandarie...
Paul Brebner
 
Architecting Applications With Multiple Open Source Big Data Technologies
Paul Brebner
 
The Impact of Hardware and Software Version Changes on Apache Kafka Performan...
Paul Brebner
 
Apache ZooKeeper and Apache Curator: Meet the Dining Philosophers
Paul Brebner
 
Spinning your Drones with Cadence Workflows and Apache Kafka
Paul Brebner
 
Change Data Capture (CDC) With Kafka Connect® and the Debezium PostgreSQL Sou...
Paul Brebner
 
Scaling Open Source Big Data Cloud Applications is Easy/Hard
Paul Brebner
 
OPEN Talk: Scaling Open Source Big Data Cloud Applications is Easy/Hard
Paul Brebner
 
A Visual Introduction to Apache Kafka
Paul Brebner
 
Massively Scalable Real-time Geospatial Anomaly Detection with Apache Kafka a...
Paul Brebner
 
Building a real-time data processing pipeline using Apache Kafka, Kafka Conne...
Paul Brebner
 
Grid Middleware – Principles, Practice and Potential
Paul Brebner
 
Grid middleware is easy to install, configure, secure, debug and manage acros...
Paul Brebner
 
Massively Scalable Real-time Geospatial Data Processing with Apache Kafka and...
Paul Brebner
 
Massively Scalable Real-time Geospatial Data Processing with Apache Kafka and...
Paul Brebner
 
Massively Scalable Real-time Geospatial Data Processing with Apache Kafka and...
Paul Brebner
 
0b101000 years of computing: a personal timeline - decade "0", the 1980's
Paul Brebner
 
Ad

Recently uploaded (20)

PDF
Digger Solo: Semantic search and maps for your local files
seanpedersen96
 
PDF
Automate Cybersecurity Tasks with Python
VICTOR MAESTRE RAMIREZ
 
PPTX
Fundamentals_of_Microservices_Architecture.pptx
MuhammadUzair504018
 
PDF
Revenue streams of the Wazirx clone script.pdf
aaronjeffray
 
PDF
Odoo CRM vs Zoho CRM: Honest Comparison 2025
Odiware Technologies Private Limited
 
PDF
Efficient, Automated Claims Processing Software for Insurers
Insurance Tech Services
 
PDF
Alarm in Android-Scheduling Timed Tasks Using AlarmManager in Android.pdf
Nabin Dhakal
 
PPTX
Java Native Memory Leaks: The Hidden Villain Behind JVM Performance Issues
Tier1 app
 
PDF
Open Chain Q2 Steering Committee Meeting - 2025-06-25
Shane Coughlan
 
PDF
Alexander Marshalov - How to use AI Assistants with your Monitoring system Q2...
VictoriaMetrics
 
PDF
Why Businesses Are Switching to Open Source Alternatives to Crystal Reports.pdf
Varsha Nayak
 
PPTX
Agentic Automation: Build & Deploy Your First UiPath Agent
klpathrudu
 
PPTX
Coefficient of Variance in IBM SPSS Statistics Version 31.pptx
Version 1 Analytics
 
PDF
Linux Certificate of Completion - LabEx Certificate
VICTOR MAESTRE RAMIREZ
 
PPTX
Tally software_Introduction_Presentation
AditiBansal54083
 
PDF
Build It, Buy It, or Already Got It? Make Smarter Martech Decisions
bbedford2
 
PDF
Powering GIS with FME and VertiGIS - Peak of Data & AI 2025
Safe Software
 
PPTX
MailsDaddy Outlook OST to PST converter.pptx
abhishekdutt366
 
PPTX
How Cloud Computing is Reinventing Financial Services
Isla Pandora
 
PPTX
Agentic Automation Journey Series Day 2 – Prompt Engineering for UiPath Agents
klpathrudu
 
Digger Solo: Semantic search and maps for your local files
seanpedersen96
 
Automate Cybersecurity Tasks with Python
VICTOR MAESTRE RAMIREZ
 
Fundamentals_of_Microservices_Architecture.pptx
MuhammadUzair504018
 
Revenue streams of the Wazirx clone script.pdf
aaronjeffray
 
Odoo CRM vs Zoho CRM: Honest Comparison 2025
Odiware Technologies Private Limited
 
Efficient, Automated Claims Processing Software for Insurers
Insurance Tech Services
 
Alarm in Android-Scheduling Timed Tasks Using AlarmManager in Android.pdf
Nabin Dhakal
 
Java Native Memory Leaks: The Hidden Villain Behind JVM Performance Issues
Tier1 app
 
Open Chain Q2 Steering Committee Meeting - 2025-06-25
Shane Coughlan
 
Alexander Marshalov - How to use AI Assistants with your Monitoring system Q2...
VictoriaMetrics
 
Why Businesses Are Switching to Open Source Alternatives to Crystal Reports.pdf
Varsha Nayak
 
Agentic Automation: Build & Deploy Your First UiPath Agent
klpathrudu
 
Coefficient of Variance in IBM SPSS Statistics Version 31.pptx
Version 1 Analytics
 
Linux Certificate of Completion - LabEx Certificate
VICTOR MAESTRE RAMIREZ
 
Tally software_Introduction_Presentation
AditiBansal54083
 
Build It, Buy It, or Already Got It? Make Smarter Martech Decisions
bbedford2
 
Powering GIS with FME and VertiGIS - Peak of Data & AI 2025
Safe Software
 
MailsDaddy Outlook OST to PST converter.pptx
abhishekdutt366
 
How Cloud Computing is Reinventing Financial Services
Isla Pandora
 
Agentic Automation Journey Series Day 2 – Prompt Engineering for UiPath Agents
klpathrudu
 

ApacheCon Berlin 2019: Kongo:Building a Scalable Streaming IoT Application using Apache Kafka

  • 1. Kongo: Building a Scalable Streaming IoT Application using Apache Kafka Paul Brebner instaclustr.com Technology Evangelist
  • 2. Overview What we’ll see on the journey up river 1. Kafka introduction 2. The Kongo problem 3. Application, Architecture and Design(s) 4. Streams extension 5. Scaling
  • 4. Instaclustr Managed Platform Open Source - Store, Analyze, Search, Explore, and in 2018 we added Stream – Kafka!
  • 5. What is Kafka? Message flow Distributed streams processing 1 Distributed Producers… 2 Send Messages 3 To Distributed Consumers 4 Via Kafka Cluster
  • 6. Kafka Key Benefits ■ Fast – high throughput and low latency ■ Scalable – horizontally scalable, just add nodes and partitions ■ Reliable – distributed and fault tolerant ■ Zero data loss ■ Open Source ■ Heterogeneous data sources and sinks ■ Available as an Instaclustr Managed service
  • 8. Producer Consumer Consumer Consumer Consumer ? Kafka is “pub-sub”, It’s loosely coupled, producers and consumers don’t know about each other.
  • 9. Filtering, or which consumers get which messages, is topic based. - Producers send messages to topics. - Consumers subscribe to topics of interest, e.g. Parties. - When they poll they only receive messages sent to those topics. None of these consumers will receive messages sent to the “Work” topic. Producer Consumer Consumer Consumer Consumer Topic “Parties” Topic “Work” Consumers subscribed to Topic “Parties” Consumers poll to receive messages from “Parties” Consumers not subscribed to “Work” messages
  • 10. Kafka works like an Amish Barn raising. Partitions and a consumer group share work across multiple consumers, the more partitions a topic has the more consumers it supports. Image: Paul Cyr ©2018 NorthernMainePhotos.com
  • 11. Kafka also works like the Clone Army It supports delivery of the same message to multiple consumers with consumer groups. Kafka doesn’t throw messages away immediately they are delivered, so the same message can be delivered to multiple consumer groups. Image: AKKHARAT JARUSILAWONG / Shutterstock.com
  • 12. Consumers subscribed to ”Parties” topic are allocated partitions. When they poll they will only get messages from their allocated partitions. Consumer Consumer Topic “Parties” Partition 1 Partition 2 Partition 3 Producer Consumer Group Consumer Consumer Group Consumer
  • 13. This enables consumers in the same group to share the work around. Each consumer gets only a subset of the available messages. Consumer Consumer Topic “Parties” Partition 1 Partition 2 Partition 3 Producer Consumer Group Consumer Consumers share work within groups
  • 14. Multiple groups enable message broadcasting. Messages are duplicated across groups, as each consumer group receives a copy of each message. Consumer Consumer Consumer Topic “Parties” Partition 1 Partition 2 Partition 3 Producer Consumer Group Consumer Consumer Group Messages are duplicated across Consumer groups
  • 15. 2 The Kongo Problem Amazon was taken Congo is 2nd biggest river, 5,000km And deepest Congo -> Kongo (Kingdom of)
  • 16. Kongo River Important for trade A logistics problem
  • 17. Our Logistics Problem Goods stored in Warehouses Goods moved between Warehouses in Trucks Checking rules in real-time
  • 18. Goods Chickens (Perishable, Fragile, Edible) Toxic Waste (Hazardous, Bullk) Vegetables (Perishable, edible) Art (Fragile)
  • 22. RFID readers Produce Truck load/unload events Interfaces between virtual and real
  • 23. Interfaces between virtual and real Sensors (e.g. shock and vibration, environmental gas sensor). About 20 metrics
  • 25. The Story Goods in a Warehouse
  • 26. The Story Warehouse Sensor events Check environment rules for all Goods in Warehouse Warehouse sensor events ? ? ? ? Warehouse sensor events No Goods
  • 29. The Story RFID Load event Art now in Truck RFID Load Event
  • 30. The Story Load truck with Drums
  • 31. The Story RFID Load Event Drums now in Truck Check Drums and Art co-location rules RFID Load Event ?
  • 32. The Story Truck drives to another warehouse
  • 33. The Story Truck sensor events Check environment rules for all Goods in Truck Truck sensor events ? ?
  • 35. The Story RFID Unload events Goods now in Warehouse Repeat from Start With lots more warehouses, goods and trucks! RFID Unload Events
  • 36. Rules Goods categories ■ Each Goods has 0 or more general Categories: ● Perishable ● Hazardous ● Fragile ● Edible ● Medicinal ● Bulky ● Dry ■ Real world more complex ● 97 categories in Australia
  • 37. Rules Goods categories ■ And 0 or 1 temperature category ● Frozen Temp ● Heat Sensitive Temp ● Cool Temp ● Room Temp ● Ambient Temp ■ Some warehouses/trucks are temperature controlled
  • 38. Rule checking Co-location Goods have rules to check if they are safe in the same Truck
  • 39. Sensor rules Goods to have rules to check if they are safe in the environment of a location - Warehouse or Trucks, 20 metrics, some in common E.g. Keep your chickens cool
  • 40. 3 Application Simulation Logical steps Create Goods, Warehouses, Trucks Simulate next hour Unload Goods into Warehouses Simulate Sensor values (Trucks and Warehouses) Check Goods + Sensor violations (Goods in Trucks and Warehouses) Check Goods + co- location violations (Goods on trucks) Load Trucks with Goods, move Trucks to random Warehouses repeat
  • 41. Architecture Create Goods, Warehouses, Trucks Simulate next hour Unload Goods into Warehouses Simulate Sensor values (Trucks and Warehouses) Check Goods + Sensor violations (Goods in Trucks and Warehouses) Check Goods + co- location violations (Goods on trucks) Load Trucks with Goods, move Trucks to random Warehouses repeat Rule violations Monolithic Rule violations
  • 42. Architecture Create Goods, Warehouses, Trucks Simulate next hour Unload Goods into Warehouses Simulate Sensor values (Trucks and Warehouses) Check Goods + Sensor violations (Goods in Trucks and Warehouses) Check Goods + co- location violations (Goods on trucks)Load Trucks with Goods, move Trucks to random Warehouses repeat Event streams Rule violations Sensor events Unload events Load events De-coupled with event streams
  • 43. Distributed Architecture Create Goods, Warehouses, Trucks Simulate next hour Unload Goods into Warehouses Simulate Sensor values (Trucks and Warehouses) Check Goods + Sensor violations (Goods in Trucks and Warehouses) Check Goods + co- location violations (Goods on trucks) Load Trucks with Goods, move Trucks to random Warehouses repeat Rule violations Separate Kafka producers and consumers Simulation has perfect knowledge, but violation rules checking relies on event stream data Simulation Kafka Producers Violation rules checking Kafka Consumers
  • 44. Design Goal Deliver events produced from each location to Goods in same location Events delivered to Goods in same location Events delivered to Goods in same location Events delivered to Goods in same location Events delivered to Goods in same location
  • 45. Design Variables Topics and Consumers=Goods All locations in 1 topic 1 topic per location Goods de-coupled from Consumers (Consumers < Goods) Every Goods is a Consumer (Group) Goods = Consumers Topics Consumers 1 Many
  • 46. Design Variables Problems? 100s of locations = topics ok, more not ok Too many consumer groups not ok All locations in 1 topic 1 topic per location Goods de-coupled from Consumers (Consumers << Goods) Every Goods is a Consumer (Group) Goods == Consumers Topics Consumers 1 Many
  • 47. Possible Design 1 Multiple topics Goods = Consumers = many 0..n 0..n 0..n 0..n
  • 48. Possible Design 2 Single Topic 1 Consumer Group, decoupled Goods Another component responsible for mapping of location to Goods
  • 49. Design check High Fan-out How well does Kafka work for broadcast delivery of the same event to large numbers of consumers?
  • 50. Initial benchmarking 100 Locations 100,000 Goods Fan-out = 1:1000 Option 2 superior Single topic, single consumer group 0 20 40 60 80 100 120 Option 1 Option 2 Relative Throughput (%)
  • 51. 4 Kafka Streams 1 of 4 Kafka APIs Fishing on the Congo Kafka Streams = a complex way of fishing?!
  • 53. Streams 1 of 4 Kafka APIs ■ Kafka has 4 APIs, Producer, Consumer, Connector and Streams! ■ The Streams API allows an application to act as a stream processor ● consuming an input stream from one or more topics and ● producing an output stream to one or more topics ● transforming the input streams to output streams ■ A stream is an unbounded, ordered, replayable, continuously updating data set, consisting of strongly typed key-value records.
  • 54. Processor Topology DAGs of stream processors (nodes) that are connected by streams (edges) Processors transform data by receiving one input record, applying an operation to it, and producing output records.
  • 55. Streams DSL Streams and Tables ■ The Streams DSL has built-in abstractions for streams and tables ● KStream, KTable, GlobalKTable, KGroupedStream, and KGroupedTable. ■ The DSL supports a Declarative functional programming style, with ● stateless transformations (e.g. map and filter) as well as ● stateful transformations such as aggregations (e.g. count and reduce), joins, and windowing.
  • 57. Truck overload! Trucks have a maximum load weight Built a streams application to check for overloading.
  • 59. Understanding and debugging Streams Topologies Use Kafka Streams Topology Visualizer! https://zz85.github.io /kafka-streams-viz/
  • 60. Streaming Problems? ■ Invalid Topology errors ● Some tricky (non-obvious) Kafka streams topology rules ● And cycles aren’t allowed Invalid Topologies
  • 61. Streaming Problems? ■ Anti-gravity? ● Sometimes truck weights went negative! ● Solution: Turned on “exactly-once” transactional setting ● The transactional producer allows an application to send messages to multiple partitions atomically. ● Weights no longer go negative Negative truck weights!
  • 63. Scaling Congo Inga rapids Scaling is easy? 100 warehouses 200 trucks 10,000 Goods
  • 64. Scalability alternatives Scale out, up and multiple clusters Multiple clusters enables flexible scaling (cluster for violations) Different instance sizes have different network speeds
  • 65. Larger instances reduce end- to-end latency 2 core instances c.f. 4 core instances Higher concurrency and faster network We also offer 8 core Kafka instances (AWS R5’s +SSDs)
  • 66. Total resources Kafka clusters and application cores Application used x2 server cores Kubernetes works well for application deployment, scaling, monitoring
  • 67. Scaling is hard (1) Actually hard to achieve linear scalability Why? Kafka is scalable, but: ■ Hash Collisions ● Too many open files exceptions ● Due to increasing and eventually too many consumers ● Some consumers were timing out ● Why? Some consumers were not receiving any events ● 300 locations and 300 partitions, but only 200 unique values, so only 200 consumers receive events, the rest time out ● This is due to hashing collisions, some partitions get > 1 locations, others 0
  • 68. Key parking problem Well known problem Knuth 1962 Ensure number of keys >>> number of partitions >= number of consumers (in a group)
  • 69. Scaling is hard (2) Cloudy with a chance of Rebalancing Storms
  • 70. Rebalancing storms ■ Rebalancing storms result in some consumers not receiving events (drop in throughput) and a very slow start up time for new consumers (> 20s) ■ Need to ensure consumers are started and are polling before trying to add lots more consumers ■ So try to keep total number of consumers as low as practical (next…)
  • 71. Scaling is hard (3) Too much (consumer) scalability is bad
  • 72. Consumers Less is more ■ Even though we used the design with least consumers… ■ If Kafka consumers take too long to read events and process them, then need more consumer threads (and more partitions), impacting Kafka cluster scalability ■ Solution? Minimize consumer response time ● Only use consumers for reading events ● Do event processing asynchronously or in separate thread pool ■ My #1 Kafka rule is ● “Kafka is easy to scale with the smallest number of consumers”
  • 73. More information The End ■ Kongo code: ● https://github.com/instaclustr/kongo2 ● https://github.com/instaclustr/kongokafkastreams ■ All blog series, including Kongo, and latest, ● Anomalia Machina ᐨ Kafka+Cassandra+Kubernetes, and ● Geospatial Anomalia Machina ᐨ Kafka+Cassandra+Kubernetes+Geospatial queries & indexing ● https://www.instaclustr.com/paul-brebner/ ■ Visual Introduction To Kafka ● https://www.instaclustr.com/resource/apache-kafka-a-visual- introduction/ ■ The Instaclustr Managed Platform ● https://www.instaclustr.com/platform/ ● Free Trial ᐨ https://console.instaclustr.com/user/signup