ApacheCon Berlin 2019: Kongo:Building a Scalable Streaming IoT Application using Apache Kafka

Kongo: Building a Scalable
Streaming IoT Application
using Apache Kafka
Paul Brebner
instaclustr.com Technology Evangelist

Overview
What we’ll see on
the journey up river
1. Kafka introduction
2. The Kongo problem
3. Application, Architecture and Design(s)
4. Streams extension
5. Scaling

1 Kafka
Introduction
Instaclustr
Managed
Platform
Multi-cloud
For scale,
performance,
availability,
integration, security

Instaclustr
Managed
Platform
Open Source -
Store, Analyze,
Search, Explore,
and in 2018 we
added Stream –
Kafka!

What is Kafka?
Message flow
Distributed
streams
processing
1 Distributed Producers…
2 Send Messages
3 To Distributed Consumers
4 Via Kafka Cluster

Kafka
Key Benefits
■ Fast – high throughput and low latency
■ Scalable – horizontally scalable, just add nodes and
partitions
■ Reliable – distributed and fault tolerant
■ Zero data loss
■ Open Source
■ Heterogeneous data sources and sinks
■ Available as an Instaclustr Managed service

How does
Kafka work?
Excerpt from…

Producer
Consumer
Consumer
Consumer
Consumer
?
Kafka is “pub-sub”, It’s loosely coupled,
producers and consumers don’t know about
each other.

Filtering, or which consumers get which messages, is topic based.
- Producers send messages to topics.
- Consumers subscribe to topics of interest, e.g. Parties.
- When they poll they only receive messages sent to those topics.
None of these consumers will receive messages sent to the “Work” topic.
Producer
Consumer
Consumer
Consumer
Consumer
Topic “Parties”
Topic “Work”
Consumers subscribed
to Topic “Parties”
Consumers poll to
receive messages
from “Parties”
Consumers not subscribed to
“Work” messages

Kafka works like an Amish Barn raising.
Partitions and a consumer group share
work across multiple consumers, the more
partitions a topic has the more consumers
it supports.
Image: Paul Cyr ©2018 NorthernMainePhotos.com

Kafka also works like the Clone Army
It supports delivery of the same message to
multiple consumers with consumer groups.
Kafka doesn’t throw messages away
immediately they are delivered, so the
same message can be delivered to
multiple consumer groups.
Image: AKKHARAT JARUSILAWONG / Shutterstock.com

Consumers subscribed to ”Parties” topic are allocated partitions.
When they poll they will only get messages from their allocated
partitions.
Consumer
Consumer
Topic “Parties”
Partition 1
Partition 2
Partition 3
Producer
Consumer Group
Consumer
Consumer Group
Consumer

This enables consumers in the same group to share the work
around. Each consumer gets only a subset of the available
messages.
Consumer
Consumer
Topic “Parties”
Partition 1
Partition 2
Partition 3
Producer
Consumer Group
Consumer
Consumers share
work within groups

Multiple groups enable message broadcasting. Messages
are duplicated across groups, as each consumer group
receives a copy of each message.
Consumer
Consumer
Consumer
Topic “Parties”
Partition 1
Partition 2
Partition 3
Producer
Consumer Group
Consumer
Consumer Group
Messages are
duplicated across
Consumer groups

2 The Kongo
Problem
Amazon was taken
Congo is 2nd biggest
river, 5,000km
And deepest
Congo -> Kongo
(Kingdom of)

Kongo River
Important for trade
A logistics problem

Our
Logistics
Problem
Goods stored in
Warehouses
Goods moved
between
Warehouses in
Trucks
Checking rules in
real-time

Goods
Chickens
(Perishable, Fragile,
Edible)
Toxic Waste
(Hazardous, Bullk)
Vegetables
(Perishable, edible)
Art (Fragile)

Interfaces
between
virtual and
real
RFID tags

RFID readers
Produce Truck
load/unload events
Interfaces
between
virtual and
real

Interfaces
between
virtual and
real
Sensors (e.g. shock
and vibration,
environmental gas
sensor).
About 20 metrics

The Story
Goods in a
Warehouse

The Story
Warehouse Sensor
events
Check environment
rules for all Goods
in Warehouse
Warehouse sensor events
? ?
? ?
Warehouse sensor events No Goods

The Story
RFID Load event
Art now in Truck
RFID Load Event

The Story
Load truck with
Drums

The Story
RFID Load Event
Drums now in Truck
Check Drums and
Art co-location rules
RFID Load Event
?

The Story
Truck drives to
another warehouse

The Story
Truck sensor events
Check environment
rules for all Goods
in Truck
Truck sensor events
?
?

The Story
Unload Drums and
Art

The Story
RFID Unload events
Goods now in
Warehouse
Repeat from Start
With lots more
warehouses, goods
and trucks!
RFID Unload Events

Rules
Goods
categories
■ Each Goods has 0 or more general Categories:
● Perishable
● Hazardous
● Fragile
● Edible
● Medicinal
● Bulky
● Dry
■ Real world more complex
● 97 categories in Australia

Rules
Goods
categories
■ And 0 or 1 temperature category
● Frozen Temp
● Heat Sensitive Temp
● Cool Temp
● Room Temp
● Ambient Temp
■ Some warehouses/trucks are temperature controlled

Rule
checking
Co-location
Goods have rules to
check if they are
safe in the same
Truck

Sensor rules
Goods to have rules
to check if they are
safe in the
environment of a
location -
Warehouse or
Trucks, 20 metrics,
some in common
E.g. Keep your
chickens cool

3 Application
Simulation
Logical steps
Create Goods,
Warehouses,
Trucks
Simulate next hour
Unload Goods into
Warehouses
Simulate Sensor
values (Trucks and
Warehouses)
Check Goods +
Sensor violations
(Goods in Trucks
and Warehouses)
Check Goods + co-
location violations
(Goods on trucks)
Load Trucks with
Goods, move
Trucks to random
Warehouses
repeat

Architecture
Create Goods,
Warehouses,
Trucks
Simulate next hour
Unload Goods into
Warehouses
Simulate Sensor
values (Trucks and
Warehouses)
Check Goods +
Sensor violations
(Goods in Trucks
and Warehouses)
Check Goods + co-
location violations
(Goods on trucks)
Load Trucks with
Goods, move
Trucks to random
Warehouses
repeat
Rule violations
Monolithic
Rule violations

Architecture
Create Goods,
Warehouses,
Trucks
Simulate next hour
Unload Goods into
Warehouses
Simulate Sensor
values (Trucks and
Warehouses)
Check Goods +
Sensor violations
(Goods in Trucks
and Warehouses)
Check Goods + co-
location violations
(Goods on trucks)Load Trucks with
Goods, move
Trucks to random
Warehouses
repeat
Event streams
Rule violations
Sensor events
Unload events
Load events
De-coupled with
event streams

Distributed
Architecture Create Goods,
Warehouses,
Trucks
Simulate next hour
Unload Goods into
Warehouses
Simulate Sensor
values (Trucks and
Warehouses)
Check Goods +
Sensor violations
(Goods in Trucks
and Warehouses)
Check Goods + co-
location violations
(Goods on trucks)
Load Trucks with
Goods, move
Trucks to random
Warehouses
repeat
Rule violations
Separate Kafka
producers and
consumers
Simulation has
perfect knowledge,
but violation rules
checking relies on
event stream data
Simulation
Kafka Producers
Violation rules checking
Kafka Consumers

Design Goal
Deliver events
produced from each
location to Goods in
same location
Events delivered to Goods
in same location
in same location
in same location
in same location

Design
Variables
Topics and
Consumers=Goods
All locations in 1 topic 1 topic per location
Goods de-coupled
from Consumers
(Consumers < Goods)
Every Goods is a
Consumer (Group)
Goods = Consumers
Topics
Consumers
1 Many

Design
Variables
Problems?
100s of locations =
topics ok, more not
ok
Too many consumer
groups not ok
All locations in 1 topic 1 topic per location
Goods de-coupled
from Consumers
(Consumers <<
Goods)
Every Goods is a
Consumer (Group)
Goods == Consumers
Topics
Consumers
1 Many

Possible
Design 1
Multiple topics
Goods =
Consumers = many
0..n
0..n
0..n
0..n

Possible
Design 2
Single Topic
1 Consumer Group,
decoupled Goods
Another component
responsible for
mapping of location
to Goods

Design
check
High Fan-out
How well does
Kafka work for
broadcast delivery
of the same event to
large numbers of
consumers?

Initial
benchmarking
100 Locations
100,000 Goods
Fan-out = 1:1000
Option 2 superior
Single topic, single
consumer group
0
20
40
60
80
100
120
Option 1 Option 2
Relative Throughput (%)

4 Kafka
Streams
1 of 4 Kafka APIs
Fishing on the
Congo
Kafka Streams = a
complex way of
fishing?!

But scalable
Streams
concurrency (tasks)
<= input topic
partitions

Streams
1 of 4 Kafka APIs
■ Kafka has 4 APIs, Producer, Consumer, Connector
and Streams!
■ The Streams API allows an application to act as a
stream processor
● consuming an input stream from one or more topics and
● producing an output stream to one or more topics
● transforming the input streams to output streams
■ A stream is an unbounded, ordered, replayable,
continuously updating data set, consisting of
strongly typed key-value records.

Processor
Topology
DAGs of stream
processors (nodes)
that are connected by
streams (edges)
Processors transform
data by receiving one
input record, applying
an operation to it, and
producing output
records.

Streams
DSL
Streams and Tables
■ The Streams DSL has built-in abstractions for
streams and tables
● KStream, KTable, GlobalKTable, KGroupedStream, and
KGroupedTable.
■ The DSL supports a Declarative functional
programming style, with
● stateless transformations (e.g. map and filter) as well as
● stateful transformations such as aggregations (e.g. count and
reduce), joins, and windowing.

Truck
overload!
Trucks have a
maximum load
weight
Built a streams
application to check
for overloading.

Streaming
Problems?
Topology
Exceptions and
Floating trucks!

Understanding
and debugging
Streams
Topologies
Use Kafka Streams
Topology Visualizer!
https://zz85.github.io
/kafka-streams-viz/

Streaming
Problems?
■ Invalid Topology errors
● Some tricky (non-obvious)
Kafka streams topology rules
● And cycles aren’t allowed
Invalid Topologies

Streaming
Problems?
■ Anti-gravity?
● Sometimes truck weights went negative!
● Solution: Turned on “exactly-once” transactional setting
● The transactional producer allows an application to send messages
to multiple partitions atomically.
● Weights no longer go negative
Negative truck
weights!

Scaling
Congo Inga rapids
Scaling is easy?
100 warehouses
200 trucks
10,000 Goods

Scalability
alternatives
Scale out, up and
multiple clusters
Multiple clusters
enables flexible
scaling (cluster for
violations)
Different instance
sizes have different
network speeds

Larger
instances
reduce end-
to-end
latency
2 core instances c.f.
4 core instances
Higher concurrency
and faster network
We also offer 8 core
Kafka instances
(AWS R5’s +SSDs)

Total
resources
Kafka clusters and
application cores
Application used x2
server cores
Kubernetes works
well for application
deployment, scaling,
monitoring

Scaling is
hard (1)
Actually hard to
achieve linear
scalability
Why? Kafka is
scalable, but:
■ Hash Collisions
● Too many open files exceptions
● Due to increasing and eventually too many consumers
● Some consumers were timing out
● Why? Some consumers were not receiving any events
● 300 locations and 300 partitions, but only 200 unique values, so
only 200 consumers receive events, the rest time out
● This is due to hashing collisions, some partitions get > 1 locations,
others 0

Key parking
problem
Well known problem
Knuth 1962
Ensure number of
keys >>> number of
partitions >=
number of
consumers (in a
group)

Scaling is
hard (2)
Cloudy with a
chance of
Rebalancing Storms

Rebalancing
storms
■ Rebalancing storms result in some consumers not
receiving events (drop in throughput) and a very
slow start up time for new consumers (> 20s)
■ Need to ensure consumers are started and are
polling before trying to add lots more consumers
■ So try to keep total number of consumers as low as
practical (next…)

Scaling is
hard (3)
Too much
(consumer)
scalability is bad

Consumers
Less is more
■ Even though we used the design with least
consumers…
■ If Kafka consumers take too long to read events and
process them, then need more consumer threads
(and more partitions), impacting Kafka cluster
scalability
■ Solution? Minimize consumer response time
● Only use consumers for reading events
● Do event processing asynchronously or in separate thread pool
■ My #1 Kafka rule is
● “Kafka is easy to scale with the smallest number of consumers”

More
information
The End
■ Kongo code:
● https://github.com/instaclustr/kongo2
● https://github.com/instaclustr/kongokafkastreams
■ All blog series, including Kongo, and latest,
● Anomalia Machina
ᐨ Kafka+Cassandra+Kubernetes, and
● Geospatial Anomalia Machina
ᐨ Kafka+Cassandra+Kubernetes+Geospatial queries & indexing
● https://www.instaclustr.com/paul-brebner/
■ Visual Introduction To Kafka
● https://www.instaclustr.com/resource/apache-kafka-a-visual-
introduction/
■ The Instaclustr Managed Platform
● https://www.instaclustr.com/platform/
● Free Trial
ᐨ https://console.instaclustr.com/user/signup

ApacheCon Berlin 2019: Kongo:Building a Scalable Streaming IoT Application using Apache Kafka

More Related Content

What's hot (20)

Similar to ApacheCon Berlin 2019: Kongo:Building a Scalable Streaming IoT Application using Apache Kafka (20)

More from Paul Brebner (20)

Recently uploaded (20)

ApacheCon Berlin 2019: Kongo:Building a Scalable Streaming IoT Application using Apache Kafka