A unified analytics platform with Kafka and Flink | Stephan Ewen, Ververica

© 2019 Ververica
Stephan Ewen
CTO @ Ververica, Apache Flink PMC
A Uniﬁed Analytics Platform with Apache Flink and
Apache Kafka
a.k.a. Uniﬁed Streaming & Batch Processing

© 2019 Ververica
2
Apache Kafka and Apache Flink

© 2019 Ververica
3
Log Storage File Storage
DB/Table
s
JDB
C
Debezium
Analytics and Applications on
Streaming Data &
Data-at-Rest

© 2019 Ververica
4
Flink Runtime
Stateful Stream- and Batch Processing
DataStream API StateFun
SQL & Table API
more declarative
more explicit
control

© 2019 Ververica
5
How big can you go? – Alibaba Singles Day
Search Rec. Security
BI
Ads
incl. sub-second updates to the GMV dashboard
Real-time Data Applications
Infrastructure
>5K
nodes
Data Size
985PB
Throughput (Peak)
2.5B
events/sec
Latency
Sub-sec
State Size (Biggest)
100TB
>500K
CPU cores

© 2019 Ververica
6
How small can you go? - U-Hopper FogGuru
Cluster of 5 Raspberry Pi 3b+ Data volume: 800 events/sec
Docker Swarm + Flink + Mosquitto
“The Fridge”

© 2019 Ververica
Streaming & Batch SQL
= SQL!

© 2019 Ververica
8
Our Sample Setup
Real-time
Queries
real-time
events
historical data
ingest
Historical
Queries
Combined Queries

© 2019 Ververica
9
SQL – Static Data (Batch Case)
user cnt
Mary 2
Bob 1
Liz 1
SELECT
user,
COUNT(url) as cnt
FROM clicks
GROUP BY user
user cTime url
Mary 12:00:00 https://…
Bob 12:00:00 https://…
Mary 12:00:02 https://…
Liz 12:00:03 https://…

© 2019 Ververica
10
SQL – Static Data (Streaming Case)
user cTime url
user cnt
SELECT
user,
COUNT(url) as cnt
FROM clicks
GROUP BY user
Mary 12:00:00 https://…
Bob 12:00:00 https://…
Mary 12:00:02 https://…
Liz 12:00:03 https://…
Bob 1
Liz 1
Mary 1
Mary 2

© 2019 Ververica
11
Demo Time

© 2019 Ververica
Stream to Batches and Back Again

© 2019 Ververica
13
ye olde stuff new events
2020-4-1
12:00 am
2016-4-1
1:00 am
2016-4-1
2:00 am
2016-4-7
11:00pm
2016-4-7
10:00pm
…
Inges
t

© 2019 Ververica
14
2020-4-1
12:00 am
2016-4-1
1:00 am
2016-4-1
2:00 am
2016-4-7
11:00pm
2016-4-7
10:00pm
…
Real-tim
e
Queries
Historical
Queries
Inges
t

© 2019 Ververica
15
Why combine data in Kafka and in Object Stores?
•Why not just longer Kafka retention?
•Sample Numbers from a petabyte-scale Flink + Kafka user
– Compressed columnar data avg. 5x smaller
– Object Store 3-4x cost cheaper than persistent volumes (EBS)
– Object Store already replicated (cross AZ), replaces broker replication (3x)
🡪 ~50x cheaper data storage
•Faster access to compressed columnar data for
– Higher read parallelism
– Read optimizations: Predicate pushdowns, projection pushdowns, partition pruning, etc.

© 2019 Ververica
16
2020-4-1
12:00 am
2016-4-1
1:00 am
2016-4-1
2:00 am
2016-4-7
11:00pm
2016-4-7
10:00pm
…
Inges
t
while we are at it…
…normalize our data on the way
and convert currencies.

© 2019 Ververica
17
Temporal Join
Symbol Exchange Rate Timestamp Order Timestamp Currency
EUR 1.0 12.00 A @ 17.00 $ 12:05 USD
USD 0.88 12:01 B @ 12.32 £ 12:11 GBP
USD 0.90 12:03
GBP 1.17 12:04
USD 0.89 12:12
GBP 1.20 12:13
USD 0.87 12:17
C @ 111.51 $ 12:01 USD
D @ 2.39 $ 12:02 USD
E @ 17.11 $ 12:20 USD
F @ 243.50 £ 12:15 GBP
G @ 3.49 $ 12:10 USD
H @ 0.99 £ 12:16 GBP

© 2019 Ververica
18
Temporal Join
Symbol Exchange Rate Timestamp Order Timestamp Currency
EUR 1.0 12.00 A @ 17.00 $ 12:05 USD
B @ 12.32 £ 12:11 GBP
GBP 1.17 12:04
USD 0.90 12:03 C @ 111.51 $ 12:01 USD
D @ 2.39 $ 12:02 USD
E @ 17.11 $ 12:20 USD
F @ 243.50 £ 12:15 GBP
G @ 3.49 $ 12:07 USD
H @ 0.99 £ 12:16 GBP
@ 12:07h
Symbol Exchange Rate Timestamp
EUR 1.0 12.00
GBP 1.20 12:13
USD 0.89 12:12
@ 12:15h

© 2019 Ververica
21
2020-4-1
12:00 am
2016-4-1
1:00 am
2016-4-1
2:00 am
2016-4-7
11:00pm
2016-4-7
10:00pm
…
Bootstrap Streaming Queries
Inges
t

© 2019 Ververica
23
Event Time in
Kafka Topic
Event time
Data (orders / currency rates)
Temporal Join
Time bucketing
on ﬁle system
Time partition pruning
for queries
Stitching together
Stream again from
ﬁles and stream tail

© 2019 Ververica
26
Everything is a Stream
Bounded & Unbounded
Batch Processing complements
Stream Processing
(Compressed Columnar) File
Storage complements Log Storage
make the most of your
(event) time
SQL generalizes well
across batch & streaming
a uniﬁed engine makes
things easier

A unified analytics platform with Kafka and Flink | Stephan Ewen, Ververica

More Related Content

What's hot (19)

Similar to A unified analytics platform with Kafka and Flink | Stephan Ewen, Ververica (20)

More from HostedbyConfluent (20)

Recently uploaded (20)

A unified analytics platform with Kafka and Flink | Stephan Ewen, Ververica