SlideShare a Scribd company logo
1
Stefan Richter

@stefanrrichter



April 11, 2017
Improvements for
large state in Apache Flink
State in Streaming Programs
2
case class Event(producer: String, evtType: Int, msg: String)
case class Alert(msg: String, count: Long)
env.addSource(…)

.map(bytes => Event.parse(bytes) )
.keyBy("producer")
.mapWithState { (event: Event, state: Option[Int]) => {
// pattern rules
}
.filter(alert => alert.msg.contains("CRITICAL"))
.keyBy("msg")
.timeWindow(Time.seconds(10))
.sum("count")
Source map()
mapWith

State()
filter()
window()

sum()keyBy keyBy
State in Streaming Programs
3
case class Event(producer: String, evtType: Int, msg: String)
case class Alert(msg: String, count: Long)
env.addSource(…)

.map(bytes => Event.parse(bytes) )
.keyBy("producer")
.mapWithState { (event: Event, state: Option[Int]) => {
// pattern rules
}
.filter(alert => alert.msg.contains("CRITICAL"))
.keyBy("msg")
.timeWindow(Time.seconds(10))
.sum("count")
Source map()
mapWith

State()
filter()
window()

sum()keyBy keyBy
Stateless
Stateful
Internal vs External State
4
External State
Internal State
• State in a separate data store
• Can store "state capacity" independent
• Usually much slower than internal state
• Hard to get "exactly-once" guarantees
• State in the stream processor
• Faster than external state
• Working area local to computation
• Checkpoints to stable store (DFS)
• Always exactly-once consistent
• Stream processor has to handle scalability
Keyed State Backends
5
HeapKeyedStateBackend RocksDBKeyedStateBackend
-State lives in memory, on Java heap
-Operates on objects
-Think of a hash map {key obj -> state obj}
-Async snapshots supported
-State lives in off-heap memory and on disk
-Operates on bytes, uses serialization
-Think of K/V store {key bytes -> state bytes}
-Log-structured-merge (LSM) tree
-Async snapshots
-Incremental snapshots
Asynchronous Checkpoints
6
Synchronous Checkpointing
7
Checkpoint
Coordinator thread
Event processing
thread
Checkpointing
thread
(loop: processElement)
(trigger checkpoint) (acknowledge checkpoint)
(write state to DFS)
Task Manager
Job Manager
Why is async checkpointing so essential for large state?
Synchronous Checkpointing
8
Checkpoint
Coordinator thread
Event processing
thread
Checkpointing
thread
(loop: processElement)
(trigger checkpoint) (acknowledge checkpoint)
(write state to DFS)
Task Manager
Job Manager
Problem: All event processing is on hold here to avoid
concurrent modifications to the state that is written
Asynchronous Checkpointing
9
Checkpoint
Coordinator thread
Event processing
thread
Checkpointing
thread
(loop: processElement) (snapshot state)
(trigger checkpoint) (acknowledge checkpoint)(write state to DFS)
Task Manager
Job Manager
Asynchronous Checkpointing
10
Checkpoint
Coordinator thread
Event processing
thread
Checkpointing
thread
(loop: processElement) (snapshot state)
(trigger checkpoint) (acknowledge checkpoint)(write state to DFS)
Task Manager
Job Manager
Problem: How to deal with concurrent modifications?
Incremental Checkpoints
11
What we will discuss
▪ What are incremental checkpoints?
▪ Why is RocksDB so well suited for this?
▪ How do we integrate this with Flink’s
checkpointing?
12
Driven by and
Full Checkpointing
13
K S
2 B
4 W
6 N
K S
2 B
3 K
4 L
6 N
K S
2 Q
3 K
6 N
9 S
K S
2 B
4 W
6 N
K S
2 B
3 K
4 L
6 N
K S
2 Q
3 K
6 N
9 S
Checkpoint 1 Checkpoint 2 Checkpoint 3
time
Incremental Checkpointing
14
K S
2 B
4 W
6 N
K S
2 B
3 K
4 L
6 N
K S
2 Q
3 K
6 N
9 S
K S
2 B
4 W
6 N
K S
3 K
4 L
K S
2 Q
4 -
9 S
iCheckpoint 1 iCheckpoint 2 iCheckpoint 3
Δ(-,c1)
Δ(c1,c2)
Δ(c2,c3)
time
Incremental Recovery
15
K S
2 B
4 W
6 N
K S
2 B
3 K
4 L
6 N
K S
2 Q
3 K
6 N
9 S
K S
2 B
4 W
6 N
K S
3 K
4 L
K S
2 Q
4 -
9 S
iCheckpoint 1 iCheckpoint 2 iCheckpoint 3
Δ(-,c1)
Δ(c1,c2)
Δ(c2,c3)
time
Recovery?
+ +
RocksDB Architecture (simplified)
16
Memtable
SSTable-7
SSTable-6
Memory
…
Storage
key_1 val_1 key_2 val_2 …Index +
sorted by key
- All writes go against Memtable
- Mutable Buffer (couple MB)
- Unique keys
- Reads consider Memtable first, then SSTables
- Immutable
- We can consider newly created SSTables as Δs!
periodic
flush
periodic
merge
RocksDB Compaction
▪ Background Thread
merges SSTable files
▪ Removes copies of
the same key (latest
version survives)
▪ Actually deletion of
keys
17
2 C 7 N 9 Q 1 V 7 - 9 S
1 V 2 C 9 S
SSTable-1 SSTable-2
SSTable-3
merge
Compaction consolidates our Δs!
Flink’s Incremental Checkpointing
18
Checkpoint
Coordinator
StatefulMap
(1/3)
StatefulMap
(2/3)
StatefulMap
(3/3)
DFS
Network
SharedStateRegistry
Flink’s Incremental Checkpointing
19
Checkpoint
Coordinator
StatefulMap
(1/3)
StatefulMap
(2/3)
StatefulMap
(3/3)
DFS
Network
Step 1:
Checkpoint Coordinator sends
checkpoint barrier that triggers
a snapshot on each instance
SharedStateRegistry
Flink’s Incremental Checkpointing
20
StatefulMap
(1/3)
Checkpoint
Coordinator
StatefulMap
(2/3)
StatefulMap
(3/3)
DFS
Δ3
Δ2
Δ1
Network
Step 2:
Each instance writes its
incremental snapshot to
distributed storage
SharedStateRegistry
Flink’s Incremental Checkpointing
21
StatefulMap
(1/3)
Checkpoint
Coordinator
StatefulMap
(2/3)
StatefulMap
(3/3)
DFS
Δ3
Δ2
Δ1
Network
Step 2:
Each instance writes its
incremental snapshot to
distributed storage
SharedStateRegistry
Incremental Snapshot of Operator
22
data
manifest
01010101
00110011
10101010
11001100
01010101
00225.sst
share
Local FS
01010101
00110011
10101010
11001100
01010101
00226.sst
SharedState
Registry
Distributed FS://sfmap/1/
Incremental Snapshot of Operator
23
data
chk-1
manifest
01010101
00110011
10101010
11001100
01010101
00225.sst
manifest
01010101
00110011
10101010
11001100
01010101
00225.sst 00226.sst
01010101
00110011
10101010
11001100
01010101
++
share
Local FS
01010101
00110011
10101010
11001100
01010101
00226.sst
SharedState
Registry
copy / hardlink
Distributed FS://sfmap/1/
Incremental Snapshot of Operator
24
data
chk-1
manifest
01010101
00110011
10101010
11001100
01010101
00225.sst
manifest
01010101
00110011
10101010
11001100
01010101
00225.sst 00226.sst
01010101
00110011
10101010
11001100
01010101
++
share
01010101
00110011
10101010
11001100
01010101
00226.sst
01010101
00110011
10101010
11001100
01010101
00225.sst
manifest sst.list
Local FS
SharedState
Registry
01010101
00110011
10101010
11001100
01010101
00226.sst
chk-1
Distributed FS://sfmap/1/
List of SSTables
referenced by snapshot
async upload to DFS
Flink’s Incremental Checkpointing
25
StatefulMap
(1/3)
Checkpoint
Coordinator
StatefulMap
(2/3)
StatefulMap
(3/3)
DFS
H1
H2
H3Network
Step 3:
Each instance acknowledges and sends
a handle (e.g. file path in DFS) to the
Checkpoint Coordinator.
SharedStateRegistry
Δ3
Δ2
Δ1
Incremental Snapshot of Operator
26
data
chk-1
manifest
01010101
00110011
10101010
11001100
01010101
00225.sst
manifest
01010101
00110011
10101010
11001100
01010101
00225.sst 00226.sst
01010101
00110011
10101010
11001100
01010101
++
share
01010101
00110011
10101010
11001100
01010101
00226.sst
01010101
00110011
10101010
11001100
01010101
00225.sst
manifest sst.list
Local FS
SharedState
Registry
01010101
00110011
10101010
11001100
01010101
00226.sst
chk-1
{00225.sst = 1}
{00226.sst = 1}
Distributed FS://sfmap/1/
Flink’s Incremental Checkpointing
27
StatefulMap
(1/3)
Checkpoint
Coordinator
StatefulMap
(2/3)
StatefulMap
(3/3)
DFS
H1
H3
CP 1
H2
Network
SharedStateRegistry
Δ3
Δ2
Δ1Step 4:
Checkpoint Coordinator signals CP1 success
to all instances.
Flink’s Incremental Checkpointing
28
Checkpoint
Coordinator
StatefulMap
(1/3)
StatefulMap
(2/3)
StatefulMap
(3/3)
DFS
Network
SharedStateRegistry
H1
H3
CP 1
H2
What happens when another CP is triggered?
Δ3
Δ2
Δ1
Incremental Snapshot of Operator
29
data
chk-1
manifest
01010101
00110011
10101010
11001100
01010101
00226.sst 00228.sst 00229.sst
01010101
00110011
10101010
11001100
01010101
01010101
00110011
10101010
11001100
01010101
share
01010101
00110011
10101010
11001100
01010101
00226.sst
01010101
00110011
10101010
11001100
01010101
00225.sst
manifest sst.list
Local FS
{00225.sst = 1}
{00226.sst = 1}
SharedState
Registry
Distributed FS://sfmap/1/
Incremental Snapshot of Operator
30
data
chk-2 chk-1
manifest
01010101
00110011
10101010
11001100
01010101
00226.sst 00228.sst 00229.sst
01010101
00110011
10101010
11001100
01010101
01010101
00110011
10101010
11001100
01010101
manifest
01010101
00110011
10101010
11001100
01010101
00226.sst 00228.sst
01010101
00110011
10101010
11001100
01010101
01010101
00110011
10101010
11001100
01010101
+ +
share
01010101
00110011
10101010
11001100
01010101
00226.sst
01010101
00110011
10101010
11001100
01010101
00225.sst
manifest sst.list
Local FS
{00225.sst = 1}
{00226.sst = 1}
SharedState
Registry
Distributed FS://sfmap/1/
00229.sst
Incremental Snapshot of Operator
31
data
chk-1
manifest
01010101
00110011
10101010
11001100
01010101
00226.sst 00228.sst 00229.sst
01010101
00110011
10101010
11001100
01010101
01010101
00110011
10101010
11001100
01010101
share
01010101
00110011
10101010
11001100
01010101
00226.sst 00228.sst 00229.sst
01010101
00110011
10101010
11001100
01010101
01010101
00110011
10101010
11001100
01010101
01010101
00110011
10101010
11001100
01010101
00225.sst
manifest sst.list
chk-2
manifest sst.list
Local FS
{00225.sst = 1}
{00226.sst = 2}
{00228.sst = 1}
{00229.sst = 1}
SharedState
Registry
Distributed FS://sfmap/1/
chk-2
manifest
01010101
00110011
10101010
11001100
01010101
00226.sst 00228.sst 00229.sst
01010101
00110011
10101010
11001100
01010101
01010101
00110011
10101010
11001100
01010101
+ +
upload missing SSTable files
Deleting Incremental Checkpoints
32
StatefulMap
(1/3)
Checkpoint
Coordinator
StatefulMap
(2/3)
StatefulMap
(3/3)
DFS
H1
H3
CP 1
H1
H2H3
CP 2
H2
Network
Deleting an outdated checkpoint
SharedStateRegistry
Δ3
Δ2
Δ1
Δ3
Δ2
Δ1
Deleting Incremental Snapshot
33
data
chk-1
manifest
01010101
00110011
10101010
11001100
01010101
00226.sst 00228.sst 00229.sst
01010101
00110011
10101010
11001100
01010101
01010101
00110011
10101010
11001100
01010101
share
01010101
00110011
10101010
11001100
01010101
00226.sst 00228.sst 00229.sst
01010101
00110011
10101010
11001100
01010101
01010101
00110011
10101010
11001100
01010101
01010101
00110011
10101010
11001100
01010101
00225.sst
manifest sst.list
chk-2
manifest sst.list
Local FS
{00225.sst = 1}
{00226.sst = 2}
{00228.sst = 1}
{00229.sst = 1}
SharedState
Registry
Distributed FS://sfmap/1/
Deleting Incremental Snapshot
34
data
manifest
01010101
00110011
10101010
11001100
01010101
00226.sst 00228.sst 00229.sst
01010101
00110011
10101010
11001100
01010101
01010101
00110011
10101010
11001100
01010101
share
01010101
00110011
10101010
11001100
01010101
00226.sst 00228.sst 00229.sst
01010101
00110011
10101010
11001100
01010101
01010101
00110011
10101010
11001100
01010101
01010101
00110011
10101010
11001100
01010101
00225.sst
chk-2
manifest sst.list
Local FS
{00225.sst = 0}
{00226.sst = 1}
{00228.sst = 1}
{00229.sst = 1}
SharedState
Registry
Distributed FS://sfmap/1/
Deleting Incremental Snapshot
35
data
manifest
01010101
00110011
10101010
11001100
01010101
00226.sst 00228.sst 00229.sst
01010101
00110011
10101010
11001100
01010101
01010101
00110011
10101010
11001100
01010101
share
01010101
00110011
10101010
11001100
01010101
00226.sst 00228.sst 00229.sst
01010101
00110011
10101010
11001100
01010101
01010101
00110011
10101010
11001100
01010101
chk-2
manifest sst.list
Local FS
{00226.sst = 1}
{00228.sst = 1}
{00229.sst = 1}
SharedState
Registry
Distributed FS://sfmap/1/
Wrapping up
36
Incremental checkpointing benefits
▪ Incremental checkpoints can dramatically
reduce CP overhead for large state.
▪ Incremental checkpoints are async.
▪ RocksDB’s compaction consolidates the
increments. Keeps overhead low for
recovery.
37
Incremental checkpointing limitations
▪ Breaks the unification of checkpoints and
savepoints (CP: low overhead, SP: features)
▪ RocksDB specific format.
▪ Currently no support for rescaling from
incremental checkpoint.
38
Further improvements in Flink 1.3/4
▪ AsyncHeapKeyedStateBackend (merged)
▪ AsyncHeapOperatorStateBackend (PR)
▪ MapState (merged)
▪ RocksDBInternalTimerService (PR)
▪ AsyncHeapInternalTimerService
39
Questions?
40

More Related Content

What's hot (20)

PDF
Virtual Flink Forward 2020: Autoscaling Flink at Netflix - Timothy Farkas
Flink Forward
 
PDF
Flink forward SF 2017: Ufuk Celebi - The Stream Processor as a Database: Buil...
Flink Forward
 
PPTX
Flink Forward Berlin 2017: Kostas Kloudas - Complex Event Processing with Fli...
Flink Forward
 
PPTX
Stephan Ewen - Stream Processing as a Foundational Paradigm and Apache Flink'...
Ververica
 
PPTX
Flink Forward SF 2017: Konstantinos Kloudas - Extending Flink’s Streaming APIs
Flink Forward
 
PPTX
Kostas Kloudas - Complex Event Processing with Flink: the state of FlinkCEP
Ververica
 
PDF
Flink Forward Berlin 2017: Stephan Ewen - The State of Flink and how to adopt...
Flink Forward
 
PPTX
Flink Forward Berlin 2017: Dongwon Kim - Predictive Maintenance with Apache F...
Flink Forward
 
PPTX
Fabian Hueske - Stream Analytics with SQL on Apache Flink
Ververica
 
PDF
Tzu-Li (Gordon) Tai - Stateful Stream Processing with Apache Flink
Ververica
 
PPTX
Kostas Kloudas - Extending Flink's Streaming APIs
Ververica
 
PDF
Aljoscha Krettek - Portable stateful big data processing in Apache Beam
Ververica
 
PDF
Apache Flink's Table & SQL API - unified APIs for batch and stream processing
Timo Walther
 
PPTX
What's new in 1.9.0 blink planner - Kurt Young, Alibaba
Flink Forward
 
PPTX
Flink Forward Berlin 2017: Patrick Gunia - Migration of a realtime stats prod...
Flink Forward
 
PPTX
Continuous Processing with Apache Flink - Strata London 2016
Stephan Ewen
 
PPTX
Fabian Hueske_Till Rohrmann - Declarative stream processing with StreamSQL an...
Flink Forward
 
PDF
Flink Forward Berlin 2017: Aljoscha Krettek - Talk Python to me: Stream Proce...
Flink Forward
 
PDF
Apache Flink Internals: Stream & Batch Processing in One System – Apache Flin...
ucelebi
 
PDF
Streaming Data Flow with Apache Flink @ Paris Flink Meetup 2015
Till Rohrmann
 
Virtual Flink Forward 2020: Autoscaling Flink at Netflix - Timothy Farkas
Flink Forward
 
Flink forward SF 2017: Ufuk Celebi - The Stream Processor as a Database: Buil...
Flink Forward
 
Flink Forward Berlin 2017: Kostas Kloudas - Complex Event Processing with Fli...
Flink Forward
 
Stephan Ewen - Stream Processing as a Foundational Paradigm and Apache Flink'...
Ververica
 
Flink Forward SF 2017: Konstantinos Kloudas - Extending Flink’s Streaming APIs
Flink Forward
 
Kostas Kloudas - Complex Event Processing with Flink: the state of FlinkCEP
Ververica
 
Flink Forward Berlin 2017: Stephan Ewen - The State of Flink and how to adopt...
Flink Forward
 
Flink Forward Berlin 2017: Dongwon Kim - Predictive Maintenance with Apache F...
Flink Forward
 
Fabian Hueske - Stream Analytics with SQL on Apache Flink
Ververica
 
Tzu-Li (Gordon) Tai - Stateful Stream Processing with Apache Flink
Ververica
 
Kostas Kloudas - Extending Flink's Streaming APIs
Ververica
 
Aljoscha Krettek - Portable stateful big data processing in Apache Beam
Ververica
 
Apache Flink's Table & SQL API - unified APIs for batch and stream processing
Timo Walther
 
What's new in 1.9.0 blink planner - Kurt Young, Alibaba
Flink Forward
 
Flink Forward Berlin 2017: Patrick Gunia - Migration of a realtime stats prod...
Flink Forward
 
Continuous Processing with Apache Flink - Strata London 2016
Stephan Ewen
 
Fabian Hueske_Till Rohrmann - Declarative stream processing with StreamSQL an...
Flink Forward
 
Flink Forward Berlin 2017: Aljoscha Krettek - Talk Python to me: Stream Proce...
Flink Forward
 
Apache Flink Internals: Stream & Batch Processing in One System – Apache Flin...
ucelebi
 
Streaming Data Flow with Apache Flink @ Paris Flink Meetup 2015
Till Rohrmann
 

Similar to Flink Forward SF 2017: Stefan Richter - Improvements for large state and recovery in Flink (20)

PPTX
Stephan Ewen - Scaling to large State
Flink Forward
 
PDF
Apache Flink: Better, Faster & Uncut - Piotr Nowojski, data Artisans
Evention
 
PPTX
2018-04 Kafka Summit London: Stephan Ewen - "Apache Flink and Apache Kafka fo...
Ververica
 
PDF
Introduction to Stateful Stream Processing with Apache Flink.
Konstantinos Kloudas
 
PDF
Stateful stream processing with Apache Flink
Knoldus Inc.
 
PPTX
Stream processing - Apache flink
Renato Guimaraes
 
PDF
Flink Forward San Francisco 2018: Stefan Richter - "How to build a modern str...
Flink Forward
 
PDF
Zurich Flink Meetup
Konstantinos Kloudas
 
PDF
State Management in Apache Flink : Consistent Stateful Distributed Stream Pro...
Paris Carbone
 
PDF
Apache flink
pranay kumar
 
PPTX
Flink Architecture
Prasad Wali
 
PDF
Marton Balassi – Stateful Stream Processing
Flink Forward
 
PDF
Flink Forward San Francisco 2019: High cardinality data stream processing wit...
Flink Forward
 
PPTX
The Stream Processor as the Database - Apache Flink @ Berlin buzzwords
Stephan Ewen
 
PPTX
The Stream Processor as a Database Apache Flink
DataWorks Summit/Hadoop Summit
 
PPTX
Stephan Ewen - Experiences running Flink at Very Large Scale
Ververica
 
PPTX
January 2016 Flink Community Update & Roadmap 2016
Robert Metzger
 
PPTX
Apache Flink Overview at SF Spark and Friends
Stephan Ewen
 
PDF
Aljoscha Krettek - Apache Flink® and IoT: How Stateful Event-Time Processing ...
Ververica
 
PDF
Apache Flink Training Workshop @ HadoopCon2016 - #4 Advanced Stream Processing
Apache Flink Taiwan User Group
 
Stephan Ewen - Scaling to large State
Flink Forward
 
Apache Flink: Better, Faster & Uncut - Piotr Nowojski, data Artisans
Evention
 
2018-04 Kafka Summit London: Stephan Ewen - "Apache Flink and Apache Kafka fo...
Ververica
 
Introduction to Stateful Stream Processing with Apache Flink.
Konstantinos Kloudas
 
Stateful stream processing with Apache Flink
Knoldus Inc.
 
Stream processing - Apache flink
Renato Guimaraes
 
Flink Forward San Francisco 2018: Stefan Richter - "How to build a modern str...
Flink Forward
 
Zurich Flink Meetup
Konstantinos Kloudas
 
State Management in Apache Flink : Consistent Stateful Distributed Stream Pro...
Paris Carbone
 
Apache flink
pranay kumar
 
Flink Architecture
Prasad Wali
 
Marton Balassi – Stateful Stream Processing
Flink Forward
 
Flink Forward San Francisco 2019: High cardinality data stream processing wit...
Flink Forward
 
The Stream Processor as the Database - Apache Flink @ Berlin buzzwords
Stephan Ewen
 
The Stream Processor as a Database Apache Flink
DataWorks Summit/Hadoop Summit
 
Stephan Ewen - Experiences running Flink at Very Large Scale
Ververica
 
January 2016 Flink Community Update & Roadmap 2016
Robert Metzger
 
Apache Flink Overview at SF Spark and Friends
Stephan Ewen
 
Aljoscha Krettek - Apache Flink® and IoT: How Stateful Event-Time Processing ...
Ververica
 
Apache Flink Training Workshop @ HadoopCon2016 - #4 Advanced Stream Processing
Apache Flink Taiwan User Group
 
Ad

More from Flink Forward (20)

PDF
Building a fully managed stream processing platform on Flink at scale for Lin...
Flink Forward
 
PPTX
Evening out the uneven: dealing with skew in Flink
Flink Forward
 
PPTX
“Alexa, be quiet!”: End-to-end near-real time model building and evaluation i...
Flink Forward
 
PDF
Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...
Flink Forward
 
PDF
Introducing the Apache Flink Kubernetes Operator
Flink Forward
 
PPTX
Autoscaling Flink with Reactive Mode
Flink Forward
 
PDF
Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...
Flink Forward
 
PPTX
One sink to rule them all: Introducing the new Async Sink
Flink Forward
 
PPTX
Tuning Apache Kafka Connectors for Flink.pptx
Flink Forward
 
PDF
Flink powered stream processing platform at Pinterest
Flink Forward
 
PPTX
Apache Flink in the Cloud-Native Era
Flink Forward
 
PPTX
Where is my bottleneck? Performance troubleshooting in Flink
Flink Forward
 
PPTX
Using the New Apache Flink Kubernetes Operator in a Production Deployment
Flink Forward
 
PPTX
The Current State of Table API in 2022
Flink Forward
 
PDF
Flink SQL on Pulsar made easy
Flink Forward
 
PPTX
Dynamic Rule-based Real-time Market Data Alerts
Flink Forward
 
PPTX
Exactly-Once Financial Data Processing at Scale with Flink and Pinot
Flink Forward
 
PPTX
Processing Semantically-Ordered Streams in Financial Services
Flink Forward
 
PDF
Tame the small files problem and optimize data layout for streaming ingestion...
Flink Forward
 
PDF
Batch Processing at Scale with Flink & Iceberg
Flink Forward
 
Building a fully managed stream processing platform on Flink at scale for Lin...
Flink Forward
 
Evening out the uneven: dealing with skew in Flink
Flink Forward
 
“Alexa, be quiet!”: End-to-end near-real time model building and evaluation i...
Flink Forward
 
Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...
Flink Forward
 
Introducing the Apache Flink Kubernetes Operator
Flink Forward
 
Autoscaling Flink with Reactive Mode
Flink Forward
 
Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...
Flink Forward
 
One sink to rule them all: Introducing the new Async Sink
Flink Forward
 
Tuning Apache Kafka Connectors for Flink.pptx
Flink Forward
 
Flink powered stream processing platform at Pinterest
Flink Forward
 
Apache Flink in the Cloud-Native Era
Flink Forward
 
Where is my bottleneck? Performance troubleshooting in Flink
Flink Forward
 
Using the New Apache Flink Kubernetes Operator in a Production Deployment
Flink Forward
 
The Current State of Table API in 2022
Flink Forward
 
Flink SQL on Pulsar made easy
Flink Forward
 
Dynamic Rule-based Real-time Market Data Alerts
Flink Forward
 
Exactly-Once Financial Data Processing at Scale with Flink and Pinot
Flink Forward
 
Processing Semantically-Ordered Streams in Financial Services
Flink Forward
 
Tame the small files problem and optimize data layout for streaming ingestion...
Flink Forward
 
Batch Processing at Scale with Flink & Iceberg
Flink Forward
 
Ad

Recently uploaded (20)

PDF
apidays Singapore 2025 - Building a Federated Future, Alex Szomora (GSMA)
apidays
 
PDF
Development and validation of the Japanese version of the Organizational Matt...
Yoga Tokuyoshi
 
PPTX
big data eco system fundamentals of data science
arivukarasi
 
PDF
apidays Singapore 2025 - Streaming Lakehouse with Kafka, Flink and Iceberg by...
apidays
 
PDF
apidays Singapore 2025 - From API Intelligence to API Governance by Harsha Ch...
apidays
 
PPTX
Comparative Study of ML Techniques for RealTime Credit Card Fraud Detection S...
Debolina Ghosh
 
PDF
Loading Data into Snowflake (Bulk & Stream)
Accentfuture
 
PPTX
How to Add Columns and Rows in an R Data Frame
subhashenia
 
PDF
NIS2 Compliance for MSPs: Roadmap, Benefits & Cybersecurity Trends (2025 Guide)
GRC Kompas
 
PPTX
covid 19 data analysis updates in our municipality
RhuAyungon1
 
PDF
Research Methodology Overview Introduction
ayeshagul29594
 
PDF
IT GOVERNANCE 4-2 - Information System Security (1).pdf
mdirfanuddin1322
 
PPTX
BinarySearchTree in datastructures in detail
kichokuttu
 
PDF
Group 5_RMB Final Project on circular economy
pgban24anmola
 
PPTX
Presentation.pptx hhgihyugyygyijguuffddfffffff
abhiruppal2007
 
DOCX
🧩 1. Solvent R-WPS Office work scientific
NohaSalah45
 
PPTX
办理学历认证InformaticsLetter新加坡英华美学院毕业证书,Informatics成绩单
Taqyea
 
PPTX
01_Nico Vincent_Sailpeak.pptx_AI_Barometer_2025
FinTech Belgium
 
PDF
5- Global Demography Concepts _ Population Pyramids .pdf
pkhadka824
 
PPTX
Data anlytics Hospitals Research India.pptx
SayantanChakravorty2
 
apidays Singapore 2025 - Building a Federated Future, Alex Szomora (GSMA)
apidays
 
Development and validation of the Japanese version of the Organizational Matt...
Yoga Tokuyoshi
 
big data eco system fundamentals of data science
arivukarasi
 
apidays Singapore 2025 - Streaming Lakehouse with Kafka, Flink and Iceberg by...
apidays
 
apidays Singapore 2025 - From API Intelligence to API Governance by Harsha Ch...
apidays
 
Comparative Study of ML Techniques for RealTime Credit Card Fraud Detection S...
Debolina Ghosh
 
Loading Data into Snowflake (Bulk & Stream)
Accentfuture
 
How to Add Columns and Rows in an R Data Frame
subhashenia
 
NIS2 Compliance for MSPs: Roadmap, Benefits & Cybersecurity Trends (2025 Guide)
GRC Kompas
 
covid 19 data analysis updates in our municipality
RhuAyungon1
 
Research Methodology Overview Introduction
ayeshagul29594
 
IT GOVERNANCE 4-2 - Information System Security (1).pdf
mdirfanuddin1322
 
BinarySearchTree in datastructures in detail
kichokuttu
 
Group 5_RMB Final Project on circular economy
pgban24anmola
 
Presentation.pptx hhgihyugyygyijguuffddfffffff
abhiruppal2007
 
🧩 1. Solvent R-WPS Office work scientific
NohaSalah45
 
办理学历认证InformaticsLetter新加坡英华美学院毕业证书,Informatics成绩单
Taqyea
 
01_Nico Vincent_Sailpeak.pptx_AI_Barometer_2025
FinTech Belgium
 
5- Global Demography Concepts _ Population Pyramids .pdf
pkhadka824
 
Data anlytics Hospitals Research India.pptx
SayantanChakravorty2
 

Flink Forward SF 2017: Stefan Richter - Improvements for large state and recovery in Flink

  • 1. 1 Stefan Richter
 @stefanrrichter
 
 April 11, 2017 Improvements for large state in Apache Flink
  • 2. State in Streaming Programs 2 case class Event(producer: String, evtType: Int, msg: String) case class Alert(msg: String, count: Long) env.addSource(…)
 .map(bytes => Event.parse(bytes) ) .keyBy("producer") .mapWithState { (event: Event, state: Option[Int]) => { // pattern rules } .filter(alert => alert.msg.contains("CRITICAL")) .keyBy("msg") .timeWindow(Time.seconds(10)) .sum("count") Source map() mapWith
 State() filter() window()
 sum()keyBy keyBy
  • 3. State in Streaming Programs 3 case class Event(producer: String, evtType: Int, msg: String) case class Alert(msg: String, count: Long) env.addSource(…)
 .map(bytes => Event.parse(bytes) ) .keyBy("producer") .mapWithState { (event: Event, state: Option[Int]) => { // pattern rules } .filter(alert => alert.msg.contains("CRITICAL")) .keyBy("msg") .timeWindow(Time.seconds(10)) .sum("count") Source map() mapWith
 State() filter() window()
 sum()keyBy keyBy Stateless Stateful
  • 4. Internal vs External State 4 External State Internal State • State in a separate data store • Can store "state capacity" independent • Usually much slower than internal state • Hard to get "exactly-once" guarantees • State in the stream processor • Faster than external state • Working area local to computation • Checkpoints to stable store (DFS) • Always exactly-once consistent • Stream processor has to handle scalability
  • 5. Keyed State Backends 5 HeapKeyedStateBackend RocksDBKeyedStateBackend -State lives in memory, on Java heap -Operates on objects -Think of a hash map {key obj -> state obj} -Async snapshots supported -State lives in off-heap memory and on disk -Operates on bytes, uses serialization -Think of K/V store {key bytes -> state bytes} -Log-structured-merge (LSM) tree -Async snapshots -Incremental snapshots
  • 7. Synchronous Checkpointing 7 Checkpoint Coordinator thread Event processing thread Checkpointing thread (loop: processElement) (trigger checkpoint) (acknowledge checkpoint) (write state to DFS) Task Manager Job Manager Why is async checkpointing so essential for large state?
  • 8. Synchronous Checkpointing 8 Checkpoint Coordinator thread Event processing thread Checkpointing thread (loop: processElement) (trigger checkpoint) (acknowledge checkpoint) (write state to DFS) Task Manager Job Manager Problem: All event processing is on hold here to avoid concurrent modifications to the state that is written
  • 9. Asynchronous Checkpointing 9 Checkpoint Coordinator thread Event processing thread Checkpointing thread (loop: processElement) (snapshot state) (trigger checkpoint) (acknowledge checkpoint)(write state to DFS) Task Manager Job Manager
  • 10. Asynchronous Checkpointing 10 Checkpoint Coordinator thread Event processing thread Checkpointing thread (loop: processElement) (snapshot state) (trigger checkpoint) (acknowledge checkpoint)(write state to DFS) Task Manager Job Manager Problem: How to deal with concurrent modifications?
  • 12. What we will discuss ▪ What are incremental checkpoints? ▪ Why is RocksDB so well suited for this? ▪ How do we integrate this with Flink’s checkpointing? 12 Driven by and
  • 13. Full Checkpointing 13 K S 2 B 4 W 6 N K S 2 B 3 K 4 L 6 N K S 2 Q 3 K 6 N 9 S K S 2 B 4 W 6 N K S 2 B 3 K 4 L 6 N K S 2 Q 3 K 6 N 9 S Checkpoint 1 Checkpoint 2 Checkpoint 3 time
  • 14. Incremental Checkpointing 14 K S 2 B 4 W 6 N K S 2 B 3 K 4 L 6 N K S 2 Q 3 K 6 N 9 S K S 2 B 4 W 6 N K S 3 K 4 L K S 2 Q 4 - 9 S iCheckpoint 1 iCheckpoint 2 iCheckpoint 3 Δ(-,c1) Δ(c1,c2) Δ(c2,c3) time
  • 15. Incremental Recovery 15 K S 2 B 4 W 6 N K S 2 B 3 K 4 L 6 N K S 2 Q 3 K 6 N 9 S K S 2 B 4 W 6 N K S 3 K 4 L K S 2 Q 4 - 9 S iCheckpoint 1 iCheckpoint 2 iCheckpoint 3 Δ(-,c1) Δ(c1,c2) Δ(c2,c3) time Recovery? + +
  • 16. RocksDB Architecture (simplified) 16 Memtable SSTable-7 SSTable-6 Memory … Storage key_1 val_1 key_2 val_2 …Index + sorted by key - All writes go against Memtable - Mutable Buffer (couple MB) - Unique keys - Reads consider Memtable first, then SSTables - Immutable - We can consider newly created SSTables as Δs! periodic flush periodic merge
  • 17. RocksDB Compaction ▪ Background Thread merges SSTable files ▪ Removes copies of the same key (latest version survives) ▪ Actually deletion of keys 17 2 C 7 N 9 Q 1 V 7 - 9 S 1 V 2 C 9 S SSTable-1 SSTable-2 SSTable-3 merge Compaction consolidates our Δs!
  • 19. Flink’s Incremental Checkpointing 19 Checkpoint Coordinator StatefulMap (1/3) StatefulMap (2/3) StatefulMap (3/3) DFS Network Step 1: Checkpoint Coordinator sends checkpoint barrier that triggers a snapshot on each instance SharedStateRegistry
  • 22. Incremental Snapshot of Operator 22 data manifest 01010101 00110011 10101010 11001100 01010101 00225.sst share Local FS 01010101 00110011 10101010 11001100 01010101 00226.sst SharedState Registry Distributed FS://sfmap/1/
  • 23. Incremental Snapshot of Operator 23 data chk-1 manifest 01010101 00110011 10101010 11001100 01010101 00225.sst manifest 01010101 00110011 10101010 11001100 01010101 00225.sst 00226.sst 01010101 00110011 10101010 11001100 01010101 ++ share Local FS 01010101 00110011 10101010 11001100 01010101 00226.sst SharedState Registry copy / hardlink Distributed FS://sfmap/1/
  • 24. Incremental Snapshot of Operator 24 data chk-1 manifest 01010101 00110011 10101010 11001100 01010101 00225.sst manifest 01010101 00110011 10101010 11001100 01010101 00225.sst 00226.sst 01010101 00110011 10101010 11001100 01010101 ++ share 01010101 00110011 10101010 11001100 01010101 00226.sst 01010101 00110011 10101010 11001100 01010101 00225.sst manifest sst.list Local FS SharedState Registry 01010101 00110011 10101010 11001100 01010101 00226.sst chk-1 Distributed FS://sfmap/1/ List of SSTables referenced by snapshot async upload to DFS
  • 25. Flink’s Incremental Checkpointing 25 StatefulMap (1/3) Checkpoint Coordinator StatefulMap (2/3) StatefulMap (3/3) DFS H1 H2 H3Network Step 3: Each instance acknowledges and sends a handle (e.g. file path in DFS) to the Checkpoint Coordinator. SharedStateRegistry Δ3 Δ2 Δ1
  • 26. Incremental Snapshot of Operator 26 data chk-1 manifest 01010101 00110011 10101010 11001100 01010101 00225.sst manifest 01010101 00110011 10101010 11001100 01010101 00225.sst 00226.sst 01010101 00110011 10101010 11001100 01010101 ++ share 01010101 00110011 10101010 11001100 01010101 00226.sst 01010101 00110011 10101010 11001100 01010101 00225.sst manifest sst.list Local FS SharedState Registry 01010101 00110011 10101010 11001100 01010101 00226.sst chk-1 {00225.sst = 1} {00226.sst = 1} Distributed FS://sfmap/1/
  • 27. Flink’s Incremental Checkpointing 27 StatefulMap (1/3) Checkpoint Coordinator StatefulMap (2/3) StatefulMap (3/3) DFS H1 H3 CP 1 H2 Network SharedStateRegistry Δ3 Δ2 Δ1Step 4: Checkpoint Coordinator signals CP1 success to all instances.
  • 29. Incremental Snapshot of Operator 29 data chk-1 manifest 01010101 00110011 10101010 11001100 01010101 00226.sst 00228.sst 00229.sst 01010101 00110011 10101010 11001100 01010101 01010101 00110011 10101010 11001100 01010101 share 01010101 00110011 10101010 11001100 01010101 00226.sst 01010101 00110011 10101010 11001100 01010101 00225.sst manifest sst.list Local FS {00225.sst = 1} {00226.sst = 1} SharedState Registry Distributed FS://sfmap/1/
  • 30. Incremental Snapshot of Operator 30 data chk-2 chk-1 manifest 01010101 00110011 10101010 11001100 01010101 00226.sst 00228.sst 00229.sst 01010101 00110011 10101010 11001100 01010101 01010101 00110011 10101010 11001100 01010101 manifest 01010101 00110011 10101010 11001100 01010101 00226.sst 00228.sst 01010101 00110011 10101010 11001100 01010101 01010101 00110011 10101010 11001100 01010101 + + share 01010101 00110011 10101010 11001100 01010101 00226.sst 01010101 00110011 10101010 11001100 01010101 00225.sst manifest sst.list Local FS {00225.sst = 1} {00226.sst = 1} SharedState Registry Distributed FS://sfmap/1/ 00229.sst
  • 31. Incremental Snapshot of Operator 31 data chk-1 manifest 01010101 00110011 10101010 11001100 01010101 00226.sst 00228.sst 00229.sst 01010101 00110011 10101010 11001100 01010101 01010101 00110011 10101010 11001100 01010101 share 01010101 00110011 10101010 11001100 01010101 00226.sst 00228.sst 00229.sst 01010101 00110011 10101010 11001100 01010101 01010101 00110011 10101010 11001100 01010101 01010101 00110011 10101010 11001100 01010101 00225.sst manifest sst.list chk-2 manifest sst.list Local FS {00225.sst = 1} {00226.sst = 2} {00228.sst = 1} {00229.sst = 1} SharedState Registry Distributed FS://sfmap/1/ chk-2 manifest 01010101 00110011 10101010 11001100 01010101 00226.sst 00228.sst 00229.sst 01010101 00110011 10101010 11001100 01010101 01010101 00110011 10101010 11001100 01010101 + + upload missing SSTable files
  • 32. Deleting Incremental Checkpoints 32 StatefulMap (1/3) Checkpoint Coordinator StatefulMap (2/3) StatefulMap (3/3) DFS H1 H3 CP 1 H1 H2H3 CP 2 H2 Network Deleting an outdated checkpoint SharedStateRegistry Δ3 Δ2 Δ1 Δ3 Δ2 Δ1
  • 33. Deleting Incremental Snapshot 33 data chk-1 manifest 01010101 00110011 10101010 11001100 01010101 00226.sst 00228.sst 00229.sst 01010101 00110011 10101010 11001100 01010101 01010101 00110011 10101010 11001100 01010101 share 01010101 00110011 10101010 11001100 01010101 00226.sst 00228.sst 00229.sst 01010101 00110011 10101010 11001100 01010101 01010101 00110011 10101010 11001100 01010101 01010101 00110011 10101010 11001100 01010101 00225.sst manifest sst.list chk-2 manifest sst.list Local FS {00225.sst = 1} {00226.sst = 2} {00228.sst = 1} {00229.sst = 1} SharedState Registry Distributed FS://sfmap/1/
  • 34. Deleting Incremental Snapshot 34 data manifest 01010101 00110011 10101010 11001100 01010101 00226.sst 00228.sst 00229.sst 01010101 00110011 10101010 11001100 01010101 01010101 00110011 10101010 11001100 01010101 share 01010101 00110011 10101010 11001100 01010101 00226.sst 00228.sst 00229.sst 01010101 00110011 10101010 11001100 01010101 01010101 00110011 10101010 11001100 01010101 01010101 00110011 10101010 11001100 01010101 00225.sst chk-2 manifest sst.list Local FS {00225.sst = 0} {00226.sst = 1} {00228.sst = 1} {00229.sst = 1} SharedState Registry Distributed FS://sfmap/1/
  • 35. Deleting Incremental Snapshot 35 data manifest 01010101 00110011 10101010 11001100 01010101 00226.sst 00228.sst 00229.sst 01010101 00110011 10101010 11001100 01010101 01010101 00110011 10101010 11001100 01010101 share 01010101 00110011 10101010 11001100 01010101 00226.sst 00228.sst 00229.sst 01010101 00110011 10101010 11001100 01010101 01010101 00110011 10101010 11001100 01010101 chk-2 manifest sst.list Local FS {00226.sst = 1} {00228.sst = 1} {00229.sst = 1} SharedState Registry Distributed FS://sfmap/1/
  • 37. Incremental checkpointing benefits ▪ Incremental checkpoints can dramatically reduce CP overhead for large state. ▪ Incremental checkpoints are async. ▪ RocksDB’s compaction consolidates the increments. Keeps overhead low for recovery. 37
  • 38. Incremental checkpointing limitations ▪ Breaks the unification of checkpoints and savepoints (CP: low overhead, SP: features) ▪ RocksDB specific format. ▪ Currently no support for rescaling from incremental checkpoint. 38
  • 39. Further improvements in Flink 1.3/4 ▪ AsyncHeapKeyedStateBackend (merged) ▪ AsyncHeapOperatorStateBackend (PR) ▪ MapState (merged) ▪ RocksDBInternalTimerService (PR) ▪ AsyncHeapInternalTimerService 39