Flink Forward Berlin 2017: Stefan Richter - A look at Flink's internal data structures and algorithms for efficient checkpointing

1
Stefan Richter 
@StefanRRichter 
 
September 13, 2017
A Look at Flink’s Internal Data
Structures for Efficient
Checkpointing

Flink State and Distributed Snapshots
2
State
Backend
Stateful 
Operation
Source
Event

3
Trigger checkpoint
Inject checkpoint barrier
Stateful 
Operation
Source
„Asynchronous Barrier Snapshotting“

4
Take state snapshot
Synchronously trigger
state snapshot (e.g.
copy-on-write)
Stateful 
Operation
Source
„Asynchronous Barrier Snapshotting“

5
DFS
Processing pipeline continues
Durably persist 
full snapshots 
asynchronously
For each state backend:
one writer, n reader
Stateful 
Operation
Source

Challenges for Data Structures
▪ Asynchronous Checkpoints:
▪ Minimize pipeline stall time while taking
the snapshot.
▪ Keep overhead (memory, CPU,…) as low
as possible while writing the snapshot.
▪ Support multiple parallel checkpoints.
6

7
DFS
Durably persist 
incremental
snapshots 
asynchronously
Processing pipeline continues
ΔΔ
(differences from
previous snapshots)
Stateful 
Operation
Source

Full Snapshots
8
K S
2 B
4 W
6 N
K S
2 B
3 K
4 L
6 N
K S
2 Q
3 K
6 N
9 S
K S
2 B
4 W
6 N
K S
2 B
3 K
4 L
6 N
K S
2 Q
3 K
6 N
9 S
Checkpoint 1 Checkpoint 2 Checkpoint 3
time
State
Backend
(Task-local)
Checkpoint
Directory
(DFS)
Updates Updates

Checkpoint
Directory
(DFS)
Incremental Snapshots
9
K S
2 B
4 W
6 N
K S
2 B
3 K
4 L
6 N
K S
2 Q
3 K
6 N
9 S
K S
2 B
4 W
6 N
K S
3 K
4 L
K S
2 Q
4 -
9 S
iCheckpoint 1 iCheckpoint 2 iCheckpoint 3
Δ(-, icp1)
Δ(icp1, icp2)
Δ(icp2, icp3)
time
State
Backend
(Task-local)
Updates Updates

5 TB 10 TB 15 TB
Incremental Full
Incremental vs Full Snapshots
10
Checkpoint
Duration
State Size

State
Backend
(Task-local)
Incremental Recovery
11
K S
2 B
4 W
6 N
K S
2 B
3 K
4 L
6 N
K S
2 Q
3 K
6 N
9 S
K S
2 B
4 W
6 N
K S
3 K
4 L
K S
2 Q
4 -
9 S
time
Recovery?
+ +
iCheckpoint 1 iCheckpoint 2 iCheckpoint 3
Δ(-, icp1)
Δ(icp1, icp2)
Δ(icp2, icp3)
Checkpoint
Directory
(DFS)

Challenges for Data Structures
▪ Incremental checkpoints:
▪ Efficiently detect the (minimal) set of
state changes between two
checkpoints.
▪ Avoid unbounded checkpoint history.
12

Two Keyed State Backends
13
HeapKeyedStateBackend RocksDBKeyedStateBackend
• State lives in memory, on Java heap.
• Operates on objects.
• Think of a hash map {key obj -> state obj}.
• Goes through ser/de during snapshot/restore.
• Async snapshots supported.
• State lives in off-heap memory and on disk.
• Operates on serialized bytes.
• Think of K/V store {key bytes -> state bytes}.
• Based on log-structured-merge (LSM) tree.
• Goes through ser/de for each state access.
• Async and incremental snapshots.

Asynchronous and Incremental Checkpoints
with the RocksDB Keyed-State Backend
14

What is RocksDB
▪ Ordered Key/Value store (bytes).
▪ Based on LSM trees.
▪ Embedded, written in C++, used via JNI.
▪ Write optimized (all bulk sequential), with
acceptable read performance.
15

RocksDB Architecture (simplified)
16
Memory
Local Disk
K: 2 V: T K: 8 V: W
Memtable
- Mutable Memory Buffer for K/V Pairs (bytes)
- All reads / writes go here first
- Aggregating (overrides same unique keys)
- Asynchronously flushed to disk when full

17
Memory
Local Disk
K: 4 V: Q
Memtable
K: 2 V: T K: 8 V: WIdx
SSTable-(1)
BF-1
- Flushed Memtable becomes immutable
- Sorted By Key (unique)
- Read: first check Memtable, then SSTable
- Bloomfilter & Index as optimisations

18
Memory
Local Disk
K: 2 V: A K: 5 V: N
Memtable
K: 4 V: Q K: 7 V: SIdx
SSTable-(2)
K: 2 V: T K: 8 V: WIdx
SSTable-(1)
K: 2 V: C K: 7 -Idx
SSTable-(3)
BF-1 BF-2 BF-3
- Deletes are explicit entries
- Natural support for snapshots
- Iteration: online merge
- SSTables can accumulate

RocksDB Compaction (simplified)
19
K: 2 V: C K: 7 - K: 4 V: Q K: 7 V: S K: 2 V: T K: 8 V: W
K: 2 V: C K: 4 V: Q K: 8 V: W
multiway
sorted
merge
„Log Compaction“
SSTable-(3) SSTable-(1)SSTable-(2)
SSTable-(1, 2, 3)
(rm)

RocksDB Asynchronous Checkpoint
▪ Flush Memtable. (Simplified)
▪ Then create iterator over all current
SSTables (they are immutable).
▪ New changes go to Memtable and future
SSTables and are not considered by
iterator.
20

RocksDB Incremental Checkpoint
▪ Flush Memtable.
▪ For all SSTable files in local working
directory: upload new files since last
checkpoint to DFS, re-reference other files.
▪ Do reference counting on JobManager for
uploaded files.
21

Incremental Checkpoints - Example
▪ Illustrate interactions (simplified) between:
▪ Local RocksDB instance
▪ Remote Checkpoint Directory (DFS)
▪ Job Manager (Checkpoint Coordinator)
▪ In this example: 2 latest checkpoints retained
22

sstable-(1,2,3)
sstable-(1) sstable-(2)
sstable-(1) sstable-(2) sstable-(3) sstable-(4)
CP 1
CP 2
CP 3
merge
sstable-(1,2,3) sstable-(4,5,6)CP 4
merge
+sstable-(6)
sstable-(3)
sstable-(4)
sstable-(1)
sstable-(2)
sstable-(1,2,3)
sstable-(5)
sstable-(4,5,6)
CP 1
CP 2CP 2
CP 3
CP 4
TaskManager Local RocksDB Working Directory DFS Upload Job Manager

sstable-(1,2,3)
CP 1
CP 2
CP 3
merge
+sstable-(6)
sstable-(3)
sstable-(4)
sstable-(1)
sstable-(2)
sstable-(1,2,3)
sstable-(5)
sstable-(4,5,6)
CP 1
CP 2CP 2
CP 3
CP 4

29
sstable-(1,2,3)
CP 1
CP 2
CP 3
merge
merge
+sstable-(6)
sstable-(3)
sstable-(4)
sstable-(1)
sstable-(2)
sstable-(1,2,3)
sstable-(5)
sstable-(4,5,6)
CP 1
CP 2CP 2
CP 3
CP 4

30
sstable-(1,2,3)
CP 1
CP 2
CP 3
merge
merge
+sstable-(6)
sstable-(3)
sstable-(4)
sstable-(1)
sstable-(2)
sstable-(1,2,3)
sstable-(5)
sstable-(4,5,6)
CP 1
CP 2CP 2
CP 3
CP 4

Summary RocksDB
▪ Asynchronous snapshots „for free“, immutable
copy is already there.
▪ Trivially supports multiple concurrent snapshots.
▪ Detection of incremental changes easy: observe
creation and deletion of SSTables.
▪ Bounded checkpoint history through compaction.
31

Asynchronous Checkpoints
with the Heap Keyed-State Backend
32

Heap Backend: Chained Hash Map
33
S1
K1
Entry
S2
K2
Entry
next
Map<K, S> {
Entry<K, S>[] table;
}
Entry<K, S> {
final K key;
S state;
Entry next;
}
Map<K, S>

34
S1
K1
Entry
S2
K2
Entry
next
Map<K, S> {
}
Entry<K, S> {
final K key;
S state;
Entry next;
}
Map<K, S>

35
S1
K1
Entry
S2
K2
Entry
next
Map<K, S> {
}
Entry<K, S> {
final K key;
S state;
Entry next;
}
Map<K, S>

36
S1
K1
Entry
S2
K2
Entry
next
Map<K, S> {
}
Entry<K, S> {
final K key;
S state;
Entry next;
}
Map<K, S>

37
S1
K1
Entry
S2
K2
Entry
next
Map<K, S> {
}
Entry<K, S> {
final K key;
S state;
Entry next;
}
Map<K, S>
„Structural
changes to map“
(easy to detect)
„User modifies
state objects“
(hard to
detect)

Copy-on-Write Hash Map
38
S1
K1
Entry
S2
K2
Entry
next
Map<K, S>
„Structural
changes to map“
(easy to detect)
„User modifies
state objects“
(hard to
detect)
Map<K, S> {
int mapVersion;
int requiredVersion;
OrderedSet<Integer> snapshots;
}
Entry<K, S> {
final K key;
S state;
Entry next;
int stateVersion;
int entryVersion;
}

Copy-on-Write Hash Map - Snapshots
39
Map<K, S> {
int mapVersion;
int requiredVersion;
OrderedSet<Integer> snapshots;
}
Entry<K, S> {
final K key;
S state;
Entry next;
int stateVersion;
int entryVersion;
}
Create Snapshot:
1. Flat array-copy of table array
2. snapshots.add(mapVersion);
3. ++mapVersion;
4. requiredVersion = mapVersion;
Release Snapshot:
1. snapshots.remove(snapVersion);
2. requiredVersion = snapshots.getMax();

Copy-on-Write Hash Map - 2 Golden Rules
1.Whenever a map entry e is modified and e.entryVersion <
map.requiredVersion, first copy the entry and redirect pointers that are
reachable through snapshot to the copy. Set e.entryVersion =
map.mapVersion. Pointer redirection can trigger recursive application of
rule 1 to other entries.
2.Before returning the state object s of entry e to a caller, if e.stateVersion
< map.requiredVersion, create a deep copy of s and redirect the
pointer in e to the copy. Set e.stateVersion = map.mapVersion. Then
return the copy to the caller. Applying rule 2 can trigger rule 1.
40

Copy-on-Write Hash Map - Example
41
sv:0
ev:0
sv:0
ev:0
K:42 S:7
K:23 S:3
mapVersion: 0
requiredVersion: 0

42
sv:0
ev:0
sv:0
ev:0
K:42 S:7
K:23 S:3
snapshotVersion: 0
mapVersion: 1
requiredVersion: 1
snapshot() -> 0

sv:0
ev:0
sv:0
ev:0
S:7
S:3
43
K:42
K:23
sv:1
ev:1
K:13 S:2
snapshotVersion: 0
mapVersion: 1
requiredVersion: 1
put(13, 2)

sv:0
ev:0
sv:0
ev:0
S:7
S:3
44
K:42
K:23
K:13
sv:1
ev:1
S:2
snapshotVersion: 0
mapVersion: 1
requiredVersion: 1
get(23)

sv:0
ev:0
sv:0
ev:0
S:7
S:3
45
K:42
K:23
K:13
sv:1
ev:1
S:2
snapshotVersion: 0
mapVersion: 1
requiredVersion: 1
get(23)
Caller could modify state object!

sv:0
ev:0
sv:0
ev:0
S:7
S:3
46
K:42
K:23 S:3
K:13
sv:1
ev:1
S:2
snapshotVersion: 0
mapVersion: 1
requiredVersion: 1
get(23)
Trigger copy.
How to connect?
?

sv:0
ev:0
sv:0
ev:0
S:7
S:3
47
K:42
K:23 S:3
sv:1
ev:1
K:13
sv:1
ev:1
S:2
snapshotVersion: 0
mapVersion: 1
requiredVersion: 1
get(23)
Copy entry.
How to connect?
?

sv:0
ev:0
sv:0
ev:0
S:7
S:3K:23 S:3
sv:1
ev:1
48
K:42K:13
sv:1
ev:1
S:2
sv:0
ev:1
snapshotVersion: 0
mapVersion: 1
requiredVersion: 1
get(23)
Copy entry (recursive).
Connect?

sv:0
ev:0
sv:0
ev:0
S:7
S:3 S:3
sv:1
ev:1
K:23
49
K:42K:13
sv:1
ev:1
S:2
sv:0
ev:1
snapshotVersion: 0
mapVersion: 1
requiredVersion: 1
get(23)

S:3
sv:1
ev:1
sv:0
ev:0
sv:0
ev:0
S:7
S:3K:23
50
K:42K:13
sv:1
ev:1
S:2
sv:0
ev:1
snapshotVersion: 0
mapVersion: 1
requiredVersion: 1
get(42)

S:3
sv:1
ev:1
sv:0
ev:0
sv:0
ev:0
S:7
S:3K:23
51
K:42K:13
sv:1
ev:1
S:2
sv:0
ev:1
snapshotVersion: 0
mapVersion: 1
requiredVersion: 1
get(42)
Copy required!

S:3
sv:1
ev:1
sv:0
ev:0
sv:0
ev:0
S:7
S:3K:23
52
K:42
S:7
K:13
sv:1
ev:1
S:2
sv:1
ev:1
snapshotVersion: 0
mapVersion: 1
requiredVersion: 1
get(42)

K:23 S:3
sv:1
ev:1
53
K:42
sv:1
ev:1
S:7
K:13
sv:1
ev:1
S:2
mapVersion: 1
requiredVersion: 0
release(0)

K:23 S:3
sv:1
ev:1
54
sv:1
ev:1
S:2 K:42
sv:1
ev:1
S:7
K:13
snapshotVersion: 1
mapVersion: 2
requiredVersion: 2
snapshot() -> 1

K:23 S:3
sv:1
ev:1
55
sv:1
ev:1
S:2 K:42
sv:1
ev:1
S:7
K:13
snapshotVersion: 1
mapVersion: 2
requiredVersion: 2
remove(23)
Change next-pointer?

K:23 S:3
sv:1
ev:1
56
sv:1
ev:1
S:2 K:42
sv:1
ev:1
S:7
K:13
snapshotVersion: 1
mapVersion: 2
requiredVersion: 2
remove(23)
Change next-pointer?

sv:1
ev:2
K:23 S:3
sv:1
ev:1
57
sv:1
ev:1
S:2 K:42
sv:1
ev:1
S:7
K:13
snapshotVersion: 1
mapVersion: 2
requiredVersion: 2
remove(23)

sv:1
ev:2
58
K:42
S:7
K:13
sv:1
ev:1
S:2
mapVersion: 2
requiredVersion: 0
release(1)

Copy-on-Write Hash Map Benefits
▪ Fine granular, lazy copy-on-write.
▪ At most one copy per object, per snapshot.
▪ No thread synchronisation between readers and writer
after array-copy step.
▪ Supports multiple concurrent snapshots.
▪ Also implements efficient incremental rehashing.
▪ Future: Could serve as basis for incremental snapshots.
59

Flink Forward Berlin 2017: Stefan Richter - A look at Flink's internal data structures and algorithms for efficient checkpointing

More Related Content

What's hot (20)

Similar to Flink Forward Berlin 2017: Stefan Richter - A look at Flink's internal data structures and algorithms for efficient checkpointing (20)

More from Flink Forward (20)

Recently uploaded (20)

Flink Forward Berlin 2017: Stefan Richter - A look at Flink's internal data structures and algorithms for efficient checkpointing

Editor's Notes