Scaling Flink in Cloud

Scaling Flink in Cloud
Steven Wu @stevenzwu

Agenda
● Introduction
● Scaling stateless jobs
● Scaling stateful jobs

Running Flink on Titus
(Netflix’s in-house container runtime)

Job isolation: single job
Job
Manager
Task
Manager
Task
Manager
Task
Manager
...
Titus Job #1
Titus Job #2
Flink
standalone
cluster

State backend and checkpoint store
State backend
● Memory
● File system
● RocksDB
Source: http://flink.apache.org/
checkpoint store
● HDFS
● S3

Why S3 as the snapshot store
● Only out-of-the-box support for Amazon cloud
● Cost-effective, scalability, durability

S3 concepts
● Massive storage system
● Bucket: container for objects
● Object: identified by a key (and a version)
● Filesystem like operations
○ GET, PUT, DELETE, LIST, HEAD

examplebucket/2018-04-01/00/data1.avro
S3 sharding: range partition
Partition 1
Partition 2
Partition 3
date / hour / file

LIST (prefix query)
Partition 1
Partition 2
Partition 3
date / hour / file

S3 scaling
● If request rate grows steadily, S3 automatically
partitions buckets as needed to support higher
request rates

Avoid sequential key names
if over 100 reqs/second

examplebucket/232a/2018-04-01/00/data1.avro
examplebucket/7b54/2018-04-01/00/data2.avro
examplebucket/921c/2018-04-01/00/data3.avro
examplebucket/ba65/2018-04-01/00/data4.avro
examplebucket/8761/2018-04-01/00/data5.avro
examplebucket/a390/2018-04-01/00/data6.avro
examplebucket/5d6c/2018-04-01/00/data7.avro
Introduce random prefix in key name

S3 Performance
● Optimized for high I/O throughput
● Not optimized for high request rate without
tweaking key names
● Not optimized for small files
● Not optimized for consistent low latency

Event
Producers
Sinks
highly available ingest pipelines - the
backbone of a real-time data infrastructure

Events are published to fronting Kafka
directly or via proxy
KSGateway
Stream
Consumers
Event
Producer
Keystone
Management
Fronting
Kafka
Flink
Router
Consumer
Kafka
HTTP /
gRPC

Events land up in fronting Kafka cluster
KSGateway
Stream
Consumers
Event
Producer
Keystone
Management
Fronting
Kafka
Flink
Router
Consumer
Kafka
HTTP /
gRPC

Events are polled by router, filter and
projection applied
KSGateway
Stream
Consumers
Keystone
Management
Fronting
Kafka
Flink
Router
Consumer
Kafka
Event
Producer
HTTP /
gRPC

Router sends events to destination
KSGateway
Stream
Consumers
Keystone Management
Fronting
Kafka
Flink
Router
Consumer
Kafka
Event
Producer
HTTP /
gRPC

KSGateway
Stream
Consumers
Keystone
Management
Fronting
Kafka
Flink
Router
Consumer
Kafka
Event
Producer
HTTP /
gRPC

Keystone routing jobs
● Stateless
● Embarrassingly parallel

Keystone router scale
● ~3 trillion events/day
● ~2,000 routing jobs
● ~10,000 containers
● ~200,000 parallel operator instances

Math 101: S3 writes
● ~2,000 routing jobs
● checkpoint interval is 30 seconds
● ~67 (= 2,000 / 30) S3 writes per second?

Adapted from http://flink.apache.org/
Each operator writes to S3
S3
State
Snapshot
barriers

Math 201: S3 writes
● ~200,000 operators. Each operator writes
checkpoint to S3
● ~6,600 writes (= 200,000 / 30) per second
○ Actual writes 2-3x smaller because only Kafka
source operators have state

S3 not optimized for high request rate
without tweaking key names

Checkpoint path
state.checkpoints.dir:
s3://bucket/checkpoints
<deploy timestamp>/
<job id>

Introduce entropy in checkpoint path
s3://bucket/checkpoints
<4-char random hex>/
<deploy timestamp>/
<job id>

S3 not optimized for small files

Checkpoint ack with metadata
Job
Manager
ACK
(metadata)
S3
State
Snapshot
barriers

Uber checkpoint file after all ACKs
Job
Manager
S3
Uber file with
metadata

State.backend.fs.memory-threshold:
1,024

Checkpoint ack with state
Job
Manager
S3
barriers
ACK
(state)

Uber checkpoint file after all ACKs
Job
Manager
S3
Uber file with
metadata + state

State.backend.fs.memory-threshold:
1,024,000

Avoid S3 writes from task managers
● Only job manager writes one uber checkpoint file
● Reduced checkpoint duration by 10x

Issue #1 Hadoop S3 file system
● Half of the HEADs failed for non-exist objects
● Always two HEADs for the same object (with and
without trailing slash)
○ checkpoints/<flink job>/fe68ab5591614163c19b55ff4aa66ac
○ checkpoints/<flink job>/fe68ab5591614163c19b55ff4aa66ac/

HEAD requests coming from task
managers

BTrace: dynamic tracing tool for Java
● Dynamically trace a running Java process
● Dynamically instruments the classes of the target
application to inject tracing code ("bytecode
tracing")

import com.sun.btrace.annotations.*;
import static com.sun.btrace.BTraceUtils.*;
import static com.sun.btrace.BTraceUtils.Sys.*;
import java.util.concurrent.atomic.*;
@BTrace
public class S3ReqTracing {
private static AtomicLong putCounter = newAtomicLong(0);
private static AtomicLong headCounter = newAtomicLong(0);
@OnMethod(
clazz = "com.amazonaws.services.s3.AmazonS3Client",
method = "putObject"
)
public static void trackPut() {
addAndGet(putCounter, 1L);
}

@OnMethod(
clazz = "com.amazonaws.services.s3.AmazonS3Client",
method = "getObjectMetadata",
location = @Location(value=Kind.LINE, line=966)
)
public static void trackHead() {
addAndGet(headCounter, 1L);
jstack();
}
@OnTimer(30000)
public static void dumpCounters() {
printNumber("put", getAndSet(putCounter, 0));
printNumber("head", getAndSet(headCounter, 0));
}
}

Run btrace on task manager
● bin/btrace <PID> S3ReqTracing.java

Setup
● 1-CPU container
● 1 subtask with two operators

Findings from task manager
● No S3 writes
● 4 HEAD requests per checkpoint interval
○ 1 (subtask) * 2 (operators) * 2 (with and without
trailing slash)

com.amazonaws.services.s3.AmazonS3Client.getObjectMetadata(AmazonS3Client.java:966)
com.amazonaws.services.s3.AmazonS3Client.getObjectMetadata(AmazonS3Client.java:956)
org.apache.hadoop.fs.s3a.S3AFileSystem.getFileStatus(S3AFileSystem.java:892)
org.apache.hadoop.fs.s3a.S3AFileSystem.mkdirs(S3AFileSystem.java:848)
org.apache.hadoop.fs.FileSystem.mkdirs(FileSystem.java:1877)
org.apache.flink.runtime.fs.hdfs.HadoopFileSystem.mkdirs(HadoopFileSystem.java:433)
org.apache.flink.runtime.state.filesystem.FsCheckpointStreamFactory.<init>(FsChec
kpointStreamFactory.java:105)
org.apache.flink.runtime.state.filesystem.FsStateBackend.createStreamFactory(FsStateBackend.java:174)
org.apache.flink.streaming.runtime.tasks.StreamTask$CheckpointingOperation.createStreamFactory(Strea
mTask.java:987)
org.apache.flink.streaming.runtime.tasks.StreamTask$CheckpointingOperation.executeCheckpointing(Stre
amTask.java:956)
org.apache.flink.streaming.runtime.tasks.StreamTask.checkpointState(StreamTask.java:583)
org.apache.flink.streaming.runtime.tasks.StreamTask.performCheckpoint(StreamTask.java:551)
org.apache.flink.streaming.runtime.tasks.StreamTask.triggerCheckpoint(StreamTask.java:511)

Math 301: metadata reqs
● ~200,000 operators
● Each operator creates 2 HEAD requests (with and
without trailing slash)
● ~13,000 (200,000 * 2 / 30) HEAD reqs/s from
task managers even though they write zero S3
files

Create CheckpointStreamFactory
only once during operator
initialization (FLINK-5800)
Fixed in 1.2.1 (https://github.com/apache/flink/pull/3312)

Fine grained recovery (FLIP-1)

What is fine grained recovery
A1
A2
A3
B1
B2
B3
C1
C2
C3

Life without fine grained recovery
Each kill (every 10 minutes) caused ~2x spikes

Sometimes revert to full restart
Full restartFine grained
recovery

Current implementation issue (FLINK-8042)
● Revert to full restart immediately if replacement
container didn’t come back in time (FLINK-8042)
● Fix expected in FLIP-6

Workaround: +1 standby container
Job
Manager
Task
Manager
Task
Manager
Task
Manager
... Task
Manager
+1

Fine grained recovery in action

Recap of scaling stateless jobs
● Introduce random prefix in checkpoint path to
spread S3 writes from many different jobs
● Avoid S3 writes from task managers
● Enable fine grained recovery (+1 standby)

Often come with data shuffling
A1
A2
A3
B1
B2
B3
C1
C2
C3
keyBysource window sink

Challenges of large-state job
● Introduce random hex chars in checkpoint path to
spread S3 writes from different jobs
○ Single job writes large state to S3
● Avoid S3 writes from task managers
○ Each task manager has large state
● Enable fine grained recovery (+1 standby)
○ Connected job graph

Challenges of large-state job
● Single job writes large state to S3
● Each task manager has large state
● Connected job graph

Inject dynamic entropy in S3 path
state.backend.fs.checkpointdir.injectEntropy.enabled:
True
state.backend.fs.checkpointdir.injectEntropy.key:
__ENTROPY_KEY__
s3://bucket/__ENTROPY_KEY__/path
2.5x throughput improvement

Tuning Flink to stabilize
● Enable incremental checkpoint with RocksDB
● RocksDB tuning: FLASH_SSD_OPTIMIZED
● Network buffer: taskmanager.network.memory.max=4 GB

Setup for performance test
● Cluster size: 200 nodes
○ 16 CPUs
○ 54 GB memory
○ 108 GB SSD-backed EBS volume
● Parallelism: 3,200

Numbers for savepoint
● Size: 21 TBs
● Take time: 27 minutes
● Recovery time: 6 minutes

Numbers for incremental checkpoint
● Checkpoint interval: 15 mins
● Size (avg): 950 GB
● Duration (avg): 2.5 minutes

Full restart with connected graph
A1
A2
A3
B1
B2
B3
C1
C2
C3

Recover data from S3
A1
A2
A3
B1
B2
B3
C1
C2
C3
TM #1
TM #2 S3
HDD
HDD
HDDTM #3

Task local recovery (FLINK-8360)
A1
A2
A3
B1
B2
B3
C1
C2
C3
TM #1
TM #2 S3
HDD
HDD
HDDTM #3

Task local recovery with EBS
A1
A2
A3
B1
B2
B3
C1
C2
C3
TM #1
TM #2 S3
EBS
EBS
EBSTM #3

Recap of scaling stateful jobs
● Inject dynamic random prefix in checkpoint path
to spread S3 writes from operators in the same
job
● Enable incremental checkpoint with RocksDB
● Challenge: connected graph makes recovery
more expensive

Thank you!
Steven Wu @stevenzwu

Scaling Flink in Cloud

Recommended

More Related Content

What's hot (20)

Similar to Scaling Flink in Cloud (20)

Recently uploaded (20)

Scaling Flink in Cloud

Editor's Notes