Apache flink 1.7 and Beyond

Apache Flink® 1.7 and Beyond
公司：data Artisans
职位：Engineering Lead
演讲者：Till Rohrmann
@stsffap
1

2
Original creators
of
Apache Flink®
dA Platform
Stream Processing
for the Enterprise

3
What is Apache Flink?
Batch
Processing
process static and
historic data
Data Stream
Processing
realtime results
from data streams
Event-driven
Applications
data-driven actions
and services
Stateful Computations Over
Data Streams

Flink 1.7: What happened so
far?
4

• Contributors: 112
• Resolved issues: 430
• Commits: 970
• Changes LOC: +103824/-63124
5
Flink 1.7.0 in Numbers

• E.g. changing requirements, new algorithms, better serializers,
bug fixes, etc.
• Expensive to restart application from scratch (maintain state)
6
Flink Applications Need to Evolve

• Support for changing state schema
• Adding/Removing fields
• Changing type of fields
• Currently fully supported when using Avro
types
7
State Schema Evolution
“Upgrading Stateful Flink
Streaming Applications:
State of the Union” by
Tzu-Li Tai Today @
5:20 pm Room 2

8
Converting Currencies
7:12pm 9:37am 8:45am
€ 1
$ 1.13
CN¥ 7.8

9
Temporal Tables and Joins
13 11 7
Currency Rate Time
CN¥ 7.8 3
CN¥ 7.89 5
CN¥ 7.75 915 14 12
7 4

10
SQL for Pattern Analysis
SELECT * from ?

11
MATCH_RECOGNIZESELECT *
FROM TaxiRides
MATCH_RECOGNIZE (
PARTITION BY driverId
ORDER BY rideTime
MEASURES
S.rideId as sRideId
AFTER MATCH SKIP PAST LAST ROW
PATTERN (S M{2,} E)
DEFINE
S AS S.isStart = true,
M AS M.rideId <> S.rideId,
E AS E.isStart = false
AND E.rideId = S.rideId
)

• ElasticSearch 6 Table Sink
• Support for views in SQL Client
• More built-in functions: TO_BASE64, LOG2, REPLACE, COSH,…
12
More SQL Improvements
“Flink Streaming SQL 2018”
by Piotr Nowojski Today @
4:00 pm Room 2

• Scala 2.12 Support
• Exactly-once S3 StreamingFileSink
• Kafka 2.0 connector
• Versioned REST API
• Removal of legacy mode
13
Other Notable Features

Flink 1.8+: What is happening
next?
14

15
Capability Spectrum
offline real time
Batch
Event-driven
applications
Streaming
analytics
Strict SLA
applications
Flink

• Deploying Flink applications should be as easy as starting a process
• Bundle application code and Flink into a single image
• Process connects to other application processes and figures out its role
• Removing the cluster out of the equation
16
Flink as a Library
P1
P2 P3 P4
New process

• Active mode
• Flink is aware of underlying cluster framework
• Flink allocate resources
• E.g. existing YARN and Mesos integration
• Reactive mode
• Flink is oblivious to its runtime environment
• External system allocates and releases resources
• Flink scales with respect to available resources
• Relevant for environments: Kubernetes, Docker, as a library
17
Reactive vs. Active

18
Dynamic Scaling
• Latency
• Throughput
• Resource utilization
• Connector signals

• No fundamental difference between batch and stream processing
• Batch allows optimizations because data is bounded and
”complete”
• Batch and streaming still separately treated from task level upwards
• Working toward a single runtime for batch and streaming workloads
19
Batch-Streaming Unification

• Lazy scheduling (batch case)
• Deploy tasks starting from the
sources
• Whenever data is produced
start consumers
• Scheduling of idling tasks 
resource under-utilization
20
Flink Scheduler
src
src
join join
src
build side
build side
prob
e
side
probe
side

• More efficient scheduling by
taking dependencies into account
• E.g. probe side is only scheduled
after build side has been
processed
21
Batch Scheduler
src
src
join join
src
build side
build side
prob
e
side
probe
side
(1)
(2)
(2)
(3)

• Make Flink’s scheduler extendable &
pluggable
• Scheduler considers dependencies and
reacts to signals from ExecutionGraph
• Specialized scheduler for different use
cases
22
Extendable Scheduler
Scheduler
Streaming
Scheduler
Batch
Scheduler
Speculative
Scheduler

• Tasks own produced result
partitions
• Containers cannot be freed
until result is consumed
• One implementation for
streaming and batch loads
23
Flink’s Shuffle Service
Result partitionContainer

• Result partitions are written to
an external shuffle service
• Containers can be freed early
• Different implementations
based on use case
24
External & Persistent Shuffle Service
External shuffle
service (e.g. Yarn,
DFS)

• Support for external catalogs
(Confluent Schema Registry, Hive
Meta Store)
• Data definition language (DDL)
25
End-to-end SQL Only Pipelines
Hive Meta Store
Table
Source
Table
Sink
Output schema
information
Input schema
information
SQL
Query

• Flink 1.7.0 added many new features around SQL, connectors and state evolution
• A lot of new features in the pipeline
• Join the community!
• Subscribe to mailing lists
• Participate in Flink development
• Become active
26
TL;DL

Apache flink 1.7 and Beyond

Recommended

More Related Content

What's hot (20)

Similar to Apache flink 1.7 and Beyond (20)

More from Till Rohrmann (11)

Recently uploaded (20)

Apache flink 1.7 and Beyond

Editor's Notes