Why and how to leverage the simplicity and power of SQL on Flink

TIMO WALTHER, SOFTWARE ENGINEER
DATAWORKS SUMMIT, SAN JOSE
JUNE 21, 2018
WHY AND HOW TO LEVERAGE
THE POWER AND SIMPLICITY OF
SQL ON APACHE FLINK®

© 2018 data Artisans2
ABOUT DATA ARTISANS
Original creators of
Apache Flink®
Open Source Apache Flink
+ dA Application Manager

DA PLATFORM
data-artisans.com/download

POWERED BY APACHE FLINK

FLINK’S POWERFUL ABSTRACTIONS
Process Function (events, state, time)
DataStream API (streams, windows)
SQL / Table API (dynamic tables)
Stream- & Batch
Data Processing
High-level
Analytics API
Stateful Event-
Driven Applications
val stats = stream
.keyBy("sensor")
.timeWindow(Time.seconds(5))
.sum((a, b) -> a.add(b))
def processElement(event: MyEvent, ctx: Context, out: Collector[Result]) = {
// work with event and state
(event, state.value) match { … }
out.collect(…) // emit events
state.update(…) // modify state
// schedule a timer callback
ctx.timerService.registerEventTimeTimer(event.timestamp + 500)
}
Layered abstractions to
navigate simple to complex use cases

APACHE FLINK’S RELATIONAL APIS
Unified APIs for batch & streaming data
A query specifies exactly the same result
regardless whether its input is
static batch data or streaming data.
tableEnvironment
.scan("clicks")
.groupBy('user)
.select('user, 'url.count as 'cnt)
SELECT user, COUNT(url) AS cnt
FROM clicks
GROUP BY user
LINQ-style Table APIANSI SQL

QUERY TRANSLATION
tableEnvironment
.scan("clicks")
.groupBy('user)
.select('user, 'url.count as 'cnt)
FROM clicks
GROUP BY user
Input data is
bounded
(batch)
Input data is
unbounded
(streaming)

WHAT IF “CLICKS” IS A FILE?
Clicks
user cTime url
Mary 12:00:00 https://…
Bob 12:00:00 https://…
Mary 12:00:02 https://…
Liz 12:00:03 https://…
user cnt
Mary 2
Bob 1
Liz 1
SELECT
user,
COUNT(url) as cnt
FROM clicks
GROUP BY user
Input data is
read at once
Result is produced
at once

WHAT IF “CLICKS” IS A STREAM?
user cTime url
user cnt
SELECT
user,
COUNT(url) as cnt
FROM clicks
GROUP BY user
Clicks
Mary 12:00:00 https://…
Bob 12:00:00 https://…
Mary 12:00:02 https://…
Liz 12:00:03 https://…
Bob 1
Liz 1
Mary 1Mary 2
Input data is
continuously read
Result is
continuously updated
The result is the same!

• Usability
‒ ANSI SQL syntax: No custom “StreamSQL” syntax.
‒ ANSI SQL semantics: No stream-specific results.
• Portability
‒ Run the same query on bounded and unbounded data
‒ Run the same query on recorded and real-time data
• How can we achieve SQL semantics on streams?
now
bounded query
unbounded query
past future
bounded query
start of the stream
unbounded query
WHY IS STREAM-BATCH UNIFICATION IMPORTANT?

• Materialized views (MV) are similar to regular views,
but persisted to disk or memory
‒Used to speed-up analytical queries
‒MVs need to be updated when the base tables change
• MV maintenance is very similar to SQL on streams
‒Base table updates are a stream of DML statements
‒MV definition query is evaluated on that stream
‒MV is query result and continuously updated
DATABASE SYSTEMS RUN QUERIES ON STREAMS

CONTINUOUS QUERIES IN FLINK
• Core concept is a “DynamicTable”
‒Dynamic tables are changing over time
• Queries on dynamic tables
‒produce new dynamic tables (which are updated based on input)
‒do not terminate
• Stream ↔ Dynamic table conversions
12

STREAM ↔ DYNAMIC TABLE CONVERSIONS
• Append Conversions
‒ Records are only inserted/appended
• Upsert Conversions
‒ Records are inserted/updated/deleted
‒ Records have a (composite) unique key
• Changelog Conversions
‒ Records are inserted/updated/deleted

SQL FEATURE SET IN FLINK 1.5.0
• SELECT FROMWHERE
• GROUP BY / HAVING
‒ Non-windowed,TUMBLE, HOP, SESSION windows
• JOIN
‒ Windowed INNER, LEFT / RIGHT / FULL OUTER JOIN
‒ Non-windowed INNER JOIN
• Scalar, aggregation, table-valued UDFs
• SQL CLI Client (beta)
• [streaming only] OVER /WINDOW
‒ UNBOUNDED / BOUNDED PRECEDING
• [batch only] UNION / INTERSECT / EXCEPT / IN / ORDER BY

WHAT CAN I BUILD WITH THIS?
• Data Pipelines
‒ Transform, aggregate, and move events in real-time
• Low-latency ETL
‒ Convert and write streams to file systems, DBMS, K-V stores, indexes, …
‒ Ingest appearing files to produce streams
• Stream & Batch Analytics
‒ Run analytical queries over bounded and unbounded data
‒ Query and compare historic and real-time data
• Power Live Dashboards
‒ Compute and update data to visualize in real-time

THE NEW YORK TAXI RIDES DATA SET
• The NewYork CityTaxi & Limousine Commission provides a public data set
about past taxi rides in NewYork City
• We can derive a streaming table from the data
• Table: TaxiRides
rideId: BIGINT // ID of the taxi ride
isStart: BOOLEAN // flag for pick-up (true) or drop-off (false) event
lon: DOUBLE // longitude of pick-up or drop-off location
lat: DOUBLE // latitude of pick-up or drop-off location
rowtime: TIMESTAMP // time of pick-up or drop-off event

SELECT cell,
isStart,
HOP_END(rowtime, INTERVAL '5' MINUTE, INTERVAL '15' MINUTE) AS hopEnd,
COUNT(*) AS cnt
FROM (SELECT rowtime, isStart, toCellId(lon, lat) AS cell
FROM TaxiRides)
GROUP BY cell,
isStart,
HOP(rowtime, INTERVAL '5' MINUTE, INTERVAL '15' MINUTE)
 Compute every 5 minutes for each location the
number of departing and arriving taxis
of the last 15 minutes.
IDENTIFY POPULAR PICK-UP / DROP-OFF LOCATIONS

SELECT pickUpCell,
AVG(TIMESTAMPDIFF(MINUTE, e.rowtime, s.rowtime) AS avgDuration
FROM (SELECT rideId, rowtime, toCellId(lon, lat) AS pickUpCell
FROM TaxiRides
WHERE isStart) s
JOIN
(SELECT rideId, rowtime
FROM TaxiRides
WHERE NOT isStart) e
ON s.rideId = e.rideId AND
e.rowtime BETWEEN s.rowtime AND s.rowtime + INTERVAL '1' HOUR
GROUP BY pickUpCell
 Join start ride and end ride events on rideId and
compute average ride duration per pick-up location.
AVERAGE RIDE DURATION PER PICK-UP LOCATION

BUILDING A DASHBOARD
Elastic
Search
Kafka
SELECT cell,
isStart,
HOP_END(rowtime, INTERVAL '5' MINUTE, INTERVAL '15' MINUTE) AS hopEnd,
COUNT(*) AS cnt
FROM (SELECT rowtime, isStart, toCellId(lon, lat) AS cell
FROM TaxiRides)
GROUP BY cell,
isStart,
HOP(rowtime, INTERVAL '5' MINUTE, INTERVAL '15' MINUTE)

SOUNDS GREAT! HOW CAN I USE IT?
• ATM, SQL queries must be embedded in Java/Scala code 
‒ Tight integration with DataStream and DataSet APIs
• Community focused on internals (until Flink 1.4.0)
‒ Operators, types, built-in functions, extensibility (UDFs, extern. catalog)
‒ Proven at scale by Alibaba, Huawei, and Uber
‒ All built their own submission system & connectors library
• Community neglected user interfaces
‒ No query submission client, no CLI
‒ No catalog integration
‒ Limited set ofTableSources andTableSinks

COMING IN FLINK 1.5.0 - SQL CLI
DemoTime!
That’s a nice toy, but …
... can I use it for anything serious?

FLIP-24 – A SQL QUERY SERVICE
• REST service to submit & manage SQL queries
‒ SELECT …
‒ INSERT INTO SELECT …
‒ CREATE MATERIALIZE VIEW …
• Serve results of “SELECT …” queries
• Provide a table catalog (integrated with external catalogs)
• Use cases
‒ Data exploration with notebooks like Apache Zeppelin
‒ Access to real-time data from applications
‒ Easy data routing / ETL from management consoles

CHALLENGE: SERVE DYNAMIC TABLES
Unbounded input yields unbounded results
FROM clicks
GROUP BY user
SELECT user, url
FROM clicks
WHERE url LIKE '%xyz.com'
Append-onlyTable
• Result rows are never changed
• Consume, buffer, or drop rows
Continuously updatingTable
• Result rows can be updated or
deleted
• Consume changelog or
periodically query result table
• Result table must be maintained
somewhere
(Serving bounded results is easy)

Application
Query Service
Catalog
Optimizer
Database /
HDFS
Event Log
External Catalog
(Schema Registry,
HCatalog, …)
Query
Results
Submit Query Job
State
REST
Result Server
Submit Query
REST
Database /
HDFS
Event Log
SELECT
user,
COUNT(url) AS cnt
FROM clicks
GROUP BY user
Results are served by Query Service via REST
+ Application does not need a special client
+ Works well in many network configurations
− Query service can become bottleneck

Application
Query Service
SELECT
user,
COUNT(url) AS cnt
FROM clicks
GROUP BY user Catalog
Optimizer
Database /
HDFS
Event Log
External Catalog
(Schema Registry,
HCatalog, …)
Query
Submit Query Job
State
REST
Result Server
Submit Query
REST
Database /
HDFS
Event Log
Serving
Library
Result Handle

WE WANT YOUR FEEDBACK!
• The design of SQL Query Service is not final yet.
• Check out FLIP-24 and FLINK-7594
• Share your ideas and feedback and discuss on
JIRA or dev@flink.apache.org.

SUMMARY
• Unification of stream and batch is important.
• Flink’s SQL solves many streaming and batch use cases.
• Runs in production at Alibaba, Uber, and others.
• The community is working on improving user interfaces.
• Get involved, discuss, and contribute!

THANK YOU!
Available on O’Reilly Early Release!

THANK YOU!
@twalthr
@dataArtisans
@ApacheFlink
WE ARE HIRING
data-artisans.com/careers

Why and how to leverage the simplicity and power of SQL on Flink

More Related Content

What's hot (20)

Similar to Why and how to leverage the simplicity and power of SQL on Flink (20)

More from DataWorks Summit (20)

Recently uploaded (20)

Why and how to leverage the simplicity and power of SQL on Flink

Editor's Notes