SlideShare a Scribd company logo
1
Fabian Hueske
@fhueske
Strata Data Conference, New York
September, 27th 2017
Stream Analytics with SQL
on Apache Flink®
About me
 Apache Flink PMC member
• Contributing since day 1 at TU Berlin
• Focusing on Flink’s relational APIs since 2 years
 Co-author of “Stream Processing with Apache Flink”
• Work in progress…
 Co-founder & Software Engineer at data Artisans
2
3
Original creators of
Apache Flink®
dA Platform 2
Open Source Apache Flink
+ dA Application Manager
4
Productionizing and operating
stream processing made easy
The dA Platform 2
dA Platform 2
Apache Flink
Stateful stream processing
Kubernetes
Container platform
Logging
Streams from
Kafka, Kinesis,
S3, HDFS,
Databases, ...
dA
Application
Manager
Application lifecycle
management
Metrics
CI/CD
Real-time
Analytics
Anomaly- &
Fraud Detection
Real-time
Data Integration
Reactive
Microservices
(and more)
What is Apache Flink?
6
Batch Processing
process static and
historic data
Data Stream
Processing
realtime results
from data streams
Event-driven
Applications
data-driven actions
and services
Stateful Computations Over Data Streams
What is Apache Flink?
7
Queries
Applications
Devices
etc.
Database
Stream
File / Object
Storage
Stateful computations over streams
real-time and historic
fast, scalable, fault tolerant, in-memory,
event time, large state, exactly-once
Historic
Data
Streams
Application
Hardened at scale
8
Streaming Platform Service
billions messages per day
A lot of Stream SQL
Streaming Platform as a Service
3700+ container running Flink,
1400+ nodes, 22k+ cores, 100s of jobs
Fraud detection
Streaming Analytics Platform
100s jobs, 1000s nodes, TBs state,
metrics, analytics, real time ML,
Streaming SQL as a platform
Powerful Abstractions
9
Process Function (events, state, time)
DataStream API (streams, windows)
SQL / Table API (dynamic tables)
Stream- & Batch
Data Processing
High-level
Analytics API
Stateful Event-
Driven Applications
val stats = stream
.keyBy("sensor")
.timeWindow(Time.seconds(5))
.sum((a, b) -> a.add(b))
def processElement(event: MyEvent, ctx: Context, out: Collector[Result]) = {
// work with event and state
(event, state.value) match { … }
out.collect(…) // emit events
state.update(…) // modify state
// schedule a timer callback
ctx.timerService.registerEventTimeTimer(event.timestamp + 500)
}
Layered abstractions to
navigate simple to complex use cases
Apache Flink’s relational APIs
 ANSI SQL & LINQ-style Table API
 Unified APIs for batch & streaming data
A query specifies exactly the same result
regardless whether its input is
static batch data or streaming data.
 Common translation layers
• Optimization based on Apache Calcite
• Type system & code-generation
• Table sources & sinks
10
Show me some code!
tableEnvironment
.scan("clicks")
.filter('url.like("https://www.xyz.com%")
.groupBy('user)
.select('user, 'url.count as 'cnt)
SELECT user, COUNT(url) AS cnt
FROM clicks
WHERE url LIKE 'https://www.xyz.com%'
GROUP BY user
11
“clicks” can be a
- file
- database table,
- stream, …
What if “clicks” is a file?
12
user cTime url
Mary 12:00:00 https://…
Bob 12:00:00 https://…
Mary 12:00:02 https://…
Liz 12:00:03 https://…
user cnt
Mary 2
Bob 1
Liz 1
Q: What if we get more click data?
A: We run the query again.
SELECT
user,
COUNT(url) as cnt
FROM clicks
GROUP BY user
What if “clicks” is a stream?
13
 We want the same
results as for batch
input!
 Does SQL work on
streams as well?
SQL was not designed for
streams
 Relations are
bounded (multi-)sets.
 DBMS can access
all data.
 SQL queries return a
result and complete.
14
Streams are infinite
sequences.
Streaming data arrives
over time.
Streaming queries
continuously emit results
and never complete.
↔
↔
↔
DBMSs run queries on streams
 Materialized views (MV) are similar to regular views,
but persisted to disk or memory
• Used to speed-up analytical queries
• MVs need to be updated when the base tables change
 MV maintenance is very similar to SQL on streams
• Base table updates are a stream of DML statements
• MV definition query is evaluated on that stream
• MV is query result and continuously updated
15
Continuous Queries in Flink
 Core concept is a “Dynamic Table”
• Dynamic tables are changing over time
 Queries on dynamic tables
• produce new dynamic tables (which are updated based on input)
• do not terminate
 Stream ↔ Dynamic table conversions
16
Stream → Dynamic Table
 Append mode
• Stream records are appended to table
• Table grows as more data arrives
17
user cTime url
Mary 12:00:00 ./home
Bob 12:00:00 ./cart
Mary 12:00:05 ./prod?id=1
Liz 12:01:00 ./home
Bob 12:01:30 ./prod?id=3
Mary 12:01:45 ./prod?id=7
… …
Mary, 12:00:00, ./home
Bob, 12:00:00, ./cart
Mary, 12:00:05, ./prod?id=1
Liz, 12:01:00, ./home
Bob, 12:01:30, ./prod?id=3
Mary, 12:01:45, ./prod?id=7
Stream → Dynamic Table
 Upsert mode
• Stream records have (composite) key attributes
• Records are inserted or update existing records with same key
18
user lastLogin
Mary 2017-07-01
Bob 2017-06-01
Liz 2017-05-01
…
Mary, 2017-03-01
Bob, 2017-03-15
Mary, 2017-04-01
Liz, 2017-05-01
Bob, 2017-06-01
Mary, 2017-07-01
Querying a Dynamic Table
clicks
user cnt
Mary 1
result
Bob 1
Liz 1
Mary 2
Liz 2
Mary 3SELECT
user,
COUNT(url) as cnt
FROM clicks
GROUP BY user
Rows of result table are updated.
19
Mary 12:01:45 ./prod?id=7
Liz 12:01:30 ./prod?id=3
Liz 12:01:00 ./home
Mary 12:00:05 ./prod?id=1
Bob 12:00:00 ./cart
Mary 12:00:00 ./home
user cTime url
What about windows?
tableEnvironment
.scan("clicks")
.window(Tumble over 1.hour on 'cTime as 'w)
.groupBy('w, 'user)
.select('user, 'w.end AS endT, 'url.count as 'cnt)
SELECT user,
TUMBLE_END(cTime, INTERVAL '1' HOURS) AS endT,
COUNT(url) AS cnt
FROM clicks
GROUP BY TUMBLE(cTime, INTERVAL '1' HOURS),
user
20
clicks
Computing window aggregates
user endT cnt
Mary 13:00:00 3
Bob 13:00:00 1
result
Bob 14:00:00 1
Liz 14:00:00 2
Mary 15:00:00 1
Bob 15:00:00 2
Liz 15:00:00 1
Mary 12:00:00 ./home
Bob 12:00:00 ./cart
Mary 12:02:00 ./prod?id=2
Mary 12:55:00 ./home
Mary 14:00:00 ./prod?id=1
Liz 14:02:00 ./prod?id=8
Bob 14:30:00 ./prod?id=7
Bob 14:40:00 ./home
Bob 13:01:00 ./prod?id=4
Liz 13:30:00 ./cart
Liz 13:59:00 ./home
SELECT
user,
TUMBLE_END(
cTime,
INTERVAL '1' HOURS)
AS endT,
COUNT(url) AS cnt
FROM clicks
GROUP BY
user,
TUMBLE(
cTime,
INTERVAL '1' HOURS)
Rows are appended to result table. 21
user cTime url
Why are the results not updated?
 cTime attribute is event-time attribute
• Guarded by watermarks
• Internally represented as special type
• User-facing as TIMESTAMP
 Special plans for queries that operate on event-time
attributes 22
SELECT user,
TUMBLE_END(cTime, INTERVAL '1' HOURS) AS endT,
COUNT(url) AS cnt
FROM clicks
GROUP BY TUMBLE(cTime, INTERVAL '1' HOURS),
user
Dynamic Table → Stream
 Converting a dynamic table into a stream
• Dynamic tables might update or delete existing rows
• Updates must be encoded in outgoing stream
 Conversion of tables to streams inspired by DBMS logs
• DBMS use logs to restore databases (and tables)
• REDO logs store new records to redo changes
• UNDO logs store old records to undo changes
23
Dynamic Table → Stream: REDO/UNDO
+ Bob,1+ Mary,2+ Liz,1+ Bob,2 + Mary,1- Mary,1- Bob,1
SELECT
user,
COUNT(url) as cnt
FROM clicks
GROUP BY user
+ INSERT / - DELETE
24
user url
clicks
Mary ./home
Bob ./cart
Mary ./prod?id=1
Liz ./home
Bob ./prod?id=3
Dynamic Table → Stream: REDO
* Bob,1* Mary,2* Liz,1* Liz,2* Mary,3 * Mary,1
* UPSERT by KEY / - DELETE by KEY
SELECT
user,
COUNT(url) as cnt
FROM clicks
GROUP BY user
25
user url
clicks
Mary ./home
Bob ./cart
Mary ./prod?id=1
Liz ./home
Liz ./prod?id=3
Mary ./prod?id=7
Can we run any query on a dynamic table?
 No, there are space and computation constraints 
 State size may not grow infinitely as more data arrives
SELECT sessionId, COUNT(url) FROM clicks GROUP BY sessionId;
 A change of an input table may only trigger a partial
re-computation of the result table
SELECT user, RANK() OVER (ORDER BY lastLogin) FROM users;
26
Bounding the size of query state
 Adapt the semantics of the query
• Aggregate data of last 24 hours. Discard older data.
 Trade the accuracy of the result for size of state
• Remove state for keys that became inactive.
27
SELECT sessionId, COUNT(url) AS cnt
FROM clicks
WHERE last(cTime, INTERVAL '1' DAY)
GROUP BY sessionId
Current state of SQL & Table API
 Flink’s relational APIs are rapidly evolving
• Lots of interest by community and many contributors
• Used in production at large scale by Alibaba, Uber, and others
 Features released in Flink 1.3
• GroupBy & Over windowed aggregates
• Non-windowed aggregates (with update changes)
• User-defined aggregation functions
 Features coming with Flink 1.4
• Windowed Joins
• Reworked connectors APIs
28
What can be built with this?
 Continuous ETL & Streaming Analytics
• Continuously ingest data
• Process with transformations & window aggregates
• Write to files (Parquet, ORC), Kafka, PostgreSQL, HBase, …
29
What can be built with this?
30
 Event-driven applications & Dashboards
• Flink updates query results with low latency
• Result can be written to KV store, DBMS, compacted Kafka topic
• Maintain result table as queryable state
Wrap up!
 Used in production heavily at Alibaba, Uber, and others
 Unified Batch and Stream Processing
 Lots of great features
• Continuously updating results like Materialized Views
• Sophisticated event-time model with retractions
• User-defined scalar, table & aggregation functions
 Check it out! 31
Thank you!
@fhueske
@ApacheFlink
@dataArtisans
Available on O’Reilly Early Release!

More Related Content

What's hot (20)

PPTX
Apache Flink Meetup Munich (November 2015): Flink Overview, Architecture, Int...
Robert Metzger
 
PDF
Scaling stream data pipelines with Pravega and Apache Flink
Till Rohrmann
 
PDF
Modern Stream Processing With Apache Flink @ GOTO Berlin 2017
Till Rohrmann
 
PDF
Flink Forward Berlin 2018: Xiaowei Jiang - Keynote: "Unified Engine for Data ...
Flink Forward
 
PPTX
Flink Community Update December 2015: Year in Review
Robert Metzger
 
PPTX
Riding the Streaming Wave DIY style
Konstantine Karantasis
 
PDF
Athens BigData Meetup - Sept 17
Landoop Ltd
 
PDF
Dynamic Scaling: How Apache Flink Adapts to Changing Workloads (at FlinkForwa...
Till Rohrmann
 
PPTX
January 2016 Flink Community Update & Roadmap 2016
Robert Metzger
 
PPTX
data Artisans Product Announcement
Flink Forward
 
PPTX
Flink Forward Berlin 2017 Keynote: Ferd Scheepers - Taking away customer fric...
Flink Forward
 
PPTX
Analytics Beyond RAM Capacity using R
Alex Palamides
 
PPTX
Flink Forward Berlin 2017: Hao Wu - Large Scale User Behavior Analytics by Flink
Flink Forward
 
PPTX
AusNOG 2017: APNIC Update
APNIC
 
PDF
Tuning Flink For Robustness And Performance
Stefan Richter
 
PDF
Maximilian Michels - Flink and Beam
Flink Forward
 
PDF
Flink Forward SF 2017: Bill Liu & Haohui Mai - AthenaX : Uber’s streaming pro...
Flink Forward
 
PDF
Temporal-Joins in Kafka Streams and ksqlDB | Matthias Sax, Confluent
HostedbyConfluent
 
PDF
Till Rohrmann - Dynamic Scaling - How Apache Flink adapts to changing workloads
Flink Forward
 
PDF
Virtual Flink Forward 2020: Production-Ready Flink and Hive Integration - wha...
Flink Forward
 
Apache Flink Meetup Munich (November 2015): Flink Overview, Architecture, Int...
Robert Metzger
 
Scaling stream data pipelines with Pravega and Apache Flink
Till Rohrmann
 
Modern Stream Processing With Apache Flink @ GOTO Berlin 2017
Till Rohrmann
 
Flink Forward Berlin 2018: Xiaowei Jiang - Keynote: "Unified Engine for Data ...
Flink Forward
 
Flink Community Update December 2015: Year in Review
Robert Metzger
 
Riding the Streaming Wave DIY style
Konstantine Karantasis
 
Athens BigData Meetup - Sept 17
Landoop Ltd
 
Dynamic Scaling: How Apache Flink Adapts to Changing Workloads (at FlinkForwa...
Till Rohrmann
 
January 2016 Flink Community Update & Roadmap 2016
Robert Metzger
 
data Artisans Product Announcement
Flink Forward
 
Flink Forward Berlin 2017 Keynote: Ferd Scheepers - Taking away customer fric...
Flink Forward
 
Analytics Beyond RAM Capacity using R
Alex Palamides
 
Flink Forward Berlin 2017: Hao Wu - Large Scale User Behavior Analytics by Flink
Flink Forward
 
AusNOG 2017: APNIC Update
APNIC
 
Tuning Flink For Robustness And Performance
Stefan Richter
 
Maximilian Michels - Flink and Beam
Flink Forward
 
Flink Forward SF 2017: Bill Liu & Haohui Mai - AthenaX : Uber’s streaming pro...
Flink Forward
 
Temporal-Joins in Kafka Streams and ksqlDB | Matthias Sax, Confluent
HostedbyConfluent
 
Till Rohrmann - Dynamic Scaling - How Apache Flink adapts to changing workloads
Flink Forward
 
Virtual Flink Forward 2020: Production-Ready Flink and Hive Integration - wha...
Flink Forward
 

Similar to Stream Analytics with SQL on Apache Flink (20)

PPTX
Flink Forward Berlin 2017: Fabian Hueske - Using Stream and Batch Processing ...
Flink Forward
 
PPTX
Fabian Hueske - Stream Analytics with SQL on Apache Flink
Ververica
 
PPTX
Flink Forward SF 2017: Timo Walther - Table & SQL API – unified APIs for bat...
Flink Forward
 
PPTX
Flink Forward Berlin 2018: Timo Walther - "Flink SQL in Action"
Flink Forward
 
PPTX
Flexible and Real-Time Stream Processing with Apache Flink
DataWorks Summit
 
PDF
Streaming SQL Foundations: Why I ❤ Streams+Tables
C4Media
 
PDF
Apache Kafka, and the Rise of Stream Processing
Guozhang Wang
 
PPTX
Data Stream Processing with Apache Flink
Fabian Hueske
 
PPTX
Fabian Hueske - Taking a look under the hood of Apache Flink’s relational APIs
Flink Forward
 
PDF
Streaming SQL
Julian Hyde
 
PPTX
SF big Analytics : Stream all things by Gwen Shapira @ Lyft 2018
Chester Chen
 
PDF
Flink Forward Berlin 2017: Stephan Ewen - The State of Flink and how to adopt...
Flink Forward
 
PPTX
Data Stream Processing for Beginners with Kafka and CDC
Abhijit Kumar
 
PPTX
Streaming Data Ingest and Processing with Apache Kafka
Attunity
 
PPTX
Apache Flink: Past, Present and Future
Gyula Fóra
 
PDF
Flink in Zalando's world of Microservices
ZalandoHayley
 
PDF
Flink in Zalando's World of Microservices
Zalando Technology
 
PDF
Stream All Things—Patterns of Modern Data Integration with Gwen Shapira
Databricks
 
PDF
ksqlDB: A Stream-Relational Database System
confluent
 
PDF
Apache CarbonData+Spark to realize data convergence and Unified high performa...
Tech Triveni
 
Flink Forward Berlin 2017: Fabian Hueske - Using Stream and Batch Processing ...
Flink Forward
 
Fabian Hueske - Stream Analytics with SQL on Apache Flink
Ververica
 
Flink Forward SF 2017: Timo Walther - Table & SQL API – unified APIs for bat...
Flink Forward
 
Flink Forward Berlin 2018: Timo Walther - "Flink SQL in Action"
Flink Forward
 
Flexible and Real-Time Stream Processing with Apache Flink
DataWorks Summit
 
Streaming SQL Foundations: Why I ❤ Streams+Tables
C4Media
 
Apache Kafka, and the Rise of Stream Processing
Guozhang Wang
 
Data Stream Processing with Apache Flink
Fabian Hueske
 
Fabian Hueske - Taking a look under the hood of Apache Flink’s relational APIs
Flink Forward
 
Streaming SQL
Julian Hyde
 
SF big Analytics : Stream all things by Gwen Shapira @ Lyft 2018
Chester Chen
 
Flink Forward Berlin 2017: Stephan Ewen - The State of Flink and how to adopt...
Flink Forward
 
Data Stream Processing for Beginners with Kafka and CDC
Abhijit Kumar
 
Streaming Data Ingest and Processing with Apache Kafka
Attunity
 
Apache Flink: Past, Present and Future
Gyula Fóra
 
Flink in Zalando's world of Microservices
ZalandoHayley
 
Flink in Zalando's World of Microservices
Zalando Technology
 
Stream All Things—Patterns of Modern Data Integration with Gwen Shapira
Databricks
 
ksqlDB: A Stream-Relational Database System
confluent
 
Apache CarbonData+Spark to realize data convergence and Unified high performa...
Tech Triveni
 
Ad

More from Fabian Hueske (8)

PPTX
Flink SQL in Action
Fabian Hueske
 
PPTX
Flink's Journey from Academia to the ASF
Fabian Hueske
 
PPTX
Juggling with Bits and Bytes - How Apache Flink operates on binary data
Fabian Hueske
 
PPTX
ApacheCon: Apache Flink - Fast and Reliable Large-Scale Data Processing
Fabian Hueske
 
PPTX
Apache Flink - Hadoop MapReduce Compatibility
Fabian Hueske
 
PPTX
Apache Flink - A Sneek Preview on Language Integrated Queries
Fabian Hueske
 
PPTX
Apache Flink - Akka for the Win!
Fabian Hueske
 
PPTX
Apache Flink - Community Update January 2015
Fabian Hueske
 
Flink SQL in Action
Fabian Hueske
 
Flink's Journey from Academia to the ASF
Fabian Hueske
 
Juggling with Bits and Bytes - How Apache Flink operates on binary data
Fabian Hueske
 
ApacheCon: Apache Flink - Fast and Reliable Large-Scale Data Processing
Fabian Hueske
 
Apache Flink - Hadoop MapReduce Compatibility
Fabian Hueske
 
Apache Flink - A Sneek Preview on Language Integrated Queries
Fabian Hueske
 
Apache Flink - Akka for the Win!
Fabian Hueske
 
Apache Flink - Community Update January 2015
Fabian Hueske
 
Ad

Recently uploaded (20)

PDF
The 2025 InfraRed Report - Redpoint Ventures
Razin Mustafiz
 
PDF
What’s my job again? Slides from Mark Simos talk at 2025 Tampa BSides
Mark Simos
 
PDF
Automating Feature Enrichment and Station Creation in Natural Gas Utility Net...
Safe Software
 
PDF
NASA A Researcher’s Guide to International Space Station : Physical Sciences ...
Dr. PANKAJ DHUSSA
 
PPTX
The Project Compass - GDG on Campus MSIT
dscmsitkol
 
PDF
Transforming Utility Networks: Large-scale Data Migrations with FME
Safe Software
 
PDF
UPDF - AI PDF Editor & Converter Key Features
DealFuel
 
PDF
Staying Human in a Machine- Accelerated World
Catalin Jora
 
PDF
Bitcoin for Millennials podcast with Bram, Power Laws of Bitcoin
Stephen Perrenod
 
PDF
LOOPS in C Programming Language - Technology
RishabhDwivedi43
 
PDF
CIFDAQ Market Wrap for the week of 4th July 2025
CIFDAQ
 
PPTX
Digital Circuits, important subject in CS
contactparinay1
 
PDF
SIZING YOUR AIR CONDITIONER---A PRACTICAL GUIDE.pdf
Muhammad Rizwan Akram
 
PPTX
COMPARISON OF RASTER ANALYSIS TOOLS OF QGIS AND ARCGIS
Sharanya Sarkar
 
PPTX
Q2 FY26 Tableau User Group Leader Quarterly Call
lward7
 
PDF
Kit-Works Team Study_20250627_한달만에만든사내서비스키링(양다윗).pdf
Wonjun Hwang
 
PPTX
MuleSoft MCP Support (Model Context Protocol) and Use Case Demo
shyamraj55
 
PPTX
Mastering ODC + Okta Configuration - Chennai OSUG
HathiMaryA
 
PDF
Mastering Financial Management in Direct Selling
Epixel MLM Software
 
PDF
[Newgen] NewgenONE Marvin Brochure 1.pdf
darshakparmar
 
The 2025 InfraRed Report - Redpoint Ventures
Razin Mustafiz
 
What’s my job again? Slides from Mark Simos talk at 2025 Tampa BSides
Mark Simos
 
Automating Feature Enrichment and Station Creation in Natural Gas Utility Net...
Safe Software
 
NASA A Researcher’s Guide to International Space Station : Physical Sciences ...
Dr. PANKAJ DHUSSA
 
The Project Compass - GDG on Campus MSIT
dscmsitkol
 
Transforming Utility Networks: Large-scale Data Migrations with FME
Safe Software
 
UPDF - AI PDF Editor & Converter Key Features
DealFuel
 
Staying Human in a Machine- Accelerated World
Catalin Jora
 
Bitcoin for Millennials podcast with Bram, Power Laws of Bitcoin
Stephen Perrenod
 
LOOPS in C Programming Language - Technology
RishabhDwivedi43
 
CIFDAQ Market Wrap for the week of 4th July 2025
CIFDAQ
 
Digital Circuits, important subject in CS
contactparinay1
 
SIZING YOUR AIR CONDITIONER---A PRACTICAL GUIDE.pdf
Muhammad Rizwan Akram
 
COMPARISON OF RASTER ANALYSIS TOOLS OF QGIS AND ARCGIS
Sharanya Sarkar
 
Q2 FY26 Tableau User Group Leader Quarterly Call
lward7
 
Kit-Works Team Study_20250627_한달만에만든사내서비스키링(양다윗).pdf
Wonjun Hwang
 
MuleSoft MCP Support (Model Context Protocol) and Use Case Demo
shyamraj55
 
Mastering ODC + Okta Configuration - Chennai OSUG
HathiMaryA
 
Mastering Financial Management in Direct Selling
Epixel MLM Software
 
[Newgen] NewgenONE Marvin Brochure 1.pdf
darshakparmar
 

Stream Analytics with SQL on Apache Flink

  • 1. 1 Fabian Hueske @fhueske Strata Data Conference, New York September, 27th 2017 Stream Analytics with SQL on Apache Flink®
  • 2. About me  Apache Flink PMC member • Contributing since day 1 at TU Berlin • Focusing on Flink’s relational APIs since 2 years  Co-author of “Stream Processing with Apache Flink” • Work in progress…  Co-founder & Software Engineer at data Artisans 2
  • 3. 3 Original creators of Apache Flink® dA Platform 2 Open Source Apache Flink + dA Application Manager
  • 5. The dA Platform 2 dA Platform 2 Apache Flink Stateful stream processing Kubernetes Container platform Logging Streams from Kafka, Kinesis, S3, HDFS, Databases, ... dA Application Manager Application lifecycle management Metrics CI/CD Real-time Analytics Anomaly- & Fraud Detection Real-time Data Integration Reactive Microservices (and more)
  • 6. What is Apache Flink? 6 Batch Processing process static and historic data Data Stream Processing realtime results from data streams Event-driven Applications data-driven actions and services Stateful Computations Over Data Streams
  • 7. What is Apache Flink? 7 Queries Applications Devices etc. Database Stream File / Object Storage Stateful computations over streams real-time and historic fast, scalable, fault tolerant, in-memory, event time, large state, exactly-once Historic Data Streams Application
  • 8. Hardened at scale 8 Streaming Platform Service billions messages per day A lot of Stream SQL Streaming Platform as a Service 3700+ container running Flink, 1400+ nodes, 22k+ cores, 100s of jobs Fraud detection Streaming Analytics Platform 100s jobs, 1000s nodes, TBs state, metrics, analytics, real time ML, Streaming SQL as a platform
  • 9. Powerful Abstractions 9 Process Function (events, state, time) DataStream API (streams, windows) SQL / Table API (dynamic tables) Stream- & Batch Data Processing High-level Analytics API Stateful Event- Driven Applications val stats = stream .keyBy("sensor") .timeWindow(Time.seconds(5)) .sum((a, b) -> a.add(b)) def processElement(event: MyEvent, ctx: Context, out: Collector[Result]) = { // work with event and state (event, state.value) match { … } out.collect(…) // emit events state.update(…) // modify state // schedule a timer callback ctx.timerService.registerEventTimeTimer(event.timestamp + 500) } Layered abstractions to navigate simple to complex use cases
  • 10. Apache Flink’s relational APIs  ANSI SQL & LINQ-style Table API  Unified APIs for batch & streaming data A query specifies exactly the same result regardless whether its input is static batch data or streaming data.  Common translation layers • Optimization based on Apache Calcite • Type system & code-generation • Table sources & sinks 10
  • 11. Show me some code! tableEnvironment .scan("clicks") .filter('url.like("https://www.xyz.com%") .groupBy('user) .select('user, 'url.count as 'cnt) SELECT user, COUNT(url) AS cnt FROM clicks WHERE url LIKE 'https://www.xyz.com%' GROUP BY user 11 “clicks” can be a - file - database table, - stream, …
  • 12. What if “clicks” is a file? 12 user cTime url Mary 12:00:00 https://… Bob 12:00:00 https://… Mary 12:00:02 https://… Liz 12:00:03 https://… user cnt Mary 2 Bob 1 Liz 1 Q: What if we get more click data? A: We run the query again. SELECT user, COUNT(url) as cnt FROM clicks GROUP BY user
  • 13. What if “clicks” is a stream? 13  We want the same results as for batch input!  Does SQL work on streams as well?
  • 14. SQL was not designed for streams  Relations are bounded (multi-)sets.  DBMS can access all data.  SQL queries return a result and complete. 14 Streams are infinite sequences. Streaming data arrives over time. Streaming queries continuously emit results and never complete. ↔ ↔ ↔
  • 15. DBMSs run queries on streams  Materialized views (MV) are similar to regular views, but persisted to disk or memory • Used to speed-up analytical queries • MVs need to be updated when the base tables change  MV maintenance is very similar to SQL on streams • Base table updates are a stream of DML statements • MV definition query is evaluated on that stream • MV is query result and continuously updated 15
  • 16. Continuous Queries in Flink  Core concept is a “Dynamic Table” • Dynamic tables are changing over time  Queries on dynamic tables • produce new dynamic tables (which are updated based on input) • do not terminate  Stream ↔ Dynamic table conversions 16
  • 17. Stream → Dynamic Table  Append mode • Stream records are appended to table • Table grows as more data arrives 17 user cTime url Mary 12:00:00 ./home Bob 12:00:00 ./cart Mary 12:00:05 ./prod?id=1 Liz 12:01:00 ./home Bob 12:01:30 ./prod?id=3 Mary 12:01:45 ./prod?id=7 … … Mary, 12:00:00, ./home Bob, 12:00:00, ./cart Mary, 12:00:05, ./prod?id=1 Liz, 12:01:00, ./home Bob, 12:01:30, ./prod?id=3 Mary, 12:01:45, ./prod?id=7
  • 18. Stream → Dynamic Table  Upsert mode • Stream records have (composite) key attributes • Records are inserted or update existing records with same key 18 user lastLogin Mary 2017-07-01 Bob 2017-06-01 Liz 2017-05-01 … Mary, 2017-03-01 Bob, 2017-03-15 Mary, 2017-04-01 Liz, 2017-05-01 Bob, 2017-06-01 Mary, 2017-07-01
  • 19. Querying a Dynamic Table clicks user cnt Mary 1 result Bob 1 Liz 1 Mary 2 Liz 2 Mary 3SELECT user, COUNT(url) as cnt FROM clicks GROUP BY user Rows of result table are updated. 19 Mary 12:01:45 ./prod?id=7 Liz 12:01:30 ./prod?id=3 Liz 12:01:00 ./home Mary 12:00:05 ./prod?id=1 Bob 12:00:00 ./cart Mary 12:00:00 ./home user cTime url
  • 20. What about windows? tableEnvironment .scan("clicks") .window(Tumble over 1.hour on 'cTime as 'w) .groupBy('w, 'user) .select('user, 'w.end AS endT, 'url.count as 'cnt) SELECT user, TUMBLE_END(cTime, INTERVAL '1' HOURS) AS endT, COUNT(url) AS cnt FROM clicks GROUP BY TUMBLE(cTime, INTERVAL '1' HOURS), user 20
  • 21. clicks Computing window aggregates user endT cnt Mary 13:00:00 3 Bob 13:00:00 1 result Bob 14:00:00 1 Liz 14:00:00 2 Mary 15:00:00 1 Bob 15:00:00 2 Liz 15:00:00 1 Mary 12:00:00 ./home Bob 12:00:00 ./cart Mary 12:02:00 ./prod?id=2 Mary 12:55:00 ./home Mary 14:00:00 ./prod?id=1 Liz 14:02:00 ./prod?id=8 Bob 14:30:00 ./prod?id=7 Bob 14:40:00 ./home Bob 13:01:00 ./prod?id=4 Liz 13:30:00 ./cart Liz 13:59:00 ./home SELECT user, TUMBLE_END( cTime, INTERVAL '1' HOURS) AS endT, COUNT(url) AS cnt FROM clicks GROUP BY user, TUMBLE( cTime, INTERVAL '1' HOURS) Rows are appended to result table. 21 user cTime url
  • 22. Why are the results not updated?  cTime attribute is event-time attribute • Guarded by watermarks • Internally represented as special type • User-facing as TIMESTAMP  Special plans for queries that operate on event-time attributes 22 SELECT user, TUMBLE_END(cTime, INTERVAL '1' HOURS) AS endT, COUNT(url) AS cnt FROM clicks GROUP BY TUMBLE(cTime, INTERVAL '1' HOURS), user
  • 23. Dynamic Table → Stream  Converting a dynamic table into a stream • Dynamic tables might update or delete existing rows • Updates must be encoded in outgoing stream  Conversion of tables to streams inspired by DBMS logs • DBMS use logs to restore databases (and tables) • REDO logs store new records to redo changes • UNDO logs store old records to undo changes 23
  • 24. Dynamic Table → Stream: REDO/UNDO + Bob,1+ Mary,2+ Liz,1+ Bob,2 + Mary,1- Mary,1- Bob,1 SELECT user, COUNT(url) as cnt FROM clicks GROUP BY user + INSERT / - DELETE 24 user url clicks Mary ./home Bob ./cart Mary ./prod?id=1 Liz ./home Bob ./prod?id=3
  • 25. Dynamic Table → Stream: REDO * Bob,1* Mary,2* Liz,1* Liz,2* Mary,3 * Mary,1 * UPSERT by KEY / - DELETE by KEY SELECT user, COUNT(url) as cnt FROM clicks GROUP BY user 25 user url clicks Mary ./home Bob ./cart Mary ./prod?id=1 Liz ./home Liz ./prod?id=3 Mary ./prod?id=7
  • 26. Can we run any query on a dynamic table?  No, there are space and computation constraints   State size may not grow infinitely as more data arrives SELECT sessionId, COUNT(url) FROM clicks GROUP BY sessionId;  A change of an input table may only trigger a partial re-computation of the result table SELECT user, RANK() OVER (ORDER BY lastLogin) FROM users; 26
  • 27. Bounding the size of query state  Adapt the semantics of the query • Aggregate data of last 24 hours. Discard older data.  Trade the accuracy of the result for size of state • Remove state for keys that became inactive. 27 SELECT sessionId, COUNT(url) AS cnt FROM clicks WHERE last(cTime, INTERVAL '1' DAY) GROUP BY sessionId
  • 28. Current state of SQL & Table API  Flink’s relational APIs are rapidly evolving • Lots of interest by community and many contributors • Used in production at large scale by Alibaba, Uber, and others  Features released in Flink 1.3 • GroupBy & Over windowed aggregates • Non-windowed aggregates (with update changes) • User-defined aggregation functions  Features coming with Flink 1.4 • Windowed Joins • Reworked connectors APIs 28
  • 29. What can be built with this?  Continuous ETL & Streaming Analytics • Continuously ingest data • Process with transformations & window aggregates • Write to files (Parquet, ORC), Kafka, PostgreSQL, HBase, … 29
  • 30. What can be built with this? 30  Event-driven applications & Dashboards • Flink updates query results with low latency • Result can be written to KV store, DBMS, compacted Kafka topic • Maintain result table as queryable state
  • 31. Wrap up!  Used in production heavily at Alibaba, Uber, and others  Unified Batch and Stream Processing  Lots of great features • Continuously updating results like Materialized Views • Sophisticated event-time model with retractions • User-defined scalar, table & aggregation functions  Check it out! 31