SlideShare a Scribd company logo
TIMO WALTHER, SOFTWARE ENGINEER
DATAWORKS SUMMIT, SAN JOSE
JUNE 21, 2018
WHY AND HOW TO LEVERAGE
THE POWER AND SIMPLICITY OF
SQL ON APACHE FLINK®
© 2018 data Artisans2
ABOUT DATA ARTISANS
Original creators of
Apache Flink®
Open Source Apache Flink
+ dA Application Manager
© 2018 data Artisans3
DA PLATFORM
data-artisans.com/download
© 2018 data Artisans4
POWERED BY APACHE FLINK
© 2018 data Artisans5
FLINK’S POWERFUL ABSTRACTIONS
Process Function (events, state, time)
DataStream API (streams, windows)
SQL / Table API (dynamic tables)
Stream- & Batch
Data Processing
High-level
Analytics API
Stateful Event-
Driven Applications
val stats = stream
.keyBy("sensor")
.timeWindow(Time.seconds(5))
.sum((a, b) -> a.add(b))
def processElement(event: MyEvent, ctx: Context, out: Collector[Result]) = {
// work with event and state
(event, state.value) match { … }
out.collect(…) // emit events
state.update(…) // modify state
// schedule a timer callback
ctx.timerService.registerEventTimeTimer(event.timestamp + 500)
}
Layered abstractions to
navigate simple to complex use cases
© 2018 data Artisans6
APACHE FLINK’S RELATIONAL APIS
Unified APIs for batch & streaming data
A query specifies exactly the same result
regardless whether its input is
static batch data or streaming data.
tableEnvironment
.scan("clicks")
.groupBy('user)
.select('user, 'url.count as 'cnt)
SELECT user, COUNT(url) AS cnt
FROM clicks
GROUP BY user
LINQ-style Table APIANSI SQL
© 2018 data Artisans7
QUERY TRANSLATION
tableEnvironment
.scan("clicks")
.groupBy('user)
.select('user, 'url.count as 'cnt)
SELECT user, COUNT(url) AS cnt
FROM clicks
GROUP BY user
Input data is
bounded
(batch)
Input data is
unbounded
(streaming)
© 2018 data Artisans8
WHAT IF “CLICKS” IS A FILE?
Clicks
user cTime url
Mary 12:00:00 https://…
Bob 12:00:00 https://…
Mary 12:00:02 https://…
Liz 12:00:03 https://…
user cnt
Mary 2
Bob 1
Liz 1
SELECT
user,
COUNT(url) as cnt
FROM clicks
GROUP BY user
Input data is
read at once
Result is produced
at once
© 2018 data Artisans9
WHAT IF “CLICKS” IS A STREAM?
user cTime url
user cnt
SELECT
user,
COUNT(url) as cnt
FROM clicks
GROUP BY user
Clicks
Mary 12:00:00 https://…
Bob 12:00:00 https://…
Mary 12:00:02 https://…
Liz 12:00:03 https://…
Bob 1
Liz 1
Mary 1Mary 2
Input data is
continuously read
Result is
continuously updated
The result is the same!
© 2018 data Artisans10
• Usability
‒ ANSI SQL syntax: No custom “StreamSQL” syntax.
‒ ANSI SQL semantics: No stream-specific results.
• Portability
‒ Run the same query on bounded and unbounded data
‒ Run the same query on recorded and real-time data
• How can we achieve SQL semantics on streams?
now
bounded query
unbounded query
past future
bounded query
start of the stream
unbounded query
WHY IS STREAM-BATCH UNIFICATION IMPORTANT?
© 2018 data Artisans11
• Materialized views (MV) are similar to regular views,
but persisted to disk or memory
‒Used to speed-up analytical queries
‒MVs need to be updated when the base tables change
• MV maintenance is very similar to SQL on streams
‒Base table updates are a stream of DML statements
‒MV definition query is evaluated on that stream
‒MV is query result and continuously updated
DATABASE SYSTEMS RUN QUERIES ON STREAMS
© 2018 data Artisans12
CONTINUOUS QUERIES IN FLINK
• Core concept is a “DynamicTable”
‒Dynamic tables are changing over time
• Queries on dynamic tables
‒produce new dynamic tables (which are updated based on input)
‒do not terminate
• Stream ↔ Dynamic table conversions
12
© 2018 data Artisans13
STREAM ↔ DYNAMIC TABLE CONVERSIONS
• Append Conversions
‒ Records are only inserted/appended
• Upsert Conversions
‒ Records are inserted/updated/deleted
‒ Records have a (composite) unique key
• Changelog Conversions
‒ Records are inserted/updated/deleted
© 2018 data Artisans14
SQL FEATURE SET IN FLINK 1.5.0
• SELECT FROMWHERE
• GROUP BY / HAVING
‒ Non-windowed,TUMBLE, HOP, SESSION windows
• JOIN
‒ Windowed INNER, LEFT / RIGHT / FULL OUTER JOIN
‒ Non-windowed INNER JOIN
• Scalar, aggregation, table-valued UDFs
• SQL CLI Client (beta)
• [streaming only] OVER /WINDOW
‒ UNBOUNDED / BOUNDED PRECEDING
• [batch only] UNION / INTERSECT / EXCEPT / IN / ORDER BY
© 2018 data Artisans15
WHAT CAN I BUILD WITH THIS?
• Data Pipelines
‒ Transform, aggregate, and move events in real-time
• Low-latency ETL
‒ Convert and write streams to file systems, DBMS, K-V stores, indexes, …
‒ Ingest appearing files to produce streams
• Stream & Batch Analytics
‒ Run analytical queries over bounded and unbounded data
‒ Query and compare historic and real-time data
• Power Live Dashboards
‒ Compute and update data to visualize in real-time
© 2018 data Artisans16
THE NEW YORK TAXI RIDES DATA SET
• The NewYork CityTaxi & Limousine Commission provides a public data set
about past taxi rides in NewYork City
• We can derive a streaming table from the data
• Table: TaxiRides
rideId: BIGINT // ID of the taxi ride
isStart: BOOLEAN // flag for pick-up (true) or drop-off (false) event
lon: DOUBLE // longitude of pick-up or drop-off location
lat: DOUBLE // latitude of pick-up or drop-off location
rowtime: TIMESTAMP // time of pick-up or drop-off event
© 2018 data Artisans17
SELECT cell,
isStart,
HOP_END(rowtime, INTERVAL '5' MINUTE, INTERVAL '15' MINUTE) AS hopEnd,
COUNT(*) AS cnt
FROM (SELECT rowtime, isStart, toCellId(lon, lat) AS cell
FROM TaxiRides)
GROUP BY cell,
isStart,
HOP(rowtime, INTERVAL '5' MINUTE, INTERVAL '15' MINUTE)
 Compute every 5 minutes for each location the
number of departing and arriving taxis
of the last 15 minutes.
IDENTIFY POPULAR PICK-UP / DROP-OFF LOCATIONS
© 2018 data Artisans18
SELECT pickUpCell,
AVG(TIMESTAMPDIFF(MINUTE, e.rowtime, s.rowtime) AS avgDuration
FROM (SELECT rideId, rowtime, toCellId(lon, lat) AS pickUpCell
FROM TaxiRides
WHERE isStart) s
JOIN
(SELECT rideId, rowtime
FROM TaxiRides
WHERE NOT isStart) e
ON s.rideId = e.rideId AND
e.rowtime BETWEEN s.rowtime AND s.rowtime + INTERVAL '1' HOUR
GROUP BY pickUpCell
 Join start ride and end ride events on rideId and
compute average ride duration per pick-up location.
AVERAGE RIDE DURATION PER PICK-UP LOCATION
© 2018 data Artisans19
BUILDING A DASHBOARD
Elastic
Search
Kafka
SELECT cell,
isStart,
HOP_END(rowtime, INTERVAL '5' MINUTE, INTERVAL '15' MINUTE) AS hopEnd,
COUNT(*) AS cnt
FROM (SELECT rowtime, isStart, toCellId(lon, lat) AS cell
FROM TaxiRides)
GROUP BY cell,
isStart,
HOP(rowtime, INTERVAL '5' MINUTE, INTERVAL '15' MINUTE)
© 2018 data Artisans20
SOUNDS GREAT! HOW CAN I USE IT?
• ATM, SQL queries must be embedded in Java/Scala code 
‒ Tight integration with DataStream and DataSet APIs
• Community focused on internals (until Flink 1.4.0)
‒ Operators, types, built-in functions, extensibility (UDFs, extern. catalog)
‒ Proven at scale by Alibaba, Huawei, and Uber
‒ All built their own submission system & connectors library
• Community neglected user interfaces
‒ No query submission client, no CLI
‒ No catalog integration
‒ Limited set ofTableSources andTableSinks
© 2018 data Artisans21
COMING IN FLINK 1.5.0 - SQL CLI
DemoTime!
That’s a nice toy, but …
... can I use it for anything serious?
© 2018 data Artisans22
FLIP-24 – A SQL QUERY SERVICE
• REST service to submit & manage SQL queries
‒ SELECT …
‒ INSERT INTO SELECT …
‒ CREATE MATERIALIZE VIEW …
• Serve results of “SELECT …” queries
• Provide a table catalog (integrated with external catalogs)
• Use cases
‒ Data exploration with notebooks like Apache Zeppelin
‒ Access to real-time data from applications
‒ Easy data routing / ETL from management consoles
© 2018 data Artisans23
CHALLENGE: SERVE DYNAMIC TABLES
Unbounded input yields unbounded results
SELECT user, COUNT(url) AS cnt
FROM clicks
GROUP BY user
SELECT user, url
FROM clicks
WHERE url LIKE '%xyz.com'
Append-onlyTable
• Result rows are never changed
• Consume, buffer, or drop rows
Continuously updatingTable
• Result rows can be updated or
deleted
• Consume changelog or
periodically query result table
• Result table must be maintained
somewhere
(Serving bounded results is easy)
© 2018 data Artisans24
Application
FLIP-24 – A SQL QUERY SERVICE
Query Service
Catalog
Optimizer
Database /
HDFS
Event Log
External Catalog
(Schema Registry,
HCatalog, …)
Query
Results
Submit Query Job
State
REST
Result Server
Submit Query
REST
Database /
HDFS
Event Log
SELECT
user,
COUNT(url) AS cnt
FROM clicks
GROUP BY user
Results are served by Query Service via REST
+ Application does not need a special client
+ Works well in many network configurations
− Query service can become bottleneck
© 2018 data Artisans25
Application
FLIP-24 – A SQL QUERY SERVICE
Query Service
SELECT
user,
COUNT(url) AS cnt
FROM clicks
GROUP BY user Catalog
Optimizer
Database /
HDFS
Event Log
External Catalog
(Schema Registry,
HCatalog, …)
Query
Submit Query Job
State
REST
Result Server
Submit Query
REST
Database /
HDFS
Event Log
Serving
Library
Result Handle
© 2018 data Artisans26
WE WANT YOUR FEEDBACK!
• The design of SQL Query Service is not final yet.
• Check out FLIP-24 and FLINK-7594
• Share your ideas and feedback and discuss on
JIRA or dev@flink.apache.org.
© 2018 data Artisans27
SUMMARY
• Unification of stream and batch is important.
• Flink’s SQL solves many streaming and batch use cases.
• Runs in production at Alibaba, Uber, and others.
• The community is working on improving user interfaces.
• Get involved, discuss, and contribute!
THANK YOU!
Available on O’Reilly Early Release!
THANK YOU!
@twalthr
@dataArtisans
@ApacheFlink
WE ARE HIRING
data-artisans.com/careers

More Related Content

What's hot (20)

PPTX
Big data at United Airlines
DataWorks Summit
 
PDF
Starting Small and Scaling Big with Hadoop (Talend and Hortonworks webinar)) ...
Hortonworks
 
PDF
Hortonworks DataFlow (HDF) 3.3 - Taking Stream Processing to the Next Level
Hortonworks
 
PPTX
Accelerating query processing with materialized views in Apache Hive
DataWorks Summit
 
PPTX
Data Driving Yahoo Mail Growth and Evolution with a 50 PB Hadoop Warehouse
DataWorks Summit
 
PPTX
Implementing the Business Catalog in the Modern Enterprise: Bridging Traditio...
DataWorks Summit/Hadoop Summit
 
PPTX
Bridging the gap: achieving fast data synchronization from SAP HANA by levera...
DataWorks Summit
 
PPTX
Big Data Platform Industrialization
DataWorks Summit/Hadoop Summit
 
PPTX
Analyzing the World's Largest Security Data Lake!
DataWorks Summit
 
PPTX
LLAP: Sub-Second Analytical Queries in Hive
DataWorks Summit/Hadoop Summit
 
PPTX
What's new in apache hive
DataWorks Summit
 
PPTX
Real Time Streaming Architecture at Ford
DataWorks Summit
 
PDF
So You Want to Build a Data Lake?
David P. Moore
 
PDF
Machine Learning for z/OS
Cuneyt Goksu
 
PPTX
Containers and Big Data
DataWorks Summit
 
PPTX
Designing data pipelines for analytics and machine learning in industrial set...
DataWorks Summit
 
PPTX
Apache Hadoop YARN: state of the union
DataWorks Summit
 
PPTX
What's new in Ambari
DataWorks Summit
 
PPTX
Multi-tenant Hadoop - the challenge of maintaining high SLAS
DataWorks Summit
 
PDF
From an experiment to a real production environment
DataWorks Summit
 
Big data at United Airlines
DataWorks Summit
 
Starting Small and Scaling Big with Hadoop (Talend and Hortonworks webinar)) ...
Hortonworks
 
Hortonworks DataFlow (HDF) 3.3 - Taking Stream Processing to the Next Level
Hortonworks
 
Accelerating query processing with materialized views in Apache Hive
DataWorks Summit
 
Data Driving Yahoo Mail Growth and Evolution with a 50 PB Hadoop Warehouse
DataWorks Summit
 
Implementing the Business Catalog in the Modern Enterprise: Bridging Traditio...
DataWorks Summit/Hadoop Summit
 
Bridging the gap: achieving fast data synchronization from SAP HANA by levera...
DataWorks Summit
 
Big Data Platform Industrialization
DataWorks Summit/Hadoop Summit
 
Analyzing the World's Largest Security Data Lake!
DataWorks Summit
 
LLAP: Sub-Second Analytical Queries in Hive
DataWorks Summit/Hadoop Summit
 
What's new in apache hive
DataWorks Summit
 
Real Time Streaming Architecture at Ford
DataWorks Summit
 
So You Want to Build a Data Lake?
David P. Moore
 
Machine Learning for z/OS
Cuneyt Goksu
 
Containers and Big Data
DataWorks Summit
 
Designing data pipelines for analytics and machine learning in industrial set...
DataWorks Summit
 
Apache Hadoop YARN: state of the union
DataWorks Summit
 
What's new in Ambari
DataWorks Summit
 
Multi-tenant Hadoop - the challenge of maintaining high SLAS
DataWorks Summit
 
From an experiment to a real production environment
DataWorks Summit
 

Similar to Why and how to leverage the simplicity and power of SQL on Flink (20)

PPTX
Flink Forward San Francisco 2018: Fabian Hueske & Timo Walther - "Why and how...
Flink Forward
 
PPTX
Why and how to leverage the power and simplicity of SQL on Apache Flink
Fabian Hueske
 
PPTX
Flink Forward Berlin 2018: Timo Walther - "Flink SQL in Action"
Flink Forward
 
PPTX
Webinar: Flink SQL in Action - Fabian Hueske
Ververica
 
PPTX
Streaming SQL to unify batch and stream processing: Theory and practice with ...
Fabian Hueske
 
PPTX
Fabian Hueske - Stream Analytics with SQL on Apache Flink
Ververica
 
PPTX
Stream Analytics with SQL on Apache Flink
Fabian Hueske
 
PPTX
Flink Forward Berlin 2017: Fabian Hueske - Using Stream and Batch Processing ...
Flink Forward
 
PPTX
Flink SQL in Action
Fabian Hueske
 
PDF
Apache Flink's Table & SQL API - unified APIs for batch and stream processing
Timo Walther
 
PDF
Timo Walther - Table & SQL API - unified APIs for batch and stream processing
Ververica
 
PPTX
Flink Forward SF 2017: Timo Walther - Table & SQL API – unified APIs for bat...
Flink Forward
 
PDF
Stream Analytics with SQL on Apache Flink - Fabian Hueske
Evention
 
PPTX
Stream Analytics with SQL on Apache Flink
Fabian Hueske
 
PPTX
Fabian Hueske - Stream Analytics with SQL on Apache Flink
Ververica
 
PPTX
Fabian Hueske - Taking a look under the hood of Apache Flink’s relational APIs
Flink Forward
 
PPTX
Taking a look under the hood of Apache Flink's relational APIs.
Fabian Hueske
 
PPTX
The Evolution of (Open Source) Data Processing
Aljoscha Krettek
 
PDF
ApacheCon 2020 - Flink SQL in 2020: Time to show off!
Timo Walther
 
PPTX
Flink Forward Berlin 2018: Dawid Wysakowicz - "Detecting Patterns in Event St...
Flink Forward
 
Flink Forward San Francisco 2018: Fabian Hueske & Timo Walther - "Why and how...
Flink Forward
 
Why and how to leverage the power and simplicity of SQL on Apache Flink
Fabian Hueske
 
Flink Forward Berlin 2018: Timo Walther - "Flink SQL in Action"
Flink Forward
 
Webinar: Flink SQL in Action - Fabian Hueske
Ververica
 
Streaming SQL to unify batch and stream processing: Theory and practice with ...
Fabian Hueske
 
Fabian Hueske - Stream Analytics with SQL on Apache Flink
Ververica
 
Stream Analytics with SQL on Apache Flink
Fabian Hueske
 
Flink Forward Berlin 2017: Fabian Hueske - Using Stream and Batch Processing ...
Flink Forward
 
Flink SQL in Action
Fabian Hueske
 
Apache Flink's Table & SQL API - unified APIs for batch and stream processing
Timo Walther
 
Timo Walther - Table & SQL API - unified APIs for batch and stream processing
Ververica
 
Flink Forward SF 2017: Timo Walther - Table & SQL API – unified APIs for bat...
Flink Forward
 
Stream Analytics with SQL on Apache Flink - Fabian Hueske
Evention
 
Stream Analytics with SQL on Apache Flink
Fabian Hueske
 
Fabian Hueske - Stream Analytics with SQL on Apache Flink
Ververica
 
Fabian Hueske - Taking a look under the hood of Apache Flink’s relational APIs
Flink Forward
 
Taking a look under the hood of Apache Flink's relational APIs.
Fabian Hueske
 
The Evolution of (Open Source) Data Processing
Aljoscha Krettek
 
ApacheCon 2020 - Flink SQL in 2020: Time to show off!
Timo Walther
 
Flink Forward Berlin 2018: Dawid Wysakowicz - "Detecting Patterns in Event St...
Flink Forward
 
Ad

More from DataWorks Summit (20)

PPTX
Data Science Crash Course
DataWorks Summit
 
PPTX
Floating on a RAFT: HBase Durability with Apache Ratis
DataWorks Summit
 
PPTX
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
DataWorks Summit
 
PDF
HBase Tales From the Trenches - Short stories about most common HBase operati...
DataWorks Summit
 
PPTX
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
DataWorks Summit
 
PPTX
Managing the Dewey Decimal System
DataWorks Summit
 
PPTX
Practical NoSQL: Accumulo's dirlist Example
DataWorks Summit
 
PPTX
HBase Global Indexing to support large-scale data ingestion at Uber
DataWorks Summit
 
PPTX
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
DataWorks Summit
 
PPTX
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
DataWorks Summit
 
PPTX
Supporting Apache HBase : Troubleshooting and Supportability Improvements
DataWorks Summit
 
PPTX
Security Framework for Multitenant Architecture
DataWorks Summit
 
PDF
Presto: Optimizing Performance of SQL-on-Anything Engine
DataWorks Summit
 
PPTX
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
DataWorks Summit
 
PPTX
Extending Twitter's Data Platform to Google Cloud
DataWorks Summit
 
PPTX
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
DataWorks Summit
 
PPTX
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
DataWorks Summit
 
PPTX
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
DataWorks Summit
 
PDF
Computer Vision: Coming to a Store Near You
DataWorks Summit
 
PPTX
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
DataWorks Summit
 
Data Science Crash Course
DataWorks Summit
 
Floating on a RAFT: HBase Durability with Apache Ratis
DataWorks Summit
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
DataWorks Summit
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
DataWorks Summit
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
DataWorks Summit
 
Managing the Dewey Decimal System
DataWorks Summit
 
Practical NoSQL: Accumulo's dirlist Example
DataWorks Summit
 
HBase Global Indexing to support large-scale data ingestion at Uber
DataWorks Summit
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
DataWorks Summit
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
DataWorks Summit
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
DataWorks Summit
 
Security Framework for Multitenant Architecture
DataWorks Summit
 
Presto: Optimizing Performance of SQL-on-Anything Engine
DataWorks Summit
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
DataWorks Summit
 
Extending Twitter's Data Platform to Google Cloud
DataWorks Summit
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
DataWorks Summit
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
DataWorks Summit
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
DataWorks Summit
 
Computer Vision: Coming to a Store Near You
DataWorks Summit
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
DataWorks Summit
 
Ad

Recently uploaded (20)

PDF
Newgen Beyond Frankenstein_Build vs Buy_Digital_version.pdf
darshakparmar
 
DOCX
Python coding for beginners !! Start now!#
Rajni Bhardwaj Grover
 
PDF
What’s my job again? Slides from Mark Simos talk at 2025 Tampa BSides
Mark Simos
 
PDF
“Voice Interfaces on a Budget: Building Real-time Speech Recognition on Low-c...
Edge AI and Vision Alliance
 
PDF
“NPU IP Hardware Shaped Through Software and Use-case Analysis,” a Presentati...
Edge AI and Vision Alliance
 
PPT
Ericsson LTE presentation SEMINAR 2010.ppt
npat3
 
PDF
Peak of Data & AI Encore AI-Enhanced Workflows for the Real World
Safe Software
 
PDF
The Rise of AI and IoT in Mobile App Tech.pdf
IMG Global Infotech
 
PDF
Transcript: Book industry state of the nation 2025 - Tech Forum 2025
BookNet Canada
 
PDF
“Squinting Vision Pipelines: Detecting and Correcting Errors in Vision Models...
Edge AI and Vision Alliance
 
PDF
CIFDAQ Market Wrap for the week of 4th July 2025
CIFDAQ
 
PDF
Bitcoin for Millennials podcast with Bram, Power Laws of Bitcoin
Stephen Perrenod
 
PDF
Transforming Utility Networks: Large-scale Data Migrations with FME
Safe Software
 
PPTX
Mastering ODC + Okta Configuration - Chennai OSUG
HathiMaryA
 
PDF
UiPath DevConnect 2025: Agentic Automation Community User Group Meeting
DianaGray10
 
PDF
LOOPS in C Programming Language - Technology
RishabhDwivedi43
 
PDF
How do you fast track Agentic automation use cases discovery?
DianaGray10
 
PDF
SIZING YOUR AIR CONDITIONER---A PRACTICAL GUIDE.pdf
Muhammad Rizwan Akram
 
PDF
Mastering Financial Management in Direct Selling
Epixel MLM Software
 
PDF
Agentic AI lifecycle for Enterprise Hyper-Automation
Debmalya Biswas
 
Newgen Beyond Frankenstein_Build vs Buy_Digital_version.pdf
darshakparmar
 
Python coding for beginners !! Start now!#
Rajni Bhardwaj Grover
 
What’s my job again? Slides from Mark Simos talk at 2025 Tampa BSides
Mark Simos
 
“Voice Interfaces on a Budget: Building Real-time Speech Recognition on Low-c...
Edge AI and Vision Alliance
 
“NPU IP Hardware Shaped Through Software and Use-case Analysis,” a Presentati...
Edge AI and Vision Alliance
 
Ericsson LTE presentation SEMINAR 2010.ppt
npat3
 
Peak of Data & AI Encore AI-Enhanced Workflows for the Real World
Safe Software
 
The Rise of AI and IoT in Mobile App Tech.pdf
IMG Global Infotech
 
Transcript: Book industry state of the nation 2025 - Tech Forum 2025
BookNet Canada
 
“Squinting Vision Pipelines: Detecting and Correcting Errors in Vision Models...
Edge AI and Vision Alliance
 
CIFDAQ Market Wrap for the week of 4th July 2025
CIFDAQ
 
Bitcoin for Millennials podcast with Bram, Power Laws of Bitcoin
Stephen Perrenod
 
Transforming Utility Networks: Large-scale Data Migrations with FME
Safe Software
 
Mastering ODC + Okta Configuration - Chennai OSUG
HathiMaryA
 
UiPath DevConnect 2025: Agentic Automation Community User Group Meeting
DianaGray10
 
LOOPS in C Programming Language - Technology
RishabhDwivedi43
 
How do you fast track Agentic automation use cases discovery?
DianaGray10
 
SIZING YOUR AIR CONDITIONER---A PRACTICAL GUIDE.pdf
Muhammad Rizwan Akram
 
Mastering Financial Management in Direct Selling
Epixel MLM Software
 
Agentic AI lifecycle for Enterprise Hyper-Automation
Debmalya Biswas
 

Why and how to leverage the simplicity and power of SQL on Flink

  • 1. TIMO WALTHER, SOFTWARE ENGINEER DATAWORKS SUMMIT, SAN JOSE JUNE 21, 2018 WHY AND HOW TO LEVERAGE THE POWER AND SIMPLICITY OF SQL ON APACHE FLINK®
  • 2. © 2018 data Artisans2 ABOUT DATA ARTISANS Original creators of Apache Flink® Open Source Apache Flink + dA Application Manager
  • 3. © 2018 data Artisans3 DA PLATFORM data-artisans.com/download
  • 4. © 2018 data Artisans4 POWERED BY APACHE FLINK
  • 5. © 2018 data Artisans5 FLINK’S POWERFUL ABSTRACTIONS Process Function (events, state, time) DataStream API (streams, windows) SQL / Table API (dynamic tables) Stream- & Batch Data Processing High-level Analytics API Stateful Event- Driven Applications val stats = stream .keyBy("sensor") .timeWindow(Time.seconds(5)) .sum((a, b) -> a.add(b)) def processElement(event: MyEvent, ctx: Context, out: Collector[Result]) = { // work with event and state (event, state.value) match { … } out.collect(…) // emit events state.update(…) // modify state // schedule a timer callback ctx.timerService.registerEventTimeTimer(event.timestamp + 500) } Layered abstractions to navigate simple to complex use cases
  • 6. © 2018 data Artisans6 APACHE FLINK’S RELATIONAL APIS Unified APIs for batch & streaming data A query specifies exactly the same result regardless whether its input is static batch data or streaming data. tableEnvironment .scan("clicks") .groupBy('user) .select('user, 'url.count as 'cnt) SELECT user, COUNT(url) AS cnt FROM clicks GROUP BY user LINQ-style Table APIANSI SQL
  • 7. © 2018 data Artisans7 QUERY TRANSLATION tableEnvironment .scan("clicks") .groupBy('user) .select('user, 'url.count as 'cnt) SELECT user, COUNT(url) AS cnt FROM clicks GROUP BY user Input data is bounded (batch) Input data is unbounded (streaming)
  • 8. © 2018 data Artisans8 WHAT IF “CLICKS” IS A FILE? Clicks user cTime url Mary 12:00:00 https://… Bob 12:00:00 https://… Mary 12:00:02 https://… Liz 12:00:03 https://… user cnt Mary 2 Bob 1 Liz 1 SELECT user, COUNT(url) as cnt FROM clicks GROUP BY user Input data is read at once Result is produced at once
  • 9. © 2018 data Artisans9 WHAT IF “CLICKS” IS A STREAM? user cTime url user cnt SELECT user, COUNT(url) as cnt FROM clicks GROUP BY user Clicks Mary 12:00:00 https://… Bob 12:00:00 https://… Mary 12:00:02 https://… Liz 12:00:03 https://… Bob 1 Liz 1 Mary 1Mary 2 Input data is continuously read Result is continuously updated The result is the same!
  • 10. © 2018 data Artisans10 • Usability ‒ ANSI SQL syntax: No custom “StreamSQL” syntax. ‒ ANSI SQL semantics: No stream-specific results. • Portability ‒ Run the same query on bounded and unbounded data ‒ Run the same query on recorded and real-time data • How can we achieve SQL semantics on streams? now bounded query unbounded query past future bounded query start of the stream unbounded query WHY IS STREAM-BATCH UNIFICATION IMPORTANT?
  • 11. © 2018 data Artisans11 • Materialized views (MV) are similar to regular views, but persisted to disk or memory ‒Used to speed-up analytical queries ‒MVs need to be updated when the base tables change • MV maintenance is very similar to SQL on streams ‒Base table updates are a stream of DML statements ‒MV definition query is evaluated on that stream ‒MV is query result and continuously updated DATABASE SYSTEMS RUN QUERIES ON STREAMS
  • 12. © 2018 data Artisans12 CONTINUOUS QUERIES IN FLINK • Core concept is a “DynamicTable” ‒Dynamic tables are changing over time • Queries on dynamic tables ‒produce new dynamic tables (which are updated based on input) ‒do not terminate • Stream ↔ Dynamic table conversions 12
  • 13. © 2018 data Artisans13 STREAM ↔ DYNAMIC TABLE CONVERSIONS • Append Conversions ‒ Records are only inserted/appended • Upsert Conversions ‒ Records are inserted/updated/deleted ‒ Records have a (composite) unique key • Changelog Conversions ‒ Records are inserted/updated/deleted
  • 14. © 2018 data Artisans14 SQL FEATURE SET IN FLINK 1.5.0 • SELECT FROMWHERE • GROUP BY / HAVING ‒ Non-windowed,TUMBLE, HOP, SESSION windows • JOIN ‒ Windowed INNER, LEFT / RIGHT / FULL OUTER JOIN ‒ Non-windowed INNER JOIN • Scalar, aggregation, table-valued UDFs • SQL CLI Client (beta) • [streaming only] OVER /WINDOW ‒ UNBOUNDED / BOUNDED PRECEDING • [batch only] UNION / INTERSECT / EXCEPT / IN / ORDER BY
  • 15. © 2018 data Artisans15 WHAT CAN I BUILD WITH THIS? • Data Pipelines ‒ Transform, aggregate, and move events in real-time • Low-latency ETL ‒ Convert and write streams to file systems, DBMS, K-V stores, indexes, … ‒ Ingest appearing files to produce streams • Stream & Batch Analytics ‒ Run analytical queries over bounded and unbounded data ‒ Query and compare historic and real-time data • Power Live Dashboards ‒ Compute and update data to visualize in real-time
  • 16. © 2018 data Artisans16 THE NEW YORK TAXI RIDES DATA SET • The NewYork CityTaxi & Limousine Commission provides a public data set about past taxi rides in NewYork City • We can derive a streaming table from the data • Table: TaxiRides rideId: BIGINT // ID of the taxi ride isStart: BOOLEAN // flag for pick-up (true) or drop-off (false) event lon: DOUBLE // longitude of pick-up or drop-off location lat: DOUBLE // latitude of pick-up or drop-off location rowtime: TIMESTAMP // time of pick-up or drop-off event
  • 17. © 2018 data Artisans17 SELECT cell, isStart, HOP_END(rowtime, INTERVAL '5' MINUTE, INTERVAL '15' MINUTE) AS hopEnd, COUNT(*) AS cnt FROM (SELECT rowtime, isStart, toCellId(lon, lat) AS cell FROM TaxiRides) GROUP BY cell, isStart, HOP(rowtime, INTERVAL '5' MINUTE, INTERVAL '15' MINUTE)  Compute every 5 minutes for each location the number of departing and arriving taxis of the last 15 minutes. IDENTIFY POPULAR PICK-UP / DROP-OFF LOCATIONS
  • 18. © 2018 data Artisans18 SELECT pickUpCell, AVG(TIMESTAMPDIFF(MINUTE, e.rowtime, s.rowtime) AS avgDuration FROM (SELECT rideId, rowtime, toCellId(lon, lat) AS pickUpCell FROM TaxiRides WHERE isStart) s JOIN (SELECT rideId, rowtime FROM TaxiRides WHERE NOT isStart) e ON s.rideId = e.rideId AND e.rowtime BETWEEN s.rowtime AND s.rowtime + INTERVAL '1' HOUR GROUP BY pickUpCell  Join start ride and end ride events on rideId and compute average ride duration per pick-up location. AVERAGE RIDE DURATION PER PICK-UP LOCATION
  • 19. © 2018 data Artisans19 BUILDING A DASHBOARD Elastic Search Kafka SELECT cell, isStart, HOP_END(rowtime, INTERVAL '5' MINUTE, INTERVAL '15' MINUTE) AS hopEnd, COUNT(*) AS cnt FROM (SELECT rowtime, isStart, toCellId(lon, lat) AS cell FROM TaxiRides) GROUP BY cell, isStart, HOP(rowtime, INTERVAL '5' MINUTE, INTERVAL '15' MINUTE)
  • 20. © 2018 data Artisans20 SOUNDS GREAT! HOW CAN I USE IT? • ATM, SQL queries must be embedded in Java/Scala code  ‒ Tight integration with DataStream and DataSet APIs • Community focused on internals (until Flink 1.4.0) ‒ Operators, types, built-in functions, extensibility (UDFs, extern. catalog) ‒ Proven at scale by Alibaba, Huawei, and Uber ‒ All built their own submission system & connectors library • Community neglected user interfaces ‒ No query submission client, no CLI ‒ No catalog integration ‒ Limited set ofTableSources andTableSinks
  • 21. © 2018 data Artisans21 COMING IN FLINK 1.5.0 - SQL CLI DemoTime! That’s a nice toy, but … ... can I use it for anything serious?
  • 22. © 2018 data Artisans22 FLIP-24 – A SQL QUERY SERVICE • REST service to submit & manage SQL queries ‒ SELECT … ‒ INSERT INTO SELECT … ‒ CREATE MATERIALIZE VIEW … • Serve results of “SELECT …” queries • Provide a table catalog (integrated with external catalogs) • Use cases ‒ Data exploration with notebooks like Apache Zeppelin ‒ Access to real-time data from applications ‒ Easy data routing / ETL from management consoles
  • 23. © 2018 data Artisans23 CHALLENGE: SERVE DYNAMIC TABLES Unbounded input yields unbounded results SELECT user, COUNT(url) AS cnt FROM clicks GROUP BY user SELECT user, url FROM clicks WHERE url LIKE '%xyz.com' Append-onlyTable • Result rows are never changed • Consume, buffer, or drop rows Continuously updatingTable • Result rows can be updated or deleted • Consume changelog or periodically query result table • Result table must be maintained somewhere (Serving bounded results is easy)
  • 24. © 2018 data Artisans24 Application FLIP-24 – A SQL QUERY SERVICE Query Service Catalog Optimizer Database / HDFS Event Log External Catalog (Schema Registry, HCatalog, …) Query Results Submit Query Job State REST Result Server Submit Query REST Database / HDFS Event Log SELECT user, COUNT(url) AS cnt FROM clicks GROUP BY user Results are served by Query Service via REST + Application does not need a special client + Works well in many network configurations − Query service can become bottleneck
  • 25. © 2018 data Artisans25 Application FLIP-24 – A SQL QUERY SERVICE Query Service SELECT user, COUNT(url) AS cnt FROM clicks GROUP BY user Catalog Optimizer Database / HDFS Event Log External Catalog (Schema Registry, HCatalog, …) Query Submit Query Job State REST Result Server Submit Query REST Database / HDFS Event Log Serving Library Result Handle
  • 26. © 2018 data Artisans26 WE WANT YOUR FEEDBACK! • The design of SQL Query Service is not final yet. • Check out FLIP-24 and FLINK-7594 • Share your ideas and feedback and discuss on JIRA or [email protected].
  • 27. © 2018 data Artisans27 SUMMARY • Unification of stream and batch is important. • Flink’s SQL solves many streaming and batch use cases. • Runs in production at Alibaba, Uber, and others. • The community is working on improving user interfaces. • Get involved, discuss, and contribute!
  • 28. THANK YOU! Available on O’Reilly Early Release!
  • 29. THANK YOU! @twalthr @dataArtisans @ApacheFlink WE ARE HIRING data-artisans.com/careers

Editor's Notes

  • #3: • data Artisans was founded by the original creators of Apache Flink • We provide dA Platform, a complete stream processing infrastructure with open-source Apache Flink
  • #4: • Also included is the Application Manager, which turns dA Platform into a self-service platform for stateful stream processing applications. • dA Platform is generally available, and you can download a free trial today!
  • #5: • These companies are among many users of Apache Flink, and during this conference you’ll meet folks from some of these companies as well as others using Flink. • If your company would like to be represented on the “Powered by Apache Flink” page, email me.
  • #6: Flink offers APIs for different levels of abstraction that all can be mixed and matched. On the lowest level, ProcessFunctions give precise control about state and time, i.e., when to process data. The intermediate level, the so-called DataStream API provides higher-level primitives such as for window processing Finally, on the top level the relational API, SQL and the Table API, are centered around the concept of dynamic tables. This is what I’m going to talk about next.
  • #30: (Keep this slide up during the Q&A part of your talk. Having this up in the final 5-10 minutes of the session gives the audience something useful to look at.)