SlideShare a Scribd company logo
FEBRUARY9, 2017, WARSAW
Stream Analytics with SQL on Apache Flink®
Fabian Hueske | Apache Flink PMC member | Co-founder dataArtisans
FEBRUARY9, 2017, WARSAW
Streams are Everywhere
FEBRUARY9, 2017, WARSAW
Data Analytics on Streaming Data
• Periodic batch processing
• Lots of duct tape and baling wire
• It’s up to you to make
everything work… reliably!
• High latency
• Continuous stream processing
• Framework takes care of failures
• Low latency
FEBRUARY9, 2017, WARSAW
Stream Processing in Apache Flink
• Platform for scalable stream processing
• Fast
• Low latency and high throughput
• Accurate
• Stateful streaming processing in event time
• Reliable
• Exactly-once state guarantees
• Highly available cluster setup
FEBRUARY9, 2017, WARSAW
Streaming Applications Powered by Flink
30 Flink applications in production for more than
one year. 10 billion events (2TB) processed daily
Complex jobs of > 30 operators running 24/7,
processing 30 billion events daily, maintaining
state of 100s of GB with exactly-once guarantees
Largest job has > 20 operators, runs on > 5000
vCores in 1000-node cluster, processes millions of
events per second
FEBRUARY9, 2017, WARSAW
Stream Processing is not for Everybody, … yet
• APIs of open source stream processors target developers
• Implementing streaming applications requires knowledge & skill
• Stream processing concepts (time, state, windows, triggers, ...)
• Programming experience (Java / Scala APIs)
• Stream processing technology spreads rapidly
• There is a talent gap
FEBRUARY9, 2017, WARSAW
What about SQL?
• SQL is the most widely used language for data analytics
• Many good reasons to use SQL
• Declarative specification
• Optimization
• Efficient execution
• “Everybody” knows SQL
• SQL would make stream processing much more accessible, but…
FEBRUARY9, 2017, WARSAW
No OS Stream Processor Offers Decent SQL Support
• SQL was not designed with streaming data in mind
• Relations are sets. Streams are infinite sequences.
• Records arrive over time.
• Syntax
• Time-based operations are cumbersome to specify (aggregates, joins)
• Semantics
• A SQL query should compute the same result on a batch table and a stream
FEBRUARY9, 2017, WARSAW
• Standard SQL and LINQ-style Table API
• Unified APIs for batch & streaming data
• Common translation layers
• Optimization based on Apache Calcite
• Type system & code-generation
• Table sources & sinks
• Streaming SQL & Table API is work in
progress
Flink’s SQL Support & Table API
FEBRUARY9, 2017, WARSAW
What are the Use Cases for Stream SQL?
• Continuous ETL & Data Import
• Live Dashboards & Reports
• Ad-hoc Analytics & Exploration
FEBRUARY9, 2017, WARSAW
Dynamic Tables
• Core concept is a “Dynamic Table”
• Dynamic tables change over time
• Dynamic tables are treated like static batch tables
• Dynamic tables are queried with standard SQL
• A query returns another dynamic table
• Stream ←→ Dynamic Table conversions without information loss
• “Stream / Table Duality”
FEBRUARY9, 2017, WARSAW
Stream → Dynamic Table
• Append
• Replace by Key
time k
1 A
2 B
4 A
5 C
7 B
8 A
9 B
… …
time k
2, B4, A5, C7, B8, A9, B 1, A
2, B4, A5, C7, B8, A9, B 1, A
8 A
9 B
5 C
… …
FEBRUARY9, 2017, WARSAW
Querying a Dynamic Table
• Dynamic tables change over time
• A[t]: Table A at time t
• Dynamic tables are queried with regular SQL
• Result of a query changes as input table changes
• q(A[t]): Evaluate query q on table A at time t
• As time t progresses, the query result is continuously updated
• similar to maintaining a materialized view
• t is current event time
FEBRUARY9, 2017, WARSAW
Querying a Dynamic Table
time k
k cnt
A 3
B 2
C 1
9 B
k cnt
A 3
B 3
C 1
12 C
k cnt
A 3
B 3
C 2
A[8]
A[9]
A[12]
q(A[8])
q(A[9])
q(A[12])
Table A
q:
SELECT
k,
COUNT(k) as cnt
FROM A
GROUP BY k
1 A
2 B
4 A
5 C
7 B
8 A
FEBRUARY9, 2017, WARSAW
time k
A[5]
A[10]
A[15]
q(A[5])
q(A[10])
q(A[15])
Table A
Querying a Dynamic Table
7 B
8 A
9 B
11 A
12 C
14 C
15 A
k cnt endT
A 2 5
B 1 5
C 1 5
q(A)
A 1 10
B 2 10
A 2 15
C 2 15
q:
SELECT
k,
COUNT(k) AS cnt,
TUMBLE_END(
time,
INTERVAL '5' SECONDS)
AS endT
FROM A
GROUP BY
k,
TUMBLE(
time,
INTERVAL '5' SECONDS)
1 A
2 B
4 A
5 C
FEBRUARY9, 2017, WARSAW
Can We Run Any Query on Dynamic Tables?
• No 
• There are state and computation constraints
• State may not grow infinitely as more data arrives
• Clean-up timeout must be defined
• Input updates may only trigger partial re-computation of the result
• Queries with possibly unbounded state or computation are rejected
• Optimizer performs validation
FEBRUARY9, 2017, WARSAW
Bounding the State of a Query
• State grows infinitely with domain of grouping attribute
• Bound query input by time
• Query aggregates data of last 24 hours. Older data is discarded.
SELECT k, COUNT(k) AS cnt
FROM A
GROUP BY k
SELECT k, COUNT(k) AS cnt
FROM A
WHERE last(time, INTERVAL ‘1’ DAY)
GROUP BY k
STOP!
UNBOUNED
STATE!
FEBRUARY9, 2017, WARSAW
Updating Results and Late Arriving Data
• Sometimes emitted results need to be updated
• Results which are continuously updated
• Results for which relevant records arrived late
• Results that might be updated must be kept as state
• Clean-up timeout
• When a table is converted into a stream, updates must be propagated
• Update mode
• Add/Retract mode
FEBRUARY9, 2017, WARSAW
Dynamic Table → Stream: Update Mode
time k
Table A
B, 1A, 2C, 1B, 2A, 3 A, 1
SELECT
k,
COUNT(k) AS cnt
FROM A
GROUP BY k
1 A
2 B
4 A
5 C
7 B
8 A
… …
Update by Key
FEBRUARY9, 2017, WARSAW
Dynamic Table → Stream: Add/Retract Mode
time k
Table A
+ B, 1+ A, 2+ C, 1+ B, 2+ A, 3 + A, 1- A, 1- B, 1- A, 2
1 A
2 B
4 A
5 C
7 B
8 A
… …
SELECT
k,
COUNT(k) AS cnt
FROM A
GROUP BY k
Add (+) / Retract (-)
FEBRUARY9, 2017, WARSAW
Current State of SQL and Table API
• Huge interest and many contributors
• Current development efforts
• Adding more window operators
• Introducing dynamic tables
• And there is a lot more to do
• New operators and features for streaming and batch
• Performance improvements
• Tooling and integration
• Try it out, give feedback, and start contributing!
FEBRUARY9, 2017, WARSAW
Stream Analytics with SQL on Apache Flink
Fabian Hueske | @fhueske

More Related Content

What's hot (20)

PDF
Flink Forward SF 2017: Chinmay Soman - Real Time Analytics in the real World ...
Flink Forward
 
PDF
Flink Forward SF 2017: Stefan Richter - Improvements for large state and reco...
Flink Forward
 
PDF
Flink Forward Berlin 2017: Francesco Versaci - Integrating Flink and Kafka in...
Flink Forward
 
PDF
Flink Forward Berlin 2017: Stephan Ewen - The State of Flink and how to adopt...
Flink Forward
 
PDF
Apache Flink's Table & SQL API - unified APIs for batch and stream processing
Timo Walther
 
PDF
Flink Forward SF 2017: Joe Olson - Using Flink and Queryable State to Buffer ...
Flink Forward
 
PDF
Flink Forward Berlin 2017: Jörg Schad, Till Rohrmann - Apache Flink meets Apa...
Flink Forward
 
PPTX
Flink Forward SF 2017: David Hardwick, Sean Hester & David Brelloch - Dynami...
Flink Forward
 
PDF
Flink Forward SF 2017: Kenneth Knowles - Back to Sessions overview
Flink Forward
 
PDF
Flink forward SF 2017: Elizabeth K. Joseph and Ravi Yadav - Flink meet DC/OS ...
Flink Forward
 
PPTX
What's new in 1.9.0 blink planner - Kurt Young, Alibaba
Flink Forward
 
PPTX
Flink Forward SF 2017: Shaoxuan Wang_Xiaowei Jiang - Blinks Improvements to F...
Flink Forward
 
PDF
Flink Forward Berlin 2017: Mihail Vieru - A Materialization Engine for Data I...
Flink Forward
 
PDF
Flink Forward San Francisco 2018: Stefan Richter - "How to build a modern str...
Flink Forward
 
PPTX
Keynote: Stephan Ewen - Stream Processing as a Foundational Paradigm and Apac...
Ververica
 
PDF
I²: Interactive Real-Time Visualization for Streaming Data with Apache Flink ...
Jonas Traub
 
PDF
Flink Forward SF 2017: Feng Wang & Zhijiang Wang - Runtime Improvements in Bl...
Flink Forward
 
PPTX
Flink Forward Berlin 2017: Dongwon Kim - Predictive Maintenance with Apache F...
Flink Forward
 
PDF
Stateful Distributed Stream Processing
Gyula Fóra
 
PPTX
Flink Forward Berlin 2017: Till Rohrmann - From Apache Flink 1.3 to 1.4
Flink Forward
 
Flink Forward SF 2017: Chinmay Soman - Real Time Analytics in the real World ...
Flink Forward
 
Flink Forward SF 2017: Stefan Richter - Improvements for large state and reco...
Flink Forward
 
Flink Forward Berlin 2017: Francesco Versaci - Integrating Flink and Kafka in...
Flink Forward
 
Flink Forward Berlin 2017: Stephan Ewen - The State of Flink and how to adopt...
Flink Forward
 
Apache Flink's Table & SQL API - unified APIs for batch and stream processing
Timo Walther
 
Flink Forward SF 2017: Joe Olson - Using Flink and Queryable State to Buffer ...
Flink Forward
 
Flink Forward Berlin 2017: Jörg Schad, Till Rohrmann - Apache Flink meets Apa...
Flink Forward
 
Flink Forward SF 2017: David Hardwick, Sean Hester & David Brelloch - Dynami...
Flink Forward
 
Flink Forward SF 2017: Kenneth Knowles - Back to Sessions overview
Flink Forward
 
Flink forward SF 2017: Elizabeth K. Joseph and Ravi Yadav - Flink meet DC/OS ...
Flink Forward
 
What's new in 1.9.0 blink planner - Kurt Young, Alibaba
Flink Forward
 
Flink Forward SF 2017: Shaoxuan Wang_Xiaowei Jiang - Blinks Improvements to F...
Flink Forward
 
Flink Forward Berlin 2017: Mihail Vieru - A Materialization Engine for Data I...
Flink Forward
 
Flink Forward San Francisco 2018: Stefan Richter - "How to build a modern str...
Flink Forward
 
Keynote: Stephan Ewen - Stream Processing as a Foundational Paradigm and Apac...
Ververica
 
I²: Interactive Real-Time Visualization for Streaming Data with Apache Flink ...
Jonas Traub
 
Flink Forward SF 2017: Feng Wang & Zhijiang Wang - Runtime Improvements in Bl...
Flink Forward
 
Flink Forward Berlin 2017: Dongwon Kim - Predictive Maintenance with Apache F...
Flink Forward
 
Stateful Distributed Stream Processing
Gyula Fóra
 
Flink Forward Berlin 2017: Till Rohrmann - From Apache Flink 1.3 to 1.4
Flink Forward
 

Viewers also liked (20)

PDF
Feature Engineering
HJ van Veen
 
PPTX
How Humans See Data - Amazon Cut
John Rauser
 
PDF
Scalable Python with Docker, Kubernetes, OpenShift
Aarno Aukia
 
PPTX
Taking a look under the hood of Apache Flink's relational APIs.
Fabian Hueske
 
PDF
Beacosystem Talk @ MongoDB User Group Dublin @sos100
Sean O'Sullivan
 
PDF
Dynamic Scaling: How Apache Flink Adapts to Changing Workloads (at FlinkForwa...
Till Rohrmann
 
PDF
Streaming analytics better than batch when and why - (Big Data Tech 2017)
GetInData
 
PDF
Big Data Day LA 2015 - Always-on Ingestion for Data at Scale by Arvind Prabha...
Data Con LA
 
PPTX
IoT Innovation Lab Berlin @relayr - Kay Lerch on Getting basics right for you...
Kay Lerch
 
PDF
The Data Dichotomy- Rethinking the Way We Treat Data and Services
confluent
 
PPTX
Dataflow with Apache NiFi - Apache NiFi Meetup - 2016 Hadoop Summit - San Jose
Aldrin Piri
 
PPTX
Comparison of various streaming technologies
Sachin Aggarwal
 
PPTX
February 2017 HUG: Exactly-once end-to-end processing with Apache Apex
Yahoo Developer Network
 
PPTX
Real-Time Event & Stream Processing on MS Azure
Khalid Salama
 
PDF
Introducing Kafka's Streams API
confluent
 
PDF
Lightbend Fast Data Platform
Lightbend
 
PPTX
How to Build Continuous Ingestion for the Internet of Things
Cloudera, Inc.
 
PDF
Power of the Log: LSM & Append Only Data Structures
confluent
 
PDF
Study: The Future of VR, AR and Self-Driving Cars
LinkedIn
 
PDF
UX, ethnography and possibilities: for Libraries, Museums and Archives
Ned Potter
 
Feature Engineering
HJ van Veen
 
How Humans See Data - Amazon Cut
John Rauser
 
Scalable Python with Docker, Kubernetes, OpenShift
Aarno Aukia
 
Taking a look under the hood of Apache Flink's relational APIs.
Fabian Hueske
 
Beacosystem Talk @ MongoDB User Group Dublin @sos100
Sean O'Sullivan
 
Dynamic Scaling: How Apache Flink Adapts to Changing Workloads (at FlinkForwa...
Till Rohrmann
 
Streaming analytics better than batch when and why - (Big Data Tech 2017)
GetInData
 
Big Data Day LA 2015 - Always-on Ingestion for Data at Scale by Arvind Prabha...
Data Con LA
 
IoT Innovation Lab Berlin @relayr - Kay Lerch on Getting basics right for you...
Kay Lerch
 
The Data Dichotomy- Rethinking the Way We Treat Data and Services
confluent
 
Dataflow with Apache NiFi - Apache NiFi Meetup - 2016 Hadoop Summit - San Jose
Aldrin Piri
 
Comparison of various streaming technologies
Sachin Aggarwal
 
February 2017 HUG: Exactly-once end-to-end processing with Apache Apex
Yahoo Developer Network
 
Real-Time Event & Stream Processing on MS Azure
Khalid Salama
 
Introducing Kafka's Streams API
confluent
 
Lightbend Fast Data Platform
Lightbend
 
How to Build Continuous Ingestion for the Internet of Things
Cloudera, Inc.
 
Power of the Log: LSM & Append Only Data Structures
confluent
 
Study: The Future of VR, AR and Self-Driving Cars
LinkedIn
 
UX, ethnography and possibilities: for Libraries, Museums and Archives
Ned Potter
 
Ad

Similar to Fabian Hueske - Stream Analytics with SQL on Apache Flink (20)

PDF
Stream Analytics with SQL on Apache Flink - Fabian Hueske
Evention
 
PPTX
Stream Analytics with SQL on Apache Flink
Fabian Hueske
 
PDF
Timo Walther - Table & SQL API - unified APIs for batch and stream processing
Ververica
 
PPTX
Project_Plan-Datalake_v1.0_26-10-2022.pptx
ssuser3e2857
 
PDF
Learn from HomeAway Hadoop Development and Operations Best Practices
Driven Inc.
 
PDF
Streaming SQL Foundations: Why I ❤ Streams+Tables
C4Media
 
PDF
AWS Innovate: Running Databases in AWS- Russell Nash
Amazon Web Services Korea
 
PDF
Couchbase Chennai Meetup: Developing with Couchbase- made easy
Karthik Babu Sekar
 
PDF
Couchbase Singapore Meetup #2: Why Developing with Couchbase is easy !!
Karthik Babu Sekar
 
PDF
What's new in SQL Server 2017
Hasan Savran
 
PPT
NoSQL_Night
Clarence J M Tauro
 
PDF
Delta Architecture
Paulo Gutierrez
 
PDF
찾아가는 AWS 세미나(구로,가산,판교) - AWS 기반 빅데이터 활용 방법 (김일호 솔루션즈 아키텍트)
Amazon Web Services Korea
 
PPTX
Real-Time Analytics with Spark and MemSQL
SingleStore
 
PDF
Temporal-Joins in Kafka Streams and ksqlDB | Matthias Sax, Confluent
HostedbyConfluent
 
PDF
Streaming SQL
Julian Hyde
 
PPTX
EBS on Oracle Cloud
vasuballa
 
PPSX
Introducing the eDB360 Tool
Carlos Sierra
 
PDF
That won’t fit into RAM - Michał Brzezicki
Evention
 
PDF
What's new in Confluent 3.2 and Apache Kafka 0.10.2
confluent
 
Stream Analytics with SQL on Apache Flink - Fabian Hueske
Evention
 
Stream Analytics with SQL on Apache Flink
Fabian Hueske
 
Timo Walther - Table & SQL API - unified APIs for batch and stream processing
Ververica
 
Project_Plan-Datalake_v1.0_26-10-2022.pptx
ssuser3e2857
 
Learn from HomeAway Hadoop Development and Operations Best Practices
Driven Inc.
 
Streaming SQL Foundations: Why I ❤ Streams+Tables
C4Media
 
AWS Innovate: Running Databases in AWS- Russell Nash
Amazon Web Services Korea
 
Couchbase Chennai Meetup: Developing with Couchbase- made easy
Karthik Babu Sekar
 
Couchbase Singapore Meetup #2: Why Developing with Couchbase is easy !!
Karthik Babu Sekar
 
What's new in SQL Server 2017
Hasan Savran
 
NoSQL_Night
Clarence J M Tauro
 
Delta Architecture
Paulo Gutierrez
 
찾아가는 AWS 세미나(구로,가산,판교) - AWS 기반 빅데이터 활용 방법 (김일호 솔루션즈 아키텍트)
Amazon Web Services Korea
 
Real-Time Analytics with Spark and MemSQL
SingleStore
 
Temporal-Joins in Kafka Streams and ksqlDB | Matthias Sax, Confluent
HostedbyConfluent
 
Streaming SQL
Julian Hyde
 
EBS on Oracle Cloud
vasuballa
 
Introducing the eDB360 Tool
Carlos Sierra
 
That won’t fit into RAM - Michał Brzezicki
Evention
 
What's new in Confluent 3.2 and Apache Kafka 0.10.2
confluent
 
Ad

More from Ververica (20)

PDF
2020-05-06 Apache Flink Meetup London: The Easiest Way to Get Operational wit...
Ververica
 
PDF
Webinar: How to contribute to Apache Flink - Robert Metzger
Ververica
 
PDF
Webinar: Deep Dive on Apache Flink State - Seth Wiesman
Ververica
 
PDF
Webinar: 99 Ways to Enrich Streaming Data with Apache Flink - Konstantin Knauf
Ververica
 
PDF
Webinar: Detecting row patterns with Flink SQL - Dawid Wysakowicz
Ververica
 
PDF
Deploying Flink on Kubernetes - David Anderson
Ververica
 
PPTX
Webinar: Flink SQL in Action - Fabian Hueske
Ververica
 
PPTX
2018-04 Kafka Summit London: Stephan Ewen - "Apache Flink and Apache Kafka fo...
Ververica
 
PDF
2018-01 Seattle Apache Flink Meetup at OfferUp, Opening Remarks and Talk 2
Ververica
 
PPTX
Stephan Ewen - Experiences running Flink at Very Large Scale
Ververica
 
PDF
Tzu-Li (Gordon) Tai - Stateful Stream Processing with Apache Flink
Ververica
 
PPTX
Kostas Kloudas - Complex Event Processing with Flink: the state of FlinkCEP
Ververica
 
PDF
Aljoscha Krettek - Portable stateful big data processing in Apache Beam
Ververica
 
PDF
Aljoscha Krettek - Apache Flink® and IoT: How Stateful Event-Time Processing ...
Ververica
 
PDF
Apache Flink Meetup: Sanjar Akhmedov - Joining Infinity – Windowless Stream ...
Ververica
 
PPTX
Kostas Kloudas - Extending Flink's Streaming APIs
Ververica
 
PDF
Stefan Richter - A look at Flink 1.2 and beyond @ Berlin Meetup
Ververica
 
PPTX
Robert Metzger - Apache Flink Community Updates November 2016 @ Berlin Meetup
Ververica
 
PPTX
Aljoscha Krettek - Apache Flink for IoT: How Event-Time Processing Enables Ea...
Ververica
 
PPTX
Kostas Tzoumas - Stream Processing with Apache Flink®
Ververica
 
2020-05-06 Apache Flink Meetup London: The Easiest Way to Get Operational wit...
Ververica
 
Webinar: How to contribute to Apache Flink - Robert Metzger
Ververica
 
Webinar: Deep Dive on Apache Flink State - Seth Wiesman
Ververica
 
Webinar: 99 Ways to Enrich Streaming Data with Apache Flink - Konstantin Knauf
Ververica
 
Webinar: Detecting row patterns with Flink SQL - Dawid Wysakowicz
Ververica
 
Deploying Flink on Kubernetes - David Anderson
Ververica
 
Webinar: Flink SQL in Action - Fabian Hueske
Ververica
 
2018-04 Kafka Summit London: Stephan Ewen - "Apache Flink and Apache Kafka fo...
Ververica
 
2018-01 Seattle Apache Flink Meetup at OfferUp, Opening Remarks and Talk 2
Ververica
 
Stephan Ewen - Experiences running Flink at Very Large Scale
Ververica
 
Tzu-Li (Gordon) Tai - Stateful Stream Processing with Apache Flink
Ververica
 
Kostas Kloudas - Complex Event Processing with Flink: the state of FlinkCEP
Ververica
 
Aljoscha Krettek - Portable stateful big data processing in Apache Beam
Ververica
 
Aljoscha Krettek - Apache Flink® and IoT: How Stateful Event-Time Processing ...
Ververica
 
Apache Flink Meetup: Sanjar Akhmedov - Joining Infinity – Windowless Stream ...
Ververica
 
Kostas Kloudas - Extending Flink's Streaming APIs
Ververica
 
Stefan Richter - A look at Flink 1.2 and beyond @ Berlin Meetup
Ververica
 
Robert Metzger - Apache Flink Community Updates November 2016 @ Berlin Meetup
Ververica
 
Aljoscha Krettek - Apache Flink for IoT: How Event-Time Processing Enables Ea...
Ververica
 
Kostas Tzoumas - Stream Processing with Apache Flink®
Ververica
 

Recently uploaded (20)

PPTX
apidays Helsinki & North 2025 - Running a Successful API Program: Best Practi...
apidays
 
PDF
The Best NVIDIA GPUs for LLM Inference in 2025.pdf
Tamanna36
 
PPTX
Aict presentation on dpplppp sjdhfh.pptx
vabaso5932
 
PDF
Business implication of Artificial Intelligence.pdf
VishalChugh12
 
PPTX
apidays Helsinki & North 2025 - Agentic AI: A Friend or Foe?, Merja Kajava (A...
apidays
 
PDF
Data Science Course Certificate by Sigma Software University
Stepan Kalika
 
PPTX
apidays Singapore 2025 - Designing for Change, Julie Schiller (Google)
apidays
 
PPT
Growth of Public Expendituuure_55423.ppt
NavyaDeora
 
PDF
Research Methodology Overview Introduction
ayeshagul29594
 
PPTX
What Is Data Integration and Transformation?
subhashenia
 
PDF
apidays Singapore 2025 - From API Intelligence to API Governance by Harsha Ch...
apidays
 
PDF
Development and validation of the Japanese version of the Organizational Matt...
Yoga Tokuyoshi
 
PPTX
Powerful Uses of Data Analytics You Should Know
subhashenia
 
PDF
Driving Employee Engagement in a Hybrid World.pdf
Mia scott
 
PDF
InformaticsPractices-MS - Google Docs.pdf
seshuashwin0829
 
PDF
apidays Singapore 2025 - The API Playbook for AI by Shin Wee Chuang (PAND AI)
apidays
 
PPTX
apidays Helsinki & North 2025 - API access control strategies beyond JWT bear...
apidays
 
PDF
The European Business Wallet: Why It Matters and How It Powers the EUDI Ecosy...
Lal Chandran
 
PPTX
b6057ea5-8e8c-4415-90c0-ed8e9666ffcd.pptx
Anees487379
 
PPTX
03_Ariane BERCKMOES_Ethias.pptx_AIBarometer_release_event
FinTech Belgium
 
apidays Helsinki & North 2025 - Running a Successful API Program: Best Practi...
apidays
 
The Best NVIDIA GPUs for LLM Inference in 2025.pdf
Tamanna36
 
Aict presentation on dpplppp sjdhfh.pptx
vabaso5932
 
Business implication of Artificial Intelligence.pdf
VishalChugh12
 
apidays Helsinki & North 2025 - Agentic AI: A Friend or Foe?, Merja Kajava (A...
apidays
 
Data Science Course Certificate by Sigma Software University
Stepan Kalika
 
apidays Singapore 2025 - Designing for Change, Julie Schiller (Google)
apidays
 
Growth of Public Expendituuure_55423.ppt
NavyaDeora
 
Research Methodology Overview Introduction
ayeshagul29594
 
What Is Data Integration and Transformation?
subhashenia
 
apidays Singapore 2025 - From API Intelligence to API Governance by Harsha Ch...
apidays
 
Development and validation of the Japanese version of the Organizational Matt...
Yoga Tokuyoshi
 
Powerful Uses of Data Analytics You Should Know
subhashenia
 
Driving Employee Engagement in a Hybrid World.pdf
Mia scott
 
InformaticsPractices-MS - Google Docs.pdf
seshuashwin0829
 
apidays Singapore 2025 - The API Playbook for AI by Shin Wee Chuang (PAND AI)
apidays
 
apidays Helsinki & North 2025 - API access control strategies beyond JWT bear...
apidays
 
The European Business Wallet: Why It Matters and How It Powers the EUDI Ecosy...
Lal Chandran
 
b6057ea5-8e8c-4415-90c0-ed8e9666ffcd.pptx
Anees487379
 
03_Ariane BERCKMOES_Ethias.pptx_AIBarometer_release_event
FinTech Belgium
 

Fabian Hueske - Stream Analytics with SQL on Apache Flink

  • 1. FEBRUARY9, 2017, WARSAW Stream Analytics with SQL on Apache Flink® Fabian Hueske | Apache Flink PMC member | Co-founder dataArtisans
  • 3. FEBRUARY9, 2017, WARSAW Data Analytics on Streaming Data • Periodic batch processing • Lots of duct tape and baling wire • It’s up to you to make everything work… reliably! • High latency • Continuous stream processing • Framework takes care of failures • Low latency
  • 4. FEBRUARY9, 2017, WARSAW Stream Processing in Apache Flink • Platform for scalable stream processing • Fast • Low latency and high throughput • Accurate • Stateful streaming processing in event time • Reliable • Exactly-once state guarantees • Highly available cluster setup
  • 5. FEBRUARY9, 2017, WARSAW Streaming Applications Powered by Flink 30 Flink applications in production for more than one year. 10 billion events (2TB) processed daily Complex jobs of > 30 operators running 24/7, processing 30 billion events daily, maintaining state of 100s of GB with exactly-once guarantees Largest job has > 20 operators, runs on > 5000 vCores in 1000-node cluster, processes millions of events per second
  • 6. FEBRUARY9, 2017, WARSAW Stream Processing is not for Everybody, … yet • APIs of open source stream processors target developers • Implementing streaming applications requires knowledge & skill • Stream processing concepts (time, state, windows, triggers, ...) • Programming experience (Java / Scala APIs) • Stream processing technology spreads rapidly • There is a talent gap
  • 7. FEBRUARY9, 2017, WARSAW What about SQL? • SQL is the most widely used language for data analytics • Many good reasons to use SQL • Declarative specification • Optimization • Efficient execution • “Everybody” knows SQL • SQL would make stream processing much more accessible, but…
  • 8. FEBRUARY9, 2017, WARSAW No OS Stream Processor Offers Decent SQL Support • SQL was not designed with streaming data in mind • Relations are sets. Streams are infinite sequences. • Records arrive over time. • Syntax • Time-based operations are cumbersome to specify (aggregates, joins) • Semantics • A SQL query should compute the same result on a batch table and a stream
  • 9. FEBRUARY9, 2017, WARSAW • Standard SQL and LINQ-style Table API • Unified APIs for batch & streaming data • Common translation layers • Optimization based on Apache Calcite • Type system & code-generation • Table sources & sinks • Streaming SQL & Table API is work in progress Flink’s SQL Support & Table API
  • 10. FEBRUARY9, 2017, WARSAW What are the Use Cases for Stream SQL? • Continuous ETL & Data Import • Live Dashboards & Reports • Ad-hoc Analytics & Exploration
  • 11. FEBRUARY9, 2017, WARSAW Dynamic Tables • Core concept is a “Dynamic Table” • Dynamic tables change over time • Dynamic tables are treated like static batch tables • Dynamic tables are queried with standard SQL • A query returns another dynamic table • Stream ←→ Dynamic Table conversions without information loss • “Stream / Table Duality”
  • 12. FEBRUARY9, 2017, WARSAW Stream → Dynamic Table • Append • Replace by Key time k 1 A 2 B 4 A 5 C 7 B 8 A 9 B … … time k 2, B4, A5, C7, B8, A9, B 1, A 2, B4, A5, C7, B8, A9, B 1, A 8 A 9 B 5 C … …
  • 13. FEBRUARY9, 2017, WARSAW Querying a Dynamic Table • Dynamic tables change over time • A[t]: Table A at time t • Dynamic tables are queried with regular SQL • Result of a query changes as input table changes • q(A[t]): Evaluate query q on table A at time t • As time t progresses, the query result is continuously updated • similar to maintaining a materialized view • t is current event time
  • 14. FEBRUARY9, 2017, WARSAW Querying a Dynamic Table time k k cnt A 3 B 2 C 1 9 B k cnt A 3 B 3 C 1 12 C k cnt A 3 B 3 C 2 A[8] A[9] A[12] q(A[8]) q(A[9]) q(A[12]) Table A q: SELECT k, COUNT(k) as cnt FROM A GROUP BY k 1 A 2 B 4 A 5 C 7 B 8 A
  • 15. FEBRUARY9, 2017, WARSAW time k A[5] A[10] A[15] q(A[5]) q(A[10]) q(A[15]) Table A Querying a Dynamic Table 7 B 8 A 9 B 11 A 12 C 14 C 15 A k cnt endT A 2 5 B 1 5 C 1 5 q(A) A 1 10 B 2 10 A 2 15 C 2 15 q: SELECT k, COUNT(k) AS cnt, TUMBLE_END( time, INTERVAL '5' SECONDS) AS endT FROM A GROUP BY k, TUMBLE( time, INTERVAL '5' SECONDS) 1 A 2 B 4 A 5 C
  • 16. FEBRUARY9, 2017, WARSAW Can We Run Any Query on Dynamic Tables? • No  • There are state and computation constraints • State may not grow infinitely as more data arrives • Clean-up timeout must be defined • Input updates may only trigger partial re-computation of the result • Queries with possibly unbounded state or computation are rejected • Optimizer performs validation
  • 17. FEBRUARY9, 2017, WARSAW Bounding the State of a Query • State grows infinitely with domain of grouping attribute • Bound query input by time • Query aggregates data of last 24 hours. Older data is discarded. SELECT k, COUNT(k) AS cnt FROM A GROUP BY k SELECT k, COUNT(k) AS cnt FROM A WHERE last(time, INTERVAL ‘1’ DAY) GROUP BY k STOP! UNBOUNED STATE!
  • 18. FEBRUARY9, 2017, WARSAW Updating Results and Late Arriving Data • Sometimes emitted results need to be updated • Results which are continuously updated • Results for which relevant records arrived late • Results that might be updated must be kept as state • Clean-up timeout • When a table is converted into a stream, updates must be propagated • Update mode • Add/Retract mode
  • 19. FEBRUARY9, 2017, WARSAW Dynamic Table → Stream: Update Mode time k Table A B, 1A, 2C, 1B, 2A, 3 A, 1 SELECT k, COUNT(k) AS cnt FROM A GROUP BY k 1 A 2 B 4 A 5 C 7 B 8 A … … Update by Key
  • 20. FEBRUARY9, 2017, WARSAW Dynamic Table → Stream: Add/Retract Mode time k Table A + B, 1+ A, 2+ C, 1+ B, 2+ A, 3 + A, 1- A, 1- B, 1- A, 2 1 A 2 B 4 A 5 C 7 B 8 A … … SELECT k, COUNT(k) AS cnt FROM A GROUP BY k Add (+) / Retract (-)
  • 21. FEBRUARY9, 2017, WARSAW Current State of SQL and Table API • Huge interest and many contributors • Current development efforts • Adding more window operators • Introducing dynamic tables • And there is a lot more to do • New operators and features for streaming and batch • Performance improvements • Tooling and integration • Try it out, give feedback, and start contributing!
  • 22. FEBRUARY9, 2017, WARSAW Stream Analytics with SQL on Apache Flink Fabian Hueske | @fhueske