SlideShare a Scribd company logo
FEBRUARY9, 2017, WARSAW
Stream Analytics with SQL on Apache Flink®
Fabian Hueske | Apache Flink PMC member | Co-founder dataArtisans
FEBRUARY9, 2017, WARSAW
Streams are Everywhere
FEBRUARY9, 2017, WARSAW
Data Analytics on Streaming Data
• Periodic batch processing
• Lots of duct tape and baling wire
• It’s up to you to make
everything work… reliably!
• High latency
• Continuous stream processing
• Framework takes care of failures
• Low latency
FEBRUARY9, 2017, WARSAW
Stream Processing in Apache Flink
• Platform for scalable stream processing
• Fast
• Low latency and high throughput
• Accurate
• Stateful streaming processing in event time
• Reliable
• Exactly-once state guarantees
• Highly available cluster setup
FEBRUARY9, 2017, WARSAW
Streaming Applications Powered by Flink
30 Flink applications in production for more than
one year. 10 billion events (2TB) processed daily
Complex jobs of > 30 operators running 24/7,
processing 30 billion events daily, maintaining
state of 100s of GB with exactly-once guarantees
Largest job has > 20 operators, runs on > 5000
vCores in 1000-node cluster, processes millions of
events per second
FEBRUARY9, 2017, WARSAW
Stream Processing is not for Everybody, … yet
• APIs of open source stream processors target developers
• Implementing streaming applications requires knowledge & skill
• Stream processing concepts (time, state, windows, triggers, ...)
• Programming experience (Java / Scala APIs)
• Stream processing technology spreads rapidly
• There is a talent gap
FEBRUARY9, 2017, WARSAW
What about SQL?
• SQL is the most widely used language for data analytics
• Many good reasons to use SQL
• Declarative specification
• Optimization
• Efficient execution
• “Everybody” knows SQL
• SQL would make stream processing much more accessible, but…
FEBRUARY9, 2017, WARSAW
No OS Stream Processor Offers Decent SQL Support
• SQL was not designed with streaming data in mind
• Relations are sets. Streams are infinite sequences.
• Records arrive over time.
• Syntax
• Time-based operations are cumbersome to specify (aggregates, joins)
• Semantics
• A SQL query should compute the same result on a batch table and a stream
FEBRUARY9, 2017, WARSAW
• Standard SQL and LINQ-style Table API
• Unified APIs for batch & streaming data
• Common translation layers
• Optimization based on Apache Calcite
• Type system & code-generation
• Table sources & sinks
• Streaming SQL & Table API is work in
progress
Flink’s SQL Support & Table API
FEBRUARY9, 2017, WARSAW
What are the Use Cases for Stream SQL?
• Continuous ETL & Data Import
• Live Dashboards & Reports
• Ad-hoc Analytics & Exploration
FEBRUARY9, 2017, WARSAW
Dynamic Tables
• Core concept is a “Dynamic Table”
• Dynamic tables change over time
• Dynamic tables are treated like static batch tables
• Dynamic tables are queried with standard SQL
• A query returns another dynamic table
• Stream ←→ Dynamic Table conversions without information loss
• “Stream / Table Duality”
FEBRUARY9, 2017, WARSAW
Stream → Dynamic Table
• Append
• Replace by Key
time k
1 A
2 B
4 A
5 C
7 B
8 A
9 B
… …
time k
2, B4, A5, C7, B8, A9, B 1, A
2, B4, A5, C7, B8, A9, B 1, A
8 A
9 B
5 C
… …
FEBRUARY9, 2017, WARSAW
Querying a Dynamic Table
• Dynamic tables change over time
• A[t]: Table A at time t
• Dynamic tables are queried with regular SQL
• Result of a query changes as input table changes
• q(A[t]): Evaluate query q on table A at time t
• As time t progresses, the query result is continuously updated
• similar to maintaining a materialized view
• t is current event time
FEBRUARY9, 2017, WARSAW
Querying a Dynamic Table
time k
k cnt
A 3
B 2
C 1
9 B
k cnt
A 3
B 3
C 1
12 C
k cnt
A 3
B 3
C 2
A[8]
A[9]
A[12]
q(A[8])
q(A[9])
q(A[12])
Table A
q:
SELECT
k,
COUNT(k) as cnt
FROM A
GROUP BY k
1 A
2 B
4 A
5 C
7 B
8 A
FEBRUARY9, 2017, WARSAW
time k
A[5]
A[10]
A[15]
q(A[5])
q(A[10])
q(A[15])
Table A
Querying a Dynamic Table
7 B
8 A
9 B
11 A
12 C
14 C
15 A
k cnt endT
A 2 5
B 1 5
C 1 5
q(A)
A 1 10
B 2 10
A 2 15
C 2 15
q:
SELECT
k,
COUNT(k) AS cnt,
TUMBLE_END(
time,
INTERVAL '5' SECONDS)
AS endT
FROM A
GROUP BY
k,
TUMBLE(
time,
INTERVAL '5' SECONDS)
1 A
2 B
4 A
5 C
FEBRUARY9, 2017, WARSAW
Can We Run Any Query on Dynamic Tables?
• No 
• There are state and computation constraints
• State may not grow infinitely as more data arrives
• Clean-up timeout must be defined
• Input updates may only trigger partial re-computation of the result
• Queries with possibly unbounded state or computation are rejected
• Optimizer performs validation
FEBRUARY9, 2017, WARSAW
Bounding the State of a Query
• State grows infinitely with domain of grouping attribute
• Bound query input by time
• Query aggregates data of last 24 hours. Older data is discarded.
SELECT k, COUNT(k) AS cnt
FROM A
GROUP BY k
SELECT k, COUNT(k) AS cnt
FROM A
WHERE last(time, INTERVAL ‘1’ DAY)
GROUP BY k
STOP!
UNBOUNED
STATE!
FEBRUARY9, 2017, WARSAW
Updating Results and Late Arriving Data
• Sometimes emitted results need to be updated
• Results which are continuously updated
• Results for which relevant records arrived late
• Results that might be updated must be kept as state
• Clean-up timeout
• When a table is converted into a stream, updates must be propagated
• Update mode
• Add/Retract mode
FEBRUARY9, 2017, WARSAW
Dynamic Table → Stream: Update Mode
time k
Table A
B, 1A, 2C, 1B, 2A, 3 A, 1
SELECT
k,
COUNT(k) AS cnt
FROM A
GROUP BY k
1 A
2 B
4 A
5 C
7 B
8 A
… …
Update by Key
FEBRUARY9, 2017, WARSAW
Dynamic Table → Stream: Add/Retract Mode
time k
Table A
+ B, 1+ A, 2+ C, 1+ B, 2+ A, 3 + A, 1- A, 1- B, 1- A, 2
1 A
2 B
4 A
5 C
7 B
8 A
… …
SELECT
k,
COUNT(k) AS cnt
FROM A
GROUP BY k
Add (+) / Retract (-)
FEBRUARY9, 2017, WARSAW
Current State of SQL and Table API
• Huge interest and many contributors
• Current development efforts
• Adding more window operators
• Introducing dynamic tables
• And there is a lot more to do
• New operators and features for streaming and batch
• Performance improvements
• Tooling and integration
• Try it out, give feedback, and start contributing!
FEBRUARY9, 2017, WARSAW
Stream Analytics with SQL on Apache Flink
Fabian Hueske | @fhueske

More Related Content

What's hot (20)

PDF
Flink Forward SF 2017: Chinmay Soman - Real Time Analytics in the real World ...
Flink Forward
 
PDF
Flink Forward SF 2017: Stefan Richter - Improvements for large state and reco...
Flink Forward
 
PDF
Flink Forward Berlin 2017: Francesco Versaci - Integrating Flink and Kafka in...
Flink Forward
 
PDF
Flink Forward Berlin 2017: Stephan Ewen - The State of Flink and how to adopt...
Flink Forward
 
PDF
Apache Flink's Table & SQL API - unified APIs for batch and stream processing
Timo Walther
 
PDF
Flink Forward SF 2017: Joe Olson - Using Flink and Queryable State to Buffer ...
Flink Forward
 
PDF
Flink Forward Berlin 2017: Jörg Schad, Till Rohrmann - Apache Flink meets Apa...
Flink Forward
 
PPTX
Flink Forward SF 2017: David Hardwick, Sean Hester & David Brelloch - Dynami...
Flink Forward
 
PDF
Flink Forward SF 2017: Kenneth Knowles - Back to Sessions overview
Flink Forward
 
PDF
Flink forward SF 2017: Elizabeth K. Joseph and Ravi Yadav - Flink meet DC/OS ...
Flink Forward
 
PPTX
What's new in 1.9.0 blink planner - Kurt Young, Alibaba
Flink Forward
 
PPTX
Flink Forward SF 2017: Shaoxuan Wang_Xiaowei Jiang - Blinks Improvements to F...
Flink Forward
 
PDF
Flink Forward Berlin 2017: Mihail Vieru - A Materialization Engine for Data I...
Flink Forward
 
PDF
Flink Forward San Francisco 2018: Stefan Richter - "How to build a modern str...
Flink Forward
 
PPTX
Keynote: Stephan Ewen - Stream Processing as a Foundational Paradigm and Apac...
Ververica
 
PDF
I²: Interactive Real-Time Visualization for Streaming Data with Apache Flink ...
Jonas Traub
 
PDF
Flink Forward SF 2017: Feng Wang & Zhijiang Wang - Runtime Improvements in Bl...
Flink Forward
 
PPTX
Flink Forward Berlin 2017: Dongwon Kim - Predictive Maintenance with Apache F...
Flink Forward
 
PDF
Stateful Distributed Stream Processing
Gyula Fóra
 
PPTX
Flink Forward Berlin 2017: Till Rohrmann - From Apache Flink 1.3 to 1.4
Flink Forward
 
Flink Forward SF 2017: Chinmay Soman - Real Time Analytics in the real World ...
Flink Forward
 
Flink Forward SF 2017: Stefan Richter - Improvements for large state and reco...
Flink Forward
 
Flink Forward Berlin 2017: Francesco Versaci - Integrating Flink and Kafka in...
Flink Forward
 
Flink Forward Berlin 2017: Stephan Ewen - The State of Flink and how to adopt...
Flink Forward
 
Apache Flink's Table & SQL API - unified APIs for batch and stream processing
Timo Walther
 
Flink Forward SF 2017: Joe Olson - Using Flink and Queryable State to Buffer ...
Flink Forward
 
Flink Forward Berlin 2017: Jörg Schad, Till Rohrmann - Apache Flink meets Apa...
Flink Forward
 
Flink Forward SF 2017: David Hardwick, Sean Hester & David Brelloch - Dynami...
Flink Forward
 
Flink Forward SF 2017: Kenneth Knowles - Back to Sessions overview
Flink Forward
 
Flink forward SF 2017: Elizabeth K. Joseph and Ravi Yadav - Flink meet DC/OS ...
Flink Forward
 
What's new in 1.9.0 blink planner - Kurt Young, Alibaba
Flink Forward
 
Flink Forward SF 2017: Shaoxuan Wang_Xiaowei Jiang - Blinks Improvements to F...
Flink Forward
 
Flink Forward Berlin 2017: Mihail Vieru - A Materialization Engine for Data I...
Flink Forward
 
Flink Forward San Francisco 2018: Stefan Richter - "How to build a modern str...
Flink Forward
 
Keynote: Stephan Ewen - Stream Processing as a Foundational Paradigm and Apac...
Ververica
 
I²: Interactive Real-Time Visualization for Streaming Data with Apache Flink ...
Jonas Traub
 
Flink Forward SF 2017: Feng Wang & Zhijiang Wang - Runtime Improvements in Bl...
Flink Forward
 
Flink Forward Berlin 2017: Dongwon Kim - Predictive Maintenance with Apache F...
Flink Forward
 
Stateful Distributed Stream Processing
Gyula Fóra
 
Flink Forward Berlin 2017: Till Rohrmann - From Apache Flink 1.3 to 1.4
Flink Forward
 

Viewers also liked (20)

PDF
Feature Engineering
HJ van Veen
 
PPTX
How Humans See Data - Amazon Cut
John Rauser
 
PDF
Scalable Python with Docker, Kubernetes, OpenShift
Aarno Aukia
 
PPTX
Taking a look under the hood of Apache Flink's relational APIs.
Fabian Hueske
 
PDF
Beacosystem Talk @ MongoDB User Group Dublin @sos100
Sean O'Sullivan
 
PDF
Dynamic Scaling: How Apache Flink Adapts to Changing Workloads (at FlinkForwa...
Till Rohrmann
 
PDF
Streaming analytics better than batch when and why - (Big Data Tech 2017)
GetInData
 
PDF
Big Data Day LA 2015 - Always-on Ingestion for Data at Scale by Arvind Prabha...
Data Con LA
 
PPTX
IoT Innovation Lab Berlin @relayr - Kay Lerch on Getting basics right for you...
Kay Lerch
 
PDF
The Data Dichotomy- Rethinking the Way We Treat Data and Services
confluent
 
PPTX
Dataflow with Apache NiFi - Apache NiFi Meetup - 2016 Hadoop Summit - San Jose
Aldrin Piri
 
PPTX
Comparison of various streaming technologies
Sachin Aggarwal
 
PPTX
February 2017 HUG: Exactly-once end-to-end processing with Apache Apex
Yahoo Developer Network
 
PPTX
Real-Time Event & Stream Processing on MS Azure
Khalid Salama
 
PDF
Introducing Kafka's Streams API
confluent
 
PDF
Lightbend Fast Data Platform
Lightbend
 
PPTX
How to Build Continuous Ingestion for the Internet of Things
Cloudera, Inc.
 
PDF
Power of the Log: LSM & Append Only Data Structures
confluent
 
PDF
Study: The Future of VR, AR and Self-Driving Cars
LinkedIn
 
PDF
UX, ethnography and possibilities: for Libraries, Museums and Archives
Ned Potter
 
Feature Engineering
HJ van Veen
 
How Humans See Data - Amazon Cut
John Rauser
 
Scalable Python with Docker, Kubernetes, OpenShift
Aarno Aukia
 
Taking a look under the hood of Apache Flink's relational APIs.
Fabian Hueske
 
Beacosystem Talk @ MongoDB User Group Dublin @sos100
Sean O'Sullivan
 
Dynamic Scaling: How Apache Flink Adapts to Changing Workloads (at FlinkForwa...
Till Rohrmann
 
Streaming analytics better than batch when and why - (Big Data Tech 2017)
GetInData
 
Big Data Day LA 2015 - Always-on Ingestion for Data at Scale by Arvind Prabha...
Data Con LA
 
IoT Innovation Lab Berlin @relayr - Kay Lerch on Getting basics right for you...
Kay Lerch
 
The Data Dichotomy- Rethinking the Way We Treat Data and Services
confluent
 
Dataflow with Apache NiFi - Apache NiFi Meetup - 2016 Hadoop Summit - San Jose
Aldrin Piri
 
Comparison of various streaming technologies
Sachin Aggarwal
 
February 2017 HUG: Exactly-once end-to-end processing with Apache Apex
Yahoo Developer Network
 
Real-Time Event & Stream Processing on MS Azure
Khalid Salama
 
Introducing Kafka's Streams API
confluent
 
Lightbend Fast Data Platform
Lightbend
 
How to Build Continuous Ingestion for the Internet of Things
Cloudera, Inc.
 
Power of the Log: LSM & Append Only Data Structures
confluent
 
Study: The Future of VR, AR and Self-Driving Cars
LinkedIn
 
UX, ethnography and possibilities: for Libraries, Museums and Archives
Ned Potter
 
Ad

Similar to Fabian Hueske - Stream Analytics with SQL on Apache Flink (20)

PDF
Stream Analytics with SQL on Apache Flink - Fabian Hueske
Evention
 
PPTX
Why and how to leverage the power and simplicity of SQL on Apache Flink
Fabian Hueske
 
PPTX
Stream Analytics with SQL on Apache Flink
Fabian Hueske
 
PPTX
Flink Forward Berlin 2018: Timo Walther - "Flink SQL in Action"
Flink Forward
 
PPTX
Webinar: Flink SQL in Action - Fabian Hueske
Ververica
 
PPTX
Why and how to leverage the simplicity and power of SQL on Flink
DataWorks Summit
 
PPTX
Flink Forward San Francisco 2018: Fabian Hueske & Timo Walther - "Why and how...
Flink Forward
 
PDF
Timo Walther - Table & SQL API - unified APIs for batch and stream processing
Ververica
 
PDF
SQL Extensions to Support Streaming Data With Fabian Hueske | Current 2022
HostedbyConfluent
 
PPTX
Streaming SQL to unify batch and stream processing: Theory and practice with ...
Fabian Hueske
 
PDF
CDC Stream Processing with Apache Flink
Timo Walther
 
PDF
Streaming SQL
Julian Hyde
 
PDF
Julian Hyde - Streaming SQL
Flink Forward
 
PDF
Streaming SQL (at FlinkForward, Berlin, 2016/09/12)
Julian Hyde
 
PDF
Streaming SQL
Julian Hyde
 
PPTX
Fabian Hueske - Taking a look under the hood of Apache Flink’s relational APIs
Flink Forward
 
PDF
CDC Stream Processing With Apache Flink With Timo Walther | Current 2022
HostedbyConfluent
 
PDF
ApacheCon 2020 - Flink SQL in 2020: Time to show off!
Timo Walther
 
PDF
Changelog Stream Processing with Apache Flink
Flink Forward
 
Stream Analytics with SQL on Apache Flink - Fabian Hueske
Evention
 
Why and how to leverage the power and simplicity of SQL on Apache Flink
Fabian Hueske
 
Stream Analytics with SQL on Apache Flink
Fabian Hueske
 
Flink Forward Berlin 2018: Timo Walther - "Flink SQL in Action"
Flink Forward
 
Webinar: Flink SQL in Action - Fabian Hueske
Ververica
 
Why and how to leverage the simplicity and power of SQL on Flink
DataWorks Summit
 
Flink Forward San Francisco 2018: Fabian Hueske & Timo Walther - "Why and how...
Flink Forward
 
Timo Walther - Table & SQL API - unified APIs for batch and stream processing
Ververica
 
SQL Extensions to Support Streaming Data With Fabian Hueske | Current 2022
HostedbyConfluent
 
Streaming SQL to unify batch and stream processing: Theory and practice with ...
Fabian Hueske
 
CDC Stream Processing with Apache Flink
Timo Walther
 
Streaming SQL
Julian Hyde
 
Julian Hyde - Streaming SQL
Flink Forward
 
Streaming SQL (at FlinkForward, Berlin, 2016/09/12)
Julian Hyde
 
Streaming SQL
Julian Hyde
 
Fabian Hueske - Taking a look under the hood of Apache Flink’s relational APIs
Flink Forward
 
CDC Stream Processing With Apache Flink With Timo Walther | Current 2022
HostedbyConfluent
 
ApacheCon 2020 - Flink SQL in 2020: Time to show off!
Timo Walther
 
Changelog Stream Processing with Apache Flink
Flink Forward
 
Ad

More from Ververica (20)

PDF
2020-05-06 Apache Flink Meetup London: The Easiest Way to Get Operational wit...
Ververica
 
PDF
Webinar: How to contribute to Apache Flink - Robert Metzger
Ververica
 
PDF
Webinar: Deep Dive on Apache Flink State - Seth Wiesman
Ververica
 
PDF
Webinar: 99 Ways to Enrich Streaming Data with Apache Flink - Konstantin Knauf
Ververica
 
PDF
Webinar: Detecting row patterns with Flink SQL - Dawid Wysakowicz
Ververica
 
PDF
Deploying Flink on Kubernetes - David Anderson
Ververica
 
PPTX
2018-04 Kafka Summit London: Stephan Ewen - "Apache Flink and Apache Kafka fo...
Ververica
 
PDF
2018-01 Seattle Apache Flink Meetup at OfferUp, Opening Remarks and Talk 2
Ververica
 
PPTX
Stephan Ewen - Experiences running Flink at Very Large Scale
Ververica
 
PDF
Tzu-Li (Gordon) Tai - Stateful Stream Processing with Apache Flink
Ververica
 
PPTX
Kostas Kloudas - Complex Event Processing with Flink: the state of FlinkCEP
Ververica
 
PDF
Aljoscha Krettek - Portable stateful big data processing in Apache Beam
Ververica
 
PDF
Aljoscha Krettek - Apache Flink® and IoT: How Stateful Event-Time Processing ...
Ververica
 
PDF
Apache Flink Meetup: Sanjar Akhmedov - Joining Infinity – Windowless Stream ...
Ververica
 
PPTX
Kostas Kloudas - Extending Flink's Streaming APIs
Ververica
 
PDF
Stefan Richter - A look at Flink 1.2 and beyond @ Berlin Meetup
Ververica
 
PPTX
Robert Metzger - Apache Flink Community Updates November 2016 @ Berlin Meetup
Ververica
 
PPTX
Aljoscha Krettek - Apache Flink for IoT: How Event-Time Processing Enables Ea...
Ververica
 
PPTX
Kostas Tzoumas - Stream Processing with Apache Flink®
Ververica
 
PPTX
Kostas Tzoumas - Apache Flink®: State of the Union and What's Next
Ververica
 
2020-05-06 Apache Flink Meetup London: The Easiest Way to Get Operational wit...
Ververica
 
Webinar: How to contribute to Apache Flink - Robert Metzger
Ververica
 
Webinar: Deep Dive on Apache Flink State - Seth Wiesman
Ververica
 
Webinar: 99 Ways to Enrich Streaming Data with Apache Flink - Konstantin Knauf
Ververica
 
Webinar: Detecting row patterns with Flink SQL - Dawid Wysakowicz
Ververica
 
Deploying Flink on Kubernetes - David Anderson
Ververica
 
2018-04 Kafka Summit London: Stephan Ewen - "Apache Flink and Apache Kafka fo...
Ververica
 
2018-01 Seattle Apache Flink Meetup at OfferUp, Opening Remarks and Talk 2
Ververica
 
Stephan Ewen - Experiences running Flink at Very Large Scale
Ververica
 
Tzu-Li (Gordon) Tai - Stateful Stream Processing with Apache Flink
Ververica
 
Kostas Kloudas - Complex Event Processing with Flink: the state of FlinkCEP
Ververica
 
Aljoscha Krettek - Portable stateful big data processing in Apache Beam
Ververica
 
Aljoscha Krettek - Apache Flink® and IoT: How Stateful Event-Time Processing ...
Ververica
 
Apache Flink Meetup: Sanjar Akhmedov - Joining Infinity – Windowless Stream ...
Ververica
 
Kostas Kloudas - Extending Flink's Streaming APIs
Ververica
 
Stefan Richter - A look at Flink 1.2 and beyond @ Berlin Meetup
Ververica
 
Robert Metzger - Apache Flink Community Updates November 2016 @ Berlin Meetup
Ververica
 
Aljoscha Krettek - Apache Flink for IoT: How Event-Time Processing Enables Ea...
Ververica
 
Kostas Tzoumas - Stream Processing with Apache Flink®
Ververica
 
Kostas Tzoumas - Apache Flink®: State of the Union and What's Next
Ververica
 

Recently uploaded (20)

PPTX
apidays Helsinki & North 2025 - Agentic AI: A Friend or Foe?, Merja Kajava (A...
apidays
 
PDF
apidays Singapore 2025 - Surviving an interconnected world with API governanc...
apidays
 
PPTX
apidays Singapore 2025 - From Data to Insights: Building AI-Powered Data APIs...
apidays
 
PPTX
03_Ariane BERCKMOES_Ethias.pptx_AIBarometer_release_event
FinTech Belgium
 
PPTX
Aict presentation on dpplppp sjdhfh.pptx
vabaso5932
 
PDF
Technical-Report-GPS_GIS_RS-for-MSF-finalv2.pdf
KPycho
 
PPTX
05_Jelle Baats_Tekst.pptx_AI_Barometer_Release_Event
FinTech Belgium
 
PPTX
Powerful Uses of Data Analytics You Should Know
subhashenia
 
PDF
1750162332_Snapshot-of-Indias-oil-Gas-data-May-2025.pdf
sandeep718278
 
PDF
Optimizing Large Language Models with vLLM and Related Tools.pdf
Tamanna36
 
PPTX
apidays Helsinki & North 2025 - API access control strategies beyond JWT bear...
apidays
 
PPTX
thid ppt defines the ich guridlens and gives the information about the ICH gu...
shaistabegum14
 
PPTX
apidays Singapore 2025 - Generative AI Landscape Building a Modern Data Strat...
apidays
 
PDF
apidays Singapore 2025 - How APIs can make - or break - trust in your AI by S...
apidays
 
PDF
Data Science Course Certificate by Sigma Software University
Stepan Kalika
 
PDF
apidays Singapore 2025 - From API Intelligence to API Governance by Harsha Ch...
apidays
 
PDF
Business implication of Artificial Intelligence.pdf
VishalChugh12
 
PDF
The European Business Wallet: Why It Matters and How It Powers the EUDI Ecosy...
Lal Chandran
 
PDF
NIS2 Compliance for MSPs: Roadmap, Benefits & Cybersecurity Trends (2025 Guide)
GRC Kompas
 
PPTX
apidays Helsinki & North 2025 - APIs at Scale: Designing for Alignment, Trust...
apidays
 
apidays Helsinki & North 2025 - Agentic AI: A Friend or Foe?, Merja Kajava (A...
apidays
 
apidays Singapore 2025 - Surviving an interconnected world with API governanc...
apidays
 
apidays Singapore 2025 - From Data to Insights: Building AI-Powered Data APIs...
apidays
 
03_Ariane BERCKMOES_Ethias.pptx_AIBarometer_release_event
FinTech Belgium
 
Aict presentation on dpplppp sjdhfh.pptx
vabaso5932
 
Technical-Report-GPS_GIS_RS-for-MSF-finalv2.pdf
KPycho
 
05_Jelle Baats_Tekst.pptx_AI_Barometer_Release_Event
FinTech Belgium
 
Powerful Uses of Data Analytics You Should Know
subhashenia
 
1750162332_Snapshot-of-Indias-oil-Gas-data-May-2025.pdf
sandeep718278
 
Optimizing Large Language Models with vLLM and Related Tools.pdf
Tamanna36
 
apidays Helsinki & North 2025 - API access control strategies beyond JWT bear...
apidays
 
thid ppt defines the ich guridlens and gives the information about the ICH gu...
shaistabegum14
 
apidays Singapore 2025 - Generative AI Landscape Building a Modern Data Strat...
apidays
 
apidays Singapore 2025 - How APIs can make - or break - trust in your AI by S...
apidays
 
Data Science Course Certificate by Sigma Software University
Stepan Kalika
 
apidays Singapore 2025 - From API Intelligence to API Governance by Harsha Ch...
apidays
 
Business implication of Artificial Intelligence.pdf
VishalChugh12
 
The European Business Wallet: Why It Matters and How It Powers the EUDI Ecosy...
Lal Chandran
 
NIS2 Compliance for MSPs: Roadmap, Benefits & Cybersecurity Trends (2025 Guide)
GRC Kompas
 
apidays Helsinki & North 2025 - APIs at Scale: Designing for Alignment, Trust...
apidays
 

Fabian Hueske - Stream Analytics with SQL on Apache Flink

  • 1. FEBRUARY9, 2017, WARSAW Stream Analytics with SQL on Apache Flink® Fabian Hueske | Apache Flink PMC member | Co-founder dataArtisans
  • 3. FEBRUARY9, 2017, WARSAW Data Analytics on Streaming Data • Periodic batch processing • Lots of duct tape and baling wire • It’s up to you to make everything work… reliably! • High latency • Continuous stream processing • Framework takes care of failures • Low latency
  • 4. FEBRUARY9, 2017, WARSAW Stream Processing in Apache Flink • Platform for scalable stream processing • Fast • Low latency and high throughput • Accurate • Stateful streaming processing in event time • Reliable • Exactly-once state guarantees • Highly available cluster setup
  • 5. FEBRUARY9, 2017, WARSAW Streaming Applications Powered by Flink 30 Flink applications in production for more than one year. 10 billion events (2TB) processed daily Complex jobs of > 30 operators running 24/7, processing 30 billion events daily, maintaining state of 100s of GB with exactly-once guarantees Largest job has > 20 operators, runs on > 5000 vCores in 1000-node cluster, processes millions of events per second
  • 6. FEBRUARY9, 2017, WARSAW Stream Processing is not for Everybody, … yet • APIs of open source stream processors target developers • Implementing streaming applications requires knowledge & skill • Stream processing concepts (time, state, windows, triggers, ...) • Programming experience (Java / Scala APIs) • Stream processing technology spreads rapidly • There is a talent gap
  • 7. FEBRUARY9, 2017, WARSAW What about SQL? • SQL is the most widely used language for data analytics • Many good reasons to use SQL • Declarative specification • Optimization • Efficient execution • “Everybody” knows SQL • SQL would make stream processing much more accessible, but…
  • 8. FEBRUARY9, 2017, WARSAW No OS Stream Processor Offers Decent SQL Support • SQL was not designed with streaming data in mind • Relations are sets. Streams are infinite sequences. • Records arrive over time. • Syntax • Time-based operations are cumbersome to specify (aggregates, joins) • Semantics • A SQL query should compute the same result on a batch table and a stream
  • 9. FEBRUARY9, 2017, WARSAW • Standard SQL and LINQ-style Table API • Unified APIs for batch & streaming data • Common translation layers • Optimization based on Apache Calcite • Type system & code-generation • Table sources & sinks • Streaming SQL & Table API is work in progress Flink’s SQL Support & Table API
  • 10. FEBRUARY9, 2017, WARSAW What are the Use Cases for Stream SQL? • Continuous ETL & Data Import • Live Dashboards & Reports • Ad-hoc Analytics & Exploration
  • 11. FEBRUARY9, 2017, WARSAW Dynamic Tables • Core concept is a “Dynamic Table” • Dynamic tables change over time • Dynamic tables are treated like static batch tables • Dynamic tables are queried with standard SQL • A query returns another dynamic table • Stream ←→ Dynamic Table conversions without information loss • “Stream / Table Duality”
  • 12. FEBRUARY9, 2017, WARSAW Stream → Dynamic Table • Append • Replace by Key time k 1 A 2 B 4 A 5 C 7 B 8 A 9 B … … time k 2, B4, A5, C7, B8, A9, B 1, A 2, B4, A5, C7, B8, A9, B 1, A 8 A 9 B 5 C … …
  • 13. FEBRUARY9, 2017, WARSAW Querying a Dynamic Table • Dynamic tables change over time • A[t]: Table A at time t • Dynamic tables are queried with regular SQL • Result of a query changes as input table changes • q(A[t]): Evaluate query q on table A at time t • As time t progresses, the query result is continuously updated • similar to maintaining a materialized view • t is current event time
  • 14. FEBRUARY9, 2017, WARSAW Querying a Dynamic Table time k k cnt A 3 B 2 C 1 9 B k cnt A 3 B 3 C 1 12 C k cnt A 3 B 3 C 2 A[8] A[9] A[12] q(A[8]) q(A[9]) q(A[12]) Table A q: SELECT k, COUNT(k) as cnt FROM A GROUP BY k 1 A 2 B 4 A 5 C 7 B 8 A
  • 15. FEBRUARY9, 2017, WARSAW time k A[5] A[10] A[15] q(A[5]) q(A[10]) q(A[15]) Table A Querying a Dynamic Table 7 B 8 A 9 B 11 A 12 C 14 C 15 A k cnt endT A 2 5 B 1 5 C 1 5 q(A) A 1 10 B 2 10 A 2 15 C 2 15 q: SELECT k, COUNT(k) AS cnt, TUMBLE_END( time, INTERVAL '5' SECONDS) AS endT FROM A GROUP BY k, TUMBLE( time, INTERVAL '5' SECONDS) 1 A 2 B 4 A 5 C
  • 16. FEBRUARY9, 2017, WARSAW Can We Run Any Query on Dynamic Tables? • No  • There are state and computation constraints • State may not grow infinitely as more data arrives • Clean-up timeout must be defined • Input updates may only trigger partial re-computation of the result • Queries with possibly unbounded state or computation are rejected • Optimizer performs validation
  • 17. FEBRUARY9, 2017, WARSAW Bounding the State of a Query • State grows infinitely with domain of grouping attribute • Bound query input by time • Query aggregates data of last 24 hours. Older data is discarded. SELECT k, COUNT(k) AS cnt FROM A GROUP BY k SELECT k, COUNT(k) AS cnt FROM A WHERE last(time, INTERVAL ‘1’ DAY) GROUP BY k STOP! UNBOUNED STATE!
  • 18. FEBRUARY9, 2017, WARSAW Updating Results and Late Arriving Data • Sometimes emitted results need to be updated • Results which are continuously updated • Results for which relevant records arrived late • Results that might be updated must be kept as state • Clean-up timeout • When a table is converted into a stream, updates must be propagated • Update mode • Add/Retract mode
  • 19. FEBRUARY9, 2017, WARSAW Dynamic Table → Stream: Update Mode time k Table A B, 1A, 2C, 1B, 2A, 3 A, 1 SELECT k, COUNT(k) AS cnt FROM A GROUP BY k 1 A 2 B 4 A 5 C 7 B 8 A … … Update by Key
  • 20. FEBRUARY9, 2017, WARSAW Dynamic Table → Stream: Add/Retract Mode time k Table A + B, 1+ A, 2+ C, 1+ B, 2+ A, 3 + A, 1- A, 1- B, 1- A, 2 1 A 2 B 4 A 5 C 7 B 8 A … … SELECT k, COUNT(k) AS cnt FROM A GROUP BY k Add (+) / Retract (-)
  • 21. FEBRUARY9, 2017, WARSAW Current State of SQL and Table API • Huge interest and many contributors • Current development efforts • Adding more window operators • Introducing dynamic tables • And there is a lot more to do • New operators and features for streaming and batch • Performance improvements • Tooling and integration • Try it out, give feedback, and start contributing!
  • 22. FEBRUARY9, 2017, WARSAW Stream Analytics with SQL on Apache Flink Fabian Hueske | @fhueske