SlideShare a Scribd company logo
Streaming SQL
Julian Hyde
Apex Big Data World
Mountain View
2017/04/04
@julianhyde
SQL
Query planning
Query federation
OLAP
Streaming
Hadoop
Apache member
Original author of Apache Calcite
PMC Apache Arrow, Drill, Eagle,
Kylin
Data center
Why SQL?
Data in motion, data at
rest - it’s all just data
Stream / database
duality
SQL is the best
language ever created
for data (because it’s
declarative)
Building a streaming SQL standard via
consensus
Please! No more “SQL-like” languages!
Key technologies are open source (many are Apache projects)
Calcite is providing leadership: developing example queries, TCK
Complements Apache Beam’s work on a common streaming API/algebra
(Optional) Use Calcite’s framework to build a streaming SQL parser/planner for
your engine
Several projects are working with us: Apex, Flink, Samza, Storm. (Also
non-streaming SQL in Cassandra, Drill, Druid, Elasticsearch, Hive, Kylin, Phoenix.)
SQL in Apex
SQL support is part of Malhar
(malhar-sql) [1]
Disclaimer: Not everything I
describe today is in Apex
Operators: Scan, Filter, Project
Coming soon: Window operators
[1] https://www.datatorrent.com/blog/sql-apache-apex/
Simple queries
select *
from Products
where unitPrice < 20
select stream *
from Orders
where units > 1000
➢ Traditional (non-streaming)
➢ Products is a table
➢ Retrieves records from -∞ to now
➢ Streaming
➢ Orders is a stream
➢ Retrieves records from now to +∞
➢ Query never terminates
Stream-table duality
select *
from Orders
where units > 1000
➢ Yes, you can use a stream as
a table
➢ And you can use a table as a
stream
➢ Actually, Orders is both
➢ Use the stream keyword
➢ Where to actually find the
data? That’s up to the system
select stream *
from Orders
where units > 1000
Combining past and future
select stream *
from Orders as o
where units > (
select avg(units)
from Orders as h
where h.productId = o.productId
and h.rowtime > o.rowtime - interval ‘1’ year)
➢ Orders is used as both stream and table
➢ System determines where to find the records
➢ Query is invalid if records are not available
Semantics of streaming queries
The replay principle:
A streaming query produces the same result as the corresponding
non-streaming query would if given the same data in a table.
Output must not rely on implicit information (arrival order, arrival time,
processing time, or watermarks/punctuations)
(Some triggering schemes allow records to be emitted early and re-stated if
incorrect.)
Controlling when data is emitted
Early emission is the defining
characteristic of a streaming query.
The emit clause is a SQL extension
inspired by Apache Beam’s “trigger”
notion. (Still experimental… and
evolving.)
A relational (non-streaming) query is
just a query with the most conservative
possible emission strategy.
select stream productId,
count(*) as c
from Orders
group by productId,
floor(rowtime to hour)
emit at watermark,
early interval ‘2’ minute,
late limit 1;
select *
from Orders
emit when complete;
Aggregation and windows
on streams
GROUP BY aggregates multiple rows into sub-totals
➢ In regular GROUP BY each row contributes to
exactly one sub-total
➢ In multi-GROUP BY (e.g. HOP, GROUPING
SETS) a row can contribute to more than one
sub-total
Window functions (OVER) leave the number of rows
unchanged, but compute extra expressions for
each row (based on neighboring rows)
Multi
GROUP BY
Window
functions
GROUP BY
Tumbling, hopping & session windows in SQL
Tumbling window
Hopping window
Session window
select stream … from Orders
group by floor(rowtime to hour)
select stream … from Orders
group by tumble(rowtime, interval ‘1’ hour)
select stream … from Orders
group by hop(rowtime, interval ‘1’ hour,
interval ‘2’ hour)
select stream … from Orders
group by session(rowtime, interval ‘1’ hour)
The “pie chart” problem
➢ Task: Write a web page summarizing
orders over the last hour
➢ Problem: The Orders stream only
contains the current few records
➢ Solution: Materialize short-term history
Orders over the last hour
Beer
48%
Cheese
30%
Wine
22%
select productId, count(*)
from Orders
where rowtime > current_timestamp - interval ‘1’ hour
group by productId
Join stream to a table
Inputs are the Orders stream and the
Products table, output is a stream.
Acts as a “lookup”.
Execute by caching the table in a
hash-map (if table is not too large) and
stream order will be preserved.
What if Products table is being
modified while query executes?
select stream *
from Orders as o
join Products as p
on o.productId = p.productId
Join stream to a stream
We can join streams if the join
condition forces them into “lock
step”, within a window (in this case,
1 hour).
Which stream to put input a hash
table? It depends on relative rates,
outer joins, and how we’d like the
output sorted.
select stream *
from Orders as o
join Shipments as s
on o.productId = p.productId
and s.rowtime
between o.rowtime
and o.rowtime + interval ‘1’ hour
Other operations
Other relational operations make sense on streams (usually only if there is an
implicit time bound).
Examples:
● order by - E.g. Each hour emit the top 10 selling products
● union - E.g. Merge streams of orders and shipments
● insert, update, delete - E.g. Continuously insert into an external table
● exists, in sub-queries - E.g. Show me shipments of products for which
there has been no order in the last hour
● view - Expanded when query is parsed; zero runtime cost
● match_recognize - Complex event processing (CEP)
Planning queries
MySQL
Splunk
join
Key: productId
group
Key: productName
Agg: count
filter
Condition:
action = 'purchase'
sort
Key: c desc
scan
scan
Table: products
select p.productName, count(*) as c
from splunk.splunk as s
join mysql.products as p
on s.productId = p.productId
where s.action = 'purchase'
group by p.productName
order by c desc
Table: splunk
Optimized query
MySQL
Splunk
join
Key: productId
group
Key: productName
Agg: count
filter
Condition:
action = 'purchase'
sort
Key: c desc
scan
scan
Table: splunk
Table: products
select p.productName, count(*) as c
from splunk.splunk as s
join mysql.products as p
on s.productId = p.productId
where s.action = 'purchase'
group by p.productName
order by c desc
Apache Calcite
Apache top-level project since October, 2015
Query planning framework
➢ Relational algebra, rewrite rules
➢ Cost model & statistics
➢ Federation via adapters
➢ Extensible
Packaging
➢ Library
➢ Optional SQL parser, JDBC server
➢ Community-authored rules, adapters
Embedded Adapters Streaming
Apache Drill
Apache Hive
Apache Kylin
Apache Phoenix*
Cascading
Lingual
Apache Cassandra
Apache Spark
CSV
Druid
Elasticsearch
In-memory
JDBC
JSON
MongoDB
Splunk
Web tables
Apache Apex
Apache Flink
Apache Samza
Apache Storm
* Under development
Join the community!
Calcite and Apex are projects of the Apache
Software Foundation
The Apache Way: meritocracy, openness,
consensus, community
We welcome new contributors!
Thank you!
@julianhyde
@ApacheCalcite
http://calcite.apache.org
http://calcite.apache.org/docs/stream.html
References
● Hyde, Julian. "Data in flight." Communications of the ACM 53.1
(2010): 48-52. [pdf]
● Akidau, Tyler, et al. "The dataflow model: a practical approach to
balancing correctness, latency, and cost in massive-scale,
unbounded, out-of-order data processing." Proceedings of the
VLDB Endowment 8.12 (2015): 1792-1803. [pdf]
● Arasu, Arvind, Shivnath Babu, and Jennifer Widom. "The CQL
continuous query language: semantic foundations and query
execution." The VLDB Journal—The International Journal on Very
Large Data Bases 15.2 (2006): 121-142. [pdf]
Extra slides
Summary
Features of streaming SQL:
● Standard SQL over streams and relations
● Relational queries on streams, and vice versa
● Materialized views and standing queries
Benefits:
● Brings streaming data to DB tools and traditional users
● Brings historic data to message-oriented applications
● Lets the system optimize quality of service (QoS) and data location
Why SQL? ● API to your database
● Ask for what you want,
system decides how to get it
● Query planner (optimizer)
converts logical queries to
physical plans
● Mathematically sound
language (relational algebra)
● For all data, not just data in a
database
● Opportunity for novel data
organizations & algorithms
● Standard
https://www.flickr.com/photos/pere/523019984/ (CC BY-NC-SA 2.0)
➢ API to your database
➢ Ask for what you want,
system decides how to get it
➢ Query planner (optimizer)
converts logical queries to
physical plans
➢ Mathematically sound
language (relational algebra)
➢ For all data, not just “flat”
data in a database
➢ Opportunity for novel data
organizations & algorithms
➢ Standard
Why SQL?
Architecture
Conventional database Calcite
Relational algebra (plus streaming)
Core operators:
➢ Scan
➢ Filter
➢ Project
➢ Join
➢ Sort
➢ Aggregate
➢ Union
➢ Values
Streaming operators:
➢ Delta (converts relation to
stream)
➢ Chi (converts stream to
relation)
In SQL, the STREAM keyword
signifies Delta
Streaming algebra
➢ Filter
➢ Route
➢ Partition
➢ Round-robin
➢ Queue
➢ Aggregate
➢ Merge
➢ Store
➢ Replay
➢ Sort
➢ Lookup
Optimizing streaming queries
The usual relational transformations still apply: push filters and projects towards
sources, eliminate empty inputs, etc.
The transformations for delta are mostly simple:
➢ Delta(Filter(r, predicate)) → Filter(Delta(r), predicate)
➢ Delta(Project(r, e0, ...)) → Project(Delta(r), e0, …)
➢ Delta(Union(r0, r1), ALL) → Union(Delta(r0), Delta(r1))
But not always:
➢ Delta(Join(r0, r1, predicate)) → Union(Join(r0, Delta(r1)), Join(Delta(r0), r1)
➢ Delta(Scan(aTable)) → Empty
Sliding windows in SQL
select stream
sum(units) over w (partition by productId) as units1hp,
sum(units) over w as units1h,
rowtime, productId, units
from Orders
window w as (order by rowtime range interval ‘1’ hour preceding)
rowtime productId units
09:12 100 5
09:25 130 10
09:59 100 3
10:17 100 10
units1hp units1h rowtime productId units
5 5 09:12 100 5
10 15 09:25 130 10
8 18 09:59 100 3
23 13 10:17 100 10
Join stream to a changing table
Execution is more difficult if the
Products table is being changed
while the query executes.
To do things properly (e.g. to get the
same results when we re-play the
data), we’d need temporal database
semantics.
(Sometimes doing things properly is
too expensive.)
select stream *
from Orders as o
join Products as p
on o.productId = p.productId
and o.rowtime
between p.startEffectiveDate
and p.endEffectiveDate

More Related Content

PDF
Streaming SQL
Julian Hyde
 
PDF
Streaming SQL
Julian Hyde
 
PDF
Streaming SQL (at FlinkForward, Berlin, 2016/09/12)
Julian Hyde
 
PDF
Querying the Internet of Things: Streaming SQL on Kafka/Samza and Storm/Trident
Julian Hyde
 
PDF
Data all over the place! How SQL and Apache Calcite bring sanity to streaming...
Julian Hyde
 
PDF
SQL on everything, in memory
Julian Hyde
 
PDF
Streaming SQL
Julian Hyde
 
PDF
Cost-based query optimization in Apache Hive 0.14
Julian Hyde
 
Streaming SQL
Julian Hyde
 
Streaming SQL
Julian Hyde
 
Streaming SQL (at FlinkForward, Berlin, 2016/09/12)
Julian Hyde
 
Querying the Internet of Things: Streaming SQL on Kafka/Samza and Storm/Trident
Julian Hyde
 
Data all over the place! How SQL and Apache Calcite bring sanity to streaming...
Julian Hyde
 
SQL on everything, in memory
Julian Hyde
 
Streaming SQL
Julian Hyde
 
Cost-based query optimization in Apache Hive 0.14
Julian Hyde
 

What's hot (20)

PPTX
Cost-based query optimization in Apache Hive
Julian Hyde
 
PDF
Streaming SQL with Apache Calcite
Julian Hyde
 
PDF
Why you care about
 relational algebra (even though you didn’t know it)
Julian Hyde
 
PDF
Cost-based Query Optimization in Apache Phoenix using Apache Calcite
Julian Hyde
 
PDF
Tactical data engineering
Julian Hyde
 
PDF
Apache Calcite: A Foundational Framework for Optimized Query Processing Over ...
Julian Hyde
 
PDF
Apache Calcite Tutorial - BOSS 21
Stamatis Zampetakis
 
PDF
Apache Calcite: One Frontend to Rule Them All
Michael Mior
 
PDF
Don’t optimize my queries, optimize my data!
Julian Hyde
 
PDF
ONE FOR ALL! Using Apache Calcite to make SQL smart
Evans Ye
 
PDF
Planning with Polyalgebra: Bringing Together Relational, Complex and Machine ...
Julian Hyde
 
PPTX
How to understand and analyze Apache Hive query execution plan for performanc...
DataWorks Summit/Hadoop Summit
 
PPTX
Discardable In-Memory Materialized Queries With Hadoop
Julian Hyde
 
PDF
Don't optimize my queries, organize my data!
Julian Hyde
 
PPT
SQL on Big Data using Optiq
Julian Hyde
 
PDF
A smarter Pig: Building a SQL interface to Apache Pig using Apache Calcite
Julian Hyde
 
PDF
Cost-based Query Optimization
DataWorks Summit/Hadoop Summit
 
PPT
Drill / SQL / Optiq
Julian Hyde
 
PDF
Spatial query on vanilla databases
Julian Hyde
 
PPTX
Lazy beats Smart and Fast
Julian Hyde
 
Cost-based query optimization in Apache Hive
Julian Hyde
 
Streaming SQL with Apache Calcite
Julian Hyde
 
Why you care about
 relational algebra (even though you didn’t know it)
Julian Hyde
 
Cost-based Query Optimization in Apache Phoenix using Apache Calcite
Julian Hyde
 
Tactical data engineering
Julian Hyde
 
Apache Calcite: A Foundational Framework for Optimized Query Processing Over ...
Julian Hyde
 
Apache Calcite Tutorial - BOSS 21
Stamatis Zampetakis
 
Apache Calcite: One Frontend to Rule Them All
Michael Mior
 
Don’t optimize my queries, optimize my data!
Julian Hyde
 
ONE FOR ALL! Using Apache Calcite to make SQL smart
Evans Ye
 
Planning with Polyalgebra: Bringing Together Relational, Complex and Machine ...
Julian Hyde
 
How to understand and analyze Apache Hive query execution plan for performanc...
DataWorks Summit/Hadoop Summit
 
Discardable In-Memory Materialized Queries With Hadoop
Julian Hyde
 
Don't optimize my queries, organize my data!
Julian Hyde
 
SQL on Big Data using Optiq
Julian Hyde
 
A smarter Pig: Building a SQL interface to Apache Pig using Apache Calcite
Julian Hyde
 
Cost-based Query Optimization
DataWorks Summit/Hadoop Summit
 
Drill / SQL / Optiq
Julian Hyde
 
Spatial query on vanilla databases
Julian Hyde
 
Lazy beats Smart and Fast
Julian Hyde
 
Ad

Similar to Streaming SQL (20)

PDF
Streaming SQL w/ Apache Calcite
Hortonworks
 
PDF
Streaming SQL
Julian Hyde
 
PDF
Julian Hyde - Streaming SQL
Flink Forward
 
PDF
Towards sql for streams
Radu Tudoran
 
PPTX
A Smarter Pig: Building a SQL interface to Pig using Apache Calcite
Salesforce Engineering
 
PDF
Querying the Internet of Things: Streaming SQL on Kafka/Samza and Storm/Trident
DataWorks Summit/Hadoop Summit
 
PPTX
Lessons Learned From PayPal: Implementing Back-Pressure With Akka Streams And...
Lightbend
 
PDF
Spark and Couchbase: Augmenting the Operational Database with Spark
Spark Summit
 
PPTX
SQL Server 2008 Development for Programmers
Adam Hutson
 
PPTX
Stream Analytics with SQL on Apache Flink
Fabian Hueske
 
PPTX
Emerging technologies /frameworks in Big Data
Rahul Jain
 
PDF
WSO2Con ASIA 2016: WSO2 Analytics Platform: The One Stop Shop for All Your Da...
WSO2
 
PPTX
Explore big data at speed of thought with Spark 2.0 and Snappydata
Data Con LA
 
PPTX
Cost-based query optimization in Apache Hive 0.14
Julian Hyde
 
PPTX
Flink Forward SF 2017: Timo Walther - Table & SQL API – unified APIs for bat...
Flink Forward
 
PDF
10 Reasons to Start Your Analytics Project with PostgreSQL
Satoshi Nagayasu
 
PDF
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Databricks
 
PPTX
Intro to Apache Apex - Next Gen Platform for Ingest and Transform
Apache Apex
 
PDF
Learnings Using Spark Streaming and DataFrames for Walmart Search: Spark Summ...
Spark Summit
 
Streaming SQL w/ Apache Calcite
Hortonworks
 
Streaming SQL
Julian Hyde
 
Julian Hyde - Streaming SQL
Flink Forward
 
Towards sql for streams
Radu Tudoran
 
A Smarter Pig: Building a SQL interface to Pig using Apache Calcite
Salesforce Engineering
 
Querying the Internet of Things: Streaming SQL on Kafka/Samza and Storm/Trident
DataWorks Summit/Hadoop Summit
 
Lessons Learned From PayPal: Implementing Back-Pressure With Akka Streams And...
Lightbend
 
Spark and Couchbase: Augmenting the Operational Database with Spark
Spark Summit
 
SQL Server 2008 Development for Programmers
Adam Hutson
 
Stream Analytics with SQL on Apache Flink
Fabian Hueske
 
Emerging technologies /frameworks in Big Data
Rahul Jain
 
WSO2Con ASIA 2016: WSO2 Analytics Platform: The One Stop Shop for All Your Da...
WSO2
 
Explore big data at speed of thought with Spark 2.0 and Snappydata
Data Con LA
 
Cost-based query optimization in Apache Hive 0.14
Julian Hyde
 
Flink Forward SF 2017: Timo Walther - Table & SQL API – unified APIs for bat...
Flink Forward
 
10 Reasons to Start Your Analytics Project with PostgreSQL
Satoshi Nagayasu
 
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Databricks
 
Intro to Apache Apex - Next Gen Platform for Ingest and Transform
Apache Apex
 
Learnings Using Spark Streaming and DataFrames for Walmart Search: Spark Summ...
Spark Summit
 
Ad

More from Julian Hyde (16)

PPTX
Measures in SQL (SIGMOD 2024, Santiago, Chile)
Julian Hyde
 
PDF
Measures in SQL (a talk at SF Distributed Systems meetup, 2024-05-22)
Julian Hyde
 
PDF
Building a semantic/metrics layer using Calcite
Julian Hyde
 
PDF
Cubing and Metrics in SQL, oh my!
Julian Hyde
 
PDF
Adding measures to Calcite SQL
Julian Hyde
 
PDF
Morel, a data-parallel programming language
Julian Hyde
 
PDF
Is there a perfect data-parallel programming language? (Experiments with More...
Julian Hyde
 
PDF
Morel, a Functional Query Language
Julian Hyde
 
PDF
Apache Calcite (a tutorial given at BOSS '21)
Julian Hyde
 
PDF
The evolution of Apache Calcite and its Community
Julian Hyde
 
PDF
What to expect when you're Incubating
Julian Hyde
 
PDF
Open Source SQL - beyond parsers: ZetaSQL and Apache Calcite
Julian Hyde
 
PDF
Efficient spatial queries on vanilla databases
Julian Hyde
 
PDF
Smarter Together - Bringing Relational Algebra, Powered by Apache Calcite, in...
Julian Hyde
 
PDF
Data profiling with Apache Calcite
Julian Hyde
 
PDF
Data Profiling in Apache Calcite
Julian Hyde
 
Measures in SQL (SIGMOD 2024, Santiago, Chile)
Julian Hyde
 
Measures in SQL (a talk at SF Distributed Systems meetup, 2024-05-22)
Julian Hyde
 
Building a semantic/metrics layer using Calcite
Julian Hyde
 
Cubing and Metrics in SQL, oh my!
Julian Hyde
 
Adding measures to Calcite SQL
Julian Hyde
 
Morel, a data-parallel programming language
Julian Hyde
 
Is there a perfect data-parallel programming language? (Experiments with More...
Julian Hyde
 
Morel, a Functional Query Language
Julian Hyde
 
Apache Calcite (a tutorial given at BOSS '21)
Julian Hyde
 
The evolution of Apache Calcite and its Community
Julian Hyde
 
What to expect when you're Incubating
Julian Hyde
 
Open Source SQL - beyond parsers: ZetaSQL and Apache Calcite
Julian Hyde
 
Efficient spatial queries on vanilla databases
Julian Hyde
 
Smarter Together - Bringing Relational Algebra, Powered by Apache Calcite, in...
Julian Hyde
 
Data profiling with Apache Calcite
Julian Hyde
 
Data Profiling in Apache Calcite
Julian Hyde
 

Recently uploaded (20)

PDF
On Software Engineers' Productivity - Beyond Misleading Metrics
Romén Rodríguez-Gil
 
PDF
Using licensed Data Loss Prevention (DLP) as a strategic proactive data secur...
Q-Advise
 
PDF
What to consider before purchasing Microsoft 365 Business Premium_PDF.pdf
Q-Advise
 
PDF
10 posting ideas for community engagement with AI prompts
Pankaj Taneja
 
PDF
Bandai Playdia The Book - David Glotz
BluePanther6
 
PPTX
Contractor Management Platform and Software Solution for Compliance
SHEQ Network Limited
 
PDF
ChatPharo: an Open Architecture for Understanding How to Talk Live to LLMs
ESUG
 
PDF
Summary Of Odoo 18.1 to 18.4 : The Way For Odoo 19
CandidRoot Solutions Private Limited
 
PPTX
Odoo Integration Services by Candidroot Solutions
CandidRoot Solutions Private Limited
 
DOCX
Can You Build Dashboards Using Open Source Visualization Tool.docx
Varsha Nayak
 
PPTX
Can You Build Dashboards Using Open Source Visualization Tool.pptx
Varsha Nayak
 
PDF
Generating Union types w/ Static Analysis
K. Matthew Dupree
 
PPT
Activate_Methodology_Summary presentatio
annapureddyn
 
PDF
Enhancing Healthcare RPM Platforms with Contextual AI Integration
Cadabra Studio
 
PDF
Salesforce Implementation Services Provider.pdf
VALiNTRY360
 
PPTX
GALILEO CRS SYSTEM | GALILEO TRAVEL SOFTWARE
philipnathen82
 
PPTX
Presentation about Database and Database Administrator
abhishekchauhan86963
 
PPT
Why Reliable Server Maintenance Service in New York is Crucial for Your Business
Sam Vohra
 
PPTX
classification of computer and basic part of digital computer
ravisinghrajpurohit3
 
PDF
lesson-2-rules-of-netiquette.pdf.bshhsjdj
jasmenrojas249
 
On Software Engineers' Productivity - Beyond Misleading Metrics
Romén Rodríguez-Gil
 
Using licensed Data Loss Prevention (DLP) as a strategic proactive data secur...
Q-Advise
 
What to consider before purchasing Microsoft 365 Business Premium_PDF.pdf
Q-Advise
 
10 posting ideas for community engagement with AI prompts
Pankaj Taneja
 
Bandai Playdia The Book - David Glotz
BluePanther6
 
Contractor Management Platform and Software Solution for Compliance
SHEQ Network Limited
 
ChatPharo: an Open Architecture for Understanding How to Talk Live to LLMs
ESUG
 
Summary Of Odoo 18.1 to 18.4 : The Way For Odoo 19
CandidRoot Solutions Private Limited
 
Odoo Integration Services by Candidroot Solutions
CandidRoot Solutions Private Limited
 
Can You Build Dashboards Using Open Source Visualization Tool.docx
Varsha Nayak
 
Can You Build Dashboards Using Open Source Visualization Tool.pptx
Varsha Nayak
 
Generating Union types w/ Static Analysis
K. Matthew Dupree
 
Activate_Methodology_Summary presentatio
annapureddyn
 
Enhancing Healthcare RPM Platforms with Contextual AI Integration
Cadabra Studio
 
Salesforce Implementation Services Provider.pdf
VALiNTRY360
 
GALILEO CRS SYSTEM | GALILEO TRAVEL SOFTWARE
philipnathen82
 
Presentation about Database and Database Administrator
abhishekchauhan86963
 
Why Reliable Server Maintenance Service in New York is Crucial for Your Business
Sam Vohra
 
classification of computer and basic part of digital computer
ravisinghrajpurohit3
 
lesson-2-rules-of-netiquette.pdf.bshhsjdj
jasmenrojas249
 

Streaming SQL

  • 1. Streaming SQL Julian Hyde Apex Big Data World Mountain View 2017/04/04
  • 2. @julianhyde SQL Query planning Query federation OLAP Streaming Hadoop Apache member Original author of Apache Calcite PMC Apache Arrow, Drill, Eagle, Kylin
  • 3. Data center Why SQL? Data in motion, data at rest - it’s all just data Stream / database duality SQL is the best language ever created for data (because it’s declarative)
  • 4. Building a streaming SQL standard via consensus Please! No more “SQL-like” languages! Key technologies are open source (many are Apache projects) Calcite is providing leadership: developing example queries, TCK Complements Apache Beam’s work on a common streaming API/algebra (Optional) Use Calcite’s framework to build a streaming SQL parser/planner for your engine Several projects are working with us: Apex, Flink, Samza, Storm. (Also non-streaming SQL in Cassandra, Drill, Druid, Elasticsearch, Hive, Kylin, Phoenix.)
  • 5. SQL in Apex SQL support is part of Malhar (malhar-sql) [1] Disclaimer: Not everything I describe today is in Apex Operators: Scan, Filter, Project Coming soon: Window operators [1] https://www.datatorrent.com/blog/sql-apache-apex/
  • 6. Simple queries select * from Products where unitPrice < 20 select stream * from Orders where units > 1000 ➢ Traditional (non-streaming) ➢ Products is a table ➢ Retrieves records from -∞ to now ➢ Streaming ➢ Orders is a stream ➢ Retrieves records from now to +∞ ➢ Query never terminates
  • 7. Stream-table duality select * from Orders where units > 1000 ➢ Yes, you can use a stream as a table ➢ And you can use a table as a stream ➢ Actually, Orders is both ➢ Use the stream keyword ➢ Where to actually find the data? That’s up to the system select stream * from Orders where units > 1000
  • 8. Combining past and future select stream * from Orders as o where units > ( select avg(units) from Orders as h where h.productId = o.productId and h.rowtime > o.rowtime - interval ‘1’ year) ➢ Orders is used as both stream and table ➢ System determines where to find the records ➢ Query is invalid if records are not available
  • 9. Semantics of streaming queries The replay principle: A streaming query produces the same result as the corresponding non-streaming query would if given the same data in a table. Output must not rely on implicit information (arrival order, arrival time, processing time, or watermarks/punctuations) (Some triggering schemes allow records to be emitted early and re-stated if incorrect.)
  • 10. Controlling when data is emitted Early emission is the defining characteristic of a streaming query. The emit clause is a SQL extension inspired by Apache Beam’s “trigger” notion. (Still experimental… and evolving.) A relational (non-streaming) query is just a query with the most conservative possible emission strategy. select stream productId, count(*) as c from Orders group by productId, floor(rowtime to hour) emit at watermark, early interval ‘2’ minute, late limit 1; select * from Orders emit when complete;
  • 11. Aggregation and windows on streams GROUP BY aggregates multiple rows into sub-totals ➢ In regular GROUP BY each row contributes to exactly one sub-total ➢ In multi-GROUP BY (e.g. HOP, GROUPING SETS) a row can contribute to more than one sub-total Window functions (OVER) leave the number of rows unchanged, but compute extra expressions for each row (based on neighboring rows) Multi GROUP BY Window functions GROUP BY
  • 12. Tumbling, hopping & session windows in SQL Tumbling window Hopping window Session window select stream … from Orders group by floor(rowtime to hour) select stream … from Orders group by tumble(rowtime, interval ‘1’ hour) select stream … from Orders group by hop(rowtime, interval ‘1’ hour, interval ‘2’ hour) select stream … from Orders group by session(rowtime, interval ‘1’ hour)
  • 13. The “pie chart” problem ➢ Task: Write a web page summarizing orders over the last hour ➢ Problem: The Orders stream only contains the current few records ➢ Solution: Materialize short-term history Orders over the last hour Beer 48% Cheese 30% Wine 22% select productId, count(*) from Orders where rowtime > current_timestamp - interval ‘1’ hour group by productId
  • 14. Join stream to a table Inputs are the Orders stream and the Products table, output is a stream. Acts as a “lookup”. Execute by caching the table in a hash-map (if table is not too large) and stream order will be preserved. What if Products table is being modified while query executes? select stream * from Orders as o join Products as p on o.productId = p.productId
  • 15. Join stream to a stream We can join streams if the join condition forces them into “lock step”, within a window (in this case, 1 hour). Which stream to put input a hash table? It depends on relative rates, outer joins, and how we’d like the output sorted. select stream * from Orders as o join Shipments as s on o.productId = p.productId and s.rowtime between o.rowtime and o.rowtime + interval ‘1’ hour
  • 16. Other operations Other relational operations make sense on streams (usually only if there is an implicit time bound). Examples: ● order by - E.g. Each hour emit the top 10 selling products ● union - E.g. Merge streams of orders and shipments ● insert, update, delete - E.g. Continuously insert into an external table ● exists, in sub-queries - E.g. Show me shipments of products for which there has been no order in the last hour ● view - Expanded when query is parsed; zero runtime cost ● match_recognize - Complex event processing (CEP)
  • 17. Planning queries MySQL Splunk join Key: productId group Key: productName Agg: count filter Condition: action = 'purchase' sort Key: c desc scan scan Table: products select p.productName, count(*) as c from splunk.splunk as s join mysql.products as p on s.productId = p.productId where s.action = 'purchase' group by p.productName order by c desc Table: splunk
  • 18. Optimized query MySQL Splunk join Key: productId group Key: productName Agg: count filter Condition: action = 'purchase' sort Key: c desc scan scan Table: splunk Table: products select p.productName, count(*) as c from splunk.splunk as s join mysql.products as p on s.productId = p.productId where s.action = 'purchase' group by p.productName order by c desc
  • 19. Apache Calcite Apache top-level project since October, 2015 Query planning framework ➢ Relational algebra, rewrite rules ➢ Cost model & statistics ➢ Federation via adapters ➢ Extensible Packaging ➢ Library ➢ Optional SQL parser, JDBC server ➢ Community-authored rules, adapters Embedded Adapters Streaming Apache Drill Apache Hive Apache Kylin Apache Phoenix* Cascading Lingual Apache Cassandra Apache Spark CSV Druid Elasticsearch In-memory JDBC JSON MongoDB Splunk Web tables Apache Apex Apache Flink Apache Samza Apache Storm * Under development
  • 20. Join the community! Calcite and Apex are projects of the Apache Software Foundation The Apache Way: meritocracy, openness, consensus, community We welcome new contributors!
  • 21. Thank you! @julianhyde @ApacheCalcite http://calcite.apache.org http://calcite.apache.org/docs/stream.html References ● Hyde, Julian. "Data in flight." Communications of the ACM 53.1 (2010): 48-52. [pdf] ● Akidau, Tyler, et al. "The dataflow model: a practical approach to balancing correctness, latency, and cost in massive-scale, unbounded, out-of-order data processing." Proceedings of the VLDB Endowment 8.12 (2015): 1792-1803. [pdf] ● Arasu, Arvind, Shivnath Babu, and Jennifer Widom. "The CQL continuous query language: semantic foundations and query execution." The VLDB Journal—The International Journal on Very Large Data Bases 15.2 (2006): 121-142. [pdf]
  • 23. Summary Features of streaming SQL: ● Standard SQL over streams and relations ● Relational queries on streams, and vice versa ● Materialized views and standing queries Benefits: ● Brings streaming data to DB tools and traditional users ● Brings historic data to message-oriented applications ● Lets the system optimize quality of service (QoS) and data location
  • 24. Why SQL? ● API to your database ● Ask for what you want, system decides how to get it ● Query planner (optimizer) converts logical queries to physical plans ● Mathematically sound language (relational algebra) ● For all data, not just data in a database ● Opportunity for novel data organizations & algorithms ● Standard https://www.flickr.com/photos/pere/523019984/ (CC BY-NC-SA 2.0) ➢ API to your database ➢ Ask for what you want, system decides how to get it ➢ Query planner (optimizer) converts logical queries to physical plans ➢ Mathematically sound language (relational algebra) ➢ For all data, not just “flat” data in a database ➢ Opportunity for novel data organizations & algorithms ➢ Standard Why SQL?
  • 26. Relational algebra (plus streaming) Core operators: ➢ Scan ➢ Filter ➢ Project ➢ Join ➢ Sort ➢ Aggregate ➢ Union ➢ Values Streaming operators: ➢ Delta (converts relation to stream) ➢ Chi (converts stream to relation) In SQL, the STREAM keyword signifies Delta
  • 27. Streaming algebra ➢ Filter ➢ Route ➢ Partition ➢ Round-robin ➢ Queue ➢ Aggregate ➢ Merge ➢ Store ➢ Replay ➢ Sort ➢ Lookup
  • 28. Optimizing streaming queries The usual relational transformations still apply: push filters and projects towards sources, eliminate empty inputs, etc. The transformations for delta are mostly simple: ➢ Delta(Filter(r, predicate)) → Filter(Delta(r), predicate) ➢ Delta(Project(r, e0, ...)) → Project(Delta(r), e0, …) ➢ Delta(Union(r0, r1), ALL) → Union(Delta(r0), Delta(r1)) But not always: ➢ Delta(Join(r0, r1, predicate)) → Union(Join(r0, Delta(r1)), Join(Delta(r0), r1) ➢ Delta(Scan(aTable)) → Empty
  • 29. Sliding windows in SQL select stream sum(units) over w (partition by productId) as units1hp, sum(units) over w as units1h, rowtime, productId, units from Orders window w as (order by rowtime range interval ‘1’ hour preceding) rowtime productId units 09:12 100 5 09:25 130 10 09:59 100 3 10:17 100 10 units1hp units1h rowtime productId units 5 5 09:12 100 5 10 15 09:25 130 10 8 18 09:59 100 3 23 13 10:17 100 10
  • 30. Join stream to a changing table Execution is more difficult if the Products table is being changed while the query executes. To do things properly (e.g. to get the same results when we re-play the data), we’d need temporal database semantics. (Sometimes doing things properly is too expensive.) select stream * from Orders as o join Products as p on o.productId = p.productId and o.rowtime between p.startEffectiveDate and p.endEffectiveDate