SlideShare a Scribd company logo
© Hortonworks Inc. 2016
Polyalgebra
Hadoop Summit, April 14, 2016
© Hortonworks Inc. 2016
Julian Hyde @julianhyde
Apache Calcite (VP), Drill, Kylin
Mondrian OLAP
Hortonworks
Thanks:
Jacques Nadeau @intjesus (Dremio/Drill)
Wes McKinney (Cloudera/Arrow)
Who
Tomer Shiran @tshiran
Apache Drill
Dremio
Page‹#› © Hortonworks Inc. 2014
© Hortonworks Inc. 2016
Polyalgebra
An extended form of relational
algebra that encompasses work with
dynamically-typed data, complex
records, streaming and machine
learning that allows for a single
optimization space.
© Hortonworks Inc. 2016
Ecosystem
Calcite
Drill
Arrow
Ibis
Impala
Kudu
Splunk
Cassandra
JDBC
MongoDB
JDBC
Spark
Flink
© Hortonworks Inc. 2016
Ecosystem - data stores
Calcite
Drill
Arrow
Ibis
Impala
Kudu
Splunk
Cassandra
JDBC
MongoDB
JDBC
Spark
Flink
© Hortonworks Inc. 2016
Ecosystem - Engines
Calcite
Drill
Arrow
Ibis
Impala
Kudu
Splunk
Cassandra
JDBC
MongoDB
JDBC
Spark
Flink
© Hortonworks Inc. 2016
Ecosystem - Focus of this talk
Calcite
Drill
Arrow
Ibis
Impala
Kudu
Splunk
Cassandra
JDBC
MongoDB
JDBC
Spark
Flink
© Hortonworks Inc. 2016
HadoopRDBMS
Old world, new world
• Security
• Metadata
• SQL
• Query planning
• Data independence
• Scale
• Late schema
• Choice of front-end
• Choice of engines
• Workload: batch, interactive,
streaming, ML, graph, …
© Hortonworks Inc. 2016
Many front ends, many engines
SQL
Planning
Execution

engine
Planning
User code
Map

Reduce
Tez User code
in Yarn
Spark MongoDB
Hadoop
External

SQL
SQL Spark Storm Cascading HBase Graph
© Hortonworks Inc. 2016
Extension to mathematical set theory
Devised by E.F. Code (IBM) in 1970
Defines the relational database
Operators: select, filter, join, sort, union, etc.
Intermediate format for query planning/optimization
Relational algebra
SQL
Relational
algebra
Runnable
query
plan
Optimization
© Hortonworks Inc. 2016
select d.name, COUNT(*) as c

from Emps as e

join Depts as d

on e.deptno = d.deptno

where e.age < 30

group by d.deptno

having count(*) > 5

order by c desc
Relational algebra
Scan [Emps] Scan [Depts]
Join [e.deptno

= d.deptno]
Filter [e.age < 30]
Aggregate [deptno, COUNT(*) AS c]
Filter [c > 5]
Project [name, c]
Sort [c DESC]
(Column names are simplified. They would usually

be ordinals, e.g. $0 is the first column of the left input.)
© Hortonworks Inc. 2016
select * from (

select zipcode, state

from Emps

union all

select zipcode, state

from Customers)

where state in (‘CA’, ‘TX’)
Relational algebra - Union and sub-query
Scan [Emps] Scan [Customers]
Union [all]
Project [zipcode, state] Project [zipcode, state]
Filter [state IN (‘CA’, ‘TX’)]
© Hortonworks Inc. 2016
insert into Facts

values (‘Meaning of life’, 42),

(‘Clever as clever’, 6)
Relational algebra - Insert and Values
Insert [Facts]
Values [[‘Meaning of life’, 42],
[‘Clever as clever’, 6]]
© Hortonworks Inc. 2016
MySQL
Splunk
Expression tree
 select p.productName, COUNT(*) as c

from splunk.splunk as s

join mysql.products as p

on s.productId = p.productId

where s.action = 'purchase'

group by p.productName

order by c desc
join
Key: product_id
group
Key: product_name

Agg: count
filter
Condition:

action =

'purchase'
sort
Key: c DESC
scan
scan
Table: splunk
Table: products
© Hortonworks Inc. 2016
Splunk
Expression tree

(optimized)
join
Key: product_id
group
Key: product_name

Agg: count
filter
Condition:

action =

'purchase'
sort
Key: c DESC
scan
Table: splunk
MySQL
scan
Table: products
select p.productName, COUNT(*) as c

from splunk.splunk as s

join mysql.products as p

on s.productId = p.productId

where s.action = 'purchase'

group by p.productName

order by c desc
Page‹#› © Hortonworks Inc. 2014
© Hortonworks Inc. 2016
Demo
{sqlline, apache-calcite, .csv, CsvPushProjectOntoTableRule}
© Hortonworks Inc. 2016
Calcite – APIs and SPIs
Cost, statistics
RelOptCost
RelOptCostFactory
RelMetadataProvider
• RelMdColumnUniquensss
• RelMdDistinctRowCount
• RelMdSelectivity
SQL parser
SqlNode

SqlParser

SqlValidator
Transformation rules
RelOptRule
• MergeFilterRule
• PushAggregateThroughUnionRule
• 100+ more
Global transformations
• Unification (materialized view)
• Column trimming
• De-correlation
Relational algebra
RelNode (operator)
• TableScan
• Filter
• Project
• Union
• Aggregate
• …
RelDataType (type)
RexNode (expression)
RelTrait (physical property)
• RelConvention (calling-convention)
• RelCollation (sortedness)
• TBD (bucketedness/distribution)
JDBC driver
Metadata
Schema
Table
Function
• TableFunction
• TableMacro
Lattice
© Hortonworks Inc. 2016
Calcite Planning Process
SQL
parse
tree
Planner
RelNode
Graph
Sql-to-Rel Converter
SqlNode !
RelNode + RexNode
• Node for each node in
Input Plan
• Each node is a Set of
alternate Sub Plans
• Set further divided into
Subsets: based on traits
like sortedness
1. Plan Graph
• Rule: specifies an Operator
sub-graph to match and
logic to generate equivalent
‘better’ sub-graph
• New and original sub-graph
both remain in contention
2. Rules
• RelNodes have Cost &
Cumulative Cost
3. Cost Model
- Used to plug in Schema,
cost formulas
- Filter selectivity
- Join selectivity
- NDV calculations
4. Metadata Providers
Rule Match Queue
- Add Rule matches to Queue
- Apply Rule match
transformations to plan graph
- Iterate for fixed iterations or until
cost doesn’t change
- Match importance based on
cost of RelNode and height
Best RelNode Graph
Translate to
runtime
Logical Plan
Based on “Volcano” & “Cascades” papers [G. Graefe]
© Hortonworks Inc. 2016
Algebra builder API
produces
final FrameworkConfig config;
final RelBuilder builder = RelBuilder.create(config);
final RelNode node = builder
.scan("EMP")
.aggregate(builder.groupKey("DEPTNO"),
builder.count(false, "C"),
builder.sum(false, "S", builder.field("SAL")))
.filter(
builder.call(SqlStdOperatorTable.GREATER_THAN,
builder.field("C"),
builder.literal(10)))
.build();
System.out.println(RelOptUtil.toString(node));
select deptno,

COUNT(*) as c,
sum(sal) as s

from Emp

having COUNT(*) > 10
LogicalFilter(condition=[>($1, 10)])
LogicalAggregate(group=[{7}], C=[COUNT()], S=[SUM($5)])
LogicalTableScan(table=[[scott, EMP]])
Equivalent SQL:
© Hortonworks Inc. 2016
Non-relational, post-relational
Non-relational stores:
• Document databases — MongoDB
• Key-value stores — HBase, Cassandra
• Graph databases — Neo4J
• Multidimensional OLAP — Microsoft Analysis, Mondrian
• Streams — Kafka, Storm
• Text, audio, video
Non-relational operators — data exploration, machine learning
Late or no schema
© Hortonworks Inc. 2016
Complex data, also known as nested or
document-oriented data. Typically, it can be
represented as JSON.



2 new operators are sufficient:
• UNNEST
• COLLECT aggregate function
Complex data employees: [
{
name: “Bob”,
age: 48,
pets: [
{name: “Jim”, type: “Dog”},
{name: “Frank”, type: “Cat”}
]
}, {
name: “Stacy",
age: 31,

starSign: ‘taurus’,
pets: [
{name: “Jack”, type: “Cat”}
]
}, {
name: “Ken”,
age: 23
}
]
© Hortonworks Inc. 2016
Flatten converts arrays of values to
separate rows:
• New record for each list item
• Empty lists removes record
Flatten is actually just syntactic sugar for
the UNNEST relational operator:
UNNEST and Flatten
name age pets
Bob 48 [{name: Jim, type: dog},
{name:Frank, type: dog}]
Stacy 31 [{name: Jack, type: cat}]
Ken 23 []
name age pet
Bob 48 {name: Jim, type: dog}
Bob 48 {name: Frank, type: dog}
Stacy 31 {name: Jack, type: cat}
select e.name, e.age,

flatten(e.pet)

from Employees as e
select e.name, e.age,

row(a.name, a.type)

from Employees as e, 

unnest e.addresses as a
© Hortonworks Inc. 2016
Optimizing UNNEST
As usual, to optimize, we write planner rules.
We can push filters into the non-nested side,
so we write FilterUnnestTransposeRule.
(There are many other possible rules.)
select e.name, a.name

from Employees as e, 

unnest e.pets as a
where e.age < 30
select e.name, a.street

from (

select *

from Employees 

where e.age < 30) as e,
unnest e.addresses as a
FilterUnnestTransposeRule
© Hortonworks Inc. 2016
Optimizing UNNEST (2)
We can also optimize projects.
If table is stored in a column-oriented file
format, this reduces disk reads significantly.
• Array wildcard projection through flatten
• Non-flattened column inclusion
select e.name,

flatten(pets).name

from Employees as e
scan(name, age, pets)
flatten(pets) as pet
project(name, pet.name)
scan(name, pets[*].name)
flatten(pets) as pet
scan(name, age, pets)
flatten(pets as pet)
project(name, pets[*].name)
Original Plan Project through Flatten
Project into scan
(less data read)
ProjectFlattenTransposeRule ProjectScanRule
© Hortonworks Inc. 2016
Evolution:
• Oracle: Schema before write, strongly typed SQL (like Java)
• Hive: Schema before query, strongly typed SQL
• Drill: Schema on data, weakly typed SQL (like JavaScript)
Late schema
name age starSign pets
Bob 48 [{name: Jim, type: dog},
{name:Frank, type: dog}]
Stacy 31 Taurus [{name: Jack, type: cat}]
Ken 23 []
name age pets
Ken 23 []
select *

from Employees
select *

from Employees
where age < 30
no starSign column!
© Hortonworks Inc. 2016
Expanding *
• Early schema databases expand * at planning time, based on schema
• Drill expands * during query execution
• Each operator needs to be able to propagate column names/types as well as data
Internally, Drill is strongly typed
• Strong typing means efficient code
• JavaScript engines do this too
• Infer type for each batch of records
• Throw away generated code if a column changes type in the next batch of records
Implementing schema-on-data
select e.name

from Employees
where e.age < 30
select e._map[“name”] as name

from Employees
where cast(e._map[“age”] as integer) < 30
© Hortonworks Inc. 2016
A table function is a Java UDF that returns a
relation.
• Its arguments may be relations or scalars.
• It appears in the execution plan.
• Annotations indicate whether it is safe to
push filters, project through
A table macro is a Java function that takes a
parse tree and returns a parse tree.
• Named after Lisp macros.
• It does not appear in the execution plan.
• Views (next slide) are a kind of table macro.
Use a table macro rather than a table function,
if possible. Re-use existing optimizations.
User-defined operators
select e.name

from table(
my_sample(
select * from Employees,
0.15))
select e.name

from table(
my_filter(
select * from Employees,
‘age’, ‘<‘, 30))
© Hortonworks Inc. 2016
Views
Scan [Emps]
Join [$0, $5]
Project [$0, $1, $2, $3]
Filter [age >= 50]
Aggregate [deptno, min(salary)]
Scan [Managers]
Aggregate [manager]
Scan [Emps]
select deptno, min(salary)

from Managers
where age >= 50
group by deptno
create view Managers as
select *
from Emps as e
where exists (
select *
from Emps as underling
where underling.manager = e.id)
© Hortonworks Inc. 2016
Views (after expansion)
select deptno, min(salary)

from Managers
where age >= 50
group by deptno
create view Managers as
select *
from Emps as e
where exists (
select *
from Emps as underling
where underling.manager = e.id)
Scan [Emps] Aggregate [manager]
Join [$0, $5]
Project [$0, $1, $2, $3]
Filter [age >= 50]
Aggregate [deptno, min(salary)]
Scan [Emps]
© Hortonworks Inc. 2016
Views (after pushing down filter)
select deptno, min(salary)

from Managers
where age >= 50
group by deptno
create view Managers as
select *
from Emps as e
where exists (
select *
from Emps as underling
where underling.manager = e.id) Scan [Emps]
Scan [Emps]
Join [$0, $5]
Project [$0, $1, $2, $3]
Filter [age >= 50]
Aggregate [deptno, min(salary)]
© Hortonworks Inc. 2016
Materialized view
create materialized view

EmpSummary as

select deptno,

gender,

count(*) as c, 

sum(sal)

from Emps
group by deptno, gender
select count(*) as c

from Emps
where deptno = 10
and gender = ‘M’
Scan [Emps]
Aggregate [deptno, gender,

COUNT(*), SUM(sal)]
Scan [EmpSummary] =
Scan [Emps]
Filter [deptno = 10 AND gender = ‘M’]
Aggregate [COUNT(*)]
© Hortonworks Inc. 2016
Materialized view, step 2: Rewrite query to match
create materialized view

EmpSummary as

select deptno,

gender,

count(*) as c, 

sum(sal)

from Emps
group by deptno, gender
select count(*) as c

from Emps
where deptno = 10
and gender = ‘M’
Scan [Emps]
Aggregate [deptno, gender,

COUNT(*), SUM(sal)]
Scan [EmpSummary] =
Scan [Emps]
Filter [deptno = 10 AND gender = ‘M’]
Aggregate [deptno, gender,

COUNT(*) AS c, SUM(sal) AS s]
Project [c]
© Hortonworks Inc. 2016
Materialized view, step 3: substitute table
create materialized view

EmpSummary as

select deptno,

gender,

count(*) as c, 

sum(sal)

from Emps
group by deptno, gender
select count(*) as c

from Emps
where deptno = 10
and gender = ‘M’
Scan [Emps]
Aggregate [deptno, gender,

COUNT(*), SUM(sal)]
Scan [EmpSummary] =
Filter [deptno = 10 AND gender = ‘M’]
Project [c]
Scan [EmpSummary]
© Hortonworks Inc. 2016
Streaming queries run forever.
Stream appears in the FROM clause: Orders.
Without the stream keyword, Orders means the
history of the stream (a table).
Calcite streaming SQL: in Samza, Storm, Flink.
Streaming queries
select stream *

from Orders
select *

from Orders
© Hortonworks Inc. 2016
• Orders is used as both stream and table
• System determines where to find the records
• Query is invalid if records are not available
Combining past and future
select stream *

from Orders as o

where units > ( 

select avg(units)

from Orders as h

where h.productId = o.productId

and h.rowtime >

o.rowtime - interval ‘1’ year)
© Hortonworks Inc. 2016
Hybrid systems combine more than one data source and/or engine.
Examples:
• Splunk join to MySQL
• User-defined table written in Python reading from an in-memory temporary table just
created by Drill.
• Streaming query populating a table summarizing the last hour’s activity that will be
used to populate a pie chart in a web dashboard.
Two challenges:
• Planning the query to take advantage of each system’s strengths.
• Efficient interchange of data at run time.
Hybrid systems
© Hortonworks Inc. 2016
Hybrid algebra, hybrid run-time
Calcite
Drill
Arrow
Ibis
Impala
Kudu
Splunk
Cassandra
JDBC
MongoDB
JDBC
Spark
Flink
DREMIO© Hortonworks Inc. 2016
Arrow in a Slide
• New Top-level Apache Software Foundation project
– Announced Feb 17, 2016
• Focused on Columnar In-Memory Analytics
1. 10-100x speedup on many workloads
2. Common data layer enables companies to choose best of breed
systems
3. Designed to work with any programming language
4. Support for both relational and complex data as-is
• Developers from 13+ major open source projects involved
– A significant % of the world’s data will be processed through Arrow!
Calcite
Cassandra
Deeplearning4j
Drill
Hadoop
HBase
Ibis
Impala
Kudu
Pandas
Parquet
Phoenix
Spark
Storm
R
DREMIO© Hortonworks Inc. 2016
Focus on CPU Efficiency
Traditional
Memory Buffer
Arrow
Memory Buffer
• Cache Locality
• Super-scalar & vectorized
operation
• Minimal Structure Overhead
• Constant value access
– With minimal structure
overhead
• Operate directly on columnar
compressed data
DREMIO© Hortonworks Inc. 2016
High Performance Sharing & Interchange
Today With Arrow
• Each system has its own internal memory
format
• 70-80% CPU wasted on serialization and
deserialization
• Similar functionality implemented in
multiple projects
• All systems utilize the same memory
format
• No overhead for cross-system
communication
• Projects can share functionality (eg,
Parquet-to-Arrow reader)
© Hortonworks Inc. 2016
• Algebra-centric approach
• Optimize by applying transformation rules
• User-defined operators (table functions, table macros, custom
RelNode classes)
• Complex data
• Late-schema queries
• Streaming queries
• Calcite enables planning hybrid queries
• Arrow enables hybrid runtime
Summary
© Hortonworks Inc. 2016
@julianhyde
@tshiran
@ApacheCalcite
@ApacheDrill
@ApacheArrow
Get involved:
• http://calcite.apache.org
• http://drill.apache.org
• http://arrow.apache.org
Thanks!
Ad

More Related Content

What's hot (20)

Apache Calcite: One Frontend to Rule Them All
Apache Calcite: One Frontend to Rule Them AllApache Calcite: One Frontend to Rule Them All
Apache Calcite: One Frontend to Rule Them All
Michael Mior
 
SQL for NoSQL and how Apache Calcite can help
SQL for NoSQL and how  Apache Calcite can helpSQL for NoSQL and how  Apache Calcite can help
SQL for NoSQL and how Apache Calcite can help
Christian Tzolov
 
Adding measures to Calcite SQL
Adding measures to Calcite SQLAdding measures to Calcite SQL
Adding measures to Calcite SQL
Julian Hyde
 
Apache Calcite: A Foundational Framework for Optimized Query Processing Over ...
Apache Calcite: A Foundational Framework for Optimized Query Processing Over ...Apache Calcite: A Foundational Framework for Optimized Query Processing Over ...
Apache Calcite: A Foundational Framework for Optimized Query Processing Over ...
Julian Hyde
 
Using Apache Calcite for Enabling SQL and JDBC Access to Apache Geode and Oth...
Using Apache Calcite for Enabling SQL and JDBC Access to Apache Geode and Oth...Using Apache Calcite for Enabling SQL and JDBC Access to Apache Geode and Oth...
Using Apache Calcite for Enabling SQL and JDBC Access to Apache Geode and Oth...
Christian Tzolov
 
Change Data Capture to Data Lakes Using Apache Pulsar and Apache Hudi - Pulsa...
Change Data Capture to Data Lakes Using Apache Pulsar and Apache Hudi - Pulsa...Change Data Capture to Data Lakes Using Apache Pulsar and Apache Hudi - Pulsa...
Change Data Capture to Data Lakes Using Apache Pulsar and Apache Hudi - Pulsa...
StreamNative
 
Unified Stream and Batch Processing with Apache Flink
Unified Stream and Batch Processing with Apache FlinkUnified Stream and Batch Processing with Apache Flink
Unified Stream and Batch Processing with Apache Flink
DataWorks Summit/Hadoop Summit
 
Morel, a Functional Query Language
Morel, a Functional Query LanguageMorel, a Functional Query Language
Morel, a Functional Query Language
Julian Hyde
 
Apache Calcite overview
Apache Calcite overviewApache Calcite overview
Apache Calcite overview
Julian Hyde
 
The Rise of ZStandard: Apache Spark/Parquet/ORC/Avro
The Rise of ZStandard: Apache Spark/Parquet/ORC/AvroThe Rise of ZStandard: Apache Spark/Parquet/ORC/Avro
The Rise of ZStandard: Apache Spark/Parquet/ORC/Avro
Databricks
 
Spark shuffle introduction
Spark shuffle introductionSpark shuffle introduction
Spark shuffle introduction
colorant
 
Smarter Together - Bringing Relational Algebra, Powered by Apache Calcite, in...
Smarter Together - Bringing Relational Algebra, Powered by Apache Calcite, in...Smarter Together - Bringing Relational Algebra, Powered by Apache Calcite, in...
Smarter Together - Bringing Relational Algebra, Powered by Apache Calcite, in...
Julian Hyde
 
[2019] 바르게, 빠르게! Reactive를 품은 Spring Kafka
[2019] 바르게, 빠르게! Reactive를 품은 Spring Kafka[2019] 바르게, 빠르게! Reactive를 품은 Spring Kafka
[2019] 바르게, 빠르게! Reactive를 품은 Spring Kafka
NHN FORWARD
 
Redis for duplicate detection on real time stream
Redis for duplicate detection on real time streamRedis for duplicate detection on real time stream
Redis for duplicate detection on real time stream
Roberto Franchini
 
Spark Summit East 2015 Advanced Devops Student Slides
Spark Summit East 2015 Advanced Devops Student SlidesSpark Summit East 2015 Advanced Devops Student Slides
Spark Summit East 2015 Advanced Devops Student Slides
Databricks
 
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Databricks
 
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...
Databricks
 
Unlocking the Power of Lakehouse Architectures with Apache Pulsar and Apache ...
Unlocking the Power of Lakehouse Architectures with Apache Pulsar and Apache ...Unlocking the Power of Lakehouse Architectures with Apache Pulsar and Apache ...
Unlocking the Power of Lakehouse Architectures with Apache Pulsar and Apache ...
StreamNative
 
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Spark Summit
 
Dynamic Rule-based Real-time Market Data Alerts
Dynamic Rule-based Real-time Market Data AlertsDynamic Rule-based Real-time Market Data Alerts
Dynamic Rule-based Real-time Market Data Alerts
Flink Forward
 
Apache Calcite: One Frontend to Rule Them All
Apache Calcite: One Frontend to Rule Them AllApache Calcite: One Frontend to Rule Them All
Apache Calcite: One Frontend to Rule Them All
Michael Mior
 
SQL for NoSQL and how Apache Calcite can help
SQL for NoSQL and how  Apache Calcite can helpSQL for NoSQL and how  Apache Calcite can help
SQL for NoSQL and how Apache Calcite can help
Christian Tzolov
 
Adding measures to Calcite SQL
Adding measures to Calcite SQLAdding measures to Calcite SQL
Adding measures to Calcite SQL
Julian Hyde
 
Apache Calcite: A Foundational Framework for Optimized Query Processing Over ...
Apache Calcite: A Foundational Framework for Optimized Query Processing Over ...Apache Calcite: A Foundational Framework for Optimized Query Processing Over ...
Apache Calcite: A Foundational Framework for Optimized Query Processing Over ...
Julian Hyde
 
Using Apache Calcite for Enabling SQL and JDBC Access to Apache Geode and Oth...
Using Apache Calcite for Enabling SQL and JDBC Access to Apache Geode and Oth...Using Apache Calcite for Enabling SQL and JDBC Access to Apache Geode and Oth...
Using Apache Calcite for Enabling SQL and JDBC Access to Apache Geode and Oth...
Christian Tzolov
 
Change Data Capture to Data Lakes Using Apache Pulsar and Apache Hudi - Pulsa...
Change Data Capture to Data Lakes Using Apache Pulsar and Apache Hudi - Pulsa...Change Data Capture to Data Lakes Using Apache Pulsar and Apache Hudi - Pulsa...
Change Data Capture to Data Lakes Using Apache Pulsar and Apache Hudi - Pulsa...
StreamNative
 
Morel, a Functional Query Language
Morel, a Functional Query LanguageMorel, a Functional Query Language
Morel, a Functional Query Language
Julian Hyde
 
Apache Calcite overview
Apache Calcite overviewApache Calcite overview
Apache Calcite overview
Julian Hyde
 
The Rise of ZStandard: Apache Spark/Parquet/ORC/Avro
The Rise of ZStandard: Apache Spark/Parquet/ORC/AvroThe Rise of ZStandard: Apache Spark/Parquet/ORC/Avro
The Rise of ZStandard: Apache Spark/Parquet/ORC/Avro
Databricks
 
Spark shuffle introduction
Spark shuffle introductionSpark shuffle introduction
Spark shuffle introduction
colorant
 
Smarter Together - Bringing Relational Algebra, Powered by Apache Calcite, in...
Smarter Together - Bringing Relational Algebra, Powered by Apache Calcite, in...Smarter Together - Bringing Relational Algebra, Powered by Apache Calcite, in...
Smarter Together - Bringing Relational Algebra, Powered by Apache Calcite, in...
Julian Hyde
 
[2019] 바르게, 빠르게! Reactive를 품은 Spring Kafka
[2019] 바르게, 빠르게! Reactive를 품은 Spring Kafka[2019] 바르게, 빠르게! Reactive를 품은 Spring Kafka
[2019] 바르게, 빠르게! Reactive를 품은 Spring Kafka
NHN FORWARD
 
Redis for duplicate detection on real time stream
Redis for duplicate detection on real time streamRedis for duplicate detection on real time stream
Redis for duplicate detection on real time stream
Roberto Franchini
 
Spark Summit East 2015 Advanced Devops Student Slides
Spark Summit East 2015 Advanced Devops Student SlidesSpark Summit East 2015 Advanced Devops Student Slides
Spark Summit East 2015 Advanced Devops Student Slides
Databricks
 
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Databricks
 
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...
Databricks
 
Unlocking the Power of Lakehouse Architectures with Apache Pulsar and Apache ...
Unlocking the Power of Lakehouse Architectures with Apache Pulsar and Apache ...Unlocking the Power of Lakehouse Architectures with Apache Pulsar and Apache ...
Unlocking the Power of Lakehouse Architectures with Apache Pulsar and Apache ...
StreamNative
 
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Spark Summit
 
Dynamic Rule-based Real-time Market Data Alerts
Dynamic Rule-based Real-time Market Data AlertsDynamic Rule-based Real-time Market Data Alerts
Dynamic Rule-based Real-time Market Data Alerts
Flink Forward
 

Viewers also liked (20)

Apache Calcite: One planner fits all
Apache Calcite: One planner fits allApache Calcite: One planner fits all
Apache Calcite: One planner fits all
Julian Hyde
 
Streaming SQL
Streaming SQLStreaming SQL
Streaming SQL
Julian Hyde
 
Cost-based Query Optimization in Apache Phoenix using Apache Calcite
Cost-based Query Optimization in Apache Phoenix using Apache CalciteCost-based Query Optimization in Apache Phoenix using Apache Calcite
Cost-based Query Optimization in Apache Phoenix using Apache Calcite
Julian Hyde
 
Streaming SQL (at FlinkForward, Berlin, 2016/09/12)
Streaming SQL (at FlinkForward, Berlin, 2016/09/12)Streaming SQL (at FlinkForward, Berlin, 2016/09/12)
Streaming SQL (at FlinkForward, Berlin, 2016/09/12)
Julian Hyde
 
Calcite meetup-2016-04-20
Calcite meetup-2016-04-20Calcite meetup-2016-04-20
Calcite meetup-2016-04-20
Josh Elser
 
SQL on everything, in memory
SQL on everything, in memorySQL on everything, in memory
SQL on everything, in memory
Julian Hyde
 
Streaming SQL
Streaming SQLStreaming SQL
Streaming SQL
Julian Hyde
 
Streaming SQL
Streaming SQLStreaming SQL
Streaming SQL
Julian Hyde
 
Streaming SQL
Streaming SQLStreaming SQL
Streaming SQL
Julian Hyde
 
Polyalgebra
PolyalgebraPolyalgebra
Polyalgebra
DataWorks Summit/Hadoop Summit
 
Alternatives to Apache Accumulo’s Java API
Alternatives to Apache Accumulo’s Java APIAlternatives to Apache Accumulo’s Java API
Alternatives to Apache Accumulo’s Java API
Josh Elser
 
phoenix-on-calcite-nyc-meetup
phoenix-on-calcite-nyc-meetupphoenix-on-calcite-nyc-meetup
phoenix-on-calcite-nyc-meetup
Maryann Xue
 
Discardable In-Memory Materialized Queries With Hadoop
Discardable In-Memory Materialized Queries With HadoopDiscardable In-Memory Materialized Queries With Hadoop
Discardable In-Memory Materialized Queries With Hadoop
Julian Hyde
 
The twins that everyone loved too much
The twins that everyone loved too muchThe twins that everyone loved too much
The twins that everyone loved too much
Julian Hyde
 
What's new in Mondrian 4?
What's new in Mondrian 4?What's new in Mondrian 4?
What's new in Mondrian 4?
Julian Hyde
 
Optiq: A dynamic data management framework
Optiq: A dynamic data management frameworkOptiq: A dynamic data management framework
Optiq: A dynamic data management framework
Julian Hyde
 
Why you care about
 relational algebra (even though you didn’t know it)
Why you care about
 relational algebra (even though you didn’t know it)Why you care about
 relational algebra (even though you didn’t know it)
Why you care about
 relational algebra (even though you didn’t know it)
Julian Hyde
 
Cost-based query optimization in Apache Hive 0.14
Cost-based query optimization in Apache Hive 0.14Cost-based query optimization in Apache Hive 0.14
Cost-based query optimization in Apache Hive 0.14
Julian Hyde
 
Cost-based query optimization in Apache Hive 0.14
Cost-based query optimization in Apache Hive 0.14Cost-based query optimization in Apache Hive 0.14
Cost-based query optimization in Apache Hive 0.14
Julian Hyde
 
Querying the Internet of Things: Streaming SQL on Kafka/Samza and Storm/Trident
 Querying the Internet of Things: Streaming SQL on Kafka/Samza and Storm/Trident Querying the Internet of Things: Streaming SQL on Kafka/Samza and Storm/Trident
Querying the Internet of Things: Streaming SQL on Kafka/Samza and Storm/Trident
Julian Hyde
 
Apache Calcite: One planner fits all
Apache Calcite: One planner fits allApache Calcite: One planner fits all
Apache Calcite: One planner fits all
Julian Hyde
 
Cost-based Query Optimization in Apache Phoenix using Apache Calcite
Cost-based Query Optimization in Apache Phoenix using Apache CalciteCost-based Query Optimization in Apache Phoenix using Apache Calcite
Cost-based Query Optimization in Apache Phoenix using Apache Calcite
Julian Hyde
 
Streaming SQL (at FlinkForward, Berlin, 2016/09/12)
Streaming SQL (at FlinkForward, Berlin, 2016/09/12)Streaming SQL (at FlinkForward, Berlin, 2016/09/12)
Streaming SQL (at FlinkForward, Berlin, 2016/09/12)
Julian Hyde
 
Calcite meetup-2016-04-20
Calcite meetup-2016-04-20Calcite meetup-2016-04-20
Calcite meetup-2016-04-20
Josh Elser
 
SQL on everything, in memory
SQL on everything, in memorySQL on everything, in memory
SQL on everything, in memory
Julian Hyde
 
Alternatives to Apache Accumulo’s Java API
Alternatives to Apache Accumulo’s Java APIAlternatives to Apache Accumulo’s Java API
Alternatives to Apache Accumulo’s Java API
Josh Elser
 
phoenix-on-calcite-nyc-meetup
phoenix-on-calcite-nyc-meetupphoenix-on-calcite-nyc-meetup
phoenix-on-calcite-nyc-meetup
Maryann Xue
 
Discardable In-Memory Materialized Queries With Hadoop
Discardable In-Memory Materialized Queries With HadoopDiscardable In-Memory Materialized Queries With Hadoop
Discardable In-Memory Materialized Queries With Hadoop
Julian Hyde
 
The twins that everyone loved too much
The twins that everyone loved too muchThe twins that everyone loved too much
The twins that everyone loved too much
Julian Hyde
 
What's new in Mondrian 4?
What's new in Mondrian 4?What's new in Mondrian 4?
What's new in Mondrian 4?
Julian Hyde
 
Optiq: A dynamic data management framework
Optiq: A dynamic data management frameworkOptiq: A dynamic data management framework
Optiq: A dynamic data management framework
Julian Hyde
 
Why you care about
 relational algebra (even though you didn’t know it)
Why you care about
 relational algebra (even though you didn’t know it)Why you care about
 relational algebra (even though you didn’t know it)
Why you care about
 relational algebra (even though you didn’t know it)
Julian Hyde
 
Cost-based query optimization in Apache Hive 0.14
Cost-based query optimization in Apache Hive 0.14Cost-based query optimization in Apache Hive 0.14
Cost-based query optimization in Apache Hive 0.14
Julian Hyde
 
Cost-based query optimization in Apache Hive 0.14
Cost-based query optimization in Apache Hive 0.14Cost-based query optimization in Apache Hive 0.14
Cost-based query optimization in Apache Hive 0.14
Julian Hyde
 
Querying the Internet of Things: Streaming SQL on Kafka/Samza and Storm/Trident
 Querying the Internet of Things: Streaming SQL on Kafka/Samza and Storm/Trident Querying the Internet of Things: Streaming SQL on Kafka/Samza and Storm/Trident
Querying the Internet of Things: Streaming SQL on Kafka/Samza and Storm/Trident
Julian Hyde
 
Ad

Similar to Planning with Polyalgebra: Bringing Together Relational, Complex and Machine Learning Algebra (20)

Big Data Processing with .NET and Spark (SQLBits 2020)
Big Data Processing with .NET and Spark (SQLBits 2020)Big Data Processing with .NET and Spark (SQLBits 2020)
Big Data Processing with .NET and Spark (SQLBits 2020)
Michael Rys
 
Intro to Spark and Spark SQL
Intro to Spark and Spark SQLIntro to Spark and Spark SQL
Intro to Spark and Spark SQL
jeykottalam
 
Spark SQL Deep Dive @ Melbourne Spark Meetup
Spark SQL Deep Dive @ Melbourne Spark MeetupSpark SQL Deep Dive @ Melbourne Spark Meetup
Spark SQL Deep Dive @ Melbourne Spark Meetup
Databricks
 
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data PlatformsCassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
DataStax Academy
 
Jump Start into Apache® Spark™ and Databricks
Jump Start into Apache® Spark™ and DatabricksJump Start into Apache® Spark™ and Databricks
Jump Start into Apache® Spark™ and Databricks
Databricks
 
Composable Parallel Processing in Apache Spark and Weld
Composable Parallel Processing in Apache Spark and WeldComposable Parallel Processing in Apache Spark and Weld
Composable Parallel Processing in Apache Spark and Weld
Databricks
 
PyconZA19-Distributed-workloads-challenges-with-PySpark-and-Airflow
PyconZA19-Distributed-workloads-challenges-with-PySpark-and-AirflowPyconZA19-Distributed-workloads-challenges-with-PySpark-and-Airflow
PyconZA19-Distributed-workloads-challenges-with-PySpark-and-Airflow
Chetan Khatri
 
Real-Time Spark: From Interactive Queries to Streaming
Real-Time Spark: From Interactive Queries to StreamingReal-Time Spark: From Interactive Queries to Streaming
Real-Time Spark: From Interactive Queries to Streaming
Databricks
 
Big Data Everywhere Chicago: Apache Spark Plus Many Other Frameworks -- How S...
Big Data Everywhere Chicago: Apache Spark Plus Many Other Frameworks -- How S...Big Data Everywhere Chicago: Apache Spark Plus Many Other Frameworks -- How S...
Big Data Everywhere Chicago: Apache Spark Plus Many Other Frameworks -- How S...
BigDataEverywhere
 
Enabling Exploratory Analysis of Large Data with Apache Spark and R
Enabling Exploratory Analysis of Large Data with Apache Spark and REnabling Exploratory Analysis of Large Data with Apache Spark and R
Enabling Exploratory Analysis of Large Data with Apache Spark and R
Databricks
 
Big Data Web applications for Interactive Hadoop by ENRICO BERTI at Big Data...
 Big Data Web applications for Interactive Hadoop by ENRICO BERTI at Big Data... Big Data Web applications for Interactive Hadoop by ENRICO BERTI at Big Data...
Big Data Web applications for Interactive Hadoop by ENRICO BERTI at Big Data...
Big Data Spain
 
Enterprise Data Workflows with Cascading and Windows Azure HDInsight
Enterprise Data Workflows with Cascading and Windows Azure HDInsightEnterprise Data Workflows with Cascading and Windows Azure HDInsight
Enterprise Data Workflows with Cascading and Windows Azure HDInsight
Paco Nathan
 
ScalaTo July 2019 - No more struggles with Apache Spark workloads in production
ScalaTo July 2019 - No more struggles with Apache Spark workloads in productionScalaTo July 2019 - No more struggles with Apache Spark workloads in production
ScalaTo July 2019 - No more struggles with Apache Spark workloads in production
Chetan Khatri
 
Full Stack Scala
Full Stack ScalaFull Stack Scala
Full Stack Scala
Ramnivas Laddad
 
The openCypher Project - An Open Graph Query Language
The openCypher Project - An Open Graph Query LanguageThe openCypher Project - An Open Graph Query Language
The openCypher Project - An Open Graph Query Language
Neo4j
 
20170126 big data processing
20170126 big data processing20170126 big data processing
20170126 big data processing
Vienna Data Science Group
 
Apache Spark - Intro to Large-scale recommendations with Apache Spark and Python
Apache Spark - Intro to Large-scale recommendations with Apache Spark and PythonApache Spark - Intro to Large-scale recommendations with Apache Spark and Python
Apache Spark - Intro to Large-scale recommendations with Apache Spark and Python
Christian Perone
 
Introduction to MongoDB and Workshop
Introduction to MongoDB and WorkshopIntroduction to MongoDB and Workshop
Introduction to MongoDB and Workshop
AhmedabadJavaMeetup
 
Bringing the Power and Familiarity of .NET, C# and F# to Big Data Processing ...
Bringing the Power and Familiarity of .NET, C# and F# to Big Data Processing ...Bringing the Power and Familiarity of .NET, C# and F# to Big Data Processing ...
Bringing the Power and Familiarity of .NET, C# and F# to Big Data Processing ...
Michael Rys
 
Meetup ml spark_ppt
Meetup ml spark_pptMeetup ml spark_ppt
Meetup ml spark_ppt
Snehal Nagmote
 
Big Data Processing with .NET and Spark (SQLBits 2020)
Big Data Processing with .NET and Spark (SQLBits 2020)Big Data Processing with .NET and Spark (SQLBits 2020)
Big Data Processing with .NET and Spark (SQLBits 2020)
Michael Rys
 
Intro to Spark and Spark SQL
Intro to Spark and Spark SQLIntro to Spark and Spark SQL
Intro to Spark and Spark SQL
jeykottalam
 
Spark SQL Deep Dive @ Melbourne Spark Meetup
Spark SQL Deep Dive @ Melbourne Spark MeetupSpark SQL Deep Dive @ Melbourne Spark Meetup
Spark SQL Deep Dive @ Melbourne Spark Meetup
Databricks
 
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data PlatformsCassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
DataStax Academy
 
Jump Start into Apache® Spark™ and Databricks
Jump Start into Apache® Spark™ and DatabricksJump Start into Apache® Spark™ and Databricks
Jump Start into Apache® Spark™ and Databricks
Databricks
 
Composable Parallel Processing in Apache Spark and Weld
Composable Parallel Processing in Apache Spark and WeldComposable Parallel Processing in Apache Spark and Weld
Composable Parallel Processing in Apache Spark and Weld
Databricks
 
PyconZA19-Distributed-workloads-challenges-with-PySpark-and-Airflow
PyconZA19-Distributed-workloads-challenges-with-PySpark-and-AirflowPyconZA19-Distributed-workloads-challenges-with-PySpark-and-Airflow
PyconZA19-Distributed-workloads-challenges-with-PySpark-and-Airflow
Chetan Khatri
 
Real-Time Spark: From Interactive Queries to Streaming
Real-Time Spark: From Interactive Queries to StreamingReal-Time Spark: From Interactive Queries to Streaming
Real-Time Spark: From Interactive Queries to Streaming
Databricks
 
Big Data Everywhere Chicago: Apache Spark Plus Many Other Frameworks -- How S...
Big Data Everywhere Chicago: Apache Spark Plus Many Other Frameworks -- How S...Big Data Everywhere Chicago: Apache Spark Plus Many Other Frameworks -- How S...
Big Data Everywhere Chicago: Apache Spark Plus Many Other Frameworks -- How S...
BigDataEverywhere
 
Enabling Exploratory Analysis of Large Data with Apache Spark and R
Enabling Exploratory Analysis of Large Data with Apache Spark and REnabling Exploratory Analysis of Large Data with Apache Spark and R
Enabling Exploratory Analysis of Large Data with Apache Spark and R
Databricks
 
Big Data Web applications for Interactive Hadoop by ENRICO BERTI at Big Data...
 Big Data Web applications for Interactive Hadoop by ENRICO BERTI at Big Data... Big Data Web applications for Interactive Hadoop by ENRICO BERTI at Big Data...
Big Data Web applications for Interactive Hadoop by ENRICO BERTI at Big Data...
Big Data Spain
 
Enterprise Data Workflows with Cascading and Windows Azure HDInsight
Enterprise Data Workflows with Cascading and Windows Azure HDInsightEnterprise Data Workflows with Cascading and Windows Azure HDInsight
Enterprise Data Workflows with Cascading and Windows Azure HDInsight
Paco Nathan
 
ScalaTo July 2019 - No more struggles with Apache Spark workloads in production
ScalaTo July 2019 - No more struggles with Apache Spark workloads in productionScalaTo July 2019 - No more struggles with Apache Spark workloads in production
ScalaTo July 2019 - No more struggles with Apache Spark workloads in production
Chetan Khatri
 
The openCypher Project - An Open Graph Query Language
The openCypher Project - An Open Graph Query LanguageThe openCypher Project - An Open Graph Query Language
The openCypher Project - An Open Graph Query Language
Neo4j
 
Apache Spark - Intro to Large-scale recommendations with Apache Spark and Python
Apache Spark - Intro to Large-scale recommendations with Apache Spark and PythonApache Spark - Intro to Large-scale recommendations with Apache Spark and Python
Apache Spark - Intro to Large-scale recommendations with Apache Spark and Python
Christian Perone
 
Introduction to MongoDB and Workshop
Introduction to MongoDB and WorkshopIntroduction to MongoDB and Workshop
Introduction to MongoDB and Workshop
AhmedabadJavaMeetup
 
Bringing the Power and Familiarity of .NET, C# and F# to Big Data Processing ...
Bringing the Power and Familiarity of .NET, C# and F# to Big Data Processing ...Bringing the Power and Familiarity of .NET, C# and F# to Big Data Processing ...
Bringing the Power and Familiarity of .NET, C# and F# to Big Data Processing ...
Michael Rys
 
Ad

More from Julian Hyde (18)

Measures in SQL (SIGMOD 2024, Santiago, Chile)
Measures in SQL (SIGMOD 2024, Santiago, Chile)Measures in SQL (SIGMOD 2024, Santiago, Chile)
Measures in SQL (SIGMOD 2024, Santiago, Chile)
Julian Hyde
 
Measures in SQL (a talk at SF Distributed Systems meetup, 2024-05-22)
Measures in SQL (a talk at SF Distributed Systems meetup, 2024-05-22)Measures in SQL (a talk at SF Distributed Systems meetup, 2024-05-22)
Measures in SQL (a talk at SF Distributed Systems meetup, 2024-05-22)
Julian Hyde
 
Building a semantic/metrics layer using Calcite
Building a semantic/metrics layer using CalciteBuilding a semantic/metrics layer using Calcite
Building a semantic/metrics layer using Calcite
Julian Hyde
 
Cubing and Metrics in SQL, oh my!
Cubing and Metrics in SQL, oh my!Cubing and Metrics in SQL, oh my!
Cubing and Metrics in SQL, oh my!
Julian Hyde
 
Morel, a data-parallel programming language
Morel, a data-parallel programming languageMorel, a data-parallel programming language
Morel, a data-parallel programming language
Julian Hyde
 
Is there a perfect data-parallel programming language? (Experiments with More...
Is there a perfect data-parallel programming language? (Experiments with More...Is there a perfect data-parallel programming language? (Experiments with More...
Is there a perfect data-parallel programming language? (Experiments with More...
Julian Hyde
 
The evolution of Apache Calcite and its Community
The evolution of Apache Calcite and its CommunityThe evolution of Apache Calcite and its Community
The evolution of Apache Calcite and its Community
Julian Hyde
 
What to expect when you're Incubating
What to expect when you're IncubatingWhat to expect when you're Incubating
What to expect when you're Incubating
Julian Hyde
 
Efficient spatial queries on vanilla databases
Efficient spatial queries on vanilla databasesEfficient spatial queries on vanilla databases
Efficient spatial queries on vanilla databases
Julian Hyde
 
Tactical data engineering
Tactical data engineeringTactical data engineering
Tactical data engineering
Julian Hyde
 
Don't optimize my queries, organize my data!
Don't optimize my queries, organize my data!Don't optimize my queries, organize my data!
Don't optimize my queries, organize my data!
Julian Hyde
 
Spatial query on vanilla databases
Spatial query on vanilla databasesSpatial query on vanilla databases
Spatial query on vanilla databases
Julian Hyde
 
Lazy beats Smart and Fast
Lazy beats Smart and FastLazy beats Smart and Fast
Lazy beats Smart and Fast
Julian Hyde
 
Don’t optimize my queries, optimize my data!
Don’t optimize my queries, optimize my data!Don’t optimize my queries, optimize my data!
Don’t optimize my queries, optimize my data!
Julian Hyde
 
Data profiling with Apache Calcite
Data profiling with Apache CalciteData profiling with Apache Calcite
Data profiling with Apache Calcite
Julian Hyde
 
A smarter Pig: Building a SQL interface to Apache Pig using Apache Calcite
A smarter Pig: Building a SQL interface to Apache Pig using Apache CalciteA smarter Pig: Building a SQL interface to Apache Pig using Apache Calcite
A smarter Pig: Building a SQL interface to Apache Pig using Apache Calcite
Julian Hyde
 
Data Profiling in Apache Calcite
Data Profiling in Apache CalciteData Profiling in Apache Calcite
Data Profiling in Apache Calcite
Julian Hyde
 
Streaming SQL
Streaming SQLStreaming SQL
Streaming SQL
Julian Hyde
 
Measures in SQL (SIGMOD 2024, Santiago, Chile)
Measures in SQL (SIGMOD 2024, Santiago, Chile)Measures in SQL (SIGMOD 2024, Santiago, Chile)
Measures in SQL (SIGMOD 2024, Santiago, Chile)
Julian Hyde
 
Measures in SQL (a talk at SF Distributed Systems meetup, 2024-05-22)
Measures in SQL (a talk at SF Distributed Systems meetup, 2024-05-22)Measures in SQL (a talk at SF Distributed Systems meetup, 2024-05-22)
Measures in SQL (a talk at SF Distributed Systems meetup, 2024-05-22)
Julian Hyde
 
Building a semantic/metrics layer using Calcite
Building a semantic/metrics layer using CalciteBuilding a semantic/metrics layer using Calcite
Building a semantic/metrics layer using Calcite
Julian Hyde
 
Cubing and Metrics in SQL, oh my!
Cubing and Metrics in SQL, oh my!Cubing and Metrics in SQL, oh my!
Cubing and Metrics in SQL, oh my!
Julian Hyde
 
Morel, a data-parallel programming language
Morel, a data-parallel programming languageMorel, a data-parallel programming language
Morel, a data-parallel programming language
Julian Hyde
 
Is there a perfect data-parallel programming language? (Experiments with More...
Is there a perfect data-parallel programming language? (Experiments with More...Is there a perfect data-parallel programming language? (Experiments with More...
Is there a perfect data-parallel programming language? (Experiments with More...
Julian Hyde
 
The evolution of Apache Calcite and its Community
The evolution of Apache Calcite and its CommunityThe evolution of Apache Calcite and its Community
The evolution of Apache Calcite and its Community
Julian Hyde
 
What to expect when you're Incubating
What to expect when you're IncubatingWhat to expect when you're Incubating
What to expect when you're Incubating
Julian Hyde
 
Efficient spatial queries on vanilla databases
Efficient spatial queries on vanilla databasesEfficient spatial queries on vanilla databases
Efficient spatial queries on vanilla databases
Julian Hyde
 
Tactical data engineering
Tactical data engineeringTactical data engineering
Tactical data engineering
Julian Hyde
 
Don't optimize my queries, organize my data!
Don't optimize my queries, organize my data!Don't optimize my queries, organize my data!
Don't optimize my queries, organize my data!
Julian Hyde
 
Spatial query on vanilla databases
Spatial query on vanilla databasesSpatial query on vanilla databases
Spatial query on vanilla databases
Julian Hyde
 
Lazy beats Smart and Fast
Lazy beats Smart and FastLazy beats Smart and Fast
Lazy beats Smart and Fast
Julian Hyde
 
Don’t optimize my queries, optimize my data!
Don’t optimize my queries, optimize my data!Don’t optimize my queries, optimize my data!
Don’t optimize my queries, optimize my data!
Julian Hyde
 
Data profiling with Apache Calcite
Data profiling with Apache CalciteData profiling with Apache Calcite
Data profiling with Apache Calcite
Julian Hyde
 
A smarter Pig: Building a SQL interface to Apache Pig using Apache Calcite
A smarter Pig: Building a SQL interface to Apache Pig using Apache CalciteA smarter Pig: Building a SQL interface to Apache Pig using Apache Calcite
A smarter Pig: Building a SQL interface to Apache Pig using Apache Calcite
Julian Hyde
 
Data Profiling in Apache Calcite
Data Profiling in Apache CalciteData Profiling in Apache Calcite
Data Profiling in Apache Calcite
Julian Hyde
 

Recently uploaded (20)

Cybersecurity Identity and Access Solutions using Azure AD
Cybersecurity Identity and Access Solutions using Azure ADCybersecurity Identity and Access Solutions using Azure AD
Cybersecurity Identity and Access Solutions using Azure AD
VICTOR MAESTRE RAMIREZ
 
Procurement Insights Cost To Value Guide.pptx
Procurement Insights Cost To Value Guide.pptxProcurement Insights Cost To Value Guide.pptx
Procurement Insights Cost To Value Guide.pptx
Jon Hansen
 
Electronic_Mail_Attacks-1-35.pdf by xploit
Electronic_Mail_Attacks-1-35.pdf by xploitElectronic_Mail_Attacks-1-35.pdf by xploit
Electronic_Mail_Attacks-1-35.pdf by xploit
niftliyevhuseyn
 
Cyber Awareness overview for 2025 month of security
Cyber Awareness overview for 2025 month of securityCyber Awareness overview for 2025 month of security
Cyber Awareness overview for 2025 month of security
riccardosl1
 
Splunk Security Update | Public Sector Summit Germany 2025
Splunk Security Update | Public Sector Summit Germany 2025Splunk Security Update | Public Sector Summit Germany 2025
Splunk Security Update | Public Sector Summit Germany 2025
Splunk
 
How Can I use the AI Hype in my Business Context?
How Can I use the AI Hype in my Business Context?How Can I use the AI Hype in my Business Context?
How Can I use the AI Hype in my Business Context?
Daniel Lehner
 
Linux Support for SMARC: How Toradex Empowers Embedded Developers
Linux Support for SMARC: How Toradex Empowers Embedded DevelopersLinux Support for SMARC: How Toradex Empowers Embedded Developers
Linux Support for SMARC: How Toradex Empowers Embedded Developers
Toradex
 
#StandardsGoals for 2025: Standards & certification roundup - Tech Forum 2025
#StandardsGoals for 2025: Standards & certification roundup - Tech Forum 2025#StandardsGoals for 2025: Standards & certification roundup - Tech Forum 2025
#StandardsGoals for 2025: Standards & certification roundup - Tech Forum 2025
BookNet Canada
 
Transcript: #StandardsGoals for 2025: Standards & certification roundup - Tec...
Transcript: #StandardsGoals for 2025: Standards & certification roundup - Tec...Transcript: #StandardsGoals for 2025: Standards & certification roundup - Tec...
Transcript: #StandardsGoals for 2025: Standards & certification roundup - Tec...
BookNet Canada
 
Rusty Waters: Elevating Lakehouses Beyond Spark
Rusty Waters: Elevating Lakehouses Beyond SparkRusty Waters: Elevating Lakehouses Beyond Spark
Rusty Waters: Elevating Lakehouses Beyond Spark
carlyakerly1
 
Quantum Computing Quick Research Guide by Arthur Morgan
Quantum Computing Quick Research Guide by Arthur MorganQuantum Computing Quick Research Guide by Arthur Morgan
Quantum Computing Quick Research Guide by Arthur Morgan
Arthur Morgan
 
IEDM 2024 Tutorial2_Advances in CMOS Technologies and Future Directions for C...
IEDM 2024 Tutorial2_Advances in CMOS Technologies and Future Directions for C...IEDM 2024 Tutorial2_Advances in CMOS Technologies and Future Directions for C...
IEDM 2024 Tutorial2_Advances in CMOS Technologies and Future Directions for C...
organizerofv
 
Special Meetup Edition - TDX Bengaluru Meetup #52.pptx
Special Meetup Edition - TDX Bengaluru Meetup #52.pptxSpecial Meetup Edition - TDX Bengaluru Meetup #52.pptx
Special Meetup Edition - TDX Bengaluru Meetup #52.pptx
shyamraj55
 
SAP Modernization: Maximizing the Value of Your SAP S/4HANA Migration.pdf
SAP Modernization: Maximizing the Value of Your SAP S/4HANA Migration.pdfSAP Modernization: Maximizing the Value of Your SAP S/4HANA Migration.pdf
SAP Modernization: Maximizing the Value of Your SAP S/4HANA Migration.pdf
Precisely
 
What is Model Context Protocol(MCP) - The new technology for communication bw...
What is Model Context Protocol(MCP) - The new technology for communication bw...What is Model Context Protocol(MCP) - The new technology for communication bw...
What is Model Context Protocol(MCP) - The new technology for communication bw...
Vishnu Singh Chundawat
 
Manifest Pre-Seed Update | A Humanoid OEM Deeptech In France
Manifest Pre-Seed Update | A Humanoid OEM Deeptech In FranceManifest Pre-Seed Update | A Humanoid OEM Deeptech In France
Manifest Pre-Seed Update | A Humanoid OEM Deeptech In France
chb3
 
Complete Guide to Advanced Logistics Management Software in Riyadh.pdf
Complete Guide to Advanced Logistics Management Software in Riyadh.pdfComplete Guide to Advanced Logistics Management Software in Riyadh.pdf
Complete Guide to Advanced Logistics Management Software in Riyadh.pdf
Software Company
 
tecnologias de las primeras civilizaciones.pdf
tecnologias de las primeras civilizaciones.pdftecnologias de las primeras civilizaciones.pdf
tecnologias de las primeras civilizaciones.pdf
fjgm517
 
Generative Artificial Intelligence (GenAI) in Business
Generative Artificial Intelligence (GenAI) in BusinessGenerative Artificial Intelligence (GenAI) in Business
Generative Artificial Intelligence (GenAI) in Business
Dr. Tathagat Varma
 
Massive Power Outage Hits Spain, Portugal, and France: Causes, Impact, and On...
Massive Power Outage Hits Spain, Portugal, and France: Causes, Impact, and On...Massive Power Outage Hits Spain, Portugal, and France: Causes, Impact, and On...
Massive Power Outage Hits Spain, Portugal, and France: Causes, Impact, and On...
Aqusag Technologies
 
Cybersecurity Identity and Access Solutions using Azure AD
Cybersecurity Identity and Access Solutions using Azure ADCybersecurity Identity and Access Solutions using Azure AD
Cybersecurity Identity and Access Solutions using Azure AD
VICTOR MAESTRE RAMIREZ
 
Procurement Insights Cost To Value Guide.pptx
Procurement Insights Cost To Value Guide.pptxProcurement Insights Cost To Value Guide.pptx
Procurement Insights Cost To Value Guide.pptx
Jon Hansen
 
Electronic_Mail_Attacks-1-35.pdf by xploit
Electronic_Mail_Attacks-1-35.pdf by xploitElectronic_Mail_Attacks-1-35.pdf by xploit
Electronic_Mail_Attacks-1-35.pdf by xploit
niftliyevhuseyn
 
Cyber Awareness overview for 2025 month of security
Cyber Awareness overview for 2025 month of securityCyber Awareness overview for 2025 month of security
Cyber Awareness overview for 2025 month of security
riccardosl1
 
Splunk Security Update | Public Sector Summit Germany 2025
Splunk Security Update | Public Sector Summit Germany 2025Splunk Security Update | Public Sector Summit Germany 2025
Splunk Security Update | Public Sector Summit Germany 2025
Splunk
 
How Can I use the AI Hype in my Business Context?
How Can I use the AI Hype in my Business Context?How Can I use the AI Hype in my Business Context?
How Can I use the AI Hype in my Business Context?
Daniel Lehner
 
Linux Support for SMARC: How Toradex Empowers Embedded Developers
Linux Support for SMARC: How Toradex Empowers Embedded DevelopersLinux Support for SMARC: How Toradex Empowers Embedded Developers
Linux Support for SMARC: How Toradex Empowers Embedded Developers
Toradex
 
#StandardsGoals for 2025: Standards & certification roundup - Tech Forum 2025
#StandardsGoals for 2025: Standards & certification roundup - Tech Forum 2025#StandardsGoals for 2025: Standards & certification roundup - Tech Forum 2025
#StandardsGoals for 2025: Standards & certification roundup - Tech Forum 2025
BookNet Canada
 
Transcript: #StandardsGoals for 2025: Standards & certification roundup - Tec...
Transcript: #StandardsGoals for 2025: Standards & certification roundup - Tec...Transcript: #StandardsGoals for 2025: Standards & certification roundup - Tec...
Transcript: #StandardsGoals for 2025: Standards & certification roundup - Tec...
BookNet Canada
 
Rusty Waters: Elevating Lakehouses Beyond Spark
Rusty Waters: Elevating Lakehouses Beyond SparkRusty Waters: Elevating Lakehouses Beyond Spark
Rusty Waters: Elevating Lakehouses Beyond Spark
carlyakerly1
 
Quantum Computing Quick Research Guide by Arthur Morgan
Quantum Computing Quick Research Guide by Arthur MorganQuantum Computing Quick Research Guide by Arthur Morgan
Quantum Computing Quick Research Guide by Arthur Morgan
Arthur Morgan
 
IEDM 2024 Tutorial2_Advances in CMOS Technologies and Future Directions for C...
IEDM 2024 Tutorial2_Advances in CMOS Technologies and Future Directions for C...IEDM 2024 Tutorial2_Advances in CMOS Technologies and Future Directions for C...
IEDM 2024 Tutorial2_Advances in CMOS Technologies and Future Directions for C...
organizerofv
 
Special Meetup Edition - TDX Bengaluru Meetup #52.pptx
Special Meetup Edition - TDX Bengaluru Meetup #52.pptxSpecial Meetup Edition - TDX Bengaluru Meetup #52.pptx
Special Meetup Edition - TDX Bengaluru Meetup #52.pptx
shyamraj55
 
SAP Modernization: Maximizing the Value of Your SAP S/4HANA Migration.pdf
SAP Modernization: Maximizing the Value of Your SAP S/4HANA Migration.pdfSAP Modernization: Maximizing the Value of Your SAP S/4HANA Migration.pdf
SAP Modernization: Maximizing the Value of Your SAP S/4HANA Migration.pdf
Precisely
 
What is Model Context Protocol(MCP) - The new technology for communication bw...
What is Model Context Protocol(MCP) - The new technology for communication bw...What is Model Context Protocol(MCP) - The new technology for communication bw...
What is Model Context Protocol(MCP) - The new technology for communication bw...
Vishnu Singh Chundawat
 
Manifest Pre-Seed Update | A Humanoid OEM Deeptech In France
Manifest Pre-Seed Update | A Humanoid OEM Deeptech In FranceManifest Pre-Seed Update | A Humanoid OEM Deeptech In France
Manifest Pre-Seed Update | A Humanoid OEM Deeptech In France
chb3
 
Complete Guide to Advanced Logistics Management Software in Riyadh.pdf
Complete Guide to Advanced Logistics Management Software in Riyadh.pdfComplete Guide to Advanced Logistics Management Software in Riyadh.pdf
Complete Guide to Advanced Logistics Management Software in Riyadh.pdf
Software Company
 
tecnologias de las primeras civilizaciones.pdf
tecnologias de las primeras civilizaciones.pdftecnologias de las primeras civilizaciones.pdf
tecnologias de las primeras civilizaciones.pdf
fjgm517
 
Generative Artificial Intelligence (GenAI) in Business
Generative Artificial Intelligence (GenAI) in BusinessGenerative Artificial Intelligence (GenAI) in Business
Generative Artificial Intelligence (GenAI) in Business
Dr. Tathagat Varma
 
Massive Power Outage Hits Spain, Portugal, and France: Causes, Impact, and On...
Massive Power Outage Hits Spain, Portugal, and France: Causes, Impact, and On...Massive Power Outage Hits Spain, Portugal, and France: Causes, Impact, and On...
Massive Power Outage Hits Spain, Portugal, and France: Causes, Impact, and On...
Aqusag Technologies
 

Planning with Polyalgebra: Bringing Together Relational, Complex and Machine Learning Algebra

  • 1. © Hortonworks Inc. 2016 Polyalgebra Hadoop Summit, April 14, 2016
  • 2. © Hortonworks Inc. 2016 Julian Hyde @julianhyde Apache Calcite (VP), Drill, Kylin Mondrian OLAP Hortonworks Thanks: Jacques Nadeau @intjesus (Dremio/Drill) Wes McKinney (Cloudera/Arrow) Who Tomer Shiran @tshiran Apache Drill Dremio
  • 3. Page‹#› © Hortonworks Inc. 2014 © Hortonworks Inc. 2016 Polyalgebra An extended form of relational algebra that encompasses work with dynamically-typed data, complex records, streaming and machine learning that allows for a single optimization space.
  • 4. © Hortonworks Inc. 2016 Ecosystem Calcite Drill Arrow Ibis Impala Kudu Splunk Cassandra JDBC MongoDB JDBC Spark Flink
  • 5. © Hortonworks Inc. 2016 Ecosystem - data stores Calcite Drill Arrow Ibis Impala Kudu Splunk Cassandra JDBC MongoDB JDBC Spark Flink
  • 6. © Hortonworks Inc. 2016 Ecosystem - Engines Calcite Drill Arrow Ibis Impala Kudu Splunk Cassandra JDBC MongoDB JDBC Spark Flink
  • 7. © Hortonworks Inc. 2016 Ecosystem - Focus of this talk Calcite Drill Arrow Ibis Impala Kudu Splunk Cassandra JDBC MongoDB JDBC Spark Flink
  • 8. © Hortonworks Inc. 2016 HadoopRDBMS Old world, new world • Security • Metadata • SQL • Query planning • Data independence • Scale • Late schema • Choice of front-end • Choice of engines • Workload: batch, interactive, streaming, ML, graph, …
  • 9. © Hortonworks Inc. 2016 Many front ends, many engines SQL Planning Execution
 engine Planning User code Map
 Reduce Tez User code in Yarn Spark MongoDB Hadoop External
 SQL SQL Spark Storm Cascading HBase Graph
  • 10. © Hortonworks Inc. 2016 Extension to mathematical set theory Devised by E.F. Code (IBM) in 1970 Defines the relational database Operators: select, filter, join, sort, union, etc. Intermediate format for query planning/optimization Relational algebra SQL Relational algebra Runnable query plan Optimization
  • 11. © Hortonworks Inc. 2016 select d.name, COUNT(*) as c
 from Emps as e
 join Depts as d
 on e.deptno = d.deptno
 where e.age < 30
 group by d.deptno
 having count(*) > 5
 order by c desc Relational algebra Scan [Emps] Scan [Depts] Join [e.deptno
 = d.deptno] Filter [e.age < 30] Aggregate [deptno, COUNT(*) AS c] Filter [c > 5] Project [name, c] Sort [c DESC] (Column names are simplified. They would usually
 be ordinals, e.g. $0 is the first column of the left input.)
  • 12. © Hortonworks Inc. 2016 select * from (
 select zipcode, state
 from Emps
 union all
 select zipcode, state
 from Customers)
 where state in (‘CA’, ‘TX’) Relational algebra - Union and sub-query Scan [Emps] Scan [Customers] Union [all] Project [zipcode, state] Project [zipcode, state] Filter [state IN (‘CA’, ‘TX’)]
  • 13. © Hortonworks Inc. 2016 insert into Facts
 values (‘Meaning of life’, 42),
 (‘Clever as clever’, 6) Relational algebra - Insert and Values Insert [Facts] Values [[‘Meaning of life’, 42], [‘Clever as clever’, 6]]
  • 14. © Hortonworks Inc. 2016 MySQL Splunk Expression tree
 select p.productName, COUNT(*) as c
 from splunk.splunk as s
 join mysql.products as p
 on s.productId = p.productId
 where s.action = 'purchase'
 group by p.productName
 order by c desc join Key: product_id group Key: product_name
 Agg: count filter Condition:
 action =
 'purchase' sort Key: c DESC scan scan Table: splunk Table: products
  • 15. © Hortonworks Inc. 2016 Splunk Expression tree
 (optimized) join Key: product_id group Key: product_name
 Agg: count filter Condition:
 action =
 'purchase' sort Key: c DESC scan Table: splunk MySQL scan Table: products select p.productName, COUNT(*) as c
 from splunk.splunk as s
 join mysql.products as p
 on s.productId = p.productId
 where s.action = 'purchase'
 group by p.productName
 order by c desc
  • 16. Page‹#› © Hortonworks Inc. 2014 © Hortonworks Inc. 2016 Demo {sqlline, apache-calcite, .csv, CsvPushProjectOntoTableRule}
  • 17. © Hortonworks Inc. 2016 Calcite – APIs and SPIs Cost, statistics RelOptCost RelOptCostFactory RelMetadataProvider • RelMdColumnUniquensss • RelMdDistinctRowCount • RelMdSelectivity SQL parser SqlNode
 SqlParser
 SqlValidator Transformation rules RelOptRule • MergeFilterRule • PushAggregateThroughUnionRule • 100+ more Global transformations • Unification (materialized view) • Column trimming • De-correlation Relational algebra RelNode (operator) • TableScan • Filter • Project • Union • Aggregate • … RelDataType (type) RexNode (expression) RelTrait (physical property) • RelConvention (calling-convention) • RelCollation (sortedness) • TBD (bucketedness/distribution) JDBC driver Metadata Schema Table Function • TableFunction • TableMacro Lattice
  • 18. © Hortonworks Inc. 2016 Calcite Planning Process SQL parse tree Planner RelNode Graph Sql-to-Rel Converter SqlNode ! RelNode + RexNode • Node for each node in Input Plan • Each node is a Set of alternate Sub Plans • Set further divided into Subsets: based on traits like sortedness 1. Plan Graph • Rule: specifies an Operator sub-graph to match and logic to generate equivalent ‘better’ sub-graph • New and original sub-graph both remain in contention 2. Rules • RelNodes have Cost & Cumulative Cost 3. Cost Model - Used to plug in Schema, cost formulas - Filter selectivity - Join selectivity - NDV calculations 4. Metadata Providers Rule Match Queue - Add Rule matches to Queue - Apply Rule match transformations to plan graph - Iterate for fixed iterations or until cost doesn’t change - Match importance based on cost of RelNode and height Best RelNode Graph Translate to runtime Logical Plan Based on “Volcano” & “Cascades” papers [G. Graefe]
  • 19. © Hortonworks Inc. 2016 Algebra builder API produces final FrameworkConfig config; final RelBuilder builder = RelBuilder.create(config); final RelNode node = builder .scan("EMP") .aggregate(builder.groupKey("DEPTNO"), builder.count(false, "C"), builder.sum(false, "S", builder.field("SAL"))) .filter( builder.call(SqlStdOperatorTable.GREATER_THAN, builder.field("C"), builder.literal(10))) .build(); System.out.println(RelOptUtil.toString(node)); select deptno,
 COUNT(*) as c, sum(sal) as s
 from Emp
 having COUNT(*) > 10 LogicalFilter(condition=[>($1, 10)]) LogicalAggregate(group=[{7}], C=[COUNT()], S=[SUM($5)]) LogicalTableScan(table=[[scott, EMP]]) Equivalent SQL:
  • 20. © Hortonworks Inc. 2016 Non-relational, post-relational Non-relational stores: • Document databases — MongoDB • Key-value stores — HBase, Cassandra • Graph databases — Neo4J • Multidimensional OLAP — Microsoft Analysis, Mondrian • Streams — Kafka, Storm • Text, audio, video Non-relational operators — data exploration, machine learning Late or no schema
  • 21. © Hortonworks Inc. 2016 Complex data, also known as nested or document-oriented data. Typically, it can be represented as JSON.
 
 2 new operators are sufficient: • UNNEST • COLLECT aggregate function Complex data employees: [ { name: “Bob”, age: 48, pets: [ {name: “Jim”, type: “Dog”}, {name: “Frank”, type: “Cat”} ] }, { name: “Stacy", age: 31,
 starSign: ‘taurus’, pets: [ {name: “Jack”, type: “Cat”} ] }, { name: “Ken”, age: 23 } ]
  • 22. © Hortonworks Inc. 2016 Flatten converts arrays of values to separate rows: • New record for each list item • Empty lists removes record Flatten is actually just syntactic sugar for the UNNEST relational operator: UNNEST and Flatten name age pets Bob 48 [{name: Jim, type: dog}, {name:Frank, type: dog}] Stacy 31 [{name: Jack, type: cat}] Ken 23 [] name age pet Bob 48 {name: Jim, type: dog} Bob 48 {name: Frank, type: dog} Stacy 31 {name: Jack, type: cat} select e.name, e.age,
 flatten(e.pet)
 from Employees as e select e.name, e.age,
 row(a.name, a.type)
 from Employees as e, 
 unnest e.addresses as a
  • 23. © Hortonworks Inc. 2016 Optimizing UNNEST As usual, to optimize, we write planner rules. We can push filters into the non-nested side, so we write FilterUnnestTransposeRule. (There are many other possible rules.) select e.name, a.name
 from Employees as e, 
 unnest e.pets as a where e.age < 30 select e.name, a.street
 from (
 select *
 from Employees 
 where e.age < 30) as e, unnest e.addresses as a FilterUnnestTransposeRule
  • 24. © Hortonworks Inc. 2016 Optimizing UNNEST (2) We can also optimize projects. If table is stored in a column-oriented file format, this reduces disk reads significantly. • Array wildcard projection through flatten • Non-flattened column inclusion select e.name,
 flatten(pets).name
 from Employees as e scan(name, age, pets) flatten(pets) as pet project(name, pet.name) scan(name, pets[*].name) flatten(pets) as pet scan(name, age, pets) flatten(pets as pet) project(name, pets[*].name) Original Plan Project through Flatten Project into scan (less data read) ProjectFlattenTransposeRule ProjectScanRule
  • 25. © Hortonworks Inc. 2016 Evolution: • Oracle: Schema before write, strongly typed SQL (like Java) • Hive: Schema before query, strongly typed SQL • Drill: Schema on data, weakly typed SQL (like JavaScript) Late schema name age starSign pets Bob 48 [{name: Jim, type: dog}, {name:Frank, type: dog}] Stacy 31 Taurus [{name: Jack, type: cat}] Ken 23 [] name age pets Ken 23 [] select *
 from Employees select *
 from Employees where age < 30 no starSign column!
  • 26. © Hortonworks Inc. 2016 Expanding * • Early schema databases expand * at planning time, based on schema • Drill expands * during query execution • Each operator needs to be able to propagate column names/types as well as data Internally, Drill is strongly typed • Strong typing means efficient code • JavaScript engines do this too • Infer type for each batch of records • Throw away generated code if a column changes type in the next batch of records Implementing schema-on-data select e.name
 from Employees where e.age < 30 select e._map[“name”] as name
 from Employees where cast(e._map[“age”] as integer) < 30
  • 27. © Hortonworks Inc. 2016 A table function is a Java UDF that returns a relation. • Its arguments may be relations or scalars. • It appears in the execution plan. • Annotations indicate whether it is safe to push filters, project through A table macro is a Java function that takes a parse tree and returns a parse tree. • Named after Lisp macros. • It does not appear in the execution plan. • Views (next slide) are a kind of table macro. Use a table macro rather than a table function, if possible. Re-use existing optimizations. User-defined operators select e.name
 from table( my_sample( select * from Employees, 0.15)) select e.name
 from table( my_filter( select * from Employees, ‘age’, ‘<‘, 30))
  • 28. © Hortonworks Inc. 2016 Views Scan [Emps] Join [$0, $5] Project [$0, $1, $2, $3] Filter [age >= 50] Aggregate [deptno, min(salary)] Scan [Managers] Aggregate [manager] Scan [Emps] select deptno, min(salary)
 from Managers where age >= 50 group by deptno create view Managers as select * from Emps as e where exists ( select * from Emps as underling where underling.manager = e.id)
  • 29. © Hortonworks Inc. 2016 Views (after expansion) select deptno, min(salary)
 from Managers where age >= 50 group by deptno create view Managers as select * from Emps as e where exists ( select * from Emps as underling where underling.manager = e.id) Scan [Emps] Aggregate [manager] Join [$0, $5] Project [$0, $1, $2, $3] Filter [age >= 50] Aggregate [deptno, min(salary)] Scan [Emps]
  • 30. © Hortonworks Inc. 2016 Views (after pushing down filter) select deptno, min(salary)
 from Managers where age >= 50 group by deptno create view Managers as select * from Emps as e where exists ( select * from Emps as underling where underling.manager = e.id) Scan [Emps] Scan [Emps] Join [$0, $5] Project [$0, $1, $2, $3] Filter [age >= 50] Aggregate [deptno, min(salary)]
  • 31. © Hortonworks Inc. 2016 Materialized view create materialized view
 EmpSummary as
 select deptno,
 gender,
 count(*) as c, 
 sum(sal)
 from Emps group by deptno, gender select count(*) as c
 from Emps where deptno = 10 and gender = ‘M’ Scan [Emps] Aggregate [deptno, gender,
 COUNT(*), SUM(sal)] Scan [EmpSummary] = Scan [Emps] Filter [deptno = 10 AND gender = ‘M’] Aggregate [COUNT(*)]
  • 32. © Hortonworks Inc. 2016 Materialized view, step 2: Rewrite query to match create materialized view
 EmpSummary as
 select deptno,
 gender,
 count(*) as c, 
 sum(sal)
 from Emps group by deptno, gender select count(*) as c
 from Emps where deptno = 10 and gender = ‘M’ Scan [Emps] Aggregate [deptno, gender,
 COUNT(*), SUM(sal)] Scan [EmpSummary] = Scan [Emps] Filter [deptno = 10 AND gender = ‘M’] Aggregate [deptno, gender,
 COUNT(*) AS c, SUM(sal) AS s] Project [c]
  • 33. © Hortonworks Inc. 2016 Materialized view, step 3: substitute table create materialized view
 EmpSummary as
 select deptno,
 gender,
 count(*) as c, 
 sum(sal)
 from Emps group by deptno, gender select count(*) as c
 from Emps where deptno = 10 and gender = ‘M’ Scan [Emps] Aggregate [deptno, gender,
 COUNT(*), SUM(sal)] Scan [EmpSummary] = Filter [deptno = 10 AND gender = ‘M’] Project [c] Scan [EmpSummary]
  • 34. © Hortonworks Inc. 2016 Streaming queries run forever. Stream appears in the FROM clause: Orders. Without the stream keyword, Orders means the history of the stream (a table). Calcite streaming SQL: in Samza, Storm, Flink. Streaming queries select stream *
 from Orders select *
 from Orders
  • 35. © Hortonworks Inc. 2016 • Orders is used as both stream and table • System determines where to find the records • Query is invalid if records are not available Combining past and future select stream *
 from Orders as o
 where units > ( 
 select avg(units)
 from Orders as h
 where h.productId = o.productId
 and h.rowtime >
 o.rowtime - interval ‘1’ year)
  • 36. © Hortonworks Inc. 2016 Hybrid systems combine more than one data source and/or engine. Examples: • Splunk join to MySQL • User-defined table written in Python reading from an in-memory temporary table just created by Drill. • Streaming query populating a table summarizing the last hour’s activity that will be used to populate a pie chart in a web dashboard. Two challenges: • Planning the query to take advantage of each system’s strengths. • Efficient interchange of data at run time. Hybrid systems
  • 37. © Hortonworks Inc. 2016 Hybrid algebra, hybrid run-time Calcite Drill Arrow Ibis Impala Kudu Splunk Cassandra JDBC MongoDB JDBC Spark Flink
  • 38. DREMIO© Hortonworks Inc. 2016 Arrow in a Slide • New Top-level Apache Software Foundation project – Announced Feb 17, 2016 • Focused on Columnar In-Memory Analytics 1. 10-100x speedup on many workloads 2. Common data layer enables companies to choose best of breed systems 3. Designed to work with any programming language 4. Support for both relational and complex data as-is • Developers from 13+ major open source projects involved – A significant % of the world’s data will be processed through Arrow! Calcite Cassandra Deeplearning4j Drill Hadoop HBase Ibis Impala Kudu Pandas Parquet Phoenix Spark Storm R
  • 39. DREMIO© Hortonworks Inc. 2016 Focus on CPU Efficiency Traditional Memory Buffer Arrow Memory Buffer • Cache Locality • Super-scalar & vectorized operation • Minimal Structure Overhead • Constant value access – With minimal structure overhead • Operate directly on columnar compressed data
  • 40. DREMIO© Hortonworks Inc. 2016 High Performance Sharing & Interchange Today With Arrow • Each system has its own internal memory format • 70-80% CPU wasted on serialization and deserialization • Similar functionality implemented in multiple projects • All systems utilize the same memory format • No overhead for cross-system communication • Projects can share functionality (eg, Parquet-to-Arrow reader)
  • 41. © Hortonworks Inc. 2016 • Algebra-centric approach • Optimize by applying transformation rules • User-defined operators (table functions, table macros, custom RelNode classes) • Complex data • Late-schema queries • Streaming queries • Calcite enables planning hybrid queries • Arrow enables hybrid runtime Summary
  • 42. © Hortonworks Inc. 2016 @julianhyde @tshiran @ApacheCalcite @ApacheDrill @ApacheArrow Get involved: • http://calcite.apache.org • http://drill.apache.org • http://arrow.apache.org Thanks!