SlideShare a Scribd company logo
Coral & Transport
Portable SQL & UDFs
For the interoperability of
Spark and other engines
Walaa Eldin Moustafa
Senior Staff Software Engineer, LinkedIn
Wenye Zhang
Senior Software Engineer, LinkedIn
Modern Data Lake Architectures
Variety of query engines
2
Modern Data Lake Architectures
Variety of query languages
3
• Spark SQL
• Hive QL
• Presto SQL
• Trino SQL
• Flink SQL
• Other: Gremlin, SPARQL, Spark Scala,
PySpark
Modern Data Lake Architectures
Variety of Data Sources
4
• Hive tables
• Delta Lake tables
• Iceberg tables
• Hudi tables
• Various file formats
• Avro
• ORC
• Parquet
Tables
• Different query languages
• Different UDF APIs
Views
Modern Data Lake Architectures
Even more data sources..
5
API API
SQL
Data
SQL-aware
data source
Composable Data Architectures
6
Q
uery
Exec
Table
Formats
Storage
Composable Data Architectures
But not quite there yet..
7
Q
uery
Exec
Q
uery
Languages Query
rewrite rules
View
Catalogs
Composable Data Architectures
Logic interoperability
8
• Different SQL dialects
• View definitions
• Different engine plan representations
• SQL pushdown between engines
• Common query transformations
Common representation to capture
Adapters to transform
• From an input representation
• To an output representation
Composable Data Architectures
Coral
9
• Different SQL dialects
• View definitions
• Different engine plan representations
• SQL pushdown between engines
• Common query transformations
Common representation to capture
Adapters to transform
• From an input representation
• To an output representation
Composable Data Architectures
Transport
10
• UDF semantics
• Type validation and inference
Common API to express
Adapters to transform
• To any engine UDF
Coral
11
• Open-source project since 2020
• https://github.com/linkedin/coral
• Extends Calcite logical plan to
represent logic
• Intermediate representation called
Coral IR
Coral
12
• Coral IR captures query semantics
using standard operators
• Supported Transformations
• Hive QL (optionally Spark SQL) to Coral IR
• Trino SQL to Coral IR (WIP)
• Coral IR to Trino SQL
• Coral IR to Spark SQL (optionally Hive QL)
• Coral IR to Avro schema
IR, Transformations
⋈
σ T
⋈
R S
Coral IR
Example
Spark SQL
13
SELECT instr(R.x[0], 'foo')
FROM R
WHERE ! y
Example Query
• instr(a, b): returns
index of b in a
• x[i]: returns element i in
array x, 0-based index
• ! y: negates y
Operators
Example
Trino SQL
14
SELECT strpos(element_at(R.x, 1), 'foo')
FROM R
WHERE NOT y
Example Query
• strpos(a, b): returns
index of b in a
• element_at(x, i):
returns element i in
array x, 1-based index
• Not y: negates y
Operators
Transformations
Saprk QL to Coral IR conversion
15
Spark SQL Coral IR
instr(x, y) instr(x, y)
x[i] x[i+1]
!x NOT x
Transformations
Coral IR to Trino SQL conversion
16
Coral IR Trino SQL
instr(x, y) strpos(x, y)
x[i] element_at(x, i)
NOT x NOT x
Transformations
More complex transformations
• Lateral view joins
• User defined table functions
• Window functions
• Common table expressions
17
Integrations
Notable integrations
• OSS Trino
• Resolve Hive views in Trino
• LinkedIn’s fork of Spark
• Access Hive and Trino views (Trino in WIP)
• Preserve view dataframe nullability, casing through inference
• Perform schema evolution automatically
• Register view UDFs automatically
• Spark Dataset API
• Through Avro Specific record classes
• Blog post: Advanced schema management for Spark applications at scale
• https://engineering.linkedin.com/blog/2020/advanced-schema-management-for-spark
18
Apache Spark Integration
SPARK-31357
• Spark improvement to introduce
top-level view abstractions
• ViewCatalog API
• View API
• Enable custom implementations
for view SQL and schema
resolution
• Envision Coral integration to
Apache Spark through this API
19
SPIP: Catalog API
for view
metadata
interface View {
/**
* A name to
identify this view.
*/
String name();
/**
* The view query
SQL text.
*/
String sql();
...
}
Standalone mode
Coral-as-a-service
$ curl --header "Content-Type: application/json" 
--request POST 
--data '{
"fromLanguage": "hive",
"toLanguage": "trino",
"query": "SELECT * FROM db1.airport"
}' http://localhost:8080/api/translations/translate
Try it today! https://github.com/linkedin/coral
20
Future Extensions
• Spark catalyst plan to Coral IR
• POC in Coral-Spark-Plan
• Enables translation of all Spark APIs
• Scala
• Java
• Python
• Common query rewrites
• Materialized view substitution
• Incremental view maintenance
• Data governance (e.g., automatic obfuscation of PII)
21
Future Extensions
SPARK-37960
• Spark data source integration
• Push functions to data sources
• Delta Lake
• Iceberg
• Push SQL expressions to SQL data sources
• Trino
• Presto
• Pinot
22
Transport
Translatable, Portable UDFs
23
• SQL has pretty well-understood IR:
Relational Algebra
• Scan, Filter, Project, Join, Group By, etc
• UDFs
• Opaque
• Use imperative language
• Not portable or translatable
Motivation
UDF Denormalization
24
Multiple versions of
the same UDF. Not
clear which is the
source of truth.
Duplication
Duplicate
implementations can
diverge causing data
inconsistency
Inconsistency
Developers need to
learn multiple APIs,
implement same logic
multiple times.
Low Productivity
In some cases, use
tuple conversion
adapters to enable
portability.
Low Performance
A UDF Primer
UDFs 101
25
Example Hive UDF
26
Example Trino UDF
27
UDF APIs
• API Complexity
• APIs expose low-level details of engines
• Data types may not intuitively map to SQL type-system
• API Disparity
• APIs differ in what to expect from developer
• APIs differ in features they can provide
28
Transport UDFs
29
Then What?
30
> gradle build
> ls build/my-udfs/libs
my-udfs-trino.jar
my-udfs-hive.jar
my-udfs-spark.jar
Auto-generated UDFs
31
Transport Gradle Plugin
Code Analysis – Metadata Generation
Autogenerated Engine UDFs
Trino Hive Spark …
Autogenerated UDF JARs
Trino Hive Spark …
User-defined Transport UDF
Architecture
32
Engine
Engine-specific
Autogenerated UDF
Wrapper
User-defined
Transport UDF
Engine-specific
Data and Types
1
2
3
Conclusions
Transport UDFs API
33
Only implement what
is needed to define
logic. No boilerplate
code.
Simple
Declarative type
signatures with
generics.
getRequiredFiles()
support.
Feature-rich
Can run on multiple
platforms.
Code specific to platform
is auto-generated.
Translatable
Direct access to
native platform data.
Performant

More Related Content

PPT
Apache spark-melbourne-april-2015-meetup
PDF
Enabling exploratory data science with Spark and R
PDF
Data Science with Solr and Spark
PPTX
Spark from the Surface
PPTX
Spark sql meetup
PDF
20170126 big data processing
PDF
Spark DataFrames and ML Pipelines
PPTX
Large scale, interactive ad-hoc queries over different datastores with Apache...
Apache spark-melbourne-april-2015-meetup
Enabling exploratory data science with Spark and R
Data Science with Solr and Spark
Spark from the Surface
Spark sql meetup
20170126 big data processing
Spark DataFrames and ML Pipelines
Large scale, interactive ad-hoc queries over different datastores with Apache...

Similar to Coral-and-Transport_Portable-SQL-and-UDFs-for-the-Interoperability-of-Spark-and-Other-Engines.pdf (20)

PPTX
Dive into spark2
PPTX
REST Enabling Your Oracle Database
PDF
Scaling Spark Workloads on YARN - Boulder/Denver July 2015
PPTX
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
PDF
Building Robust ETL Pipelines with Apache Spark
PPTX
Coral & Transport UDFs: Building Blocks of a Postmodern Data Warehouse​
PDF
Composable Parallel Processing in Apache Spark and Weld
PDF
The Nitty Gritty of Advanced Analytics Using Apache Spark in Python
PPTX
Deep Dive into Apache Apex App Development
PDF
Ephedra: efficiently combining RDF data and services using SPARQL federation
PPTX
NYC Lucene/Solr Meetup: Spark / Solr
PDF
Koalas: Unifying Spark and pandas APIs
PDF
Apache Arrow and Pandas UDF on Apache Spark
PDF
How and When to Use FalcorJS
PDF
ETL to ML: Use Apache Spark as an end to end tool for Advanced Analytics
PDF
Using Apache Calcite for Enabling SQL and JDBC Access to Apache Geode and Oth...
PPTX
Internship.pptx
PDF
Morpheus - SQL and Cypher in Apache Spark
PDF
Morpheus SQL and Cypher® in Apache® Spark - Big Data Meetup Munich
PDF
Impala Architecture presentation
Dive into spark2
REST Enabling Your Oracle Database
Scaling Spark Workloads on YARN - Boulder/Denver July 2015
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
Building Robust ETL Pipelines with Apache Spark
Coral & Transport UDFs: Building Blocks of a Postmodern Data Warehouse​
Composable Parallel Processing in Apache Spark and Weld
The Nitty Gritty of Advanced Analytics Using Apache Spark in Python
Deep Dive into Apache Apex App Development
Ephedra: efficiently combining RDF data and services using SPARQL federation
NYC Lucene/Solr Meetup: Spark / Solr
Koalas: Unifying Spark and pandas APIs
Apache Arrow and Pandas UDF on Apache Spark
How and When to Use FalcorJS
ETL to ML: Use Apache Spark as an end to end tool for Advanced Analytics
Using Apache Calcite for Enabling SQL and JDBC Access to Apache Geode and Oth...
Internship.pptx
Morpheus - SQL and Cypher in Apache Spark
Morpheus SQL and Cypher® in Apache® Spark - Big Data Meetup Munich
Impala Architecture presentation
Ad

Recently uploaded (20)

PPTX
Acceptance and paychological effects of mandatory extra coach I classes.pptx
PDF
REAL ILLUMINATI AGENT IN KAMPALA UGANDA CALL ON+256765750853/0705037305
PPTX
New ISO 27001_2022 standard and the changes
PDF
Introduction to the R Programming Language
PPTX
A Complete Guide to Streamlining Business Processes
PDF
How to run a consulting project- client discovery
PPTX
IBA_Chapter_11_Slides_Final_Accessible.pptx
PPTX
CYBER SECURITY the Next Warefare Tactics
PPTX
QUANTUM_COMPUTING_AND_ITS_POTENTIAL_APPLICATIONS[2].pptx
DOCX
Factor Analysis Word Document Presentation
PPT
Predictive modeling basics in data cleaning process
PDF
Global Data and Analytics Market Outlook Report
PDF
Data Engineering Interview Questions & Answers Batch Processing (Spark, Hadoo...
PPTX
SAP 2 completion done . PRESENTATION.pptx
PDF
Transcultural that can help you someday.
PDF
Data Engineering Interview Questions & Answers Cloud Data Stacks (AWS, Azure,...
PPTX
(Ali Hamza) Roll No: (F24-BSCS-1103).pptx
PPT
ISS -ESG Data flows What is ESG and HowHow
PDF
OneRead_20250728_1808.pdfhdhddhshahwhwwjjaaja
PDF
Microsoft Core Cloud Services powerpoint
Acceptance and paychological effects of mandatory extra coach I classes.pptx
REAL ILLUMINATI AGENT IN KAMPALA UGANDA CALL ON+256765750853/0705037305
New ISO 27001_2022 standard and the changes
Introduction to the R Programming Language
A Complete Guide to Streamlining Business Processes
How to run a consulting project- client discovery
IBA_Chapter_11_Slides_Final_Accessible.pptx
CYBER SECURITY the Next Warefare Tactics
QUANTUM_COMPUTING_AND_ITS_POTENTIAL_APPLICATIONS[2].pptx
Factor Analysis Word Document Presentation
Predictive modeling basics in data cleaning process
Global Data and Analytics Market Outlook Report
Data Engineering Interview Questions & Answers Batch Processing (Spark, Hadoo...
SAP 2 completion done . PRESENTATION.pptx
Transcultural that can help you someday.
Data Engineering Interview Questions & Answers Cloud Data Stacks (AWS, Azure,...
(Ali Hamza) Roll No: (F24-BSCS-1103).pptx
ISS -ESG Data flows What is ESG and HowHow
OneRead_20250728_1808.pdfhdhddhshahwhwwjjaaja
Microsoft Core Cloud Services powerpoint
Ad

Coral-and-Transport_Portable-SQL-and-UDFs-for-the-Interoperability-of-Spark-and-Other-Engines.pdf

  • 1. Coral & Transport Portable SQL & UDFs For the interoperability of Spark and other engines Walaa Eldin Moustafa Senior Staff Software Engineer, LinkedIn Wenye Zhang Senior Software Engineer, LinkedIn
  • 2. Modern Data Lake Architectures Variety of query engines 2
  • 3. Modern Data Lake Architectures Variety of query languages 3 • Spark SQL • Hive QL • Presto SQL • Trino SQL • Flink SQL • Other: Gremlin, SPARQL, Spark Scala, PySpark
  • 4. Modern Data Lake Architectures Variety of Data Sources 4 • Hive tables • Delta Lake tables • Iceberg tables • Hudi tables • Various file formats • Avro • ORC • Parquet Tables • Different query languages • Different UDF APIs Views
  • 5. Modern Data Lake Architectures Even more data sources.. 5 API API SQL Data SQL-aware data source
  • 7. Composable Data Architectures But not quite there yet.. 7 Q uery Exec Q uery Languages Query rewrite rules View Catalogs
  • 8. Composable Data Architectures Logic interoperability 8 • Different SQL dialects • View definitions • Different engine plan representations • SQL pushdown between engines • Common query transformations Common representation to capture Adapters to transform • From an input representation • To an output representation
  • 9. Composable Data Architectures Coral 9 • Different SQL dialects • View definitions • Different engine plan representations • SQL pushdown between engines • Common query transformations Common representation to capture Adapters to transform • From an input representation • To an output representation
  • 10. Composable Data Architectures Transport 10 • UDF semantics • Type validation and inference Common API to express Adapters to transform • To any engine UDF
  • 11. Coral 11 • Open-source project since 2020 • https://github.com/linkedin/coral • Extends Calcite logical plan to represent logic • Intermediate representation called Coral IR
  • 12. Coral 12 • Coral IR captures query semantics using standard operators • Supported Transformations • Hive QL (optionally Spark SQL) to Coral IR • Trino SQL to Coral IR (WIP) • Coral IR to Trino SQL • Coral IR to Spark SQL (optionally Hive QL) • Coral IR to Avro schema IR, Transformations ⋈ σ T ⋈ R S Coral IR
  • 13. Example Spark SQL 13 SELECT instr(R.x[0], 'foo') FROM R WHERE ! y Example Query • instr(a, b): returns index of b in a • x[i]: returns element i in array x, 0-based index • ! y: negates y Operators
  • 14. Example Trino SQL 14 SELECT strpos(element_at(R.x, 1), 'foo') FROM R WHERE NOT y Example Query • strpos(a, b): returns index of b in a • element_at(x, i): returns element i in array x, 1-based index • Not y: negates y Operators
  • 15. Transformations Saprk QL to Coral IR conversion 15 Spark SQL Coral IR instr(x, y) instr(x, y) x[i] x[i+1] !x NOT x
  • 16. Transformations Coral IR to Trino SQL conversion 16 Coral IR Trino SQL instr(x, y) strpos(x, y) x[i] element_at(x, i) NOT x NOT x
  • 17. Transformations More complex transformations • Lateral view joins • User defined table functions • Window functions • Common table expressions 17
  • 18. Integrations Notable integrations • OSS Trino • Resolve Hive views in Trino • LinkedIn’s fork of Spark • Access Hive and Trino views (Trino in WIP) • Preserve view dataframe nullability, casing through inference • Perform schema evolution automatically • Register view UDFs automatically • Spark Dataset API • Through Avro Specific record classes • Blog post: Advanced schema management for Spark applications at scale • https://engineering.linkedin.com/blog/2020/advanced-schema-management-for-spark 18
  • 19. Apache Spark Integration SPARK-31357 • Spark improvement to introduce top-level view abstractions • ViewCatalog API • View API • Enable custom implementations for view SQL and schema resolution • Envision Coral integration to Apache Spark through this API 19 SPIP: Catalog API for view metadata interface View { /** * A name to identify this view. */ String name(); /** * The view query SQL text. */ String sql(); ... }
  • 20. Standalone mode Coral-as-a-service $ curl --header "Content-Type: application/json" --request POST --data '{ "fromLanguage": "hive", "toLanguage": "trino", "query": "SELECT * FROM db1.airport" }' http://localhost:8080/api/translations/translate Try it today! https://github.com/linkedin/coral 20
  • 21. Future Extensions • Spark catalyst plan to Coral IR • POC in Coral-Spark-Plan • Enables translation of all Spark APIs • Scala • Java • Python • Common query rewrites • Materialized view substitution • Incremental view maintenance • Data governance (e.g., automatic obfuscation of PII) 21
  • 22. Future Extensions SPARK-37960 • Spark data source integration • Push functions to data sources • Delta Lake • Iceberg • Push SQL expressions to SQL data sources • Trino • Presto • Pinot 22
  • 23. Transport Translatable, Portable UDFs 23 • SQL has pretty well-understood IR: Relational Algebra • Scan, Filter, Project, Join, Group By, etc • UDFs • Opaque • Use imperative language • Not portable or translatable Motivation
  • 24. UDF Denormalization 24 Multiple versions of the same UDF. Not clear which is the source of truth. Duplication Duplicate implementations can diverge causing data inconsistency Inconsistency Developers need to learn multiple APIs, implement same logic multiple times. Low Productivity In some cases, use tuple conversion adapters to enable portability. Low Performance
  • 28. UDF APIs • API Complexity • APIs expose low-level details of engines • Data types may not intuitively map to SQL type-system • API Disparity • APIs differ in what to expect from developer • APIs differ in features they can provide 28
  • 30. Then What? 30 > gradle build > ls build/my-udfs/libs my-udfs-trino.jar my-udfs-hive.jar my-udfs-spark.jar
  • 31. Auto-generated UDFs 31 Transport Gradle Plugin Code Analysis – Metadata Generation Autogenerated Engine UDFs Trino Hive Spark … Autogenerated UDF JARs Trino Hive Spark … User-defined Transport UDF
  • 33. Conclusions Transport UDFs API 33 Only implement what is needed to define logic. No boilerplate code. Simple Declarative type signatures with generics. getRequiredFiles() support. Feature-rich Can run on multiple platforms. Code specific to platform is auto-generated. Translatable Direct access to native platform data. Performant