Coral-and-Transport_Portable-SQL-and-UDFs-for-the-Interoperability-of-Spark-and-Other-Engines.pdf

Coral & Transport
Portable SQL & UDFs
For the interoperability of
Spark and other engines
Walaa Eldin Moustafa
Senior Staff Software Engineer, LinkedIn
Wenye Zhang
Senior Software Engineer, LinkedIn

Modern Data Lake Architectures
Variety of query engines
2

Variety of query languages
3
• Spark SQL
• Hive QL
• Presto SQL
• Trino SQL
• Flink SQL
• Other: Gremlin, SPARQL, Spark Scala,
PySpark

Variety of Data Sources
4
• Hive tables
• Delta Lake tables
• Iceberg tables
• Hudi tables
• Various file formats
• Avro
• ORC
• Parquet
Tables
• Different query languages
• Different UDF APIs
Views

Even more data sources..
5
API API
SQL
Data
SQL-aware
data source

Composable Data Architectures
6
Q
uery
Exec
Table
Formats
Storage

But not quite there yet..
7
Q
uery
Exec
Q
uery
Languages Query
rewrite rules
View
Catalogs

Logic interoperability
8
• Different SQL dialects
• View definitions
• Different engine plan representations
• SQL pushdown between engines
• Common query transformations
Common representation to capture
Adapters to transform
• From an input representation
• To an output representation

Coral
9
• Different SQL dialects
• View definitions
• Different engine plan representations
• SQL pushdown between engines
• Common query transformations
Common representation to capture
• From an input representation
• To an output representation

Transport
10
• UDF semantics
• Type validation and inference
Common API to express
• To any engine UDF

Coral
11
• Open-source project since 2020
• https://github.com/linkedin/coral
• Extends Calcite logical plan to
represent logic
• Intermediate representation called
Coral IR

Coral
12
• Coral IR captures query semantics
using standard operators
• Supported Transformations
• Hive QL (optionally Spark SQL) to Coral IR
• Trino SQL to Coral IR (WIP)
• Coral IR to Trino SQL
• Coral IR to Spark SQL (optionally Hive QL)
• Coral IR to Avro schema
IR, Transformations
⋈
σ T
⋈
R S
Coral IR

Example
Spark SQL
13
SELECT instr(R.x[0], 'foo')
FROM R
WHERE ! y
Example Query
• instr(a, b): returns
index of b in a
• x[i]: returns element i in
array x, 0-based index
• ! y: negates y
Operators

Example
Trino SQL
14
SELECT strpos(element_at(R.x, 1), 'foo')
FROM R
WHERE NOT y
Example Query
• strpos(a, b): returns
index of b in a
• element_at(x, i):
returns element i in
array x, 1-based index
• Not y: negates y
Operators

Transformations
Saprk QL to Coral IR conversion
15
Spark SQL Coral IR
instr(x, y) instr(x, y)
x[i] x[i+1]
!x NOT x

Transformations
Coral IR to Trino SQL conversion
16
Coral IR Trino SQL
instr(x, y) strpos(x, y)
x[i] element_at(x, i)
NOT x NOT x

Transformations
More complex transformations
• Lateral view joins
• User defined table functions
• Window functions
• Common table expressions
17

Integrations
Notable integrations
• OSS Trino
• Resolve Hive views in Trino
• LinkedIn’s fork of Spark
• Access Hive and Trino views (Trino in WIP)
• Preserve view dataframe nullability, casing through inference
• Perform schema evolution automatically
• Register view UDFs automatically
• Spark Dataset API
• Through Avro Specific record classes
• Blog post: Advanced schema management for Spark applications at scale
• https://engineering.linkedin.com/blog/2020/advanced-schema-management-for-spark
18

Apache Spark Integration
SPARK-31357
• Spark improvement to introduce
top-level view abstractions
• ViewCatalog API
• View API
• Enable custom implementations
for view SQL and schema
resolution
• Envision Coral integration to
Apache Spark through this API
19
SPIP: Catalog API
for view
metadata
interface View {
/**
* A name to
identify this view.
*/
String name();
/**
* The view query
SQL text.
*/
String sql();
...
}

Standalone mode
Coral-as-a-service
$ curl --header "Content-Type: application/json"
--request POST
--data '{
"fromLanguage": "hive",
"toLanguage": "trino",
"query": "SELECT * FROM db1.airport"
}' http://localhost:8080/api/translations/translate
Try it today! https://github.com/linkedin/coral
20

Future Extensions
• Spark catalyst plan to Coral IR
• POC in Coral-Spark-Plan
• Enables translation of all Spark APIs
• Scala
• Java
• Python
• Common query rewrites
• Materialized view substitution
• Incremental view maintenance
• Data governance (e.g., automatic obfuscation of PII)
21

Future Extensions
SPARK-37960
• Spark data source integration
• Push functions to data sources
• Delta Lake
• Iceberg
• Push SQL expressions to SQL data sources
• Trino
• Presto
• Pinot
22

Transport
Translatable, Portable UDFs
23
• SQL has pretty well-understood IR:
Relational Algebra
• Scan, Filter, Project, Join, Group By, etc
• UDFs
• Opaque
• Use imperative language
• Not portable or translatable
Motivation

UDF Denormalization
24
Multiple versions of
the same UDF. Not
clear which is the
source of truth.
Duplication
Duplicate
implementations can
diverge causing data
inconsistency
Inconsistency
Developers need to
learn multiple APIs,
implement same logic
multiple times.
Low Productivity
In some cases, use
tuple conversion
adapters to enable
portability.
Low Performance

UDF APIs
• API Complexity
• APIs expose low-level details of engines
• Data types may not intuitively map to SQL type-system
• API Disparity
• APIs differ in what to expect from developer
• APIs differ in features they can provide
28

Then What?
30
> gradle build
> ls build/my-udfs/libs
my-udfs-trino.jar
my-udfs-hive.jar
my-udfs-spark.jar

Auto-generated UDFs
31
Transport Gradle Plugin
Code Analysis – Metadata Generation
Autogenerated Engine UDFs
Trino Hive Spark …
Autogenerated UDF JARs
Trino Hive Spark …
User-defined Transport UDF

Architecture
32
Engine
Engine-specific
Autogenerated UDF
Wrapper
User-defined
Transport UDF
Engine-specific
Data and Types
1
2
3

Conclusions
Transport UDFs API
33
Only implement what
is needed to define
logic. No boilerplate
code.
Simple
Declarative type
signatures with
generics.
getRequiredFiles()
support.
Feature-rich
Can run on multiple
platforms.
Code specific to platform
is auto-generated.
Translatable
Direct access to
native platform data.
Performant

Coral-and-Transport_Portable-SQL-and-UDFs-for-the-Interoperability-of-Spark-and-Other-Engines.pdf

More Related Content

Similar to Coral-and-Transport_Portable-SQL-and-UDFs-for-the-Interoperability-of-Spark-and-Other-Engines.pdf (20)

Recently uploaded (20)

Coral-and-Transport_Portable-SQL-and-UDFs-for-the-Interoperability-of-Spark-and-Other-Engines.pdf