SlideShare a Scribd company logo
Brought to you by
High-speed Database
Throughput Using Apache
Arrow Flight SQL
Kyle Porter
Architect at Dremio
James Duong
Architect at Dremio
Introduction to Arrow Flight
Introduction to Apache Arrow
■ A columnar, in-memory data format and supporting libraries
■ Supported in many languages including C++, Java, Python, Go
■ Data is strongly typed. Each row has the same schema.
■ Includes libraries for working with the format:
● Computation engine (Acero) utilizing SIMD operations for vectorized data analysis.
● Interprocess communication.
● Serialization / deserialization from file formats.
■ Fully open source with a permissive license.
Arrow powers dozens of open source
& commercial technologies
10+ programming languages
supported
>70M downloads
per month
Apache Arrow Adoption
Why is Arrow Flight Needed?
■ An open protocol that the community can support.
■ Designed for data in the modern world
● Older protocols are row oriented and geared towards large numbers of columns and low
numbers of rows.
● Arrow’s columnar format is oriented towards high compressibility and large numbers of rows.
■ Supports distributed computing as a client-side concept
● A data request can return multiple endpoints to a client.
● The client can retrieve from each endpoint in parallel.
Arrow Way: Data is sent, transported and
received in the Arrow format
Arrow Flight
■ Protocol for serialization-free transport of Arrow data
● This is particularly efficient if the client application will just work with Arrow data directly.
DATABASE
Column Based
DATABASE
Column Based
Convert
CLIENT
Column Based
Convert
CLIENT
Column Based
JDBC/ODBC Connector
Arrow Flight Connector transporting data in Arrow Format
Status Quo: Serializing/Deserializing
data at each step
Row Based
Column Based
Distributed Computing:
Single Node with Arrow Flight
Coordinator /
Executor
CLIENT
CPU
memory
1 - GetFlightInfo(<query>)
2 - FlightInfo<Schema, Endpoints>
3 - DoGet(<ticket>)
Endpoint = {location, ticket}
CPU
memory
Distributed Computing:
Multiple Nodes with Arrow Flight
CLIENT
Node 2
Node N
Node 1
CPU
memory
CPU
memory
CPU
memory
CPU
memory
DoGet(<ticket>)
DoGet(<ticket>)
DoGet(<ticket>)
Omitting GetFlightInfo call...
Arrow Flight as a Development Framework
■ Includes a fully-built client library
■ Includes a high-performance, scalable server
● Built on top of Google’s gRPC technology and compatible with existing tooling.
● Server implementation details such as thread-pooling, asynchronous IO, request cancellation
are already implemented.
■ Server deployment is a matter of implementing a few RPC request handlers.
Flight SQL Enhancements
for Arrow Flight
Why Extend Arrow Flight?
■ Client sends a byte stream, server sends a result
● The content of the byte stream is opaque in the interface.
● It only has meaning for a particular server.
● Example - Dremio interprets the byte stream to be a UTF-8 encoded SQL query.
■ Catalog information is not part of Arrow Flight’s design
● There is no RPC call to describe how to build the byte stream the client sends.
● Generic tools cannot be built.
■ Arrow Flight is meant to serve any tabular data from any source.
■ ODBC/JDBC standardize query execution and catalog access, but have
drawbacks.
■ Enter Arrow Flight SQL.
What is Arrow Flight SQL?
■ Initiative to allow databases to use Arrow Flight as the transport protocol
● Leverage the performance of Arrow and Flight for database access.
■ Extended set of RPC calls to standardize a SQL interface on Flight
● Query execution
● Prepared statements
● Database catalog metadata
● SQL syntax capabilities
■ Generic client libraries
● A Flight SQL application can be used against any Flight SQL server without code changes.
● ODBC and JDBC clients provided on top.
Common Tool Workflow
SERVER
2 - FlightInfo<Schema, Endpoints>
1 - GetFlightInfo(GetTables)
GetTables
4 - Arrow record batches
3 - DoGet(<ticket>)
DoGet
6 - FlightInfo<Schema, Endpoints>
5 - GetFlightInfo(StatementExecute)
Execute
7 - DoGet(<ticket>)
DoGet
CPU
memory
Listing tables
Retrieving query data
CLIENT
CPU
memory
Flight SQL vs. Legacy
Legacy (ODBC / JDBC)
■ Each database vendor must implement,
maintain, and distribute a driver.
■ Each database vendor must implement their
entire server.
■ Implementation details may be closed source.
■ Protocol is usually proprietary.
Flight SQL
■ Single client that works against any Flight SQL
server.
■ Server implementation is part of Flight. Only
RPC handlers need to be implemented.
■ Flight and Arrow components are open and the
community is actively improving them.
■ Protocol is open and integrates with gRPC and
Arrow tooling.
Flight SQL Status
■ Initial version released with Arrow 7.0.0
● Includes support for C++ and Java clients and servers
■ Enhancements to column and data type metadata have been accepted into
more recent versions of Arrow.
■ Support for transactions and query cancellation have been accepted.
■ Open for contributions
● Support for additional languages (Python, Go, C#, etc.).
● More features such as small result enhancements.
Flight SQL Status
■ JDBC Driver
● Connect legacy JDBC applications to databases with the Flight SQL protocol
with no code changes.
■ Examples: DBeaver, DBVisualizer
● Merged into Apache/master. To be released in Arrow 10.0.0
■ ODBC Driver
● Released by Dremio.
● Connect ODBC applications such as Tableau, pyodbc, PowerBI to Flight
SQL-enabled databases.
Performance
Practical Example: pyodbc vs. PyArrow
● PyArrow is columnar
■ Consume columnar data returned using the Arrow Flight without deserialization costs.
● pyodbc is row-oriented
■ All data values must be converted to scalars to expose to the python application.
■ This process incurs significant deserialization costs.
Practical Example: pyodbc vs. PyArrow
● Comparison: 500,000 rows queried from a remote server. (No parallelism).
■ pyodbc: 8.00s. PyArrow: 0.900s.
Query Execution: pyodbc vs. PyArrow
cursor = connection.cursor()
cursor.execute(sql)
data = cursor.fetchall()
■ ODBC requires all data to be retrieved from a single entry point (the cursor in the above example).
■ Arrow Flight lets the server expose multiple endpoints that host separate partitions of the data. Data
can be retrieved in parallel and even from separate processes or client nodes.
pyodbc (ODBC)
options = flight.FlightCallOptions(headers=headers)
descriptor = flight.FlightDescriptor.for_command(sql)
flight_info = client.get_flight_info(descriptor, options)
reader = client.do_get(flight_info.endpoints[0].ticket, options)
data = reader.read_chunk()
PyArrow (Arrow Flight SQL)
Arrow Client Design Tips
■ Minimize copying of data.
■ Avoid manual calculations on data.
● Prefer library calls using the Compute library to analyze data (for
example, arithmetic or aggregation on Arrow data).
● Arrow libraries use SIMD instructions for high-performance calculations!
■ Arrow provides fast file serialization to JSON, CSV, Parquet, ORC, and
uncompressed Arrow files. Avoid serializing Arrow data by hand.
References
■ Arrow Flight SQL Announcement:
https://arrow.apache.org/blog/2022/02/16/introducing-arrow-flight-sql/
■ Arrow Flight SQL ODBC Driver: https://github.com/dremio/flightsql-odbc and
https://github.com/dremio/warpdrive
■ Arrow Flight SQL JDBC Driver:
https://github.com/apache/arrow/tree/master/java/flight/flight-sql-jdbc-driver
■ Arrow Flight SQL JDBC Driver Improvements:
https://issues.apache.org/jira/browse/ARROW-17729
Brought to you by
Kyle Porter
kporter@dremio.com
James Duong
jduong@dremio.com

More Related Content

What's hot (20)

Using Apache Arrow, Calcite, and Parquet to Build a Relational Cache
Using Apache Arrow, Calcite, and Parquet to Build a Relational CacheUsing Apache Arrow, Calcite, and Parquet to Build a Relational Cache
Using Apache Arrow, Calcite, and Parquet to Build a Relational Cache
Dremio Corporation
 
From cache to in-memory data grid. Introduction to Hazelcast.
From cache to in-memory data grid. Introduction to Hazelcast.From cache to in-memory data grid. Introduction to Hazelcast.
From cache to in-memory data grid. Introduction to Hazelcast.
Taras Matyashovsky
 
Building an open data platform with apache iceberg
Building an open data platform with apache icebergBuilding an open data platform with apache iceberg
Building an open data platform with apache iceberg
Alluxio, Inc.
 
PySpark dataframe
PySpark dataframePySpark dataframe
PySpark dataframe
Jaemun Jung
 
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
Simplilearn
 
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Databricks
 
Linux tuning to improve PostgreSQL performance
Linux tuning to improve PostgreSQL performanceLinux tuning to improve PostgreSQL performance
Linux tuning to improve PostgreSQL performance
PostgreSQL-Consulting
 
Real-time Analytics with Trino and Apache Pinot
Real-time Analytics with Trino and Apache PinotReal-time Analytics with Trino and Apache Pinot
Real-time Analytics with Trino and Apache Pinot
Xiang Fu
 
Parquet performance tuning: the missing guide
Parquet performance tuning: the missing guideParquet performance tuning: the missing guide
Parquet performance tuning: the missing guide
Ryan Blue
 
Iceberg: A modern table format for big data (Strata NY 2018)
Iceberg: A modern table format for big data (Strata NY 2018)Iceberg: A modern table format for big data (Strata NY 2018)
Iceberg: A modern table format for big data (Strata NY 2018)
Ryan Blue
 
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Spark Summit
 
Apache Spark Architecture
Apache Spark ArchitectureApache Spark Architecture
Apache Spark Architecture
Alexey Grishchenko
 
Real-time Analytics with Presto and Apache Pinot
Real-time Analytics with Presto and Apache PinotReal-time Analytics with Presto and Apache Pinot
Real-time Analytics with Presto and Apache Pinot
Xiang Fu
 
Iceberg: a fast table format for S3
Iceberg: a fast table format for S3Iceberg: a fast table format for S3
Iceberg: a fast table format for S3
DataWorks Summit
 
Apache Arrow Flight: A New Gold Standard for Data Transport
Apache Arrow Flight: A New Gold Standard for Data TransportApache Arrow Flight: A New Gold Standard for Data Transport
Apache Arrow Flight: A New Gold Standard for Data Transport
Wes McKinney
 
Apache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
Apache Spark Data Source V2 with Wenchen Fan and Gengliang WangApache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
Apache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
Databricks
 
Introduction to DataFusion An Embeddable Query Engine Written in Rust
Introduction to DataFusion  An Embeddable Query Engine Written in RustIntroduction to DataFusion  An Embeddable Query Engine Written in Rust
Introduction to DataFusion An Embeddable Query Engine Written in Rust
Andrew Lamb
 
The Rise of ZStandard: Apache Spark/Parquet/ORC/Avro
The Rise of ZStandard: Apache Spark/Parquet/ORC/AvroThe Rise of ZStandard: Apache Spark/Parquet/ORC/Avro
The Rise of ZStandard: Apache Spark/Parquet/ORC/Avro
Databricks
 
The columnar roadmap: Apache Parquet and Apache Arrow
The columnar roadmap: Apache Parquet and Apache ArrowThe columnar roadmap: Apache Parquet and Apache Arrow
The columnar roadmap: Apache Parquet and Apache Arrow
DataWorks Summit
 
Spark shuffle introduction
Spark shuffle introductionSpark shuffle introduction
Spark shuffle introduction
colorant
 
Using Apache Arrow, Calcite, and Parquet to Build a Relational Cache
Using Apache Arrow, Calcite, and Parquet to Build a Relational CacheUsing Apache Arrow, Calcite, and Parquet to Build a Relational Cache
Using Apache Arrow, Calcite, and Parquet to Build a Relational Cache
Dremio Corporation
 
From cache to in-memory data grid. Introduction to Hazelcast.
From cache to in-memory data grid. Introduction to Hazelcast.From cache to in-memory data grid. Introduction to Hazelcast.
From cache to in-memory data grid. Introduction to Hazelcast.
Taras Matyashovsky
 
Building an open data platform with apache iceberg
Building an open data platform with apache icebergBuilding an open data platform with apache iceberg
Building an open data platform with apache iceberg
Alluxio, Inc.
 
PySpark dataframe
PySpark dataframePySpark dataframe
PySpark dataframe
Jaemun Jung
 
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
Simplilearn
 
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Databricks
 
Linux tuning to improve PostgreSQL performance
Linux tuning to improve PostgreSQL performanceLinux tuning to improve PostgreSQL performance
Linux tuning to improve PostgreSQL performance
PostgreSQL-Consulting
 
Real-time Analytics with Trino and Apache Pinot
Real-time Analytics with Trino and Apache PinotReal-time Analytics with Trino and Apache Pinot
Real-time Analytics with Trino and Apache Pinot
Xiang Fu
 
Parquet performance tuning: the missing guide
Parquet performance tuning: the missing guideParquet performance tuning: the missing guide
Parquet performance tuning: the missing guide
Ryan Blue
 
Iceberg: A modern table format for big data (Strata NY 2018)
Iceberg: A modern table format for big data (Strata NY 2018)Iceberg: A modern table format for big data (Strata NY 2018)
Iceberg: A modern table format for big data (Strata NY 2018)
Ryan Blue
 
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Spark Summit
 
Real-time Analytics with Presto and Apache Pinot
Real-time Analytics with Presto and Apache PinotReal-time Analytics with Presto and Apache Pinot
Real-time Analytics with Presto and Apache Pinot
Xiang Fu
 
Iceberg: a fast table format for S3
Iceberg: a fast table format for S3Iceberg: a fast table format for S3
Iceberg: a fast table format for S3
DataWorks Summit
 
Apache Arrow Flight: A New Gold Standard for Data Transport
Apache Arrow Flight: A New Gold Standard for Data TransportApache Arrow Flight: A New Gold Standard for Data Transport
Apache Arrow Flight: A New Gold Standard for Data Transport
Wes McKinney
 
Apache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
Apache Spark Data Source V2 with Wenchen Fan and Gengliang WangApache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
Apache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
Databricks
 
Introduction to DataFusion An Embeddable Query Engine Written in Rust
Introduction to DataFusion  An Embeddable Query Engine Written in RustIntroduction to DataFusion  An Embeddable Query Engine Written in Rust
Introduction to DataFusion An Embeddable Query Engine Written in Rust
Andrew Lamb
 
The Rise of ZStandard: Apache Spark/Parquet/ORC/Avro
The Rise of ZStandard: Apache Spark/Parquet/ORC/AvroThe Rise of ZStandard: Apache Spark/Parquet/ORC/Avro
The Rise of ZStandard: Apache Spark/Parquet/ORC/Avro
Databricks
 
The columnar roadmap: Apache Parquet and Apache Arrow
The columnar roadmap: Apache Parquet and Apache ArrowThe columnar roadmap: Apache Parquet and Apache Arrow
The columnar roadmap: Apache Parquet and Apache Arrow
DataWorks Summit
 
Spark shuffle introduction
Spark shuffle introductionSpark shuffle introduction
Spark shuffle introduction
colorant
 

Similar to High-speed Database Throughput Using Apache Arrow Flight SQL (20)

The Data Lake Engine Data Microservices in Spark using Apache Arrow Flight
The Data Lake Engine Data Microservices in Spark using Apache Arrow FlightThe Data Lake Engine Data Microservices in Spark using Apache Arrow Flight
The Data Lake Engine Data Microservices in Spark using Apache Arrow Flight
Databricks
 
Data Engineer's Lunch #77: Apache Arrow Flight SQL: A Universal Standard for ...
Data Engineer's Lunch #77: Apache Arrow Flight SQL: A Universal Standard for ...Data Engineer's Lunch #77: Apache Arrow Flight SQL: A Universal Standard for ...
Data Engineer's Lunch #77: Apache Arrow Flight SQL: A Universal Standard for ...
Anant Corporation
 
Hands on with CoAP and Californium
Hands on with CoAP and CaliforniumHands on with CoAP and Californium
Hands on with CoAP and Californium
Julien Vermillard
 
Module 1: ConfD Technical Introduction
Module 1: ConfD Technical IntroductionModule 1: ConfD Technical Introduction
Module 1: ConfD Technical Introduction
Tail-f Systems
 
Building Your First Apache Apex Application
Building Your First Apache Apex ApplicationBuilding Your First Apache Apex Application
Building Your First Apache Apex Application
Apache Apex
 
Building your first aplication using Apache Apex
Building your first aplication using Apache ApexBuilding your first aplication using Apache Apex
Building your first aplication using Apache Apex
Yogi Devendra Vyavahare
 
Seattle Spark Meetup Mobius CSharp API
Seattle Spark Meetup Mobius CSharp APISeattle Spark Meetup Mobius CSharp API
Seattle Spark Meetup Mobius CSharp API
shareddatamsft
 
Accelerating Spark MLlib and DataFrame with Vector Processor “SX-Aurora TSUBASA”
Accelerating Spark MLlib and DataFrame with Vector Processor “SX-Aurora TSUBASA”Accelerating Spark MLlib and DataFrame with Vector Processor “SX-Aurora TSUBASA”
Accelerating Spark MLlib and DataFrame with Vector Processor “SX-Aurora TSUBASA”
Databricks
 
LCU14 310- Cisco ODP v2
LCU14 310- Cisco ODP v2LCU14 310- Cisco ODP v2
LCU14 310- Cisco ODP v2
Linaro
 
Asp.net and .Net Framework ppt presentation
Asp.net and .Net Framework ppt presentationAsp.net and .Net Framework ppt presentation
Asp.net and .Net Framework ppt presentation
abhishek singh
 
20180503 kube con eu kubernetes metrics deep dive
20180503 kube con eu   kubernetes metrics deep dive20180503 kube con eu   kubernetes metrics deep dive
20180503 kube con eu kubernetes metrics deep dive
Bob Cotton
 
Data Summer Conf 2018, “Building unified Batch and Stream processing pipeline...
Data Summer Conf 2018, “Building unified Batch and Stream processing pipeline...Data Summer Conf 2018, “Building unified Batch and Stream processing pipeline...
Data Summer Conf 2018, “Building unified Batch and Stream processing pipeline...
Provectus
 
Cockatrice: A Hardware Design Environment with Elixir
Cockatrice: A Hardware Design Environment with ElixirCockatrice: A Hardware Design Environment with Elixir
Cockatrice: A Hardware Design Environment with Elixir
Hideki Takase
 
Byte Ordering - Unit 2.pptx
Byte Ordering - Unit 2.pptxByte Ordering - Unit 2.pptx
Byte Ordering - Unit 2.pptx
RockyBhai46825
 
Zero Downtime JEE Architectures
Zero Downtime JEE ArchitecturesZero Downtime JEE Architectures
Zero Downtime JEE Architectures
Alexander Penev
 
Use C++ and Intel® Threading Building Blocks (Intel® TBB) for Hardware Progra...
Use C++ and Intel® Threading Building Blocks (Intel® TBB) for Hardware Progra...Use C++ and Intel® Threading Building Blocks (Intel® TBB) for Hardware Progra...
Use C++ and Intel® Threading Building Blocks (Intel® TBB) for Hardware Progra...
Intel® Software
 
Functioning incessantly of Data Science Platform with Kubeflow - Albert Lewan...
Functioning incessantly of Data Science Platform with Kubeflow - Albert Lewan...Functioning incessantly of Data Science Platform with Kubeflow - Albert Lewan...
Functioning incessantly of Data Science Platform with Kubeflow - Albert Lewan...
GetInData
 
Introduction to Backend Engineering
Introduction to Backend EngineeringIntroduction to Backend Engineering
Introduction to Backend Engineering
UdayYadav90
 
.NET Core Today and Tomorrow
.NET Core Today and Tomorrow.NET Core Today and Tomorrow
.NET Core Today and Tomorrow
Jon Galloway
 
Cloud Native API Design and Management
Cloud Native API Design and ManagementCloud Native API Design and Management
Cloud Native API Design and Management
AllBits BVBA (freelancer)
 
The Data Lake Engine Data Microservices in Spark using Apache Arrow Flight
The Data Lake Engine Data Microservices in Spark using Apache Arrow FlightThe Data Lake Engine Data Microservices in Spark using Apache Arrow Flight
The Data Lake Engine Data Microservices in Spark using Apache Arrow Flight
Databricks
 
Data Engineer's Lunch #77: Apache Arrow Flight SQL: A Universal Standard for ...
Data Engineer's Lunch #77: Apache Arrow Flight SQL: A Universal Standard for ...Data Engineer's Lunch #77: Apache Arrow Flight SQL: A Universal Standard for ...
Data Engineer's Lunch #77: Apache Arrow Flight SQL: A Universal Standard for ...
Anant Corporation
 
Hands on with CoAP and Californium
Hands on with CoAP and CaliforniumHands on with CoAP and Californium
Hands on with CoAP and Californium
Julien Vermillard
 
Module 1: ConfD Technical Introduction
Module 1: ConfD Technical IntroductionModule 1: ConfD Technical Introduction
Module 1: ConfD Technical Introduction
Tail-f Systems
 
Building Your First Apache Apex Application
Building Your First Apache Apex ApplicationBuilding Your First Apache Apex Application
Building Your First Apache Apex Application
Apache Apex
 
Building your first aplication using Apache Apex
Building your first aplication using Apache ApexBuilding your first aplication using Apache Apex
Building your first aplication using Apache Apex
Yogi Devendra Vyavahare
 
Seattle Spark Meetup Mobius CSharp API
Seattle Spark Meetup Mobius CSharp APISeattle Spark Meetup Mobius CSharp API
Seattle Spark Meetup Mobius CSharp API
shareddatamsft
 
Accelerating Spark MLlib and DataFrame with Vector Processor “SX-Aurora TSUBASA”
Accelerating Spark MLlib and DataFrame with Vector Processor “SX-Aurora TSUBASA”Accelerating Spark MLlib and DataFrame with Vector Processor “SX-Aurora TSUBASA”
Accelerating Spark MLlib and DataFrame with Vector Processor “SX-Aurora TSUBASA”
Databricks
 
LCU14 310- Cisco ODP v2
LCU14 310- Cisco ODP v2LCU14 310- Cisco ODP v2
LCU14 310- Cisco ODP v2
Linaro
 
Asp.net and .Net Framework ppt presentation
Asp.net and .Net Framework ppt presentationAsp.net and .Net Framework ppt presentation
Asp.net and .Net Framework ppt presentation
abhishek singh
 
20180503 kube con eu kubernetes metrics deep dive
20180503 kube con eu   kubernetes metrics deep dive20180503 kube con eu   kubernetes metrics deep dive
20180503 kube con eu kubernetes metrics deep dive
Bob Cotton
 
Data Summer Conf 2018, “Building unified Batch and Stream processing pipeline...
Data Summer Conf 2018, “Building unified Batch and Stream processing pipeline...Data Summer Conf 2018, “Building unified Batch and Stream processing pipeline...
Data Summer Conf 2018, “Building unified Batch and Stream processing pipeline...
Provectus
 
Cockatrice: A Hardware Design Environment with Elixir
Cockatrice: A Hardware Design Environment with ElixirCockatrice: A Hardware Design Environment with Elixir
Cockatrice: A Hardware Design Environment with Elixir
Hideki Takase
 
Byte Ordering - Unit 2.pptx
Byte Ordering - Unit 2.pptxByte Ordering - Unit 2.pptx
Byte Ordering - Unit 2.pptx
RockyBhai46825
 
Zero Downtime JEE Architectures
Zero Downtime JEE ArchitecturesZero Downtime JEE Architectures
Zero Downtime JEE Architectures
Alexander Penev
 
Use C++ and Intel® Threading Building Blocks (Intel® TBB) for Hardware Progra...
Use C++ and Intel® Threading Building Blocks (Intel® TBB) for Hardware Progra...Use C++ and Intel® Threading Building Blocks (Intel® TBB) for Hardware Progra...
Use C++ and Intel® Threading Building Blocks (Intel® TBB) for Hardware Progra...
Intel® Software
 
Functioning incessantly of Data Science Platform with Kubeflow - Albert Lewan...
Functioning incessantly of Data Science Platform with Kubeflow - Albert Lewan...Functioning incessantly of Data Science Platform with Kubeflow - Albert Lewan...
Functioning incessantly of Data Science Platform with Kubeflow - Albert Lewan...
GetInData
 
Introduction to Backend Engineering
Introduction to Backend EngineeringIntroduction to Backend Engineering
Introduction to Backend Engineering
UdayYadav90
 
.NET Core Today and Tomorrow
.NET Core Today and Tomorrow.NET Core Today and Tomorrow
.NET Core Today and Tomorrow
Jon Galloway
 

More from ScyllaDB (20)

Powering a Billion Dreams: Scaling Meesho’s E-commerce Revolution with Scylla...
Powering a Billion Dreams: Scaling Meesho’s E-commerce Revolution with Scylla...Powering a Billion Dreams: Scaling Meesho’s E-commerce Revolution with Scylla...
Powering a Billion Dreams: Scaling Meesho’s E-commerce Revolution with Scylla...
ScyllaDB
 
Leading a High-Stakes Database Migration
Leading a High-Stakes Database MigrationLeading a High-Stakes Database Migration
Leading a High-Stakes Database Migration
ScyllaDB
 
Achieving Extreme Scale with ScyllaDB: Tips & Tradeoffs
Achieving Extreme Scale with ScyllaDB: Tips & TradeoffsAchieving Extreme Scale with ScyllaDB: Tips & Tradeoffs
Achieving Extreme Scale with ScyllaDB: Tips & Tradeoffs
ScyllaDB
 
Securely Serving Millions of Boot Artifacts a Day by João Pedro Lima & Matt ...
Securely Serving Millions of Boot Artifacts a Day by João Pedro Lima & Matt ...Securely Serving Millions of Boot Artifacts a Day by João Pedro Lima & Matt ...
Securely Serving Millions of Boot Artifacts a Day by João Pedro Lima & Matt ...
ScyllaDB
 
How Agoda Scaled 50x Throughput with ScyllaDB by Worakarn Isaratham
How Agoda Scaled 50x Throughput with ScyllaDB by Worakarn IsarathamHow Agoda Scaled 50x Throughput with ScyllaDB by Worakarn Isaratham
How Agoda Scaled 50x Throughput with ScyllaDB by Worakarn Isaratham
ScyllaDB
 
How Yieldmo Cut Database Costs and Cloud Dependencies Fast by Todd Coleman
How Yieldmo Cut Database Costs and Cloud Dependencies Fast by Todd ColemanHow Yieldmo Cut Database Costs and Cloud Dependencies Fast by Todd Coleman
How Yieldmo Cut Database Costs and Cloud Dependencies Fast by Todd Coleman
ScyllaDB
 
ScyllaDB: 10 Years and Beyond by Dor Laor
ScyllaDB: 10 Years and Beyond by Dor LaorScyllaDB: 10 Years and Beyond by Dor Laor
ScyllaDB: 10 Years and Beyond by Dor Laor
ScyllaDB
 
Reduce Your Cloud Spend with ScyllaDB by Tzach Livyatan
Reduce Your Cloud Spend with ScyllaDB by Tzach LivyatanReduce Your Cloud Spend with ScyllaDB by Tzach Livyatan
Reduce Your Cloud Spend with ScyllaDB by Tzach Livyatan
ScyllaDB
 
Migrating 50TB Data From a Home-Grown Database to ScyllaDB, Fast by Terence Liu
Migrating 50TB Data From a Home-Grown Database to ScyllaDB, Fast by Terence LiuMigrating 50TB Data From a Home-Grown Database to ScyllaDB, Fast by Terence Liu
Migrating 50TB Data From a Home-Grown Database to ScyllaDB, Fast by Terence Liu
ScyllaDB
 
Vector Search with ScyllaDB by Szymon Wasik
Vector Search with ScyllaDB by Szymon WasikVector Search with ScyllaDB by Szymon Wasik
Vector Search with ScyllaDB by Szymon Wasik
ScyllaDB
 
Workload Prioritization: How to Balance Multiple Workloads in a Cluster by Fe...
Workload Prioritization: How to Balance Multiple Workloads in a Cluster by Fe...Workload Prioritization: How to Balance Multiple Workloads in a Cluster by Fe...
Workload Prioritization: How to Balance Multiple Workloads in a Cluster by Fe...
ScyllaDB
 
Two Leading Approaches to Data Virtualization, and Which Scales Better? by Da...
Two Leading Approaches to Data Virtualization, and Which Scales Better? by Da...Two Leading Approaches to Data Virtualization, and Which Scales Better? by Da...
Two Leading Approaches to Data Virtualization, and Which Scales Better? by Da...
ScyllaDB
 
Scaling a Beast: Lessons from 400x Growth in a High-Stakes Financial System b...
Scaling a Beast: Lessons from 400x Growth in a High-Stakes Financial System b...Scaling a Beast: Lessons from 400x Growth in a High-Stakes Financial System b...
Scaling a Beast: Lessons from 400x Growth in a High-Stakes Financial System b...
ScyllaDB
 
Object Storage in ScyllaDB by Ran Regev, ScyllaDB
Object Storage in ScyllaDB by Ran Regev, ScyllaDBObject Storage in ScyllaDB by Ran Regev, ScyllaDB
Object Storage in ScyllaDB by Ran Regev, ScyllaDB
ScyllaDB
 
Lessons Learned from Building a Serverless Notifications System by Srushith R...
Lessons Learned from Building a Serverless Notifications System by Srushith R...Lessons Learned from Building a Serverless Notifications System by Srushith R...
Lessons Learned from Building a Serverless Notifications System by Srushith R...
ScyllaDB
 
A Dist Sys Programmer's Journey into AI by Piotr Sarna
A Dist Sys Programmer's Journey into AI by Piotr SarnaA Dist Sys Programmer's Journey into AI by Piotr Sarna
A Dist Sys Programmer's Journey into AI by Piotr Sarna
ScyllaDB
 
High Availability: Lessons Learned by Paul Preuveneers
High Availability: Lessons Learned by Paul PreuveneersHigh Availability: Lessons Learned by Paul Preuveneers
High Availability: Lessons Learned by Paul Preuveneers
ScyllaDB
 
How Natura Uses ScyllaDB and ScyllaDB Connector to Create a Real-time Data Pi...
How Natura Uses ScyllaDB and ScyllaDB Connector to Create a Real-time Data Pi...How Natura Uses ScyllaDB and ScyllaDB Connector to Create a Real-time Data Pi...
How Natura Uses ScyllaDB and ScyllaDB Connector to Create a Real-time Data Pi...
ScyllaDB
 
Persistence Pipelines in a Processing Graph: Mutable Big Data at Salesforce b...
Persistence Pipelines in a Processing Graph: Mutable Big Data at Salesforce b...Persistence Pipelines in a Processing Graph: Mutable Big Data at Salesforce b...
Persistence Pipelines in a Processing Graph: Mutable Big Data at Salesforce b...
ScyllaDB
 
Database Migration Strategies and Pitfalls by Patrick Bossman
Database Migration Strategies and Pitfalls by Patrick BossmanDatabase Migration Strategies and Pitfalls by Patrick Bossman
Database Migration Strategies and Pitfalls by Patrick Bossman
ScyllaDB
 
Powering a Billion Dreams: Scaling Meesho’s E-commerce Revolution with Scylla...
Powering a Billion Dreams: Scaling Meesho’s E-commerce Revolution with Scylla...Powering a Billion Dreams: Scaling Meesho’s E-commerce Revolution with Scylla...
Powering a Billion Dreams: Scaling Meesho’s E-commerce Revolution with Scylla...
ScyllaDB
 
Leading a High-Stakes Database Migration
Leading a High-Stakes Database MigrationLeading a High-Stakes Database Migration
Leading a High-Stakes Database Migration
ScyllaDB
 
Achieving Extreme Scale with ScyllaDB: Tips & Tradeoffs
Achieving Extreme Scale with ScyllaDB: Tips & TradeoffsAchieving Extreme Scale with ScyllaDB: Tips & Tradeoffs
Achieving Extreme Scale with ScyllaDB: Tips & Tradeoffs
ScyllaDB
 
Securely Serving Millions of Boot Artifacts a Day by João Pedro Lima & Matt ...
Securely Serving Millions of Boot Artifacts a Day by João Pedro Lima & Matt ...Securely Serving Millions of Boot Artifacts a Day by João Pedro Lima & Matt ...
Securely Serving Millions of Boot Artifacts a Day by João Pedro Lima & Matt ...
ScyllaDB
 
How Agoda Scaled 50x Throughput with ScyllaDB by Worakarn Isaratham
How Agoda Scaled 50x Throughput with ScyllaDB by Worakarn IsarathamHow Agoda Scaled 50x Throughput with ScyllaDB by Worakarn Isaratham
How Agoda Scaled 50x Throughput with ScyllaDB by Worakarn Isaratham
ScyllaDB
 
How Yieldmo Cut Database Costs and Cloud Dependencies Fast by Todd Coleman
How Yieldmo Cut Database Costs and Cloud Dependencies Fast by Todd ColemanHow Yieldmo Cut Database Costs and Cloud Dependencies Fast by Todd Coleman
How Yieldmo Cut Database Costs and Cloud Dependencies Fast by Todd Coleman
ScyllaDB
 
ScyllaDB: 10 Years and Beyond by Dor Laor
ScyllaDB: 10 Years and Beyond by Dor LaorScyllaDB: 10 Years and Beyond by Dor Laor
ScyllaDB: 10 Years and Beyond by Dor Laor
ScyllaDB
 
Reduce Your Cloud Spend with ScyllaDB by Tzach Livyatan
Reduce Your Cloud Spend with ScyllaDB by Tzach LivyatanReduce Your Cloud Spend with ScyllaDB by Tzach Livyatan
Reduce Your Cloud Spend with ScyllaDB by Tzach Livyatan
ScyllaDB
 
Migrating 50TB Data From a Home-Grown Database to ScyllaDB, Fast by Terence Liu
Migrating 50TB Data From a Home-Grown Database to ScyllaDB, Fast by Terence LiuMigrating 50TB Data From a Home-Grown Database to ScyllaDB, Fast by Terence Liu
Migrating 50TB Data From a Home-Grown Database to ScyllaDB, Fast by Terence Liu
ScyllaDB
 
Vector Search with ScyllaDB by Szymon Wasik
Vector Search with ScyllaDB by Szymon WasikVector Search with ScyllaDB by Szymon Wasik
Vector Search with ScyllaDB by Szymon Wasik
ScyllaDB
 
Workload Prioritization: How to Balance Multiple Workloads in a Cluster by Fe...
Workload Prioritization: How to Balance Multiple Workloads in a Cluster by Fe...Workload Prioritization: How to Balance Multiple Workloads in a Cluster by Fe...
Workload Prioritization: How to Balance Multiple Workloads in a Cluster by Fe...
ScyllaDB
 
Two Leading Approaches to Data Virtualization, and Which Scales Better? by Da...
Two Leading Approaches to Data Virtualization, and Which Scales Better? by Da...Two Leading Approaches to Data Virtualization, and Which Scales Better? by Da...
Two Leading Approaches to Data Virtualization, and Which Scales Better? by Da...
ScyllaDB
 
Scaling a Beast: Lessons from 400x Growth in a High-Stakes Financial System b...
Scaling a Beast: Lessons from 400x Growth in a High-Stakes Financial System b...Scaling a Beast: Lessons from 400x Growth in a High-Stakes Financial System b...
Scaling a Beast: Lessons from 400x Growth in a High-Stakes Financial System b...
ScyllaDB
 
Object Storage in ScyllaDB by Ran Regev, ScyllaDB
Object Storage in ScyllaDB by Ran Regev, ScyllaDBObject Storage in ScyllaDB by Ran Regev, ScyllaDB
Object Storage in ScyllaDB by Ran Regev, ScyllaDB
ScyllaDB
 
Lessons Learned from Building a Serverless Notifications System by Srushith R...
Lessons Learned from Building a Serverless Notifications System by Srushith R...Lessons Learned from Building a Serverless Notifications System by Srushith R...
Lessons Learned from Building a Serverless Notifications System by Srushith R...
ScyllaDB
 
A Dist Sys Programmer's Journey into AI by Piotr Sarna
A Dist Sys Programmer's Journey into AI by Piotr SarnaA Dist Sys Programmer's Journey into AI by Piotr Sarna
A Dist Sys Programmer's Journey into AI by Piotr Sarna
ScyllaDB
 
High Availability: Lessons Learned by Paul Preuveneers
High Availability: Lessons Learned by Paul PreuveneersHigh Availability: Lessons Learned by Paul Preuveneers
High Availability: Lessons Learned by Paul Preuveneers
ScyllaDB
 
How Natura Uses ScyllaDB and ScyllaDB Connector to Create a Real-time Data Pi...
How Natura Uses ScyllaDB and ScyllaDB Connector to Create a Real-time Data Pi...How Natura Uses ScyllaDB and ScyllaDB Connector to Create a Real-time Data Pi...
How Natura Uses ScyllaDB and ScyllaDB Connector to Create a Real-time Data Pi...
ScyllaDB
 
Persistence Pipelines in a Processing Graph: Mutable Big Data at Salesforce b...
Persistence Pipelines in a Processing Graph: Mutable Big Data at Salesforce b...Persistence Pipelines in a Processing Graph: Mutable Big Data at Salesforce b...
Persistence Pipelines in a Processing Graph: Mutable Big Data at Salesforce b...
ScyllaDB
 
Database Migration Strategies and Pitfalls by Patrick Bossman
Database Migration Strategies and Pitfalls by Patrick BossmanDatabase Migration Strategies and Pitfalls by Patrick Bossman
Database Migration Strategies and Pitfalls by Patrick Bossman
ScyllaDB
 

Recently uploaded (7)

MYB International Profile recruitment from Pakistan
MYB International Profile recruitment from PakistanMYB International Profile recruitment from Pakistan
MYB International Profile recruitment from Pakistan
Dr. Omer Al-Bermawy
 
Impact of Effective Performance Appraisal Systems on Employee Motivation and ...
Impact of Effective Performance Appraisal Systems on Employee Motivation and ...Impact of Effective Performance Appraisal Systems on Employee Motivation and ...
Impact of Effective Performance Appraisal Systems on Employee Motivation and ...
Dr. Nazrul Islam
 
HRBR lect 2 031022 Human Rights .pptx
HRBR lect 2 031022  Human Rights   .pptxHRBR lect 2 031022  Human Rights   .pptx
HRBR lect 2 031022 Human Rights .pptx
benjamin77330preteux
 
CI or FS Poly Cleared Job Fair Handbook | May 7, 2025
CI or FS Poly Cleared Job Fair Handbook | May 7, 2025CI or FS Poly Cleared Job Fair Handbook | May 7, 2025
CI or FS Poly Cleared Job Fair Handbook | May 7, 2025
ClearedJobs.Net
 
Breaking into Fintech - A Career Talk.pdf
Breaking into Fintech - A Career Talk.pdfBreaking into Fintech - A Career Talk.pdf
Breaking into Fintech - A Career Talk.pdf
GinMS
 
HRBR Lect 1 27092022 - Human Rights.pptx
HRBR Lect 1 27092022 - Human Rights.pptxHRBR Lect 1 27092022 - Human Rights.pptx
HRBR Lect 1 27092022 - Human Rights.pptx
benjamin77330preteux
 
HRBR Lect 3 101022_FINAL .pptx
HRBR Lect 3 101022_FINAL           .pptxHRBR Lect 3 101022_FINAL           .pptx
HRBR Lect 3 101022_FINAL .pptx
benjamin77330preteux
 
MYB International Profile recruitment from Pakistan
MYB International Profile recruitment from PakistanMYB International Profile recruitment from Pakistan
MYB International Profile recruitment from Pakistan
Dr. Omer Al-Bermawy
 
Impact of Effective Performance Appraisal Systems on Employee Motivation and ...
Impact of Effective Performance Appraisal Systems on Employee Motivation and ...Impact of Effective Performance Appraisal Systems on Employee Motivation and ...
Impact of Effective Performance Appraisal Systems on Employee Motivation and ...
Dr. Nazrul Islam
 
HRBR lect 2 031022 Human Rights .pptx
HRBR lect 2 031022  Human Rights   .pptxHRBR lect 2 031022  Human Rights   .pptx
HRBR lect 2 031022 Human Rights .pptx
benjamin77330preteux
 
CI or FS Poly Cleared Job Fair Handbook | May 7, 2025
CI or FS Poly Cleared Job Fair Handbook | May 7, 2025CI or FS Poly Cleared Job Fair Handbook | May 7, 2025
CI or FS Poly Cleared Job Fair Handbook | May 7, 2025
ClearedJobs.Net
 
Breaking into Fintech - A Career Talk.pdf
Breaking into Fintech - A Career Talk.pdfBreaking into Fintech - A Career Talk.pdf
Breaking into Fintech - A Career Talk.pdf
GinMS
 
HRBR Lect 1 27092022 - Human Rights.pptx
HRBR Lect 1 27092022 - Human Rights.pptxHRBR Lect 1 27092022 - Human Rights.pptx
HRBR Lect 1 27092022 - Human Rights.pptx
benjamin77330preteux
 

High-speed Database Throughput Using Apache Arrow Flight SQL

  • 1. Brought to you by High-speed Database Throughput Using Apache Arrow Flight SQL Kyle Porter Architect at Dremio James Duong Architect at Dremio
  • 3. Introduction to Apache Arrow ■ A columnar, in-memory data format and supporting libraries ■ Supported in many languages including C++, Java, Python, Go ■ Data is strongly typed. Each row has the same schema. ■ Includes libraries for working with the format: ● Computation engine (Acero) utilizing SIMD operations for vectorized data analysis. ● Interprocess communication. ● Serialization / deserialization from file formats. ■ Fully open source with a permissive license.
  • 4. Arrow powers dozens of open source & commercial technologies 10+ programming languages supported
  • 6. Why is Arrow Flight Needed? ■ An open protocol that the community can support. ■ Designed for data in the modern world ● Older protocols are row oriented and geared towards large numbers of columns and low numbers of rows. ● Arrow’s columnar format is oriented towards high compressibility and large numbers of rows. ■ Supports distributed computing as a client-side concept ● A data request can return multiple endpoints to a client. ● The client can retrieve from each endpoint in parallel.
  • 7. Arrow Way: Data is sent, transported and received in the Arrow format Arrow Flight ■ Protocol for serialization-free transport of Arrow data ● This is particularly efficient if the client application will just work with Arrow data directly. DATABASE Column Based DATABASE Column Based Convert CLIENT Column Based Convert CLIENT Column Based JDBC/ODBC Connector Arrow Flight Connector transporting data in Arrow Format Status Quo: Serializing/Deserializing data at each step Row Based Column Based
  • 8. Distributed Computing: Single Node with Arrow Flight Coordinator / Executor CLIENT CPU memory 1 - GetFlightInfo(<query>) 2 - FlightInfo<Schema, Endpoints> 3 - DoGet(<ticket>) Endpoint = {location, ticket} CPU memory
  • 9. Distributed Computing: Multiple Nodes with Arrow Flight CLIENT Node 2 Node N Node 1 CPU memory CPU memory CPU memory CPU memory DoGet(<ticket>) DoGet(<ticket>) DoGet(<ticket>) Omitting GetFlightInfo call...
  • 10. Arrow Flight as a Development Framework ■ Includes a fully-built client library ■ Includes a high-performance, scalable server ● Built on top of Google’s gRPC technology and compatible with existing tooling. ● Server implementation details such as thread-pooling, asynchronous IO, request cancellation are already implemented. ■ Server deployment is a matter of implementing a few RPC request handlers.
  • 12. Why Extend Arrow Flight? ■ Client sends a byte stream, server sends a result ● The content of the byte stream is opaque in the interface. ● It only has meaning for a particular server. ● Example - Dremio interprets the byte stream to be a UTF-8 encoded SQL query. ■ Catalog information is not part of Arrow Flight’s design ● There is no RPC call to describe how to build the byte stream the client sends. ● Generic tools cannot be built. ■ Arrow Flight is meant to serve any tabular data from any source. ■ ODBC/JDBC standardize query execution and catalog access, but have drawbacks. ■ Enter Arrow Flight SQL.
  • 13. What is Arrow Flight SQL? ■ Initiative to allow databases to use Arrow Flight as the transport protocol ● Leverage the performance of Arrow and Flight for database access. ■ Extended set of RPC calls to standardize a SQL interface on Flight ● Query execution ● Prepared statements ● Database catalog metadata ● SQL syntax capabilities ■ Generic client libraries ● A Flight SQL application can be used against any Flight SQL server without code changes. ● ODBC and JDBC clients provided on top.
  • 14. Common Tool Workflow SERVER 2 - FlightInfo<Schema, Endpoints> 1 - GetFlightInfo(GetTables) GetTables 4 - Arrow record batches 3 - DoGet(<ticket>) DoGet 6 - FlightInfo<Schema, Endpoints> 5 - GetFlightInfo(StatementExecute) Execute 7 - DoGet(<ticket>) DoGet CPU memory Listing tables Retrieving query data CLIENT CPU memory
  • 15. Flight SQL vs. Legacy Legacy (ODBC / JDBC) ■ Each database vendor must implement, maintain, and distribute a driver. ■ Each database vendor must implement their entire server. ■ Implementation details may be closed source. ■ Protocol is usually proprietary. Flight SQL ■ Single client that works against any Flight SQL server. ■ Server implementation is part of Flight. Only RPC handlers need to be implemented. ■ Flight and Arrow components are open and the community is actively improving them. ■ Protocol is open and integrates with gRPC and Arrow tooling.
  • 16. Flight SQL Status ■ Initial version released with Arrow 7.0.0 ● Includes support for C++ and Java clients and servers ■ Enhancements to column and data type metadata have been accepted into more recent versions of Arrow. ■ Support for transactions and query cancellation have been accepted. ■ Open for contributions ● Support for additional languages (Python, Go, C#, etc.). ● More features such as small result enhancements.
  • 17. Flight SQL Status ■ JDBC Driver ● Connect legacy JDBC applications to databases with the Flight SQL protocol with no code changes. ■ Examples: DBeaver, DBVisualizer ● Merged into Apache/master. To be released in Arrow 10.0.0 ■ ODBC Driver ● Released by Dremio. ● Connect ODBC applications such as Tableau, pyodbc, PowerBI to Flight SQL-enabled databases.
  • 19. Practical Example: pyodbc vs. PyArrow ● PyArrow is columnar ■ Consume columnar data returned using the Arrow Flight without deserialization costs. ● pyodbc is row-oriented ■ All data values must be converted to scalars to expose to the python application. ■ This process incurs significant deserialization costs.
  • 20. Practical Example: pyodbc vs. PyArrow ● Comparison: 500,000 rows queried from a remote server. (No parallelism). ■ pyodbc: 8.00s. PyArrow: 0.900s.
  • 21. Query Execution: pyodbc vs. PyArrow cursor = connection.cursor() cursor.execute(sql) data = cursor.fetchall() ■ ODBC requires all data to be retrieved from a single entry point (the cursor in the above example). ■ Arrow Flight lets the server expose multiple endpoints that host separate partitions of the data. Data can be retrieved in parallel and even from separate processes or client nodes. pyodbc (ODBC) options = flight.FlightCallOptions(headers=headers) descriptor = flight.FlightDescriptor.for_command(sql) flight_info = client.get_flight_info(descriptor, options) reader = client.do_get(flight_info.endpoints[0].ticket, options) data = reader.read_chunk() PyArrow (Arrow Flight SQL)
  • 22. Arrow Client Design Tips ■ Minimize copying of data. ■ Avoid manual calculations on data. ● Prefer library calls using the Compute library to analyze data (for example, arithmetic or aggregation on Arrow data). ● Arrow libraries use SIMD instructions for high-performance calculations! ■ Arrow provides fast file serialization to JSON, CSV, Parquet, ORC, and uncompressed Arrow files. Avoid serializing Arrow data by hand.
  • 23. References ■ Arrow Flight SQL Announcement: https://arrow.apache.org/blog/2022/02/16/introducing-arrow-flight-sql/ ■ Arrow Flight SQL ODBC Driver: https://github.com/dremio/flightsql-odbc and https://github.com/dremio/warpdrive ■ Arrow Flight SQL JDBC Driver: https://github.com/apache/arrow/tree/master/java/flight/flight-sql-jdbc-driver ■ Arrow Flight SQL JDBC Driver Improvements: https://issues.apache.org/jira/browse/ARROW-17729