New Directions for Apache Arrow

New Directions for Apache Arrow
Wes McKinney
@wesmckinn
September 10, 2021
New York R Conference

2
Apache Arrow
Multi-language toolbox for accelerated
data interchange and in-memory processing
● Founded in 2016 by a group of developers of open source data projects
● Provides a shared foundation for data analytics
● Enables uniﬁcation of database and data science technology stacks
● Thriving user and developer community
● Adopted by numerous projects and products in the data ecosystem

3
2018: Ursa Labs
Founded with a not-for-proﬁt mission
● Build cross-language, open libraries for data analytics
● Grow the Apache Arrow ecosystem
● Employ a team of full-time developers
Supported by sponsors and partners

4
2020: Ursa Computing
Founded to support enterprise applications of Arrow
● Empower teams to accelerate data workﬂows
● Work with enterprises to enhance data platforms
● Enable organizations to get more out of their data
A venture-backed startup

5
2021: Voltron Data
Joining forces for an Arrow-native future
● Ursa joined forces with GPU-accelerated computing pioneers
● Together we are creating a uniﬁed foundation for the future of
analytical computing
○ Optimized for diverse hardware
○ Compatible across languages
○ Fast and efﬁcient
○ Based on Apache Arrow
● Ursa Labs is now Voltron Labs

6
Apache Arrow
● Speciﬁes a columnar format for how data is stored in memory
● Provides implementations or bindings in numerous languages

7
Arrow Flight
High-performance data transport protocol
● Provides a framework for sending and receiving Arrow data natively
● Built using gRPC, Protocol Buffers, and the Arrow columnar format
● Designed to move large-scale data with excellent speed and efﬁciency
● Enables seamless interoperability across networks
Arrow Flight SQL is a next-generation standard for data access using SQL
● Adds SQL semantics to Arrow Flight
● Enables ODBC/JDBC-style data access at the speed of Flight

8
Arrow R Package
Exposes an interface to the Arrow C++ library
● Low-level access to the Arrow C++ API
● Higher-level access through a dplyr backend
Install the latest release from CRAN:
install.packages("arrow")
Install the latest nightly development build:
install.packages("arrow", repos =
c("https://arrow-r-nightly.s3.amazonaws.com", getOption("repos")))

read_parquet("nyc-taxi/2015/09/data.parquet", as_data_frame = TRUE) %>%
filter(total_amount > 100) %>%
select(tip_amount, total_amount, passenger_count) %>%
mutate(tip_pct = tip_amount / total_amount * 100) %>%
group_by(passenger_count) %>%
summarize(avg_tip_pct = mean(tip_pct), n = n()) %>%
filter(n > 500) %>%
arrange(desc(avg_tip_pct))
#> # A tibble: 3 x 3
#> passenger_count avg_tip_pct n
#> <int> <dbl> <int>
#> 1 1 13.6 11714
#> 2 2 11.9 2892
#> 3 3 11.1 709
system.time(...)
#> user system elapsed
#> 4.762 0.806 1.612
Parquet file
~10 million rows
~250 MB
Read into
an R data
frame
DEV
VERSIO
N

read_parquet("nyc-taxi/2015/09/data.parquet", as_data_frame = FALSE) %>%
filter(total_amount > 100) %>%
filter(n > 500) %>%
arrange(desc(avg_tip_pct)) %>%
collect()
#> 1 1 13.6 11714
#> 2 2 11.9 2892
#> 3 3 11.1 709
system.time(...)
#> 3.446 1.012 0.605
Parquet file
~10 million rows
~250 MB
Read into
an Arrow
Table
DEV
VERSIO
N
Return the result
as an R data frame

open_dataset("nyc-taxi", partitioning = c("year", "month")) %>%
filter(total_amount > 100 & year == 2015) %>%
filter(n > 5000) %>%
collect()
#> 1 5 16.8 5806
#> 2 1 13.5 143087
#> 3 2 12.6 34418
#> 4 3 11.9 8922
125 Parquet files
~2 billion rows
~40 GB
system.time(...)
#> 3.319 0.247 1.111
DEV
VERSIO
N

open_dataset("nyc-taxi", partitioning = c("year", "month")) %>%
filter(total_amount > 100 & year == 2015) %>%
to_duckdb() %>%
filter(n > 5000) %>%
collect()
● Creates a virtual DuckDB table backed by an Arrow data object
○ No data is loaded until collect() is called
○ Returns a dbplyr object for use in dplyr pipelines
DEV
VERSIO
N

20
Coming Soon
Upcoming Arrow releases will bring additional
query execution capabilities to the Arrow C++ engine and R package
● Joins
● Window functions
● More scalar and aggregate functions
● Performance and efﬁciency improvements

Ibis
● The Arrow C++ engine currently lacks a high-level Python API
● Ibis can ﬁll this gap
taxi
.filter(taxi.total_amount > 100)
.projection(['tip_amount', 'total_amount', 'passenger_count'])
.mutate(tip_pct = taxi.tip_amount / taxi.total_amount * 100)
.group_by('passenger_count')
.aggregate(n=lambda x: x.count(), avg_tip=lambda x: x.tip_pct.mean())
.filter(lambda x: x.n > 500)
.sort_by(ibis.desc('avg_tip'))
.execute()

Engines and Interfaces
There are multiple efforts underway to develop Arrow-native query engines
● Arrow C++ engine
● Arrow DataFusion
● DuckDB
● …
Users want ﬂuent interfaces to these engines from their preferred languages
● Python
● R
● JavaScript
● …
Users also want to run SQL queries on these engines

26
Compute Intermediate Representation (IR)
The Arrow community has launched a collaboration to establish a Compute IR
● A standard serialized representation of compute expressions
● A common layer connecting APIs (front ends) and engines (back ends)
○ Is produced by APIs
○ Is consumed by engines
● Follow this initiative at substrait.io

Thank you
Wes McKinney
@wesmckinn
arrow.apache.org
voltrondata.com
We’re hiring!
voltrondata.com/careers

New Directions for Apache Arrow

Recommended

More Related Content

What's hot (20)

Similar to New Directions for Apache Arrow (20)

More from Wes McKinney (20)

Recently uploaded (20)

New Directions for Apache Arrow