Apache Arrow: Present and Future @ ScaledML 2020

February 26, 2020 -- ScaledML
Wes McKinney
Apache Arrow
Present and Future

Computational tools for
data preparation, analytics, and
feature engineering

• Director of Ursa Labs, not-for-profit dev group
working on Apache Arrow
• Created Python pandas project (~2008), lead
developer/maintainer until 2013
• PMC Apache Arrow, Apache Parquet, ASF Member
• Wrote Python for Data Analysis (1e 2012, 2e
2017)
• Formerly: Two Sigma, Cloudera, DataPad, AQR
Wes McKinney

Some Partners
● https://ursalabs.org
● Apache Arrow-powered
Data Science Tools
● Funded by corporate
partners
● Built in collaboration with
RStudio

Aside: Increasing dialogue amongst OSS devs
• Many important discussions either do not happen or
only happen in face-to-face meetings
• New Discourse forum: discuss.ossdata.org
• Topics: design & architecture, developer tools,
collaboration opportunities, ...

Apache Arrow: Present and Future @ ScaledML 2020

Overview
• Review Arrow Mission Statement
• “Shallow dive” into columnar format + protocol
• Some success stories
• Update on latest projects + initiatives for 2020 and
beyond

Apache Arrow
● Open source community project launched in 2016
● Intersection of database systems, big data, and data
science tools
● Purpose: Language-independent open standards and
libraries to accelerate and simplify in-memory computing
● https://github.com/apache/arrow

Personal motivations
● Interoperability problems with other data processing
systems
● Awareness of fundamental computational problems in
pandas or R data frames
○ Limited data types
○ Memory use problems
○ Slow processing efficiency
○ Difficulty with larger-than-memory datasets

Apache Arrow Big Picture
● Language-agnostic in-memory columnar format for
analytical query engines, data frames
● Binary protocol for IPC / RPC
● “Batteries included” development platform for building
data processing applications

Downstream applications
● Eliminate serialization overhead in data interchange
● Improve CPU/GPU in-memory processing efficiency
● Simplify architectures
● Promote code reuse

2020 Development Status
● 16 major releases
● Over 400 unique contributors
● Over 50M package installs in
2019
● ASF roster: 50 committers, 28
PMC members
● 11 programming languages
represented

Relationship with data science libraries
● Retrofit existing packages with faster IO or faster
processing code
● Superior computational foundation for new projects
● Aside: many “scalable data frame ” projects have
limited investments in improving single-node
processing efficiency

Arrow Columnar Format and
Binary Protocol

Arrow’s Columnar Memory Format
• Runtime memory format for analytical query processing
• Ideal companion to columnar storage like Apache Parquet
• “Fully shredded” columnar, supports flat and nested schemas
• Organized for cache-efficient access on CPUs/GPUs
• Optimized for data locality, SIMD, parallel processing
• Accommodates both random access and scan workloads

Arrow Binary Protocol
• Record batch: ordered collection of named arrays
• Streaming wire format for transferring datasets between address
spaces
• Intended for both IPC / shared memory and RPC use cases
SCHEMA DICTIONARY DICTIONARY
RECORD
BATCH
RECORD
BATCH
...
receiver sender

Encapsulated protocol (“IPC”) messages
• Serialization wire format suitable for stream-based parsing
metadata body
Metadata size or
end-of-stream marker
“Message” Flatbuffer
(see format/Message.fbs)
padding
Metadata contains memory
addresses within body to
reconstruct data structures

Record Batch serialization
• IPC message body contains buffers
concatenated end-to-end
• Serialized metadata records memory
offset and size of each buffer, for
later pointer arithmetic
schema {
a: int32,
b: list<item: binary>
}
a: buffer 0
a: buffer 1
b: buffer 0
b: buffer 1
b.item: buffer 0
b.item: buffer 1
b.item: buffer 2
BODY

Value type metadata
• Reasonably comprehensive set of built-in value types
• Application-defined logical types can be defined and transmitted
using special custom_metadata fields in the Schema
• Extension data stored using in a built-in type
• Examples
• UUID stored as FixedSizeBinary<16>
• LatitudeLongitude stored as struct<x: double, y: double>

Columnar Format Future Directions
• In-memory encoding, compression, sparseness
• e.g. run-length encoding
• See mailing list discussions, we need your
feedback!
• Expansion of logical types

https://medium.com/google-cloud/announcing-google-cloud-bigquery-version-1-17-0-1fc428512171

Some active development
initiatives

Arrow C++ development platform
Allocators and
Buffers
Columnar Data
Structures and
Builders
File Format Interfaces
PARQUET
CSV JSON ORC
AVRO
Binary IPC
Protocol
Gandiva: LLVM
Expr Compiler
Compute Kernels
IO / Filesystem Platform
localfs
AWS S3 HDFS
mmap
GCP
Azure
Red means planned /
under construction work
Plasma:
Shared Mem
Object Store
Multithreading
Runtime
Datasets
Framework
Data Frame
Interface
Embeddable
Query Engine
Compressor
Interfaces
… and much more
CUDA Interop
Flight RPC

Arrow C++ Platform
Multi-core Work Scheduler
Core Data
Platform
Query
Processing
Datasets
Framework
Arrow Flight RPC
Network
Storage

Example: use in R libraries
flights %>%
group_by(year, month, day) %>%
select(arr_delay, dep_delay) %>%
summarise(
arr = mean(arr_delay, na.rm = TRUE),
dep = mean(dep_delay, na.rm = TRUE)
) %>%
filter(arr > 30 | dep > 30)
dplyr verbs can be
translated to Arrow
computation graphs,
executed by parallel
runtime
R expressions can be JIT-compiled with LLVM
Can be a massive Arrow dataset

Arrow Flight Overview
• A gRPC-based framework for defining custom data
services that send and receive Arrow columnar data
natively
• Uses Protocol Buffers v3 for client protocol
• Pluggable command execution layer, authentication
• Low-level gRPC optimizations to avoid unnecessary
serialization

Arrow Flight - Parallel Get
Client Planner
GetFlightInfo
FlightInfo
DoGet Data Nodes
FlightData
DoGet
FlightData
...

Arrow Flight - Efficient gRPC transport
Client
DoGet
Data Node
FlightData
Row
Batch
Row
Batch
Row
Batch
Row
Batch
Row
Batch
...
Data transported in a Protocol
Buffer, but reads can be made
zero-copy by writing a custom
gRPC “deserializer”

Demo: Build simple Flight service
in Python

Getting involved
• Join dev@arrow.apache.org
• Development https://github.com/apache/arrow
• Non-project-specific discussions
https://discuss.ossdata.org

Apache Arrow: Present and Future @ ScaledML 2020

Recommended

More Related Content

What's hot (20)

Similar to Apache Arrow: Present and Future @ ScaledML 2020 (20)

More from Wes McKinney (14)

Recently uploaded (20)

Apache Arrow: Present and Future @ ScaledML 2020