In-Memory Analytics with Apache Arrow: Accelerate data analytics for efficient processing of flat and hierarchical data structures
By Matthew Topol and Wes McKinney
()
Matthew Topol
Matthew Topol is a member of the Apache Arrow Project Management Committee (PMC) and a staff software engineer at Voltron Data, Inc. Matt has worked in infrastructure, application development, and large-scale distributed system analytical processing for financial data. At Voltron Data, Matt's primary responsibilities have been working on and enhancing the Apache Arrow libraries and associated sub-projects. In his spare time, Matt likes to bash his head against a keyboard, develop and run delightfully demented fantasy games for his victims—er—friends, and share his knowledge and experience with anyone interested enough to listen.
Related to In-Memory Analytics with Apache Arrow
Related ebooks
Mastering Apache Arrow: Accelerating Data Processing and In-Memory Analytics Rating: 0 out of 5 stars0 ratingsPython Data Structures and Algorithms Rating: 5 out of 5 stars5/5CHERIoT Programmers' Guide: CHERIoT, #1 Rating: 0 out of 5 stars0 ratingsBeginning Software Engineering Rating: 5 out of 5 stars5/5Learning Object-Oriented Programming Rating: 4 out of 5 stars4/5HTML language complete Rating: 0 out of 5 stars0 ratingsPractical Data Analysis - Second Edition Rating: 0 out of 5 stars0 ratingsObject-Oriented Basics Rating: 0 out of 5 stars0 ratingsIPython Notebook Essentials Rating: 0 out of 5 stars0 ratingsComputer Data Rating: 0 out of 5 stars0 ratingsPython | Learn to Code Step by Step Rating: 0 out of 5 stars0 ratingsMastering Objectoriented Python Rating: 5 out of 5 stars5/5Python Data Analysis - Second Edition Rating: 0 out of 5 stars0 ratingsFundamentals of Python Data Engineering Rating: 0 out of 5 stars0 ratingsMicrosoft Azure Machine Learning Rating: 4 out of 5 stars4/5Android Hacker's Handbook Rating: 4 out of 5 stars4/5Business Analytics Rating: 4 out of 5 stars4/5NumPy Essentials Rating: 0 out of 5 stars0 ratingsLexicon of Programming Terminology: Lexicon of Tech and Business, #17 Rating: 5 out of 5 stars5/5Clean Code in JavaScript: Develop reliable, maintainable, and robust JavaScript Rating: 5 out of 5 stars5/5Dart By Example Rating: 0 out of 5 stars0 ratingsApache Spark 2.x Cookbook Rating: 0 out of 5 stars0 ratingsLearning Modular Java Programming Rating: 0 out of 5 stars0 ratingsPython Programming: General-Purpose Libraries; NumPy,Pandas,Matplotlib,Seaborn,Requests,os & sys: Python, #2 Rating: 0 out of 5 stars0 ratingsPython The Complete Reference: Comprehensive Guide to Mastering Python Programming from Fundamentals to Advanced Techniques Rating: 0 out of 5 stars0 ratings
Computers For You
The ChatGPT Millionaire Handbook: Make Money Online With the Power of AI Technology Rating: 4 out of 5 stars4/5SQL QuickStart Guide: The Simplified Beginner's Guide to Managing, Analyzing, and Manipulating Data With SQL Rating: 4 out of 5 stars4/5Mastering ChatGPT: 21 Prompts Templates for Effortless Writing Rating: 4 out of 5 stars4/5Elon Musk Rating: 4 out of 5 stars4/5Data Analytics for Beginners: Introduction to Data Analytics Rating: 4 out of 5 stars4/5Everybody Lies: Big Data, New Data, and What the Internet Can Tell Us About Who We Really Are Rating: 4 out of 5 stars4/5CompTIA Security+ Get Certified Get Ahead: SY0-701 Study Guide Rating: 5 out of 5 stars5/5Creating Online Courses with ChatGPT | A Step-by-Step Guide with Prompt Templates Rating: 4 out of 5 stars4/5Quantum Computing For Dummies Rating: 3 out of 5 stars3/5The Self-Taught Computer Scientist: The Beginner's Guide to Data Structures & Algorithms Rating: 0 out of 5 stars0 ratingsStorytelling with Data: Let's Practice! Rating: 4 out of 5 stars4/5Procreate for Beginners: Introduction to Procreate for Drawing and Illustrating on the iPad Rating: 5 out of 5 stars5/5Computer Science I Essentials Rating: 5 out of 5 stars5/5CompTIA IT Fundamentals (ITF+) Study Guide: Exam FC0-U61 Rating: 0 out of 5 stars0 ratingsDeep Search: How to Explore the Internet More Effectively Rating: 5 out of 5 stars5/5UX/UI Design Playbook Rating: 4 out of 5 stars4/5Thinking in Algorithms: Strategic Thinking Skills, #2 Rating: 4 out of 5 stars4/5Technical Writing For Dummies Rating: 0 out of 5 stars0 ratingsBecoming a Data Head: How to Think, Speak, and Understand Data Science, Statistics, and Machine Learning Rating: 5 out of 5 stars5/5Algorithms For Dummies Rating: 4 out of 5 stars4/5Learning the Chess Openings Rating: 5 out of 5 stars5/5
Reviews for In-Memory Analytics with Apache Arrow
0 ratings0 reviews
Book preview
In-Memory Analytics with Apache Arrow - Matthew Topol
In-Memory Analytics with Apache Arrow
Copyright © 2024 Packt Publishing
All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.
Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the author, nor Packt Publishing or its dealers and distributors, will be held liable for any damages caused or alleged to have been caused directly or indirectly by this book.
Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.
Group Product Manager: Apeksha Shetty
Publishing Product Manager: Chayan Majumdar
Book Project Manager: Aparna Nair
Senior Content Development Editor: Shreya Moharir
Technical Editor: Sweety Pagaria
Copy Editor: Safis Editing
Proofreader: Shreya Moharir
Indexer: Pratik Shirodkar
Production Designer: Prafulla Nikalje
DevRel Marketing Executive: Nivedita Singh
First published: June 2022
Second edition: September 2024
Production reference: 1060924
Published by Packt Publishing Ltd.
Grosvenor House
11 St Paul’s Square
Birmingham
B3 1RB, UK
ISBN 978-1-83546-122-8
www.packtpub.com
For my family, Kat and Haley, who managed to tolerate me the entire time I was writing this.
Also for Logan and Penny, my fuzzy coding companions who got me through so much. Their memory is a blessing.
Foreword
Since launching as an open source project in 2016, Apache Arrow has rapidly become the de facto standard for interoperability and accelerated in-memory processing for tabular data. We have broadened support to a dozen programming languages while expanding substantially beyond the project’s initial goal of defining a standardized columnar data format to create a true multi-language developer toolbox for creating high-performance data applications. While Arrow has helped greatly with improving interoperability and performance in heterogeneous systems (such as across programming languages or different kinds of execution engines), it is also increasingly being chosen as the foundation for building new data processing systems and databases. With Dremio as the first true Arrow-native
system, we hope that many more production systems will become Arrow-compatible
or Arrow-native
over the coming years.
Part of Arrow’s success and the rapid growth of its developer community comes from the passion and time investment of its early adopters and most prolific core contributors. Matt Topol has been a driving force in the Go libraries for Arrow, and with this new book, he has made a significant contribution to making the whole project a lot more accessible to newcomers. The book goes in depth into the details of how different pieces of Arrow work while highlighting the many different building blocks that could be employed by an Arrow user to accelerate or simplify their application.
I am thrilled to see this updated second edition of this book as the Arrow project and its open source ecosystem continue to expand in new, impactful directions, even more than eight years since the project started. This was the first true Arrow book
since the project’s founding, and it is a valuable resource for developers who want to explore different areas in depth and to learn how to apply new tools in their projects. I’m always happy to recommend it to new users of Arrow as well as existing users who are looking to deepen their knowledge by learning from an expert like Matt.
– Wes McKinney
Co-founder of Voltron Data and Principal Architect at Posit
Co-creator and PMC for Apache Arrow
Contributors
About the author
Matthew Topol is a member of the Apache Arrow Project Management Committee (PMC) and a staff software engineer at Voltron Data, Inc. Matt has worked in infrastructure, application development, and large-scale distributed system analytical processing for financial data. At Voltron Data, Matt’s primary responsibilities have been working on and enhancing the Apache Arrow libraries and associated sub-projects. In his spare time, Matt likes to bash his head against a keyboard, develop and run delightfully demented fantasy games for his victims—er—friends, and share his knowledge and experience with anyone interested enough to listen.
A very special thanks go out to my friends Hope and Stan, whose encouragement is the only reason I wrote a book in the first place. Finally, thanks go to my parents, who beam with pride every time I talk about this book. Thank you for your support and for being there through everything.
About the reviewers
Weston Pace is a maintainer for the Apache Arrow project and a member of the Arrow PMC and Substrait SMC. He has worked closely with the C++, Python, and Rust implementations of Apache Arrow. He has developed components in several of the systems described in this book, such as datasets and Acero. Weston is currently employed at LanceDB, where he is working on new Arrow-compatible storage formats to enable even more Arrow-native technology.
Jacob Wujciak-Jens is an Apache Arrow committer and an elected member of the Apache Software Foundation. His work at Voltron Data as a senior software release engineer has included pivotal roles in the Apache Arrow and Velox projects. During his tenure, he has developed a deep knowledge of the release processes, build systems, and inner workings of these high-profile open source software projects. Jacob has a passion for open source and its use, both in the open source community and industry. Holding a Master of Education in computer science and public health, he loves to share his knowledge, enriching the community and enhancing collaborative projects.
Raúl Cumplido is a PMC of the Apache Arrow project and has been the release manager for the project for more than 10 releases now. He has worked on several areas of the project. He has been always involved with open source communities, contributing mainly to Python-related projects. He’s one of the cofounders of the Python Spanish Association and has also been involved in the organization of several EuroPython and PyCon ES conferences. He currently works as a senior software release engineer at Voltron Data where he contributed to the Apache Arrow project.
Table of Contents
Preface
Part 1: Overview of What Arrow is, Its Capabilities, Benefits, and Goals
1
Getting Started with Apache Arrow
Technical requirements
Understanding the Arrow format and specifications
Why does Arrow use a columnar in-memory format?
Learning the terminology and physical memory layout
Quick summary of physical layouts, or TL;DR
How to speak Arrow
Arrow format versioning and stability
Would you download a library? Of course!
Setting up your shooting range
Using PyArrow for Python
C++ for the 1337 coders
Go, Arrow, go!
Summary
References
2
Working with Key Arrow Specifications
Technical requirements
Playing with data, wherever it might be!
Working with Arrow tables
Accessing data files with PyArrow
Accessing data files with Arrow in C++
Bears firing arrows
Putting pandas in your quiver
Making pandas run fast
Keeping pandas from running wild
Polar bears use Rust-y arrows
Sharing is caring… especially when it’s your memory
Diving into memory management
Managing buffers for performance
Crossing boundaries
Summary
3
Format and Memory Handling
Technical requirements
Storage versus runtime in-memory versus message-passing formats
Long-term storage formats
In-memory runtime formats
Message-passing formats
Summing up
Passing your Arrows around
What is this sorcery?!
Producing and consuming Arrows
Learning about memory cartography
The base case
Parquet versus CSV
Mapping data into memory
Too long; didn’t read (TL;DR) – computers are magic
Leaving the CPU – using device memory
Starting with a few pointers
Device-agnostic buffer handling
Summary
Part 2: Interoperability with Arrow: The Power of Open Standards
4
Crossing the Language Barrier with the Arrow C Data API
Technical requirements
Using the Arrow C data interface
The ArrowSchema structure
The ArrowArray structure
Example use cases
Using the C data API to export Arrow-formatted data
Importing Arrow data with Python
Exporting Arrow data with the C Data API from Python to Go
Streaming Arrow data between Python and Go
What about non-CPU device data?
The ArrowDeviceArray struct
Using ArrowDeviceArray
Other use cases
Some exercises
Summary
5
Acero: A Streaming Arrow Execution Engine
Technical requirements
Letting Acero do the work for you
Input shaping
Value casting
Types of functions in Acero
Invoking functions
Using the C++ compute library
Using the compute library in Python
Picking the right tools
Adding a constant value to an array
Compute Add function
A simple for loop
Using std::for_each and reserve space
Divide and conquer
Always have a plan
Where does Acero fit?
Acero’s core concepts
Let’s get streaming!
Simplifying complexity
Summary
6
Using the Arrow Datasets API
Technical requirements
Querying multifile datasets
Creating a sample dataset
Discovering dataset fragments
Filtering data programmatically
Expressing yourself – a quick detour
Using expressions for filtering data
Deriving and renaming columns (projecting)
Using the Datasets API in Python
Creating our sample dataset
Discovering the dataset
Using different file formats
Filtering and projecting columns with Python
Streaming results
Working with partitioned datasets
Writing partitioned data
Connecting everything together
Summary
7
Exploring Apache Arrow Flight RPC
Technical requirements
The basics and complications of gRPC
Building modern APIs for data
Efficiency and streaming are important
Arrow Flight’s building blocks
Horizontal scalability with Arrow Flight
Adding your business logic to Flight
Other bells and whistles
Understanding the Flight Protobuf definitions
Using Flight, choose your language!
Building a Python Flight server
Building a Go Flight server
What is Flight SQL?
Setting up a performance test
Everyone gets a containerized development environment!
Running the performance test
Flight SQL, the new kid on the block
Summary
8
Understanding Arrow Database Connectivity (ADBC)
Technical requirements
ODBC takes an Arrow to the knee
Lost in translation
Arrow adoption in ODBC drivers
The benefits of standards around connectivity
The ADBC specification
ADBC databases
ADBC connections
ADBC statements
ADBC error handling
Using ADBC for performance and adaptability
ADBC with C/C++
Using ADBC with Python
Using ADBC with Go
Summary
9
Using Arrow with Machine Learning Workflows
Technical requirements
SPARKing new ideas on Jupyter
Understanding the integration of Arrow in Spark
Containerization makes life easier
SPARKing joy with Arrow and PySpark
Facehuggers implanting data
Setting up your environment
Proving the benefits by checking resource usage
Using Arrow with the standard tools for ML
More GPU, more speed!
Summary
Part 3: Real-World Examples, Use Cases, and Future Development
10
Powered by Apache Arrow
Swimming in data with Dremio Sonar
Clarifying Dremio Sonar’s architecture
The library of the gods…of data analysis
Spicing up your data workflows
Arrow in the browser using JavaScript
Gaining a little perspective
Taking flight with Falcon
An Influx of connectivity
Summary
11
How to Leave Your Mark on Arrow
Technical requirements
Contributing to open source projects
Communication is key
You don’t necessarily have to contribute code
There are a lot of reasons why you should contribute!
Preparing your first pull request
Creating and navigating GitHub issues
Setting up Git
Orienting yourself in the code base
Building the Arrow libraries
Creating the pull request
Understanding Archery and the CI configuration
Find your interest and expand on it
Getting that sweet, sweet approval
Finishing up with style!
C++ code styling
Python code styling
Go code styling
Summary
12
Future Development and Plans
Globetrotting with data – GeoArrow and GeoParquet
Collaboration breeds success
Expanding ADBC adoption
Final words
Index
Other Books You May Enjoy
Preface
To quote a famous blue hedgehog, Gotta Go Fast! When it comes to data, speed is important. It doesn’t matter if you’re collecting or analyzing data or developing utilities for others to do so, performance and efficiency are going to be huge factors in your technology choices, not just in the efficiency of the software itself, but also in development time. You need the right tools and the right technology, or you’re dead in the water.
The Apache Arrow ecosystem is developer-centric, and this book is no different. Get started with understanding what Arrow is and how it works, then learn how to utilize it in your projects. You’ll find code examples, explanations, and diagrams here, all with the express purpose of helping you learn. You’ll integrate your data sources with Python DataFrame libraries such as pandas or NumPy and utilize Arrow Flight to create efficient data services.
With real-world datasets, you’ll learn how to leverage Apache Arrow with Apache Spark and other technologies. Apache Arrow’s format is language-independent and organized so that analytical operations are performed extremely quickly on modern CPU and GPU hardware. Join the industry adoption of this open source data format and save yourself valuable development time creating high-performant, memory-efficient, analytical workflows.
This book has been a labor of love to share knowledge. I hope you learn a lot from it! I sure did when writing it.
Who this book is for
This book is for developers, data analysts, and data scientists looking to explore the capabilities of Apache Arrow from the ground up. This book will also be useful for any engineers who are working on building utilities for data analytics, query engines, or otherwise working with tabular data, regardless of the language they are programming in.
What this book covers
Chapter 1
, Getting Started with Apache Arrow, introduces you to the basic concepts underpinning Apache Arrow. It introduces and explains the Arrow format and the data types it supports, along with how they are represented in memory. Afterward, you’ll set up your development environment and run some simple code examples showing the basic operation of Arrow libraries.
Chapter 2
, Working with Key Arrow Specifications, continues your introduction to Apache Arrow by explaining how to read both local and remote data files using different formats. You’ll learn how to integrate Arrow with the Python pandas and Polars libraries and how to utilize the zero-copy aspects of Arrow to share memory for performance.
Chapter 3
, Format and Memory Handling, discusses the relationships between Apache Arrow and Apache Parquet, Feather, Protocol Buffers, JSON, and CSV data, along with when and why to use these different formats. Following this, the Arrow IPC format is introduced and described, along with an explanation of using memory mapping to further improve performance. Finally, we wrap up with some basic leveraging of Arrow on a GPU.
Chapter 4
, Crossing the Language Barrier with the Arrow C Data API, introduces the titular C Data API for efficiently passing Apache Arrow data between different language runtimes and devices. This chapter will cover the struct definitions utilized for this interface along with describing use cases that make it beneficial.
Chapter 5
, Acero: A Streaming Arrow Execution Engine, describes how to utilize the reference implementation of an Arrow computation engine named Acero. You’ll learn when and why you should use the compute engine to perform analytics rather than implementing something yourself and why we’re seeing Arrow showing up in many popular execution engines.
Chapter 6
, Using the Arrow Datasets API, demonstrates querying, filtering, and otherwise interacting with multi-file datasets that can potentially be across multiple sources. Partitioned datasets are also covered, along with utilizing Acero to perform streaming filtering and other operations on the data.
Chapter 7
, Exploring Apache Arrow Flight RPC, examines the Flight RPC protocol and its benefits. You will be walked through building a simple Flight server and client in multiple languages to produce and consume tabular data.
Chapter 8
, Understanding Arrow Database Connectivity (ADBC), introduces and explains an Apache Arrow-based alternative to ODBC/JDBC and why it matters for the ecosystem. You will be walked through several examples with sample code that interact with multiple database systems such as DuckDB and PostgreSQL.
Chapter 9
, Using Arrow with Machine Learning Workflows, integrates multiple concepts that have been covered to explain the various ways that Apache Arrow can be utilized to improve parts of data pipelines and the performance of machine learning model training. It will describe how Arrow’s interoperability and defined standards make it ideal for use with Spark, GPU compute, and many other tools.
Chapter 10
, Powered by Apache Arrow, provides a few examples of current real-world usage of Apache Arrow, such as Dremio, Spice.AI, and InfluxDB.
Chapter 11
, How to Leave Your Mark on Arrow, provides a brief introduction to contributing to open source projects in general, but specifically how to contribute to the Arrow project itself. You will be walked through finding starter issues, setting up your first pull request to contribute, and what to expect when doing so. To that end, this chapter also contains various instructions on locally building Arrow C++, Python, and Go libraries from source to test your contributions.
Chapter 12
, Future Development and Plans, wraps up the book by examining the features that are still in development at the time of writing. This includes geospatial integrations with GeoArrow and GeoParquet along with expanding Arrow Database Connectivity (ADBC) adoption. Finally, there are some parting words and a challenge from me to you.
To get the most out of this book
It is assumed that you have a basic understanding of writing code in at least one of C++, Python, or Go to benefit from and use the code snippets. You should know how to compile and run code in the desired language. Some familiarity with basic concepts of data analysis will help you get the most out of the scenarios and use cases explained in this book. Beyond this, concepts such as tabular data and installing software on your machine are assumed to be understood rather than explained.
The sample data is in the book’s GitHub repository. You’ll need to use Git Large File Storage (LFS) or a browser to download the large data files. There are also a couple of larger sample data files in publicly accessible AWS S3 buckets. The book will provide a link to download the files when necessary. Code examples are provided in C++, Python, and Go.
If you are using the digital version of this book, we advise you to the complete code from the book’s GitHub repository (a link is available in the next section). Doing so will help you avoid any potential errors related to the copying and pasting of code.
Take your time, enjoy, and experiment in all kinds of ways, and please, have fun with the exercises!
Download the example code files
You can download the example code files for this book from GitHub at https://github.com/PacktPublishing/In-Memory-Analytics-with-Apache-Arrow-Second-Edition
. If there’s an update to the code, it will be updated in the GitHub repository.
We also have other code bundles from our rich catalog of books and videos available at https://github.com/PacktPublishing/
. Check them out!
Conventions used
There are a number of text conventions used throughout this book.
Code in text: Indicates code words in text, database table names, folder names, filenames, file extensions, pathnames, dummy URLs, user input, and Twitter handles. Here is an example: We’re using PyArrow in this example, but if you have the ArrowDeviceArray struct definition, you could create and populate the struct without ever needing to directly include or link against the Arrow libraries!
A block of code is set as follows:
>>> import numba.cuda
>>> import pyarrow as pa
>>> from pyarrow import cuda
>>> import numpy as np
>>> from pyarrow.cffi import ffi
When we wish to draw your attention to a particular part of a code block, the relevant lines or items are set in bold:
std::unique_ptr
// returns a status, handle the error case
arrow::MakeBuilder(arrow::default_memory_pool(), st_type, &tmp);
std::shared_ptr
builder.reset(static_cast
Any command-line input or output is written as follows:
$ mkdir arrow_chapter1 && cd arrow_chapter1
$ go mod init arrow_chapter1
$ go get -u github.com/apache/arrow/go/v17/arrow@latest
Bold: Indicates a new term, an important word, or words that you see onscreen. For instance, words in menus or dialog boxes appear in bold. Here is an example: You'll notice that for the Filter and Project nodes in the figure, since they each use a compute expression, there is a sub-tree of the execution graph representing the expression tree.
Tips or important notes
Appear like this.
Get in touch
Feedback from our readers is always welcome.
General feedback: If you have questions about any aspect of this book, email us at [email protected]
and mention the book title in the subject of your message.
Errata: Although we have taken every care to ensure the accuracy of our content, mistakes do happen. If you have found a mistake in this book, we would be grateful if you would report this to us. Please visit www.packtpub.com/support/errata
and fill in the form.
Piracy: If you come across any illegal copies of our works in any form on the internet, we would be grateful if you would provide us with the location address or website name. Please contact us at [email protected]
with a link to the material.
If you are interested in becoming an author: If there is a topic that you have expertise in and you are interested in either writing or contributing to a book, please visit authors.packtpub.com
.
Share Your Thoughts
Once you’ve read In-Memory Analytics with Apache Arrow, Second Edition, we’d love to hear your thoughts! Please click here to go straight to the Amazon review page
for this book and share your feedback.
Your review is important to us and the tech community and will help us make sure we’re delivering excellent quality content.
Download a free PDF copy of this book
Thanks for purchasing this book!
Do you like to read on the go but are unable to carry your print books everywhere?
Is your eBook purchase not compatible with the device of your choice?
Don’t worry, now with every Packt book you get a DRM-free PDF version of that book at no cost.
Read anywhere, any place, on any device. Search, copy, and paste code from your favorite technical books directly into your application.
The perks don’t stop there, you can get exclusive access to discounts, newsletters, and great free content in your inbox daily
Follow these simple steps to get the benefits:
Scan the QR code or visit the link below
https://packt.link/free-ebook/9781835461228
Submit your proof of purchase
That’s it! We’ll send your free PDF and other benefits to your email directly
Part 1: Overview of What Arrow is, Its Capabilities, Benefits, and Goals
This section is an introduction to Apache Arrow as a format specification and a project, the benefits it claims, and the goals it’s trying to achieve. You’ll also find a high-level overview of basic use cases and examples.
This part has the following chapters:
Chapter 1
, Getting Started with Apache Arrow
Chapter 2
, Working with Key Arrow Specifications
Chapter 3
, Format and Memory Handling
1
Getting Started with Apache Arrow
Regardless of whether you’re a data scientist/engineer, a machine learning (ML) specialist, or a software engineer trying to build something to perform data analytics, you’ve probably heard of or read about something called Apache Arrow and either looked for more information or wondered what it was. Hopefully, this book can serve as a springboard in understanding what Apache Arrow is and isn’t, as well as a reference book to be continuously utilized so that you can supercharge your analytical capabilities.
For now, we’ll start by explaining what Apache Arrow is and what you will use it for. Following that, we will walk through the Arrow specifications, set up a development environment where you can play around with the various Apache Arrow libraries, and walk through a few simple exercises so that you can get a feel for how to use them.
In this chapter, we’re going to cover the following topics:
Understanding the Arrow format and specifications
Why does Arrow use a columnar in-memory format?
Learning the terminology and the physical memory layout
Arrow format versioning and stability
Setting up your shooting range
Technical requirements
For the portion of this chapter that describes how to set up a development environment for working with various Arrow libraries, you’ll need the following:
Your preferred integrated development environment (IDE) – for example, VS Code, Sublime, Emacs, or Vim
Plugins for your desired language (optional but highly recommended)
An interpreter or toolchain for your desired language(s):
Python 3.8+: pip and venvand/or pipenv
Go 1.21+
C++ Compiler (capable of compiling C++17 or newer)
Understanding the Arrow format and specifications
The Apache Arrow documentation states the following [1]:
Apache Arrow is a development platform for in-memory analytics. It contains a set of technologies that enable big data systems to process and move data fast. It specifies a standardized language-independent columnar memory format for flat and hierarchical data, organized for efficient analytic operations on modern hardware.
Well, that’s a lot of technical jargon! Let’s start from the top. Apache Arrow (just Arrow for brevity) is an open source project from the Apache Software Foundation (https://apache.org
) that is released under the Apache License, version 2.0 [2]. It was co-created by Jacques Nadeau and Wes McKinney, the creator of pandas, and first released in 2016. Simply put, Arrow is a collection of libraries and specifications that make it easy to build high-performance software utilities for processing and transporting large datasets. It consists of a collection of libraries related to in-memory data processing, including specifications for memory layouts and protocols for sharing and efficiently transporting data between systems and processes. When we’re talking about in-memory data processing, we’re talking exclusively about processing data in RAM and eliminating slow data access (as well as redundantly copying and converting data) wherever possible to improve performance. This is where Arrow excels and provides libraries to support this with utilities for streaming and transportation to speed up data access.
When working with data, there are two primary situations to consider, and each has different needs: the in-memory format and the on-disk format. When data is stored on disk, the biggest concerns are the size of the data and the input/output (I/O) cost to read it into the main memory before you can operate on it. As a result, formats for data on disk tend to focus much more on increasing I/O throughput, such as compressing the data to make it smaller and faster to read into memory. One example of this might be the Apache Parquet format, which is a columnar on-disk file format. Instead of being an on-disk format, Arrow’s focus is on the in-memory format, which targets processing efficiency, with numerous tactics such as cache locality and vectorization of computation.
The primary goal of Arrow is to become the lingua franca of data analytics and processing – the One Format to Rule Them All, so to speak. Different databases, programming languages, and libraries tend to implement and use separate internal formats for managing data, which means that any time you’re moving data between these components for different uses, you’re paying a cost to serialize and deserialize that data every time. Not only that but lots of time and resources get spent reimplementing common algorithms and processing in those different data formats over and over. If we can standardize on an efficient, feature-rich internal data format that can be widely adopted and used instead, this excess computation and development time is no longer necessary. Figure 1.1 shows a simplified diagram of multiple systems, each with their own data formats, having to be copied and/or converted for the different components to work with each other:
Figure 1.1 – Copy and convert componentsFigure 1.1 – Copy and convert components
In many cases, the serialization and deserialization processes can end up taking nearly 90% of the processing time in such a system and prevent you from being able to spend that CPU on analytics. Alternatively, if every component is using Arrow’s in-memory format, you end up with a system similar to the one shown in Figure 1.2, where the data can be transferred between components at little-to-no cost. All the components can either share memory directly or send the data as-is without having to convert between different formats:
Figure 1.2 – Sharing Arrow memory between componentsFigure 1.2 – Sharing Arrow memory between components
At this point, there’s no need for the different components and systems to implement custom connectors or re-implement common algorithms and utilities. The same libraries and connectors can be utilized, even across programming languages and process barriers, by sharing memory directly so that it refers to the same data rather than copying multiple times between them. An example of this idea will be covered in Chapter 8
, Understanding Arrow Database Connectivity (ADBC), where we’ll consider a specification for leveraging common database drivers in a cross-platform way to enable efficient interactions using Arrow-formatted data.
Most data processing systems now use distributed processing by breaking the data into chunks and sending those chunks across the network to various workers. So, even if we can share memory across processes on a box, there’s still the cost to send it across the network. This brings us to the final piece of the puzzle: the format of raw Arrow data on the wire is the same as it is in memory. You can avoid having to deserialize that data before you can use it (skipping a copy) or reference the memory buffers you were operating on to send it across the network without having to serialize it first. Just a bit of metadata sent along with the raw data buffers and interfaces that perform zero copies can be created to achieve performance benefits, by reducing memory usage and improving throughput. We’ll cover this more directly in Chapter 3
, Format and Memory Handling, so look forward to it!
Let’s quickly recap the features of the Arrow format we just described before moving on:
Using the same high-performance internal format across components allows for much more code reuse in libraries instead of the need to reimplement common workflows.
The Arrow libraries provide mechanisms to directly share memory buffers to reduce copying between processes by using the same internal representation, regardless of the language. This is what’s being referred to whenever you see theterm zero-copy.
The wire format is the same as the in-memory format to eliminate serialization and deserialization costs when sending data across networks between components of a system.
Now, you might be thinking, Well, this sounds too good to be true!
And of course, being skeptical of promises like this is always a good idea. The community around Arrow has done a ton of work over the years to bring these ideas and concepts to fruition. The project itself provides and distributes libraries in a variety of different programming languages so that projects that want to incorporate and/or support the Arrow format don’t need to implement it themselves. Above and beyond the interaction with Arrow-formatted data, the libraries provide a significant amount of utility in assisting with common processes such as data access and I/O-related optimizations. As a result, the Arrow libraries can be useful for projects, even if they don’t utilize the Arrow format themselves.
Here’s just a quick sample of use cases where using Arrow as the internal/intermediate data format can be very beneficial:
SQL execution engines (such as Dremio Sonar, InfluxDB, or Apache DataFusion)
Data analysis utilities and pipelines (such as pandas or Apache Spark)
Streaming and message queue systems (such as Apache Kafka or Storm)
Storage systems and formats (such as Apache Parquet, Cassandra, and Kudu)
As for how Arrow can help you, it depends on which piece of the data puzzle you work with. The following are a few different roles that work with data and show how using Arrow could potentially be beneficial; it’s by no means a complete list though:
If you’re a data scientist:
You can utilize Arrow via Polars or pandas and NumPy integration to significantly improve the performance of your data manipulations.
If the tools you use integrate Arrow support, you can gain significant speed-ups for your queries and computations by using Arrow directly to reduce copies and/or serialization costs.
If you’re a data engineer specializing in extract, transform, and load (ETL):
The higher adoption of Arrow as an internal and externally-facing format can make it easier to integrate with many different utilities.
By using Arrow, data can be shared between processes and tools, with shared memory increasing the tools available to you for building pipelines, regardless of the language you’re operating in. You could take data from Python, use it in Spark, and then pass it directly to the Java virtual machine (JVM) without paying the cost of copying between them.
If you’re a software engineer or ML specialist building computation tools and utilities for data analysis:
Arrow, as an internal format, can be used to improve your memory usage and performance by reducing serialization and deserialization between components.
Understanding how to best utilize the data transfer protocols can improve your ability to parallelize queries and access your data, wherever it might be.
Because Arrow can be used for any sort of tabular data, it can be integrated into many different areas of data analysis and computation pipelines and is versatile enough to be beneficial as an internal and data transfer format, regardless of the shape of your data.
Now that you know what Arrow is, let’s dig into its design and how it delivers on the aforementioned promises of high-performance analytics, zero-copy sharing, and network communication without serialization costs. First, you’ll see why a column-oriented memory representation was chosen for Arrow’s internal format. In later chapters, we’ll cover specific integration points, explicit examples, and transfer protocols.
Why does Arrow use a columnar in-memory format?
There is often a lot of debate surrounding whether a database should be row-oriented or column-oriented, but this primarily refers to the on-disk format of the underlying storage files. Arrow’s data format is different from most cases discussed so far since it uses a columnar organization of data structures in memory directly. If you’re not familiar with columnar as a term, let’s take a look at what it means. First, imagine the following table of data:
Figure 1.3 – Sample data tableFigure 1.3 – Sample data table
Traditionally, if you were to read this table into memory, you’d likely have some structure to represent a row and then read the data in one row at a