Discover millions of audiobooks, ebooks, and so much more with a free trial

Only $12.99 CAD/month after trial. Cancel anytime.

In-Memory Analytics with Apache Arrow: Accelerate data analytics for efficient processing of flat and hierarchical data structures
In-Memory Analytics with Apache Arrow: Accelerate data analytics for efficient processing of flat and hierarchical data structures
In-Memory Analytics with Apache Arrow: Accelerate data analytics for efficient processing of flat and hierarchical data structures
Ebook861 pages6 hours

In-Memory Analytics with Apache Arrow: Accelerate data analytics for efficient processing of flat and hierarchical data structures

Rating: 0 out of 5 stars

()

Read preview
LanguageEnglish
PublisherPackt Publishing
Release dateSep 30, 2024
ISBN9781835469682
In-Memory Analytics with Apache Arrow: Accelerate data analytics for efficient processing of flat and hierarchical data structures
Author

Matthew Topol

Matthew Topol is a member of the Apache Arrow Project Management Committee (PMC) and a staff software engineer at Voltron Data, Inc. Matt has worked in infrastructure, application development, and large-scale distributed system analytical processing for financial data. At Voltron Data, Matt's primary responsibilities have been working on and enhancing the Apache Arrow libraries and associated sub-projects. In his spare time, Matt likes to bash his head against a keyboard, develop and run delightfully demented fantasy games for his victims—er—friends, and share his knowledge and experience with anyone interested enough to listen.

Related to In-Memory Analytics with Apache Arrow

Related ebooks

Computers For You

View More

Reviews for In-Memory Analytics with Apache Arrow

Rating: 0 out of 5 stars
0 ratings

0 ratings0 reviews

What did you think?

Tap to rate

Review must be at least 10 words

    Book preview

    In-Memory Analytics with Apache Arrow - Matthew Topol

    Cover.png

    In-Memory Analytics with Apache Arrow

    Copyright © 2024 Packt Publishing

    All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.

    Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the author, nor Packt Publishing or its dealers and distributors, will be held liable for any damages caused or alleged to have been caused directly or indirectly by this book.

    Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.

    Group Product Manager: Apeksha Shetty

    Publishing Product Manager: Chayan Majumdar

    Book Project Manager: Aparna Nair

    Senior Content Development Editor: Shreya Moharir

    Technical Editor: Sweety Pagaria

    Copy Editor: Safis Editing

    Proofreader: Shreya Moharir

    Indexer: Pratik Shirodkar

    Production Designer: Prafulla Nikalje

    DevRel Marketing Executive: Nivedita Singh

    First published: June 2022

    Second edition: September 2024

    Production reference: 1060924

    Published by Packt Publishing Ltd.

    Grosvenor House

    11 St Paul’s Square

    Birmingham

    B3 1RB, UK

    ISBN 978-1-83546-122-8

    www.packtpub.com

    For my family, Kat and Haley, who managed to tolerate me the entire time I was writing this.

    Also for Logan and Penny, my fuzzy coding companions who got me through so much. Their memory is a blessing.

    Foreword

    Since launching as an open source project in 2016, Apache Arrow has rapidly become the de facto standard for interoperability and accelerated in-memory processing for tabular data. We have broadened support to a dozen programming languages while expanding substantially beyond the project’s initial goal of defining a standardized columnar data format to create a true multi-language developer toolbox for creating high-performance data applications. While Arrow has helped greatly with improving interoperability and performance in heterogeneous systems (such as across programming languages or different kinds of execution engines), it is also increasingly being chosen as the foundation for building new data processing systems and databases. With Dremio as the first true Arrow-native system, we hope that many more production systems will become Arrow-compatible or Arrow-native over the coming years.

    Part of Arrow’s success and the rapid growth of its developer community comes from the passion and time investment of its early adopters and most prolific core contributors. Matt Topol has been a driving force in the Go libraries for Arrow, and with this new book, he has made a significant contribution to making the whole project a lot more accessible to newcomers. The book goes in depth into the details of how different pieces of Arrow work while highlighting the many different building blocks that could be employed by an Arrow user to accelerate or simplify their application.

    I am thrilled to see this updated second edition of this book as the Arrow project and its open source ecosystem continue to expand in new, impactful directions, even more than eight years since the project started. This was the first true Arrow book since the project’s founding, and it is a valuable resource for developers who want to explore different areas in depth and to learn how to apply new tools in their projects. I’m always happy to recommend it to new users of Arrow as well as existing users who are looking to deepen their knowledge by learning from an expert like Matt.

    Wes McKinney

    Co-founder of Voltron Data and Principal Architect at Posit

    Co-creator and PMC for Apache Arrow

    Contributors

    About the author

    Matthew Topol is a member of the Apache Arrow Project Management Committee (PMC) and a staff software engineer at Voltron Data, Inc. Matt has worked in infrastructure, application development, and large-scale distributed system analytical processing for financial data. At Voltron Data, Matt’s primary responsibilities have been working on and enhancing the Apache Arrow libraries and associated sub-projects. In his spare time, Matt likes to bash his head against a keyboard, develop and run delightfully demented fantasy games for his victims—er—friends, and share his knowledge and experience with anyone interested enough to listen.

    A very special thanks go out to my friends Hope and Stan, whose encouragement is the only reason I wrote a book in the first place. Finally, thanks go to my parents, who beam with pride every time I talk about this book. Thank you for your support and for being there through everything.

    About the reviewers

    Weston Pace is a maintainer for the Apache Arrow project and a member of the Arrow PMC and Substrait SMC. He has worked closely with the C++, Python, and Rust implementations of Apache Arrow. He has developed components in several of the systems described in this book, such as datasets and Acero. Weston is currently employed at LanceDB, where he is working on new Arrow-compatible storage formats to enable even more Arrow-native technology.

    Jacob Wujciak-Jens is an Apache Arrow committer and an elected member of the Apache Software Foundation. His work at Voltron Data as a senior software release engineer has included pivotal roles in the Apache Arrow and Velox projects. During his tenure, he has developed a deep knowledge of the release processes, build systems, and inner workings of these high-profile open source software projects. Jacob has a passion for open source and its use, both in the open source community and industry. Holding a Master of Education in computer science and public health, he loves to share his knowledge, enriching the community and enhancing collaborative projects.

    Raúl Cumplido is a PMC of the Apache Arrow project and has been the release manager for the project for more than 10 releases now. He has worked on several areas of the project. He has been always involved with open source communities, contributing mainly to Python-related projects. He’s one of the cofounders of the Python Spanish Association and has also been involved in the organization of several EuroPython and PyCon ES conferences. He currently works as a senior software release engineer at Voltron Data where he contributed to the Apache Arrow project.

    Table of Contents

    Preface

    Part 1: Overview of What Arrow is, Its Capabilities, Benefits, and Goals

    1

    Getting Started with Apache Arrow

    Technical requirements

    Understanding the Arrow format and specifications

    Why does Arrow use a columnar in-memory format?

    Learning the terminology and physical memory layout

    Quick summary of physical layouts, or TL;DR

    How to speak Arrow

    Arrow format versioning and stability

    Would you download a library? Of course!

    Setting up your shooting range

    Using PyArrow for Python

    C++ for the 1337 coders

    Go, Arrow, go!

    Summary

    References

    2

    Working with Key Arrow Specifications

    Technical requirements

    Playing with data, wherever it might be!

    Working with Arrow tables

    Accessing data files with PyArrow

    Accessing data files with Arrow in C++

    Bears firing arrows

    Putting pandas in your quiver

    Making pandas run fast

    Keeping pandas from running wild

    Polar bears use Rust-y arrows

    Sharing is caring… especially when it’s your memory

    Diving into memory management

    Managing buffers for performance

    Crossing boundaries

    Summary

    3

    Format and Memory Handling

    Technical requirements

    Storage versus runtime in-memory versus message-passing formats

    Long-term storage formats

    In-memory runtime formats

    Message-passing formats

    Summing up

    Passing your Arrows around

    What is this sorcery?!

    Producing and consuming Arrows

    Learning about memory cartography

    The base case

    Parquet versus CSV

    Mapping data into memory

    Too long; didn’t read (TL;DR) – computers are magic

    Leaving the CPU – using device memory

    Starting with a few pointers

    Device-agnostic buffer handling

    Summary

    Part 2: Interoperability with Arrow: The Power of Open Standards

    4

    Crossing the Language Barrier with the Arrow C Data API

    Technical requirements

    Using the Arrow C data interface

    The ArrowSchema structure

    The ArrowArray structure

    Example use cases

    Using the C data API to export Arrow-formatted data

    Importing Arrow data with Python

    Exporting Arrow data with the C Data API from Python to Go

    Streaming Arrow data between Python and Go

    What about non-CPU device data?

    The ArrowDeviceArray struct

    Using ArrowDeviceArray

    Other use cases

    Some exercises

    Summary

    5

    Acero: A Streaming Arrow Execution Engine

    Technical requirements

    Letting Acero do the work for you

    Input shaping

    Value casting

    Types of functions in Acero

    Invoking functions

    Using the C++ compute library

    Using the compute library in Python

    Picking the right tools

    Adding a constant value to an array

    Compute Add function

    A simple for loop

    Using std::for_each and reserve space

    Divide and conquer

    Always have a plan

    Where does Acero fit?

    Acero’s core concepts

    Let’s get streaming!

    Simplifying complexity

    Summary

    6

    Using the Arrow Datasets API

    Technical requirements

    Querying multifile datasets

    Creating a sample dataset

    Discovering dataset fragments

    Filtering data programmatically

    Expressing yourself – a quick detour

    Using expressions for filtering data

    Deriving and renaming columns (projecting)

    Using the Datasets API in Python

    Creating our sample dataset

    Discovering the dataset

    Using different file formats

    Filtering and projecting columns with Python

    Streaming results

    Working with partitioned datasets

    Writing partitioned data

    Connecting everything together

    Summary

    7

    Exploring Apache Arrow Flight RPC

    Technical requirements

    The basics and complications of gRPC

    Building modern APIs for data

    Efficiency and streaming are important

    Arrow Flight’s building blocks

    Horizontal scalability with Arrow Flight

    Adding your business logic to Flight

    Other bells and whistles

    Understanding the Flight Protobuf definitions

    Using Flight, choose your language!

    Building a Python Flight server

    Building a Go Flight server

    What is Flight SQL?

    Setting up a performance test

    Everyone gets a containerized development environment!

    Running the performance test

    Flight SQL, the new kid on the block

    Summary

    8

    Understanding Arrow Database Connectivity (ADBC)

    Technical requirements

    ODBC takes an Arrow to the knee

    Lost in translation

    Arrow adoption in ODBC drivers

    The benefits of standards around connectivity

    The ADBC specification

    ADBC databases

    ADBC connections

    ADBC statements

    ADBC error handling

    Using ADBC for performance and adaptability

    ADBC with C/C++

    Using ADBC with Python

    Using ADBC with Go

    Summary

    9

    Using Arrow with Machine Learning Workflows

    Technical requirements

    SPARKing new ideas on Jupyter

    Understanding the integration of Arrow in Spark

    Containerization makes life easier

    SPARKing joy with Arrow and PySpark

    Facehuggers implanting data

    Setting up your environment

    Proving the benefits by checking resource usage

    Using Arrow with the standard tools for ML

    More GPU, more speed!

    Summary

    Part 3: Real-World Examples, Use Cases, and Future Development

    10

    Powered by Apache Arrow

    Swimming in data with Dremio Sonar

    Clarifying Dremio Sonar’s architecture

    The library of the gods…of data analysis

    Spicing up your data workflows

    Arrow in the browser using JavaScript

    Gaining a little perspective

    Taking flight with Falcon

    An Influx of connectivity

    Summary

    11

    How to Leave Your Mark on Arrow

    Technical requirements

    Contributing to open source projects

    Communication is key

    You don’t necessarily have to contribute code

    There are a lot of reasons why you should contribute!

    Preparing your first pull request

    Creating and navigating GitHub issues

    Setting up Git

    Orienting yourself in the code base

    Building the Arrow libraries

    Creating the pull request

    Understanding Archery and the CI configuration

    Find your interest and expand on it

    Getting that sweet, sweet approval

    Finishing up with style!

    C++ code styling

    Python code styling

    Go code styling

    Summary

    12

    Future Development and Plans

    Globetrotting with data – GeoArrow and GeoParquet

    Collaboration breeds success

    Expanding ADBC adoption

    Final words

    Index

    Other Books You May Enjoy

    Preface

    To quote a famous blue hedgehog, Gotta Go Fast! When it comes to data, speed is important. It doesn’t matter if you’re collecting or analyzing data or developing utilities for others to do so, performance and efficiency are going to be huge factors in your technology choices, not just in the efficiency of the software itself, but also in development time. You need the right tools and the right technology, or you’re dead in the water.

    The Apache Arrow ecosystem is developer-centric, and this book is no different. Get started with understanding what Arrow is and how it works, then learn how to utilize it in your projects. You’ll find code examples, explanations, and diagrams here, all with the express purpose of helping you learn. You’ll integrate your data sources with Python DataFrame libraries such as pandas or NumPy and utilize Arrow Flight to create efficient data services.

    With real-world datasets, you’ll learn how to leverage Apache Arrow with Apache Spark and other technologies. Apache Arrow’s format is language-independent and organized so that analytical operations are performed extremely quickly on modern CPU and GPU hardware. Join the industry adoption of this open source data format and save yourself valuable development time creating high-performant, memory-efficient, analytical workflows.

    This book has been a labor of love to share knowledge. I hope you learn a lot from it! I sure did when writing it.

    Who this book is for

    This book is for developers, data analysts, and data scientists looking to explore the capabilities of Apache Arrow from the ground up. This book will also be useful for any engineers who are working on building utilities for data analytics, query engines, or otherwise working with tabular data, regardless of the language they are programming in.

    What this book covers

    Chapter 1

    , Getting Started with Apache Arrow, introduces you to the basic concepts underpinning Apache Arrow. It introduces and explains the Arrow format and the data types it supports, along with how they are represented in memory. Afterward, you’ll set up your development environment and run some simple code examples showing the basic operation of Arrow libraries.

    Chapter 2

    , Working with Key Arrow Specifications, continues your introduction to Apache Arrow by explaining how to read both local and remote data files using different formats. You’ll learn how to integrate Arrow with the Python pandas and Polars libraries and how to utilize the zero-copy aspects of Arrow to share memory for performance.

    Chapter 3

    , Format and Memory Handling, discusses the relationships between Apache Arrow and Apache Parquet, Feather, Protocol Buffers, JSON, and CSV data, along with when and why to use these different formats. Following this, the Arrow IPC format is introduced and described, along with an explanation of using memory mapping to further improve performance. Finally, we wrap up with some basic leveraging of Arrow on a GPU.

    Chapter 4

    , Crossing the Language Barrier with the Arrow C Data API, introduces the titular C Data API for efficiently passing Apache Arrow data between different language runtimes and devices. This chapter will cover the struct definitions utilized for this interface along with describing use cases that make it beneficial.

    Chapter 5

    , Acero: A Streaming Arrow Execution Engine, describes how to utilize the reference implementation of an Arrow computation engine named Acero. You’ll learn when and why you should use the compute engine to perform analytics rather than implementing something yourself and why we’re seeing Arrow showing up in many popular execution engines.

    Chapter 6

    , Using the Arrow Datasets API, demonstrates querying, filtering, and otherwise interacting with multi-file datasets that can potentially be across multiple sources. Partitioned datasets are also covered, along with utilizing Acero to perform streaming filtering and other operations on the data.

    Chapter 7

    , Exploring Apache Arrow Flight RPC, examines the Flight RPC protocol and its benefits. You will be walked through building a simple Flight server and client in multiple languages to produce and consume tabular data.

    Chapter 8

    , Understanding Arrow Database Connectivity (ADBC), introduces and explains an Apache Arrow-based alternative to ODBC/JDBC and why it matters for the ecosystem. You will be walked through several examples with sample code that interact with multiple database systems such as DuckDB and PostgreSQL.

    Chapter 9

    , Using Arrow with Machine Learning Workflows, integrates multiple concepts that have been covered to explain the various ways that Apache Arrow can be utilized to improve parts of data pipelines and the performance of machine learning model training. It will describe how Arrow’s interoperability and defined standards make it ideal for use with Spark, GPU compute, and many other tools.

    Chapter 10

    , Powered by Apache Arrow, provides a few examples of current real-world usage of Apache Arrow, such as Dremio, Spice.AI, and InfluxDB.

    Chapter 11

    , How to Leave Your Mark on Arrow, provides a brief introduction to contributing to open source projects in general, but specifically how to contribute to the Arrow project itself. You will be walked through finding starter issues, setting up your first pull request to contribute, and what to expect when doing so. To that end, this chapter also contains various instructions on locally building Arrow C++, Python, and Go libraries from source to test your contributions.

    Chapter 12

    , Future Development and Plans, wraps up the book by examining the features that are still in development at the time of writing. This includes geospatial integrations with GeoArrow and GeoParquet along with expanding Arrow Database Connectivity (ADBC) adoption. Finally, there are some parting words and a challenge from me to you.

    To get the most out of this book

    It is assumed that you have a basic understanding of writing code in at least one of C++, Python, or Go to benefit from and use the code snippets. You should know how to compile and run code in the desired language. Some familiarity with basic concepts of data analysis will help you get the most out of the scenarios and use cases explained in this book. Beyond this, concepts such as tabular data and installing software on your machine are assumed to be understood rather than explained.

    The sample data is in the book’s GitHub repository. You’ll need to use Git Large File Storage (LFS) or a browser to download the large data files. There are also a couple of larger sample data files in publicly accessible AWS S3 buckets. The book will provide a link to download the files when necessary. Code examples are provided in C++, Python, and Go.

    If you are using the digital version of this book, we advise you to the complete code from the book’s GitHub repository (a link is available in the next section). Doing so will help you avoid any potential errors related to the copying and pasting of code.

    Take your time, enjoy, and experiment in all kinds of ways, and please, have fun with the exercises!

    Download the example code files

    You can download the example code files for this book from GitHub at https://github.com/PacktPublishing/In-Memory-Analytics-with-Apache-Arrow-Second-Edition

    . If there’s an update to the code, it will be updated in the GitHub repository.

    We also have other code bundles from our rich catalog of books and videos available at https://github.com/PacktPublishing/

    . Check them out!

    Conventions used

    There are a number of text conventions used throughout this book.

    Code in text: Indicates code words in text, database table names, folder names, filenames, file extensions, pathnames, dummy URLs, user input, and Twitter handles. Here is an example: We’re using PyArrow in this example, but if you have the ArrowDeviceArray struct definition, you could create and populate the struct without ever needing to directly include or link against the Arrow libraries!

    A block of code is set as follows:

    >>> import numba.cuda

    >>> import pyarrow as pa

    >>> from pyarrow import cuda

    >>> import numpy as np

    >>> from pyarrow.cffi import ffi

    When we wish to draw your attention to a particular part of a code block, the relevant lines or items are set in bold:

    std::unique_ptr tmp;

    // returns a status, handle the error case

    arrow::MakeBuilder(arrow::default_memory_pool(),                    st_type, &tmp);

     

    std::shared_ptr builder;

    builder.reset(static_cast(               tmp.release()));

    Any command-line input or output is written as follows:

    $ mkdir arrow_chapter1 && cd arrow_chapter1

    $ go mod init arrow_chapter1

    $ go get -u github.com/apache/arrow/go/v17/arrow@latest

    Bold: Indicates a new term, an important word, or words that you see onscreen. For instance, words in menus or dialog boxes appear in bold. Here is an example: You'll notice that for the Filter and Project nodes in the figure, since they each use a compute expression, there is a sub-tree of the execution graph representing the expression tree.

    Tips or important notes

    Appear like this.

    Get in touch

    Feedback from our readers is always welcome.

    General feedback: If you have questions about any aspect of this book, email us at [email protected]

    and mention the book title in the subject of your message.

    Errata: Although we have taken every care to ensure the accuracy of our content, mistakes do happen. If you have found a mistake in this book, we would be grateful if you would report this to us. Please visit www.packtpub.com/support/errata

    and fill in the form.

    Piracy: If you come across any illegal copies of our works in any form on the internet, we would be grateful if you would provide us with the location address or website name. Please contact us at [email protected]

    with a link to the material.

    If you are interested in becoming an author: If there is a topic that you have expertise in and you are interested in either writing or contributing to a book, please visit authors.packtpub.com

    .

    Share Your Thoughts

    Once you’ve read In-Memory Analytics with Apache Arrow, Second Edition, we’d love to hear your thoughts! Please click here to go straight to the Amazon review page

    for this book and share your feedback.

    Your review is important to us and the tech community and will help us make sure we’re delivering excellent quality content.

    Download a free PDF copy of this book

    Thanks for purchasing this book!

    Do you like to read on the go but are unable to carry your print books everywhere?

    Is your eBook purchase not compatible with the device of your choice?

    Don’t worry, now with every Packt book you get a DRM-free PDF version of that book at no cost.

    Read anywhere, any place, on any device. Search, copy, and paste code from your favorite technical books directly into your application.

    The perks don’t stop there, you can get exclusive access to discounts, newsletters, and great free content in your inbox daily

    Follow these simple steps to get the benefits:

    Scan the QR code or visit the link below

    https://packt.link/free-ebook/9781835461228

    Submit your proof of purchase

    That’s it! We’ll send your free PDF and other benefits to your email directly

    Part 1: Overview of What Arrow is, Its Capabilities, Benefits, and Goals

    This section is an introduction to Apache Arrow as a format specification and a project, the benefits it claims, and the goals it’s trying to achieve. You’ll also find a high-level overview of basic use cases and examples.

    This part has the following chapters:

    Chapter 1

    , Getting Started with Apache Arrow

    Chapter 2

    , Working with Key Arrow Specifications

    Chapter 3

    , Format and Memory Handling

    1

    Getting Started with Apache Arrow

    Regardless of whether you’re a data scientist/engineer, a machine learning (ML) specialist, or a software engineer trying to build something to perform data analytics, you’ve probably heard of or read about something called Apache Arrow and either looked for more information or wondered what it was. Hopefully, this book can serve as a springboard in understanding what Apache Arrow is and isn’t, as well as a reference book to be continuously utilized so that you can supercharge your analytical capabilities.

    For now, we’ll start by explaining what Apache Arrow is and what you will use it for. Following that, we will walk through the Arrow specifications, set up a development environment where you can play around with the various Apache Arrow libraries, and walk through a few simple exercises so that you can get a feel for how to use them.

    In this chapter, we’re going to cover the following topics:

    Understanding the Arrow format and specifications

    Why does Arrow use a columnar in-memory format?

    Learning the terminology and the physical memory layout

    Arrow format versioning and stability

    Setting up your shooting range

    Technical requirements

    For the portion of this chapter that describes how to set up a development environment for working with various Arrow libraries, you’ll need the following:

    Your preferred integrated development environment (IDE) – for example, VS Code, Sublime, Emacs, or Vim

    Plugins for your desired language (optional but highly recommended)

    An interpreter or toolchain for your desired language(s):

    Python 3.8+: pip and venvand/or pipenv

    Go 1.21+

    C++ Compiler (capable of compiling C++17 or newer)

    Understanding the Arrow format and specifications

    The Apache Arrow documentation states the following [1]:

    Apache Arrow is a development platform for in-memory analytics. It contains a set of technologies that enable big data systems to process and move data fast. It specifies a standardized language-independent columnar memory format for flat and hierarchical data, organized for efficient analytic operations on modern hardware.

    Well, that’s a lot of technical jargon! Let’s start from the top. Apache Arrow (just Arrow for brevity) is an open source project from the Apache Software Foundation (https://apache.org

    ) that is released under the Apache License, version 2.0 [2]. It was co-created by Jacques Nadeau and Wes McKinney, the creator of pandas, and first released in 2016. Simply put, Arrow is a collection of libraries and specifications that make it easy to build high-performance software utilities for processing and transporting large datasets. It consists of a collection of libraries related to in-memory data processing, including specifications for memory layouts and protocols for sharing and efficiently transporting data between systems and processes. When we’re talking about in-memory data processing, we’re talking exclusively about processing data in RAM and eliminating slow data access (as well as redundantly copying and converting data) wherever possible to improve performance. This is where Arrow excels and provides libraries to support this with utilities for streaming and transportation to speed up data access.

    When working with data, there are two primary situations to consider, and each has different needs: the in-memory format and the on-disk format. When data is stored on disk, the biggest concerns are the size of the data and the input/output (I/O) cost to read it into the main memory before you can operate on it. As a result, formats for data on disk tend to focus much more on increasing I/O throughput, such as compressing the data to make it smaller and faster to read into memory. One example of this might be the Apache Parquet format, which is a columnar on-disk file format. Instead of being an on-disk format, Arrow’s focus is on the in-memory format, which targets processing efficiency, with numerous tactics such as cache locality and vectorization of computation.

    The primary goal of Arrow is to become the lingua franca of data analytics and processing – the One Format to Rule Them All, so to speak. Different databases, programming languages, and libraries tend to implement and use separate internal formats for managing data, which means that any time you’re moving data between these components for different uses, you’re paying a cost to serialize and deserialize that data every time. Not only that but lots of time and resources get spent reimplementing common algorithms and processing in those different data formats over and over. If we can standardize on an efficient, feature-rich internal data format that can be widely adopted and used instead, this excess computation and development time is no longer necessary. Figure 1.1 shows a simplified diagram of multiple systems, each with their own data formats, having to be copied and/or converted for the different components to work with each other:

    Figure 1.1 – Copy and convert components

    Figure 1.1 – Copy and convert components

    In many cases, the serialization and deserialization processes can end up taking nearly 90% of the processing time in such a system and prevent you from being able to spend that CPU on analytics. Alternatively, if every component is using Arrow’s in-memory format, you end up with a system similar to the one shown in Figure 1.2, where the data can be transferred between components at little-to-no cost. All the components can either share memory directly or send the data as-is without having to convert between different formats:

    Figure 1.2 – Sharing Arrow memory between components

    Figure 1.2 – Sharing Arrow memory between components

    At this point, there’s no need for the different components and systems to implement custom connectors or re-implement common algorithms and utilities. The same libraries and connectors can be utilized, even across programming languages and process barriers, by sharing memory directly so that it refers to the same data rather than copying multiple times between them. An example of this idea will be covered in Chapter 8

    , Understanding Arrow Database Connectivity (ADBC), where we’ll consider a specification for leveraging common database drivers in a cross-platform way to enable efficient interactions using Arrow-formatted data.

    Most data processing systems now use distributed processing by breaking the data into chunks and sending those chunks across the network to various workers. So, even if we can share memory across processes on a box, there’s still the cost to send it across the network. This brings us to the final piece of the puzzle: the format of raw Arrow data on the wire is the same as it is in memory. You can avoid having to deserialize that data before you can use it (skipping a copy) or reference the memory buffers you were operating on to send it across the network without having to serialize it first. Just a bit of metadata sent along with the raw data buffers and interfaces that perform zero copies can be created to achieve performance benefits, by reducing memory usage and improving throughput. We’ll cover this more directly in Chapter 3

    , Format and Memory Handling, so look forward to it!

    Let’s quickly recap the features of the Arrow format we just described before moving on:

    Using the same high-performance internal format across components allows for much more code reuse in libraries instead of the need to reimplement common workflows.

    The Arrow libraries provide mechanisms to directly share memory buffers to reduce copying between processes by using the same internal representation, regardless of the language. This is what’s being referred to whenever you see theterm zero-copy.

    The wire format is the same as the in-memory format to eliminate serialization and deserialization costs when sending data across networks between components of a system.

    Now, you might be thinking, Well, this sounds too good to be true! And of course, being skeptical of promises like this is always a good idea. The community around Arrow has done a ton of work over the years to bring these ideas and concepts to fruition. The project itself provides and distributes libraries in a variety of different programming languages so that projects that want to incorporate and/or support the Arrow format don’t need to implement it themselves. Above and beyond the interaction with Arrow-formatted data, the libraries provide a significant amount of utility in assisting with common processes such as data access and I/O-related optimizations. As a result, the Arrow libraries can be useful for projects, even if they don’t utilize the Arrow format themselves.

    Here’s just a quick sample of use cases where using Arrow as the internal/intermediate data format can be very beneficial:

    SQL execution engines (such as Dremio Sonar, InfluxDB, or Apache DataFusion)

    Data analysis utilities and pipelines (such as pandas or Apache Spark)

    Streaming and message queue systems (such as Apache Kafka or Storm)

    Storage systems and formats (such as Apache Parquet, Cassandra, and Kudu)

    As for how Arrow can help you, it depends on which piece of the data puzzle you work with. The following are a few different roles that work with data and show how using Arrow could potentially be beneficial; it’s by no means a complete list though:

    If you’re a data scientist:

    You can utilize Arrow via Polars or pandas and NumPy integration to significantly improve the performance of your data manipulations.

    If the tools you use integrate Arrow support, you can gain significant speed-ups for your queries and computations by using Arrow directly to reduce copies and/or serialization costs.

    If you’re a data engineer specializing in extract, transform, and load (ETL):

    The higher adoption of Arrow as an internal and externally-facing format can make it easier to integrate with many different utilities.

    By using Arrow, data can be shared between processes and tools, with shared memory increasing the tools available to you for building pipelines, regardless of the language you’re operating in. You could take data from Python, use it in Spark, and then pass it directly to the Java virtual machine (JVM) without paying the cost of copying between them.

    If you’re a software engineer or ML specialist building computation tools and utilities for data analysis:

    Arrow, as an internal format, can be used to improve your memory usage and performance by reducing serialization and deserialization between components.

    Understanding how to best utilize the data transfer protocols can improve your ability to parallelize queries and access your data, wherever it might be.

    Because Arrow can be used for any sort of tabular data, it can be integrated into many different areas of data analysis and computation pipelines and is versatile enough to be beneficial as an internal and data transfer format, regardless of the shape of your data.

    Now that you know what Arrow is, let’s dig into its design and how it delivers on the aforementioned promises of high-performance analytics, zero-copy sharing, and network communication without serialization costs. First, you’ll see why a column-oriented memory representation was chosen for Arrow’s internal format. In later chapters, we’ll cover specific integration points, explicit examples, and transfer protocols.

    Why does Arrow use a columnar in-memory format?

    There is often a lot of debate surrounding whether a database should be row-oriented or column-oriented, but this primarily refers to the on-disk format of the underlying storage files. Arrow’s data format is different from most cases discussed so far since it uses a columnar organization of data structures in memory directly. If you’re not familiar with columnar as a term, let’s take a look at what it means. First, imagine the following table of data:

    Figure 1.3 – Sample data table

    Figure 1.3 – Sample data table

    Traditionally, if you were to read this table into memory, you’d likely have some structure to represent a row and then read the data in one row at a

    Enjoying the preview?
    Page 1 of 1