0% found this document useful (0 votes)

86 views10 pages

learning Apache Arrow Overview

Uploaded by

udaycignex

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

86 views10 pages

learning Apache Arrow Overview

Uploaded by

udaycignex

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 10

I spent 6 hours learning Apache Arrow: Overview

blog.det.life/i-spent-6-hours-learning-apache-arrow-overview-e7f3b8ee85b2

Vu Trinh October 27, 2024

Top highlight

Why do we need a standard memory format for analytics workload?

This article was published at my newsletter 14 days ago. Subscribe for free at
https://vutr.substack.com to get my writing delivered straight to your inbox sooner.

Intro
This week, we will explore one of the most exciting data-related projects at the moment:
Apache Arrow. The article will be structured as follows: first, we will understand what Arrow is
and the motivation behind it. Then, we will learn about the physical data layout of the Arrow
array and how Arrow data is serialized. Finally, we will explore how Arrow can bring immense
value to the analytics world.

1/10
What?
I will bring the definition from Arrow’s official documentation here:

The Arrow columnar format includes a language-agnostic in-memory data structure

specification, metadata serialization, and a protocol for serialization and generic data
transport. It provides analytical performance and data locality guarantees in exchange
for comparatively more expensive mutation operations.

The Apache Arrow format project began in February 2016, focusing on columnar in-memory
analytics workload. Unlike file formats like Parquet or CSV, which specify how data is
organized on disk, Arrow focuses on how data is organized in memory.

Image created by Canva Image Generator.

The creators try to build Arrow as a community standard in-memory format for workload
analytics. These foundations attract many contributors from projects such as Pandas, Spark,
Cassandra, Apache Calcite, Dremio, and Ibis.

Apache Arrow tries to achieve:

: Efficient data processing for analytics workload by designing to take advantage of

modern CPU characteristics.
: Sharing data between systems at a low cost or zero cost.

When two systems communicate, each converts its data into a standard format before
transferring it. However, this process incurs serialization and deserialization costs. The idea
behind Apache Arrow is to provide a highly efficient format for processing within a single
system. As more systems adopt this data representation, they can share data at a very low

2/10
cost, potentially even through shared memory at zero cost. This is the core of Arrow’s
design. It’s a library that can be embedded in many systems, such as execution engines,
analytics tools, or storage layers.

Terminology
Before going further, let’s check out some terminology in the Arrow world:

An is a sequence of values with defined lengths, all sharing the same type.
A is a single logical value within a specific data type array.
is a sequential virtual address space with a fixed length, where any byte can be
accessed via a pointer offset within the region’s length.
describes the underlying memory structure of an array without considering its value
semantics.
represents an application-level value type implemented using a specific physical layout.
For example, Decimal128 values are stored as 16 bytes in a fixed-size binary format,
while a timestamp might be stored in a 64-bit fixed-size layout.
A is a data type with no child types, such as fixed bit-width, variable-size binary, and
null types.
A is a data type whose structure depends on one or more child types.

For the rest of this article, I will convey the information using arrays with primitive types. For
other data types, you can check Arrow’s documentation.

Array Physical Memory Layout

A few pieces of metadata and data define arrays:

3/10
Image created by the author.

The array’s length is a 64-bit signed integer, and the null count is also a 64-bit signed
integer.
A data type.
An optional for dictionary-encoded arrays.
A sequence of buffers:
: Almost all array types have a dedicated memory buffer, known as the validity bitmap,
which encodes the null information for each array’s slot value.
: Some array types, such as Variable-size Binary Layout, have offset buffers to locate
the start position of each slot in the data buffer.
: The buffers containthearray’s data
There are more buffers for some complex types, such as Size Buffer () or Types Buffer
()

Memory Alignment
When working with Apache Arrow, memory should be allocated at aligned addresses —
typically in multiples of 8 or 64 bytes. Additionally, padding (over-allocating memory) is
encouraged to ensure the total length is a multiple of 8 or 64 bytes.

4/10
refers to a memory address that is a multiple of a specific value, known as the
alignment boundary, such as 4, 8, or 64 bytes. Aligned memory is crucial for
performance because CPUs are optimized to handle data on these boundaries,
allowing faster access. Misaligned data forces the CPU to perform extra operations,
slowing things down.

in memory refers to the practice of adding extra, unused bytes between data elements
or at the end of a data block to ensure proper alignment. This is often done to make
sure that subsequent data starts at a correctly aligned memory address, adhering to
alignment boundaries such as 8 or 64 bytes. Padding helps maintain efficient memory
access. In return, padding increases the memory usage.

This alignment follows Intel’s performance guidelines, which suggest matching memory
alignment to SIMD register widths, particularly for the AVX-512 architecture.

SIMD (Single Instruction, Multiple Data) is a processing technique that allows a CPU to
perform the same operation on multiple data points simultaneously. This is achieved
through specialized instructions and registers that can handle multiple values at once.

An example layout of a Fixed-size Primitive Array Layout

A primitive value array represents values with the same slot width.

Image created by the author.

An example layout of a Variable-size Binary Array Layout

5/10
Each value in this layout consists of 0 or more bytes. A variable-size binary has an additional
buffer called offset in addition to the data buffer.

The offset buffer’s length equals the value array’s length + 1. This buffer encodes each slot’s
start position in the data buffer. The value length in each slot is computed using the
difference between the offset at that slot’s index and the subsequent offset.

Image created by the author.

A null value may have a positive slot length and take non-empty memory space in the data
buffer. In such cases, the content of the corresponding memory space is undefined. Offsets
must increase monotonically, even for null slots, ensuring that all values’ locations are valid
and well-defined. Typically, the first slot in the offsets array is 0, and the last slot is the length
of the values array.

For more layouts of different array types, you can check Arrow Official Documentation.

6/10
Serialization and Interprocess Communication (IPC)

This section describes the Arrow protocol for efficiently transferring and processing
data between processes.

The unit of serialized data in the Arrow is the “record batch.” A record batch is a collection of
arrays, known as its fields, each with potentially different data types. The field names and
types collectively form the batch’s schema.

Image created by the author.

The Arrow protocol utilizes a one-way stream of binary messages of these types:

: This defines the structure of the data. It consists of a list of fields, each with a name
and a data type (int, float, string, etc.). A serialized Schema does not contain any data
buffers.
contains the actual data buffers. A RecordBatch contains a collection of equal-length
arrays, each corresponding to a column described in the schema. The metadata for
this message provides the location and size of each buffer, allowing its array to be
reconstructed using pointer arithmetic and, thus, avoid memory copying. The serialized
form of the record batch has the body and data header. The body includes arrays’
memory buffers. The header contains the length and null count for each flattened field
and the memory offset and size of each buffer within the record batch’s body.
: A DictionaryBatch is a specialized batch used for dictionary encoding, an efficient way
to store categorical data. It contains a dictionary or lookup table where unique values
are stored. Dictionary-encoded fields refer to indices in this dictionary rather than
storing the total values directly, saving space and improving performance.

Arrow supports two types of binary formats for serializing RecordBatches:

7/10
: Used for sending an arbitrary-length sequence of record batches. This format must be
processed sequentially from start to end and does not support random access. The
schema appears first in the stream. If any fields in the schema are dictionary-encoded,
one or more DictionaryBatch messages will be included.
: Used for serializing a fixed number of record batches, with support for random
access. The file begins and ends with the magic string “ARROW1.” The file contents
are otherwise identical to the streaming format. At the end of the file is a footer
containing a redundant copy of the schema and memory offsets and sizes for each
data block. This allows for random access to any record batch within the file.

How does Apache Arrow bring value?

Performance
Arrow positions itself for adoption in data analytics workloads that access subsets of
attributes (columns) rather than individual data records.

As mentioned, Arrow organizes data in a column-by-column format within a record batch.

This design is highly advantageous for data analytics workloads, which typically focus on a
subset of columns at a time and scan through large numbers of rows to aggregate values.
Storing data in a columnar fashion enables high-performance, sequential access patterns
ideal for these tasks.

Additionally, storing data column-by-column offers further benefits for analytical workloads,
such as enabling SIMD acceleration and improving compression rates. One additional factor
that ensures Arrow provides processing efficiency is memory alignment.

Interoperability
Initially, when moving data from one system to another, we had to rewrite the data within the
system into a more straightforward representation. This representation would then be passed
to the other system, where it would be rewritten to fit its proprietary format. Rewriting data
before export is called “serialization,” and rewriting it back before import is called
“deserialization.” These serialization and deserialization CPU costs were unavoidable when
moving data between systems.

Before Arrow, each system used its internal memory format, which wasted many CPU
resources on serialization and deserialization. With Arrow, everything changes. All systems
now utilize the same memory format, eliminating cross-system communication overhead.

8/10
Image created by the author.

Apache Arrow promises to provide low-cost or zero-cost data sharing between systems by
providing an IPC format (IPC stream and IPC file) that allows data to be seamlessly passed
between processes without re-serialization, making inter-process communication faster and
more efficient.

Arrow IPC files can be memory-mapped, allowing us to work with datasets that exceed the
memory capacity. This enables seamless data sharing across different languages and
processes.

A memory-mapped file is a segment of virtual memory that has been assigned a

direct byte-for-byte correlation with some portion of a file or file-like resource. The
benefit of memory mapping a file is increasing I/O performance, especially when used
on large files.

— Wikipedia —

Arrow also excels at moving data over the network. The format supports serializing and
transferring columnar data across the network or other streaming transports. Apache Spark,
for instance, uses Arrow as a data interchange format. Big names like Google BigQuery,
TensorFlow, and AWS Athena also use Arrow to streamline data operations.

Moreover, the Arrow project defines Flight, a client-server RPC framework. Flight helps
users build robust services for exchanging data based on application-specific needs, making
data handling even more efficient and customizable.

To see Arrow’s ubiquity, you can visit the list of projects that leverage Apache Arror here.
Some notable projects include Spark, AWS Data Wrangler, Clickhouse, Dask, Dremio,
InfluxDB IOx, MATLAB, pandas, Polars, and Ray.

Outro

9/10
In this article, we explored the Apache Arrow’s overview, from its definition and motivation to
its physical memory layout and how data is serialized.

Thank you for reading this far. If you notice any points needing correction or want to discuss
more about Arrow, feel free to leave a comment.

Now, it’s time to say goodbye. See you in the next blog!

Reference
[1]

[2] Jacques Nadeau, CTO of Dremio,

[3] Daniel Abadi, (2018)

10/10

Data Structures and Algorithms
No ratings yet
Data Structures and Algorithms
820 pages
Basic Gin Martini
No ratings yet
Basic Gin Martini
1 page
Unit 4 - Principles of Programming Languages - Www.rgpvnotes.in
No ratings yet
Unit 4 - Principles of Programming Languages - Www.rgpvnotes.in
21 pages
CS112 Programming Languages 1 (Lecture 2 - Spring 2020)
No ratings yet
CS112 Programming Languages 1 (Lecture 2 - Spring 2020)
61 pages
AL CSC Notes - Part 3 (2023)
No ratings yet
AL CSC Notes - Part 3 (2023)
52 pages
3-types-alignment-pointers-web
No ratings yet
3-types-alignment-pointers-web
22 pages
Pointers Overlays Nonotes
No ratings yet
Pointers Overlays Nonotes
126 pages
Module 10 - Pointers in C
No ratings yet
Module 10 - Pointers in C
58 pages
Rohini 38341594818
No ratings yet
Rohini 38341594818
18 pages
Programming in Java - Data Structures - Array ArrayList LinkedList
No ratings yet
Programming in Java - Data Structures - Array ArrayList LinkedList
29 pages
Suggestions
No ratings yet
Suggestions
20 pages
Notes On C++ For Students
No ratings yet
Notes On C++ For Students
121 pages
Gach 3D-151-180
No ratings yet
Gach 3D-151-180
30 pages
PF 6
No ratings yet
PF 6
22 pages
XrPbrVs1Sv2zDXTxDko1ug_ce5d7512a9b14b94a6d94f88e37ce3f1_32-assembler-dc
No ratings yet
XrPbrVs1Sv2zDXTxDko1ug_ce5d7512a9b14b94a6d94f88e37ce3f1_32-assembler-dc
44 pages
42_P16CSE5A-P16ITE3A_2020052204503639
No ratings yet
42_P16CSE5A-P16ITE3A_2020052204503639
23 pages
Lec 2 - Arrays
No ratings yet
Lec 2 - Arrays
35 pages
Chapter-2. Array & String
No ratings yet
Chapter-2. Array & String
15 pages
Unit-I
No ratings yet
Unit-I
23 pages
DS-CIA1-PORTIONS[1]
No ratings yet
DS-CIA1-PORTIONS[1]
35 pages
Lecture 9 Arrays
100% (5)
Lecture 9 Arrays
7 pages
7 - Data Structures 1
No ratings yet
7 - Data Structures 1
20 pages
Module 4 Lecture Notes-Csc 317
No ratings yet
Module 4 Lecture Notes-Csc 317
15 pages
C# Basics Day 1
No ratings yet
C# Basics Day 1
37 pages
Kodnest - Placements PDF
67% (3)
Kodnest - Placements PDF
4 pages
Presentation of Arrays
No ratings yet
Presentation of Arrays
9 pages
Week_01-Learn Dsa With c++
No ratings yet
Week_01-Learn Dsa With c++
38 pages
Python_Material Ashok IT
No ratings yet
Python_Material Ashok IT
277 pages
RFC 8746
No ratings yet
RFC 8746
12 pages
PPL Lecture3
No ratings yet
PPL Lecture3
6 pages
Notes On The Foundations of Programming
No ratings yet
Notes On The Foundations of Programming
59 pages
2-types-address-memalloc-web
No ratings yet
2-types-address-memalloc-web
21 pages
Supporting Columnar In-Memory Formats On Fpga: The Hardware Design of Fletcher For Appache Arrow
No ratings yet
Supporting Columnar In-Memory Formats On Fpga: The Hardware Design of Fletcher For Appache Arrow
14 pages
System Verilog Quick View 4pages
No ratings yet
System Verilog Quick View 4pages
9 pages
A-WPS Office
No ratings yet
A-WPS Office
8 pages
Java Glossary
No ratings yet
Java Glossary
41 pages
Data Type and Data Structure
No ratings yet
Data Type and Data Structure
16 pages
Defining and Using Complex Data Types
No ratings yet
Defining and Using Complex Data Types
33 pages
Data Struc
100% (1)
Data Struc
28 pages
Basic Data Structure
No ratings yet
Basic Data Structure
55 pages
01 Topol Arrow and Go
No ratings yet
01 Topol Arrow and Go
32 pages
4 SDML Copy of Chapter 4 - Designing Data-Intensive Applications
No ratings yet
4 SDML Copy of Chapter 4 - Designing Data-Intensive Applications
5 pages
Computer Systems Engineering Student Notes
No ratings yet
Computer Systems Engineering Student Notes
68 pages
Dsa Unit 1
No ratings yet
Dsa Unit 1
86 pages
SIA Week 4
No ratings yet
SIA Week 4
3 pages
Week02 Lesson 01 Abstract Data Types M
No ratings yet
Week02 Lesson 01 Abstract Data Types M
19 pages
PPL (Unit2 Data Types)
75% (4)
PPL (Unit2 Data Types)
43 pages
Data Structures
No ratings yet
Data Structures
59 pages
CS13002 Programming and Data Structures PDF
No ratings yet
CS13002 Programming and Data Structures PDF
17 pages
Demystifying Apache Arrow
No ratings yet
Demystifying Apache Arrow
6 pages
rohini_94515057847
No ratings yet
rohini_94515057847
7 pages
OOT (RGPV) IV Sem CS
No ratings yet
OOT (RGPV) IV Sem CS
5 pages
C and Memory Notes
No ratings yet
C and Memory Notes
6 pages
C++ Programming From Problem Analysis to Program Design 8th Edition Malik Solutions Manual - PDF Version Is Available For Instant Access
100% (2)
C++ Programming From Problem Analysis to Program Design 8th Edition Malik Solutions Manual - PDF Version Is Available For Instant Access
44 pages
Data Types in System Verilog
No ratings yet
Data Types in System Verilog
33 pages
Data Structure & Algorithm: Programming Strategies
No ratings yet
Data Structure & Algorithm: Programming Strategies
80 pages
candidate
No ratings yet
candidate
2 pages
Chicken tikka
No ratings yet
Chicken tikka
1 page
Mushroom tikka
No ratings yet
Mushroom tikka
1 page
Gin Martini Recipes
No ratings yet
Gin Martini Recipes
14 pages
A Guide to DevOps Branching Strategies
No ratings yet
A Guide to DevOps Branching Strategies
7 pages
BHANU UI-FRONTEND.M_RESUME
No ratings yet
BHANU UI-FRONTEND.M_RESUME
7 pages
Scotch Cocktail Recipes
No ratings yet
Scotch Cocktail Recipes
13 pages
Module 3
No ratings yet
Module 3
35 pages
sreeninet.wordpress.com-Macvlan and IPvlannbspbasics
No ratings yet
sreeninet.wordpress.com-Macvlan and IPvlannbspbasics
5 pages
learning.appsecengineer.com-Course Lab
No ratings yet
learning.appsecengineer.com-Course Lab
6 pages
Python+Deep+Dive+2
No ratings yet
Python+Deep+Dive+2
371 pages
ASCII Stands For American Standard Code For Information Interchange
No ratings yet
ASCII Stands For American Standard Code For Information Interchange
3 pages
M.Rajanikanth Lecturer in Computer Science DRG GDC Pentapadu
No ratings yet
M.Rajanikanth Lecturer in Computer Science DRG GDC Pentapadu
13 pages
Angular resume
No ratings yet
Angular resume
10 pages
kostyacholak.substack.com-The Evolution of LLMs in the context of NLP
No ratings yet
kostyacholak.substack.com-The Evolution of LLMs in the context of NLP
5 pages
CSE311 IAH Slide07 SQL+Advanced+Quries
No ratings yet
CSE311 IAH Slide07 SQL+Advanced+Quries
63 pages
Essential Avro: Definitive Reference for Developers and Engineers
From Everand
Essential Avro: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
C Programming Arrays
No ratings yet
C Programming Arrays
6 pages
Efficient Memory Optimization for IoT Intrusion Detection
From Everand
Efficient Memory Optimization for IoT Intrusion Detection
Ethan Evelyn
No ratings yet
Final Report
No ratings yet
Final Report
74 pages
TFT SDK Manual
No ratings yet
TFT SDK Manual
107 pages
Pointers and Arrays in C Jensen 1.5
No ratings yet
Pointers and Arrays in C Jensen 1.5
64 pages
Algorithm and Pascal Conversion Notes Sheet Revised
No ratings yet
Algorithm and Pascal Conversion Notes Sheet Revised
25 pages
lowder recipe training-manual
No ratings yet
lowder recipe training-manual
44 pages
Using A TypeDescriptionProvider To Support Dynamic Run-Time Properties - CodeProject
No ratings yet
Using A TypeDescriptionProvider To Support Dynamic Run-Time Properties - CodeProject
9 pages
Java_Notes_Complete_year 2
No ratings yet
Java_Notes_Complete_year 2
149 pages
Xi-Cs-Datatypes-Variable Declaration and Input Function in Python
No ratings yet
Xi-Cs-Datatypes-Variable Declaration and Input Function in Python
13 pages
3.2-Variables in OutSystems
No ratings yet
3.2-Variables in OutSystems
12 pages
Indusoft Modbus PDF
No ratings yet
Indusoft Modbus PDF
25 pages
Create Email Alerts For Mulitple People or Group - Xpo
No ratings yet
Create Email Alerts For Mulitple People or Group - Xpo
79 pages
Oracle Fundamentals
No ratings yet
Oracle Fundamentals
40 pages
Zerotohero Python3 170809030243
100% (1)
Zerotohero Python3 170809030243
96 pages
Barbara Liskov, Programming With Abstract Data Types
100% (1)
Barbara Liskov, Programming With Abstract Data Types
10 pages
Module 4 - Python
No ratings yet
Module 4 - Python
64 pages
Data Structures I Essentials
From Everand
Data Structures I Essentials
Dennis Smolarski
No ratings yet
Java Programs
No ratings yet
Java Programs
7 pages
PLT Book
No ratings yet
PLT Book
133 pages
VHDL Final
No ratings yet
VHDL Final
152 pages
Mastering Apache Arrow: Accelerating Data Processing and In-Memory Analytics
From Everand
Mastering Apache Arrow: Accelerating Data Processing and In-Memory Analytics
Robert Johnson
No ratings yet
Laboratory Manual: Fundamentals of Programming
No ratings yet
Laboratory Manual: Fundamentals of Programming
84 pages
Data Structures and Algorithms Using Python
92% (12)
Data Structures and Algorithms Using Python
538 pages
2018 Icse
No ratings yet
2018 Icse
23 pages
TIB SpotClient 3.2.0 EnglishManual
No ratings yet
TIB SpotClient 3.2.0 EnglishManual
943 pages
Ian Talks Python A-Z
From Everand
Ian Talks Python A-Z
Ian Eress
No ratings yet
MICROSOFT AZURE ADMINISTRATOR EXAM PREP(AZ-104) Part-3: AZ 104 EXAM STUDY GUIDE
From Everand
MICROSOFT AZURE ADMINISTRATOR EXAM PREP(AZ-104) Part-3: AZ 104 EXAM STUDY GUIDE
Devi Prasad
No ratings yet
RobotInterface Specification
No ratings yet
RobotInterface Specification
43 pages
Mock Java
No ratings yet
Mock Java
68 pages
Structures and Unions For C Programmers
No ratings yet
Structures and Unions For C Programmers
31 pages

learning Apache Arrow Overview

Uploaded by

learning Apache Arrow Overview

Uploaded by

I spent 6 hours learning Apache Arrow: Overview

Vu Trinh October 27, 2024

Why do we need a standard memory format for analytics workload?

The Arrow columnar format includes a language-agnostic in-memory data structure

Image created by Canva Image Generator.

Apache Arrow tries to achieve:

: Efficient data processing for analytics workload by designing to take advantage of

Array Physical Memory Layout

An example layout of a Fixed-size Primitive Array Layout

Image created by the author.

An example layout of a Variable-size Binary Array Layout

Image created by the author.

Image created by the author.

Arrow supports two types of binary formats for serializing RecordBatches:

How does Apache Arrow bring value?

As mentioned, Arrow organizes data in a column-by-column format within a record batch.

A memory-mapped file is a segment of virtual memory that has been assigned a

[2] Jacques Nadeau, CTO of Dremio,

[3] Daniel Abadi, (2018)

You might also like