Arrow Cookbook
Arrow Cookbook
Recipes related to reading and writing data from disk using Apache Arrow.
Contents
import numpy as np
import pyarrow as pa
arr = pa.array(np.arange(100))
print(f"{arr[0]} .. {arr[‐1]}")
0 .. 99
To write it to a Parquet file, as Parquet is a format that contains multiple named columns, we
must create a pyarrow.Table out of it, so that we get a table of a single column which can then be
written to a Parquet file.
Once we have a table, it can be written to a Parquet File using the functions provided by the
pyarrow.parquet module
import pyarrow.parquet as pq
import pyarrow.parquet as pq
table = pq.read_table("example.parquet")
The resulting table will contain the same columns that existed in the parquet file as ChunkedArray
print(table)
pyarrow.Table
col1: int64
‐‐‐‐
col1: [[0,1,2,3,4,...,95,96,97,98,99]]
import pyarrow.parquet as pq
table = pq.read_table("example.parquet",
columns=["col1"],
filters=[
("col1", ">", 5),
("col1", "<", 10),
])
The resulting table will contain only the projected columns and filtered rows. Refer to
pyarrow.parquet.read_table() documentation for details about the syntax for filters.
print(table)
pyarrow.Table
col1: int64
‐‐‐‐
col1: [[6,7,8,9]]
import numpy as np
import pyarrow as pa
arr = pa.array(np.arange(100))
print(f"{arr[0]} .. {arr[‐1]}")
0 .. 99
We can save the array by making a pyarrow.RecordBatch out of it and writing the record batch to
disk.
schema = pa.schema([
pa.field('nums', arr.type)
])
If we were to save multiple arrays into the same file, we would just have to adapt the schema
accordingly and add them all to the record_batch call.
arr = loaded_arrays[0]
print(f"{arr[0]} .. {arr[‐1]}")
0 .. 99
import pyarrow.csv
pa.csv.write_csv(table, "table.csv",
write_options=pa.csv.WriteOptions(include_header=True))
It’s equally possible to write pyarrow.RecordBatch by passing them as you would for tables.
import pyarrow.csv
table = pa.csv.read_csv("table.csv")
Arrow will do its best to infer data types. Further options can be provided to
pyarrow.csv.read_csv() to drive pyarrow.csv.ConvertOptions.
print(table)
pyarrow.Table
col1: int64
‐‐‐‐
col1: [[0,1,2,3,4,...,95,96,97,98,99]]
The partitioning argument allows to tell pyarrow.dataset.write_dataset() for which columns the
data should be split.
data should be split.
import numpy.random
data = pa.table({"day": numpy.random.randint(1, 31, size=100),
"month": numpy.random.randint(1, 12, size=100),
"year": [2000 + x // 10 for x in range(100)]})
Then we could partition the data by the year column so that it gets saved in 10 different files:
import pyarrow as pa
import pyarrow.dataset as ds
Arrow will partition datasets in subdirectories by default, which will result in 10 different
directories named with the value of the partitioning column each with a file containing the
subset of the data for that partition:
localfs = fs.LocalFileSystem()
partitioned_dir_content = localfs.get_file_info(fs.FileSelector("./partitioned", recursive=True
files = sorted((f.path for f in partitioned_dir_content if f.type == fs.FileType.File))
./partitioned/2000/part‐0.parquet
./partitioned/2001/part‐0.parquet
./partitioned/2002/part‐0.parquet
./partitioned/2003/part‐0.parquet
./partitioned/2004/part‐0.parquet
./partitioned/2005/part‐0.parquet
./partitioned/2006/part‐0.parquet
./partitioned/2007/part‐0.parquet
./partitioned/2008/part‐0.parquet
./partitioned/2009/part‐0.parquet
In this case the pyarrow.dataset.dataset() function provides an interface to discover and read all
those files as a single big dataset.
examples/
examples/
├── dataset1.parquet
├── dataset2.parquet
└── dataset3.parquet
Then, pointing the pyarrow.dataset.dataset() function to the examples directory will discover
those parquet files and will expose them all as a single pyarrow.dataset.Dataset:
import pyarrow.dataset as ds
The whole dataset can be viewed as a single big table using pyarrow.dataset.Dataset.to_table().
While each parquet file contains only 10 rows, converting the dataset to a table will expose them
as a single Table.
table = dataset.to_table()
print(table)
pyarrow.Table
col1: int64
‐‐‐‐
col1: [[0,1,2,3,4,5,6,7,8,9],[10,11,12,13,14,15,16,17,18,19],[20,21,22,23,24,25,26,27,28,29]]
Notice that converting to a table will force all data to be loaded in memory. For big datasets is
usually not what you want.
col1 = 0 .. 9
col1 = 10 .. 19
col1 = 20 .. 29
2011/01/data.parquet
2011/02/data.parquet
2011/03/data.parquet
2011/04/data.parquet
2011/05/data.parquet
2011/06/data.parquet
2011/07/data.parquet
2011/08/data.parquet
2011/09/data.parquet
2011/10/data.parquet
2011/11/data.parquet
2011/12/data.parquet
The data in the bucket can be loaded as a single big dataset partitioned by month using
dataset = ds.dataset("s3://ursa‐labs‐taxi‐data/2011",
partitioning=["month"])
for f in dataset.files[:10]:
print(f)
print("...")
ursa‐labs‐taxi‐data/2011/01/data.parquet
ursa‐labs‐taxi‐data/2011/02/data.parquet
ursa‐labs‐taxi‐data/2011/03/data.parquet
ursa‐labs‐taxi‐data/2011/04/data.parquet
ursa‐labs‐taxi‐data/2011/05/data.parquet
ursa‐labs‐taxi‐data/2011/06/data.parquet
ursa‐labs‐taxi‐data/2011/07/data.parquet
ursa‐labs‐taxi‐data/2011/08/data.parquet
ursa‐labs‐taxi‐data/2011/09/data.parquet
ursa‐labs‐taxi‐data/2011/10/data.parquet
...
Note:
It is possible to load partitioned data also in the ipc arrow format or in feather format.
Warning:
If the above code throws an error most likely the reason is your AWS credentials are not
set. Follow these instructions to get AWS Access Key Id and AWS Secret Access Key: AWS
set. Follow these instructions to get AWS Access Key Id and AWS Secret Access Key: AWS
Credentials.
[default]
aws_access_key_id=<YOUR_AWS_ACCESS_KEY_ID>
aws_secret_access_key=<YOUR_AWS_SECRET_ACCESS_KEY>
import numpy as np
import pyarrow as pa
arr = pa.array(np.arange(100))
print(f"{arr[0]} .. {arr[‐1]}")
0 .. 99
To write it to a Feather file, as Feather stores multiple columns, we must create a pyarrow.Table
out of it, so that we get a table of a single column which can then be written to a Feather file.
Once we have a table, it can be written to a Feather File using the functions provided by the
pyarrow.feather module
import pyarrow.feather as ft
ft.write_feather(table, 'example.feather')
import pyarrow.feather as ft
table = ft.read_table("example.feather")
The resulting table will contain the same columns that existed in the parquet file as ChunkedArray
print(table)
pyarrow.Table
col1: int64
‐‐‐‐
col1: [[0,1,2,3,4,...,95,96,97,98,99]]
Given some data in a file where each line is a JSON object containing a row of data:
import tempfile
The content of the file can be read back to a pyarrow.Table using pyarrow.json.read_json():
import pyarrow as pa
import pyarrow.json
table = pa.json.read_json(f.name)
print(table.to_pydict())
{'a': [1, 3, 5, 7], 'b': [2.0, 3.0, 4.0, 5.0], 'c': [1, 2, 3, 4]}
Given a table:
table = pa.table([
pa.array([1, 2, 3, 4, 5])
], names=["numbers"])
Writing compressed Parquet or Feather data is driven by the compression argument to the
pyarrow.feather.write_feather() and pyarrow.parquet.write_table() functions:
pa.feather.write_feather(table, "compressed.feather",
pa.feather.write_feather(table, "compressed.feather",
compression="lz4")
pa.parquet.write_table(table, "compressed.parquet",
compression="lz4")
You can refer to each of those functions’ documentation for a complete list of supported
compression formats.
Note:
Arrow actually uses compression by default when writing Parquet or Feather files.
Feather is compressed using lz4 by default and Parquet uses snappy by default.
For formats that don’t support compression natively, like CSV, it’s possible to save compressed
data using pyarrow.CompressedOutputStream:
This requires decompressing the file when reading it back, which can be done using
pyarrow.CompressedInputStream as explained in the next recipe.
Reading compressed formats that have native support for compression doesn’t require any
special handling. We can for example read back the Parquet and Feather files we wrote in the
previous recipe by simply invoking pyarrow.feather.read_table() and
pyarrow.parquet.read_table():
table_feather = pa.feather.read_table("compressed.feather")
print(table_feather)
pyarrow.Table
numbers: int64
‐‐‐‐
numbers: [[1,2,3,4,5]]
table_parquet = pa.parquet.read_table("compressed.parquet")
print(table_parquet)
pyarrow.Table
numbers: int64
‐‐‐‐
numbers: [[1,2,3,4,5]]
Reading data from formats that don’t have native support for compression instead involves
decompressing them before decoding them. This can be done using the
pyarrow.CompressedInputStream class which wraps files with a decompress operation before the
result is provided to the actual read function.
pyarrow.Table
numbers: int64
‐‐‐‐
numbers: [[1,2,3,4,5]]
Note:
In the case of CSV, arrow is actually smart enough to try detecting compressed files
using the file extension. So if your file is named *.gz or *.bz2 the pyarrow.csv.read_csv()
function will try to decompress it accordingly
table_csv2 = pa.csv.read_csv("compressed.csv.gz")
print(table_csv2)
pyarrow.Table
numbers: int64
‐‐‐‐
numbers: [[1,2,3,4,5]]
Quick search
Go