Cloud-Optimized HDF5 Files - Current Status

Aleksandar Jelenak
NASA EED-3 / HDF Group
2024 ESIP Summer Meeting
Cloud Optimized HDF5 Files:
Current Status
GOVERNMENT RIGHTS NOTICE
This work was authored by employees of The HDF Group under Contract No. 80GSFC21CA001 with the National Aeronautics and Space
Administration. The United States Government retains and the publisher, by accepting the article for publication, acknowledges that the United
States Government retains a non-exclusive, paid-up, irrevocable, worldwide license to reproduce, prepare derivative works, distribute copies to
the public, and perform publicly and display publicly, or allow others to do so, for United States Government purposes. All other rights are
reserved by the copyright owner.
©2024 Raytheon Company. All rights reserved.

Glossary
HDF5: Hierarchical Data Format Version 5
netCDF-4: Network Common Data Form version 4
COH5: Cloud optimized HDF5
S3: Simple Storage Service
EOSDIS: NASA Earth Observing System Data and Information System
MB: megabyte (106 bytes)
kB: kilobyte (103 bytes)
MiB: mebibyte (220 bytes)
kiB: kibibyte (210 bytes)
LIDAR: laser imaging, detection, and ranging
URI: uniform resource identifier

What are cloud optimized HDF5 files?
● Valid HDF5 files. Not a new file format or convention.
● Larger dataset chunk sizes.
● Internal file metadata consolidated into bigger contiguous blocks.
● Total number of required S3 requests is significantly reduced which directly
improves performance.
● For detailed information, see my 2023 ESIP Summer talk.
From “HDF at the Speed of Zarr” by Luis Lopez, NASA NSIDC.

Current Advice for COH5
File Settings

Larger dataset chunk sizes
● Term clarification:
○ chunk shape = number of array elements in each chunk dimension
○ chunk size = number of bytes
(number of array elements in a chunk multiplied by byte size of one array
element)
● Chunk size is prior to any filtering (compression, etc.) applied.
● Not enough testing so far:
○ EOSDIS granules with larger dataset chunks are rare.
○ h5repack tool is not easy to use for large rechunking jobs.
● Larger chunks = less of them = less internal file metadata.

Consolidation of internal metadata
● Three different consolidation methods (see the YouTube video on slide #3).
● Practically only one of them tested: files created with paged aggregation file
space management strategy. (Easier to pronounce: paged files.)
● An HDF5 file is divided into pages. Page size set at file creation.
● Each page holds either internal metadata or data (chunks).

Paged file: pros and cons
● HDF5 library reads entire pages which yields its best cloud performance.
● It also has a special cache for these pages, called page buffer. Its size must
be set prior to opening a file.
● One file page can have more than one chunk = less overall S3 requests.
● Paged files tend to have larger size compared to their non-paged version
which is caused by extra unused space in each page.
○ Think of a file page as a box filled with different sized objects.

Current Advice: Chunks
● Chunk size needs to account for speed of applying filters (e.g.,
decompression) when chunks are read.
● NASA satellite data predominantly compressed with the zlib (a.k.a., gzip,
deflate) method.
● Need to explore other compression methods for optimal speed vs.
compression ratio.
● Smaller compressed chunks fill file pages better.
● Suggested chunk sizes: 100k(i)B to 2M(i)B.

Current Advice: Paged files
● Tested file pages of 4, 8, and 16 MiB sizes.
● 8 MiB file page produced slightly better performance, with tolerable (<5%) file
size increase.
● Majority of tested files had their internal metadata in one 8MiB file page.
● Don’t worry about unused space in that one internal metadata file page.
● Majority of datasets in the tested files were stored in a single file page.
● Consider a minimum of four chunks per file page when choosing a dataset’s
chunk size.
● If writing data to a paged file in more than one open-close session, enable
re-use of file’s free space when creating it.
○ Otherwise, the file may end up much larger than needed.
○ h5repack can produce a defragmented version of the file.

What happens to chunks
in a paged file?

Example: GEDI Level 2A granule
● Global Ecosystem Dynamics Investigation (GEDI) instrument is on the
International Space Station.
● A full-waveform LIDAR system for high-resolution observations of forests’
vertical structure.
● Example granule:
○ 1,518,338,048 bytes
○ 136 contiguous datasets
○ 4,184 chunked datasets compressed with the zlib filter
● Repacked into a paged file version with 8MiB file page size.
● No chunk was “hurt” (i.e., rechunked) during repacking.

Number of stored dataset chunks

Dataset chunk spread across file pages

Extra file pages compared to dataset total size

Dataset cache size for all chunks?

HDF5 Library Improvements
for Cloud Data Access

HDF5 library
● Applies to version 1.14.4 only.
● Released in May 2024.
● All other maintenance releases of the library – 1.8.*, 1.10.*, and 1.12.* – are
deprecated now.
● Native method for S3 data access: Read-Only S3 (ROS3) virtual file driver
(VFD).
○ Not always available – build dependent.
○ Conda Forge hdf5 package has it but not h5py from PyPI.
● For Python users: fsspec via h5py.
○ fsspec connected with the library using its virtual file layer API.
○ Lacks communication of important information from the library.

Notable improvements
● ROS3 caches first 16 MiB of the file on open.
● ROS3 support for AWS temporary session token.
● Set-and-forget page buffer size. Opening non-paged files will not cause an
error.
● Fixed chunk file location info to account for file’s user block size.
● Fixed an h5repack bug for datasets with variable-length data. Important when
repacking netCDF-4 string variables.
● Next release: Build with zlib-ng. This is a newer open-source implementation
of the standard zlib compression library and ~2x faster.
● Next release: h5repack, h5ls, h5dump, and h5stat new command-line
option for page buffer size. This will enable much improved performance for
cloud optimized files in S3.
● Next release: ROS3 support for relevant AWS environment variables.
● Next release: Support for S3 object URIs (s3://bucket-name/object-name).

This work was supported by NASA/GSFC under
Raytheon Company contract number
80GSFC21CA001

Cloud-Optimized HDF5 Files - Current Status

Recommended

More Related Content

Similar to Cloud-Optimized HDF5 Files - Current Status (20)

More from The HDF-EOS Tools and Information Center (20)

Recently uploaded (20)

Cloud-Optimized HDF5 Files - Current Status