SlideShare a Scribd company logo
Aleksandar Jelenak
NASA EED-3 / HDF Group
2024 ESIP Summer Meeting
Cloud Optimized HDF5 Files:
Current Status
GOVERNMENT RIGHTS NOTICE
This work was authored by employees of The HDF Group under Contract No. 80GSFC21CA001 with the National Aeronautics and Space
Administration. The United States Government retains and the publisher, by accepting the article for publication, acknowledges that the United
States Government retains a non-exclusive, paid-up, irrevocable, worldwide license to reproduce, prepare derivative works, distribute copies to
the public, and perform publicly and display publicly, or allow others to do so, for United States Government purposes. All other rights are
reserved by the copyright owner.
©2024 Raytheon Company. All rights reserved.
Glossary
HDF5: Hierarchical Data Format Version 5
netCDF-4: Network Common Data Form version 4
COH5: Cloud optimized HDF5
S3: Simple Storage Service
EOSDIS: NASA Earth Observing System Data and Information System
MB: megabyte (106 bytes)
kB: kilobyte (103 bytes)
MiB: mebibyte (220 bytes)
kiB: kibibyte (210 bytes)
LIDAR: laser imaging, detection, and ranging
URI: uniform resource identifier
What are cloud optimized HDF5 files?
● Valid HDF5 files. Not a new file format or convention.
● Larger dataset chunk sizes.
● Internal file metadata consolidated into bigger contiguous blocks.
● Total number of required S3 requests is significantly reduced which directly
improves performance.
● For detailed information, see my 2023 ESIP Summer talk.
From “HDF at the Speed of Zarr” by Luis Lopez, NASA NSIDC.
Current Advice for COH5
File Settings
Larger dataset chunk sizes
● Term clarification:
○ chunk shape = number of array elements in each chunk dimension
○ chunk size = number of bytes
(number of array elements in a chunk multiplied by byte size of one array
element)
● Chunk size is prior to any filtering (compression, etc.) applied.
● Not enough testing so far:
○ EOSDIS granules with larger dataset chunks are rare.
○ h5repack tool is not easy to use for large rechunking jobs.
● Larger chunks = less of them = less internal file metadata.
Consolidation of internal metadata
● Three different consolidation methods (see the YouTube video on slide #3).
● Practically only one of them tested: files created with paged aggregation file
space management strategy. (Easier to pronounce: paged files.)
● An HDF5 file is divided into pages. Page size set at file creation.
● Each page holds either internal metadata or data (chunks).
Paged file: pros and cons
● HDF5 library reads entire pages which yields its best cloud performance.
● It also has a special cache for these pages, called page buffer. Its size must
be set prior to opening a file.
● One file page can have more than one chunk = less overall S3 requests.
● Paged files tend to have larger size compared to their non-paged version
which is caused by extra unused space in each page.
○ Think of a file page as a box filled with different sized objects.
Current Advice: Chunks
● Chunk size needs to account for speed of applying filters (e.g.,
decompression) when chunks are read.
● NASA satellite data predominantly compressed with the zlib (a.k.a., gzip,
deflate) method.
● Need to explore other compression methods for optimal speed vs.
compression ratio.
● Smaller compressed chunks fill file pages better.
● Suggested chunk sizes: 100k(i)B to 2M(i)B.
Current Advice: Paged files
● Tested file pages of 4, 8, and 16 MiB sizes.
● 8 MiB file page produced slightly better performance, with tolerable (<5%) file
size increase.
● Majority of tested files had their internal metadata in one 8MiB file page.
● Don’t worry about unused space in that one internal metadata file page.
● Majority of datasets in the tested files were stored in a single file page.
● Consider a minimum of four chunks per file page when choosing a dataset’s
chunk size.
● If writing data to a paged file in more than one open-close session, enable
re-use of file’s free space when creating it.
○ Otherwise, the file may end up much larger than needed.
○ h5repack can produce a defragmented version of the file.
What happens to chunks
in a paged file?
Example: GEDI Level 2A granule
● Global Ecosystem Dynamics Investigation (GEDI) instrument is on the
International Space Station.
● A full-waveform LIDAR system for high-resolution observations of forests’
vertical structure.
● Example granule:
○ 1,518,338,048 bytes
○ 136 contiguous datasets
○ 4,184 chunked datasets compressed with the zlib filter
● Repacked into a paged file version with 8MiB file page size.
● No chunk was “hurt” (i.e., rechunked) during repacking.
Chunk sizes
Number of stored dataset chunks
Dataset chunk spread across file pages
Extra file pages compared to dataset total size
Dataset cache size for all chunks?
HDF5 Library Improvements
for Cloud Data Access
HDF5 library
● Applies to version 1.14.4 only.
● Released in May 2024.
● All other maintenance releases of the library – 1.8.*, 1.10.*, and 1.12.* – are
deprecated now.
● Native method for S3 data access: Read-Only S3 (ROS3) virtual file driver
(VFD).
○ Not always available – build dependent.
○ Conda Forge hdf5 package has it but not h5py from PyPI.
● For Python users: fsspec via h5py.
○ fsspec connected with the library using its virtual file layer API.
○ Lacks communication of important information from the library.
Notable improvements
● ROS3 caches first 16 MiB of the file on open.
● ROS3 support for AWS temporary session token.
● Set-and-forget page buffer size. Opening non-paged files will not cause an
error.
● Fixed chunk file location info to account for file’s user block size.
● Fixed an h5repack bug for datasets with variable-length data. Important when
repacking netCDF-4 string variables.
● Next release: Build with zlib-ng. This is a newer open-source implementation
of the standard zlib compression library and ~2x faster.
● Next release: h5repack, h5ls, h5dump, and h5stat new command-line
option for page buffer size. This will enable much improved performance for
cloud optimized files in S3.
● Next release: ROS3 support for relevant AWS environment variables.
● Next release: Support for S3 object URIs (s3://bucket-name/object-name).
This work was supported by NASA/GSFC under
Raytheon Company contract number
80GSFC21CA001
Ad

More Related Content

Similar to Cloud-Optimized HDF5 Files - Current Status (20)

Wheeler w 0450_linux_file_systems1
Wheeler w 0450_linux_file_systems1Wheeler w 0450_linux_file_systems1
Wheeler w 0450_linux_file_systems1
sprdd
 
Wheeler w 0450_linux_file_systems1
Wheeler w 0450_linux_file_systems1Wheeler w 0450_linux_file_systems1
Wheeler w 0450_linux_file_systems1
sprdd
 
HDF Update 2016
HDF Update 2016HDF Update 2016
HDF Update 2016
The HDF-EOS Tools and Information Center
 
HDF5 Advanced Topics - Chunking
HDF5 Advanced Topics - ChunkingHDF5 Advanced Topics - Chunking
HDF5 Advanced Topics - Chunking
The HDF-EOS Tools and Information Center
 
hdfs readrmation ghghg bigdats analytics info.pdf
hdfs readrmation ghghg bigdats analytics info.pdfhdfs readrmation ghghg bigdats analytics info.pdf
hdfs readrmation ghghg bigdats analytics info.pdf
ssuser2d043c
 
FITSIO, HDF4, NetCDF, PDB and HDF5 Performance - Some Benchmark Results
FITSIO, HDF4, NetCDF, PDB and HDF5 Performance - Some Benchmark ResultsFITSIO, HDF4, NetCDF, PDB and HDF5 Performance - Some Benchmark Results
FITSIO, HDF4, NetCDF, PDB and HDF5 Performance - Some Benchmark Results
The HDF-EOS Tools and Information Center
 
Hdg explains swapfile.sys, hiberfil.sys and pagefile
Hdg explains   swapfile.sys, hiberfil.sys and pagefileHdg explains   swapfile.sys, hiberfil.sys and pagefile
Hdg explains swapfile.sys, hiberfil.sys and pagefile
Trường Tiền
 
Hdg explains swapfile.sys, hiberfil.sys and pagefile
Hdg explains   swapfile.sys, hiberfil.sys and pagefileHdg explains   swapfile.sys, hiberfil.sys and pagefile
Hdg explains swapfile.sys, hiberfil.sys and pagefile
Trường Tiền
 
Using HDF5 tools for performance tuning and troubleshooting
Using HDF5 tools for performance tuning and troubleshootingUsing HDF5 tools for performance tuning and troubleshooting
Using HDF5 tools for performance tuning and troubleshooting
The HDF-EOS Tools and Information Center
 
Introduction to Hadoop and Big Data Processing
Introduction to Hadoop and Big Data ProcessingIntroduction to Hadoop and Big Data Processing
Introduction to Hadoop and Big Data Processing
Sam Ng
 
HDF Update for DAAC Managers (2017-02-27)
HDF Update for DAAC Managers (2017-02-27)HDF Update for DAAC Managers (2017-02-27)
HDF Update for DAAC Managers (2017-02-27)
The HDF-EOS Tools and Information Center
 
Best Practices for Using Alluxio with Apache Spark with Cheng Chang and Haoyu...
Best Practices for Using Alluxio with Apache Spark with Cheng Chang and Haoyu...Best Practices for Using Alluxio with Apache Spark with Cheng Chang and Haoyu...
Best Practices for Using Alluxio with Apache Spark with Cheng Chang and Haoyu...
Databricks
 
AdvFS/Advanced File System Ccncepts
AdvFS/Advanced File System CcnceptsAdvFS/Advanced File System Ccncepts
AdvFS/Advanced File System Ccncepts
Justin Goldberg
 
Accessing HDF5 data in the cloud with HSDS
Accessing HDF5 data in the cloud with HSDSAccessing HDF5 data in the cloud with HSDS
Accessing HDF5 data in the cloud with HSDS
The HDF-EOS Tools and Information Center
 
Hadoop compression strata conference
Hadoop compression strata conferenceHadoop compression strata conference
Hadoop compression strata conference
nkabra
 
H5Coro: The Cloud-Optimized Read-Only Library
H5Coro: The Cloud-Optimized Read-Only LibraryH5Coro: The Cloud-Optimized Read-Only Library
H5Coro: The Cloud-Optimized Read-Only Library
The HDF-EOS Tools and Information Center
 
What's new in Hadoop Common and HDFS
What's new in Hadoop Common and HDFS What's new in Hadoop Common and HDFS
What's new in Hadoop Common and HDFS
DataWorks Summit/Hadoop Summit
 
Unit 3 Big Data àaaaaaaaaaaaTutorial.pdf
Unit 3 Big Data àaaaaaaaaaaaTutorial.pdfUnit 3 Big Data àaaaaaaaaaaaTutorial.pdf
Unit 3 Big Data àaaaaaaaaaaaTutorial.pdf
VarunTyagi624957
 
Hadoop data management
Hadoop data managementHadoop data management
Hadoop data management
Subhas Kumar Ghosh
 
HDF5 Life cycle of data
HDF5 Life cycle of dataHDF5 Life cycle of data
HDF5 Life cycle of data
The HDF-EOS Tools and Information Center
 
Wheeler w 0450_linux_file_systems1
Wheeler w 0450_linux_file_systems1Wheeler w 0450_linux_file_systems1
Wheeler w 0450_linux_file_systems1
sprdd
 
Wheeler w 0450_linux_file_systems1
Wheeler w 0450_linux_file_systems1Wheeler w 0450_linux_file_systems1
Wheeler w 0450_linux_file_systems1
sprdd
 
hdfs readrmation ghghg bigdats analytics info.pdf
hdfs readrmation ghghg bigdats analytics info.pdfhdfs readrmation ghghg bigdats analytics info.pdf
hdfs readrmation ghghg bigdats analytics info.pdf
ssuser2d043c
 
Hdg explains swapfile.sys, hiberfil.sys and pagefile
Hdg explains   swapfile.sys, hiberfil.sys and pagefileHdg explains   swapfile.sys, hiberfil.sys and pagefile
Hdg explains swapfile.sys, hiberfil.sys and pagefile
Trường Tiền
 
Hdg explains swapfile.sys, hiberfil.sys and pagefile
Hdg explains   swapfile.sys, hiberfil.sys and pagefileHdg explains   swapfile.sys, hiberfil.sys and pagefile
Hdg explains swapfile.sys, hiberfil.sys and pagefile
Trường Tiền
 
Introduction to Hadoop and Big Data Processing
Introduction to Hadoop and Big Data ProcessingIntroduction to Hadoop and Big Data Processing
Introduction to Hadoop and Big Data Processing
Sam Ng
 
Best Practices for Using Alluxio with Apache Spark with Cheng Chang and Haoyu...
Best Practices for Using Alluxio with Apache Spark with Cheng Chang and Haoyu...Best Practices for Using Alluxio with Apache Spark with Cheng Chang and Haoyu...
Best Practices for Using Alluxio with Apache Spark with Cheng Chang and Haoyu...
Databricks
 
AdvFS/Advanced File System Ccncepts
AdvFS/Advanced File System CcnceptsAdvFS/Advanced File System Ccncepts
AdvFS/Advanced File System Ccncepts
Justin Goldberg
 
Hadoop compression strata conference
Hadoop compression strata conferenceHadoop compression strata conference
Hadoop compression strata conference
nkabra
 
Unit 3 Big Data àaaaaaaaaaaaTutorial.pdf
Unit 3 Big Data àaaaaaaaaaaaTutorial.pdfUnit 3 Big Data àaaaaaaaaaaaTutorial.pdf
Unit 3 Big Data àaaaaaaaaaaaTutorial.pdf
VarunTyagi624957
 

More from The HDF-EOS Tools and Information Center (20)

Cloud Optimized HDF5 for the ICESat-2 mission
Cloud Optimized HDF5 for the ICESat-2 missionCloud Optimized HDF5 for the ICESat-2 mission
Cloud Optimized HDF5 for the ICESat-2 mission
The HDF-EOS Tools and Information Center
 
Access HDF Data in the Cloud via OPeNDAP Web Service
Access HDF Data in the Cloud via OPeNDAP Web ServiceAccess HDF Data in the Cloud via OPeNDAP Web Service
Access HDF Data in the Cloud via OPeNDAP Web Service
The HDF-EOS Tools and Information Center
 
Upcoming New HDF5 Features: Multi-threading, sparse data storage, and encrypt...
Upcoming New HDF5 Features: Multi-threading, sparse data storage, and encrypt...Upcoming New HDF5 Features: Multi-threading, sparse data storage, and encrypt...
Upcoming New HDF5 Features: Multi-threading, sparse data storage, and encrypt...
The HDF-EOS Tools and Information Center
 
The State of HDF5 / Dana Robinson / The HDF Group
The State of HDF5 / Dana Robinson / The HDF GroupThe State of HDF5 / Dana Robinson / The HDF Group
The State of HDF5 / Dana Robinson / The HDF Group
The HDF-EOS Tools and Information Center
 
The State of HDF
The State of HDFThe State of HDF
The State of HDF
The HDF-EOS Tools and Information Center
 
Highly Scalable Data Service (HSDS) Performance Features
Highly Scalable Data Service (HSDS) Performance FeaturesHighly Scalable Data Service (HSDS) Performance Features
Highly Scalable Data Service (HSDS) Performance Features
The HDF-EOS Tools and Information Center
 
HDF5 OPeNDAP Handler Updates, and Performance Discussion
HDF5 OPeNDAP Handler Updates, and Performance DiscussionHDF5 OPeNDAP Handler Updates, and Performance Discussion
HDF5 OPeNDAP Handler Updates, and Performance Discussion
The HDF-EOS Tools and Information Center
 
Hyrax: Serving Data from S3
Hyrax: Serving Data from S3Hyrax: Serving Data from S3
Hyrax: Serving Data from S3
The HDF-EOS Tools and Information Center
 
Accessing Cloud Data and Services Using EDL, Pydap, MATLAB
Accessing Cloud Data and Services Using EDL, Pydap, MATLABAccessing Cloud Data and Services Using EDL, Pydap, MATLAB
Accessing Cloud Data and Services Using EDL, Pydap, MATLAB
The HDF-EOS Tools and Information Center
 
HDF - Current status and Future Directions
HDF - Current status and Future DirectionsHDF - Current status and Future Directions
HDF - Current status and Future Directions
The HDF-EOS Tools and Information Center
 
HDFEOS.org User Analsys, Updates, and Future
HDFEOS.org User Analsys, Updates, and FutureHDFEOS.org User Analsys, Updates, and Future
HDFEOS.org User Analsys, Updates, and Future
The HDF-EOS Tools and Information Center
 
HDF - Current status and Future Directions
HDF - Current status and Future Directions HDF - Current status and Future Directions
HDF - Current status and Future Directions
The HDF-EOS Tools and Information Center
 
MATLAB Modernization on HDF5 1.10
MATLAB Modernization on HDF5 1.10MATLAB Modernization on HDF5 1.10
MATLAB Modernization on HDF5 1.10
The HDF-EOS Tools and Information Center
 
HDF for the Cloud - Serverless HDF
HDF for the Cloud - Serverless HDFHDF for the Cloud - Serverless HDF
HDF for the Cloud - Serverless HDF
The HDF-EOS Tools and Information Center
 
HDF5 <-> Zarr
HDF5 <-> ZarrHDF5 <-> Zarr
HDF5 <-> Zarr
The HDF-EOS Tools and Information Center
 
HDF for the Cloud - New HDF Server Features
HDF for the Cloud - New HDF Server FeaturesHDF for the Cloud - New HDF Server Features
HDF for the Cloud - New HDF Server Features
The HDF-EOS Tools and Information Center
 
Apache Drill and Unidata THREDDS Data Server for NASA HDF-EOS on S3
Apache Drill and Unidata THREDDS Data Server for NASA HDF-EOS on S3Apache Drill and Unidata THREDDS Data Server for NASA HDF-EOS on S3
Apache Drill and Unidata THREDDS Data Server for NASA HDF-EOS on S3
The HDF-EOS Tools and Information Center
 
STARE-PODS: A Versatile Data Store Leveraging the HDF Virtual Object Layer fo...
STARE-PODS: A Versatile Data Store Leveraging the HDF Virtual Object Layer fo...STARE-PODS: A Versatile Data Store Leveraging the HDF Virtual Object Layer fo...
STARE-PODS: A Versatile Data Store Leveraging the HDF Virtual Object Layer fo...
The HDF-EOS Tools and Information Center
 
HDF5 and Ecosystem: What Is New?
HDF5 and Ecosystem: What Is New?HDF5 and Ecosystem: What Is New?
HDF5 and Ecosystem: What Is New?
The HDF-EOS Tools and Information Center
 
HDF5 Roadmap 2019-2020
HDF5 Roadmap 2019-2020HDF5 Roadmap 2019-2020
HDF5 Roadmap 2019-2020
The HDF-EOS Tools and Information Center
 
Upcoming New HDF5 Features: Multi-threading, sparse data storage, and encrypt...
Upcoming New HDF5 Features: Multi-threading, sparse data storage, and encrypt...Upcoming New HDF5 Features: Multi-threading, sparse data storage, and encrypt...
Upcoming New HDF5 Features: Multi-threading, sparse data storage, and encrypt...
The HDF-EOS Tools and Information Center
 
STARE-PODS: A Versatile Data Store Leveraging the HDF Virtual Object Layer fo...
STARE-PODS: A Versatile Data Store Leveraging the HDF Virtual Object Layer fo...STARE-PODS: A Versatile Data Store Leveraging the HDF Virtual Object Layer fo...
STARE-PODS: A Versatile Data Store Leveraging the HDF Virtual Object Layer fo...
The HDF-EOS Tools and Information Center
 
Ad

Recently uploaded (20)

tecnologias de las primeras civilizaciones.pdf
tecnologias de las primeras civilizaciones.pdftecnologias de las primeras civilizaciones.pdf
tecnologias de las primeras civilizaciones.pdf
fjgm517
 
Enhancing ICU Intelligence: How Our Functional Testing Enabled a Healthcare I...
Enhancing ICU Intelligence: How Our Functional Testing Enabled a Healthcare I...Enhancing ICU Intelligence: How Our Functional Testing Enabled a Healthcare I...
Enhancing ICU Intelligence: How Our Functional Testing Enabled a Healthcare I...
Impelsys Inc.
 
Cybersecurity Identity and Access Solutions using Azure AD
Cybersecurity Identity and Access Solutions using Azure ADCybersecurity Identity and Access Solutions using Azure AD
Cybersecurity Identity and Access Solutions using Azure AD
VICTOR MAESTRE RAMIREZ
 
Role of Data Annotation Services in AI-Powered Manufacturing
Role of Data Annotation Services in AI-Powered ManufacturingRole of Data Annotation Services in AI-Powered Manufacturing
Role of Data Annotation Services in AI-Powered Manufacturing
Andrew Leo
 
Manifest Pre-Seed Update | A Humanoid OEM Deeptech In France
Manifest Pre-Seed Update | A Humanoid OEM Deeptech In FranceManifest Pre-Seed Update | A Humanoid OEM Deeptech In France
Manifest Pre-Seed Update | A Humanoid OEM Deeptech In France
chb3
 
AI Changes Everything – Talk at Cardiff Metropolitan University, 29th April 2...
AI Changes Everything – Talk at Cardiff Metropolitan University, 29th April 2...AI Changes Everything – Talk at Cardiff Metropolitan University, 29th April 2...
AI Changes Everything – Talk at Cardiff Metropolitan University, 29th April 2...
Alan Dix
 
Generative Artificial Intelligence (GenAI) in Business
Generative Artificial Intelligence (GenAI) in BusinessGenerative Artificial Intelligence (GenAI) in Business
Generative Artificial Intelligence (GenAI) in Business
Dr. Tathagat Varma
 
Procurement Insights Cost To Value Guide.pptx
Procurement Insights Cost To Value Guide.pptxProcurement Insights Cost To Value Guide.pptx
Procurement Insights Cost To Value Guide.pptx
Jon Hansen
 
Electronic_Mail_Attacks-1-35.pdf by xploit
Electronic_Mail_Attacks-1-35.pdf by xploitElectronic_Mail_Attacks-1-35.pdf by xploit
Electronic_Mail_Attacks-1-35.pdf by xploit
niftliyevhuseyn
 
Noah Loul Shares 5 Steps to Implement AI Agents for Maximum Business Efficien...
Noah Loul Shares 5 Steps to Implement AI Agents for Maximum Business Efficien...Noah Loul Shares 5 Steps to Implement AI Agents for Maximum Business Efficien...
Noah Loul Shares 5 Steps to Implement AI Agents for Maximum Business Efficien...
Noah Loul
 
Dev Dives: Automate and orchestrate your processes with UiPath Maestro
Dev Dives: Automate and orchestrate your processes with UiPath MaestroDev Dives: Automate and orchestrate your processes with UiPath Maestro
Dev Dives: Automate and orchestrate your processes with UiPath Maestro
UiPathCommunity
 
Special Meetup Edition - TDX Bengaluru Meetup #52.pptx
Special Meetup Edition - TDX Bengaluru Meetup #52.pptxSpecial Meetup Edition - TDX Bengaluru Meetup #52.pptx
Special Meetup Edition - TDX Bengaluru Meetup #52.pptx
shyamraj55
 
Massive Power Outage Hits Spain, Portugal, and France: Causes, Impact, and On...
Massive Power Outage Hits Spain, Portugal, and France: Causes, Impact, and On...Massive Power Outage Hits Spain, Portugal, and France: Causes, Impact, and On...
Massive Power Outage Hits Spain, Portugal, and France: Causes, Impact, and On...
Aqusag Technologies
 
How analogue intelligence complements AI
How analogue intelligence complements AIHow analogue intelligence complements AI
How analogue intelligence complements AI
Paul Rowe
 
Splunk Security Update | Public Sector Summit Germany 2025
Splunk Security Update | Public Sector Summit Germany 2025Splunk Security Update | Public Sector Summit Germany 2025
Splunk Security Update | Public Sector Summit Germany 2025
Splunk
 
Build Your Own Copilot & Agents For Devs
Build Your Own Copilot & Agents For DevsBuild Your Own Copilot & Agents For Devs
Build Your Own Copilot & Agents For Devs
Brian McKeiver
 
Heap, Types of Heap, Insertion and Deletion
Heap, Types of Heap, Insertion and DeletionHeap, Types of Heap, Insertion and Deletion
Heap, Types of Heap, Insertion and Deletion
Jaydeep Kale
 
HCL Nomad Web – Best Practices und Verwaltung von Multiuser-Umgebungen
HCL Nomad Web – Best Practices und Verwaltung von Multiuser-UmgebungenHCL Nomad Web – Best Practices und Verwaltung von Multiuser-Umgebungen
HCL Nomad Web – Best Practices und Verwaltung von Multiuser-Umgebungen
panagenda
 
SAP Modernization: Maximizing the Value of Your SAP S/4HANA Migration.pdf
SAP Modernization: Maximizing the Value of Your SAP S/4HANA Migration.pdfSAP Modernization: Maximizing the Value of Your SAP S/4HANA Migration.pdf
SAP Modernization: Maximizing the Value of Your SAP S/4HANA Migration.pdf
Precisely
 
AI and Data Privacy in 2025: Global Trends
AI and Data Privacy in 2025: Global TrendsAI and Data Privacy in 2025: Global Trends
AI and Data Privacy in 2025: Global Trends
InData Labs
 
tecnologias de las primeras civilizaciones.pdf
tecnologias de las primeras civilizaciones.pdftecnologias de las primeras civilizaciones.pdf
tecnologias de las primeras civilizaciones.pdf
fjgm517
 
Enhancing ICU Intelligence: How Our Functional Testing Enabled a Healthcare I...
Enhancing ICU Intelligence: How Our Functional Testing Enabled a Healthcare I...Enhancing ICU Intelligence: How Our Functional Testing Enabled a Healthcare I...
Enhancing ICU Intelligence: How Our Functional Testing Enabled a Healthcare I...
Impelsys Inc.
 
Cybersecurity Identity and Access Solutions using Azure AD
Cybersecurity Identity and Access Solutions using Azure ADCybersecurity Identity and Access Solutions using Azure AD
Cybersecurity Identity and Access Solutions using Azure AD
VICTOR MAESTRE RAMIREZ
 
Role of Data Annotation Services in AI-Powered Manufacturing
Role of Data Annotation Services in AI-Powered ManufacturingRole of Data Annotation Services in AI-Powered Manufacturing
Role of Data Annotation Services in AI-Powered Manufacturing
Andrew Leo
 
Manifest Pre-Seed Update | A Humanoid OEM Deeptech In France
Manifest Pre-Seed Update | A Humanoid OEM Deeptech In FranceManifest Pre-Seed Update | A Humanoid OEM Deeptech In France
Manifest Pre-Seed Update | A Humanoid OEM Deeptech In France
chb3
 
AI Changes Everything – Talk at Cardiff Metropolitan University, 29th April 2...
AI Changes Everything – Talk at Cardiff Metropolitan University, 29th April 2...AI Changes Everything – Talk at Cardiff Metropolitan University, 29th April 2...
AI Changes Everything – Talk at Cardiff Metropolitan University, 29th April 2...
Alan Dix
 
Generative Artificial Intelligence (GenAI) in Business
Generative Artificial Intelligence (GenAI) in BusinessGenerative Artificial Intelligence (GenAI) in Business
Generative Artificial Intelligence (GenAI) in Business
Dr. Tathagat Varma
 
Procurement Insights Cost To Value Guide.pptx
Procurement Insights Cost To Value Guide.pptxProcurement Insights Cost To Value Guide.pptx
Procurement Insights Cost To Value Guide.pptx
Jon Hansen
 
Electronic_Mail_Attacks-1-35.pdf by xploit
Electronic_Mail_Attacks-1-35.pdf by xploitElectronic_Mail_Attacks-1-35.pdf by xploit
Electronic_Mail_Attacks-1-35.pdf by xploit
niftliyevhuseyn
 
Noah Loul Shares 5 Steps to Implement AI Agents for Maximum Business Efficien...
Noah Loul Shares 5 Steps to Implement AI Agents for Maximum Business Efficien...Noah Loul Shares 5 Steps to Implement AI Agents for Maximum Business Efficien...
Noah Loul Shares 5 Steps to Implement AI Agents for Maximum Business Efficien...
Noah Loul
 
Dev Dives: Automate and orchestrate your processes with UiPath Maestro
Dev Dives: Automate and orchestrate your processes with UiPath MaestroDev Dives: Automate and orchestrate your processes with UiPath Maestro
Dev Dives: Automate and orchestrate your processes with UiPath Maestro
UiPathCommunity
 
Special Meetup Edition - TDX Bengaluru Meetup #52.pptx
Special Meetup Edition - TDX Bengaluru Meetup #52.pptxSpecial Meetup Edition - TDX Bengaluru Meetup #52.pptx
Special Meetup Edition - TDX Bengaluru Meetup #52.pptx
shyamraj55
 
Massive Power Outage Hits Spain, Portugal, and France: Causes, Impact, and On...
Massive Power Outage Hits Spain, Portugal, and France: Causes, Impact, and On...Massive Power Outage Hits Spain, Portugal, and France: Causes, Impact, and On...
Massive Power Outage Hits Spain, Portugal, and France: Causes, Impact, and On...
Aqusag Technologies
 
How analogue intelligence complements AI
How analogue intelligence complements AIHow analogue intelligence complements AI
How analogue intelligence complements AI
Paul Rowe
 
Splunk Security Update | Public Sector Summit Germany 2025
Splunk Security Update | Public Sector Summit Germany 2025Splunk Security Update | Public Sector Summit Germany 2025
Splunk Security Update | Public Sector Summit Germany 2025
Splunk
 
Build Your Own Copilot & Agents For Devs
Build Your Own Copilot & Agents For DevsBuild Your Own Copilot & Agents For Devs
Build Your Own Copilot & Agents For Devs
Brian McKeiver
 
Heap, Types of Heap, Insertion and Deletion
Heap, Types of Heap, Insertion and DeletionHeap, Types of Heap, Insertion and Deletion
Heap, Types of Heap, Insertion and Deletion
Jaydeep Kale
 
HCL Nomad Web – Best Practices und Verwaltung von Multiuser-Umgebungen
HCL Nomad Web – Best Practices und Verwaltung von Multiuser-UmgebungenHCL Nomad Web – Best Practices und Verwaltung von Multiuser-Umgebungen
HCL Nomad Web – Best Practices und Verwaltung von Multiuser-Umgebungen
panagenda
 
SAP Modernization: Maximizing the Value of Your SAP S/4HANA Migration.pdf
SAP Modernization: Maximizing the Value of Your SAP S/4HANA Migration.pdfSAP Modernization: Maximizing the Value of Your SAP S/4HANA Migration.pdf
SAP Modernization: Maximizing the Value of Your SAP S/4HANA Migration.pdf
Precisely
 
AI and Data Privacy in 2025: Global Trends
AI and Data Privacy in 2025: Global TrendsAI and Data Privacy in 2025: Global Trends
AI and Data Privacy in 2025: Global Trends
InData Labs
 
Ad

Cloud-Optimized HDF5 Files - Current Status

  • 1. Aleksandar Jelenak NASA EED-3 / HDF Group 2024 ESIP Summer Meeting Cloud Optimized HDF5 Files: Current Status GOVERNMENT RIGHTS NOTICE This work was authored by employees of The HDF Group under Contract No. 80GSFC21CA001 with the National Aeronautics and Space Administration. The United States Government retains and the publisher, by accepting the article for publication, acknowledges that the United States Government retains a non-exclusive, paid-up, irrevocable, worldwide license to reproduce, prepare derivative works, distribute copies to the public, and perform publicly and display publicly, or allow others to do so, for United States Government purposes. All other rights are reserved by the copyright owner. ©2024 Raytheon Company. All rights reserved.
  • 2. Glossary HDF5: Hierarchical Data Format Version 5 netCDF-4: Network Common Data Form version 4 COH5: Cloud optimized HDF5 S3: Simple Storage Service EOSDIS: NASA Earth Observing System Data and Information System MB: megabyte (106 bytes) kB: kilobyte (103 bytes) MiB: mebibyte (220 bytes) kiB: kibibyte (210 bytes) LIDAR: laser imaging, detection, and ranging URI: uniform resource identifier
  • 3. What are cloud optimized HDF5 files? ● Valid HDF5 files. Not a new file format or convention. ● Larger dataset chunk sizes. ● Internal file metadata consolidated into bigger contiguous blocks. ● Total number of required S3 requests is significantly reduced which directly improves performance. ● For detailed information, see my 2023 ESIP Summer talk. From “HDF at the Speed of Zarr” by Luis Lopez, NASA NSIDC.
  • 4. Current Advice for COH5 File Settings
  • 5. Larger dataset chunk sizes ● Term clarification: ○ chunk shape = number of array elements in each chunk dimension ○ chunk size = number of bytes (number of array elements in a chunk multiplied by byte size of one array element) ● Chunk size is prior to any filtering (compression, etc.) applied. ● Not enough testing so far: ○ EOSDIS granules with larger dataset chunks are rare. ○ h5repack tool is not easy to use for large rechunking jobs. ● Larger chunks = less of them = less internal file metadata.
  • 6. Consolidation of internal metadata ● Three different consolidation methods (see the YouTube video on slide #3). ● Practically only one of them tested: files created with paged aggregation file space management strategy. (Easier to pronounce: paged files.) ● An HDF5 file is divided into pages. Page size set at file creation. ● Each page holds either internal metadata or data (chunks).
  • 7. Paged file: pros and cons ● HDF5 library reads entire pages which yields its best cloud performance. ● It also has a special cache for these pages, called page buffer. Its size must be set prior to opening a file. ● One file page can have more than one chunk = less overall S3 requests. ● Paged files tend to have larger size compared to their non-paged version which is caused by extra unused space in each page. ○ Think of a file page as a box filled with different sized objects.
  • 8. Current Advice: Chunks ● Chunk size needs to account for speed of applying filters (e.g., decompression) when chunks are read. ● NASA satellite data predominantly compressed with the zlib (a.k.a., gzip, deflate) method. ● Need to explore other compression methods for optimal speed vs. compression ratio. ● Smaller compressed chunks fill file pages better. ● Suggested chunk sizes: 100k(i)B to 2M(i)B.
  • 9. Current Advice: Paged files ● Tested file pages of 4, 8, and 16 MiB sizes. ● 8 MiB file page produced slightly better performance, with tolerable (<5%) file size increase. ● Majority of tested files had their internal metadata in one 8MiB file page. ● Don’t worry about unused space in that one internal metadata file page. ● Majority of datasets in the tested files were stored in a single file page. ● Consider a minimum of four chunks per file page when choosing a dataset’s chunk size. ● If writing data to a paged file in more than one open-close session, enable re-use of file’s free space when creating it. ○ Otherwise, the file may end up much larger than needed. ○ h5repack can produce a defragmented version of the file.
  • 10. What happens to chunks in a paged file?
  • 11. Example: GEDI Level 2A granule ● Global Ecosystem Dynamics Investigation (GEDI) instrument is on the International Space Station. ● A full-waveform LIDAR system for high-resolution observations of forests’ vertical structure. ● Example granule: ○ 1,518,338,048 bytes ○ 136 contiguous datasets ○ 4,184 chunked datasets compressed with the zlib filter ● Repacked into a paged file version with 8MiB file page size. ● No chunk was “hurt” (i.e., rechunked) during repacking.
  • 13. Number of stored dataset chunks
  • 14. Dataset chunk spread across file pages
  • 15. Extra file pages compared to dataset total size
  • 16. Dataset cache size for all chunks?
  • 17. HDF5 Library Improvements for Cloud Data Access
  • 18. HDF5 library ● Applies to version 1.14.4 only. ● Released in May 2024. ● All other maintenance releases of the library – 1.8.*, 1.10.*, and 1.12.* – are deprecated now. ● Native method for S3 data access: Read-Only S3 (ROS3) virtual file driver (VFD). ○ Not always available – build dependent. ○ Conda Forge hdf5 package has it but not h5py from PyPI. ● For Python users: fsspec via h5py. ○ fsspec connected with the library using its virtual file layer API. ○ Lacks communication of important information from the library.
  • 19. Notable improvements ● ROS3 caches first 16 MiB of the file on open. ● ROS3 support for AWS temporary session token. ● Set-and-forget page buffer size. Opening non-paged files will not cause an error. ● Fixed chunk file location info to account for file’s user block size. ● Fixed an h5repack bug for datasets with variable-length data. Important when repacking netCDF-4 string variables. ● Next release: Build with zlib-ng. This is a newer open-source implementation of the standard zlib compression library and ~2x faster. ● Next release: h5repack, h5ls, h5dump, and h5stat new command-line option for page buffer size. This will enable much improved performance for cloud optimized files in S3. ● Next release: ROS3 support for relevant AWS environment variables. ● Next release: Support for S3 object URIs (s3://bucket-name/object-name).
  • 20. This work was supported by NASA/GSFC under Raytheon Company contract number 80GSFC21CA001