This document discusses how to optimize HDF5 files for efficient access in cloud object stores. Key optimizations include using large dataset chunk sizes of 1-4 MiB, consolidating internal file metadata, and minimizing variable-length datatypes. The document recommends creating files with paged aggregation and storing file content information in the user block to enable fast discovery of file contents when stored in object stores.
In this talk we will examine how to tune HDF5 performance to improve I/O speed. The talk will focus on chunk and metadata caches, how they affect performance, and which HDF5 APIs that can be used for performance tuning.
Examples of different chunking strategies will be given. We will also discuss how to reduce file overhead by using special properties of the HDF5 groups, datasets and datatypes.
Hadoop uses large 64MB blocks by default to store file data in HDFS for improved performance. The namenode manages file metadata and knows which datanodes store each block. Datanodes store and retrieve blocks as requested by clients. The secondary namenode helps manage the namenode metadata but cannot replace it in case of failure. Writing files involves breaking them into blocks and storing replicas across datanodes, while reading locates blocks and retrieves their data.
This document discusses creating cloud-optimized HDF5 files by rearranging internal structures for more efficient data access in cloud object stores. It describes cloud-native and cloud-optimized storage formats, with the latter involving storing the entire HDF5 file as a single object. The benefits of cloud-optimized HDF5 include fast scanning and using the HDF5 library. Key aspects covered include using optimal chunk sizes, compression, and minimizing variable-length datatypes.
The document discusses the importance of data storage for large tech companies and the challenges of storing large amounts of data reliably. It provides an overview of NetApp's storage solutions, including Data ONTAP, WAFL file system, Snapshot technology, replication tools like SnapMirror, and management tools like My AutoSupport. NetApp believes in providing a unified storage platform with integrated data protection, management and optimization capabilities.
In this talk we will discuss caching and buffering strategies in HDF5. The information presented will help developers write more efficient applications and avoid performance bottlenecks.
LAS16-400: Mini Conference 3 AOSP (Session 1)Linaro
LAS16-400: Mini Conference 3 AOSP (Session 1)
Speakers: Thomas Gall, Bernhard Rosenkränzer
Date: September 29, 2016
★ Session Description ★
The Android Open Source Project is one community which is strategic to Linaro and it’s members. The purpose of this mini conference is to gather fellow Android engineers together from the community, member companies, and Linaro to discuss engineering activities and improve collaboration across different groups.
Within this mini conference we encourage discussion and presentations to advance engineering topics, forge consensus and educate each other.
The tentative agenda for this mini conference includes :
- Quick introduction
- Filesystems - Between requirements for encryption and standing concerns about degrading performance as an Android file system age, let’s have some discussion involving current data, known issues and towards improvements in this area for Android.
- HAL consolidation - Review current status and discuss next steps to work on.
One build for many devices: device/build configuration. Next features and platforms to add. Gaps in HiKey support vs. AOSP build.
- Graphics - YUV support in mesa and hwc.
- WiFi and sensor HAL status and next steps
- New developments with AOSP + the Kernel - With regards to the Google Common Kernel tree and upstream Linux kernel activities related to Android, there are a few topics up for discussion:
- - Updates on HiKey in AOSP
- - EAS in common.git & integration with AOSP userspace
- - New Sync API in 4.6+ kernels, and how it will affects graphics drivers
- AOSP transition to clang - As everyone knows GCC in AOSP has been deprecated. Let’s cover current status, issues and next steps. Let’s also discuss the elephant in the room, building the kernel with clang.
- Out of tree AOSP User space Patches - This is a discussion with the goal of organized action to see forward progress on AOSP user space patches that aren’t in AOSP for whatever reason.
- Android is used in some environments where booting can be frequent and affect the product experience. Do you want to wait for a minute while your car boots? We’ll spend time brainstorming on improving Android boot time.
★ Resources ★
Etherpad: pad.linaro.org/p/las16-400
Presentations & Videos: http://connect.linaro.org/resource/las16/las16-400/
★ Event Details ★
Linaro Connect Las Vegas 2016 – #LAS16
September 26-30, 2016
http://www.linaro.org
http://connect.linaro.org
This document presents a technique called VM-aware adaptive storage cache prefetching that uses information from virtual machines to improve storage caching performance in hybrid storage arrays. It exploits access locality based on file layouts obtained from the guest file system and adaptively tunes prefetching by adjusting the prefetch window size based on application performance statistics. An evaluation using TPCx-V benchmarks showed the technique improved performance over 32% by using file layout information and 6.7% by using adaptive prefetch window tuning compared to other caching approaches.
This document summarizes new features in file systems and storage for Red Hat Enterprise Linux 6 and 7. Some key points include:
- RHEL6 introduced new LVM features like thin provisioning and snapshots that improve storage utilization and reduce administration. Ext4 and XFS were expanded file system options.
- RHEL6 also enhanced support for parallel NFS to improve scalability of NFS file systems. GFS2 and XFS saw performance improvements.
- RHEL7 is focusing on enhancing performance for high-speed devices like SSDs and new types of persistent memory. It will include block layer caching options and improved thin provisioning alerts. Btrfs support is also being expanded.
This document summarizes new file system and storage features in Red Hat Enterprise Linux (RHEL) 6 and 7. It discusses enhancements to logical volume management (LVM) such as thin provisioning and snapshots. It also covers expanded file system options like XFS, improvements to NFS including parallel NFS, and general performance enhancements.
The HDF Group provides updates on new features in HDF including faster compression, single writer/multiple reader file access, virtual datasets, and dynamically loaded filters. They also discuss tools like HDFView, nagg for data aggregation, and a new HDF5 ODBC driver. The work is supported by NASA.
It will cover features of the HDF5 library for achieving better I/O performance and efficient storage. The following HDF5 features will be discussed: chunked storage layout.
This tutorial is for persons who are already familiar with HDF5 and wish to take advantage is some of its advanced features.
HDF5 performs significantly faster than other file formats like FITSIO, HDF4, NetCDF, and PDB for reading and writing large scientific datasets according to benchmarks. For writing contiguous datasets, HDF5 was 1.5-2 times faster than HDF4, 1.2-1.7 times faster than NetCDF, and 1.3-1.6 times faster than PDB. HDF5 also outperformed other formats for reading contiguous datasets and hyperslabs. HDF5 was able to read and write multiple datasets within a file an order of magnitude faster than other formats when accessing datasets from files with many objects.
Hdg explains swapfile.sys, hiberfil.sys and pagefileTrường Tiền
The document discusses the pagefile.sys, hiberfil.sys, and swapfile.sys files in Windows 8. It explains that pagefile.sys is used for virtual memory when physical RAM is exhausted. Hiberfil.sys is used for hibernation and fast startup, and is only present if fast startup is enabled. Swapfile.sys is used specifically for suspending and resuming Metro apps, and may have other future uses. It is smaller than pagefile.sys. Fast startup results in hiberfil.sys being 75% of RAM size and pagefile.sys 25% of RAM size.
Hdg explains swapfile.sys, hiberfil.sys and pagefileTrường Tiền
The document discusses the pagefile.sys, hiberfil.sys, and swapfile.sys files in Windows 8. It explains that pagefile.sys is used for virtual memory when physical RAM is exhausted. Hiberfil.sys is used for hibernation and fast startup, and only exists if fast startup is enabled. Swapfile.sys is used specifically for suspending and resuming Metro apps, and may have other future uses. It is smaller than pagefile.sys.
The document discusses various HDF5 tools that can be used for performance tuning and troubleshooting HDF5 files. It describes how h5stat can provide statistics about file size overhead and object properties to help optimize storage strategies. H5stat reports on file structure metadata like object headers and dataset datatypes. The document also shows how h5diff and h5ls can detect differences between files and locate raw data addresses. An example demonstrates using these tools to debug issues with reading files created on different platforms due to datatype size mismatches.
This document provides information about HDF (Hierarchical Data Format) tools and resources for working with Earth observation data. It summarizes HDF's focus on helping users at different stages of working with data, from initial product design to long-term archiving. It also describes specific HDF tools for viewing, comparing, converting between formats and adding metadata to scientific data files.
Best Practices for Using Alluxio with Apache Spark with Cheng Chang and Haoyu...Databricks
Alluxio, formerly Tachyon, is a memory speed virtual distributed storage system that leverages memory for storing data and accelerating access to data in different storage systems. Many organizations and deployments use Alluxio with Apache Spark, and some of them scale out to over petabytes of data. Alluxio can enable Spark to be even more effective, in both on-premise deployments and public cloud deployments. Alluxio bridges Spark applications with various storage systems and further accelerates data intensive applications. This session will briefly introduce Alluxio and present different ways that Alluxio can help Spark jobs. Get best practices for using Alluxio with Spark, including RDDs and DataFrames, as well as on-premise deployments and public cloud deployments.
This document provides an overview of HSDS (Highly Scalable Data Service), which is a REST-based service that allows accessing HDF5 data stored in the cloud. It discusses how HSDS maps HDF5 objects like datasets and groups to individual cloud storage objects to optimize performance. The document also describes how HSDS was used to improve access performance for NASA ICESat-2 HDF5 data on AWS S3 by hyper-chunking datasets into larger chunks spanning multiple original HDF5 chunks. Benchmark results showed that accessing the data through HSDS provided over 2x faster performance than other methods like ROS3 or S3FS that directly access the cloud storage.
The document discusses data compression in Hadoop. There are several benefits to compressing data in Hadoop including reduced storage needs, faster data transfers, and less disk I/O. However, compression increases CPU usage. There are different compression algorithms and file formats that can be used including gzip, bzip2, LZO, LZ4, zlib, snappy, Avro, SequenceFiles, RCFiles, ORC, and Parquet. The best options depend on factors like the data, query needs, support in Hadoop distributions, and whether the data schema may evolve. Columnar formats like Parquet provide better query performance but slower write speeds.
The document describes H5Coro, a new C++ library for reading HDF5 files from cloud storage. H5Coro was created to optimize HDF5 reading for cloud environments by minimizing I/O operations through caching and efficient HTTP requests. Performance tests showed H5Coro was 77-132x faster than the previous HDF5 library at reading HDF5 data from Amazon S3 for NASA's SlideRule project. H5Coro supports common HDF5 elements but does not support writing or some complex HDF5 data types and messages to focus on optimized read-only performance for time series data stored sequentially in memory.
The document discusses new features in Apache Hadoop Common and HDFS for version 3.0. Key updates include upgrading the minimum Java version to Java 8, improving dependency management, adding a new Azure Data Lake Storage connector, and introducing erasure coding in HDFS to improve storage efficiency. Erasure coding in HDFS phase 1 allows for striping of small blocks and parallel writes/reads while trading off higher network usage compared to replication.
This document summarizes key aspects of the Hadoop Distributed File System (HDFS). HDFS is designed for storing very large files across commodity hardware. It uses a master/slave architecture with a single NameNode that manages file system metadata and multiple DataNodes that store application data. HDFS allows for streaming access to this distributed data and can provide higher throughput than a single high-end server by parallelizing reads across nodes.
In this talk we will discuss what happens to data when it is written from the HDF5 application to an HDF5 file. This knowledge will help developers to write more efficient applications and to avoid performance bottlenecks.
This document summarizes new features in file systems and storage for Red Hat Enterprise Linux 6 and 7. Some key points include:
- RHEL6 introduced new LVM features like thin provisioning and snapshots that improve storage utilization and reduce administration. Ext4 and XFS were expanded file system options.
- RHEL6 also enhanced support for parallel NFS to improve scalability of NFS file systems. GFS2 and XFS saw performance improvements.
- RHEL7 is focusing on enhancing performance for high-speed devices like SSDs and new types of persistent memory. It will include block layer caching options and improved thin provisioning alerts. Btrfs support is also being expanded.
This document summarizes new file system and storage features in Red Hat Enterprise Linux (RHEL) 6 and 7. It discusses enhancements to logical volume management (LVM) such as thin provisioning and snapshots. It also covers expanded file system options like XFS, improvements to NFS including parallel NFS, and general performance enhancements.
The HDF Group provides updates on new features in HDF including faster compression, single writer/multiple reader file access, virtual datasets, and dynamically loaded filters. They also discuss tools like HDFView, nagg for data aggregation, and a new HDF5 ODBC driver. The work is supported by NASA.
It will cover features of the HDF5 library for achieving better I/O performance and efficient storage. The following HDF5 features will be discussed: chunked storage layout.
This tutorial is for persons who are already familiar with HDF5 and wish to take advantage is some of its advanced features.
HDF5 performs significantly faster than other file formats like FITSIO, HDF4, NetCDF, and PDB for reading and writing large scientific datasets according to benchmarks. For writing contiguous datasets, HDF5 was 1.5-2 times faster than HDF4, 1.2-1.7 times faster than NetCDF, and 1.3-1.6 times faster than PDB. HDF5 also outperformed other formats for reading contiguous datasets and hyperslabs. HDF5 was able to read and write multiple datasets within a file an order of magnitude faster than other formats when accessing datasets from files with many objects.
Hdg explains swapfile.sys, hiberfil.sys and pagefileTrường Tiền
The document discusses the pagefile.sys, hiberfil.sys, and swapfile.sys files in Windows 8. It explains that pagefile.sys is used for virtual memory when physical RAM is exhausted. Hiberfil.sys is used for hibernation and fast startup, and is only present if fast startup is enabled. Swapfile.sys is used specifically for suspending and resuming Metro apps, and may have other future uses. It is smaller than pagefile.sys. Fast startup results in hiberfil.sys being 75% of RAM size and pagefile.sys 25% of RAM size.
Hdg explains swapfile.sys, hiberfil.sys and pagefileTrường Tiền
The document discusses the pagefile.sys, hiberfil.sys, and swapfile.sys files in Windows 8. It explains that pagefile.sys is used for virtual memory when physical RAM is exhausted. Hiberfil.sys is used for hibernation and fast startup, and only exists if fast startup is enabled. Swapfile.sys is used specifically for suspending and resuming Metro apps, and may have other future uses. It is smaller than pagefile.sys.
The document discusses various HDF5 tools that can be used for performance tuning and troubleshooting HDF5 files. It describes how h5stat can provide statistics about file size overhead and object properties to help optimize storage strategies. H5stat reports on file structure metadata like object headers and dataset datatypes. The document also shows how h5diff and h5ls can detect differences between files and locate raw data addresses. An example demonstrates using these tools to debug issues with reading files created on different platforms due to datatype size mismatches.
This document provides information about HDF (Hierarchical Data Format) tools and resources for working with Earth observation data. It summarizes HDF's focus on helping users at different stages of working with data, from initial product design to long-term archiving. It also describes specific HDF tools for viewing, comparing, converting between formats and adding metadata to scientific data files.
Best Practices for Using Alluxio with Apache Spark with Cheng Chang and Haoyu...Databricks
Alluxio, formerly Tachyon, is a memory speed virtual distributed storage system that leverages memory for storing data and accelerating access to data in different storage systems. Many organizations and deployments use Alluxio with Apache Spark, and some of them scale out to over petabytes of data. Alluxio can enable Spark to be even more effective, in both on-premise deployments and public cloud deployments. Alluxio bridges Spark applications with various storage systems and further accelerates data intensive applications. This session will briefly introduce Alluxio and present different ways that Alluxio can help Spark jobs. Get best practices for using Alluxio with Spark, including RDDs and DataFrames, as well as on-premise deployments and public cloud deployments.
This document provides an overview of HSDS (Highly Scalable Data Service), which is a REST-based service that allows accessing HDF5 data stored in the cloud. It discusses how HSDS maps HDF5 objects like datasets and groups to individual cloud storage objects to optimize performance. The document also describes how HSDS was used to improve access performance for NASA ICESat-2 HDF5 data on AWS S3 by hyper-chunking datasets into larger chunks spanning multiple original HDF5 chunks. Benchmark results showed that accessing the data through HSDS provided over 2x faster performance than other methods like ROS3 or S3FS that directly access the cloud storage.
The document discusses data compression in Hadoop. There are several benefits to compressing data in Hadoop including reduced storage needs, faster data transfers, and less disk I/O. However, compression increases CPU usage. There are different compression algorithms and file formats that can be used including gzip, bzip2, LZO, LZ4, zlib, snappy, Avro, SequenceFiles, RCFiles, ORC, and Parquet. The best options depend on factors like the data, query needs, support in Hadoop distributions, and whether the data schema may evolve. Columnar formats like Parquet provide better query performance but slower write speeds.
The document describes H5Coro, a new C++ library for reading HDF5 files from cloud storage. H5Coro was created to optimize HDF5 reading for cloud environments by minimizing I/O operations through caching and efficient HTTP requests. Performance tests showed H5Coro was 77-132x faster than the previous HDF5 library at reading HDF5 data from Amazon S3 for NASA's SlideRule project. H5Coro supports common HDF5 elements but does not support writing or some complex HDF5 data types and messages to focus on optimized read-only performance for time series data stored sequentially in memory.
The document discusses new features in Apache Hadoop Common and HDFS for version 3.0. Key updates include upgrading the minimum Java version to Java 8, improving dependency management, adding a new Azure Data Lake Storage connector, and introducing erasure coding in HDFS to improve storage efficiency. Erasure coding in HDFS phase 1 allows for striping of small blocks and parallel writes/reads while trading off higher network usage compared to replication.
This document summarizes key aspects of the Hadoop Distributed File System (HDFS). HDFS is designed for storing very large files across commodity hardware. It uses a master/slave architecture with a single NameNode that manages file system metadata and multiple DataNodes that store application data. HDFS allows for streaming access to this distributed data and can provide higher throughput than a single high-end server by parallelizing reads across nodes.
In this talk we will discuss what happens to data when it is written from the HDF5 application to an HDF5 file. This knowledge will help developers to write more efficient applications and to avoid performance bottlenecks.
This document summarizes the current status and focus of the HDF Group. It discusses that the HDF Group is located in Champaign, IL and is a non-profit organization focused on developing and maintaining HDF software and data formats. It provides an overview of recent HDF5, HDF4 and HDFView releases and notes areas of focus for software quality improvements, increased transparency, strengthening the community, and modernizing HDF products. It invites support and participation in upcoming user group meetings.
This document provides an overview of HSDS (HDF Server and Data Service), which allows HDF5 files to be stored and accessed from the cloud. Key points include:
- HSDS maps HDF5 objects like datasets and groups to individual cloud storage objects for scalability and parallelism.
- Features include streaming support, fancy indexing for complex queries, and caching for improved performance.
- HSDS can be deployed on Docker, Kubernetes, or AWS Lambda depending on needs.
- Case studies show HSDS is used by organizations like NREL and NSF to make petabytes of scientific data publicly accessible in the cloud.
This document discusses updates and performance improvements to the HDF5 OPeNDAP data handler. It provides a history of the handler since 2001 and describes recent updates including supporting DAP4, new data types, and NetCDF data models. A performance study showed that passing compressed HDF5 data through the handler without decompressing/recompressing led to speedups of around 17-30x by leveraging HDF5 direct I/O APIs. This allows outputting HDF5 files as NetCDF files much faster through the handler.
This document provides instructions for using the Hyrax software to serve scientific data files stored on Amazon S3 using the OPeNDAP data access protocol. It describes how to generate ancillary metadata files called DMR++ files using the get_dmrpp tool that provide information about the data file structure and locations. The document explains how to run get_dmrpp inside a Docker container to process data files on S3 and generate customized DMR++ files that the Hyrax server can use to serve the files to clients.
This document provides an overview and examples of accessing cloud data and services using the Earthdata Login (EDL), Pydap, and MATLAB. It discusses some common problems users encounter, such as being unable to access HDF5 data on AWS S3 using MATLAB or read data from OPeNDAP servers using Pydap. Solutions presented include using EDL to get temporary AWS tokens for S3 access in MATLAB and providing code examples on the HDFEOS website to help users access S3 data and OPeNDAP services. The document also notes some limitations, such as tokens being valid for only 1 hour, and workarounds like requesting new tokens or using the MATLAB HDF5 API instead of the netCDF API.
The HDF5 Roadmap and New Features document outlines upcoming changes and improvements to the HDF5 library. Key points include:
- HDF5 1.13.x releases will include new features like selection I/O, the Onion VFD for versioned files, improved VFD SWMR for single-writer multiple-reader access, and subfiling for parallel I/O.
- The Virtual Object Layer allows customizing HDF5 object storage and introduces terminal and pass-through connectors.
- The Onion VFD stores versions of HDF5 files in a separate onion file for versioned access.
- VFD SWMR improves on legacy SWMR by implementing single-writer multiple-reader capabilities
This document discusses user analysis of the HDFEOS.org website and plans for future improvements. It finds that the majority of the site's 100 daily users are "quiet", not posting on forums or other interactive elements. The main user types are locators, who search for examples or data; mergers, who combine or mosaic datasets; and converters, who change file formats. The document outlines recent updates focused on these user types, like adding Python examples for subsetting and calculating latitude and longitude. It proposes future work on artificial intelligence/machine learning uses of HDF files and examples for processing HDF data in the cloud.
This document summarizes a presentation about the current status and future directions of the Hierarchical Data Format (HDF) software. It provides updates on recent HDF5 releases, development efforts including new compression methods and ways to access HDF5 data, and outreach resources. It concludes by inviting the audience to share wishes for future HDF development.
This document summarizes MathWorks' work to modernize MATLAB's support for HDF5. Key points include:
1) MATLAB now supports HDF5 1.10.7 features like single-writer/multiple-reader access and virtual datasets through new and updated low-level functions.
2) Performance benchmarks show some improvements but also regressions compared to the previous HDF5 version, and work continues to optimize code and support future versions.
3) There are compatibility considerations for Linux filter plugins, but interim solutions are provided until MathWorks can ship a single HDF5 version.
HSDS provides HDF as a service through a REST API that can scale across nodes. New releases will enable serverless operation using AWS Lambda or direct client access without a server. This allows HDF data to be accessed remotely without managing servers. HSDS stores each HDF object separately, making it compatible with cloud object storage. Performance on AWS Lambda is slower than a dedicated server but has no management overhead. Direct client access has better performance but limits collaboration between clients.
HDF5 and Zarr are data formats that can be used to store and access scientific data. This presentation discusses approaches to translating between the two formats. It describes how HDF5 files were translated to the Zarr format by creating a separate Zarr store to hold HDF5 file chunks, and storing chunk location metadata. It also discusses an implementation that translates Zarr data to the HDF5 format by using a special chunking layout and storing chunk information in an HDF5 compound dataset. Limitations of the translations include lack of support for some HDF5 dataset properties in Zarr, and lack of support for some Zarr compression methods in the HDF5 implementation.
The document discusses HDF for the cloud, including new features of the HDF Server and what's next. Key points:
- HDF Server uses a "sharded schema" that maps HDF5 objects to individual storage objects, allowing parallel access and updates without transferring entire files.
- Implementations include HSDS software that uses the sharded schema with an API and SDKs for different languages like h5pyd for Python.
- New features of HSDS 0.6 include support for POSIX, Azure, AWS Lambda, and role-based access control.
- Future work includes direct access to storage without a server intermediary for some use cases.
This document compares different methods for accessing HDF and netCDF files stored on Amazon S3, including Apache Drill, THREDDS Data Server (TDS), and HDF5 Virtual File Driver (VFD). A benchmark test of accessing a 24GB HDF5/netCDF-4 file on S3 from Amazon EC2 found that TDS performed the best, responding within 2 minutes, while Apache Drill failed after 7 minutes. The document concludes that TDS 5.0 is the clear winner based on performance and support for role-based access control and HDF4 files, but the best solution depends on use case and software.
This document discusses STARE-PODS, a proposal to NASA/ACCESS-19 to develop a scalable data store for earth science data using the SpatioTemporal Adaptive Resolution Encoding (STARE) indexing scheme. STARE allows diverse earth science data to be unified and indexed, enabling the data to be partitioned and stored in a Parallel Optimized Data Store (PODS) for efficient analysis. The HDF Virtual Object Layer and Virtual Data Set technologies can then provide interfaces to access the data in STARE-PODS in a familiar way. The goal is for STARE-PODS to organize diverse data for alignment and parallel/distributed storage and processing to enable integrative analysis at scale.
This document provides an overview and update on HDF5 and its ecosystem. Key points include:
- HDF5 1.12.0 was recently released with new features like the Virtual Object Layer and external references.
- The HDF5 library now supports accessing data in the cloud using connectors like S3 VFD and REST VOL without needing to modify applications.
- Projects like HDFql and H5CPP provide additional interfaces for querying and working with HDF5 files from languages like SQL, C++, and Python.
- The HDF5 community is moving development to GitHub and improving documentation resources on the HDF wiki site.
This document summarizes new features in HDF5 1.12.0, including support for storing references to objects and attributes across files, new storage backends using a virtual object layer (VOL), and virtual file drivers (VFDs) for Amazon S3 and HDFS. It outlines the HDF5 roadmap for 2019-2022, which includes continued support for HDF5 1.8 and 1.10, and new features in future 1.12.x releases like querying, indexing, and provenance tracking.
Enhancing ICU Intelligence: How Our Functional Testing Enabled a Healthcare I...Impelsys Inc.
Impelsys provided a robust testing solution, leveraging a risk-based and requirement-mapped approach to validate ICU Connect and CritiXpert. A well-defined test suite was developed to assess data communication, clinical data collection, transformation, and visualization across integrated devices.
Role of Data Annotation Services in AI-Powered ManufacturingAndrew Leo
From predictive maintenance to robotic automation, AI is driving the future of manufacturing. But without high-quality annotated data, even the smartest models fall short.
Discover how data annotation services are powering accuracy, safety, and efficiency in AI-driven manufacturing systems.
Precision in data labeling = Precision on the production floor.
AI Changes Everything – Talk at Cardiff Metropolitan University, 29th April 2...Alan Dix
Talk at the final event of Data Fusion Dynamics: A Collaborative UK-Saudi Initiative in Cybersecurity and Artificial Intelligence funded by the British Council UK-Saudi Challenge Fund 2024, Cardiff Metropolitan University, 29th April 2025
https://alandix.com/academic/talks/CMet2025-AI-Changes-Everything/
Is AI just another technology, or does it fundamentally change the way we live and think?
Every technology has a direct impact with micro-ethical consequences, some good, some bad. However more profound are the ways in which some technologies reshape the very fabric of society with macro-ethical impacts. The invention of the stirrup revolutionised mounted combat, but as a side effect gave rise to the feudal system, which still shapes politics today. The internal combustion engine offers personal freedom and creates pollution, but has also transformed the nature of urban planning and international trade. When we look at AI the micro-ethical issues, such as bias, are most obvious, but the macro-ethical challenges may be greater.
At a micro-ethical level AI has the potential to deepen social, ethnic and gender bias, issues I have warned about since the early 1990s! It is also being used increasingly on the battlefield. However, it also offers amazing opportunities in health and educations, as the recent Nobel prizes for the developers of AlphaFold illustrate. More radically, the need to encode ethics acts as a mirror to surface essential ethical problems and conflicts.
At the macro-ethical level, by the early 2000s digital technology had already begun to undermine sovereignty (e.g. gambling), market economics (through network effects and emergent monopolies), and the very meaning of money. Modern AI is the child of big data, big computation and ultimately big business, intensifying the inherent tendency of digital technology to concentrate power. AI is already unravelling the fundamentals of the social, political and economic world around us, but this is a world that needs radical reimagining to overcome the global environmental and human challenges that confront us. Our challenge is whether to let the threads fall as they may, or to use them to weave a better future.
Generative Artificial Intelligence (GenAI) in BusinessDr. Tathagat Varma
My talk for the Indian School of Business (ISB) Emerging Leaders Program Cohort 9. In this talk, I discussed key issues around adoption of GenAI in business - benefits, opportunities and limitations. I also discussed how my research on Theory of Cognitive Chasms helps address some of these issues
Procurement Insights Cost To Value Guide.pptxJon Hansen
Procurement Insights integrated Historic Procurement Industry Archives, serves as a powerful complement — not a competitor — to other procurement industry firms. It fills critical gaps in depth, agility, and contextual insight that most traditional analyst and association models overlook.
Learn more about this value- driven proprietary service offering here.
Noah Loul Shares 5 Steps to Implement AI Agents for Maximum Business Efficien...Noah Loul
Artificial intelligence is changing how businesses operate. Companies are using AI agents to automate tasks, reduce time spent on repetitive work, and focus more on high-value activities. Noah Loul, an AI strategist and entrepreneur, has helped dozens of companies streamline their operations using smart automation. He believes AI agents aren't just tools—they're workers that take on repeatable tasks so your human team can focus on what matters. If you want to reduce time waste and increase output, AI agents are the next move.
Dev Dives: Automate and orchestrate your processes with UiPath MaestroUiPathCommunity
This session is designed to equip developers with the skills needed to build mission-critical, end-to-end processes that seamlessly orchestrate agents, people, and robots.
📕 Here's what you can expect:
- Modeling: Build end-to-end processes using BPMN.
- Implementing: Integrate agentic tasks, RPA, APIs, and advanced decisioning into processes.
- Operating: Control process instances with rewind, replay, pause, and stop functions.
- Monitoring: Use dashboards and embedded analytics for real-time insights into process instances.
This webinar is a must-attend for developers looking to enhance their agentic automation skills and orchestrate robust, mission-critical processes.
👨🏫 Speaker:
Andrei Vintila, Principal Product Manager @UiPath
This session streamed live on April 29, 2025, 16:00 CET.
Check out all our upcoming Dev Dives sessions at https://community.uipath.com/dev-dives-automation-developer-2025/.
Special Meetup Edition - TDX Bengaluru Meetup #52.pptxshyamraj55
We’re bringing the TDX energy to our community with 2 power-packed sessions:
🛠️ Workshop: MuleSoft for Agentforce
Explore the new version of our hands-on workshop featuring the latest Topic Center and API Catalog updates.
📄 Talk: Power Up Document Processing
Dive into smart automation with MuleSoft IDP, NLP, and Einstein AI for intelligent document workflows.
Massive Power Outage Hits Spain, Portugal, and France: Causes, Impact, and On...Aqusag Technologies
In late April 2025, a significant portion of Europe, particularly Spain, Portugal, and parts of southern France, experienced widespread, rolling power outages that continue to affect millions of residents, businesses, and infrastructure systems.
Artificial Intelligence is providing benefits in many areas of work within the heritage sector, from image analysis, to ideas generation, and new research tools. However, it is more critical than ever for people, with analogue intelligence, to ensure the integrity and ethical use of AI. Including real people can improve the use of AI by identifying potential biases, cross-checking results, refining workflows, and providing contextual relevance to AI-driven results.
News about the impact of AI often paints a rosy picture. In practice, there are many potential pitfalls. This presentation discusses these issues and looks at the role of analogue intelligence and analogue interfaces in providing the best results to our audiences. How do we deal with factually incorrect results? How do we get content generated that better reflects the diversity of our communities? What roles are there for physical, in-person experiences in the digital world?
HCL Nomad Web – Best Practices und Verwaltung von Multiuser-Umgebungenpanagenda
Webinar Recording: https://www.panagenda.com/webinars/hcl-nomad-web-best-practices-und-verwaltung-von-multiuser-umgebungen/
HCL Nomad Web wird als die nächste Generation des HCL Notes-Clients gefeiert und bietet zahlreiche Vorteile, wie die Beseitigung des Bedarfs an Paketierung, Verteilung und Installation. Nomad Web-Client-Updates werden “automatisch” im Hintergrund installiert, was den administrativen Aufwand im Vergleich zu traditionellen HCL Notes-Clients erheblich reduziert. Allerdings stellt die Fehlerbehebung in Nomad Web im Vergleich zum Notes-Client einzigartige Herausforderungen dar.
Begleiten Sie Christoph und Marc, während sie demonstrieren, wie der Fehlerbehebungsprozess in HCL Nomad Web vereinfacht werden kann, um eine reibungslose und effiziente Benutzererfahrung zu gewährleisten.
In diesem Webinar werden wir effektive Strategien zur Diagnose und Lösung häufiger Probleme in HCL Nomad Web untersuchen, einschließlich
- Zugriff auf die Konsole
- Auffinden und Interpretieren von Protokolldateien
- Zugriff auf den Datenordner im Cache des Browsers (unter Verwendung von OPFS)
- Verständnis der Unterschiede zwischen Einzel- und Mehrbenutzerszenarien
- Nutzung der Client Clocking-Funktion
AI and Data Privacy in 2025: Global TrendsInData Labs
In this infographic, we explore how businesses can implement effective governance frameworks to address AI data privacy. Understanding it is crucial for developing effective strategies that ensure compliance, safeguard customer trust, and leverage AI responsibly. Equip yourself with insights that can drive informed decision-making and position your organization for success in the future of data privacy.
This infographic contains:
-AI and data privacy: Key findings
-Statistics on AI data privacy in the today’s world
-Tips on how to overcome data privacy challenges
-Benefits of AI data security investments.
Keep up-to-date on how AI is reshaping privacy standards and what this entails for both individuals and organizations.
2. Glossary
HDF5: Hierarchical Data Format Version 5
netCDF-4: Network Common Data Form version 4
COH5: Cloud optimized HDF5
S3: Simple Storage Service
EOSDIS: NASA Earth Observing System Data and Information System
MB: megabyte (106 bytes)
kB: kilobyte (103 bytes)
MiB: mebibyte (220 bytes)
kiB: kibibyte (210 bytes)
LIDAR: laser imaging, detection, and ranging
URI: uniform resource identifier
3. What are cloud optimized HDF5 files?
● Valid HDF5 files. Not a new file format or convention.
● Larger dataset chunk sizes.
● Internal file metadata consolidated into bigger contiguous blocks.
● Total number of required S3 requests is significantly reduced which directly
improves performance.
● For detailed information, see my 2023 ESIP Summer talk.
From “HDF at the Speed of Zarr” by Luis Lopez, NASA NSIDC.
5. Larger dataset chunk sizes
● Term clarification:
○ chunk shape = number of array elements in each chunk dimension
○ chunk size = number of bytes
(number of array elements in a chunk multiplied by byte size of one array
element)
● Chunk size is prior to any filtering (compression, etc.) applied.
● Not enough testing so far:
○ EOSDIS granules with larger dataset chunks are rare.
○ h5repack tool is not easy to use for large rechunking jobs.
● Larger chunks = less of them = less internal file metadata.
6. Consolidation of internal metadata
● Three different consolidation methods (see the YouTube video on slide #3).
● Practically only one of them tested: files created with paged aggregation file
space management strategy. (Easier to pronounce: paged files.)
● An HDF5 file is divided into pages. Page size set at file creation.
● Each page holds either internal metadata or data (chunks).
7. Paged file: pros and cons
● HDF5 library reads entire pages which yields its best cloud performance.
● It also has a special cache for these pages, called page buffer. Its size must
be set prior to opening a file.
● One file page can have more than one chunk = less overall S3 requests.
● Paged files tend to have larger size compared to their non-paged version
which is caused by extra unused space in each page.
○ Think of a file page as a box filled with different sized objects.
8. Current Advice: Chunks
● Chunk size needs to account for speed of applying filters (e.g.,
decompression) when chunks are read.
● NASA satellite data predominantly compressed with the zlib (a.k.a., gzip,
deflate) method.
● Need to explore other compression methods for optimal speed vs.
compression ratio.
● Smaller compressed chunks fill file pages better.
● Suggested chunk sizes: 100k(i)B to 2M(i)B.
9. Current Advice: Paged files
● Tested file pages of 4, 8, and 16 MiB sizes.
● 8 MiB file page produced slightly better performance, with tolerable (<5%) file
size increase.
● Majority of tested files had their internal metadata in one 8MiB file page.
● Don’t worry about unused space in that one internal metadata file page.
● Majority of datasets in the tested files were stored in a single file page.
● Consider a minimum of four chunks per file page when choosing a dataset’s
chunk size.
● If writing data to a paged file in more than one open-close session, enable
re-use of file’s free space when creating it.
○ Otherwise, the file may end up much larger than needed.
○ h5repack can produce a defragmented version of the file.
11. Example: GEDI Level 2A granule
● Global Ecosystem Dynamics Investigation (GEDI) instrument is on the
International Space Station.
● A full-waveform LIDAR system for high-resolution observations of forests’
vertical structure.
● Example granule:
○ 1,518,338,048 bytes
○ 136 contiguous datasets
○ 4,184 chunked datasets compressed with the zlib filter
● Repacked into a paged file version with 8MiB file page size.
● No chunk was “hurt” (i.e., rechunked) during repacking.
18. HDF5 library
● Applies to version 1.14.4 only.
● Released in May 2024.
● All other maintenance releases of the library – 1.8.*, 1.10.*, and 1.12.* – are
deprecated now.
● Native method for S3 data access: Read-Only S3 (ROS3) virtual file driver
(VFD).
○ Not always available – build dependent.
○ Conda Forge hdf5 package has it but not h5py from PyPI.
● For Python users: fsspec via h5py.
○ fsspec connected with the library using its virtual file layer API.
○ Lacks communication of important information from the library.
19. Notable improvements
● ROS3 caches first 16 MiB of the file on open.
● ROS3 support for AWS temporary session token.
● Set-and-forget page buffer size. Opening non-paged files will not cause an
error.
● Fixed chunk file location info to account for file’s user block size.
● Fixed an h5repack bug for datasets with variable-length data. Important when
repacking netCDF-4 string variables.
● Next release: Build with zlib-ng. This is a newer open-source implementation
of the standard zlib compression library and ~2x faster.
● Next release: h5repack, h5ls, h5dump, and h5stat new command-line
option for page buffer size. This will enable much improved performance for
cloud optimized files in S3.
● Next release: ROS3 support for relevant AWS environment variables.
● Next release: Support for S3 object URIs (s3://bucket-name/object-name).
20. This work was supported by NASA/GSFC under
Raytheon Company contract number
80GSFC21CA001