0% found this document useful (0 votes)
27 views10 pages

Then_and_Now_Improving_Software_Portability_Productivity_and_100_Performance

Software

Uploaded by

Srihari Yallala
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
27 views10 pages

Then_and_Now_Improving_Software_Portability_Productivity_and_100_Performance

Software

Uploaded by

Srihari Yallala
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

THEME ARTICLE: TRANSFORMING SCIENCE THROUGH

SOFTWARE: IMPROVING WHILE DELIVERING 100X

Then and Now: Improving Software


Portability, Productivity, and 1003
Performance
Hartwig Anzt , University of Tennessee, Knoxville, TN, 37996, USA
Axel Huebl and Xiaoye S. Li , Lawrence Berkeley National Laboratory, Berkeley, CA, 94720, USA

The U.S. Exascale Computing Project (ECP) has succeeded in preparing applications to
run efficiently on the first reported exascale supercomputers in the world. To achieve
this, it modernized the whole leadership software stack, from libraries to simulation
codes. In this article, we contrast selected leadership software before and after the ECP.
We discuss how sustainable research software development for leadership computing
can embrace the conversation with the hardware vendors, leadership computing
facilities, software community, and domain scientists who are the application
developers and integrators of software products. We elaborate on how software needs
to take portability as a central design principle and to benefit from interdependent
teams; we also demonstrate how moving to programming languages with high
momentum, like modern C11, can help improve the sustainability, interoperability, and
performance of research software. Finally, we showcase how cross-institutional efforts
can enable algorithm advances that are beyond incremental performance optimization.

H
igh-performance computing (HPC) enables Oak Ridge National Laboratory was the first machine
innovation for scientists and engineers, across benchmarked to compute the LINPACK benchmark at
exploration and discovery science, design and an execution rate of 1 exaflops—thereby fulfilling the
optimization, or validation of theories about the funda- ambitious goal of a 5 performance improvement over
mental laws of nature. Supercomputers enable largest Summit.b With the sunset of the ECP in December
scale data analysis as well as modeling and simulation 2023, it is time to look at the software side and com-
to study systems that would be impossible to study at pare the capability status before and after the ECP.
the same level of detail in the real world, e.g., due to For this purpose, we describe several software
the size, complexity, physical danger, or cost involved. projects that are rooted in mathematical libraries and
In 2016, the U.S. Exascale Computing Project (ECP)a the application space, and we investigate their perfor-
started on its mission to accelerate the delivery of a mance improvements and sustainability: the Extreme-
capable exascale computing ecosystem that delivers Scale Scientific Software Development Kit (xSDK) and
50 the application performance of the leading 20 its constituent libraries, such as Ginkgo, Software for
petafloating point operations per second (petaflops) Linear Algebra Targeting Exascale (SLATE), SuperLU,
systems. In 2022, the Frontier supercomputer at the and the laser–plasma modeling application WarpX.c
We will describe critical facets of how software devel-
a
https://www.exascaleproject.org opment methodologies and interdisciplinary teams
have been transformed, leading to improvements in
© 2024 The Authors. This work is licensed under a Creative the software itself, and why these advances are essen-
Commons Attribution 4.0 License. For more information, see tial for next-generation science.
https://creativecommons.org/licenses/by/4.0/
Digital Object Identifier 10.1109/MCSE.2024.3387302
b
Date of publication 10 April 2024; date of current version https://www.olcf.ornl.gov/summit/
c
5 July 2024. https://warpx.readthedocs.io

January-March 2024 Published by the IEEE Computer Society Computing in Science & Engineering 61
TRANSFORMING SCIENCE THROUGH SOFTWARE: IMPROVING WHILE DELIVERING 100X

SOFTWARE ENGINEERING: THEN Easy installation via the Spack package managere:
AND NOW The release process uses the Spack pull request
process for all xSDK-related changes that go into
Prior to the ECP, many HPC software stacks used for
the Spack package manager.
scientific research in the U.S. Department of Energy
Continued systematic testing: All xSDK pack-
(DOE) were developed and grew in response to needs
ages go through Spack build test cycles on vari-
of time-limited domain science projects. There was not
ous commonly used workstations. The testing
much coordination among different software teams
is also extended to multiple DOE Leadership
and products. For example, multiple libraries could not
Computing Facility machines. The Gitlab CI
even be built and linked into a single application, e.g.,
(pipeline) infrastructure is used to perform daily
due to name space issues. The naturally growing soft-
runs of multiple tests on different systems.f
ware stacks also often did not have a defined software
Performance autotuner GPTuneg: Each library
development cycle or quality standards to adhere to.
in xSDK has tunable parameters that may
Sometimes, even ad hoc solutions implemented to
greatly affect the code performance on the
serve certain requirements became an essential part
actual machine. GPTune uses Bayesian optimi-
of a major software ecosystem.
zation based on Gaussian process regression
The concept of making software sustainable, pro-
to find the best parameter configurations. It
ductive, and reliable through a defined software
supports advanced features, such as multitask
development process and a culture of collaborative
learning, transfer learning, multifidelity/objective
software engineering became popular (and required)
tuning, and parameter sensitivity analysis.
only during the ECP. Among the most successful and
impactful measures on the side of mathematical Since the inception of the ECP, the number of librar-
software is xSDK.d Its community efforts have imple- ies in the xSDK collection has grown to 26. Figure 1
mented a set of standards on software quality and illustrates the dependencies among some of the librar-
interoperability, deployed a federated continuous ies. As shown in this hierarchy, some libraries at the
integration (CI) infrastructure that allows for rigorous lower level provide commonly used building blocks
software testing on various hardware architectures, that are needed by the higher level math libraries
and taken the challenge of defining software packages and applications.
that contain compatible versions of a plethora of inde- As an example, the Ginkgo software stack devel-
pendent but interoperable software libraries. xSDK oped under the ECP currently employs 45 CI pipelines
pioneered a set of key elements that addresses the on CPU and GPU architectures from AMD, Intel, and
shortcomings from the past: Nvidia and has 91% unit test coverage. Likewise, the
WarpX application performs CI tests with code reviews
Community policies: There is a set of manda-
on the three major operating systems (Linux, macOS,
tory policies [including topics of configuring,
and Windows) and deploys to three major CPU (x86,
installing, testing, message passing interface
ARM, and PPC) and three major GPU (Nvidia, AMD,
(MPI) usage, portability, contact and version
and Intel) architectures. Reusing the mathematical
information, open source licensing, name spac-
software in xSDK, the AMReX libraryh became a central
ing, and repository access] that a software
dependency of WarpX for its data structures, commu-
package must satisfy to be considered xSDK
nication routines, portability, and third-party solvers.
compatible. Also presented are recommended
However, it is not the community agreeing on
policies (including public repository access,
standards, reuse, and the technical realization of rigor-
error handling, freeing system resources, and
ous software testing alone that enabled higher produc-
library dependencies), which are encouraged
tivity and collaboration across institutions. Equally
but not required.
important is the recognition of research software engi-
Interoperability (see Figure 1): This enables a
neering as a profession, establishing career paths and
collection of related and complementary soft-
understanding the culture around it.i It is an open
ware packages to be able to call each other so
research software engineering culture across projects
that they can be used simultaneously to solve a
complex problem. e
https://spack.io
f
https://gitlab.com/xsdk-project/spack-xsdk/-/pipelines
g
https://gptune.lbl.gov
h
https://amrex-codes.github.io/amrex/
d i
https://xsdk.info https://us-rse.org and https://society-rse.org

62 Computing in Science & Engineering January-March 2024


TRANSFORMING SCIENCE THROUGH SOFTWARE: IMPROVING WHILE DELIVERING 100X

FIGURE 1. xSDK libraries and interoperability tests included in xsdk-examples 0.4.0. Peach ovals represent newly added libraries
and red arrows newly featured interoperability. (Courtesy of Ulrike Yang, Satish Balay, and other xSDK developers.) SLATE: Soft-
ware for Linear Algebra Targeting Exascale; xSDK: Extreme-Scale Scientific Software Development Kit.

and dependencies that drives successful, productive, 3-D code can effectively use 10 more processes than
and resilient software ecosystems, sharing and evolv- the earlier 2-D algorithm. The sparse LU achieved up to
ing the practices described in this section. 27 speedup on 24,000 cores of a Cray XC30 [Edison
at the National Energy Research Scientific Computing
ALGORITHMS: THEN AND NOW Center (NERSC)]. When combined with the GPU off-
Exascale machines offer unprecedented degrees of loading, the new 3-D code achieves up to 24 speedup
parallelism on the order of tens of millions. This is on 4096 nodes of a Cray XK7 (Titan at the Oak Ridge
achieved in the combination of thousands of compute Leadership Computing Facility) with 32,768 CPU cores
nodes and thousands of compute threads on each and 4096 Nvidia K20x GPUs.1 The new 3-D sparse trian-
node. However, most of the existing algorithms were gular solution code outperformed the earlier 2-D code
limited to small to medium degrees of parallelism. by up to 7.2 when run on 12,000 cores of a Cray XC30
Throughout the ECP, we made significant efforts to machine. On the Perlmutter GPU machine at NERSC,
develop algorithms to better utilize this massively par- the new 3-D sparse triangular solution scaled to 256
allel computing power. In this section, we will give sev- GPUs, while the earlier 2-D code can only scale up to
eral examples to illustrate our algorithm innovations. four GPUs.2
To use many compute nodes, an algorithm needs An example where the design of a new algorithm
to distribute the data and computing tasks to different class enabled scientific advances is batched iterative
nodes, and multiple nodes perform local computation solvers. Batched methods are designed to process
and communicate among each other to finish the many problems of small dimension in a data-parallel
entire computation. On exascale machines, the local fashion. They became popular as the hardware parallel-
computation speed is very fast, but communicating a ism exceeded the problem parallelism, and processing
word between two compute nodes is orders of magni- the problems in sequence would be inefficient. Situa-
tude slower than performing one floating point opera- tions where many small systems need to be handled in
tion. Therefore, we redesigned many algorithms to parallel are common in combustion and plasma simula-
reduce the amount of communication. tions but also play a central role in machine learning
One example is avoiding communication in the (ML) methods based on deep neural networks.
sparse direct linear solver SuperLU.j The SuperLU Prior to the ECP, batched direct solvers had been
team developed the first communication-avoiding 3-D developed and used in various applications, but no
algorithm framework for a sparse lower-upper (LU) fac- need for batched iterative methods was formulated. It
torization and sparse triangular solution. The algorithm was the interaction with ECP application specialists
novelty involves a “3-D” process organization and judi- that identified the potential of batched iterative meth-
cious duplication of data between computers to effec- ods as approximate solvers for linear problems as part
tively reduce communication by up to several orders of of a nonlinear solver. A cross-institutional task force
magnitude, depending on the input matrix. The new succeeded in designing batched iterative methods
that are performance-portable and suitable for a wide
j
https://portal.nersc.gov/project/sparse/superlu/ range of applications. The batched iterative functionality

January-March 2024 Computing in Science & Engineering 63


TRANSFORMING SCIENCE THROUGH SOFTWARE: IMPROVING WHILE DELIVERING 100X

deployed in PETSc, Ginkgo, and Kokkos-Kernels is now


used in hydrodynamics simulations and plasma simula-
tions, among others. Acknowledging the hardware par-
allelism exceeding the problem concurrency in many
scenarios has resulted in a paradigm change in making
algorithms more resource efficient.
While the new batched sparse linear algebra devel-
opment is driven by application needs, the ECP’s new
multiprecision algorithm effort is driven by hardware
features. In recent years, a new hardware trend has
been to employ low-precision, special-function units
tailoring to the demands of AI workloads. The lower
precision floating point arithmetic can be done several FIGURE 2. Performance of the GPU routines in SLATE using
times faster than its higher precision counterparts. In four nodes of Spock with 16 AMD MI100 GPUs. flop: floating
2020, the ECP community created a new multiprecision point operation.
effort to design and develop new numerical algorithms
that can exploit the speed provided by the lower preci-
sion hardware while maintaining a sufficient level of and application code WarpX to illustrate performance
accuracy that is required by numerical modeling and differences.
simulations. Examples include the following: Dense linear algebra operations are ubiquitously
needed in modeling, simulation, and AI/ML applica-
Mixed-precision iterative refinement for a dense tions. They are also core kernel operations in many
LU factorization in SLATE and a sparse LU factor- sparse computations. For more than two decades, the
ization in SuperLU achieved 1.8 and 1.5 speed- Scalable Linear Algebra PACKage (ScaLAPACK) library
ups, respectively. has become the industry standard in distributed mem-
Mixed-precision generalized minimum residual ory environments. However, as ScaLAPACK can hardly
method (GMRES) with iterative refinement in be retrofitted to support hardware accelerators, the ECP
Trilinos achieved 1.4 speedup. invested in designing a modern replacement—SLATE.
Compressed basis (CB) GMRES in Ginkgo achieved SLATE modernizes the algorithms and software in
1.4 speedup, and mixed-precision sparse approxi- several ways. It uses a 2-D tiled cyclic data distribution,
mate inverse preconditioners achieved an average which enables dynamic scheduling and communication
speedup of 1.2.3 overlapping. It incorporates a number of communication-
avoiding algorithms. It relies on standard computation
These speedups coming from the mixed-precision kernels (BLASþþ and LAPACKþþ) and parallel program-
algorithms are “here to stay,” as they will carry over to ming (OpenMP and MPI) for portability. The BLASþþ
future hardware architectures. is an abstraction layer to access multiple GPU back
On the domain science level, in applications such ends: cuBLAS, hip/rocBLAS, and oneAPI.
as WarpX, algorithmic innovation enabled overcoming The speed advantage of SLATE using GPUs over
traditional limitations in numerical time stepping and the CPU-based ScaLAPACK code is tremendous. For
improved convergence rates with resolution. Examples example, on the pre-exascale system Summit, using 16
include the boosted frame method (using a favorable nodes with Nvidia V100 and 672 IBM POWER9 cores,
Lorentz-boosted reference system for simulations); mesh SLATE double-precision matrix-matrix multiplication
refinement (MR)4; advanced pseudo-spectral, GPU- (DGEMM) (dimension 175,000) outperforms ScaLAPACK
accelerated solvers using GPU-accelerated fast Fourier DGEMM by 65. Using the same machine configura-
transforms (FFTs)5,6; and accelerated, Cþþ-friendly tion, for Cholesky factorization (dimension 250,000),
linear algebra in ECP SLATE.k the SLATE code achieved 19 speedup over the
ScaLAPACK code. Moving to the exascale systems
PERFORMANCE: THEN AND NOW with the AMD GPUs, Figure 2 shows the GPU-over-
Through redesign of the algorithms and software CPU speedups of the three routines in SLATE:
implementations, we saw more than 100 gains in Cholesky, LU and QR factorizations. These were run on
speed. In this section, we use the math library SLATE Spock, an earlier system prior to the Frontier exascale
machine. Each Spock node consists of a 64-core AMD
k
https://icl.utk.edu/slate/ EPYC 7662 CPU and four AMD MI100 GPUs. The

64 Computing in Science & Engineering January-March 2024


TRANSFORMING SCIENCE THROUGH SOFTWARE: IMPROVING WHILE DELIVERING 100X

all leadership supercomputers feature accelerators of


some form, mostly GPUs from one of the major ven-
dors. Unfortunately, the CPU-centric MPI programming
model of the past does not map well to modern
GPU-centric supercomputers, and an additional level
of parallelism for highly asynchronous execution on
modern GPUs is needed. Furthermore, the distinct
vendors all prioritize their own programming model
FIGURE 3. SLATE’s matrix multiply and Cholesky factoriza- over a platform-portable programming language. Finally,
tions scaled and achieved 1.9–2.7 performance improvement GPUs of different generations differ in their compute
capabilities and special-function units. This results in a
per node on Crusher/Frontier nodes with AMD MI250X GPUs
challenge for the software developers to support a wide
compared to Summit nodes with NVIDIA V100 GPUs.
range of hardware architectures from different vendors
and programming models that feature different com-
Cholesky factorization achieved 37 speedup, reach- pute capabilities.
ing 17% of the machine’s peak performance. When the One distinguishes between different levels of porta-
exascale early access system Crusher became available bility.10 The first distinct level is no portability, where
in the early 2023, the SLATE team further improved the code compiles and runs for only one type of sys-
their codes for the newer AMD GPU, i.e., MI250X.7 tem. Unfortunately, much of the scientific legacy soft-
Figure 3 shows up to 2.7 speedup when compared to ware falls into this category, and adding portability to
the Summit Nvidia V100 GPUs. an existing software stack can be challenging. The next
Performing detailed performance analysis (includ- level is partial software portability.9 An application
ing roofline modeling of arithmetic intensity) and sys- using such an approach will be dependent on some
tematic benchmarking often led to significant rewrites platform model abstraction. For example, the model
to enable portable performance on new architectures. could expect any CPU type combined with one or more
For instance, the WarpX code already had specially accelerators, either from AMD or NVIDIA. In such a
vectorized routines for its core algorithms for Intel case, a hybrid programming approach featuring a CPU
Xeon Phi,l following the traditional strategy of “porting” programming model, like OpenMP, is combined with an
per architecture. In the ECP, reimplemented and accelerator programming model, like HIP, to ensure
portability (and possibly good performance) on the
continuously tuned core routines received an overall
machine. As a more advanced case, one might con-
5 improvement, benchmarked in a “figure-of-merit”
sider full software portability, where the application
test (FOM), defined as the weighted particle and cell
can execute and run on most platforms, including a
updates per second, by rewriting them using AMReX’s
reasonable set of hypothetical future machines that
built-in performance portability layer. As an intended
might feature field-programmable gate arrays (FPGAs).
side effect, depending on the well-established commu-
In this case, a practical example is the SYCLm program-
nication routines in AMReX also ensured scalability
ming model, which features compiler back ends that
and load balancing when running on the scale of full
support some FPGAs, all mainstream HPC accelera-
HPC systems. Over the time of the ECP, another 100
tors, and ARM-based hardware.
of FOM improvement was achieved by combining hard-
Finally, and especially important for HPC applica-
ware advances and tuning of co-designed software.8
tions, there is the level of performance portability,
which means that not only will the code compile
PORTABILITY: THEN AND NOW and run on target platforms, but it will also achieve
Traditionally, high-performance software products high efficiency by providing performance close to the
were designed for CPU-only supercomputers using machine’s total capabilities. To achieve performance
MPI for parallelization. It was also common that manu- portability, one needs good software design practices
ally tuned, vectorized code was written for different (e.g., code portability) and full command and under-
CPU architectures. Over the last decade, the physical standing of the problems inherent in computing unit
limitations and high power draw of general-purpose granularity versus problem granularity. The latter requires
CPUs resulted in an increasing number of supercom- using specific programming techniques to fully express
puters incorporating GPU accelerators. Today, virtually an application’s parallelism and scheduling to spread

l m
https://github.com/ECP-WarpX/picsar https://www.khronos.org/sycl/

January-March 2024 Computing in Science & Engineering 65


TRANSFORMING SCIENCE THROUGH SOFTWARE: IMPROVING WHILE DELIVERING 100X

FIGURE 4. Overview of the Ginkgo library design using the back-end model for platform portability.9 High-level algorithms are
contained in the library core and composed of algorithm-specific kernels coded for the different hardware back ends.

the workload dynamically, depending on the machine The hardware-specific kernels are written separately
hardware’s computing units. for the different types of hardware targeted9; see
Different strategies exist to tackle the challenge of Figure 4 visualizing the back-end model used in the
platform portability. Among the most popular and suc- Ginkgo library design.q Several libraries are using this
cessful ones are the concept of a portability layer and back-end model effectively, like deal.IIr and Ginkgo.9 To
the back-end model. The idea behind a portability layer use this model, a library must be designed with modu-
is that the user writes the code once in a high-level lan- larity and extensibility in mind. Only a library design
guage, and the code is then mapped to a source code that enforces the separation of concerns between the
tailored for a specific architecture and its ecosystem parallel algorithm and the different hardware back
before being executed thanks to an abstraction. Popu- ends can allow for extensibility in the back-end model.
lar examples of portability layers are Kokkos,n RAJA,o The different back ends need to be managed by a spe-
and SYCL.p Relying on a portability layer removes the cific interface layer between algorithms and kernels.
burden of platform portability from the library develop- However, the price of the higher performance potential
ers and allows them to focus exclusively on the develop- is high: the library developers have to synchronize sev-
ment of sophisticated algorithms. This convenience eral hardware back ends; monitor and react to changes
comes at the price of a strong dependency on the porta- in compilers, tools, and build systems; and adopt new
bility layer, and moving to another programming model hardware back ends and programming models. The
or portability layer is usually difficult or even impossible. effort of maintaining multiple hardware back ends and
Furthermore, relying on a portability layer naturally keeping them synchronized usually results in a signifi-
implies that the performance of algorithms and applica- cant workload that can easily exceed the developers’
tions is determined by the quality, expressiveness, and resources.9
hardware-specific optimization of the portability layer. The two strategies for achieving platform portabil-
As an alternative, the idea behind the back-end ity presented are not necessarily exclusive, and the
model is to embrace portability in the software design. usage of a hybrid model can be more efficient.9 One

n q
https://kokkos.github.io Ginkgo uses SYCL as one of its back ends, which is a
o
https://raja.readthedocs.io portability layer.
p r
https://www.khronos.org/sycl/ https://www.dealii.org

66 Computing in Science & Engineering January-March 2024


TRANSFORMING SCIENCE THROUGH SOFTWARE: IMPROVING WHILE DELIVERING 100X

reason for adopting such a hybrid approach is that not application, an approach relying on subsequent accel-
all building blocks are as performance critical or as eration in stages of laser wakefield accelerators was
complex to optimize as others. For those kernels, rely- investigated, an advanced plasma particle acceleration
ing on a performance portability layer allows for reduc- approach that can provide orders of magnitude higher
ing the code maintenance and testing complexities as accelerating fields than currently available particle
well as focusing on the more performance-critical accelerator elements. This and related science drivers
aspects of the library. One example using the hybrid in plasma, accelerator, beam, and fusion physics can
approach is the PETSc library.11 The main data objects be simulated in full fidelity with WarpX. Modeling in
in PETSc are Vector and Matrix. The PETSc design sep- these domains continues to benefit significantly from
arates the front-end programming model used by the scaling and more compute, as it enables, among
application and the back-end implementations. Users others, the following:
can access PETSc’s Vector, Matrix, and the operations
in their preferred programming model, such as Kokkos, Higher grid resolution: Modeling of larger sys-
RAJA, SYCL, HIP, CUDA, or OpenCL. The back end tems and higher plasma densities.
More particles: Improved sampling of kinetic,
heavily relies on the GPU vendors’ libraries or Kokkos-
nonequilibrium particle distributions.
Kernelss to provide higher level solver functions oper-
Includes more microscopic physics: Better investiga-
ating on the Vector and Matrix objects.
tion of collisional, quantum, and high field effects.
With the rewrite and advancement from the prede-
Transition from 2-D to 3-D: Covering the full
cessor Fortran code in Warp to Cþþ in WarpX, WarpX
geometric effects enables the quantitative pre-
also benefited from AMReX’s performance portability
diction of particle energies and the study of
layer, enabling the developers to write most new algo-
particle accelerator stability.
rithms in a lambda-based, single-source implementa-
Can use long-term stable, advanced solvers:
tion that supports all targeted architectures. Notably,
Modeling of longer physical time scales.
the performance portability layer in AMReX itself uses
a back-end model. Besides portable WarpX perfor- For the laser–plasma physics modeled with WarpX,
mance for HPC, this approach also improved midscale 3-D domain decomposition is used for multinode paral-
and entry-level user experience: Warp supported lelism. Besides computation, multiple communication
only CPU architectures—WarpX runs on major CPU calls between neighboring domains are needed for the
architectures and three different GPU vendors.8 This time evolution in every simulation step. With its suc-
enabled WarpX to target all scales of computing that are cessful scalable implementation, WarpX science runs
important for scientific modeling: from laptop to HPC. achieved near-ideal weak scaling over a large variety of
Based on analysis of the products in the DOE’s soft- CPU and GPU hardware, winning the 2022 ACM Gor-
ware portfolio, platform portability has become a cen- don Bell Prize. This included runs on nearly the full
tral design principle, thereby increasing the productivity scale of Frontier and Fugaku, then brand-new TOP1
and sustainability of the individual software stacks and TOP2 in the world.8
significantly as well as hardening the resilience of the WarpX improved in several ways over its predeces-
overall software ecosystem to architectural changes. sor Warp and addressed design challenges. Centrally,
many of its earlier mentioned advanced algorithms
LARGE-SCALE SIMULATION could be implemented and maintained productively,
CODE: THEN AND NOW such as MR, because AMReX provided an excellent
Many scientific problems targeted by application soft- framework to solve domain decomposition, MR book-
ware require excellent weak scaling to the largest avail- keeping, inherent load balancing, and performance por-
able supercomputers. Weak scaling means that, e.g., a tability. Thus, application developers could focus on
1000 larger problem can be solved in the same time implementing a large set of advanced algorithms and
as a 1000 smaller base case if 1000 more theoretical optimize their performance.
flops are also provided in parallel hardware.
Designing advanced particle accelerators for high- Beyond the Producer–Consumer
energy physics electron–positron colliders was the Relationship of Scientific Software
primary science driver for WarpX, the advanced elec- Development
tromagnetic particle-in-cell code in the ECP. For this WarpX development embraced the team-of-teams
approach lived in the ECP12 by depending on co-design
s
https://kokkos.org/about/kernels/ centers (like AMReX) and software technology

January-March 2024 Computing in Science & Engineering 67


TRANSFORMING SCIENCE THROUGH SOFTWARE: IMPROVING WHILE DELIVERING 100X

of WarpX,w,x enabling all possible features, includes up


to 27 direct dependencies and many transitive pack-
ages that are used by the AMReX (1–3), IO (26), and
Math (3) boxes. Including all transitive packages and
Python dependencies, WarpX currently relies on 101
software packages, according to Spack.
Package management is essential to deploy such a
software stack, and so is a modern build system used
by package managers. Targeting multiplatform desk-
top to HPC users, WarpX consequently uses CMake as
a platform-agnostic build system and deploys primarily
FIGURE 5. Warp to WarpX software stack evolution from 2016
via Spack and Conda, while parts of the system also
support PyPI/Pip and Brew.
to 2023. Modularization enabled rapid development, perfor-
Software development practices include extensive
mance portability to CPUs and three flavors of GPUs,
test coverage;, automation for CI and from-source
advanced features, efficient code sharing, and collaboration online documentation (Sphinx, ReadTheDocs, Doxy-
and spawned off specialized sibling codes in laser–plasma gen, and Breathe); formal code review for all changes,
physics (HiPACEþþ), beam dynamics modeling (ImpactX), with the approval by at least one maintainer; and a
and microelectronics (Artemis). weekly developer meeting. Figure 5 also shows the
strategic split of the application layer based on core
algorithms and assumptions into libraries and sibling
partners for libraries instead of “rolling” their own apps, even on the domain science level, to enable
implementations, as in Warp. Relying on many depen- easier use and faster development cycles. Code com-
dencies did not come without risks since the breakage patibility beyond library sharing is ensured with appli-
of a single required dependency on one of many target cation programming interfacey and data standards
machines would render WarpX unusable—and, thus, (openPMD).
reliability, responsiveness, trust, and often also the pos- In the ECP, WarpX development was able to evolve
sibility to quickly ship a patch to a dependency via from earlier short-term projects rooted solely in the
open source pull requests are essential to sustain such domain science application needs, where application
software relationships. These workflows had to be R&D had no software sustainability and quality goals
established, e.g., including upstream testing of WarpX on its own, to a strategic multiyear, multidomain HPC
in AMReX changes or providing a ready-to-use devel- effort. This enabled the following:
oper Docker container for the whole ECP Alpine
visualization stack for WarpX CI. Contributions across Integration: Iterations with vendor and ECP
projects maintained by the dependee and the depen- libraries over multiple release cycles and deploy-
dent had to be normalized, forming research software ment through packages to all target platforms.
engineering teams with computational physicists, Risk mitigation: Redesigned input–output with
applied mathematicians, and computer scientists. scalable methods from ECP libraries (ADIOS2
The dramatic shift in software architecture from and HDF5).
Warp to WarpX is shown in a simplified software stack Redesigns: Adjustments were possible, e.g., the
in Figure 5. The box “Math” includes FFT libraries (usu- following:
ally on device from vendors and recently also multide- Adopting the GPU strategy: When CPUs
vice, such as heffte/FFTX) and node-local linear algebra and pragma-based compilers lagged behind,
for a quasi-cylindrical geometry option (from SLATE’s WarpX migrated from Fortran to Cþþ.
BLASþþ and LAPACKþþ). The “IO” stack contains the A Python infrastructure overhaul to GPU-
openPMD metadata abstractiont on top of parallel capable methods.
HDF5 and ADIOS2; data compressors; and in situ Performance optimizations led to the rede-
visualization libraries, such as Conduit/Ascentu and sign of particle data structures in AMReX.
SENSEI.v In greater detail, the Spack dependency tree

t w
https://www.openPMD.org https://packages.spack.io/package.html?name=warpx
u x
https://ascent.readthedocs.io https://packages.spack.io/package.html?name=py-warpx
v y
https://sensei-insitu.org https://github.com/picmi-standard

68 Computing in Science & Engineering January-March 2024


TRANSFORMING SCIENCE THROUGH SOFTWARE: IMPROVING WHILE DELIVERING 100X

Trust and collaboration: Attracted national and capable exascale ecosystem, including software, appli-
international contributors, who trust that their cations, hardware, advanced system engineering, and
open source contributions will be maintained early testbed platforms, in support of the nation’s
and enables leveraging of investment in WarpX exascale computing imperative. We thank Mark Gates
spin-off/follow-up projects for providing some performance data of SLATE on
Summit, a pre-exascale machine. We thank Michael
Heroux, Lois McInnes, and Jean-Luc Vay for providing
FUTURE DIRECTIONS very valuable feedback on the draft manuscripts.
WarpX
WarpX is used as a blueprint to implement compatible, REFERENCES
specialized beam plasma and particle accelerator
1. P. Sao, R. Vuduc, and X. Li, “A communication-
modeling codes,z enabling advances in the R&D of par-
avoiding 3D algorithm for sparse LU factorization on
ticle accelerators, light sources, plasma and fusion
heterogeneous systems,” J. Parallel Distrib. Comput.,
devices, astrophysical plasmas, microelectronics, and
vol. 131, Sep. 2019, pp. 218–234, doi: 10.1016/j.jpdc.2019.
more. The team continues to address opportunities via 03.004.
novel numerical algorithms, increased massive parallel- 2. Y. Liu, N. Ding, P. Sao, S. Williams, and X. S. Li,
ism, and anticipated novel compute hardware and “Unified communication optimization strategies for
addresses data challenges with increased in situ proc- sparse triangular solver on CPU and GPU clusters,”
essing over traditional postprocessing workflows. in Proc. SC98, High Perform. Netw. Comput. Conf.
For sustainability, open source development practi- (SC23), Denver, CO, USA, Nov. 13–17 2023, pp. 1–15,
ces will continue to evolve, and WarpX intends to doi: 10.1145/3581784.3607092.
adopt a more formal, open governance model with its 3. A. Abdelfattah et al., “Advances in mixed precision
national and international partners from national labo- algorithms: 2021 edition,” Lawrence Livermore
ratories, academia, and industry.aa National Lab., Livermore, CA, USA, Tech. Rep. LLNL-
TR-825909 1040257, Aug. 2021. [Online]. Available:
Math Software https://www.osti.gov/biblio/1814677
Building upon the success of the teams’ collaboration 4. J.-L. Vay, D. P. Grote, R. H. Cohen, and A. Friedman,
through the xSDK community platform, the math “Novel methods in the particle-in-cell accelerator
software developers will continue to innovate the algo- code-framework warp,” Comput. Sci. Discovery, vol. 5,
rithms and software for future architecture and appli- no. 1, Dec. 2012, Art. no. 014019, doi: 10.1088/1749-
cation needs. In particular, the novel algorithms for 4699/5/1/014019.
GPU acceleration paved the way for algorithm design 5. J.-L. Vay et al., “Modeling of a chain of three plasma
targeting future more heterogeneous architectures accelerator stages with the WarpX electromagnetic
with a variety of accelerators. The math teams will PIC code on GPUs,” Phys. Plasmas, vol. 28, no. 2,
expand their algorithm portfolio to meet the needs of Feb. 2021, Art. no. 023105, doi: 10.1063/5.0028512.
AI for science. In addition to new developments, as 6. E. Zoni et al., “A hybrid nodal-staggered pseudo-
part of post-ECP work on software stewardship, the spectral electromagnetic particle-in-cell method with
math teams will ensure that the libraries developed in finite-order centering,” Comput. Phys. Commun.,
the ECP will be sustained for many years to come. This vol. 279, Oct. 2022, Art. no. 108457, doi: 10.1016/
will be achieved through close collaboration and a j.cpc.2022.108457.
common governance model across collaborating soft- 7. A. YarKhan, M. A. Farhan, D. Sukkari, M. Gates, and
ware stewardship organizations. J. Dongarra, “SLATE performance report: Updates
to Cholesky and LU factorizations,” ICL, Univ. of
Tennessee, Knoxville, TN, USA, Tech. Rep. ICL-UT-20-
ACKNOWLEDGMENTS
14, 2020. [Online]. Available: https://icl.utk.edu/files/
This research was supported by the Exascale Comput-
publications/2020/icl-utk-1418-2020.pdf
ing Project (17-SC-20-SC), a collaborative effort of two
8. L. Fedeli et al., “Pushing the Frontier in the design of
U.S. Department of Energy organizations (the Office of
laser-based electron accelerators with
Science and the National Nuclear Security Administra-
groundbreaking mesh-refined particle-in-cell
tion) responsible for the planning and preparation of a
simulations on exascale-class supercomputers,” in
z
https://blast.lbl.gov Proc. Int. Conf. High Perform. Comput., Netw., Storage
aa
https://hpsf.io Anal., 2022, pp. 1–12, doi: 10.1109/SC41404.2022.00008.

January-March 2024 Computing in Science & Engineering 69


TRANSFORMING SCIENCE THROUGH SOFTWARE: IMPROVING WHILE DELIVERING 100X

9. T. Cojean, Y.-H. M. Tsai, and H. Anzt, “Ginkgo—A math mixed precision numerical linear algebra, sustainable soft-
library designed for platform portability,” Parallel ware, and energy-efficient computing. Anzt received his
Comput., vol. 111, Jul. 2022, Art. no. 102902, doi: 10. Ph.D. degree in applied mathematics from the Karlsruhe
1016/j.parco.2022.102902. Institute of Technology. He is a Member of IEEE, GAMM, and
10. A. Dubey et al., “Performance portability in the SIAM. Contact him at [email protected].
exascale computing project: Exploration through a
panel series,” Comput. Sci. Eng., vol. 23, no. 5,
pp. 46–54, 2021, doi: 10.1109/MCSE.2021.3098231. AXEL HUEBL is a computational physicist at Lawrence
11. R. T. Mills et al., “Toward performance-portable PETSc Berkeley National Laboratory (LBNL), Berkeley, CA, 94720,
for GPU-based exascale systems,” Parallel Comput., USA. His research interests include the interface of high-
vol. 108, Dec. 2021, Art. no. 102831, doi: 10.1016/j.parco. performance computing, laser–plasma physics, and advanced
2021.102831. particle accelerator research. Huebl received his Ph.D. degree
12. E. M. Raybourn, J. D. Moulton, and A. Hungerford, in physics from Technical University Dresden, Germany. He
“Scaling productivity and innovation on the path to is a Member of IEEE, APS, and ACM. Contact him at
exascale with a ‘team of teams’ approach,” in HCI in
[email protected].
Business, Government and Organizations. Information
Systems and Analytics, F. F.-H. Nah and K. Siau, Eds.,
Cham, Switzerland: Springer International Publishing, XIAOYE S. LI is a senior scientist at LBNL, Berkeley, CA,
2019, pp. 408–421. 94720, USA. Her research interests include high-performance
computing, numerical linear algebra, Bayesian optimization,
HARTWIG ANZT is the is the chair of Computation Mathe- and scientific machine learning. Li received her Ph.D. degree
matics at Technical University Munich, 80333, Munich, in computer science from the University of California at
Germany, and a research associate professor at the Innova- Berkeley. She is a fellow of the Society for Industrial and
tive Computing Laboratory at the University of Tennessee, Applied Mathematics and a senior member of the Associa-
Knoxville, TN, 37996, USA. His research interests include tion for Computing Machinery. Contact her at [email protected].

70 Computing in Science & Engineering January-March 2024

You might also like