Then_and_Now_Improving_Software_Portability_Productivity_and_100_Performance
Then_and_Now_Improving_Software_Portability_Productivity_and_100_Performance
The U.S. Exascale Computing Project (ECP) has succeeded in preparing applications to
run efficiently on the first reported exascale supercomputers in the world. To achieve
this, it modernized the whole leadership software stack, from libraries to simulation
codes. In this article, we contrast selected leadership software before and after the ECP.
We discuss how sustainable research software development for leadership computing
can embrace the conversation with the hardware vendors, leadership computing
facilities, software community, and domain scientists who are the application
developers and integrators of software products. We elaborate on how software needs
to take portability as a central design principle and to benefit from interdependent
teams; we also demonstrate how moving to programming languages with high
momentum, like modern C11, can help improve the sustainability, interoperability, and
performance of research software. Finally, we showcase how cross-institutional efforts
can enable algorithm advances that are beyond incremental performance optimization.
H
igh-performance computing (HPC) enables Oak Ridge National Laboratory was the first machine
innovation for scientists and engineers, across benchmarked to compute the LINPACK benchmark at
exploration and discovery science, design and an execution rate of 1 exaflops—thereby fulfilling the
optimization, or validation of theories about the funda- ambitious goal of a 5 performance improvement over
mental laws of nature. Supercomputers enable largest Summit.b With the sunset of the ECP in December
scale data analysis as well as modeling and simulation 2023, it is time to look at the software side and com-
to study systems that would be impossible to study at pare the capability status before and after the ECP.
the same level of detail in the real world, e.g., due to For this purpose, we describe several software
the size, complexity, physical danger, or cost involved. projects that are rooted in mathematical libraries and
In 2016, the U.S. Exascale Computing Project (ECP)a the application space, and we investigate their perfor-
started on its mission to accelerate the delivery of a mance improvements and sustainability: the Extreme-
capable exascale computing ecosystem that delivers Scale Scientific Software Development Kit (xSDK) and
50 the application performance of the leading 20 its constituent libraries, such as Ginkgo, Software for
petafloating point operations per second (petaflops) Linear Algebra Targeting Exascale (SLATE), SuperLU,
systems. In 2022, the Frontier supercomputer at the and the laser–plasma modeling application WarpX.c
We will describe critical facets of how software devel-
a
https://www.exascaleproject.org opment methodologies and interdisciplinary teams
have been transformed, leading to improvements in
© 2024 The Authors. This work is licensed under a Creative the software itself, and why these advances are essen-
Commons Attribution 4.0 License. For more information, see tial for next-generation science.
https://creativecommons.org/licenses/by/4.0/
Digital Object Identifier 10.1109/MCSE.2024.3387302
b
Date of publication 10 April 2024; date of current version https://www.olcf.ornl.gov/summit/
c
5 July 2024. https://warpx.readthedocs.io
January-March 2024 Published by the IEEE Computer Society Computing in Science & Engineering 61
TRANSFORMING SCIENCE THROUGH SOFTWARE: IMPROVING WHILE DELIVERING 100X
SOFTWARE ENGINEERING: THEN Easy installation via the Spack package managere:
AND NOW The release process uses the Spack pull request
process for all xSDK-related changes that go into
Prior to the ECP, many HPC software stacks used for
the Spack package manager.
scientific research in the U.S. Department of Energy
Continued systematic testing: All xSDK pack-
(DOE) were developed and grew in response to needs
ages go through Spack build test cycles on vari-
of time-limited domain science projects. There was not
ous commonly used workstations. The testing
much coordination among different software teams
is also extended to multiple DOE Leadership
and products. For example, multiple libraries could not
Computing Facility machines. The Gitlab CI
even be built and linked into a single application, e.g.,
(pipeline) infrastructure is used to perform daily
due to name space issues. The naturally growing soft-
runs of multiple tests on different systems.f
ware stacks also often did not have a defined software
Performance autotuner GPTuneg: Each library
development cycle or quality standards to adhere to.
in xSDK has tunable parameters that may
Sometimes, even ad hoc solutions implemented to
greatly affect the code performance on the
serve certain requirements became an essential part
actual machine. GPTune uses Bayesian optimi-
of a major software ecosystem.
zation based on Gaussian process regression
The concept of making software sustainable, pro-
to find the best parameter configurations. It
ductive, and reliable through a defined software
supports advanced features, such as multitask
development process and a culture of collaborative
learning, transfer learning, multifidelity/objective
software engineering became popular (and required)
tuning, and parameter sensitivity analysis.
only during the ECP. Among the most successful and
impactful measures on the side of mathematical Since the inception of the ECP, the number of librar-
software is xSDK.d Its community efforts have imple- ies in the xSDK collection has grown to 26. Figure 1
mented a set of standards on software quality and illustrates the dependencies among some of the librar-
interoperability, deployed a federated continuous ies. As shown in this hierarchy, some libraries at the
integration (CI) infrastructure that allows for rigorous lower level provide commonly used building blocks
software testing on various hardware architectures, that are needed by the higher level math libraries
and taken the challenge of defining software packages and applications.
that contain compatible versions of a plethora of inde- As an example, the Ginkgo software stack devel-
pendent but interoperable software libraries. xSDK oped under the ECP currently employs 45 CI pipelines
pioneered a set of key elements that addresses the on CPU and GPU architectures from AMD, Intel, and
shortcomings from the past: Nvidia and has 91% unit test coverage. Likewise, the
WarpX application performs CI tests with code reviews
Community policies: There is a set of manda-
on the three major operating systems (Linux, macOS,
tory policies [including topics of configuring,
and Windows) and deploys to three major CPU (x86,
installing, testing, message passing interface
ARM, and PPC) and three major GPU (Nvidia, AMD,
(MPI) usage, portability, contact and version
and Intel) architectures. Reusing the mathematical
information, open source licensing, name spac-
software in xSDK, the AMReX libraryh became a central
ing, and repository access] that a software
dependency of WarpX for its data structures, commu-
package must satisfy to be considered xSDK
nication routines, portability, and third-party solvers.
compatible. Also presented are recommended
However, it is not the community agreeing on
policies (including public repository access,
standards, reuse, and the technical realization of rigor-
error handling, freeing system resources, and
ous software testing alone that enabled higher produc-
library dependencies), which are encouraged
tivity and collaboration across institutions. Equally
but not required.
important is the recognition of research software engi-
Interoperability (see Figure 1): This enables a
neering as a profession, establishing career paths and
collection of related and complementary soft-
understanding the culture around it.i It is an open
ware packages to be able to call each other so
research software engineering culture across projects
that they can be used simultaneously to solve a
complex problem. e
https://spack.io
f
https://gitlab.com/xsdk-project/spack-xsdk/-/pipelines
g
https://gptune.lbl.gov
h
https://amrex-codes.github.io/amrex/
d i
https://xsdk.info https://us-rse.org and https://society-rse.org
FIGURE 1. xSDK libraries and interoperability tests included in xsdk-examples 0.4.0. Peach ovals represent newly added libraries
and red arrows newly featured interoperability. (Courtesy of Ulrike Yang, Satish Balay, and other xSDK developers.) SLATE: Soft-
ware for Linear Algebra Targeting Exascale; xSDK: Extreme-Scale Scientific Software Development Kit.
and dependencies that drives successful, productive, 3-D code can effectively use 10 more processes than
and resilient software ecosystems, sharing and evolv- the earlier 2-D algorithm. The sparse LU achieved up to
ing the practices described in this section. 27 speedup on 24,000 cores of a Cray XC30 [Edison
at the National Energy Research Scientific Computing
ALGORITHMS: THEN AND NOW Center (NERSC)]. When combined with the GPU off-
Exascale machines offer unprecedented degrees of loading, the new 3-D code achieves up to 24 speedup
parallelism on the order of tens of millions. This is on 4096 nodes of a Cray XK7 (Titan at the Oak Ridge
achieved in the combination of thousands of compute Leadership Computing Facility) with 32,768 CPU cores
nodes and thousands of compute threads on each and 4096 Nvidia K20x GPUs.1 The new 3-D sparse trian-
node. However, most of the existing algorithms were gular solution code outperformed the earlier 2-D code
limited to small to medium degrees of parallelism. by up to 7.2 when run on 12,000 cores of a Cray XC30
Throughout the ECP, we made significant efforts to machine. On the Perlmutter GPU machine at NERSC,
develop algorithms to better utilize this massively par- the new 3-D sparse triangular solution scaled to 256
allel computing power. In this section, we will give sev- GPUs, while the earlier 2-D code can only scale up to
eral examples to illustrate our algorithm innovations. four GPUs.2
To use many compute nodes, an algorithm needs An example where the design of a new algorithm
to distribute the data and computing tasks to different class enabled scientific advances is batched iterative
nodes, and multiple nodes perform local computation solvers. Batched methods are designed to process
and communicate among each other to finish the many problems of small dimension in a data-parallel
entire computation. On exascale machines, the local fashion. They became popular as the hardware parallel-
computation speed is very fast, but communicating a ism exceeded the problem parallelism, and processing
word between two compute nodes is orders of magni- the problems in sequence would be inefficient. Situa-
tude slower than performing one floating point opera- tions where many small systems need to be handled in
tion. Therefore, we redesigned many algorithms to parallel are common in combustion and plasma simula-
reduce the amount of communication. tions but also play a central role in machine learning
One example is avoiding communication in the (ML) methods based on deep neural networks.
sparse direct linear solver SuperLU.j The SuperLU Prior to the ECP, batched direct solvers had been
team developed the first communication-avoiding 3-D developed and used in various applications, but no
algorithm framework for a sparse lower-upper (LU) fac- need for batched iterative methods was formulated. It
torization and sparse triangular solution. The algorithm was the interaction with ECP application specialists
novelty involves a “3-D” process organization and judi- that identified the potential of batched iterative meth-
cious duplication of data between computers to effec- ods as approximate solvers for linear problems as part
tively reduce communication by up to several orders of of a nonlinear solver. A cross-institutional task force
magnitude, depending on the input matrix. The new succeeded in designing batched iterative methods
that are performance-portable and suitable for a wide
j
https://portal.nersc.gov/project/sparse/superlu/ range of applications. The batched iterative functionality
l m
https://github.com/ECP-WarpX/picsar https://www.khronos.org/sycl/
FIGURE 4. Overview of the Ginkgo library design using the back-end model for platform portability.9 High-level algorithms are
contained in the library core and composed of algorithm-specific kernels coded for the different hardware back ends.
the workload dynamically, depending on the machine The hardware-specific kernels are written separately
hardware’s computing units. for the different types of hardware targeted9; see
Different strategies exist to tackle the challenge of Figure 4 visualizing the back-end model used in the
platform portability. Among the most popular and suc- Ginkgo library design.q Several libraries are using this
cessful ones are the concept of a portability layer and back-end model effectively, like deal.IIr and Ginkgo.9 To
the back-end model. The idea behind a portability layer use this model, a library must be designed with modu-
is that the user writes the code once in a high-level lan- larity and extensibility in mind. Only a library design
guage, and the code is then mapped to a source code that enforces the separation of concerns between the
tailored for a specific architecture and its ecosystem parallel algorithm and the different hardware back
before being executed thanks to an abstraction. Popu- ends can allow for extensibility in the back-end model.
lar examples of portability layers are Kokkos,n RAJA,o The different back ends need to be managed by a spe-
and SYCL.p Relying on a portability layer removes the cific interface layer between algorithms and kernels.
burden of platform portability from the library develop- However, the price of the higher performance potential
ers and allows them to focus exclusively on the develop- is high: the library developers have to synchronize sev-
ment of sophisticated algorithms. This convenience eral hardware back ends; monitor and react to changes
comes at the price of a strong dependency on the porta- in compilers, tools, and build systems; and adopt new
bility layer, and moving to another programming model hardware back ends and programming models. The
or portability layer is usually difficult or even impossible. effort of maintaining multiple hardware back ends and
Furthermore, relying on a portability layer naturally keeping them synchronized usually results in a signifi-
implies that the performance of algorithms and applica- cant workload that can easily exceed the developers’
tions is determined by the quality, expressiveness, and resources.9
hardware-specific optimization of the portability layer. The two strategies for achieving platform portabil-
As an alternative, the idea behind the back-end ity presented are not necessarily exclusive, and the
model is to embrace portability in the software design. usage of a hybrid model can be more efficient.9 One
n q
https://kokkos.github.io Ginkgo uses SYCL as one of its back ends, which is a
o
https://raja.readthedocs.io portability layer.
p r
https://www.khronos.org/sycl/ https://www.dealii.org
reason for adopting such a hybrid approach is that not application, an approach relying on subsequent accel-
all building blocks are as performance critical or as eration in stages of laser wakefield accelerators was
complex to optimize as others. For those kernels, rely- investigated, an advanced plasma particle acceleration
ing on a performance portability layer allows for reduc- approach that can provide orders of magnitude higher
ing the code maintenance and testing complexities as accelerating fields than currently available particle
well as focusing on the more performance-critical accelerator elements. This and related science drivers
aspects of the library. One example using the hybrid in plasma, accelerator, beam, and fusion physics can
approach is the PETSc library.11 The main data objects be simulated in full fidelity with WarpX. Modeling in
in PETSc are Vector and Matrix. The PETSc design sep- these domains continues to benefit significantly from
arates the front-end programming model used by the scaling and more compute, as it enables, among
application and the back-end implementations. Users others, the following:
can access PETSc’s Vector, Matrix, and the operations
in their preferred programming model, such as Kokkos, Higher grid resolution: Modeling of larger sys-
RAJA, SYCL, HIP, CUDA, or OpenCL. The back end tems and higher plasma densities.
More particles: Improved sampling of kinetic,
heavily relies on the GPU vendors’ libraries or Kokkos-
nonequilibrium particle distributions.
Kernelss to provide higher level solver functions oper-
Includes more microscopic physics: Better investiga-
ating on the Vector and Matrix objects.
tion of collisional, quantum, and high field effects.
With the rewrite and advancement from the prede-
Transition from 2-D to 3-D: Covering the full
cessor Fortran code in Warp to Cþþ in WarpX, WarpX
geometric effects enables the quantitative pre-
also benefited from AMReX’s performance portability
diction of particle energies and the study of
layer, enabling the developers to write most new algo-
particle accelerator stability.
rithms in a lambda-based, single-source implementa-
Can use long-term stable, advanced solvers:
tion that supports all targeted architectures. Notably,
Modeling of longer physical time scales.
the performance portability layer in AMReX itself uses
a back-end model. Besides portable WarpX perfor- For the laser–plasma physics modeled with WarpX,
mance for HPC, this approach also improved midscale 3-D domain decomposition is used for multinode paral-
and entry-level user experience: Warp supported lelism. Besides computation, multiple communication
only CPU architectures—WarpX runs on major CPU calls between neighboring domains are needed for the
architectures and three different GPU vendors.8 This time evolution in every simulation step. With its suc-
enabled WarpX to target all scales of computing that are cessful scalable implementation, WarpX science runs
important for scientific modeling: from laptop to HPC. achieved near-ideal weak scaling over a large variety of
Based on analysis of the products in the DOE’s soft- CPU and GPU hardware, winning the 2022 ACM Gor-
ware portfolio, platform portability has become a cen- don Bell Prize. This included runs on nearly the full
tral design principle, thereby increasing the productivity scale of Frontier and Fugaku, then brand-new TOP1
and sustainability of the individual software stacks and TOP2 in the world.8
significantly as well as hardening the resilience of the WarpX improved in several ways over its predeces-
overall software ecosystem to architectural changes. sor Warp and addressed design challenges. Centrally,
many of its earlier mentioned advanced algorithms
LARGE-SCALE SIMULATION could be implemented and maintained productively,
CODE: THEN AND NOW such as MR, because AMReX provided an excellent
Many scientific problems targeted by application soft- framework to solve domain decomposition, MR book-
ware require excellent weak scaling to the largest avail- keeping, inherent load balancing, and performance por-
able supercomputers. Weak scaling means that, e.g., a tability. Thus, application developers could focus on
1000 larger problem can be solved in the same time implementing a large set of advanced algorithms and
as a 1000 smaller base case if 1000 more theoretical optimize their performance.
flops are also provided in parallel hardware.
Designing advanced particle accelerators for high- Beyond the Producer–Consumer
energy physics electron–positron colliders was the Relationship of Scientific Software
primary science driver for WarpX, the advanced elec- Development
tromagnetic particle-in-cell code in the ECP. For this WarpX development embraced the team-of-teams
approach lived in the ECP12 by depending on co-design
s
https://kokkos.org/about/kernels/ centers (like AMReX) and software technology
t w
https://www.openPMD.org https://packages.spack.io/package.html?name=warpx
u x
https://ascent.readthedocs.io https://packages.spack.io/package.html?name=py-warpx
v y
https://sensei-insitu.org https://github.com/picmi-standard
Trust and collaboration: Attracted national and capable exascale ecosystem, including software, appli-
international contributors, who trust that their cations, hardware, advanced system engineering, and
open source contributions will be maintained early testbed platforms, in support of the nation’s
and enables leveraging of investment in WarpX exascale computing imperative. We thank Mark Gates
spin-off/follow-up projects for providing some performance data of SLATE on
Summit, a pre-exascale machine. We thank Michael
Heroux, Lois McInnes, and Jean-Luc Vay for providing
FUTURE DIRECTIONS very valuable feedback on the draft manuscripts.
WarpX
WarpX is used as a blueprint to implement compatible, REFERENCES
specialized beam plasma and particle accelerator
1. P. Sao, R. Vuduc, and X. Li, “A communication-
modeling codes,z enabling advances in the R&D of par-
avoiding 3D algorithm for sparse LU factorization on
ticle accelerators, light sources, plasma and fusion
heterogeneous systems,” J. Parallel Distrib. Comput.,
devices, astrophysical plasmas, microelectronics, and
vol. 131, Sep. 2019, pp. 218–234, doi: 10.1016/j.jpdc.2019.
more. The team continues to address opportunities via 03.004.
novel numerical algorithms, increased massive parallel- 2. Y. Liu, N. Ding, P. Sao, S. Williams, and X. S. Li,
ism, and anticipated novel compute hardware and “Unified communication optimization strategies for
addresses data challenges with increased in situ proc- sparse triangular solver on CPU and GPU clusters,”
essing over traditional postprocessing workflows. in Proc. SC98, High Perform. Netw. Comput. Conf.
For sustainability, open source development practi- (SC23), Denver, CO, USA, Nov. 13–17 2023, pp. 1–15,
ces will continue to evolve, and WarpX intends to doi: 10.1145/3581784.3607092.
adopt a more formal, open governance model with its 3. A. Abdelfattah et al., “Advances in mixed precision
national and international partners from national labo- algorithms: 2021 edition,” Lawrence Livermore
ratories, academia, and industry.aa National Lab., Livermore, CA, USA, Tech. Rep. LLNL-
TR-825909 1040257, Aug. 2021. [Online]. Available:
Math Software https://www.osti.gov/biblio/1814677
Building upon the success of the teams’ collaboration 4. J.-L. Vay, D. P. Grote, R. H. Cohen, and A. Friedman,
through the xSDK community platform, the math “Novel methods in the particle-in-cell accelerator
software developers will continue to innovate the algo- code-framework warp,” Comput. Sci. Discovery, vol. 5,
rithms and software for future architecture and appli- no. 1, Dec. 2012, Art. no. 014019, doi: 10.1088/1749-
cation needs. In particular, the novel algorithms for 4699/5/1/014019.
GPU acceleration paved the way for algorithm design 5. J.-L. Vay et al., “Modeling of a chain of three plasma
targeting future more heterogeneous architectures accelerator stages with the WarpX electromagnetic
with a variety of accelerators. The math teams will PIC code on GPUs,” Phys. Plasmas, vol. 28, no. 2,
expand their algorithm portfolio to meet the needs of Feb. 2021, Art. no. 023105, doi: 10.1063/5.0028512.
AI for science. In addition to new developments, as 6. E. Zoni et al., “A hybrid nodal-staggered pseudo-
part of post-ECP work on software stewardship, the spectral electromagnetic particle-in-cell method with
math teams will ensure that the libraries developed in finite-order centering,” Comput. Phys. Commun.,
the ECP will be sustained for many years to come. This vol. 279, Oct. 2022, Art. no. 108457, doi: 10.1016/
will be achieved through close collaboration and a j.cpc.2022.108457.
common governance model across collaborating soft- 7. A. YarKhan, M. A. Farhan, D. Sukkari, M. Gates, and
ware stewardship organizations. J. Dongarra, “SLATE performance report: Updates
to Cholesky and LU factorizations,” ICL, Univ. of
Tennessee, Knoxville, TN, USA, Tech. Rep. ICL-UT-20-
ACKNOWLEDGMENTS
14, 2020. [Online]. Available: https://icl.utk.edu/files/
This research was supported by the Exascale Comput-
publications/2020/icl-utk-1418-2020.pdf
ing Project (17-SC-20-SC), a collaborative effort of two
8. L. Fedeli et al., “Pushing the Frontier in the design of
U.S. Department of Energy organizations (the Office of
laser-based electron accelerators with
Science and the National Nuclear Security Administra-
groundbreaking mesh-refined particle-in-cell
tion) responsible for the planning and preparation of a
simulations on exascale-class supercomputers,” in
z
https://blast.lbl.gov Proc. Int. Conf. High Perform. Comput., Netw., Storage
aa
https://hpsf.io Anal., 2022, pp. 1–12, doi: 10.1109/SC41404.2022.00008.
9. T. Cojean, Y.-H. M. Tsai, and H. Anzt, “Ginkgo—A math mixed precision numerical linear algebra, sustainable soft-
library designed for platform portability,” Parallel ware, and energy-efficient computing. Anzt received his
Comput., vol. 111, Jul. 2022, Art. no. 102902, doi: 10. Ph.D. degree in applied mathematics from the Karlsruhe
1016/j.parco.2022.102902. Institute of Technology. He is a Member of IEEE, GAMM, and
10. A. Dubey et al., “Performance portability in the SIAM. Contact him at [email protected].
exascale computing project: Exploration through a
panel series,” Comput. Sci. Eng., vol. 23, no. 5,
pp. 46–54, 2021, doi: 10.1109/MCSE.2021.3098231. AXEL HUEBL is a computational physicist at Lawrence
11. R. T. Mills et al., “Toward performance-portable PETSc Berkeley National Laboratory (LBNL), Berkeley, CA, 94720,
for GPU-based exascale systems,” Parallel Comput., USA. His research interests include the interface of high-
vol. 108, Dec. 2021, Art. no. 102831, doi: 10.1016/j.parco. performance computing, laser–plasma physics, and advanced
2021.102831. particle accelerator research. Huebl received his Ph.D. degree
12. E. M. Raybourn, J. D. Moulton, and A. Hungerford, in physics from Technical University Dresden, Germany. He
“Scaling productivity and innovation on the path to is a Member of IEEE, APS, and ACM. Contact him at
exascale with a ‘team of teams’ approach,” in HCI in
[email protected].
Business, Government and Organizations. Information
Systems and Analytics, F. F.-H. Nah and K. Siau, Eds.,
Cham, Switzerland: Springer International Publishing, XIAOYE S. LI is a senior scientist at LBNL, Berkeley, CA,
2019, pp. 408–421. 94720, USA. Her research interests include high-performance
computing, numerical linear algebra, Bayesian optimization,
HARTWIG ANZT is the is the chair of Computation Mathe- and scientific machine learning. Li received her Ph.D. degree
matics at Technical University Munich, 80333, Munich, in computer science from the University of California at
Germany, and a research associate professor at the Innova- Berkeley. She is a fellow of the Society for Industrial and
tive Computing Laboratory at the University of Tennessee, Applied Mathematics and a senior member of the Associa-
Knoxville, TN, 37996, USA. His research interests include tion for Computing Machinery. Contact her at [email protected].