Instant download Accelerator Programming Using Directives 6th International Workshop WACCPD 2019 Denver CO USA November 18 2019 Revised Selected Papers Sandra Wienke pdf all chapter
Instant download Accelerator Programming Using Directives 6th International Workshop WACCPD 2019 Denver CO USA November 18 2019 Revised Selected Papers Sandra Wienke pdf all chapter
com
OR CLICK BUTTON
DOWNLOAD NOW
Accelerator Programming
Using Directives
6th International Workshop, WACCPD 2019
Denver, CO, USA, November 18, 2019
Revised Selected Papers
Lecture Notes in Computer Science 12017
Founding Editors
Gerhard Goos
Karlsruhe Institute of Technology, Karlsruhe, Germany
Juris Hartmanis
Cornell University, Ithaca, NY, USA
Accelerator Programming
Using Directives
6th International Workshop, WACCPD 2019
Denver, CO, USA, November 18, 2019
Revised Selected Papers
123
Editors
Sandra Wienke Sridutt Bhalachandra
RWTH Aachen University Lawrence Berkeley National Laboratory
Aachen, Germany Berkeley, CA, USA
This Springer imprint is published by the registered company Springer Nature Switzerland AG
The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland
Preface
The program co-chairs invited Dr. Nicholas James Wright from Lawrence Berkeley
National Laboratory (LBL) to give a keynote address on “Perlmutter – A 2020
Pre-Exascale GPU-accelerated System for NERSC: Architecture and Application
Performance Optimization.” Dr. Nicholas J. Wright is the Perlmutter chief architect and
the Advanced Technologies Group lead in the National Energy Research Scientific
Computing (NERSC) center at LBL. He led the effort to optimize the architecture of the
Perlmutter machine, the first NERSC platform designed to meet the needs of both
large-scale simulation and data analysis from experimental facilities. Nicholas has a
PhD from the University of Durham in computational chemistry and has been with
NERSC since 2009.
Robert Henschel from Indiana University gave an invited talk titled “The
SPEC ACCEL Benchmark – Results and Lessons Learned.” Robert Henschel is the
director of Research Software and Solutions at Indiana University. He is responsible for
providing advanced scientific applications to researchers at Indiana University and
national partners as well as providing support for computational research to the Indiana
University School of Medicine. Henschel serves as the chair of the Standard Perfor-
mance Evaluation Corporation (SPEC) High-Performance Group and in this role leads
the development of production quality benchmarks for HPC systems. He also serves as
the treasurer of the OpenACC organization. Henschel has a deep background in HPC
and his research interests focus on performance analysis of parallel applications.
The workshop concluded with a panel “Convergence, Divergence, or New
Approaches? – The Future of Software-Based Abstractions for Heterogeneous
Supercomputing” moderated by Fernanda Foertter from NVIDIA. The panelists
included:
– Christian Trott, Sandia National Laboratories, USA
– Michael Wolfe, Nvidia, USA
– Jack Deslippe, Lawrence Berkeley National Laboratory, USA
– Jeff Hammond, Intel, USA
– Johannes Doerfert, Argonne National Laboratory, USA
Based on rigorous reviews and ranking scores of all papers reviewed, the following
paper won the Best Paper Award. The authors of the Best Paper Award also included
reproducibility results to their paper, which the WACCPD workshop organizers had
indicated as a criteria to be eligible to compete for the Best Paper Award.
– Hongzhang Shan and Zhengji Zhao from Lawrence Berkeley National Laboratory,
and Marcus Wagner from Cray: “Accelerating the Performance of Modal Aerosol
Module of E3SM Using OpenACC”
Emphasizing the importance of using directives for legacy scientific applications,
each keynote/invited speakers, panelists, and Best Paper Award winners were given a
book on “OpenACC for Programmers: Concepts & Strategies.”
Steering Committee
Barbara Chapman Stony Brook, USA
Duncan Poole OpenACC, USA
Kuan-Ching Li Providence University, Taiwan
Oscar Hernandez ORNL, USA
Jeffrey Vetter ORNL, USA
Program Co-chairs
Sandra Wienke RWTH Aachen University, Germany
Sridutt Bhalachandra Lawrence Berkeley National Laboratory, USA
Publicity Chair
Neelima Bayyapu NMAM Institute of Technology, Karnataka, India
Web Chair
Shu-Mei Tseng University of California, Irvine, USA
Program Committee
Adrian Jackson Edinburgh Parallel Computing Centre,
University of Edinburgh, UK
Andreas Herten Forschungszentrum Jülich, Germany
Arpith Jacob Google, USA
Cheng Wang Microsoft, USA
Christian Iwainsky Technische Universität Darmstadt, Germany
Christian Terboven RWTH Aachen University, Germany
Christopher Daley Lawrence Berkeley National Laboratory, USA
C. J. Newburn NVIDIA, USA
David Bernholdt Oak Ridge National Laboratory, USA
Giuseppe Congiu Argonne National Laboratory, USA
Haoqiang Jin NASA Ames Research Center, USA
Jeff Larkin NVIDIA, USA
Kelvin Li IBM, USA
Manisha Gajbe Intel, USA
Michael Wolfe NVIDIA/PGI, USA
Ray Sheppard Indiana University, USA
Ron Lieberman AMD, USA
viii Organization
1 Introduction
Nowadays, computer architectures are becoming increasingly diverse and new
hardware, including heterogeneous systems, is released every year. Software
Electronic supplementary material The online version of this chapter (https://
doi.org/10.1007/978-3-030-49943-3 1) contains supplementary material, which is avail-
able to authorized users.
c Springer Nature Switzerland AG 2020
S. Wienke and S. Bhalachandra (Eds.): WACCPD 2019, LNCS 12017, pp. 3–24, 2020.
https://doi.org/10.1007/978-3-030-49943-3_1
4 T. Yamaguchi et al.
An δun = bn , (1)
where
An = dt42 M + dt
2
Cn + Kn ,
bn = fn − qn−1 + Cn vn−1 + M an−1 + 4
dt vn−1 .
Underground
structure Buildings
Soft ground layer
Medium ground layer
Hard ground layer
Fig. 1. Extraction of part of the problem having bad convergence using AI.
Finally, we map this result to a second-order finite element model and use it
as an initial solution for the solver on FEMmodel (Algorithm 1, line 15), and
further use the results for the search direction z in the outer iteration.
By setting the tolerance of each preconditioning solver to a suitable value, we
can solve parts of the problem with bad convergence extensively while solv-
ing most of the problem with good convergence less extensively. This leads
to a reduction in the computational cost compared to a solver that solves
the entire domain uniformly. Even if the selection of FEMmodelcp by ANN is
slightly altered, the effects are absorbed by the other preconditioning solvers
(P reCGc and P reCG); therefore, the solver becomes highly robust.
The training and reference of the AI for extracting FEMmodelcp are con-
ducted offline using commercial neural network packages on a few GPUs, and
are conducted only once prior to the time-history earthquake simulation .
3. Use of low-precision arithmetic in the preconditioner
While the solution of the entire solver is required in double precision, we
can use transprecision computing [15] in the preconditioner because it is only
used to obtain rough solutions. We can use not only FP32 but also other data
types, such as FP21, which has an intermediate range and the accuracy of
FP32 and FP16 to reduce the data transfer cost and the memory footprint.
As mentioned later, all vectors can be in FP21 on CPUs while FP32 must be
used for some vectors on GPUs. The introduction of FP21 data types in both
CPU and GPU implementations makes maintenance of the entire code and
performance evaluation more complex; thus, we use custom data type only
in GPU implementation for simplicity.
4. Use of time-parallel time integration in the solver
Although AI with a transprecision-computing solver appears to be highly
complicated, it is merely a combination of conjugate gradient-based solvers
solved using simple PCGE methods. Therefore, the majority of its compu-
tational costs consist of matrix-vector products. However, because the com-
10 T. Yamaguchi et al.
Because the approximated methods are only used in the preconditioner or are
used to obtain the initial solutions of the iterative solver, the obtained solution
δui (i = 1, 2, ...) is same as that of the double-precision PCGE method within
the solver error tolerance . Further, because most of the computational cost is
in matrix-vector products, we can maintain load balance by allocating an equal
number of elements to each process/thread, which leads to high scalability for
large-scale systems.
Using the SC18GBF solver algorithm, the FLOP count is reduced by 5.56-
fold compared to the standard PGCE solver for an earthquake wave propa-
gation problem in a ground region with a buried concrete structure. Because
mixed-precision arithmetic and highly efficient SIMD arithmetic can be used,
we expected an additional speedup from the reduction in the arithmetic count.
Indeed, we obtained 9.09-fold speedup from the PCGE method [8] when mea-
sured on the CPU-based K computer system [18].
Our solver algorithm, as described in the previous section, is suitable not only for
CPUs but also for GPUs. For example, the introduction of time-parallel compu-
tation circumvents random accesses to the global vector in a matrix-vector multi-
plication kernel, which greatly improves the performance on GPUs. In addition,
P reCGcp computation can reduce the data transfer size as well as the compu-
tational amount; accordingly, this solver is appropriate for GPUs because data
transfer is a major bottleneck in GPU computations. We assume that our solver
will be accelerated even by a straightforward implementation of GPU compu-
tations. In this section, we first describe a baseline OpenACC implementation
and then optimize its performance using lower-precision data types and other
tunings.
via coloring; therefore, we use atomic operations for this part. As shown in
Fig. 2, we can enable atomic operations by adding the option #pragma acc
atomic.
3. Control data transfer between CPUs and GPUs
Without explicit instructions, OpenACC automatically transfers the neces-
sary data from the CPUs to the GPUs prior to the GPU computation and
from the GPUs to the CPUs following the GPU computation to obtain the
expected results. When data are transferred too frequently, the performance
greatly diminishes; therefore, we add directives to control the data transfer,
as described in Fig. 3, to minimize these data transfer costs.
Sophisticated Implicit Finite Element Solver Using OpenACC 13
In addition, original codes are designed for the MPI parallelization to allow us to
use multiple GPUs and assign one GPU to each MPI process. Point-to-point com-
munication requires data transfer between GPUs; we use GPUDirect. We issue
MPI Isend/Irecv to access GPU memory directly by adding the corresponding
directives, as shown in Fig. 4.
We refer to these implementations as the baseline OpenACC implementation.
To improve the performance, we introduce lower-precision data types and modify
a few parts of the code that can decrease the performance.
FP32, 32 bits
S exponen t f r ac t i on
(Single precision)
1bit sign + 8bits exponent + 23bits fraction
FP16, 16 bits
(Half Precision)
S exp f r ac t i on
1bit sign + 5bits exponent + 10bits fraction
Fig. 5. Bit alignments for the sign, exponent, and fraction parts in each data type.
Each cell describes one bit.
Therefore, we define our custom 21-bit data type in Fig. 5. Hereafter, we refer
to this data type as FP21. FP21 has the advantage of the same dynamic range
as FP32 and bfloat16 and a better accuracy than FP16 or bfloat16. In addition,
the border between the sign bit and exponent bits and the border between the
exponent bits and fraction bits in FP21 are the same as those in FP32 num-
bers; therefore, conversions between FP21 and FP32 are easier than conversions
between other combinations of data types. To facilitate the bit operations, we
store three FP21 numbers in one component of the 64-bit arrays and space 1-bit.
Our proposed data type is not supported on our targeted hardware; therefore,
we use it only when storing into memory. We convert the FP21 data types into
FP32 prior to computation in FP32 and convert the results in FP32 into FP21
numbers following the computation. Figure 6 shows an implementation of the
data type conversion. Only addition or subtraction operations and bit opera-
tions are required for this conversion, and they can be implemented entirely
within OpenACC. If these functions are called with stack frames, they decrease
the performance. Therefore, they have to be in-line in all related computations.
When we convert FP32 data types into FP21, we can remove the lower 11-bits
in the fraction parts; however, rounding to the nearest number can halve the
rounding error compared to dropping the lower-bits. We obtain rounded num-
bers as follows. First, we remove the last 11 bits of the original FP32 number a
and obtain the FP21 number ā. Then, we can obtain the result by removing the
last 11 bits of a + (a − ā) in FP32.
Here, we are targeting a 3D problem; therefore, we have the three components
of x, y, and z per node. Using FP21 for this problem enables us to assign one
component in the 64-bit arrays to one node including the x, y, and z components
in FP21; therefore, we can easily handle memory access to the FP21 numbers.
Sophisticated Implicit Finite Element Solver Using OpenACC 15
Fig. 6. Mock code for the FP21 implementation. These functions convert FP21
numbers into FP32 numbers and are in-line for all computations requiring FP21
computations.
Fig. 7. Example code for computing dot products for multiple vectors in OpenACC.
Fig. 8. Example code to call the dot product kernel in CUDA from the OpenACC
codes.
4 Performance Measurement
In this section, we evaluate the performance of our proposed solver using GPU-
based supercomputer ABCI [2], which is operated by the National Institute of
Sophisticated Implicit Finite Element Solver Using OpenACC 17
Advanced Industrial Science and Technology. Each compute node of ABCI has
four NVIDIA Tesla V100 GPUs and two Intel Xeon Gold 6148 CPUs (20 cores).
Its peak performance in double precision is 7.8 TFLOPS × 4 = 31.2 TFLOPS
on the GPUs and 1.53 TFLOPS × 2 = 3.07 TFLOPS on the CPUs. In addition,
its theoretical memory bandwidth is 900 GB/s × 4 = 3600 GB/s on the GPUs
and 126 GB/s × 2 = 256 GB/s on the CPUs. The GPUs in each compute node
are connected via NVLink, with a bandwidth of 50 GB/s bandwidth in each
direction.
We generated a finite element model assuming a small-scale city problem. The
problem settings were nearly the same as those of our previous performance mea-
surement in Ref. [8] except for the domain size and the number of MPI processes.
The target domain included two soil layers and a layer with material properties
similar to concrete. This problem had 39,191,319 degrees of freedom. In addi-
tion, P reCGcp , P reCGc , and P reCG had 659,544, 5,118,339, and 39,191,319
degrees of freedom, respectively. The target domain was decomposed into four
sub-domains, and four MPI processes were used in the computation. We used 10
OpenMP threads per MPI process when using CPUs so that all CPU cores on
an ABCI compute node were used. We applied semi-infinite absorbing bound-
ary conditions on the sides and bottom of the domain. We can incorporate any
constitutive laws into our proposed solver. Here, we used modified RO model [9]
and the Masing rule [16]. Kobe waves observed during the 1995 Southern Hyogo
Earthquake [10] were input at the bottom of the model. The time increment
was 0.01 seconds, and we computed 25 time steps. Convergence in the conjugate
gradient loops was judged using a tolerance value of = 1.0 × 10−8 . In addition,
the tolerances in P reCGcp , P reCGc , and P reCG were set to 0.05, 0.7, and 0.25,
respectively, according to Ref. [8].
speedup; therefore, we confirmed that the computational cost for the data type
conversion was negligible.
Second, we measured the performance of a dot product. The target ker-
nel computes α = i ((x(1, i) × y(1, i) + x(2, i) × y(2, i) + x(3, i) × y(3, i)) ×
z(i)), where the arrays x(, i) and y(, i) are in FP32 or FP21 and the array
z(i) is in FP32. The expected performance ratio was (CPU):(baseline Ope-
nACC):(proposed) = 1/((32 × 7)/63.9):1/((32 × 7)/900):1/((21 × 6 + 32)/900) =
1:14.1:20.0. Compared to the AXPY kernel, the measured memory bandwidth in
the baseline OpenACC implementation decreased because OpenACC cannot use
the reduction option for arrays and causes stride memory access to the vectors.
Conversely, our proposed implementation with CUDA attained nearly the same
bandwidth as the AXPY kernel.
Finally, we show the performance of the matrix-vector multiplication kernel
in Table 1. The simple implementation and our proposed method obtained 15.0-
fold and 14.8-fold speedups for our CPU-based kernel. The performance for
these kernels on the GPUs reached 4 TFLOPS. The bottlenecks of this kernel
are not memory bandwidth but the atomic addition to the global vector and the
element-wise multiplication; therefore, we were unable to observe a significant
difference in the performance even when using FP21 data types for the input
vectors. Regarding this kernel, the data conversion between FP32 and FP21
in our proposed method was a possible reason for the slight performance gap
between these two kernels.
In this section, we evaluate the elapsed time for the entire solver. We com-
pare the original CPU-based solver, a solver simply ported using OpenACC,
a solver simply ported using CUDA, our proposed solver based on OpenACC,
and the SC18GBF solver [8]. The SC18GBF solver improved its performance
at the cost of portability. For example, shared memory on the V100 GPU was
used to summarize the element-wise computation results and reduce the num-
ber of atomic operations in the element-by-element kernel and two-way packed
FP16 computations in the V100 GPU were also applied. Moreover, matrix-vector
multiplication and point-to-point communication were reordered as described in
Ref. [17] so that computationally expensive data transfers could be overlapped.
The SC18GBF solver, designed for large-scale computers, conducted further
reductions in the data transfer cost by splitting the four time steps into two sets
of two vectors and overlapping point-to-point communications with other vector
operations. However, we compared the performance of the solver using only one
compute node in this paper. Each GPU in the compute node was connected via
NVLink; therefore, the data transfer cost was lower. Considering these problem
settings, we computed the four time step vectors without splitting. In the GPU
computations, we used atomic operations when the element-wise results were
added to the global vector; therefore, numerical errors occur due to differences
in the computation order. The final results of the analysis are consistent within
the tolerance of the conjugate gradient solver; however, the number of iterations
in the solver differs every time we run the program. Accordingly, we took the
average of 10 trials for each solver.
The elapsed time for each solver is described in Table 2. The test took 781.8 s
when using only CPUs on an ABCI compute node. Conversely, we reduced the
computation time to 66.71 s via the simple implementation of OpenACC, result-
ing in a speedup ratio of 11.7. It took 61.02 s using the simple implementa-
tion with CUDA. This gap in performance between OpenACC and CUDA is
attributed to the following three factors. The first is the performance decline
in the dot product kernels. The second is that kernels that conduct complex
computations and require many variables cause register spilling, which does not
occur in CUDA implementations. The third is that OpenACC has a larger over-
head for launching each kernel than CUDA, which resulted in a large gap in
P reCGcp . Our proposed solver based on OpenACC used the FP21 data types
and introduced techniques to circumvent the overhead in the OpenACC kernels.
The elapsed time of this solver was 55.84 s; it was 9% faster than the original Ope-
nACC implementation as well as faster than the simple implementation using
CUDA. Therefore, we confirmed that the introduction of the FP21 data types
was beneficial in accelerating the solver. Our proposed solver attained approxi-
mately 86% of the SC18GBF solver performance. Performance gap in P reCGcp
between our proposed solver and the SC18GBF solver was larger than those in
P reCGc and P reCG. This was because the degrees of freedom in P reCGcp was
smaller than other preconditioning solvers and data transfer cost was relatively
higher, which was mostly overlapped in the SC18GBF solver. The performance
20 T. Yamaguchi et al.
of our proposed solver is very good from a practical point of view considering
the portability provided by OpenACC.
Table 2. Elapsed time for the entire solver measured on ABCI. The total elapsed time
includes the output of the analysis results. Performance of the preconditioning solvers
is summarized in order of their appearance in CG solver. The numbers of iteration in
each solver are also shown in parentheses.
Our website is not just a platform for buying books, but a bridge
connecting readers to the timeless values of culture and wisdom. With
an elegant, user-friendly interface and an intelligent search system,
we are committed to providing a quick and convenient shopping
experience. Additionally, our special promotions and home delivery
services ensure that you save time and fully enjoy the joy of reading.
textbookfull.com