100% found this document useful (2 votes)
34 views55 pages

Instant download Accelerator Programming Using Directives 6th International Workshop WACCPD 2019 Denver CO USA November 18 2019 Revised Selected Papers Sandra Wienke pdf all chapter

Directives

Uploaded by

nabilllebnet
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
100% found this document useful (2 votes)
34 views55 pages

Instant download Accelerator Programming Using Directives 6th International Workshop WACCPD 2019 Denver CO USA November 18 2019 Revised Selected Papers Sandra Wienke pdf all chapter

Directives

Uploaded by

nabilllebnet
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 55

Experience Seamless Full Ebook Downloads for Every Genre at textbookfull.

com

Accelerator Programming Using Directives 6th


International Workshop WACCPD 2019 Denver CO USA
November 18 2019 Revised Selected Papers Sandra
Wienke
https://textbookfull.com/product/accelerator-programming-
using-directives-6th-international-workshop-
waccpd-2019-denver-co-usa-november-18-2019-revised-selected-
papers-sandra-wienke/

OR CLICK BUTTON

DOWNLOAD NOW

Explore and download more ebook at https://textbookfull.com


Recommended digital products (PDF, EPUB, MOBI) that
you can download immediately if you are interested.

High Performance Computing Systems Performance Modeling


Benchmarking and Simulation 4th International Workshop
PMBS 2013 Denver CO USA November 18 2013 Revised Selected
Papers 1st Edition Stephen A. Jarvis
https://textbookfull.com/product/high-performance-computing-systems-
performance-modeling-benchmarking-and-simulation-4th-international-
workshop-pmbs-2013-denver-co-usa-november-18-2013-revised-selected-
papers-1st-edition-stephen-a-j/
textboxfull.com

Graphical Models for Security 6th International Workshop


GraMSec 2019 Hoboken NJ USA June 24 2019 Revised Papers
Massimiliano Albanese
https://textbookfull.com/product/graphical-models-for-security-6th-
international-workshop-gramsec-2019-hoboken-nj-usa-
june-24-2019-revised-papers-massimiliano-albanese/
textboxfull.com

Smart Multimedia Second International Conference ICSM 2019


San Diego CA USA December 16 18 2019 Revised Selected
Papers Troy Mcdaniel
https://textbookfull.com/product/smart-multimedia-second-
international-conference-icsm-2019-san-diego-ca-usa-
december-16-18-2019-revised-selected-papers-troy-mcdaniel/
textboxfull.com

Higher Education Learning Methodologies and Technologies


Online First International Workshop HELMeTO 2019 Novedrate
CO Italy June 6 7 2019 Revised Selected Papers Daniel
Burgos
https://textbookfull.com/product/higher-education-learning-
methodologies-and-technologies-online-first-international-workshop-
helmeto-2019-novedrate-co-italy-june-6-7-2019-revised-selected-papers-
daniel-burgos/
textboxfull.com
Artificial Life and Evolutionary Computation 14th Italian
Workshop WIVACE 2019 Rende Italy September 18 20 2019
Revised Selected Papers Franco Cicirelli
https://textbookfull.com/product/artificial-life-and-evolutionary-
computation-14th-italian-workshop-wivace-2019-rende-italy-
september-18-20-2019-revised-selected-papers-franco-cicirelli/
textboxfull.com

Smart Health International Conference ICSH 2015 Phoenix AZ


USA November 17 18 2015 Revised Selected Papers 1st
Edition Xiaolong Zheng
https://textbookfull.com/product/smart-health-international-
conference-icsh-2015-phoenix-az-usa-november-17-18-2015-revised-
selected-papers-1st-edition-xiaolong-zheng/
textboxfull.com

Advanced Communication Systems and Information Security


Second International Conference ACOSIS 2019 Marrakesh
Morocco November 20 22 2019 Revised Selected Papers
Mostafa Belkasmi
https://textbookfull.com/product/advanced-communication-systems-and-
information-security-second-international-conference-
acosis-2019-marrakesh-morocco-november-20-22-2019-revised-selected-
papers-mostafa-belkasmi/
textboxfull.com

Security Protocols XXVII 27th International Workshop


Cambridge UK April 10 12 2019 Revised Selected Papers
Jonathan Anderson
https://textbookfull.com/product/security-protocols-xxvii-27th-
international-workshop-cambridge-uk-april-10-12-2019-revised-selected-
papers-jonathan-anderson/
textboxfull.com

Agile Methods 10th Brazilian Workshop WBMA 2019 Belo


Horizonte Brazil September 11 2019 Revised Selected Papers
Paulo Meirelles
https://textbookfull.com/product/agile-methods-10th-brazilian-
workshop-wbma-2019-belo-horizonte-brazil-september-11-2019-revised-
selected-papers-paulo-meirelles/
textboxfull.com
Sandra Wienke
Sridutt Bhalachandra (Eds.)
LNCS 12017

Accelerator Programming
Using Directives
6th International Workshop, WACCPD 2019
Denver, CO, USA, November 18, 2019
Revised Selected Papers
Lecture Notes in Computer Science 12017

Founding Editors
Gerhard Goos
Karlsruhe Institute of Technology, Karlsruhe, Germany
Juris Hartmanis
Cornell University, Ithaca, NY, USA

Editorial Board Members


Elisa Bertino
Purdue University, West Lafayette, IN, USA
Wen Gao
Peking University, Beijing, China
Bernhard Steffen
TU Dortmund University, Dortmund, Germany
Gerhard Woeginger
RWTH Aachen, Aachen, Germany
Moti Yung
Columbia University, New York, NY, USA
More information about this series at http://www.springer.com/series/7408
Sandra Wienke Sridutt Bhalachandra (Eds.)

Accelerator Programming
Using Directives
6th International Workshop, WACCPD 2019
Denver, CO, USA, November 18, 2019
Revised Selected Papers

123
Editors
Sandra Wienke Sridutt Bhalachandra
RWTH Aachen University Lawrence Berkeley National Laboratory
Aachen, Germany Berkeley, CA, USA

ISSN 0302-9743 ISSN 1611-3349 (electronic)


Lecture Notes in Computer Science
ISBN 978-3-030-49942-6 ISBN 978-3-030-49943-3 (eBook)
https://doi.org/10.1007/978-3-030-49943-3
LNCS Sublibrary: SL2 – Programming and Software Engineering

© Springer Nature Switzerland AG 2020


This work is subject to copyright. All rights are reserved by the Publisher, whether the whole or part of the
material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation,
broadcasting, reproduction on microfilms or in any other physical way, and transmission or information
storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now
known or hereafter developed.
The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication
does not imply, even in the absence of a specific statement, that such names are exempt from the relevant
protective laws and regulations and therefore free for general use.
The publisher, the authors and the editors are safe to assume that the advice and information in this book are
believed to be true and accurate at the date of publication. Neither the publisher nor the authors or the editors
give a warranty, express or implied, with respect to the material contained herein or for any errors or
omissions that may have been made. The publisher remains neutral with regard to jurisdictional claims in
published maps and institutional affiliations.

This Springer imprint is published by the registered company Springer Nature Switzerland AG
The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland
Preface

The ever-increasing heterogeneity in supercomputing applications has given rise to


complex compute node architectures offering multiple, heterogeneous levels of massive
parallelism. As a result, the ‘X’ in MPI+X demands more focus. Exploiting the
maximum available parallelism out of such systems necessitates sophisticated pro-
gramming approaches that can provide scalable as well as portable solutions without
compromising on performance. A programmer’s expectation from the scientific com-
munity is to deliver solutions that would allow maintenance of a single code base
whenever possible avoiding duplicate effort.
Raising the abstraction of the code is one of the effective methodologies to reduce
the burden on the programmer while improving productivity. Software
abstraction-based programming models, such as OpenMP and OpenACC, have been
serving this purpose over the past several years as the compiler technology steadily
improves. These programming models address the ‘X’ component by providing pro-
grammers with high-level directive-based approaches to accelerate and port scientific
applications to heterogeneous platforms.
These proceedings contain the papers accepted for presentation at the 6th Workshop
on Accelerator Programming using Directives (WACCPD 2019) – http://waccpd.org/.
WACCPD is one of the major forums for bringing together users, developers, and the
software and tools community to share knowledge and experiences when programming
emerging complex parallel computing systems.
Recent architectural trends indicate a heavy reliance of future exascale machines on
accelerators for performance. Toward this end, the workshop highlighted improve-
ments to the state of the art through the accepted papers and prompted discussion
through keynotes/panels that drew the community’s attention to key areas that will
facilitate the transition to accelerator-based high-performance computing (HPC). The
workshop aimed to showcase all aspects of heterogeneous systems discussing inno-
vative high-level language features, lessons learned while using directives to migrate
scientific legacy code to parallel processors, compilation and runtime scheduling
techniques, among others.
The WACCPD 2019 workshop received 13 submissions out of which 7 were
accepted to be presented at the workshop and published in these proceedings. The
Program Committee of the workshop comprised 24 members spanning universities,
national laboratories, and industries. Each paper received an average of five reviews.
For 2019, we encouraged all authors to add the Artifact Description (AD) to their
submissions. Two additional pages were made available to authors (however without
obligations) to make their code and data publicly available (e.g. on GitHub, Zenodo,
Code Ocean, etc.) in support of the reproducibility initiative. As a further push, only
papers with AD were considered for the Best Paper Award.
Of the 7 accepted papers, 86% had reproducibility information and these manu-
scripts are highlighted with an ‘artifacts available’ logo in this book.
vi Preface

The program co-chairs invited Dr. Nicholas James Wright from Lawrence Berkeley
National Laboratory (LBL) to give a keynote address on “Perlmutter – A 2020
Pre-Exascale GPU-accelerated System for NERSC: Architecture and Application
Performance Optimization.” Dr. Nicholas J. Wright is the Perlmutter chief architect and
the Advanced Technologies Group lead in the National Energy Research Scientific
Computing (NERSC) center at LBL. He led the effort to optimize the architecture of the
Perlmutter machine, the first NERSC platform designed to meet the needs of both
large-scale simulation and data analysis from experimental facilities. Nicholas has a
PhD from the University of Durham in computational chemistry and has been with
NERSC since 2009.
Robert Henschel from Indiana University gave an invited talk titled “The
SPEC ACCEL Benchmark – Results and Lessons Learned.” Robert Henschel is the
director of Research Software and Solutions at Indiana University. He is responsible for
providing advanced scientific applications to researchers at Indiana University and
national partners as well as providing support for computational research to the Indiana
University School of Medicine. Henschel serves as the chair of the Standard Perfor-
mance Evaluation Corporation (SPEC) High-Performance Group and in this role leads
the development of production quality benchmarks for HPC systems. He also serves as
the treasurer of the OpenACC organization. Henschel has a deep background in HPC
and his research interests focus on performance analysis of parallel applications.
The workshop concluded with a panel “Convergence, Divergence, or New
Approaches? – The Future of Software-Based Abstractions for Heterogeneous
Supercomputing” moderated by Fernanda Foertter from NVIDIA. The panelists
included:
– Christian Trott, Sandia National Laboratories, USA
– Michael Wolfe, Nvidia, USA
– Jack Deslippe, Lawrence Berkeley National Laboratory, USA
– Jeff Hammond, Intel, USA
– Johannes Doerfert, Argonne National Laboratory, USA
Based on rigorous reviews and ranking scores of all papers reviewed, the following
paper won the Best Paper Award. The authors of the Best Paper Award also included
reproducibility results to their paper, which the WACCPD workshop organizers had
indicated as a criteria to be eligible to compete for the Best Paper Award.
– Hongzhang Shan and Zhengji Zhao from Lawrence Berkeley National Laboratory,
and Marcus Wagner from Cray: “Accelerating the Performance of Modal Aerosol
Module of E3SM Using OpenACC”
Emphasizing the importance of using directives for legacy scientific applications,
each keynote/invited speakers, panelists, and Best Paper Award winners were given a
book on “OpenACC for Programmers: Concepts & Strategies.”

April 2020 Sandra Wienke


Sridutt Bhalachandra
Organization

Steering Committee
Barbara Chapman Stony Brook, USA
Duncan Poole OpenACC, USA
Kuan-Ching Li Providence University, Taiwan
Oscar Hernandez ORNL, USA
Jeffrey Vetter ORNL, USA

Program Co-chairs
Sandra Wienke RWTH Aachen University, Germany
Sridutt Bhalachandra Lawrence Berkeley National Laboratory, USA

Publicity Chair
Neelima Bayyapu NMAM Institute of Technology, Karnataka, India

Web Chair
Shu-Mei Tseng University of California, Irvine, USA

Program Committee
Adrian Jackson Edinburgh Parallel Computing Centre,
University of Edinburgh, UK
Andreas Herten Forschungszentrum Jülich, Germany
Arpith Jacob Google, USA
Cheng Wang Microsoft, USA
Christian Iwainsky Technische Universität Darmstadt, Germany
Christian Terboven RWTH Aachen University, Germany
Christopher Daley Lawrence Berkeley National Laboratory, USA
C. J. Newburn NVIDIA, USA
David Bernholdt Oak Ridge National Laboratory, USA
Giuseppe Congiu Argonne National Laboratory, USA
Haoqiang Jin NASA Ames Research Center, USA
Jeff Larkin NVIDIA, USA
Kelvin Li IBM, USA
Manisha Gajbe Intel, USA
Michael Wolfe NVIDIA/PGI, USA
Ray Sheppard Indiana University, USA
Ron Lieberman AMD, USA
viii Organization

Ronan Keryell Xilinx, USA


Seyong Lee Oak Ridge National Laboratory, USA
Simon Hammond Sandia National Laboratories, USA
Sameer Shende University of Oregon, USA
Thomas Schwinge Mentor Graphics, Germany
Tom Scogland Lawrence Livermore National Laboratory, USA
William Sawyer Swiss National Supercomputing Centre, Switzerland

Held in conjunction with the International Conference for High Performance


Computing, Networking, Storage and Analysis (SC 2019), Denver, USA:
Contents

Porting Scientific Applications to Heterogeneous Architectures


Using Directives

GPU Implementation of a Sophisticated Implicit Low-Order Finite Element


Solver with FP21-32-64 Computation Using OpenACC . . . . . . . . . . . . . . . . 3
Takuma Yamaguchi, Kohei Fujita, Tsuyoshi Ichimura, Akira Naruse,
Maddegedara Lalith, and Muneo Hori

Acceleration in Acoustic Wave Propagation Modelling Using


OpenACC/OpenMP and Its Hybrid for the Global Monitoring System . . . . . 25
Noriyuki Kushida, Ying-Tsong Lin, Peter Nielsen, and Ronan Le Bras

Accelerating the Performance of Modal Aerosol Module of E3SM


Using OpenACC. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
Hongzhang Shan, Zhengji Zhao, and Marcus Wagner

Evaluation of Directive-Based GPU Programming Models on a Block


Eigensolver with Consideration of Large Sparse Matrices. . . . . . . . . . . . . . . 66
Fazlay Rabbi, Christopher S. Daley, Hasan Metin Aktulga,
and Nicholas J. Wright

Directive-Based Programming for Math Libraries

Performance of the RI-MP2 Fortran Kernel of GAMESS on GPUs


via Directive-Based Offloading with Math Libraries . . . . . . . . . . . . . . . . . . 91
JaeHyuk Kwack, Colleen Bertoni, Buu Pham, and Jeff Larkin

Performance Portability for Heterogeneous Architectures

Performance Portable Implementation of a Kinetic Plasma Simulation


Mini-App . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117
Yuuichi Asahi, Guillaume Latu, Virginie Grandgirard, and Julien Bigot

A Portable SIMD Primitive Using Kokkos for Heterogeneous Architectures . . . 140


Damodar Sahasrabudhe, Eric T. Phipps, Sivasankaran Rajamanickam,
and Martin Berzins

Author Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165


Porting Scientific Applications to
Heterogeneous Architectures Using
Directives
GPU Implementation of a Sophisticated
Implicit Low-Order Finite Element Solver
with FP21-32-64 Computation Using
OpenACC

Takuma Yamaguchi1(B) , Kohei Fujita1,2 , Tsuyoshi Ichimura1 , Akira Naruse3 ,


Maddegedara Lalith1 , and Muneo Hori4
1
The University of Tokyo, Yayoi, Bunkyo, Tokyo, Japan
2
Center for Computational Science, RIKEN, Minatojima-minamimachi,
Chuo, Kobe, Japan
{yamaguchi,fujita,ichimura,lalith}@eri.u-tokyo.ac.jp
3
NVIDIA Corporation, Akasaka, Minato, Tokyo, Japan
[email protected]
4
Japan Agency for Marine-Earth Science and Technology, Kanazawa, Yokohama,
Kanagawa, Japan
[email protected]

Abstract. Accelerating applications with portability and maintainabil-


ity is one of the big challenges in science and engineering. Previously, we
have developed a fast implicit low-order three-dimensional finite element
solver, which has a complicated algorithm including artificial intelligence
and transprecision computing. In addition, all possible tunings for the
target architecture were implemented; accordingly, the solver has infe-
rior portability and maintainability. In this paper, we apply OpenACC
to the solver. The directive-based implementation of OpenACC enables
GPU computation to be introduced with a smaller developmental cost
even for complex codes. In performance measurements on AI Bridging
Cloud Infrastructure (ABCI), we evaluated that a reasonable speedup
was attained on GPUs, given that the elapsed time of the entire solver
was reduced to 1/14 of that on CPUs based on the original CPU imple-
mentation. Our proposed template to use transprecision computing with
our custom FP21 data type is available to the public; therefore, it can
provide a successful example for other scientific computing applications.

Keywords: OpenACC · Finite element analysis · Conjugate gradient


solver · Transprecision computing · Lower-Precision data types

1 Introduction
Nowadays, computer architectures are becoming increasingly diverse and new
hardware, including heterogeneous systems, is released every year. Software
Electronic supplementary material The online version of this chapter (https://
doi.org/10.1007/978-3-030-49943-3 1) contains supplementary material, which is avail-
able to authorized users.
c Springer Nature Switzerland AG 2020
S. Wienke and S. Bhalachandra (Eds.): WACCPD 2019, LNCS 12017, pp. 3–24, 2020.
https://doi.org/10.1007/978-3-030-49943-3_1
4 T. Yamaguchi et al.

needs to keep up with this rapid development of hardware. Unfortunately, devel-


oping codes for every type of architecture leads to huge developmental costs. In
addition, handling all the maintenance in such a case becomes increasingly dif-
ficult. These two factors in particular have a marked influence on sophisticated
algorithms, which leads to long lines of codes.
Reflecting this situation, OpenACC [20] is in widespread use. OpenACC is a
programming model that offloads computations onto GPUs or multi-core CPUs
by inserting a few directives. Reference [3] demonstrated that simple codes can
easily be ported using OpenACC. For various scientific applications, porting
more complex algorithms to GPUs using OpenACC can be a successful example.
In this paper, we target a finite element analysis. We use implicit time integra-
tion for stability and low-order elements for complicated geometries; therefore,
the code tends to be complex and the performance decreases due to random
memory accesses. This analysis is regarded as a de-facto standard for manufac-
turing and Earth sciences; therefore, its acceleration is beneficial to these fields.
We demonstrated in WACCPD 2016 [5] and WACCPD 2017 [25] that finite ele-
ment solvers designed for CPU-based computers can be ported using OpenACC
and that such ported solvers exhibit reasonable performances.
Meanwhile, a solver extremely tuned for better performance on GPU-based
supercomputers was proposed [8]. Hereafter, we refer to this solver as the SC18-
GBF solver. It has a sophisticated algorithm including artificial intelligence (AI)
and transprecision computing with lower-precision data types. Moreover, its per-
formance is thoroughly optimized when using specialized hardware in the tar-
geted architecture, e.g., two-way packed half-precision computations on NVIDIA
Tesla V100 GPUs [19]. Therefore, the developed code lacks portability and main-
tainability.
We apply OpenACC to the SC18GBF solver to improve its compatibility for
portability and its performance. We show that our target application achieves
a reasonable speedup with a smaller developmental cost in a directive-based
method even though our solver includes a non-standard data type. Our sample
codes to use the lower-precision data type FP21 are available to the public [26];
thus, it could prove beneficial to other scientific computing applications.
The remainder of this paper is organized as follows. Section 2 describes the
baseline solver on CPU-based computers, and Sect. 3 describes the GPU imple-
mentation with FP21-32-64 data types using OpenACC. In Sect. 4, we show
the effectiveness of our proposed method via performance measurements on
AI Bridging Cloud Infrastructure (ABCI). In addition, we show an application
example on the supercomputer Summit. Section 5 provides our conclusions.

2 Baseline Solver on CPU-based Computers

In this paper, we target a low-order unstructured implicit finite element method


used for solving complex shaped three-dimensional (3D) domains. When solv-
ing this type of problem, solver programs often become complex due to the use
of sophisticated preconditioners in iterative solvers. Further, it is difficult to
Sophisticated Implicit Finite Element Solver Using OpenACC 5

attain a good computational performance because the computation of unstruc-


tured elements requires a large amount of random memory accesses. The target
SC18GBF solver is further complicated compared to standard solvers due to its
use of AI and transprecision arithmetic in its preconditioner. Below we pose the
target problem and explain the solver algorithm and its CPU implementation.

2.1 The Target Problem

Earthquake simulations involve large-domain nonlinear time-evolution problems


with locally complex structures. Therefore, we solve the target dynamic nonlin-
ear continuum mechanics problem using a nonlinear dynamic 3D finite element
method with second-order tetrahedral elements because such a method is suit-
able for modeling complex geometries and analytically satisfies the traction-free
boundary condition at the surface. The target equation using the Newmark-β
method (β = 1/4, δ = 1/2) for time integration is

An δun = bn , (1)

where

An = dt42 M + dt
2
Cn + Kn ,
 
bn = fn − qn−1 + Cn vn−1 + M an−1 + 4
dt vn−1 .

Here, δu, u, v, a, q, and f are the incremental displacement, displacement,


velocity, acceleration, internal force, and external force vectors, respectively, M,
C, and K are the consistent mass, damping, and stiffness matrices, respectively,
dt is the time increment, and n is the time step. We use Rayleigh damping [1]
for C. After solving Eq. 1, q, u, v, and a are updated using


⎪ qn = qn−1 + Kn δun ,

⎨u = u
n n−1 + δun ,
(2)

⎪ v = −v 2
n−1 + dt δun ,


n
an = −an−1 − dt 4
vn−1 + dt42 δun .

In summary, the time-history response un is computed by repeating the following


steps.

1. Read the boundary conditions.


2. Evaluate Cn and Kn based on the constitutive relationships and the strain
at the time step n − 1.
3. Obtain δun by solving Eq. 1.
4. Update Eq. 2 using δun .

Because most of the computational cost is incurred when solving Eq. 1, we


explain the details of the linear equation solver in the next subsection.
6 T. Yamaguchi et al.

2.2 The Solver Algorithm

Although it is sparse, the symmetric positive definite matrix A in Eq. 1 becomes


large in scale. Therefore, it is difficult to store A or variants of A directly in fast
memory; consequently, matrix-free matrix-vector products are often used in iter-
ative solvers for solving Eq. 1. For example, the PCGE method, which combines a
matrix-free matrix-vector product [24] with 3 × 3 block diagonal preconditioned
conjugate gradient solver is often used. This method solves the entire target
domain uniformly in double precision and, therefore, is robust for solving a wide
range of problems. However, its convergence rate is often slow, which makes
it computationally expensive. The efficiency of the conjugate gradient solver is
improved in the SC18GBF solver by changing the intensity of the computation
according to the mathematical properties of the target problem and, further, by
using AI methods considering the convergence characteristics. Below, we explain
the solver algorithm in detail following Algorithm 1.

1. Use of an adaptive conjugate gradient method


We first use an adaptive conjugate gradient method [6]. Instead of using a
fixed matrix approximating the inverse matrix A−1 in the preconditioner
of each conjugate gradient iteration, the preconditioning equation z = Ar
is solved using another conjugate gradient solver. We refer to the solving
of the preconditioning equation as the inner iteration (Algorithm 1, lines 5–
17), while we refer to the original conjugate gradient iteration as the outer
iteration (Algorithm 1, lines 18–28). By setting suitable thresholds for the
tolerances of the preconditioning solvers, we can shift most of the computa-
tional cost to the inner iterations. Because the preconditioning equation only
needs to be roughly solved, this allows for flexibility in the algorithm design
combining different methods with varying accuracies and computational costs
in the preconditioner.
2. Use of AI in a preconditioner
Data analytics exemplified by AI is often faster in inference than equation-
based methods; however, its accuracy is often not as high [11]. Therefore, the
direct use of data analytics in equation-based methods may lead to a degrada-
tion of the accuracy of the result or a divergence in the solution. Therefore, to
use AI in linear equation solvers, an algorithm design considering the solver
robustness is required. Here we focus on the heterogeneity of the convergence
characteristics of a target matrix A; that is, we develop a preconditioner
algorithm that can coarsen or refine the solving process according to the con-
vergence characteristics at each local domain and consider using AI to guess
these convergence characteristics. Using this approach, even if the inference
of the convergence characteristics is not perfectly accurate, it is only used in
the preconditioner; therefore, the robustness of the solver and the accuracy
of the solution are maintained and only the computational performance is
affected.
First, in preparation for the training with AI, we uniformly coarsen the target
second-order tetrahedral finite element model (FEMmodel shown in Fig. 1a)
Sophisticated Implicit Finite Element Solver Using OpenACC 7

Algorithm 1. SC18GBF solver algorithm for solving Ax = b on FEMmodel.


The matrix vector product Ay = ( dt42 M+ dt 2
C+ K)y is computed using matrix-
N 4
free matrix-vector products (i.e., element-by-element method): i ( dt2 Mi +
2
dt Ci + K i )y i , where dt is the time increment, M, C, and K are the consistent
mass, damping, and stiffness matrices, respectively, and subscript i indicates the
i-th element. diag[ ], (¯), and  indicate the 3×3 block Jacobi of [ ], single-
precision variable, and tolerance for relative error, respectively. ( )c and ( )cp
indicates the calculation related to FEMmodelc and FEMmodelcp , respectively,
while the other is the related calculation of the FEMmodel. P̄ is a mapping
matrix, from FEMmodelc to FEMmodel, which is defined by interpolating the
displacement in each element of FEMmodelc . p, q, r, and z are temporal vectors
and α, β, and ρ are scalars in conjugate gradient method and i denotes the
number of iteration.
1: r ⇐ b − Ax, where x is initial solution
2: β ⇐ 0, i ⇐ 1
3: (* outer loop start *)
4: while r2 /b2 ≥  do
5: (* inner loop start *)
6: r̄ ⇐ r
7: z̄ ⇐ diag[A]−1 r
8: r̄c ⇐ P̄T r̄
9: z̄c ⇐ P̄T z̄
10: Solve r̄c = Āc z̄c (* P reCGc : solved on FEMmodelc by PCGE with in c and
initial solution z̄c *)
11: Extract z̄cp from z̄c and r̄cp from r̄c
12: Solve r̄cp = A¯cp z̄cp (* P reCGcp : solved on FEMmodelcp by PCGE with in cp and
initial solution z̄cp with Dirichlet boundary condition of z̄c at boundary *)
13: Update z̄c with z̄cp
14: z̄ ⇐ P̄z̄c
15: Solve r̄ = Āz̄ (* P reCG: solved on FEMmodel by PCGE with in and initial
solution z̄ *)
16: z ⇐ z̄
17: (* inner loop end *)
18: if i > 1 then
19: β ⇐ (z, q)/ρ
20: end if
21: p ⇐ z + βp
22: q ⇐ Ap (* computed by matrix-free matrix-vector multiplication *)
23: ρ ⇐ (z, r)
24: α ⇐ ρ/(p, q)
25: q ⇐ −αq
26: r⇐r+q
27: x ⇐ x + αp
28: i⇐i+1
29: end while
30: (* outer loop end *)
8 T. Yamaguchi et al.

Algorithm 2. Standard time-integration algorithm for solving Ai xi = bi (i =


0, ..., n − 1). Values with over bars (¯) indicate approximate values, while values
without over bars indicate exact values.
1: Set x−1 ⇐ 0
2: for i = 0; i < n; i = i + 1 do
3: Guess x̄i using standard predictor
4: Set Ai and bi using xi−1
|Ai xi − bi |
5: Solve Ai xi = bi with error tolerance ≤  using initial solution x̄i :
|bi |
Computed using iterative solver with matrix-free matrix-vector multiplication
kernel (1 vector)
6: end for

Algorithm 3 . Time-parallel time-integration algorithm for solving Ai xi =


bi (i = 0, ..., n − 1). Values with over bars (¯) indicate approximate values, while
values without over bars indicate exact values. Algorithm 1 is used to solve m
sets of linear systems of equations in line 9 in parallel.
1: Set x−1 ⇐ 0 and x̄i ⇐ 0(i = 0, ..., m − 2)
2: for i = 0; i < n; i = i + 1 do
3: Guess x̄i+m−1 using standard predictor
4: Set Ai and bi using xi−1
5: Āi ⇐ Ai
6: b̄i ⇐ bi
|Ai x̄i − bi |
7: while >  do
|bi |
8: Guess Āj and b̄j using x̄j−1(j = i + 1, ..., i + m − 1)
9: Refine solution Āi x̄j = b̄j with initial solution x̄j (j = i, ..., i + m − 1):
Computed using iterative solver with matrix-free matrix-vector multiplication
kernel (m vectors)
10: end while
11: xi ⇐ x̄i
12: end for

using a geometric multi-grid [21] to obtain a first-order tetrahedral finite ele-


ment model (FEMmodelc shown in Fig. 1b). Next, we obtain the error history
distribution of a small-scale problem with similar characteristics to the target
large-scale problem using a standard PCGE solver. Using this error distribu-
tion data, we train an artificial neural network (ANN) that inputs mesh infor-
mation at a target node (i.e., the element connectivity, material property, and
element sizes) and outputs the level of error at that node. Using this ANN, we
infer the error levels at each node of the large-scale target problem using the
element connectivity, material property, and element size as input. The nodes
that are guessed to have large error levels (i.e., bad convergence) are included
in FEMmodelcp , as shown in Fig. 1c. We use a solver on FEMmodelcp (Algo-
rithm 1, lines 11–13) to refine the rough solution obtained by the solver on the
uniformly coarsened FEMmodelc (Algorithm 1, line 10) in the preconditioner.
Sophisticated Implicit Finite Element Solver Using OpenACC 9

Underground
structure Buildings
Soft ground layer
Medium ground layer
Hard ground layer

a) Whole target problem b) Whole target problem Extract part


Geometric c) Extracted part of city model
(FEMmodel: with second- (FEMmodelc: with first- with bad
multi-grid (FEMmodelcp: with first-order
order tetrahedral order tetrahedral convergence
coarsening tetrahedral elements)
elements) elements)

Use FP64 for computation and Use low-precision data


Use low-precision data
communication of outer loop types for computation and
types for computation and
(use low-precision data types communication of PreCGcp
communication of PreCGc
for PreCG)
rank #0 rank #1
rank #0 rank #1 rank #0 rank #1

Fig. 1. Extraction of part of the problem having bad convergence using AI.

Finally, we map this result to a second-order finite element model and use it
as an initial solution for the solver on FEMmodel (Algorithm 1, line 15), and
further use the results for the search direction z in the outer iteration.
By setting the tolerance of each preconditioning solver to a suitable value, we
can solve parts of the problem with bad convergence extensively while solv-
ing most of the problem with good convergence less extensively. This leads
to a reduction in the computational cost compared to a solver that solves
the entire domain uniformly. Even if the selection of FEMmodelcp by ANN is
slightly altered, the effects are absorbed by the other preconditioning solvers
(P reCGc and P reCG); therefore, the solver becomes highly robust.
The training and reference of the AI for extracting FEMmodelcp are con-
ducted offline using commercial neural network packages on a few GPUs, and
are conducted only once prior to the time-history earthquake simulation .
3. Use of low-precision arithmetic in the preconditioner
While the solution of the entire solver is required in double precision, we
can use transprecision computing [15] in the preconditioner because it is only
used to obtain rough solutions. We can use not only FP32 but also other data
types, such as FP21, which has an intermediate range and the accuracy of
FP32 and FP16 to reduce the data transfer cost and the memory footprint.
As mentioned later, all vectors can be in FP21 on CPUs while FP32 must be
used for some vectors on GPUs. The introduction of FP21 data types in both
CPU and GPU implementations makes maintenance of the entire code and
performance evaluation more complex; thus, we use custom data type only
in GPU implementation for simplicity.
4. Use of time-parallel time integration in the solver
Although AI with a transprecision-computing solver appears to be highly
complicated, it is merely a combination of conjugate gradient-based solvers
solved using simple PCGE methods. Therefore, the majority of its compu-
tational costs consist of matrix-vector products. However, because the com-
10 T. Yamaguchi et al.

putation of matrix-vector products becomes dominated by random memory


accesses in unstructured finite element methods, it has become difficult to
attain high performance on recent computational architectures regardless of
the use of CPU or GPU. Using the fact that mesh connectivity is invari-
able in the time domain even for an unstructured finite element method, the
SC18GBF solver uses a time-parallel time-integration algorithm [7] to improve
the computational efficiency. In standard time integration, each step is solved
step by step (Algorithm 2), while in the time-parallel solver, several steps,
including future time steps, are solved simultaneously (Algorithm 3). When
indicating the number of time steps solved simultaneously as m, the arith-
metic count for computing a single iteration of the iterative solver becomes
m times of that of a standard solver. However, the results obtained by the
time-parallel solver can be used as high-precision initial solutions for future
time steps; therefore, the total number of iterations is reduced by approxi-
mately 1/m. Accordingly, the total arithmetic count becomes approximately
the same as that of a standard time-integration method. The advantage of
using a time-parallel method is that random accesses are reduced by 1/m
compared to a standard solver by placing time-directional nodal variables
consecutively in memory. This leads to the efficient use of single-instruction
multiple-data (SIMD) units, which leads to a short time-to-solution for the
entire solver. Typically, m = 4 is used because enlarging m leads to an increase
in the total arithmetic count due to the degradation in the prediction accuracy
of future time steps.

Because the approximated methods are only used in the preconditioner or are
used to obtain the initial solutions of the iterative solver, the obtained solution
δui (i = 1, 2, ...) is same as that of the double-precision PCGE method within
the solver error tolerance . Further, because most of the computational cost is
in matrix-vector products, we can maintain load balance by allocating an equal
number of elements to each process/thread, which leads to high scalability for
large-scale systems.

2.3 Implementation of Solver for CPU Systems


Because the innermost loop of the solver becomes the length m = 4 with con-
secutive data accesses, the current algorithm can be implemented using packed
SIMD units with width 4. Furthermore, for systems with AVX-512 instruction
units, loop blocking and splitting are applied for use of the 8-wide FP64 and
the 16-wide FP32 SIMD units in the computation of matrix-vector products.
We avoid data recurrence in multi-core computation of matrix-vector products
by coloring of elements for each core. See Ref. [4] for details of the SIMD and
multi-core implementation of matrix-vector products. This leads to an imple-
mentation of the solver with most of its computation using SIMD instructions
on multi-cores. For simplicity of implementation, we use FP32 for the compu-
tations and communication in the inner loop solvers (P reCGc , P reCGcp , and
P reCG) in the CPU version.
Sophisticated Implicit Finite Element Solver Using OpenACC 11

Using the SC18GBF solver algorithm, the FLOP count is reduced by 5.56-
fold compared to the standard PGCE solver for an earthquake wave propa-
gation problem in a ground region with a buried concrete structure. Because
mixed-precision arithmetic and highly efficient SIMD arithmetic can be used,
we expected an additional speedup from the reduction in the arithmetic count.
Indeed, we obtained 9.09-fold speedup from the PCGE method [8] when mea-
sured on the CPU-based K computer system [18].

3 GPU Implementation Using OpenACC

Our solver algorithm, as described in the previous section, is suitable not only for
CPUs but also for GPUs. For example, the introduction of time-parallel compu-
tation circumvents random accesses to the global vector in a matrix-vector multi-
plication kernel, which greatly improves the performance on GPUs. In addition,
P reCGcp computation can reduce the data transfer size as well as the compu-
tational amount; accordingly, this solver is appropriate for GPUs because data
transfer is a major bottleneck in GPU computations. We assume that our solver
will be accelerated even by a straightforward implementation of GPU compu-
tations. In this section, we first describe a baseline OpenACC implementation
and then optimize its performance using lower-precision data types and other
tunings.

3.1 Baseline Implementation

We apply OpenACC to our CPU-based code following the general procedures


given below.

1. Define where to apply OpenACC


In our solver, all computations are computed for each node or each element
and are easily parallelized by GPUs. Therefore, we target the entire solver to
be ported to the GPUs. Conversely, the training and reference of the AI for
extracting FEMmodelcp conducted only once and their computational cost is
negligible. Accordingly, we do not port this part of the code.
2. Insert directive to parallelize loops
We can compute targeting loops on GPUs by adding the corresponding direc-
tives, as shown in Fig. 2. OpenACC has three levels of parallelism: gang,
worker, and vector. On NVIDIA GPUs, gang and vector correspond to block
and thread, respectively, and usually the worker level is ignored. We insert
directives so that the expected granularity of the parallelization can be
attained. Figure 2 describes an outline of the implementation in a matrix-
vector multiplication kernel. Loops for elements and time steps are collapsed
to enable further parallelism. Each thread on the NVIDIA GPU is assigned to
one element and its element-wise results are added to the global vector, which
may cause data race conditions between threads. A previous study [5] showed
that addition via atomic operations is much faster than explicit reordering
12 T. Yamaguchi et al.

Fig. 2. Porting example of the matrix-vector multiplication kernel on a tetrahedral


second order mesh.

Fig. 3. Example code for data transfer in a conjugate gradient loop.

via coloring; therefore, we use atomic operations for this part. As shown in
Fig. 2, we can enable atomic operations by adding the option #pragma acc
atomic.
3. Control data transfer between CPUs and GPUs
Without explicit instructions, OpenACC automatically transfers the neces-
sary data from the CPUs to the GPUs prior to the GPU computation and
from the GPUs to the CPUs following the GPU computation to obtain the
expected results. When data are transferred too frequently, the performance
greatly diminishes; therefore, we add directives to control the data transfer,
as described in Fig. 3, to minimize these data transfer costs.
Sophisticated Implicit Finite Element Solver Using OpenACC 13

Fig. 4. Example code for point-to-point communication.

In addition, original codes are designed for the MPI parallelization to allow us to
use multiple GPUs and assign one GPU to each MPI process. Point-to-point com-
munication requires data transfer between GPUs; we use GPUDirect. We issue
MPI Isend/Irecv to access GPU memory directly by adding the corresponding
directives, as shown in Fig. 4.
We refer to these implementations as the baseline OpenACC implementation.
To improve the performance, we introduce lower-precision data types and modify
a few parts of the code that can decrease the performance.

3.2 Introduction of Lower-Precision Data Types

Our proposed solver can introduce transprecision computing to preconditioning


conjugate gradient solvers. These solvers include many memory-bound compu-
tations. Therefore, we can reduce the computational cost simply by reducing the
number of bits in each variable and reducing the footprint. In CPU-based imple-
mentations, a single-precision data type (FP32) is used in P reCGc , P reCGcp ,
and P reCG. Typical floating-point numbers including FP32 are standardized in
IEEE 754 as x = (−1)sign × (1.f raction) × 2exponent−bias for normalized num-
bers [13]. The sign bit determines the sign of the number, the exponent width
influences the dynamic range of the number, and the fraction width defines the
accuracy of the data type. Recently, data types with lower precision than FP32
have become widely supported on various types of hardware. The half-precision
number, FP16, is a major example of such data types. It shortens the number of
exponent bits and fraction bits compared to FP32 data types. It is not difficult
to use FP16 for applications that do not require very high accuracy, e.g., deep
learning [14]; however, using it for general scientific computations is challenging
due to its narrow dynamic range. For our iterative solver, more exponent bits
are required. Another data type, bfloat16, was proposed in Ref. [23]. It has the
same width of exponent bits as FP32; therefore, it can avoid overflow/underflow
in more general computations. However, it cuts down on the fraction bits by
only 7 bits; accordingly, its machine epsilon becomes 1/27 = 1/128. This low
accuracy may lead to poor convergency.
14 T. Yamaguchi et al.

FP32, 32 bits
S exponen t f r ac t i on
(Single precision)
1bit sign + 8bits exponent + 23bits fraction

FP21, 21 bits S exponen t f r ac t i on


1bit sign + 8bits exponent + 12bits fraction

bfloat16, 16 bits S exponen t f r ac


1bit sign + 8bits exponent + 7bits fraction

FP16, 16 bits
(Half Precision)
S exp f r ac t i on
1bit sign + 5bits exponent + 10bits fraction

Fig. 5. Bit alignments for the sign, exponent, and fraction parts in each data type.
Each cell describes one bit.

Therefore, we define our custom 21-bit data type in Fig. 5. Hereafter, we refer
to this data type as FP21. FP21 has the advantage of the same dynamic range
as FP32 and bfloat16 and a better accuracy than FP16 or bfloat16. In addition,
the border between the sign bit and exponent bits and the border between the
exponent bits and fraction bits in FP21 are the same as those in FP32 num-
bers; therefore, conversions between FP21 and FP32 are easier than conversions
between other combinations of data types. To facilitate the bit operations, we
store three FP21 numbers in one component of the 64-bit arrays and space 1-bit.
Our proposed data type is not supported on our targeted hardware; therefore,
we use it only when storing into memory. We convert the FP21 data types into
FP32 prior to computation in FP32 and convert the results in FP32 into FP21
numbers following the computation. Figure 6 shows an implementation of the
data type conversion. Only addition or subtraction operations and bit opera-
tions are required for this conversion, and they can be implemented entirely
within OpenACC. If these functions are called with stack frames, they decrease
the performance. Therefore, they have to be in-line in all related computations.
When we convert FP32 data types into FP21, we can remove the lower 11-bits
in the fraction parts; however, rounding to the nearest number can halve the
rounding error compared to dropping the lower-bits. We obtain rounded num-
bers as follows. First, we remove the last 11 bits of the original FP32 number a
and obtain the FP21 number ā. Then, we can obtain the result by removing the
last 11 bits of a + (a − ā) in FP32.
Here, we are targeting a 3D problem; therefore, we have the three components
of x, y, and z per node. Using FP21 for this problem enables us to assign one
component in the 64-bit arrays to one node including the x, y, and z components
in FP21; therefore, we can easily handle memory access to the FP21 numbers.
Sophisticated Implicit Finite Element Solver Using OpenACC 15

Fig. 6. Mock code for the FP21 implementation. These functions convert FP21
numbers into FP32 numbers and are in-line for all computations requiring FP21
computations.

Note that atomic operations used in matrix-free matrix-vector multiplication


are supported only for FP16/32/64 and that the output vector of this kernel must
be in FP32. Therefore, vectors in FP21 and FP32 are mixed in the precondi-
tioning solvers.

3.3 Miscellaneous Optimizations in the Solver

The introduction of FP21 data types is expected to reduce the computational


time of memory bound computations compared to the baseline implementation
using OpenACC; however, our solver algorithm includes several operations that
greatly decrease the performance compared to the low-level description, e.g.,
CUDA. We avoid this performance decrease via the following modifications.

1. Dot product targeting multiple vectors


Originally, dot products could be computed on OpenACC by adding the
option reduction to the loop directive #pragma acc loop. However, the
current version of OpenACC does not allow us to specify arrays for the target
of the reduction, which prevents the parallelization of the inner loops for
four time steps. We can compute dot products by creating multiple scalar
variables and corresponding loops, as described in Fig. 7. However, such an
implementation leads to strides in the memory accesses and a decline in the
16 T. Yamaguchi et al.

Fig. 7. Example code for computing dot products for multiple vectors in OpenACC.

Fig. 8. Example code to call the dot product kernel in CUDA from the OpenACC
codes.

performance. Therefore, we use a CUDA kernel to compute dot products. We


can call CUDA-based kernel from the OpenACC-based code via a wrapper,
as shown in Fig. 8, and improve the performance of this computation.
2. Overheads for launching kernels
OpenACC has larger overheads for launching kernels than CUDA. The
degrees of freedom in P reCGc and P reCGcp in our solver become smaller
than the original problem; therefore, the relative overhead cost increases for
computations with shorter loop lengths. To reduce overhead costs, we modify
several kernels. In particular, we add options #pragma acc async(1) and
#pragma acc wait(1) for kernels that can be computed asynchronously to
overlap the overhead cost. Moreover, local arrays in OpenACC loops are some-
times stored in local memory instead of in registers. When local memory
is used, memory allocation is required and this increases the overhead for
launching kernels; therefore, we redefine these local arrays as scalar variables.

4 Performance Measurement
In this section, we evaluate the performance of our proposed solver using GPU-
based supercomputer ABCI [2], which is operated by the National Institute of
Sophisticated Implicit Finite Element Solver Using OpenACC 17

Advanced Industrial Science and Technology. Each compute node of ABCI has
four NVIDIA Tesla V100 GPUs and two Intel Xeon Gold 6148 CPUs (20 cores).
Its peak performance in double precision is 7.8 TFLOPS × 4 = 31.2 TFLOPS
on the GPUs and 1.53 TFLOPS × 2 = 3.07 TFLOPS on the CPUs. In addition,
its theoretical memory bandwidth is 900 GB/s × 4 = 3600 GB/s on the GPUs
and 126 GB/s × 2 = 256 GB/s on the CPUs. The GPUs in each compute node
are connected via NVLink, with a bandwidth of 50 GB/s bandwidth in each
direction.
We generated a finite element model assuming a small-scale city problem. The
problem settings were nearly the same as those of our previous performance mea-
surement in Ref. [8] except for the domain size and the number of MPI processes.
The target domain included two soil layers and a layer with material properties
similar to concrete. This problem had 39,191,319 degrees of freedom. In addi-
tion, P reCGcp , P reCGc , and P reCG had 659,544, 5,118,339, and 39,191,319
degrees of freedom, respectively. The target domain was decomposed into four
sub-domains, and four MPI processes were used in the computation. We used 10
OpenMP threads per MPI process when using CPUs so that all CPU cores on
an ABCI compute node were used. We applied semi-infinite absorbing bound-
ary conditions on the sides and bottom of the domain. We can incorporate any
constitutive laws into our proposed solver. Here, we used modified RO model [9]
and the Masing rule [16]. Kobe waves observed during the 1995 Southern Hyogo
Earthquake [10] were input at the bottom of the model. The time increment
was 0.01 seconds, and we computed 25 time steps. Convergence in the conjugate
gradient loops was judged using a tolerance value of  = 1.0 × 10−8 . In addition,
the tolerances in P reCGcp , P reCGc , and P reCG were set to 0.05, 0.7, and 0.25,
respectively, according to Ref. [8].

4.1 Performance Evaluation of FP21 Computation

We evaluated the performance of each computation in the solver. The elapsed


time was measured using MPI Wtime. In this section, we compared the original
CPU-based implementation, the baseline implementation using OpenACC, and
our proposed implementation.
First, we measured the performance of the real Alpha X Plus Y (AXPY) oper-
ation. We extracted a computation in the P reCG solver of x(i) = x(i) + αy(i),
where the arrays x(i) and y(i) are in FP32 or FP21 and the coefficient α
is in FP32. The elapsed times of all the implementations are described in
Table 1. This computation was a memory-bound computation. Given that the
theoretical memory bandwidths of the CPUs and GPUs per MPI process are
63.9 GB/s and 900 GB/s, the expected performance ratio was (CPU):(baseline
OpenACC):(proposed) = 1/(32/63.9):1/(32/900):1/(21/900) = 1:14.1:21.5.
Judging from this ratio, our GPU implementation achieved a reasonable speedup.
In addition, the measured bandwidth was close to the results of another bench-
mark [12]: 900 GB/s × 83.3% = 750 GB/s; therefore, we concluded that our per-
formance was reasonable. In addition, using FP21 variables resulted in a 1.5-fold
18 T. Yamaguchi et al.

speedup; therefore, we confirmed that the computational cost for the data type
conversion was negligible.
Second, we measured the performance of a dot product. The target ker-
nel computes α = i ((x(1, i) × y(1, i) + x(2, i) × y(2, i) + x(3, i) × y(3, i)) ×
z(i)), where the arrays x(, i) and y(, i) are in FP32 or FP21 and the array
z(i) is in FP32. The expected performance ratio was (CPU):(baseline Ope-
nACC):(proposed) = 1/((32 × 7)/63.9):1/((32 × 7)/900):1/((21 × 6 + 32)/900) =
1:14.1:20.0. Compared to the AXPY kernel, the measured memory bandwidth in
the baseline OpenACC implementation decreased because OpenACC cannot use
the reduction option for arrays and causes stride memory access to the vectors.
Conversely, our proposed implementation with CUDA attained nearly the same
bandwidth as the AXPY kernel.
Finally, we show the performance of the matrix-vector multiplication kernel
in Table 1. The simple implementation and our proposed method obtained 15.0-
fold and 14.8-fold speedups for our CPU-based kernel. The performance for
these kernels on the GPUs reached 4 TFLOPS. The bottlenecks of this kernel
are not memory bandwidth but the atomic addition to the global vector and the
element-wise multiplication; therefore, we were unable to observe a significant
difference in the performance even when using FP21 data types for the input
vectors. Regarding this kernel, the data conversion between FP32 and FP21
in our proposed method was a possible reason for the slight performance gap
between these two kernels.

Table 1. Performance of each kernel in the solver.

Precision CPU-based Baseline OpenACC Proposed


FP32 FP32 FP32/21
AXPY Elapsed time 9.61 ms 0.605 ms 0.401 ms
Measured 50.2 GB/s 797.1 GB/s 802.2 GB/s
bandwidth
Speeding up 1 15.8 24.0
ratio
Dot product Elapsed time 6.20 ms 0.456 ms 0.277 ms
Measured 54.0 GB/s 735.1 GB/s 822.9 GB/s
bandwidth
Speeding up 1 13.6 22.4
ratio
Matrix-vector product Elapsed time 54.61 ms 3.65 ms 3.69 ms
Measured 0.27 TFLOPS 4.11 TFLOPS 4.07 TFLOPS
FLOPS per
MPI process
Speeding up 1 15.0 14.8
ratio
Sophisticated Implicit Finite Element Solver Using OpenACC 19

4.2 Performance Evaluation of the Entire Solver

In this section, we evaluate the elapsed time for the entire solver. We com-
pare the original CPU-based solver, a solver simply ported using OpenACC,
a solver simply ported using CUDA, our proposed solver based on OpenACC,
and the SC18GBF solver [8]. The SC18GBF solver improved its performance
at the cost of portability. For example, shared memory on the V100 GPU was
used to summarize the element-wise computation results and reduce the num-
ber of atomic operations in the element-by-element kernel and two-way packed
FP16 computations in the V100 GPU were also applied. Moreover, matrix-vector
multiplication and point-to-point communication were reordered as described in
Ref. [17] so that computationally expensive data transfers could be overlapped.
The SC18GBF solver, designed for large-scale computers, conducted further
reductions in the data transfer cost by splitting the four time steps into two sets
of two vectors and overlapping point-to-point communications with other vector
operations. However, we compared the performance of the solver using only one
compute node in this paper. Each GPU in the compute node was connected via
NVLink; therefore, the data transfer cost was lower. Considering these problem
settings, we computed the four time step vectors without splitting. In the GPU
computations, we used atomic operations when the element-wise results were
added to the global vector; therefore, numerical errors occur due to differences
in the computation order. The final results of the analysis are consistent within
the tolerance of the conjugate gradient solver; however, the number of iterations
in the solver differs every time we run the program. Accordingly, we took the
average of 10 trials for each solver.
The elapsed time for each solver is described in Table 2. The test took 781.8 s
when using only CPUs on an ABCI compute node. Conversely, we reduced the
computation time to 66.71 s via the simple implementation of OpenACC, result-
ing in a speedup ratio of 11.7. It took 61.02 s using the simple implementa-
tion with CUDA. This gap in performance between OpenACC and CUDA is
attributed to the following three factors. The first is the performance decline
in the dot product kernels. The second is that kernels that conduct complex
computations and require many variables cause register spilling, which does not
occur in CUDA implementations. The third is that OpenACC has a larger over-
head for launching each kernel than CUDA, which resulted in a large gap in
P reCGcp . Our proposed solver based on OpenACC used the FP21 data types
and introduced techniques to circumvent the overhead in the OpenACC kernels.
The elapsed time of this solver was 55.84 s; it was 9% faster than the original Ope-
nACC implementation as well as faster than the simple implementation using
CUDA. Therefore, we confirmed that the introduction of the FP21 data types
was beneficial in accelerating the solver. Our proposed solver attained approxi-
mately 86% of the SC18GBF solver performance. Performance gap in P reCGcp
between our proposed solver and the SC18GBF solver was larger than those in
P reCGc and P reCG. This was because the degrees of freedom in P reCGcp was
smaller than other preconditioning solvers and data transfer cost was relatively
higher, which was mostly overlapped in the SC18GBF solver. The performance
20 T. Yamaguchi et al.

of our proposed solver is very good from a practical point of view considering
the portability provided by OpenACC.

Table 2. Elapsed time for the entire solver measured on ABCI. The total elapsed time
includes the output of the analysis results. Performance of the preconditioning solvers
is summarized in order of their appearance in CG solver. The numbers of iteration in
each solver are also shown in parentheses.

Precision in CPU-based Baseline OpenACC Baseline CUDA Proposed SC18GBF


P reCGc ,
P reCGcp ,
and P reCG
FP32 FP32 FP32 FP32/21 FP32/21/16
P reCGc 161.4 s 14.89 s 14.21 s 9.79 s 7.47 s
(6199) (6300) (6210) (4751) (4308)
P reCGcp 69.9 s 15.94 s 12.20 s 13.22 s 8.98 s
(28830) (28272) (28491) (28861) (26887)
P reCG 372.0 s 22.90 s 22.30 s 18.27 s 16.98 s
(2674) (2729) (2735) (2575) (2797)
CG 83.9 s 5.77 s 4.57 s 5.89 s 8.32 s
(91) (89) (89) (122) (129)
Other 94.8 s 7.21 s 7.73 s 8.66 s 5.99 s
Total 781.8 s 66.71 s 61.02 s 55.84 s 47.75 s
Speeding up 1 11.7 12.8 14.0 16.4
ratio

When we used lower-precision numbers, e.g., FP16 or FP21, the convergence


characteristics in the solver changed. When we replaced a computation in FP21
with a computation in bfloat16 for comparison, the solver failed to converge.
These results indicate that more fraction bits than provided by bfloat16 were
required for our problem settings. A detailed verification of the convergency
when using lower-precision data types will be a future task.
Finally, we solved problems with complicated geometry comprised of the
ground and underground structures with 16,291,917,564 degrees of freedom and
3,961,851,160 elements, as demonstrated in Ref. [8]. Here we used 384 compute
nodes of the supercomputer Summit [22]. We computed for 2,500-time steps
with time increment dt = 0.001 s. As shown in Fig. 9, we obtained the displace-
ment distribution reflecting complex geometries; therefore, the importance of
our method was demonstrated.
Regarding the developmental cost, the introduction of CUDA required an
additional 18,342 lines of code (our original code had 33,527 lines in total). Our
OpenACC implementation required the addition of 9,300 lines of code; therefore,
maintenance of our codes are expected to become easier when using OpenACC.
Another Random Scribd Document
with Unrelated Content
differences and continuing the one question of slavery, and when we
find sectional men thus uniting, we should unite to resist them and
their treasonable designs. Such was the case in 1850, when Clay left
the quiet and peace of his home, and again entered upon public life
to quell agitation and restore peace to a distracted Union. Then we
Democrats, with Cass at our head, welcomed Henry Clay, whom the
whole nation regarded as having been preserved by God for the
times. He became our leader in that great fight, and we rallied
around him the same as the Whigs rallied around old Hickory in
1832, to put down nullification. Thus you see that whilst Whigs and
Democrats fought fearlessly in old times about banks, the tariff,
distribution, the specie circular, and the sub-treasury, all united as a
band of brothers when the peace, harmony, or integrity of the Union
was imperilled. It was so in 1850, when Abolitionism had even so far
divided this country, North and South, as to endanger the peace of
the Union; Whigs and Democrats united in establishing the
Compromise measures of that year, and restoring tranquillity and
good feeling. These measures passed on the joint action of the two
parties. They rested on the great principle that the people of each
State and each Territory should be left perfectly free to form and
regulate their domestic institutions to suit themselves. You Whigs,
and we Democrats justified them in that principle. In 1854, when it
became necessary to organize the Territories of Kansas and
Nebraska, I brought forward the bill on the same principle. In the
Kansas-Nebraska bill you find it declared to be the true intent and
meaning of the act not to legislate slavery into any State or Territory,
nor to exclude it therefrom, but to leave the people thereof perfectly
free to form and regulate their domestic institutions in their own
way. I stand on that same platform in 1858 that I did in 1850, 1854,
and 1856. The Washington Union, pretending to be the organ of the
Administration, in the number of the 5th of this month, devotes
three columns and a half to establish these propositions: First, that
Douglas, in his Freeport speech, held the same doctrine that he did
in his Nebraska bill in 1854; second, that in 1854 Douglas justified
the Nebraska bill upon the ground that it was based upon the same
principle as Clay’s Compromise measures of 1850. The Union thus
proved that Douglas was the same in 1858 that he was in 1856, 1854,
and 1850, and consequently argued that he was never a Democrat. Is
it not funny that I was never a Democrat? There is no pretense that I
have changed a hair’s breadth. The Union proves by my speeches
that I explained the Compromise measures of 1850 just as I do now,
and that I explained the Kansas and Nebraska bill in 1854 just as I
did in my Freeport speech, and yet says that I am not a Democrat,
and cannot be trusted, because I have not changed during the whole
of that time. It has occurred to me that in 1854 the author of the
Kansas and Nebraska bill was considered a pretty good Democrat. It
has occurred to me that in 1856, when I was exerting every nerve and
every energy for James Buchanan, standing on the same platform
then that I do now, that I was a pretty good Democrat. They now tell
me that I am not a Democrat, because I assert that the people of a
Territory, as well as those of a State, have the right to decide for
themselves whether slavery can or cannot exist in such Territory. Let
me read what James Buchanan said on that point when he accepted
the Democratic nomination for the Presidency in 1856. In his letter
of acceptance, he used the following language:
“The recent legislation of Congress respecting domestic slavery, derived as it has
been from the original and pure fountain of legitimate political power, the will of
the majority, promises ere long to allay the dangerous excitement. This legislation
is founded upon principles as ancient as free government itself, and in accordance
with them has simply declared that the people of a Territory, like those of a State,
shall decide for themselves whether slavery shall or shall not exist within their
limits.”
Dr. Hope will there find my answer to the question he propounded
to me before I commenced speaking. Of course no man will consider
it an answer, who is outside of the Democratic organization, bolts
Democratic nominations, and indirectly aids to put Abolitionists into
power over Democrats. But whether Dr. Hope considers it an answer
or not, every fair-minded man will see that James Buchanan has
answered the question, and has asserted that the people of a
Territory like those of a State, shall decide for themselves whether
slavery shall or shall not exist within their limits. I answer
specifically if you want a further answer, and say that while under the
decision of the Supreme Court, as recorded in the opinion of Chief
Justice Taney, slaves are property like all other property, and can be
carried into any Territory of the United States the same as any other
description of property, yet when you get them there they are subject
to the local law of the Territory just like all other property. You will
find in a recent speech delivered by that able and eloquent
statesman, Hon. Jefferson Davis, at Bangor, Maine, that he took the
same view of this subject that I did in my Freeport speech. He there
said:
“If the inhabitants of any Territory should refuse to enact such laws and police
regulations as would give security to their property or to his, it would be rendered
more or less valueless in proportion to the difficulties of holding it without such
protection. In the case of property in the labor of man, or what is usually called
slave property, the insecurity would be so great that the owner could not ordinarily
retain it. Therefore, though the right would remain, the remedy being withheld, it
would follow that the owner would be practically debarred, by the circumstances of
the case, from taking slave property into a Territory where the sense of the
inhabitants was opposed to its introduction. So much for the oft repeated arm in
arm fallacy of forcing slavery upon any community.”
You will also find that the distinguished Speaker of the present
House of Representatives, Hon. Jas. L. Orr, construed the Kansas
and Nebraska bill in this same way in 1856, and also that great
intellect of the South, Alex. H. Stephens, put the same construction
upon it in Congress that I did in my Freeport speech. The whole
South are rallying to the support of the doctrine that if the people of
a Territory want slavery they have a right to have it, and if they do
not want it that no power on earth can force it upon them. I hold that
there is no principle on earth more sacred to all the friends of
freedom than that which says that no institution, no law, no
constitution, should be forced on an unwilling people contrary to
their wishes; and I assert that the Kansas and Nebraska bill contains
that principle. It is the great principle contained in that bill. It is the
principle on which James Buchanan was made President. Without
that principle he never would have been made President of the
United States. I will never violate or abandon that doctrine if I have
to stand alone. I have resisted the blandishments and threats of
power on the one side, and seduction on the other, and have stood
immovably for that principle, fighting for it when assailed by
Northern mobs, or threatened by Southern hostility. I have defended
it against the North and South, and I will defend it against whoever
assails it, and I will follow it wherever its logical conclusions lead me.
I say to you that there is but one hope, one safety for this country,
and that is to stand immovably by that principle which declares the
right of each State and each Territory to decide these questions for
themselves. This Government was founded on that principle, and
must be administered in the same sense in which it was founded.
But the Abolition party really think that under the Declaration of
Independence the negro is equal to the white man, and that negro
equality is an inalienable right conferred by the Almighty, and hence
that all human laws in violation of it are null and void. With such
men it is no use for me to argue. I hold that the signers of the
Declaration of Independence had no reference to negroes at all when
they declared all men to be created equal. They did not mean
negroes, nor savage Indians, nor the Fejee Islanders, nor any other
barbarous race. They were speaking of white men. They alluded to
men of European birth and European descent—to white men, and to
none others, when they declared that doctrine. I hold that this
Government was established on the white basis. It was established by
white men for the benefit of white men and their posterity forever,
and should be administered by white men, and none others. But it
does not follow, by any means, that merely because the negro is not a
citizen, and merely because he is not our equal, that, therefore, he
should be a slave. On the contrary, it does follow that we ought to
extend to the negro race, and to all other dependent races all the
rights, all the privileges, and all the immunities which they can
exercise consistently with the safety of society. Humanity requires
that we should give them all these privileges; Christianity commands
that we should extend those privileges to them. The question then
arises what are these privileges, and what is the nature and extent of
them. My answer is that that is a question which each State must
answer for itself. We in Illinois have decided it for ourselves. We
tried slavery, kept it up for twelve years, and finding that it was not
profitable, we abolished it for that reason, and became a free State.
We adopted in its stead the policy that a negro in this State shall not
be a slave and shall not be a citizen. We have a right to adopt that
policy. For my part I think it is a wise and sound policy for us. You in
Missouri must judge for yourselves whether it is a wise policy for
you. If you choose to follow our example, very good; if you reject it,
still well, it is your business, not ours. So with Kentucky. Let
Kentucky adopt a policy to suit herself. If we do not like it we will
keep away from it, and if she does not like ours let her stay at home,
mind her own business and let us alone. If the people of all the States
will act on that great principle, and each State mind its own business,
attend to its own affairs, take care of its own negroes and not meddle
with its neighbors, then there will be peace between the North and
the South, the East and the West, throughout the whole Union. Why
can we not thus have peace? Why should we thus allow a sectional
party to agitate this country, to array the North against the South,
and convert us into enemies instead of friends, merely that a few
ambitious men may ride into power on a sectional hobby? How long
is it since these ambitious Northern men wished for a sectional
organization? Did any one of them dream of a sectional party as long
as the North was the weaker section and the South the stronger?
Then all were opposed to sectional parties; but the moment the
North obtained the majority in the House and Senate by the
admission of California, and could elect a President without the aid
of Southern votes, that moment ambitious Northern men formed a
scheme to excite the North against the South, and make the people
be governed in their votes by geographical lines, thinking that the
North, being the stronger section, would out-vote the South, and
consequently they, the leaders, would ride into office on a sectional
hobby. I am told that my hour is out. It was very short.
Mr. Lincoln’s Reply.

Ladies and Gentlemen:—I have been somewhat, in my own


mind, complimented by a large portion of Judge Douglas’s speech—I
mean that portion which he devotes to the controversy between
himself and the present Administration. This is the seventh time
Judge Douglas and myself have met in these joint discussions, and
he has been gradually improving in regard to his war with the
administration. At Quincy, day before yesterday, he was a little more
severe upon the Administration than I had heard him upon any
occasion, and I took pains to compliment him for it. I then told him
to “Give it to them with all the power he had;” and as some of them
were present, I told them I would be very much obliged if they would
give it to him in about the same way. I take it he has now vastly
improved upon the attack he made then upon the Administration. I
flatter myself he has really taken my advice on this subject. All I can
say now is to recommend to him and to them what I then
commended—to prosecute the war against one another in the most
vigorous manner. I say to them again—“Go it, husband!—Go it,
bear!”
There is one other thing I will mention before I leave this branch of
the discussion—although I do not consider it much of my business,
any way. I refer to that part of the Judge’s remarks where he
undertakes to involve Mr. Buchanan in an inconsistency. He reads
something from Mr. Buchanan, from which he undertakes to involve
him in an inconsistency; and he gets something of a cheer for having
done so. I would only remind the Judge that while he is very valiantly
fighting for the Nebraska bill and the repeal of the Missouri
Compromise, it has been but a little while since he was the valiant
advocate of the Missouri Compromise. I want to know if Buchanan
has not as much right to be inconsistent as Douglas has? Has
Douglas the exclusive right, in this country, of being on all sides of
all questions? Is nobody allowed that high privilege but himself? Is
he to have an entire monopoly on that subject?
So far as Judge Douglas addressed his speech to me, or so far as it
was about me, it is my business to pay some attention to it. I have
heard the Judge state two or three times what he has stated to-day—
that in a speech which I made at Springfield, Illinois, I had in a very
especial manner complained that the Supreme Court in the Dred
Scott case had decided that a negro could never be a citizen of the
United States. I have omitted by some accident heretofore to analyze
this statement, and it is required of me to notice it now. In point of
fact it is untrue. I never have complained especially of the Dred Scott
decision because it held that a negro could not be a citizen, and the
Judge is always wrong when he says I ever did so complain of it. I
have the speech here, and I will thank him or any of his friends to
show where I said that a negro should be a citizen, and complained
especially of the Dred Scott decision because it declared he could not
be one. I have done no such thing, and Judge Douglas so persistently
insisting that I have done so, has strongly impressed me with the
belief of a predetermination on his part to misrepresent me. He
could not get his foundation for insisting that I was in favor of this
negro equality any where else as well as he could by assuming that
untrue proposition. Let me tell this audience what is true in regard to
that matter; and the means by which they may correct me if I do not
tell them truly is by a recurrence to the speech itself. I spoke of the
Dred Scott decision in my Springfield speech, and I was then
endeavoring to prove that the Dred Scott decision was a portion of a
system or scheme to make slavery national in this country. I pointed
out what things had been decided by the court. I mentioned as a fact
that they had decided that a negro could not be a citizen—that they
had done so, as I supposed, to deprive the negro, under all
circumstances, of the remotest possibility of ever becoming a citizen
and claiming the rights of a citizen of the United States under a
certain clause of the Constitution. I stated that, without making any
complaint of it at all. I then went on and stated the other points
decided in the case, viz: that the bringing of a negro into the State of
Illinois and holding him in slavery for two years here was a matter in
regard to which they would not decide whether it would make him
free or not; that they decided the further point that taking him into a
United States Territory where slavery was prohibited by act of
Congress, did not make him free, because that act of Congress, as
they held, was unconstitutional. I mentioned these three things as
making up the points decided in that case. I mentioned them in a
lump taken in connection with the introduction of the Nebraska bill,
and the amendment of Chase, offered at the time, declaratory of the
right of the people of the Territories to exclude slavery, which was
voted down by the friends of the bill. I mentioned all these things
together, as evidence tending to prove a combination and conspiracy
to make the institution of slavery national. In that connection and in
that way I mentioned the decision on the point that a negro could not
be a citizen, and in no other connection.
Out of this, Judge Douglas builds up his beautiful fabrication—of
my purpose to introduce a perfect, social, and political equality
between the white and black races. His assertion that I made an
“especial objection” (that is his exact language) to the decision on
this account, is untrue in point of fact.
Now, while I am upon this subject, and as Henry Clay has been
alluded to, I desire to place myself, in connection with Mr. Clay, as
nearly right before this people as may be. I am quite aware what the
Judge’s object is here by all these allusions. He knows that we are
before an audience, having strong sympathies southward by
relationship, place of birth, and so on. He desires to place me in an
extremely Abolition attitude. He read upon a former occasion, and
alludes without reading to-day, to a portion of a speech which I
delivered in Chicago. In his quotations from that speech, as he has
made them upon former occasions, the extracts were taken in such a
way as, I suppose, brings them within the definition of what is called
garbling—taking portions of a speech which, when taken by
themselves, do not present the entire sense of the speaker as
expressed at the time. I propose, therefore, out of that same speech,
to show how one portion of it which he skipped over (taking an
extract before and an extract after) will give a different idea, and the
true idea I intended to convey. It will take me some little time to read
it, but I believe I will occupy the time that way.
You have heard him frequently allude to my controversy with him
in regard to the Declaration of Independence. I confess that I have
had a struggle with Judge Douglas on that matter, and I will try
briefly to place myself right in regard to it on this occasion. I said—
and it is between the extracts Judge Douglas has taken from this
speech and put in his published speeches:
“It may be argued that there are certain conditions that make
necessities and impose them upon us, and to the extent that a
necessity is imposed upon a man he must submit to it. I think that
was the condition in which we found ourselves when we established
this Government. We had slaves among us, we could not get our
Constitution unless we permitted them to remain in slavery, we
could not secure the good we did secure if we grasped for more; and
having by necessity submitted to that much, it does not destroy the
principle that is the charter of our liberties. Let the charter remain as
our standard.”
Now I have upon all occasions declared as strongly as Judge
Douglas against the disposition to interfere with the existing
institution of slavery. You hear me read it from the same speech from
which he takes garbled extracts for the purpose of proving upon me a
disposition to interfere with the institution of slavery, and establish a
perfect social and political equality between negroes and white
people.
Allow me while upon this subject briefly to present one other
extract from a speech of mine, more than a year ago, at Springfield,
in discussing this very same question, soon after Judge Douglas took
his ground that negroes were not included in the Declaration of
Independence:
“I think the authors of that notable instrument intended to include
all men, but they did not mean to declare all men equal in all
respects. They did not mean to say all men were equal in color, size,
intellect, moral development or social capacity. They defined with
tolerable distinctness in what they did consider all men created equal
—equal in certain inalienable rights, among which are life, liberty,
and the pursuit of happiness. This they said, and this they meant.
They did not mean to assert the obvious untruth, that all were then
actually enjoying that equality, or yet, that they were about to confer
it immediately upon them. In fact they had no power to confer such a
boon. They meant simply to declare the right, so that the
enforcement of it might follow as fast as circumstances should
permit.
“They meant to set up a standard maxim for free society which
should be familiar to all: constantly looked to, constantly labored for,
and even, though never perfectly attained, constantly approximated,
and thereby constantly spreading and deepening its influence and
augmenting the happiness and value of life to all people, of all colors,
every where.”
There again are the sentiments I have expressed in regard to the
Declaration of Independence upon a former occasion—sentiments
which have been put in print and read wherever any body cared to
know what so humble an individual as myself chose to say in regard
to it.
At Galesburg the other day, I said in answer to Judge Douglas, that
three years ago there never had been a man, so far as I knew or
believed, in the whole world, who had said that the Declaration of
Independence did not include negroes in the term “all men.” I
reassert it to-day. I assert that Judge Douglas and all his friends may
search the whole records of the country, and it will be a matter of
great astonishment to me if they shall be able to find that one human
being three years ago had ever uttered the astounding sentiment that
the term “all men” in the Declaration did not include the negro. Do
not let me be misunderstood. I know that more than three years ago
there were men who, finding this assertion constantly in the way of
their schemes to bring about the ascendancy and perpetuation of
slavery, denied the truth of it. I know that Mr. Calhoun and all the
politicians of his school denied the truth of the Declaration. I know
that it ran along in the mouth of some Southern men for a period of
years, ending at last in that shameful though rather forcible
declaration of Pettit of Indiana, upon the floor of the United States
Senate, that the Declaration of Independence was in that respect “a
self-evident lie,” rather than a self-evident truth. But I say, with a
perfect knowledge of all this hawking at the Declaration without
directly attacking it, that three years ago there never had lived a man
who had ventured to assail it in the sneaking way of pretending to
believe it and then asserting it did not include the negro. I believe the
first man who ever said it was Chief Justice Taney in the Dred Scott
case, and the next to him was our friend, Stephen A. Douglas. And
now it has become the catch-word of the entire party. I would like to
call upon his friends every where to consider how they have come in
so short a time to view this matter in a way so entirely different from
their former belief? to ask whether they are not being borne along by
an irresistible current—whither they know not?
In answer to my proposition at Galesburg last week, I see that
some man in Chicago has got up a letter addressed to the Chicago
Times, to show, as he professes, that somebody had said so before;
and he signs himself “An Old Line Whig,” if I remember correctly. In
the first place I would say he was not an old line Whig. I am
somewhat acquainted with old line Whigs. I was with the old line
Whigs from the origin to the end of that party; I became pretty well
acquainted with them, and I know they always had some sense,
whatever else you could ascribe to them, I know there never was one
who had not more sense than to try to show by the evidence he
produces that some man had, prior to the time I named, said that
negroes were not included in the term “all men” in the Declaration of
Independence. What is the evidence he produces? I will bring
forward his evidence and let you see what he offers by way of
showing that somebody more than three years ago had said negroes
were not included in the Declaration. He brings forward part of a
speech from Henry Clay—the part of the speech of Henry Clay which
I used to bring forward to prove precisely the contrary. I guess we are
surrounded to some extent to-day by the old friends of Mr. Clay, and
they will be glad to hear any thing from that authority. While he was
in Indiana a man presented a petition to liberate his negroes, and he
(Mr. Clay) made a speech in answer to it, which I suppose he
carefully wrote out himself and caused to be published. I have before
me an extract from that speech which constitutes the evidence this
pretended “Old Line Whig” at Chicago brought forward to show that
Mr. Clay didn’t suppose the negro was included in the Declaration of
Independence. Hear what Mr. Clay said:
“And what is the foundation of this appeal to me in Indiana, to liberate the slaves
under my care in Kentucky? It is a general declaration in the act announcing to the
world the independence of the thirteen American colonies, that all men are created
equal. Now, as an abstract principle, there is no doubt of the truth of that
declaration; and it is desirable, in the original construction of society, and in
organized societies, to keep it in view was a great fundamental principle. But, then,
I apprehend that in no society that ever did exist, or ever shall be formed, was or
can the equality asserted among the members of the human race, be practically
enforced and carried out. There are portions, large portions, women, minors,
insane, culprits, transient sojourners, that will always probably remain subject to
the government of another portion of the community.
“That declaration, whatever may be the extent of its import, was made by the
delegations of the thirteen States. In most of them slavery existed, and had long
existed, and was established by law. It was introduced and forced upon the
colonies by the paramount law of England. Do you believe, that in making that
declaration the States that concurred in it intended that it should be tortured into a
virtual emancipation of all the slaves within their respective limits? Would Virginia
and other Southern States have ever united in a declaration which was to be
interpreted into an abolition of slavery among them? Did any one of the thirteen
colonies entertain such a design or expectation? To impute such a secret and
unavowed purpose, would be to charge a political fraud upon the noblest band of
patriots that ever assembled in council—a fraud upon the Confederacy of the
Revolution—a fraud upon the union of those States whose Constitution not only
recognized the lawfulness of slavery, but permitted the importation of slaves from
Africa until the year 1808.”
This is the entire quotation brought forward to prove that
somebody previous to three years ago had said the negro was not
included in the term “all men” in the Declaration. How does it do so?
In what way has it a tendency to prove that? Mr. Clay says it is true
as an abstract principle that all men are created equal, but that we
cannot practically apply it in all cases. He illustrates this by bringing
forward the cases of females, minors, and insane persons, with
whom it cannot be enforced; but he says it is true as an abstract
principle in the organization of society as well as in organized
society, and it should be kept in view as a fundamental principle. Let
me read a few words more before I add some comments of my own.
Mr. Clay says a little further on:
“I desire no concealment of my opinions in regard to the institution of slavery. I
look upon it as a great evil, and deeply lament that we have derived it from the
parental Government, and from our ancestors. But here they are, and the question
is, how can they be best dealt with? If a state of nature existed, and we were about
to lay the foundations of society, no man would be more strongly opposed than I
should be, to incorporating the institution of slavery among its elements.”
Now, here in this same book—in this same speech—in this same
extract brought forward to prove that Mr. Clay held that the negro
was not included in the Declaration of Independence—no such
statement on his part, but the declaration that it is a great
fundamental truth, which should be constantly kept in view in the
organization of society and in societies already organized. But if I say
a word about it—if I attempt, as Mr. Clay said all good men ought to
do, to keep it in view—if, in this “organized society,” I ask to have the
public eye turned upon it—if I ask in relation to the organization of
new Territories, that the public eye should be turned upon it—
forthwith I am vilified as you hear me to-day. What have I done, that
I have not the license of Henry Clay’s illustrious example here in
doing? Have I done aught that I have not his authority for, while
maintaining that in organizing new Territories and societies, this
fundamental principle should be regarded, and in organized society
holding it up to the public view and recognizing what he recognized
as the great principle of free government?
And when this new principle—this new proposition that no human
being ever thought of three years ago—is brought forward, I combat
it as having an evil tendency, if not an evil design. I combat it as
having a tendency to dehumanize the negro—to take away from him
the right of ever striving to be a man. I combat it as being one of the
thousand things constantly done in these days to prepare the public
mind to make property, and nothing but property, of the negro in all
the States of this Union.
But there is a point that I wish, before leaving this part of the
discussion, to ask attention to. I have read and I repeat the words of
Henry Clay:
“I desire no concealment of my opinions in regard to the institution of slavery. I
look upon it as a great evil, and deeply lament that we have derived it from the
parental Government, and from our ancestors. I wish every slave in the United
States was in the country of his ancestors. But here they are; the question is how
they can best be dealt with? If a state of nature existed, and we were about to lay
the foundations of society, no man would be more strongly opposed than I should
be, to incorporate the institution of slavery among its elements.”
The principle upon which I have insisted in this canvass, is in
relation to laying the foundations of new societies. I have ever sought
to apply these principles to the old States for the purpose of
abolishing slavery in those States. It is nothing but a miserable
perversion of what I have said, to assume that I have declared
Missouri, or any other slave State, shall emancipate her slaves. I have
proposed no such thing. But when Mr. Clay says that in laying the
foundations of societies in our Territories where it does not exist, he
would be opposed to the introduction of slavery as an element, I
insist that we have his warrant—his license for insisting upon the
exclusion of that element which he declared in such strong and
emphatic language was most hateful to him.
Judge Douglas has again referred to a Springfield speech in which
I said “a house divided against itself cannot stand.” The Judge has so
often made the entire quotation from that speech that I can make it
from memory. I used this language:
“We are now far into the fifth year since a policy was initiated with the avowed
object and confident promise of putting an end to the slavery agitation. Under the
operation of this policy, that agitation has not only not ceased, but has constantly
augmented. In my opinion it will not cease until a crisis shall have been reached
and passed. ‘A house divided against itself cannot stand.’ I believe this Government
cannot endure permanently half slave and half free. I do not expect the house to
fall—but I do expect it will cease to be divided. It will become all one thing or all
the other. Either the opponents of slavery will arrest the further spread of it, and
place it where the public mind shall rest in the belief that it is in the course of
ultimate extinction, or its advocates will push it forward till it shall become alike
lawful in all the States—old as well as new, North as well as South.”
That extract and the sentiments expressed in it, have been
extremely offensive to Judge Douglas. He has warred upon them as
Satan wars upon the Bible. His perversions upon it are endless. Here
now are my views upon it in brief.
I said we were now far into the fifth year, since a policy was
initiated with the avowed object and confident promise of putting an
end to the slavery agitation. Is it not so? When that Nebraska bill was
brought forward four years ago last January, was it not for the
“avowed object” of putting an end to the slavery agitation? We were
to have no more agitation in Congress, it was all to be banished to the
Territories. By the way, I will remark here that, as Judge Douglas is
very fond of complimenting Mr. Crittenden in these days, Mr.
Crittenden has said there was a falsehood in that whole business, for
there was no slavery agitation at that time to allay. We were for a
little while quiet on the troublesome thing, and that very allaying
plaster of Judge Douglas’s stirred it up again. But was it not
understood or intimated with the “confident promise” of putting an
end to the slavery agitation? Surely it was. In every speech you heard
Judge Douglas make, until he got into this “imbroglio,” as they call it,
with the Administration about the Lecompton Constitution, every
speech on that Nebraska bill was full of his felicitations that we were
just at the end of the slavery agitation. The last tip of the last joint of
the old serpent’s tail was just drawing out of view. But has it proved
so? I have asserted that under that policy that agitation “has not only
not ceased, but has constantly augmented.” When was there ever a
greater agitation in Congress than last winter? When was it as great
in the country as to-day?
There was a collateral object in the introduction of that Nebraska
policy which was to clothe the people of the Territories with the
superior degree of self-government, beyond what they had ever had
before. The first object and the main one of conferring upon the
people a higher degree of “self-government,” is a question of fact to
be determined by you in answer to a single question. Have you ever
heard or known of a people any where on earth who had as little to
do, as, in the first instance of its use, the people of Kansas had with
this same right of “self-government?” In its main policy and in its
collateral object, it has been nothing but a living, creeping lie from
the time of its introduction till to-day.
I have intimated that I thought the agitation would not cease until
a crisis should have been reached and passed. I have stated in what
way I thought it would be reached and passed, I have said that it
might go one way or the other. We might, by arresting the further
spread of it, and placing it where the fathers originally placed it, put
it where the public mind should rest in the belief that it was in the
course of ultimate extinction. Thus the agitation may cease. It may be
pushed forward until it shall become alike lawful in all the States, old
as well as new, North as well as South. I have said, and I repeat, my
wish is that the further spread of it may be arrested, and that it may
be placed where the public mind shall rest in the belief that it is in
the course of ultimate extinction. I have expressed that as my wish. I
entertain the opinion upon evidence sufficient to my mind, that the
fathers of this Government placed that institution where the public
mind did rest in the belief that it was in the course of ultimate
extinction. Let me ask why they made provision that the source of
slavery—the African slave-trade—should be cut off at the end of
twenty years? Why did they make provision that in all the new
territory we owned at that time, slavery should be forever inhibited?
Why stop its spread in one direction and cut off its source in another,
if they did not look to its being placed in the course of ultimate
extinction?
Again; the institution of slavery is only mentioned in the
Constitution of the United States two or three times, and in neither
of these cases does the word “slavery” or “negro race” occur; but
covert language is used each time, and for a purpose full of
significance. What is the language in regard to the prohibition of the
African slave-trade? It runs in about this way: “The migration or
importation of such persons as any of the States now existing shall
think proper to admit, shall not be prohibited by the Congress prior
to the year one thousand eight hundred and eight.”
The next allusion in the Constitution to the question of slavery and
the black race, is on the subject of the basis of representation, and
there the language used is, “Representatives and direct taxes shall be
apportioned among the several states which may be included within
this Union, according to their respective numbers, which shall be
determined by adding to the whole number of free persons, including
those bound to service for a term of years, and excluding Indians not
taxed—three-fifths of all other persons.”
It says “persons,” not slaves, not negroes; but this “three-fifths”
can be applied to no other class among us than the negroes.
Lastly, in the provision for the reclamation of fugitive slaves, it is
said: “No person held to service or labor in one State, under the laws
thereof, escaping into another shall in consequence of any law or
regulation therein, be discharged from such service or labor, but
shall be delivered up, on claim of the party to whom such service or
labor may be due.” There again there is no mention of the word
“negro” or of slavery. In all three of these places, being the only
allusion to slavery in the instrument covert language is used.
Language is used not suggesting that slavery existed or that the black
race were among us. And I understand the cotemporaneous history
of those times to be that covert language was used with a purpose,
and that purpose was that in our Constitution, which it was hoped
and is still hoped will endure forever—when it should be read by
intelligent and patriotic men, after the institution of slavery had
passed from among us—there should be nothing on the face of the
great charter of liberty suggesting that such a thing as negro slavery
had ever existed among us. This is part of the evidence that the
fathers of the Government expected and intended the institution of
slavery to come to an end. They expected and intended that it should
be in the course of ultimate extinction. And when I say that I desire
to see the further spread of it arrested, I only say I desire to see that
done which the fathers have first done. When I say I desire to see it
placed where the public mind will rest in the belief that it is in the
course of ultimate extinction, I only say I desire to see it placed
where they placed it. It is not true that our fathers, as Judge Douglas
assumes, made this Government part slave and part free.
Understand the sense in which he puts it. He assumes that slavery is
a rightful thing within itself—was introduced by the framers of the
constitution. The exact truth is, that they found the institution
existing among us, and they left it as they found it. But in making the
Government they left this institution with many clear marks of
disapprobation upon it. They found slavery among them, and they
left it among them because of the difficulty—the absolute
impossibility of its immediate removal. And when Judge Douglas
asks me why we cannot let it remain part slave and part free, as the
fathers of the Government made it, he asks a question based upon an
assumption which is itself a falsehood; and I turn upon him and ask
him the question, when the policy that the fathers of the Government
had adopted in relation to this element among us was the best policy
in the world—the only wise policy—the only policy that we can ever
safely continue upon—that will ever give us peace, unless this
dangerous element masters us all and becomes a national institution
—I turn upon him and ask him why he could not leave it alone. I
turn and ask him why he was driven to the necessity of introducing a
new policy in regard to it. He has himself said he introduced a new
policy. He said so in his speech on the 22d of March of the present
year, 1858. I ask him why he could not let it remain where our
fathers placed it. I ask, too, of Judge Douglas and his friends why we
shall not again place this institution upon the basis on which the
fathers left it. I ask you, when he infers that I am in favor of setting
the free and slave States at war, when the institution was placed in
that attitude by those who made the Constitution, did they make any
war? If we had no war out of it, when thus placed, wherein is the
ground of belief that we shall have war out of it, if we return to that
policy? Have we had any peace upon this matter springing from any
other basis? I maintain that we have not. I have proposed nothing
more than a return to the policy of the fathers.
I confess, when I propose a certain measure of policy, it is not
enough for me that I do not intend any thing evil in the result, but it
is incumbent on me to show that it has not a tendency to that result.
I have met Judge Douglas in that point of view. I have not only made
the declaration that I do not mean to produce a conflict between the
States, but I have tried to show by fair reasoning, and I think I have
shown to the minds of fair men, that I propose nothing but what has
a most peaceful tendency. The quotation that I happened to make in
that Springfield speech, that “a house divided against itself cannot
stand,” and which has proved so offensive to the Judge, was part and
parcel of the same thing. He tries to show that variety in the
domestic institutions of the different States is necessary and
indispensable. I do not dispute it. I have no controversy with Judge
Douglas about that. I shall very readily agree with him that it would
be foolish for us to insist upon having a cranberry law here, in
Illinois, where we have no cranberries, because they have a cranberry
law in Indiana, where they have cranberries. I should insist that it
would be exceedingly wrong in us to deny to Virginia the right to
enact oyster laws, where they have oysters, because we want no such
laws here. I understand, I hope, quite as well as Judge Douglas or
any body else, that the variety in the soil and climate and face of the
country, and consequent variety in the industrial pursuits and
productions of a country, require systems of law conforming to this
variety in the natural features of the country. I understand quite as
well as Judge Douglas, that if we here raise a barrel of flour more
than we want, and the Louisianians raise a barrel of sugar more than
they want, it is of mutual advantage to exchange. That produces
commerce, brings us together, and makes us better friends. We like
one another the more for it. And I understand as well as Judge
Douglas, or any body else, that these mutual accommodations are
the cements which bind together the different parts of this Union—
that instead of being a thing to “divide the house”—figuratively
expressing the Union—they tend to sustain it; they are the props of
the house tending always to hold it up.
But when I have admitted all this, I ask if there is any parallel
between these things and this institution of slavery? I do not see that
there is any parallel at all between them. Consider it. When have we
had any difficulty or quarrel amongst ourselves about the cranberry
laws of Indiana, or the oyster laws of Virginia, or the pine lumber
laws of Maine, or the fact that Louisiana produces sugar, and Illinois
flour? When have we had any quarrels over these things? When have
we had perfect peace in regard to this thing which I say is an element
of discord in this Union? We have sometimes had peace, but when
was it? It was when the institution of slavery remained quiet where it
was. We have had difficulty and turmoil whenever it has made a
struggle to spread itself where it was not. I ask, then, if experience
does not speak in thunder-tones, telling us that the policy which has
given peace to the country heretofore, being returned to, gives the
greatest promise of peace again. You may say, and Judge Douglas
has intimated the same thing, that all this difficulty in regard to the
institution of slavery is the mere agitation of office-seekers and
ambitious northern politicians. He thinks we want to get “his place,”
I suppose. I agree that there are office-seekers amongst us. The Bible
says somewhere that we are desperately selfish. I think we would
have discovered that fact without the Bible. I do not claim that I am
any less so than the average of men, but I do claim that I am not
more selfish than Judge Douglas.
But is it true that all the difficulty and agitation we have in regard
to this institution of slavery springs from office-seeking—from the
mere ambition of politicians? Is that the truth? How many times
have we had danger from this question? Go back to the day of the
Missouri Compromise. Go back to the Nullification question, at the
bottom of which lay this same slavery question. Go back to the time
of the Annexation of Texas. Go back to the troubles that led to the
Compromise of 1850. You will find that every time, with the single
exception of the Nullification question, they sprang from an
endeavor to spread this institution. There never was a party in the
history of this country, and there probably never will be, of sufficient
strength to disturb the general peace of the country. Parties
themselves may be divided and quarrel on minor questions, yet it
extends not beyond the parties themselves. But does not this
question make a disturbance outside of political circles? Does it not
enter into the churches and rend them asunder? What divided the
great Methodist Church into two parts, North and South? What has
raised this constant disturbance in every Presbyterian General
Assembly that meets? What disturbed the Unitarian Church in this
very city two years ago? What has jarred and shaken the great
American Tract Society recently, not yet splitting it, but sure to
divide it in the end? Is it not this same mighty, deep-seated power
that somehow operates on the minds of men, exciting and stirring
them up in every avenue of society—in politics, in religion, in
literature, in morals, in all the manifold relations of life? Is this the
work of politicians? Is that irresistible power which for fifty years has
shaken the Government and agitated the people to be stilled and
subdued by pretending that it is an exceedingly simple thing, and we
ought not to talk about it? If you will get everybody else to stop
talking about it, I assure you I will quit before they have half done so.
But where is the philosophy or statesmanship which assumes that
you can quiet that disturbing element in our society which has
disturbed us for more than half a century, which has been the only
serious danger that has threatened our institutions—I say, where is
the philosophy or the statesmanship based on the assumption that
we are to quit talking about it, and that the public mind is all at once
to cease being agitated by it? Yet this is the policy here in the north
that Douglas is advocating—that we are to care nothing about it! I
ask you if it is not a false philosophy? Is it not a false statesmanship
that undertakes to build up a system of policy upon the basis of
caring nothing about the very thing that every body does care the
most about?—a thing which all experience has shown we care a very
great deal about?
The Judge alludes very often in the course of his remarks to the
exclusive right which the States have to decide the whole thing for
themselves. I agree with him very readily that the different States
have that right. He is but fighting a man of straw when he assumes
that I am contending against the right of the States to do as they
please about it. Our controversy with him is in regard to the new
Territories. We agree that when States come in as States they have
the right and the power to do as they please. We have no power as
citizens of the free States or in our federal capacity as members of the
Federal Union through the General Government, to disturb slavery
in the States where it exists. We profess constantly that we have no
more inclination than belief in the power of the Government to
disturb it; yet we are driven constantly to defend ourselves from the
assumption that we are warring upon the rights of the States. What I
insist upon is, that the new Territories shall be kept free from it while
in the Territorial condition. Judge Douglas assumes that we have no
interest in them—that we have no right whatever to interfere. I think
Welcome to our website – the ideal destination for book lovers and
knowledge seekers. With a mission to inspire endlessly, we offer a
vast collection of books, ranging from classic literary works to
specialized publications, self-development books, and children's
literature. Each book is a new journey of discovery, expanding
knowledge and enriching the soul of the reade

Our website is not just a platform for buying books, but a bridge
connecting readers to the timeless values of culture and wisdom. With
an elegant, user-friendly interface and an intelligent search system,
we are committed to providing a quick and convenient shopping
experience. Additionally, our special promotions and home delivery
services ensure that you save time and fully enjoy the joy of reading.

Let us accompany you on the journey of exploring knowledge and


personal growth!

textbookfull.com

You might also like