100% found this document useful (2 votes)
26 views

The Art of High Performance Computing for Computational Science, Vol. 2: Advanced Techniques and Examples for Materials Science Masaaki Geshi download

The document discusses 'The Art of High Performance Computing for Computational Science, Vol. 2,' edited by Masaaki Geshi, which focuses on advanced techniques and applications in materials science. It covers topics such as supercomputers, performance optimization, and specific case studies on software applications, particularly in computational science. The volume aims to provide insights for researchers and developers with a background in physics, chemistry, and related fields, emphasizing practical applications and techniques for high-performance computing.

Uploaded by

drostkraljq8
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
100% found this document useful (2 votes)
26 views

The Art of High Performance Computing for Computational Science, Vol. 2: Advanced Techniques and Examples for Materials Science Masaaki Geshi download

The document discusses 'The Art of High Performance Computing for Computational Science, Vol. 2,' edited by Masaaki Geshi, which focuses on advanced techniques and applications in materials science. It covers topics such as supercomputers, performance optimization, and specific case studies on software applications, particularly in computational science. The volume aims to provide insights for researchers and developers with a background in physics, chemistry, and related fields, emphasizing practical applications and techniques for high-performance computing.

Uploaded by

drostkraljq8
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 62

The Art of High Performance Computing for

Computational Science, Vol. 2: Advanced


Techniques and Examples for Materials Science
Masaaki Geshi download
https://textbookfull.com/product/the-art-of-high-performance-
computing-for-computational-science-vol-2-advanced-techniques-
and-examples-for-materials-science-masaaki-geshi/

Download more ebook from https://textbookfull.com


We believe these products will be a great fit for you. Click
the link to download now, or visit textbookfull.com
to discover even more!

High Performance Computing in Science and Engineering '


18: Transactions of the High Performance Computing
Center, Stuttgart (HLRS) 2018 Wolfgang E. Nagel

https://textbookfull.com/product/high-performance-computing-in-
science-and-engineering-18-transactions-of-the-high-performance-
computing-center-stuttgart-hlrs-2018-wolfgang-e-nagel/

High Performance Computing in Science and Engineering


15 Transactions of the High Performance Computing
Center Stuttgart HLRS 2015 1st Edition Wolfgang E.
Nagel
https://textbookfull.com/product/high-performance-computing-in-
science-and-engineering-15-transactions-of-the-high-performance-
computing-center-stuttgart-hlrs-2015-1st-edition-wolfgang-e-
nagel/

High Performance Computing in Science and Engineering


16 Transactions of the High Performance Computing
Center Stuttgart HLRS 2016 1st Edition Wolfgang E.
Nagel
https://textbookfull.com/product/high-performance-computing-in-
science-and-engineering-16-transactions-of-the-high-performance-
computing-center-stuttgart-hlrs-2016-1st-edition-wolfgang-e-
nagel/

High Performance Computing for Computational Science –


VECPAR 2018: 13th International Conference, São Pedro,
Brazil, September 17-19, 2018, Revised Selected Papers
Hermes Senger
https://textbookfull.com/product/high-performance-computing-for-
computational-science-vecpar-2018-13th-international-conference-
sao-pedro-brazil-september-17-19-2018-revised-selected-papers-
Handbook of Fibrous Materials Vol 1 Production and
Characterization Vol 2 Applications in Energy
Environmental Science and Healthcare Hu

https://textbookfull.com/product/handbook-of-fibrous-materials-
vol-1-production-and-characterization-vol-2-applications-in-
energy-environmental-science-and-healthcare-hu/

High Performance Computing for Geospatial Applications


Wenwu Tang

https://textbookfull.com/product/high-performance-computing-for-
geospatial-applications-wenwu-tang/

High Performance Computing for Computational Science


VECPAR 2014 11th International Conference Eugene OR USA
June 30 July 3 2014 Revised Selected Papers 1st Edition
Michel Daydé
https://textbookfull.com/product/high-performance-computing-for-
computational-science-vecpar-2014-11th-international-conference-
eugene-or-usa-june-30-july-3-2014-revised-selected-papers-1st-
edition-michel-dayde/

Fair Scheduling in High Performance Computing


Environments Art Sedighi

https://textbookfull.com/product/fair-scheduling-in-high-
performance-computing-environments-art-sedighi/

Advanced Techniques for Testing of Cement Based


Materials Marijana Serdar

https://textbookfull.com/product/advanced-techniques-for-testing-
of-cement-based-materials-marijana-serdar/
Masaaki Geshi Editor

The Art of High


Performance
Computing for
Computational
Science, Vol. 2
Advanced Techniques and Examples for
Materials Science
The Art of High Performance Computing
for Computational Science, Vol. 2
Masaaki Geshi
Editor

The Art of High


Performance Computing
for Computational Science,
Vol. 2
Advanced Techniques and Examples
for Materials Science

123
Editor
Masaaki Geshi
Osaka University
Toyonaka, Japan

ISBN 978-981-13-9801-8 ISBN 978-981-13-9802-5 (eBook)


https://doi.org/10.1007/978-981-13-9802-5
© Springer Nature Singapore Pte Ltd. 2019
This work is subject to copyright. All rights are reserved by the Publisher, whether the whole or part
of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations,
recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission
or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar
methodology now known or hereafter developed.
The use of general descriptive names, registered names, trademarks, service marks, etc. in this
publication does not imply, even in the absence of a specific statement, that such names are exempt from
the relevant protective laws and regulations and therefore free for general use.
The publisher, the authors and the editors are safe to assume that the advice and information in this
book are believed to be true and accurate at the date of publication. Neither the publisher nor the
authors or the editors give a warranty, expressed or implied, with respect to the material contained
herein or for any errors or omissions that may have been made. The publisher remains neutral with regard
to jurisdictional claims in published maps and institutional affiliations.

This Springer imprint is published by the registered company Springer Nature Singapore Pte Ltd.
The registered company address is: 152 Beach Road, #21-01/04 Gateway East, Singapore 189721,
Singapore
Preface

This is the second of two volumes that are written about the basics of paral-
lelization, the foundation of numerical analysis, and related techniques. Even if it is
mentioned as a foundation, we do not assume a novice here completely in this field,
so if you would like to know from the beginning of programming, you should learn
from another book suitable for that. For readers, those who have learned physics,
chemistry, biology (earth sciences, space science, weather, disaster prevention,
manufacturing, etc.) are assumed. Furthermore, we assume those who use numer-
ical calculation and simulation as research methods. In particular, we assume those
who develop software code. Many of them have not learned systematically about
programming and numerical calculation, but from the information science experts,
many parts are included as contents of the undergraduate level.
This Volume 2 includes advanced techniques based on concrete applications of
software applications for several fields, in particular, the field of materials science.
Chapter 1 outlines supercomputers including a brief explanation of the history of
hardware. Chapter 2 details a program tuning procedure. Chapter 3 describes
concrete tuning results on the K computer for several software applications: RSDFT
[1] and PHASE [2] (now the official name changes to PHASE/0) in materials
science, nanoscience, and nanotechnology; Seism3D in earth science;
FrontFlow/Blue in engineering. The above chapters are more practical than Chaps.
1–5 in the Volume 1. Chapter 4 explains how to reduce the computational cost of
density functional theory (DFT) calculation from O(N3) to O(N), so-called order-N
method. This method is implemented in the software application, OpenMX [3].
Chapter 5 explains acceleration techniques of classical molecular dynamics
(MD) simulations, for example, general techniques for hierarchical parallelization
on the latest general-purpose supercomputers, in particular, connected by means of
a three-dimensional torus network. These techniques are implemented in the soft-
ware applications, MODYLAS [4]. This chapter also introduces the software
application, GENESIS [5], which is developed for investigating the long-term
dynamics of biomolecules by simulating a huge biomolecule system by using an
efficient structure search method such as the extended ensemble method. Chapter 6
explains techniques for large-scale quantum chemical calculation techniques

v
vi Preface

including the order-N method. These techniques are implemented in software


applications, DC-DFTB-K [6] and SMASH [7]. You can download and use some
of them. MaterApps [8] is useful for finding software applications in the field of
material science. This website introduces software applications from around the
world as well as software applications made in Japan.
This book is revised and updated from the Japanese book published by Osaka
University Press in 2017, The Art of High Performance Computing for
Computational Science 2 (Masaaki Geshi Ed., 2017). This book is based on the
lectures, “Advanced Computational Science A” and “Advanced Computational
Science and B”, broadcasted to maximally 17 campuses through videoconference
systems from 2013. All the texts and videos are published on websites (only in
Japanese). These were parts of the human resource development programs that we
have tackled as part of the project of the Computational Materials Science Initiative
(CMSI) organized by the Ministry of Education, Culture, Sports, Science and
Technology (SPIRE) field 2 <New materials/energy creation>, so-called the project
of K computer. These lectures are aiming to contribute to developing young human
resources, centering on basic techniques that will not change for a long time being
even though it is a computer that progresses day by day. These lectures were offered
from the Institute for NanoScience Design, Osaka University. We have gathered up
to about 150 participants per lecture. Participants exceeded 6500 net people in the
past 6 years. The videos of the lectures and lecture materials are opened to the
public on the web, and anyone can learn the content at any time in Japanese. Now
we are using cloud-based video meetings service that connects many users across
different devices to make it easier for more people to participate. These lectures
continue to distribute with slightly changing the organizational structure even after
the project ends.
I would like to express my deep appreciation to the authors, who cooperated in
the writing of the series of books as an editor. I would like to thank the staff of
Springer for publishing the English version. I think that the techniques cultivated in
the project of K computer of Japan contain many useful contents for future HPC.
I hope it will be shared in the world and it will contribute to the development of
HPC and the development of science.

Osaka, Japan Masaaki Geshi


April 2019

References

1. https://github.com/j-iwata/RSDFT
2. https://azuma.nims.go.jp/software
3. http://www.openmx-square.org/
4. http://www.modylas.org/
Preface vii

5. https://www.r-ccs.riken.jp/labs/cbrt/
6. http://www.chem.waseda.ac.jp/nakai/?page_id=147&lang=en
7. http://smash-qc.sourceforge.net/
8. https://ma.issp.u-tokyo.ac.jp/en/
Contents

1 Supercomputers and Application Performance . . . . . . . . . . . . . . . . . 1


Kazuo Minami
2 Performance Optimization of Applications . . . . . . . . . . . . . . . . . . . . 11
Kazuo Minami
3 Case Studies of Performance Optimization of Applications . . . . . . . . 41
Kazuo Minami and Kiyoshi Kumahata
4 O(N) Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
Taisuke Ozaki
5 Acceleration of Classical Molecular Dynamics Simulations . . . . . . . . 117
Y. Andoh, N. Yoshii, J. Jung and Y. Sugita
6 Large-Scale Quantum Chemical Calculation . . . . . . . . . . . . . . . . . . . 159
Kazuya Ishimura and Masato Kobayashi

Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 203

ix
Chapter 1
Supercomputers and Application
Performance

Kazuo Minami

Abstract Before the advent of modern supercomputers, single processors were


approaching the limit of improvement in operating frequency, and there was a mem-
ory wall problem. Even if the computing capacity of a single processor could be
increased, the data supply capacity of the memory could not match the computing
capacity. The increase in operating frequency also caused the problem, that is, power
consumption increased faster than performance improvement. In other words, the
limit of performance improvement of a single processor was becoming apparent.
To solve these problems, parallel architectures, in which many single processors
are connected by a communication mechanism, have been used. The two points,
“programming conscious of parallelism” and “programming conscious of execu-
tion performance”, are very important for users, researchers, and programmers to
make effective use of the present supercomputers equipped with tens of thousands
of processors.

1.1 What Is a Supercomputer?

In this section, we first describe the development of computers and the changes in
the usage technologies of supercomputers in Sect. 1.1, and in Sect. 1.2 we describe
two important points for developing high-performance applications. Computational
science, which elucidates scientific phenomena by using numerical simulation, has
long been described as the third science alongside theory and experiments, and in
recent years, innovative scientific technology research and development using super-

1 http://www.riken.jp/en/research/environment/kcomputer/.
2 The Institute of Physical and Chemical Research (http://www.riken.jp/en/).
3 https://www.top500.org/.

K. Minami (B)
RIKEN Center for Computational Science, RIKEN, Kobe, Hyogo, Japan
e-mail: [email protected]

© Springer Nature Singapore Pte Ltd. 2019 1


M. Geshi (ed.), The Art of High Performance Computing
for Computational Science, Vol. 2,
https://doi.org/10.1007/978-981-13-9802-5_1
2 K. Minami

1
computers has been active all over the world. In Japan, the K computer developed
2 3
by RIKEN in 2011 won the top 500 ranking for two consecutive terms.
The application of supercomputers in Japan is an innovative way to elucidate var-
ious natural phenomena over a vast scale from the extremely fine quantum world to
the universe, including an enormous number of galaxies, and the discoveries made
are expected to contribute to society. For example, on the very small scale of ten to
the minus several powers of meters, we expect to understand the behavior of viruses,
liposomes (consisting of several hundred thousand atoms), and other organic phe-
nomena through long-running simulations, and this is expected to contribute to the
medical field, inexpensive biofuels, and new energy fields. On a slightly larger scale,
we expect to accelerate innovation in next-generation electronics through design
simulation of entire next-generation semiconductor nanodevices and the creation of
new functional nonsilicon materials such as nanocarbons. On the scale of human
society from several tens of meters to several hundreds of kilometers, we expect to
contribute to detailed disaster-prevention planning by seismic simulation, combin-
ing seismic wave propagation and structure response. On a global scale of several
thousand kilometers to several tens of thousands of kilometers, we expect to present
high-resolution global weather forecasts and accurate predictions of the course and
intensity of typhoons by climate simulation, and to contribute to climate change
research. On the larger scale of more than 10 to the 20th power of meters, we expect
to elucidate cosmic phenomena, such as the generation of stars and analysis of the
behaviors of galaxies.
As described earlier, various applications are expected for supercomputers, but
what is a supercomputer in the first place?
Although there is no clear definition of a supercomputer, it is regarded as a com-
puter with extremely high speed and outstanding computing capacity, compared with
the general computers of its era. For example, a supercomputer is defined in present
government procurement in Japan (2016) as a computer capable of 50 trillion or more
floating point operations per second (50 TFLOPS)4 : This number is reviewed as nec-
essary. In the mid-1940s, one of the first digital computers, named the ENIAC (an
abbreviation of Electronic Numerical Integrator and Computer), appeared. In 1976,
the CRAY-1, which was described as the world’s first supercomputer, appeared; its
theoretical computing performance was 160 MFLOPS. The performance of a per-
sonal computer using a Pentium IV in 2002 was about 6.4 GFLOPS: about 40 times
the performance of the CRAY-1. At that time, the performance of the Earth Sim-
ulator, which was Japan’s fastest supercomputer in 2002, was 40 TFLOPS, which
was about 250,000 times the performance of the CRAY-1. The K computer, which
achieved the world’s fastest performance in 2011, achieves 10 PFLOPS, about 62.5
million times the performance of CRAY-1.
Computers have achieved these drastic performance improvements, but how?

4 FLOPS denote a unit of calculation speed. One FLOPS is the execution of one floating point
calculation per second. Thus, 160 MFLOPS is equivalent to 160 million floating point operations
per second.
1 Supercomputers and Application Performance 3

Early computers had a single processor. After the technology changed from vac-
uum tubes to semiconductors, the improvement of the frequency of the semiconduc-
tor devices promoted the performance improvement of the CPU and improved the
performance of the computer. The memory was composed of one or more memory
banks, and a memory bank could not be accessed until a certain time had elapsed
after a previous access. The waiting time for memory accesses remains several tens
of nanoseconds even now. However, until the 1970s, because the operating frequency
of the computer was low and the operation of the computing unit was slow, memory
access times were not a major problem. It was an era when the computation speed
was the bottleneck rather than the memory transfer performance.
Since then, while the waiting time for memory access of several tens of nanosec-
onds has not reduced much, the number of cycles the CPU must wait for memory
access has increased with the improvement in CPU operating frequency. Moreover,
because of the miniaturization of the semiconductor process, more computing units
can be mounted in one CPU. As a result, the data transfer capability of the memory
reduces the computing capacity of the computing unit. This is called the memory
wall problem.
Between the latter half of 1970 and the 1980s, although it was based on a single
processor, a vector architecture was developed that enabled high-speed computation
using vector pipelines, treating data that could be processed in parallel by paying
attention to the parallelism within loops. To solve the memory wall problem in the
vector computer, the number of memory banks was increased, and the CPU read
data from different memory banks cyclically to supply data to the computing unit
continuously. The problem of the waiting time to access the same memory bank was
thus overcome. By adopting this mechanism, it was possible for the vector computer
to balance the data supply capability of the memory and the calculation ability of the
computing unit.
At that time, the processor’s operating frequency was several tens of MHz or
more, and as described earlier, more computing units could be added because of the
progress in the miniaturization of the semiconductor process. This increase, together
with the increased operating frequency of the computing unit, contributed to the
realization of high-speed vector computers. Although the vector system was fast,
the manufacturing cost and power consumption increased because of the expensive
memory bank mechanism described above.
Around the same time as the development of the vector architecture, the computing
unit of the scalar architecture also became RISC5 and was pipelined, taking advantage
of the increases in processing units and operating frequency. The scalar architecture
evolved into a superscalar architecture with multiple computing units. Furthermore,
by using SIMD6 and other techniques, high-speed computation was made possible
by utilizing advanced parallelism hidden in the program.

5 Reduced instruction set computing.


6 Singleinstruction multiple data: A class of parallel processing that applies one instruction to
multiple data.
4 K. Minami

In the scalar architecture, to cope with the memory wall problem, a countermea-
sure other than the vector architecture was taken: a cache with high data supply
capability was placed between the memory and the computing unit without increas-
ing the number of memory banks. As much of data as possible was placed in the
cache, and the data were reused to compensate for the limited data supply capacity of
the memory to the computing unit. Although this method has performance disadvan-
tages, it has benefits in terms of cost and power consumption over the multiple-bank
method of the vector architecture.
The single processor was approaching the limits of improvement of the operating
frequency and the memory wall problem remained. Even if the computing capacity
of the single processor could be increased, the data supply capacity of the memory
could not catch up with the computing capacity. Furthermore, with the increases in
operating frequency, we also faced the problem of power consumption increasing
faster than the improvement in performance. In other words, the limits of performance
improvement of single processors were becoming apparent.
To solve this problem, a parallel architecture in which many single processors are
connected by a communication mechanism has appeared. Without this development,
it would be impossible to obtain the necessary computing power with realistic power
consumption.
At present, hybrid, massively parallel computers are emerging, in which multiple
calculation cores are built in a processor and thousands to tens of thousands of
processors are connected by a communication network.
Although each node of a supercomputer is basically the same as an ordinary
computer, computing capacity and computing performance in total are extremely
high, and high-speed interconnection performance is required. Further, because low-
power performance of the total system is required, it is essential that power saving
is implemented at the processor level. Because the number of parts constituting the
system is very large, extremely high reliability is required for individual parts, and
high reliability of the total system is required.
Up to this point, we have explained the development of hardware. As the hardware
evolved, how has its usage changed so that the performance of the hardware can be
fully utilized?
In the early days of computers, the processing speed was a bottleneck compared
with the memory transfer performance with a single processor, so development envi-
ronments such as high-level languages and compilers emerged. It was common for
researchers and programmers to reproduce formalized and discretized theoretical
model equations in code. High-speed processing was realized by developing com-
pilers that could interpret the parallelism hidden in the program.
From the latter half of 1970 to the 1980s, when vector architectures used multiple
memory banks to cope with the memory wall problem, the parallel nature of loop
indices was exploited and parallel-processable data were pipelined.
As a programming technique in the age of vector architecture, it was necessary to
guarantee the parallel nature of loops. Eliminating recurrence (regression references)
became an essential performance optimization technique.
1 Supercomputers and Application Performance 5

In the scalar architecture, as described earlier, by dealing with a cache with high
data supply capability, we coped with the memory wall problem. In addition, in the
scalar architecture, SIMD was introduced and effectively used simplified instructions
with RISC, superscaling with multiple operation pipelines, and software pipelining
by compiler.
Similar changes were introduced for the scalar architecture as well. It was nec-
essary to guarantee the parallelism of the loop index, and SIMD vectorization has
become a performance optimization technique that requires eliminating recurrence.
Efficient cache usage technology has also become indispensable.
After the limits of performance improvement of a single processor were seen and
supercomputers changed to a parallel architecture, with parallelism among several to
several hundred cores in the CPU, parallelism among thousands and tens of thousands
of CPUs has been realized by introducing it explicitly in programs. In other words, it
becomes necessary for the programmer to parallelize the code in consideration of the
parallelism among the cores and the parallelism among the CPUs, and to program
with consideration of the distribution of the data used by the cores and CPUs for
calculation. In addition, a communication system usage technique that exploits the
network topology between the nodes where the processes are located has become
necessary.
As described earlier, modern computers still have the memory wall problem in
which the computing capacity of the computing unit is increased but the data supply
capacity of the memory is relatively insufficient. To cope with this problem, cache
memories (level 1, level 2, and level 3) with high data supply capability are provided,
and the data are placed in the cache and reused many times while performing a
calculation.7 Thus, compared with programming on older computers, the necessity
of programming with attention to multilevel memory structures such as cache became
obvious. However, many programs cannot reuse data as described here. Because the
capacity of the computing unit cannot then be fully used, programs may require the
use of high-speed data access mechanisms such as prefetch.

1.2 What Is a High-Performance Application?

The two points mentioned in Sect. 1.1, “programming conscious of parallelism” and
“programming conscious of execution performance” must be recognized by users,
researchers, and programmers who use the present supercomputers equipped with
tens of thousands of processors and containing various enhancements and new func-
tions. Thus, high-performance applications require “performance optimization with
high parallelism” and “performance optimization of single CPUs”. In Chaps. 1–3,
these are the performance-optimizing techniques used to exploit the performance of
modern supercomputers.

7 This approach is called cache blocking.


6 K. Minami

1.2.1 Important Points in Optimizing High Parallelism

Parallelization is briefly described first. The basic idea is simple. As shown in Fig. 1.1,
if a problem that is sequentially computed using one processor is computed in parallel
using four processors, the computation should be four times faster and it should be
executed in a quarter of the original calculation time.
In simulations of fluids and structural analysis, a mesh is constructed in the spatial
direction, and calculations are performed for each mesh point. High parallelization
is briefly explained by using this example.
To parallelize the calculation, the mesh is divided into multiple regions. These
regions are distributed among the processors and the calculations are performed in
parallel. Such a parallelization method is called a domain decomposition method
and is depicted in Fig. 1.2. In this figure, after executing the calculation using four
processors, data are exchanged using the communication network to achieve con-
sistency of the calculations proceeding in parallel; these steps are then repeated to
continue the calculation. As described earlier, in the parallel computation by domain
decomposition, adjacent communications are performed to exchange the data of a
part of the domain with the adjacent processors. When an inner-product calculation
is performed over all the domains, global communication is required to obtain the
sum of the data for all processors. An important point in achieving high parallelism
is to minimize the amounts of adjacent and global communications mentioned here.
It is also important to make the calculation times for each processor as equal as
possible. Differences in calculation time are called load imbalances.

Elapsed time

Sequential
Processor Processor Processor Processor
calculation 1 1 1 1

Elapsed time

Processor
1

Processor
2
Parallel
calculation
Processor
3

Processor
4

Fig. 1.1 Parallel calculation


1 Supercomputers and Application Performance 7

Calculation Calculation
time time

Processor
1
Communication

Processor
2
Communication

Processor
3
Communication

Processor
4

Communication Communication
time time

Fig. 1.2 Parallel calculation by domain decomposition

Next, the parallelization rate and parallelization efficiency will be described.


Assume that 99% of a sequential calculation can be performed as parallel calculations
but that the remaining 1% of the sequential calculation cannot be made parallel; the
parallelization rate of this calculation is 99%. Assume that the sequential calculation
time is 100 s in total, and the parallel computation is performed using 100 processors.
The parallel computation time for the parallelizable component will then be 99 s/100
= 0.99 s; 1 s of nonparallel computing time must then be added, so the calculation
time with 100 processors is 1.99 s. Thus, the original calculation is 100 s and about
1/50 of that with 100 processors, so the parallelization efficiency is 50%.
In the same way, for 1000 processors, the calculation time is about 1/91. That is,
if even 1% of the calculation is nonparallel, efficient parallel calculation cannot be
performed no matter how many processors are used. This limit on the benefits of
parallelization is known as Amdahl’s law. From this, as well as minimizing the com-
munication time, it is important to minimize the nonparallel computing component
as much as possible.
8 K. Minami

1.2.2 Important Points for Single-CPU Performance


Optimization

From the viewpoint of single-CPU performance, applications can be roughly divided


into two types according to the ratio of the number of bytes of memory accessed to
the number of floating point operations required to execute the application. In one
type, the data transfer requests (in bytes) are small compared with the number of
floating point operations (in FLOPS). Such calculations are called computations
with small required byte/FLOP (B/F) values. For example, for the matrix–matrix
product calculation shown in Fig. 1.3, in principle the B/F value is 1/N when the data
movement amount is represented by the number of elements. In general, because
the amount of data movement is represented by the number of bytes, for double
precision calculations, the movement amount is multiplied by eight, so that it is 8/N,
and in principle the B/F value becomes smaller as N increases. In real programs (see
Fig. 1.4), if N becomes a certain size and (a) the access direction is continuous in
memory, (a) is in the cache but (b) is not in the cache. Therefore, using a technique
called blocking, we divide the matrix into small matrices so that both (a) and (b)
are in the cache. Applications with small required B/F values like the matrix–matrix
product shown can achieve high single-CPU performance by reusing the data placed
in the cache many times.

Calculation of matrix-matrix product


(Small required B/F value)
(a) (b)

2N 3 N2 N2
calculations pieces of data pieces of data

Fig. 1.3 Example of matrix–matrix product (1)

(a) (b)

Fig. 1.4 Example of matrix–matrix product (2)


1 Supercomputers and Application Performance 9

2N 2
calculations (a) (b)

N2 N
pieces of data pieces of data

Fig. 1.5 Example of matrix–vector product

In the other type of application, the data transfer requests from memory are large
compared with the number of floating point operations required to execute the appli-
cation. These calculations are called calculations with large required B/F values. Such
calculations have problems in effectively using the high performance of the CPU
because it is difficult to use the cache effectively. For example, for the matrix–vector
product calculation shown in Fig. 1.5, in principle, the B/F value is approximately
1/2 when the data movement is expressed by the number of elements. As above, for
a double precision calculation, the movement amount is multiplied by 8 bytes, so
becomes 8/2 = 4; therefore, the B/F value is large when compared with the matrix
–matrix product. In this way, as a viewpoint for improving the performance of the
single CPU, the required B/F value of the application is important.
Exercises
1. Describe the memory wall problem, which is an important problem for single
processors, and describe the characteristics of recent supercomputers.
2. There are various benchmark tests (BMTs) to evaluate the performance of super-
computers. The most famous BMT is the top 500 (https://www.top500.org/),
which is evaluated by the performance of LINPACK, but there are others as well.
For some other BMTs, discuss the relationship between the evaluation method
and the field in which the evaluation is important.
Chapter 2
Performance Optimization
of Applications

Kazuo Minami

Abstract In this chapter, we present procedures for performance evaluation that


we have used for practical applications. The method is outlined in Sect. 2.1, and
Sects. 2.2–2.6 describe the details of the method. The classification of problems
related to high parallelism and the classification of applications from the viewpoint
of single-CPU performance are described. Sections 2.7 and 2.8 then describe perfor-
mance optimization techniques for each problem pattern related to high parallelism
and the techniques for single-CPU performance optimization according to applica-
tion classification.

2.1 Performance Evaluation Method

The performance evaluation of an application is divided into two parts: “highly par-
allel performance optimization” and “single-CPU performance optimization.” For
each part, the performance evaluation method has two working phases: “current state
recognition” and “understanding the problems.” The “current state recognition” is
common to both working phases and is divided into “source code investigation”
for analyzing the structure of source code, and “measurement of elapsed time” to
understand the current state of application performance. The final procedure in the
“current state recognition” is “calculation/communication kernel analysis,” in which
we evaluate the results of “source code investigation” and “measurement of elapsed
time.” The next phase of “current state recognition” begins with “problem evaluation
method” with working phases of “understanding the problems.” In the “problem
evaluation method” for “highly parallel performance optimization,” the problems
related to high parallelization are classified into six patterns. In the “problem eval-
uation method” for “single-CPU performance optimization,” applications are also
classified into six patterns.
Our approach is summarized in Table 2.1.

K. Minami (B)
RIKEN Center for Computational Science, RIKEN, Kobe, Hyogo, Japan
e-mail: [email protected]

© Springer Nature Singapore Pte Ltd. 2019 11


M. Geshi (ed.), The Art of High Performance Computing
for Computational Science, Vol. 2,
https://doi.org/10.1007/978-981-13-9802-5_2
12 K. Minami

Table 2.1 Outline of performance optimization method


Highly parallel performance Single-CPU performance
optimization
Current state recognition Source code investigation
Measurement of elapsed time
Calculation/communication kernel analysis
Understanding the problems Problem evaluation methods

2.2 Current State Recognition: Source Code Investigation

As the first step in current state recognition, we investigate the source code of the
application. We investigate the structure of the source code and analyze the call struc-
ture of subroutines and functions. We also analyze the subroutines, the loop structure
in the functions, and the control structure of the IF blocks, and organize and visualize
the structure of the entire program. The visualized source code is divided into blocks
of calculation and communication processing according to the algorithms of physics
and mathematics used in the program, and the blocks are organized. We understand
the physical/mathematical processing content of each processing block. By com-
paring these aspects of the processing blocks with the results of the investigated
source code, the calculation characteristics for each calculation block are obtained.
The calculation characteristics describe the processing of a calculation block as non-
parallel, completely parallel, or partially parallel, and identify the calculation index
(e.g., number of atoms or number of meshes), whether the calculation amount in the
calculation block is proportional to N or proportional to N2 when the calculation
index is N, and so on. We also investigate the communication characteristics of each
communication block: whether the processing of the communication block is global
communication, adjacent communication, or whether the communication amount
depends on the calculation index. These investigations are shown in Fig. 2.1.
The purpose of the investigation of the source code is to understand the charac-
teristics of each processing block in the program. However, the visualization of the
loop structure from the start to the end of the program and that of the entire control
structure of the IF blocks mentioned here are large tasks if done manually. Therefore,
we use a visualization tool for program structure, such as K-scope [1, 2].

2.3 Current State Recognition: Measurement Methods

In the sequential calculation before parallelization in the simulation, the calculation


is sequentially performed for each calculation unit such as a mesh. To parallelize a
code, we divide a set of calculation units such as meshes into plurality sets, share the
divided sets among the processors, and perform the calculations in parallel. In such
parallel computation, adjacent communications are performed in every calculation
2 Performance Optimization of Applications 13

Fig. 2.1 Investigation of source code

step to exchange data for parts of areas with neighboring processors. In addition,
when calculating inner products of scalar values for all areas, global communication
between all processors is required. An important point in achieving high parallelism
is to make the adjacent and global communication times as small as possible.
As described in Sect. 1.2.1, it is important in parallel computation, just like the
reduction of communication time, to make the nonparallel computing parts as small
as possible.
The next step of current status recognition is to conduct application performance
measurement. It is important for these measurements to be useful for investigating
parallel characteristics; that is, what kind of behaviors the adjacent and global com-
munication times described here show during highly parallel calculation, and where
nonparallel computing parts remain and their influence on behaviors in high paral-
lelism. Therefore, where performance measurement is possible, it is carried out as
follows.
In conducting application performance measurement as the next step of current
status recognition, it is important to conduct performance measurements that clarify
the parallel characteristics of applications: specifically, what kind of behaviors the
adjacent global communication times display during the highly parallel calculation,
which calculation parts are nonparallel, and how the nonparallel parts influence the
application’s behavior in highly parallel execution. For clarification of parallel char-
acteristics, where possible, the performance measurement is carried out as follows.
First, we define the problem to be solved, determine the number of parallel paths
in the problem, and create a test problem that has the same problem size with one
processor as the target problem that can be run with several levels of parallelism.
Next, we perform the performance measurement using the prepared test problem. In
the performance measurement, the execution time is measured for each process for
each calculation block and communication block, as defined in the previous section.
The parallel characteristics during parallel computation cannot be fully clarified by
14 K. Minami

measuring the entire application. Each processing block’s influence on the paral-
lel characteristics differs depending on whether it includes a nonparallel part and
the number of parallel paths, and whether the communication time changes. There-
fore, it is essential to measure the performance of each processing block for each
process separately. These measurements allow us to identify the processing blocks
that degrade parallel performance. In addition, because the communication behav-
ior during parallel execution differs between adjacent and global communication,
it is necessary to measure them separately. The adjacent communication time has
the same value if the communication amount is the same, as described later, but
the global communication time tends to increase as the number of parallel paths
increases, even if the communication volume stays the same. Furthermore, because
communication times may include waiting times caused by load imbalance, it is also
important to measure the waiting time and the net communication time separately,
thus allowing us to distinguish whether the problem is caused by communication or
load imbalance. With respect to computation, simultaneously with the computation
time, the amount of computation and the computation performance are also measured
for each processing block in each process.

2.4 Current State Recognition: Determination


of Computation and Communication Kernels

The analysis of the source code shows the correspondence between the physi-
cal/mathematical processing contents of each processing block and the source code,
and the calculation characteristics of each calculation block and the communica-
tion characteristics of each communication block. By matching these results with
measurement results, the calculation kernel and the communication kernel can be
identified.
For example, suppose there is a parameter N that determines the amount of com-
putation. Assume that the coefficient of computation amount proportional to the third
power of N is m1, the coefficient of computation amount proportional to N is m2,
and that m2 is considerably larger than m1. When N is relatively small, the amount
of computation for the two parts may be about the same. However, as N increases,
the amount of computation for the part proportional to the third power of N becomes
significantly larger, and the amount of computation for the part proportional to N
may become negligible.
Both the amount of computation and the computation time also vary depending
on the level of performance1 that can be obtained relative to the theoretical peak
performance. The essentially nonparallel parts may remain because of the adopted
parallelization method. By considering the size of the parameters of the problem to
be solved in this way, the parallelization method used, the parallelization method
that may be adopted in the future, the prospects for effective performance, and so on,

1 Performance obtained by dividing the measured amount of computation by the execution time.
2 Performance Optimization of Applications 15

Fig. 2.2 Identifying kernel of calculation and communication

and the kernels to be evaluated, are determined (see Fig. 2.2). The kernels selected
here can be reviewed at later stages of the evaluation.

2.5 Understanding the Problems: Evaluation


of High-Parallelism Problems

We explain how to evaluate the problems of high parallelism by carrying out the
measurements shown in Sect. 2.3 and how to measure parallel performance from
several parallel processes to about 100, about 1000, or several thousand, step by
step. There are two kinds of methods for measuring the performance by gradually
increasing the number of parallel processes: strong scaling measurement and weak
scaling measurement. Strong scaling measurement is a method of fixing the scale
of the problem to be solved and increasing the number of parallel processes: for
example, if the problem scale is fixed to N = 10,000, the number of parallel pro-
cesses and the problem size per processor change is 1 and 10,000, 2 and 5000, 4 and
2500, and so on, respectively. In contrast, weak scaling measurement is a method of
fixing the scale of the problem solved by each processor and increasing the number
of parallel processes. For example, if the problem scale is measured first at N =
1000, the problem size per processor and the number of parallel processes is 1000
and 2, respectively, and the total problem scale is N = 2000. If the problem scale
per processor and the number of parallel processes is 1000 and 4, the total problem
scale is increased to N = 4000. The feature of weak scaling measurement is, ide-
ally, that even when the number of parallel processes is increased, because the same
computation is performed, and the adjacent communication amount is not changed,
the execution time of the computation parts and the execution time of the adjacent
communications are not changed. When a nonparallel part is included in the com-
putation part, a significant increase in computation time should be measured, as the
number of parallel processes becomes large in weak scaling measurement.
16 K. Minami

For example, assume that the execution time of the parallelizable part during
sequential execution is Tp and the execution time of the nonparallelizable part dur-
ing sequential execution is Ts . The execution time T0 during sequential execution
is represented by T0 = Tp + Ts . The execution time when this problem is mul-
tiplied by N and executed sequentially is represented by N × T0 = N × Tp +
N × Ts . When this problem is executed in N parallel processes, it corresponds to
what we performed with weak scaling. If the execution time when executed in N
parallel processes is Twn , the parallelizable portion becomes N times faster but the
nonparallelizable portion does not become faster, so Twn = Tp + N × Ts , and the
term N × Ts increases. Incidentally, if Tsn is the execution time when run with
strong scaling, then Tsn = Ts + Tp /N.
Even when the adjacent communication time increases in accordance with the
number of parallel processes, it is easy to see that there are some problems in the
corresponding adjacent communications. The global communication time generally
increases in accordance with the number of parallel processes, and the increase can
be predicted from the data on the basic communication performance by comparing
the degree of increase with the predicted value. This can show whether there are
some problems in the corresponding global communications.
The method described here is shown in Fig. 2.3. The reason for using weak
scaling measurement in this way is that it is easy to find problems. However, in weak
scaling measurement, it is necessary to prepare separate execution data according
to the number of parallel processes, which may be troublesome. In a simulation in
which the amount of computation is proportional to the second or third power of
the problem size N, weak scaling measurement is sometimes difficult. In such a
case, strong scaling measurement is performed. For strong scaling measurement, it
is necessary to model the computation and communication times with the number
of parallel processes as a parameter, to predict these, and to compare the predictions
with the actual measured times so as to find any nonparallel parts or communication
problems. However, unlike weak scaling measurement, it is not necessary to prepare

Parallel number

Fig. 2.3 Measurement with weak scaling


2 Performance Optimization of Applications 17

execution data according to the number of parallel processes. It is sufficient to prepare


only one type of data.
For the calculation of kernel, we compare and evaluate the trend of the predicted
computation amount as clarified in the investigation of the source code and the com-
putation amounts from the measurement results. For example, assume the computa-
tional amount of the kernel is proportional to the problem size N and is completely
parallelized and measured with weak scaling. Because the computational amount for
each process is constant regardless of the number of parallel processes, the calcula-
tion time for the total system is constant. In this case, the value obtained by dividing
the computational amount of the measurement result by N is the proportional coef-
ficient. If it can be evaluated with weak scaling, as described above, and if it is
possible to completely parallelize and there are no problems in the communication
part, even if the number of parallel processes is increased, the execution time of the
computation part will be constant, and the adjacent communication time will also
be constant and should not increase. If the execution time of the computation part
increases remarkably with the number of parallel processes, it is likely that some
nonparallel parts remain in the operation kernel. In addition, if the adjacent commu-
nication time increases significantly according to the number of parallels, it is likely
that some processing that is not adjacent communication is included in the com-
munication kernel. For example, there may be some global communication that is
used instead of adjacent communication to simplify programming. As for the global
communication, as described at the beginning of this section, its communication time
also increases as the number of parallel processes increases. However, because the
extent of the increase can be predicted from the basic communication performance
data, if the communication time increases significantly more than predicted, we can
consider that there is some problem in the corresponding global communication. In
the discussion of measurement methods, we described the method of measuring the
communication time and waiting time separately. This measured waiting time often
indicates some imbalance included in the computation part and communication part.
In parallel computing, some processing imbalance is physically unavoidable.
However, where the extent of the imbalance is remarkably large or if the imbalance
increases with the number of parallel processes, it is likely that some problem caus-
ing imbalance was introduced in the programming stage. In evaluations using strong
scaling, as described above, the existence of nonparallel parts and communication
problems are found by comparing the predicted computation and communication
times with the measured times. The predicted times are obtained by modeling them
with the parallel number N as a parameter.
For example, suppose that the computation of the kernel is proportional to the
third power of the system parameter N and the computation kernel is completely
parallelized. When measured with strong scaling, if the number of parallel processes
is doubled, then the computation of each process should be halved.
The total computation amount is the number of processes multiplied by the mea-
sured computational amount of each process and the parallel number M. The total
computational amount divided by the third power of M is the proportional coeffi-
cient for the third power of M. The investigation of these computational amounts
18 K. Minami

and proportional coefficients is performed using many parallel measurement results,


and the evaluation is made as to whether the predicted value is consistent with the
measurement results. If the evaluation results are consistent, it means that the source
code is written according to the theory. If the evaluation results are not consistent
and there is an increase in computational amounts with the increase in the number
of parallel processes, it is likely that there is some problem such as the existence of
nonparallel parts in the source code. A similar evaluation is required for the adjacent
communications. For example, when a rectangular parallelepiped area is calculated
using twice the number of parallel processes, the length of one side of the allocated
area of each processor is 1/3 to the power of 1/2. The adjacent communication amount
for the adjacent faces is 2/3 to the power of 1/2. Therefore, assuming the same com-
munication performance, the communication time should also be 2/3 to the power of
1/2. For the global communication, a similarly modeled evaluation is required. As
for the evaluation of the imbalance, it is necessary to evaluate the results as for weak
scaling.
As repeatedly described, it is essential to carry out the evaluation shown here for
each computation and communication kernel. Some tools provided by manufacturers
have functions to measure the execution time, the amount of computation, and the
computation performance for each subroutine or function, and it is usual to measure
performance using these tools. However, the subroutines and the functions of the
application do not generally match the range of the block, and because a function
may be called from different blocks several times in different ways, these tools may
not yield accurate measurement results for each block. Therefore, it is better to
perform measurements on each block. However, this does not apply if the subroutine
or function is configured to match the block.

2.5.1 Classification of Problems Related to High Parallelism

In the HPCC benchmark, applications are classified by using two axes. The first axis
is defined by the locality versus nonlocality in the spatial direction of the data divided
among the processors. The second axis is defined by the locality versus nonlocality
in the temporal direction of data in the processors [3].
In addition, a study of application classification, the “Berkeley 13 dwarfs,” classi-
fied applications by the two axes of the communication and calculation patterns [4].
In this study, applications were classified among seven dwarfs in the HPC field, and
13 dwarfs by adding other fields.
In promoting performance optimization, we also classify the application and orga-
nize the execution performance optimization methods for applications based on the
classification. For high parallelism, the locality versus nonlocality of the data is
considered in the HPCC as one axis, and in the Berkeley 13 dwarfs, the pattern of
communication is considered as one axis. In this section, we focus on the kinds of
problems that occur and how we deal with those problems when optimizing the per-
formance of existing applications, and we classify them according to highly parallel
2 Performance Optimization of Applications 19

patterns. The problems relating to high parallelism are classified into six patterns, as
shown in Table 2.2.
The main problems relating to high parallelism are caused by calculations and
communication. The first, second, and sixth problems are caused by calculation, and
the third, fourth, and fifth problems are caused by communication. The six patterns
are described as follows.
The first pattern is the mismatch of the degree of parallelism between applications
and hardware. Researchers want to solve a problem within a certain time; suppose
that to do so, it is necessary to use tens of thousands of parallel nodes on a super-
computer. For example, the K computer makes it possible to use more than 80,000
parallel nodes in terms of parallelism of the hardware. However, sometimes only
thousands of parallel nodes can be used because of the limitations of the application
parallelization. This is the mismatch of degree of parallelism between the application
and the hardware. When approaching the limitation of the parallelism of the appli-
cation, the computation time becomes extremely small, whereas the proportion of
communication time increases, leading to a deterioration of the parallel efficiency.
The second pattern is the presence of nonparallel parts. As mentioned at the begin-
ning of this chapter, we can see that the parallel performance deteriorates because
of Amdahl’s law if nonparallel parts remain in the computation. Here, assuming that
the execution time of a certain application at the time of sequential execution is Ts
and the parallelization rate of the application is α, the nonparallelization ratio of the
application is 1 – α. When this application is executed using n parallel processes, the
execution time Tn is expressed as Tn = Ts (α/n + (1 – α)). For a parallelization effi-
ciency of 50%, the parallelization ratio α is required to be 99.99% when n = 10,000.
The easiest way to find remaining nonparallel parts is to measure the increase in
execution time of the calculation part using weak scaling measurement as described
above.
The third pattern is the occurrence of large communication sizes and frequent
global communication. Communication times, particularly for global communica-
tions, have a large impact on the parallel performance. Consider an example of
implementing the ALLREDUCE communication of M (bytes) between N nodes.
Assume that the ALLREDUCE communication is performed using a binary tree

Table 2.2 Bottlenecks in parallel performance


Bottleneck
1 Mismatch of the number of parallel processes between application and hardware
(insufficient parallelism of applications)
2 Presence of nonparallel part
3 Large communication size and the occurrence of frequent global communication
4 Global communication among all nodes
5 Large communication size and the occurrence of communication times in adjacent
communication
6 Load imbalances
20 K. Minami

algorithm and the communication performance is Pt (bytes/s). The communication


time Tg for acquiring the total amount of M (bytes) after all nodes have commu-
nicated is Tg = m×log Pt
2n
. To compare the global communications with the adjacent
communications, we consider an example in which N nodes perform the adjacent
communications of M (bytes) to the next rank. When the communication performance
is matched with the above conditions, the communication time Ta to complete the
communication of M (bytes) for all nodes is calculated by Ta = Pmt . When comparing
global and adjacent communications, it is found that the global communication time
is larger by the coefficient of log2 n. Global communication should be a minimum.
The fourth pattern is the occurrence of global communications among all nodes.
As described above, when the ALLREDUCE communication of M (bytes) is per-
formed between N nodes using the binary tree algorithm, the communication time
T is T = m×log Pt
2n
assuming the communication performance to be Pt (bytes/s).
Because the communication time increases as the number of nodes N increases, it is
better to limit global communication among all nodes as much as possible. However,
calculation of inner products is inevitable in the iterative solution of simultaneous
linear equations and other problems, so it is impossible to eliminate all-node global
communication.
The fifth pattern is the occurrence of a large communication size and a large
number of communications in the adjacent communication. In terms of the commu-
nication time, adjacent communication tends to be faster than global communication.
However, useless adjacent communication, such as communicating data for the entire
area for one mesh to the adjacent mesh, are sometimes performed. Such code should
be reviewed and only communication of data on the adjacent surface should be made.
The sixth pattern is the occurrence of load imbalances. Differences in the amount
of calculation for each node may occur, causing some load imbalance among nodes.
When the load imbalance deteriorates as the number of nodes increases, or when the
load imbalance is extremely large over a small number of nodes, it is a problem.

2.6 Understanding the Problems: Evaluation Methods


for Problems in Single-CPU Performance

2.6.1 Application Classification for Single-CPU Performance

As mentioned in Sect. 2.5, the developers of the HPCC benchmark [3] and the
Berkeley 13 dwarfs [4] classified applications. For the HPCC, applications were
classified using locality versus nonlocality of data in the temporal direction with
regard to the single-CPU performance. For the Berkeley 13 dwarfs, applications
were classified using the calculation pattern.
Similarly, in promoting the study of performance optimization, we also classify
applications and organize the application execution performance optimization tech-
niques based on the classification. In Sect. 1.2, from the viewpoint of the single-CPU
2 Performance Optimization of Applications 21

performance, we mentioned that applications can roughly be classified into two types,
one with a low required B/F value and one with a high required B/F value. This idea
is close to the classification used for the HPCC. In this section, we will develop this
view and show the classification of applications into six types as shown in Table 2.3.
The calculations for which the required B/F value is small are the first to the fourth
types. The performance greatly varies depending on whether the DGEMM library
or manual cache blocking can be used, even for calculations with small required
B/F values. When cache blocking can be used, the performance varies depending
on whether the data structure and loop structure are simple, or the data structure is
slightly complicated such as using list vector indexing by integer arrays. Applications
with more complex loop structures often fail to achieve high performance. These
considerations led to the four types of calculations with small required B/F values.
The first type includes applications that can be rewritten as matrix–matrix product
calculations. This type has small B/F values because in principle it can perform the
calculations proportional to the third power of n by loading the data for a square of
size n from memory. An example of this type of calculation is the application of the
first principle quantum calculation based on density functional theory.
The second type includes applications that allow cache blocking although they
are not rewritable to the matrix–matrix product, but still have small required B/F
values. The calculation of the Coulomb interactions of molecular dynamics and the
calculation of the gravity interaction of the gravitational multiple-body problem are
examples. In both cases, by loading the data for n particles and performing cache
blocking, calculations proportional to the square of n can be performed, so that the
required B/F value is small. This type often uses list vector indexing by integer arrays
for the particle access, and the loop body2 is somewhat complicated.
The third type contains examples such as special high-precision stencil calcula-
tions,3 which make it possible to use the cache effectively, so the required B/F value

Table 2.3 Classification of applications from the standpoint of single-CPU performance


Classification Application examples
1 Rewritable to matrix–matrix products Density functional theory calculations
2 Cache blocking is possible Molecular dynamics, many-body gravity
problems
3 The required B/F value is small, and the Special stencil calculations
loop bodies are simple
4 The required B/F value is small, but the Plasmas, physical processes of meteorology,
loop body is complex quantum chemical calculations
5 The required B/F value is large Mechanical processes of meteorology,
fluids, earthquakes, nuclear fusion
6 The required B/F value is large and list Structural calculations using finite-element
accesses are used methods, fluid calculations

2 The code contained in the loop.


3 The calculation using subscripts for differences such as i, i – 1 appearing in difference calculations.
22 K. Minami

is small and the loop body is a simple calculation. Although this type of calculation
gives good performance, unfortunately there are few examples.
In the fourth type of calculations, the required B/F value is small, but the loop
body is complex. Some weather calculations have mechanical processes to calculate
the motion of a fluid and physical processes to calculate the microphysics of clouds;
this physical process corresponds to the fourth type of calculation. By using small
amounts of data loaded from the memory, complex and in-cache calculations are
performed, but the loop body tends to be long and complicated. The calculation of
the PIC method4 used for plasma calculations is also of this type. In this technique,
although the mesh data around the particle are cached, list vector indexing by integer
array is commonly used to access the particle data, resulting in complex program
codes. The body of the calculation loop also tends to be long. For this type, we expect
high performance because the data are cached, but in many cases we cannot obtain
the expected performance because of the complexity of the program code.
The fifth and sixth types of calculation have high required B/F values. Even for
program codes that have the same high required B/F values, the performance varies
greatly, depending on whether discontinuous access to lists is required. This is the
basis for classifying calculations with high required B/F values into the fifth and
sixth types.
The fifth type of calculations has high required B/F values and do not use list
accesses. There are many calculations of this type in the usual stencil calculation,
and there are many other examples such as the dynamic processes in weather cal-
culations described earlier, fluid calculations and calculations of earthquakes. The
sixth type of computation has high required B/F values and uses list accesses. Such
calculations occur frequently in engineering; examples are structural analysis and
fluid calculations using finite-element methods. List accessing is the weak point for
the modern scalar computer architecture because random accesses are required for
each element.
In general, single-CPU performance decreases in the order from type 1 to type 6
calculations. However, there is usually little difference between types 2 and 3.

2.6.2 Evaluation by Cutting Out the Computation Kernel

First, we cut out the calculation kernel to form an independent test program that can
be executed in one process. In cutting out the kernel, the following steps are carried
out.
(A) Dump the necessary data at the time of executing the original program to prepare
the data such as arrays necessary for the execution of the test program. The
data used by the conditional statements are important. For the data used for
calculation, when only the performance is a problem, the appropriate data may
be set without using dumped data.

4 Particle-in-cell method. A method for arranging particles in the calculation lattice.


Discovering Diverse Content Through
Random Scribd Documents
back
back
back
back
back
back
back
back
back
back
back
back
back
back
Welcome to our website – the ideal destination for book lovers and
knowledge seekers. With a mission to inspire endlessly, we offer a
vast collection of books, ranging from classic literary works to
specialized publications, self-development books, and children's
literature. Each book is a new journey of discovery, expanding
knowledge and enriching the soul of the reade

Our website is not just a platform for buying books, but a bridge
connecting readers to the timeless values of culture and wisdom. With
an elegant, user-friendly interface and an intelligent search system,
we are committed to providing a quick and convenient shopping
experience. Additionally, our special promotions and home delivery
services ensure that you save time and fully enjoy the joy of reading.

Let us accompany you on the journey of exploring knowledge and


personal growth!

textbookfull.com

You might also like