[Ebooks PDF] download The Art of High Performance Computing for Computational Science, Vol. 2: Advanced Techniques and Examples for Materials Science Masaaki Geshi full chapters
[Ebooks PDF] download The Art of High Performance Computing for Computational Science, Vol. 2: Advanced Techniques and Examples for Materials Science Masaaki Geshi full chapters
com
https://textbookfull.com/product/the-art-of-high-
performance-computing-for-computational-science-
vol-2-advanced-techniques-and-examples-for-materials-
science-masaaki-geshi/
OR CLICK BUTTON
DOWNLOAD NOW
https://textbookfull.com/product/high-performance-computing-for-
geospatial-applications-wenwu-tang/
textboxfull.com
https://textbookfull.com/product/fair-scheduling-in-high-performance-
computing-environments-art-sedighi/
textboxfull.com
https://textbookfull.com/product/advanced-techniques-for-testing-of-
cement-based-materials-marijana-serdar/
textboxfull.com
Masaaki Geshi Editor
123
Editor
Masaaki Geshi
Osaka University
Toyonaka, Japan
This Springer imprint is published by the registered company Springer Nature Singapore Pte Ltd.
The registered company address is: 152 Beach Road, #21-01/04 Gateway East, Singapore 189721,
Singapore
Preface
This is the second of two volumes that are written about the basics of paral-
lelization, the foundation of numerical analysis, and related techniques. Even if it is
mentioned as a foundation, we do not assume a novice here completely in this field,
so if you would like to know from the beginning of programming, you should learn
from another book suitable for that. For readers, those who have learned physics,
chemistry, biology (earth sciences, space science, weather, disaster prevention,
manufacturing, etc.) are assumed. Furthermore, we assume those who use numer-
ical calculation and simulation as research methods. In particular, we assume those
who develop software code. Many of them have not learned systematically about
programming and numerical calculation, but from the information science experts,
many parts are included as contents of the undergraduate level.
This Volume 2 includes advanced techniques based on concrete applications of
software applications for several fields, in particular, the field of materials science.
Chapter 1 outlines supercomputers including a brief explanation of the history of
hardware. Chapter 2 details a program tuning procedure. Chapter 3 describes
concrete tuning results on the K computer for several software applications: RSDFT
[1] and PHASE [2] (now the official name changes to PHASE/0) in materials
science, nanoscience, and nanotechnology; Seism3D in earth science;
FrontFlow/Blue in engineering. The above chapters are more practical than Chaps.
1–5 in the Volume 1. Chapter 4 explains how to reduce the computational cost of
density functional theory (DFT) calculation from O(N3) to O(N), so-called order-N
method. This method is implemented in the software application, OpenMX [3].
Chapter 5 explains acceleration techniques of classical molecular dynamics
(MD) simulations, for example, general techniques for hierarchical parallelization
on the latest general-purpose supercomputers, in particular, connected by means of
a three-dimensional torus network. These techniques are implemented in the soft-
ware applications, MODYLAS [4]. This chapter also introduces the software
application, GENESIS [5], which is developed for investigating the long-term
dynamics of biomolecules by simulating a huge biomolecule system by using an
efficient structure search method such as the extended ensemble method. Chapter 6
explains techniques for large-scale quantum chemical calculation techniques
v
vi Preface
References
1. https://github.com/j-iwata/RSDFT
2. https://azuma.nims.go.jp/software
3. http://www.openmx-square.org/
4. http://www.modylas.org/
Preface vii
5. https://www.r-ccs.riken.jp/labs/cbrt/
6. http://www.chem.waseda.ac.jp/nakai/?page_id=147&lang=en
7. http://smash-qc.sourceforge.net/
8. https://ma.issp.u-tokyo.ac.jp/en/
Contents
Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 203
ix
Chapter 1
Supercomputers and Application
Performance
Kazuo Minami
In this section, we first describe the development of computers and the changes in
the usage technologies of supercomputers in Sect. 1.1, and in Sect. 1.2 we describe
two important points for developing high-performance applications. Computational
science, which elucidates scientific phenomena by using numerical simulation, has
long been described as the third science alongside theory and experiments, and in
recent years, innovative scientific technology research and development using super-
1 http://www.riken.jp/en/research/environment/kcomputer/.
2 The Institute of Physical and Chemical Research (http://www.riken.jp/en/).
3 https://www.top500.org/.
K. Minami (B)
RIKEN Center for Computational Science, RIKEN, Kobe, Hyogo, Japan
e-mail: [email protected]
1
computers has been active all over the world. In Japan, the K computer developed
2 3
by RIKEN in 2011 won the top 500 ranking for two consecutive terms.
The application of supercomputers in Japan is an innovative way to elucidate var-
ious natural phenomena over a vast scale from the extremely fine quantum world to
the universe, including an enormous number of galaxies, and the discoveries made
are expected to contribute to society. For example, on the very small scale of ten to
the minus several powers of meters, we expect to understand the behavior of viruses,
liposomes (consisting of several hundred thousand atoms), and other organic phe-
nomena through long-running simulations, and this is expected to contribute to the
medical field, inexpensive biofuels, and new energy fields. On a slightly larger scale,
we expect to accelerate innovation in next-generation electronics through design
simulation of entire next-generation semiconductor nanodevices and the creation of
new functional nonsilicon materials such as nanocarbons. On the scale of human
society from several tens of meters to several hundreds of kilometers, we expect to
contribute to detailed disaster-prevention planning by seismic simulation, combin-
ing seismic wave propagation and structure response. On a global scale of several
thousand kilometers to several tens of thousands of kilometers, we expect to present
high-resolution global weather forecasts and accurate predictions of the course and
intensity of typhoons by climate simulation, and to contribute to climate change
research. On the larger scale of more than 10 to the 20th power of meters, we expect
to elucidate cosmic phenomena, such as the generation of stars and analysis of the
behaviors of galaxies.
As described earlier, various applications are expected for supercomputers, but
what is a supercomputer in the first place?
Although there is no clear definition of a supercomputer, it is regarded as a com-
puter with extremely high speed and outstanding computing capacity, compared with
the general computers of its era. For example, a supercomputer is defined in present
government procurement in Japan (2016) as a computer capable of 50 trillion or more
floating point operations per second (50 TFLOPS)4 : This number is reviewed as nec-
essary. In the mid-1940s, one of the first digital computers, named the ENIAC (an
abbreviation of Electronic Numerical Integrator and Computer), appeared. In 1976,
the CRAY-1, which was described as the world’s first supercomputer, appeared; its
theoretical computing performance was 160 MFLOPS. The performance of a per-
sonal computer using a Pentium IV in 2002 was about 6.4 GFLOPS: about 40 times
the performance of the CRAY-1. At that time, the performance of the Earth Sim-
ulator, which was Japan’s fastest supercomputer in 2002, was 40 TFLOPS, which
was about 250,000 times the performance of the CRAY-1. The K computer, which
achieved the world’s fastest performance in 2011, achieves 10 PFLOPS, about 62.5
million times the performance of CRAY-1.
Computers have achieved these drastic performance improvements, but how?
4 FLOPS denote a unit of calculation speed. One FLOPS is the execution of one floating point
calculation per second. Thus, 160 MFLOPS is equivalent to 160 million floating point operations
per second.
1 Supercomputers and Application Performance 3
Early computers had a single processor. After the technology changed from vac-
uum tubes to semiconductors, the improvement of the frequency of the semiconduc-
tor devices promoted the performance improvement of the CPU and improved the
performance of the computer. The memory was composed of one or more memory
banks, and a memory bank could not be accessed until a certain time had elapsed
after a previous access. The waiting time for memory accesses remains several tens
of nanoseconds even now. However, until the 1970s, because the operating frequency
of the computer was low and the operation of the computing unit was slow, memory
access times were not a major problem. It was an era when the computation speed
was the bottleneck rather than the memory transfer performance.
Since then, while the waiting time for memory access of several tens of nanosec-
onds has not reduced much, the number of cycles the CPU must wait for memory
access has increased with the improvement in CPU operating frequency. Moreover,
because of the miniaturization of the semiconductor process, more computing units
can be mounted in one CPU. As a result, the data transfer capability of the memory
reduces the computing capacity of the computing unit. This is called the memory
wall problem.
Between the latter half of 1970 and the 1980s, although it was based on a single
processor, a vector architecture was developed that enabled high-speed computation
using vector pipelines, treating data that could be processed in parallel by paying
attention to the parallelism within loops. To solve the memory wall problem in the
vector computer, the number of memory banks was increased, and the CPU read
data from different memory banks cyclically to supply data to the computing unit
continuously. The problem of the waiting time to access the same memory bank was
thus overcome. By adopting this mechanism, it was possible for the vector computer
to balance the data supply capability of the memory and the calculation ability of the
computing unit.
At that time, the processor’s operating frequency was several tens of MHz or
more, and as described earlier, more computing units could be added because of the
progress in the miniaturization of the semiconductor process. This increase, together
with the increased operating frequency of the computing unit, contributed to the
realization of high-speed vector computers. Although the vector system was fast,
the manufacturing cost and power consumption increased because of the expensive
memory bank mechanism described above.
Around the same time as the development of the vector architecture, the computing
unit of the scalar architecture also became RISC5 and was pipelined, taking advantage
of the increases in processing units and operating frequency. The scalar architecture
evolved into a superscalar architecture with multiple computing units. Furthermore,
by using SIMD6 and other techniques, high-speed computation was made possible
by utilizing advanced parallelism hidden in the program.
In the scalar architecture, to cope with the memory wall problem, a countermea-
sure other than the vector architecture was taken: a cache with high data supply
capability was placed between the memory and the computing unit without increas-
ing the number of memory banks. As much of data as possible was placed in the
cache, and the data were reused to compensate for the limited data supply capacity of
the memory to the computing unit. Although this method has performance disadvan-
tages, it has benefits in terms of cost and power consumption over the multiple-bank
method of the vector architecture.
The single processor was approaching the limits of improvement of the operating
frequency and the memory wall problem remained. Even if the computing capacity
of the single processor could be increased, the data supply capacity of the memory
could not catch up with the computing capacity. Furthermore, with the increases in
operating frequency, we also faced the problem of power consumption increasing
faster than the improvement in performance. In other words, the limits of performance
improvement of single processors were becoming apparent.
To solve this problem, a parallel architecture in which many single processors are
connected by a communication mechanism has appeared. Without this development,
it would be impossible to obtain the necessary computing power with realistic power
consumption.
At present, hybrid, massively parallel computers are emerging, in which multiple
calculation cores are built in a processor and thousands to tens of thousands of
processors are connected by a communication network.
Although each node of a supercomputer is basically the same as an ordinary
computer, computing capacity and computing performance in total are extremely
high, and high-speed interconnection performance is required. Further, because low-
power performance of the total system is required, it is essential that power saving
is implemented at the processor level. Because the number of parts constituting the
system is very large, extremely high reliability is required for individual parts, and
high reliability of the total system is required.
Up to this point, we have explained the development of hardware. As the hardware
evolved, how has its usage changed so that the performance of the hardware can be
fully utilized?
In the early days of computers, the processing speed was a bottleneck compared
with the memory transfer performance with a single processor, so development envi-
ronments such as high-level languages and compilers emerged. It was common for
researchers and programmers to reproduce formalized and discretized theoretical
model equations in code. High-speed processing was realized by developing com-
pilers that could interpret the parallelism hidden in the program.
From the latter half of 1970 to the 1980s, when vector architectures used multiple
memory banks to cope with the memory wall problem, the parallel nature of loop
indices was exploited and parallel-processable data were pipelined.
As a programming technique in the age of vector architecture, it was necessary to
guarantee the parallel nature of loops. Eliminating recurrence (regression references)
became an essential performance optimization technique.
1 Supercomputers and Application Performance 5
In the scalar architecture, as described earlier, by dealing with a cache with high
data supply capability, we coped with the memory wall problem. In addition, in the
scalar architecture, SIMD was introduced and effectively used simplified instructions
with RISC, superscaling with multiple operation pipelines, and software pipelining
by compiler.
Similar changes were introduced for the scalar architecture as well. It was nec-
essary to guarantee the parallelism of the loop index, and SIMD vectorization has
become a performance optimization technique that requires eliminating recurrence.
Efficient cache usage technology has also become indispensable.
After the limits of performance improvement of a single processor were seen and
supercomputers changed to a parallel architecture, with parallelism among several to
several hundred cores in the CPU, parallelism among thousands and tens of thousands
of CPUs has been realized by introducing it explicitly in programs. In other words, it
becomes necessary for the programmer to parallelize the code in consideration of the
parallelism among the cores and the parallelism among the CPUs, and to program
with consideration of the distribution of the data used by the cores and CPUs for
calculation. In addition, a communication system usage technique that exploits the
network topology between the nodes where the processes are located has become
necessary.
As described earlier, modern computers still have the memory wall problem in
which the computing capacity of the computing unit is increased but the data supply
capacity of the memory is relatively insufficient. To cope with this problem, cache
memories (level 1, level 2, and level 3) with high data supply capability are provided,
and the data are placed in the cache and reused many times while performing a
calculation.7 Thus, compared with programming on older computers, the necessity
of programming with attention to multilevel memory structures such as cache became
obvious. However, many programs cannot reuse data as described here. Because the
capacity of the computing unit cannot then be fully used, programs may require the
use of high-speed data access mechanisms such as prefetch.
The two points mentioned in Sect. 1.1, “programming conscious of parallelism” and
“programming conscious of execution performance” must be recognized by users,
researchers, and programmers who use the present supercomputers equipped with
tens of thousands of processors and containing various enhancements and new func-
tions. Thus, high-performance applications require “performance optimization with
high parallelism” and “performance optimization of single CPUs”. In Chaps. 1–3,
these are the performance-optimizing techniques used to exploit the performance of
modern supercomputers.
Parallelization is briefly described first. The basic idea is simple. As shown in Fig. 1.1,
if a problem that is sequentially computed using one processor is computed in parallel
using four processors, the computation should be four times faster and it should be
executed in a quarter of the original calculation time.
In simulations of fluids and structural analysis, a mesh is constructed in the spatial
direction, and calculations are performed for each mesh point. High parallelization
is briefly explained by using this example.
To parallelize the calculation, the mesh is divided into multiple regions. These
regions are distributed among the processors and the calculations are performed in
parallel. Such a parallelization method is called a domain decomposition method
and is depicted in Fig. 1.2. In this figure, after executing the calculation using four
processors, data are exchanged using the communication network to achieve con-
sistency of the calculations proceeding in parallel; these steps are then repeated to
continue the calculation. As described earlier, in the parallel computation by domain
decomposition, adjacent communications are performed to exchange the data of a
part of the domain with the adjacent processors. When an inner-product calculation
is performed over all the domains, global communication is required to obtain the
sum of the data for all processors. An important point in achieving high parallelism
is to minimize the amounts of adjacent and global communications mentioned here.
It is also important to make the calculation times for each processor as equal as
possible. Differences in calculation time are called load imbalances.
Elapsed time
Sequential
Processor Processor Processor Processor
calculation 1 1 1 1
Elapsed time
Processor
1
Processor
2
Parallel
calculation
Processor
3
Processor
4
Calculation Calculation
time time
Processor
1
Communication
Processor
2
Communication
Processor
3
Communication
Processor
4
Communication Communication
time time
2N 3 N2 N2
calculations pieces of data pieces of data
(a) (b)
2N 2
calculations (a) (b)
N2 N
pieces of data pieces of data
In the other type of application, the data transfer requests from memory are large
compared with the number of floating point operations required to execute the appli-
cation. These calculations are called calculations with large required B/F values. Such
calculations have problems in effectively using the high performance of the CPU
because it is difficult to use the cache effectively. For example, for the matrix–vector
product calculation shown in Fig. 1.5, in principle, the B/F value is approximately
1/2 when the data movement is expressed by the number of elements. As above, for
a double precision calculation, the movement amount is multiplied by 8 bytes, so
becomes 8/2 = 4; therefore, the B/F value is large when compared with the matrix
–matrix product. In this way, as a viewpoint for improving the performance of the
single CPU, the required B/F value of the application is important.
Exercises
1. Describe the memory wall problem, which is an important problem for single
processors, and describe the characteristics of recent supercomputers.
2. There are various benchmark tests (BMTs) to evaluate the performance of super-
computers. The most famous BMT is the top 500 (https://www.top500.org/),
which is evaluated by the performance of LINPACK, but there are others as well.
For some other BMTs, discuss the relationship between the evaluation method
and the field in which the evaluation is important.
Chapter 2
Performance Optimization
of Applications
Kazuo Minami
The performance evaluation of an application is divided into two parts: “highly par-
allel performance optimization” and “single-CPU performance optimization.” For
each part, the performance evaluation method has two working phases: “current state
recognition” and “understanding the problems.” The “current state recognition” is
common to both working phases and is divided into “source code investigation”
for analyzing the structure of source code, and “measurement of elapsed time” to
understand the current state of application performance. The final procedure in the
“current state recognition” is “calculation/communication kernel analysis,” in which
we evaluate the results of “source code investigation” and “measurement of elapsed
time.” The next phase of “current state recognition” begins with “problem evaluation
method” with working phases of “understanding the problems.” In the “problem
evaluation method” for “highly parallel performance optimization,” the problems
related to high parallelization are classified into six patterns. In the “problem eval-
uation method” for “single-CPU performance optimization,” applications are also
classified into six patterns.
Our approach is summarized in Table 2.1.
K. Minami (B)
RIKEN Center for Computational Science, RIKEN, Kobe, Hyogo, Japan
e-mail: [email protected]
As the first step in current state recognition, we investigate the source code of the
application. We investigate the structure of the source code and analyze the call struc-
ture of subroutines and functions. We also analyze the subroutines, the loop structure
in the functions, and the control structure of the IF blocks, and organize and visualize
the structure of the entire program. The visualized source code is divided into blocks
of calculation and communication processing according to the algorithms of physics
and mathematics used in the program, and the blocks are organized. We understand
the physical/mathematical processing content of each processing block. By com-
paring these aspects of the processing blocks with the results of the investigated
source code, the calculation characteristics for each calculation block are obtained.
The calculation characteristics describe the processing of a calculation block as non-
parallel, completely parallel, or partially parallel, and identify the calculation index
(e.g., number of atoms or number of meshes), whether the calculation amount in the
calculation block is proportional to N or proportional to N2 when the calculation
index is N, and so on. We also investigate the communication characteristics of each
communication block: whether the processing of the communication block is global
communication, adjacent communication, or whether the communication amount
depends on the calculation index. These investigations are shown in Fig. 2.1.
The purpose of the investigation of the source code is to understand the charac-
teristics of each processing block in the program. However, the visualization of the
loop structure from the start to the end of the program and that of the entire control
structure of the IF blocks mentioned here are large tasks if done manually. Therefore,
we use a visualization tool for program structure, such as K-scope [1, 2].
step to exchange data for parts of areas with neighboring processors. In addition,
when calculating inner products of scalar values for all areas, global communication
between all processors is required. An important point in achieving high parallelism
is to make the adjacent and global communication times as small as possible.
As described in Sect. 1.2.1, it is important in parallel computation, just like the
reduction of communication time, to make the nonparallel computing parts as small
as possible.
The next step of current status recognition is to conduct application performance
measurement. It is important for these measurements to be useful for investigating
parallel characteristics; that is, what kind of behaviors the adjacent and global com-
munication times described here show during highly parallel calculation, and where
nonparallel computing parts remain and their influence on behaviors in high paral-
lelism. Therefore, where performance measurement is possible, it is carried out as
follows.
In conducting application performance measurement as the next step of current
status recognition, it is important to conduct performance measurements that clarify
the parallel characteristics of applications: specifically, what kind of behaviors the
adjacent global communication times display during the highly parallel calculation,
which calculation parts are nonparallel, and how the nonparallel parts influence the
application’s behavior in highly parallel execution. For clarification of parallel char-
acteristics, where possible, the performance measurement is carried out as follows.
First, we define the problem to be solved, determine the number of parallel paths
in the problem, and create a test problem that has the same problem size with one
processor as the target problem that can be run with several levels of parallelism.
Next, we perform the performance measurement using the prepared test problem. In
the performance measurement, the execution time is measured for each process for
each calculation block and communication block, as defined in the previous section.
The parallel characteristics during parallel computation cannot be fully clarified by
14 K. Minami
measuring the entire application. Each processing block’s influence on the paral-
lel characteristics differs depending on whether it includes a nonparallel part and
the number of parallel paths, and whether the communication time changes. There-
fore, it is essential to measure the performance of each processing block for each
process separately. These measurements allow us to identify the processing blocks
that degrade parallel performance. In addition, because the communication behav-
ior during parallel execution differs between adjacent and global communication,
it is necessary to measure them separately. The adjacent communication time has
the same value if the communication amount is the same, as described later, but
the global communication time tends to increase as the number of parallel paths
increases, even if the communication volume stays the same. Furthermore, because
communication times may include waiting times caused by load imbalance, it is also
important to measure the waiting time and the net communication time separately,
thus allowing us to distinguish whether the problem is caused by communication or
load imbalance. With respect to computation, simultaneously with the computation
time, the amount of computation and the computation performance are also measured
for each processing block in each process.
The analysis of the source code shows the correspondence between the physi-
cal/mathematical processing contents of each processing block and the source code,
and the calculation characteristics of each calculation block and the communica-
tion characteristics of each communication block. By matching these results with
measurement results, the calculation kernel and the communication kernel can be
identified.
For example, suppose there is a parameter N that determines the amount of com-
putation. Assume that the coefficient of computation amount proportional to the third
power of N is m1, the coefficient of computation amount proportional to N is m2,
and that m2 is considerably larger than m1. When N is relatively small, the amount
of computation for the two parts may be about the same. However, as N increases,
the amount of computation for the part proportional to the third power of N becomes
significantly larger, and the amount of computation for the part proportional to N
may become negligible.
Both the amount of computation and the computation time also vary depending
on the level of performance1 that can be obtained relative to the theoretical peak
performance. The essentially nonparallel parts may remain because of the adopted
parallelization method. By considering the size of the parameters of the problem to
be solved in this way, the parallelization method used, the parallelization method
that may be adopted in the future, the prospects for effective performance, and so on,
1 Performance obtained by dividing the measured amount of computation by the execution time.
2 Performance Optimization of Applications 15
and the kernels to be evaluated, are determined (see Fig. 2.2). The kernels selected
here can be reviewed at later stages of the evaluation.
We explain how to evaluate the problems of high parallelism by carrying out the
measurements shown in Sect. 2.3 and how to measure parallel performance from
several parallel processes to about 100, about 1000, or several thousand, step by
step. There are two kinds of methods for measuring the performance by gradually
increasing the number of parallel processes: strong scaling measurement and weak
scaling measurement. Strong scaling measurement is a method of fixing the scale
of the problem to be solved and increasing the number of parallel processes: for
example, if the problem scale is fixed to N = 10,000, the number of parallel pro-
cesses and the problem size per processor change is 1 and 10,000, 2 and 5000, 4 and
2500, and so on, respectively. In contrast, weak scaling measurement is a method of
fixing the scale of the problem solved by each processor and increasing the number
of parallel processes. For example, if the problem scale is measured first at N =
1000, the problem size per processor and the number of parallel processes is 1000
and 2, respectively, and the total problem scale is N = 2000. If the problem scale
per processor and the number of parallel processes is 1000 and 4, the total problem
scale is increased to N = 4000. The feature of weak scaling measurement is, ide-
ally, that even when the number of parallel processes is increased, because the same
computation is performed, and the adjacent communication amount is not changed,
the execution time of the computation parts and the execution time of the adjacent
communications are not changed. When a nonparallel part is included in the com-
putation part, a significant increase in computation time should be measured, as the
number of parallel processes becomes large in weak scaling measurement.
16 K. Minami
For example, assume that the execution time of the parallelizable part during
sequential execution is Tp and the execution time of the nonparallelizable part dur-
ing sequential execution is Ts . The execution time T0 during sequential execution
is represented by T0 = Tp + Ts . The execution time when this problem is mul-
tiplied by N and executed sequentially is represented by N × T0 = N × Tp +
N × Ts . When this problem is executed in N parallel processes, it corresponds to
what we performed with weak scaling. If the execution time when executed in N
parallel processes is Twn , the parallelizable portion becomes N times faster but the
nonparallelizable portion does not become faster, so Twn = Tp + N × Ts , and the
term N × Ts increases. Incidentally, if Tsn is the execution time when run with
strong scaling, then Tsn = Ts + Tp /N.
Even when the adjacent communication time increases in accordance with the
number of parallel processes, it is easy to see that there are some problems in the
corresponding adjacent communications. The global communication time generally
increases in accordance with the number of parallel processes, and the increase can
be predicted from the data on the basic communication performance by comparing
the degree of increase with the predicted value. This can show whether there are
some problems in the corresponding global communications.
The method described here is shown in Fig. 2.3. The reason for using weak
scaling measurement in this way is that it is easy to find problems. However, in weak
scaling measurement, it is necessary to prepare separate execution data according
to the number of parallel processes, which may be troublesome. In a simulation in
which the amount of computation is proportional to the second or third power of
the problem size N, weak scaling measurement is sometimes difficult. In such a
case, strong scaling measurement is performed. For strong scaling measurement, it
is necessary to model the computation and communication times with the number
of parallel processes as a parameter, to predict these, and to compare the predictions
with the actual measured times so as to find any nonparallel parts or communication
problems. However, unlike weak scaling measurement, it is not necessary to prepare
Parallel number
In the HPCC benchmark, applications are classified by using two axes. The first axis
is defined by the locality versus nonlocality in the spatial direction of the data divided
among the processors. The second axis is defined by the locality versus nonlocality
in the temporal direction of data in the processors [3].
In addition, a study of application classification, the “Berkeley 13 dwarfs,” classi-
fied applications by the two axes of the communication and calculation patterns [4].
In this study, applications were classified among seven dwarfs in the HPC field, and
13 dwarfs by adding other fields.
In promoting performance optimization, we also classify the application and orga-
nize the execution performance optimization methods for applications based on the
classification. For high parallelism, the locality versus nonlocality of the data is
considered in the HPCC as one axis, and in the Berkeley 13 dwarfs, the pattern of
communication is considered as one axis. In this section, we focus on the kinds of
problems that occur and how we deal with those problems when optimizing the per-
formance of existing applications, and we classify them according to highly parallel
2 Performance Optimization of Applications 19
patterns. The problems relating to high parallelism are classified into six patterns, as
shown in Table 2.2.
The main problems relating to high parallelism are caused by calculations and
communication. The first, second, and sixth problems are caused by calculation, and
the third, fourth, and fifth problems are caused by communication. The six patterns
are described as follows.
The first pattern is the mismatch of the degree of parallelism between applications
and hardware. Researchers want to solve a problem within a certain time; suppose
that to do so, it is necessary to use tens of thousands of parallel nodes on a super-
computer. For example, the K computer makes it possible to use more than 80,000
parallel nodes in terms of parallelism of the hardware. However, sometimes only
thousands of parallel nodes can be used because of the limitations of the application
parallelization. This is the mismatch of degree of parallelism between the application
and the hardware. When approaching the limitation of the parallelism of the appli-
cation, the computation time becomes extremely small, whereas the proportion of
communication time increases, leading to a deterioration of the parallel efficiency.
The second pattern is the presence of nonparallel parts. As mentioned at the begin-
ning of this chapter, we can see that the parallel performance deteriorates because
of Amdahl’s law if nonparallel parts remain in the computation. Here, assuming that
the execution time of a certain application at the time of sequential execution is Ts
and the parallelization rate of the application is α, the nonparallelization ratio of the
application is 1 – α. When this application is executed using n parallel processes, the
execution time Tn is expressed as Tn = Ts (α/n + (1 – α)). For a parallelization effi-
ciency of 50%, the parallelization ratio α is required to be 99.99% when n = 10,000.
The easiest way to find remaining nonparallel parts is to measure the increase in
execution time of the calculation part using weak scaling measurement as described
above.
The third pattern is the occurrence of large communication sizes and frequent
global communication. Communication times, particularly for global communica-
tions, have a large impact on the parallel performance. Consider an example of
implementing the ALLREDUCE communication of M (bytes) between N nodes.
Assume that the ALLREDUCE communication is performed using a binary tree
As mentioned in Sect. 2.5, the developers of the HPCC benchmark [3] and the
Berkeley 13 dwarfs [4] classified applications. For the HPCC, applications were
classified using locality versus nonlocality of data in the temporal direction with
regard to the single-CPU performance. For the Berkeley 13 dwarfs, applications
were classified using the calculation pattern.
Similarly, in promoting the study of performance optimization, we also classify
applications and organize the application execution performance optimization tech-
niques based on the classification. In Sect. 1.2, from the viewpoint of the single-CPU
2 Performance Optimization of Applications 21
performance, we mentioned that applications can roughly be classified into two types,
one with a low required B/F value and one with a high required B/F value. This idea
is close to the classification used for the HPCC. In this section, we will develop this
view and show the classification of applications into six types as shown in Table 2.3.
The calculations for which the required B/F value is small are the first to the fourth
types. The performance greatly varies depending on whether the DGEMM library
or manual cache blocking can be used, even for calculations with small required
B/F values. When cache blocking can be used, the performance varies depending
on whether the data structure and loop structure are simple, or the data structure is
slightly complicated such as using list vector indexing by integer arrays. Applications
with more complex loop structures often fail to achieve high performance. These
considerations led to the four types of calculations with small required B/F values.
The first type includes applications that can be rewritten as matrix–matrix product
calculations. This type has small B/F values because in principle it can perform the
calculations proportional to the third power of n by loading the data for a square of
size n from memory. An example of this type of calculation is the application of the
first principle quantum calculation based on density functional theory.
The second type includes applications that allow cache blocking although they
are not rewritable to the matrix–matrix product, but still have small required B/F
values. The calculation of the Coulomb interactions of molecular dynamics and the
calculation of the gravity interaction of the gravitational multiple-body problem are
examples. In both cases, by loading the data for n particles and performing cache
blocking, calculations proportional to the square of n can be performed, so that the
required B/F value is small. This type often uses list vector indexing by integer arrays
for the particle access, and the loop body2 is somewhat complicated.
The third type contains examples such as special high-precision stencil calcula-
tions,3 which make it possible to use the cache effectively, so the required B/F value
is small and the loop body is a simple calculation. Although this type of calculation
gives good performance, unfortunately there are few examples.
In the fourth type of calculations, the required B/F value is small, but the loop
body is complex. Some weather calculations have mechanical processes to calculate
the motion of a fluid and physical processes to calculate the microphysics of clouds;
this physical process corresponds to the fourth type of calculation. By using small
amounts of data loaded from the memory, complex and in-cache calculations are
performed, but the loop body tends to be long and complicated. The calculation of
the PIC method4 used for plasma calculations is also of this type. In this technique,
although the mesh data around the particle are cached, list vector indexing by integer
array is commonly used to access the particle data, resulting in complex program
codes. The body of the calculation loop also tends to be long. For this type, we expect
high performance because the data are cached, but in many cases we cannot obtain
the expected performance because of the complexity of the program code.
The fifth and sixth types of calculation have high required B/F values. Even for
program codes that have the same high required B/F values, the performance varies
greatly, depending on whether discontinuous access to lists is required. This is the
basis for classifying calculations with high required B/F values into the fifth and
sixth types.
The fifth type of calculations has high required B/F values and do not use list
accesses. There are many calculations of this type in the usual stencil calculation,
and there are many other examples such as the dynamic processes in weather cal-
culations described earlier, fluid calculations and calculations of earthquakes. The
sixth type of computation has high required B/F values and uses list accesses. Such
calculations occur frequently in engineering; examples are structural analysis and
fluid calculations using finite-element methods. List accessing is the weak point for
the modern scalar computer architecture because random accesses are required for
each element.
In general, single-CPU performance decreases in the order from type 1 to type 6
calculations. However, there is usually little difference between types 2 and 3.
First, we cut out the calculation kernel to form an independent test program that can
be executed in one process. In cutting out the kernel, the following steps are carried
out.
(A) Dump the necessary data at the time of executing the original program to prepare
the data such as arrays necessary for the execution of the test program. The
data used by the conditional statements are important. For the data used for
calculation, when only the performance is a problem, the appropriate data may
be set without using dumped data.
and the foul monster had carried her off to his cave. The bridegroom
swore, in his despair, that “earth should no longer hold a thing so
vile,” and, marching off with his friends, killed the dragon and
rescued his bride; but the story ends on a classical note of tragedy,
for she died of horror in his house that very day.
Reference: E. Sidney Hartland, The Legend of Perseus, 3 vols.
(London, 1896).
IV
OF DRAGONS IN MODERN EUROPE
It will be well to begin this section with short accounts of the two
most satisfactory Renaissance dragons: the Dragon of Rhodes and
the Dragon of Bologna.
“The history of the ancient Order of the Knights of St. John (not
yet removed to Malta) records that about the year 1330 Dieudonné
de Gozon, afterwards third Grand Master of the Order, joined the
Knights in Rhodes, and was filled with pious zeal to kill a terrible
dragon which ravaged the Island; but the then Grand Master
considered such extravagant gaieties too dangerous for a knight
vowed to the defence of Christendom, and roundly forbade it. On
this de Gozon returned to the castle of his ancestors near Tarascon
in France, and, with the help of an ingenious dummy dragon (so
little does the art of war change), trained his horses and dogs to
face the monster, and, returning, killed it and removed its tongue as
evidence. A lying Greek (so little does the Greek nature change)
found the carcase and claimed the victory; but de Gozon showed
him up by producing the tongue—and was put in prison for
disobedience. The Pfalzgraf Ottheinrich made our first written record
of this feat when passing through on a pilgrimage in 1521, and the
corroborative evidence is indisputable: the feat is said to have been
recorded on the tombstone of the knight (we have the tombstone,
and it isn’t); there are said to be pictures of it in a wall-painting in a
house in Rhodes (which cannot be found); and the family are said to
have preserved the draconite taken by the hero from the monster’s
forehead (the family has disappeared); the head itself was seen by a
seventeenth century traveller still nailed to a gate in Rhodes, though
it disappeared during the last century. For countless years the simple
islanders had displayed it for the glory of God and without thought
of gain, and it would perhaps be uncharitable to connect its
disappearance with the recent development of transatlantic
transport, or with the discoveries of modern science, which have
shown that the skeleton of the dun-cow at Warwick is simply that of
a whale. And, finally, they will show you to this day in Rhodes the
cave where the dragon lived.”
The story of the Dragon of Bologna is tame by comparison. It is
recorded in great detail in The Natural History of Serpents and
Dragons by Professor Ulysses Aldrovandus, published at Bologna by
Mark Antony Bernia in the year 1640, at his own charges, with a
dedication to the Prince-Abbot Franciscus Perettus, and with the
approval of the Holy Roman Emperor Rudolph, the Rector to the
Cardinal-Archbishop of Bologna, and the legal adviser to the Most
Holy Office of the Inquisition in that city.
The story is as follows (p. 402): “In the early summer of the year
1572, to wit on the 13th of May, the dragon appeared in the
outskirts of Bologna, hissing horribly. It was caught the day after
Ascension Day by a cowherd called Baptista of Camaldulus, about 5
P.M. and about seven miles out from the City, on the high road. His
cows saw it and stopped dead, and Baptista, who was behind with
his cart, pricked them on with his goad; but they went down on their
knees and wouldn’t budge. Then he heard a great hissing and
beheld the astounding monster: but though frightened out of his
wits, he up with his stick and knocked it on the head so that it died.
The brave herd, fearing it might not be dead, cut off one of its feet
and brought it into Bologna as evidence. After three days the noble
Horatius Fontana gave orders for the carcase to be sent to the great
naturalist Aldrovandus, who declared it to be unique in all Italy and
all Europe, and had it stuffed and put in the museum (whence it has
unluckily disappeared). It was about this same time that the flying
dragon appeared by night in the sky, and no sane man will doubt
that these portents were sent in honour of Pope Gregory XIII, who
took office in that year and who sported a dragon on his coat-of-
arms.”
This same Aldrovandus is our chief source of information on the
modern dragon. He sets out, in due scholastico-scientific style, first
the alternative meanings of the word “dragon,” with a note that
Virgil is very haphazard in his use of “dragon” or “serpent” for
snakes in general; then the synonyms, as “syren,” “leviathan,” and
the Hebrew oach (whence perhaps our word “hoax”); then size—5
to 100 cubits (we may split the difference and safely say about 50);
habitation—Libya, India, Atlas, Æthiopia, Florida, etc. (with a caution
that the species born of a wolf and an eagle is probably fabulous
and nowhere to be found); colour—red, black, ashen, pea-green,
indeed the evidence is hopelessly conflicting; description—head of a
virgin or wild-boar, goose-feet or talons or hoofs (they probably
vary); St. Augustine confirms Herodotus’ opinion that they fly;
poison—more virulent in the hotter climes; jaw—some say very
large, some say very small, some say two rows of teeth, some say
three, and the number is in any case uncertain; manners and
customs—very vigilant and fond of gold (so we see why they are
normally set to guard treasures), not afraid of men, and able to
throw elephants with their tails: four or five, says Pliny, will twine
their tails together for a long flight and so cover the distance at an
incredible speed; very fierce, but Heracleides, the philosopher, had
one so tame that it followed him about like a dog; birth—the
evidence is conflicting as to whether from eggs or immediately.
Remedies against their poison—red mullet applied externally or
(better) internally, or (best of all) the head of a dragon skinned and
applied to the bite. Capture—men of the most magnificent courage
drug them with opium-seeds, so as to obtain the draconite; a scarlet
cloak and the appropriate incantation are effective, and an axe has
been tried with success: a useful trick is to catch them when they
are preoccupied with an elephant-fight (their customary recreation),
and another very good plan is to put down sulphur, which the
creature eagerly gulps down and then moving to the nearest river
drinks until it bursts. (This was the device of the great Cracus, who
gave his name to Cracow. It is an elaboration of the Prophet Daniel’s
method of dealing with Bal’s dragon—that holy man’s mixture, it will
be remembered, itself exploded the dragon; but the march of
science and the closer study of animal-habits no doubt made Cracus’
scheme more convincing.)
The eyes are precious stones and the teeth ivory; the fat is a
sovereign remedy against poison, fever, and blear-eyes; the spine is
a great cure for toothache; the gall-bladder and intestines mixed
with wine effect more than was ever claimed for Colman’s mustard
in the bath, removing warts. It is very lucky to bury a dragon’s head
under the front doorstep, and the eyes make a fine poison and send
away nightmares: and so on and so forth—all this less than three
centuries ago.
A little later, about 1660, the learned Jesuit Kircher visited the
Alps, and, though discounting many devils as due to the credulity of
the peasantry, could not resist the conclusion that so horrid and
inhospitable a country could only have been intended by God to
harbour dragons, especially when a public notice in the Church of St.
Leodegarius (our old friend St. Leger, the patron-saint of
bookmakers?) in Lucerne told how a man “paused some months in a
cave with two dragons, who were either naturally amiable or were
calmed by his energetic appeals to the Virgin, and finally escaped by
holding on to their tails when they flew away after their period of
hybernation” (History does not record whether they adopted Pliny’s
plan, or whether by a merciful dispensation of Providence they flew
so close together that he suffered no strain).
The anonymous author of The Golden Coast, or a Description of
Guinney (London, 1665) has little reliable information on this or any
other subject. The people, he tells us, are Nigritæ “from their colour,
which they are so much in love with that they use to paint the Devil
white”; and of the elephant, “which some call Oliphant,” that “they
have continual war against dragons which desire their blood because
it is very cold.” The book abounds in such old tales out of Pliny and
Bartholomew Anglicus, and has all the appearance of a literary puff
of the Company of Royal Adventures of England trading to Africa
(est. 1662); for what honest man could have waxed so enthusiastic
over that death-trap of a country, where (says he) “a man may gain
an estate by a handfull of beads, and his pocket full of gold for an
old hat; where a cat is a tenement and a few fox tails a Mannor;
where gold is sold for iron, and silver given for brasse and pewter?”
The Company failed shortly afterwards and was replaced by the
Royal African Company (1672), and this may well have been to over-
spending in the Advertising Department.
Doctor Thomas Browne, in his “Enquiries into very many received
tenents, and commonly presumed truths” (London, 1686)
(commonly called Browne’s Vulgar Errors) is more modern, but, he,
like a sensible man takes a middle path between scepticism and faith
—thinks we cannot safely deny that there is such an animal as the
basilisk; but we are not to confuse it with the cockatrice, a mere
hieroglyphical fancy, though even the cockatrice he will not declare
to be impossible (he does not see how such an oddity can be
hatched from “a cock’s egg” (sic: the phenomenon occurs only in a
cock’s eighth year, and causes it acute discomfort put under a toad
or serpent)); but many inventions, he says, are really “the courteous
revelations of spirits,” and we must not be too cocksure of our
merely human faculties.
Scheuchzer, the learned Botanist who toured the Alps in the first
ten years of the eighteenth century, frankly adopted the compromise
implicit in Aldrovandus—always to believe half of what he was told;
but he thought the dragon-stone in the museum at Lucerne entirely
convincing; for (says he) a dishonest man would not have invented
so simple a story as its falling from the sky—but rather some
fabulous tale about its coming from the farthest Indies; and the
stone not only cures simple hæmorrhages, which ordinary jasper or
marble might well do, but dysentery and fevers and all those ills of
which, to judge from the advertisements in the local press,
Glastonians may now rid themselves so much more simply. Item, a
respectable citizen returned home one evening lately “with a
swimming in the head and a marked uncertainty about the motions
of his legs, and how can we doubt his word when he attributes these
unprecedented phenomena to the influence of the dragon who
encountered him in the forest?” Scheuchzer’s scientific journals were
published at the expense of the Royal Society of London. Credible
witnesses of to-day maintain that “not the vestige of a dragon is to
be found, even in those wildest regions of the Alps which ... were
especially adapted for their generation.” Thus do beauty and
romance fade before the advance of Winter Sports and Grand
Babylon Hôtels.
References: Aldrovandus, op. cit.
Thomas Browne, op. cit.
Leslie Stephen, The Playground of Europe (London, 1871).
E. Ray Lankester, Science from an Easy Chair (London, 1910).
F. W. Hasluck, The Dragon of Rhodes (British School at Athens,
1914).
V
OF DRAGONS IN ANCIENT EGYPT
Transcriber’s Notes:
Printer’s, punctuation, and spelling inaccuracies were silently
corrected.
Archaic and variable spelling has been preserved.
*** END OF THE PROJECT GUTENBERG EBOOK PERSEUS; OR, OF
DRAGONS ***
1.D. The copyright laws of the place where you are located also
govern what you can do with this work. Copyright laws in most
countries are in a constant state of change. If you are outside
the United States, check the laws of your country in addition to
the terms of this agreement before downloading, copying,
displaying, performing, distributing or creating derivative works
based on this work or any other Project Gutenberg™ work. The
Foundation makes no representations concerning the copyright
status of any work in any country other than the United States.
1.E.6. You may convert to and distribute this work in any binary,
compressed, marked up, nonproprietary or proprietary form,
including any word processing or hypertext form. However, if
you provide access to or distribute copies of a Project
Gutenberg™ work in a format other than “Plain Vanilla ASCII” or
other format used in the official version posted on the official
Project Gutenberg™ website (www.gutenberg.org), you must,
at no additional cost, fee or expense to the user, provide a copy,
a means of exporting a copy, or a means of obtaining a copy
upon request, of the work in its original “Plain Vanilla ASCII” or
other form. Any alternate format must include the full Project
Gutenberg™ License as specified in paragraph 1.E.1.
• You pay a royalty fee of 20% of the gross profits you derive
from the use of Project Gutenberg™ works calculated using the
method you already use to calculate your applicable taxes. The
fee is owed to the owner of the Project Gutenberg™ trademark,
but he has agreed to donate royalties under this paragraph to
the Project Gutenberg Literary Archive Foundation. Royalty
payments must be paid within 60 days following each date on
which you prepare (or are legally required to prepare) your
periodic tax returns. Royalty payments should be clearly marked
as such and sent to the Project Gutenberg Literary Archive
Foundation at the address specified in Section 4, “Information
about donations to the Project Gutenberg Literary Archive
Foundation.”
• You comply with all other terms of this agreement for free
distribution of Project Gutenberg™ works.
1.F.
Most people start at our website which has the main PG search
facility: www.gutenberg.org.
Our website is not just a platform for buying books, but a bridge
connecting readers to the timeless values of culture and wisdom. With
an elegant, user-friendly interface and an intelligent search system,
we are committed to providing a quick and convenient shopping
experience. Additionally, our special promotions and home delivery
services ensure that you save time and fully enjoy the joy of reading.
textbookfull.com