100% found this document useful (3 votes)
90 views

[Ebooks PDF] download The Art of High Performance Computing for Computational Science, Vol. 2: Advanced Techniques and Examples for Materials Science Masaaki Geshi full chapters

High

Uploaded by

canbytatemcy
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
100% found this document useful (3 votes)
90 views

[Ebooks PDF] download The Art of High Performance Computing for Computational Science, Vol. 2: Advanced Techniques and Examples for Materials Science Masaaki Geshi full chapters

High

Uploaded by

canbytatemcy
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 63

Download the Full Version of textbook for Fast Typing at textbookfull.

com

The Art of High Performance Computing for


Computational Science, Vol. 2: Advanced Techniques
and Examples for Materials Science Masaaki Geshi

https://textbookfull.com/product/the-art-of-high-
performance-computing-for-computational-science-
vol-2-advanced-techniques-and-examples-for-materials-
science-masaaki-geshi/

OR CLICK BUTTON

DOWNLOAD NOW

Download More textbook Instantly Today - Get Yours Now at textbookfull.com


Recommended digital products (PDF, EPUB, MOBI) that
you can download immediately if you are interested.

High Performance Computing in Science and Engineering '


18: Transactions of the High Performance Computing Center,
Stuttgart (HLRS) 2018 Wolfgang E. Nagel
https://textbookfull.com/product/high-performance-computing-in-
science-and-engineering-18-transactions-of-the-high-performance-
computing-center-stuttgart-hlrs-2018-wolfgang-e-nagel/
textboxfull.com

High Performance Computing in Science and Engineering 15


Transactions of the High Performance Computing Center
Stuttgart HLRS 2015 1st Edition Wolfgang E. Nagel
https://textbookfull.com/product/high-performance-computing-in-
science-and-engineering-15-transactions-of-the-high-performance-
computing-center-stuttgart-hlrs-2015-1st-edition-wolfgang-e-nagel/
textboxfull.com

High Performance Computing in Science and Engineering 16


Transactions of the High Performance Computing Center
Stuttgart HLRS 2016 1st Edition Wolfgang E. Nagel
https://textbookfull.com/product/high-performance-computing-in-
science-and-engineering-16-transactions-of-the-high-performance-
computing-center-stuttgart-hlrs-2016-1st-edition-wolfgang-e-nagel/
textboxfull.com

High Performance Computing for Computational Science –


VECPAR 2018: 13th International Conference, São Pedro,
Brazil, September 17-19, 2018, Revised Selected Papers
Hermes Senger
https://textbookfull.com/product/high-performance-computing-for-
computational-science-vecpar-2018-13th-international-conference-sao-
pedro-brazil-september-17-19-2018-revised-selected-papers-hermes-
senger/
textboxfull.com
Handbook of Fibrous Materials Vol 1 Production and
Characterization Vol 2 Applications in Energy
Environmental Science and Healthcare Hu
https://textbookfull.com/product/handbook-of-fibrous-materials-
vol-1-production-and-characterization-vol-2-applications-in-energy-
environmental-science-and-healthcare-hu/
textboxfull.com

High Performance Computing for Geospatial Applications


Wenwu Tang

https://textbookfull.com/product/high-performance-computing-for-
geospatial-applications-wenwu-tang/

textboxfull.com

High Performance Computing for Computational Science


VECPAR 2014 11th International Conference Eugene OR USA
June 30 July 3 2014 Revised Selected Papers 1st Edition
Michel Daydé
https://textbookfull.com/product/high-performance-computing-for-
computational-science-vecpar-2014-11th-international-conference-
eugene-or-usa-june-30-july-3-2014-revised-selected-papers-1st-edition-
michel-dayde/
textboxfull.com

Fair Scheduling in High Performance Computing Environments


Art Sedighi

https://textbookfull.com/product/fair-scheduling-in-high-performance-
computing-environments-art-sedighi/

textboxfull.com

Advanced Techniques for Testing of Cement Based Materials


Marijana Serdar

https://textbookfull.com/product/advanced-techniques-for-testing-of-
cement-based-materials-marijana-serdar/

textboxfull.com
Masaaki Geshi Editor

The Art of High


Performance
Computing for
Computational
Science, Vol. 2
Advanced Techniques and Examples for
Materials Science
The Art of High Performance Computing
for Computational Science, Vol. 2
Masaaki Geshi
Editor

The Art of High


Performance Computing
for Computational Science,
Vol. 2
Advanced Techniques and Examples
for Materials Science

123
Editor
Masaaki Geshi
Osaka University
Toyonaka, Japan

ISBN 978-981-13-9801-8 ISBN 978-981-13-9802-5 (eBook)


https://doi.org/10.1007/978-981-13-9802-5
© Springer Nature Singapore Pte Ltd. 2019
This work is subject to copyright. All rights are reserved by the Publisher, whether the whole or part
of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations,
recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission
or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar
methodology now known or hereafter developed.
The use of general descriptive names, registered names, trademarks, service marks, etc. in this
publication does not imply, even in the absence of a specific statement, that such names are exempt from
the relevant protective laws and regulations and therefore free for general use.
The publisher, the authors and the editors are safe to assume that the advice and information in this
book are believed to be true and accurate at the date of publication. Neither the publisher nor the
authors or the editors give a warranty, expressed or implied, with respect to the material contained
herein or for any errors or omissions that may have been made. The publisher remains neutral with regard
to jurisdictional claims in published maps and institutional affiliations.

This Springer imprint is published by the registered company Springer Nature Singapore Pte Ltd.
The registered company address is: 152 Beach Road, #21-01/04 Gateway East, Singapore 189721,
Singapore
Preface

This is the second of two volumes that are written about the basics of paral-
lelization, the foundation of numerical analysis, and related techniques. Even if it is
mentioned as a foundation, we do not assume a novice here completely in this field,
so if you would like to know from the beginning of programming, you should learn
from another book suitable for that. For readers, those who have learned physics,
chemistry, biology (earth sciences, space science, weather, disaster prevention,
manufacturing, etc.) are assumed. Furthermore, we assume those who use numer-
ical calculation and simulation as research methods. In particular, we assume those
who develop software code. Many of them have not learned systematically about
programming and numerical calculation, but from the information science experts,
many parts are included as contents of the undergraduate level.
This Volume 2 includes advanced techniques based on concrete applications of
software applications for several fields, in particular, the field of materials science.
Chapter 1 outlines supercomputers including a brief explanation of the history of
hardware. Chapter 2 details a program tuning procedure. Chapter 3 describes
concrete tuning results on the K computer for several software applications: RSDFT
[1] and PHASE [2] (now the official name changes to PHASE/0) in materials
science, nanoscience, and nanotechnology; Seism3D in earth science;
FrontFlow/Blue in engineering. The above chapters are more practical than Chaps.
1–5 in the Volume 1. Chapter 4 explains how to reduce the computational cost of
density functional theory (DFT) calculation from O(N3) to O(N), so-called order-N
method. This method is implemented in the software application, OpenMX [3].
Chapter 5 explains acceleration techniques of classical molecular dynamics
(MD) simulations, for example, general techniques for hierarchical parallelization
on the latest general-purpose supercomputers, in particular, connected by means of
a three-dimensional torus network. These techniques are implemented in the soft-
ware applications, MODYLAS [4]. This chapter also introduces the software
application, GENESIS [5], which is developed for investigating the long-term
dynamics of biomolecules by simulating a huge biomolecule system by using an
efficient structure search method such as the extended ensemble method. Chapter 6
explains techniques for large-scale quantum chemical calculation techniques

v
vi Preface

including the order-N method. These techniques are implemented in software


applications, DC-DFTB-K [6] and SMASH [7]. You can download and use some
of them. MaterApps [8] is useful for finding software applications in the field of
material science. This website introduces software applications from around the
world as well as software applications made in Japan.
This book is revised and updated from the Japanese book published by Osaka
University Press in 2017, The Art of High Performance Computing for
Computational Science 2 (Masaaki Geshi Ed., 2017). This book is based on the
lectures, “Advanced Computational Science A” and “Advanced Computational
Science and B”, broadcasted to maximally 17 campuses through videoconference
systems from 2013. All the texts and videos are published on websites (only in
Japanese). These were parts of the human resource development programs that we
have tackled as part of the project of the Computational Materials Science Initiative
(CMSI) organized by the Ministry of Education, Culture, Sports, Science and
Technology (SPIRE) field 2 <New materials/energy creation>, so-called the project
of K computer. These lectures are aiming to contribute to developing young human
resources, centering on basic techniques that will not change for a long time being
even though it is a computer that progresses day by day. These lectures were offered
from the Institute for NanoScience Design, Osaka University. We have gathered up
to about 150 participants per lecture. Participants exceeded 6500 net people in the
past 6 years. The videos of the lectures and lecture materials are opened to the
public on the web, and anyone can learn the content at any time in Japanese. Now
we are using cloud-based video meetings service that connects many users across
different devices to make it easier for more people to participate. These lectures
continue to distribute with slightly changing the organizational structure even after
the project ends.
I would like to express my deep appreciation to the authors, who cooperated in
the writing of the series of books as an editor. I would like to thank the staff of
Springer for publishing the English version. I think that the techniques cultivated in
the project of K computer of Japan contain many useful contents for future HPC.
I hope it will be shared in the world and it will contribute to the development of
HPC and the development of science.

Osaka, Japan Masaaki Geshi


April 2019

References

1. https://github.com/j-iwata/RSDFT
2. https://azuma.nims.go.jp/software
3. http://www.openmx-square.org/
4. http://www.modylas.org/
Preface vii

5. https://www.r-ccs.riken.jp/labs/cbrt/
6. http://www.chem.waseda.ac.jp/nakai/?page_id=147&lang=en
7. http://smash-qc.sourceforge.net/
8. https://ma.issp.u-tokyo.ac.jp/en/
Contents

1 Supercomputers and Application Performance . . . . . . . . . . . . . . . . . 1


Kazuo Minami
2 Performance Optimization of Applications . . . . . . . . . . . . . . . . . . . . 11
Kazuo Minami
3 Case Studies of Performance Optimization of Applications . . . . . . . . 41
Kazuo Minami and Kiyoshi Kumahata
4 O(N) Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
Taisuke Ozaki
5 Acceleration of Classical Molecular Dynamics Simulations . . . . . . . . 117
Y. Andoh, N. Yoshii, J. Jung and Y. Sugita
6 Large-Scale Quantum Chemical Calculation . . . . . . . . . . . . . . . . . . . 159
Kazuya Ishimura and Masato Kobayashi

Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 203

ix
Chapter 1
Supercomputers and Application
Performance

Kazuo Minami

Abstract Before the advent of modern supercomputers, single processors were


approaching the limit of improvement in operating frequency, and there was a mem-
ory wall problem. Even if the computing capacity of a single processor could be
increased, the data supply capacity of the memory could not match the computing
capacity. The increase in operating frequency also caused the problem, that is, power
consumption increased faster than performance improvement. In other words, the
limit of performance improvement of a single processor was becoming apparent.
To solve these problems, parallel architectures, in which many single processors
are connected by a communication mechanism, have been used. The two points,
“programming conscious of parallelism” and “programming conscious of execu-
tion performance”, are very important for users, researchers, and programmers to
make effective use of the present supercomputers equipped with tens of thousands
of processors.

1.1 What Is a Supercomputer?

In this section, we first describe the development of computers and the changes in
the usage technologies of supercomputers in Sect. 1.1, and in Sect. 1.2 we describe
two important points for developing high-performance applications. Computational
science, which elucidates scientific phenomena by using numerical simulation, has
long been described as the third science alongside theory and experiments, and in
recent years, innovative scientific technology research and development using super-

1 http://www.riken.jp/en/research/environment/kcomputer/.
2 The Institute of Physical and Chemical Research (http://www.riken.jp/en/).
3 https://www.top500.org/.

K. Minami (B)
RIKEN Center for Computational Science, RIKEN, Kobe, Hyogo, Japan
e-mail: [email protected]

© Springer Nature Singapore Pte Ltd. 2019 1


M. Geshi (ed.), The Art of High Performance Computing
for Computational Science, Vol. 2,
https://doi.org/10.1007/978-981-13-9802-5_1
2 K. Minami

1
computers has been active all over the world. In Japan, the K computer developed
2 3
by RIKEN in 2011 won the top 500 ranking for two consecutive terms.
The application of supercomputers in Japan is an innovative way to elucidate var-
ious natural phenomena over a vast scale from the extremely fine quantum world to
the universe, including an enormous number of galaxies, and the discoveries made
are expected to contribute to society. For example, on the very small scale of ten to
the minus several powers of meters, we expect to understand the behavior of viruses,
liposomes (consisting of several hundred thousand atoms), and other organic phe-
nomena through long-running simulations, and this is expected to contribute to the
medical field, inexpensive biofuels, and new energy fields. On a slightly larger scale,
we expect to accelerate innovation in next-generation electronics through design
simulation of entire next-generation semiconductor nanodevices and the creation of
new functional nonsilicon materials such as nanocarbons. On the scale of human
society from several tens of meters to several hundreds of kilometers, we expect to
contribute to detailed disaster-prevention planning by seismic simulation, combin-
ing seismic wave propagation and structure response. On a global scale of several
thousand kilometers to several tens of thousands of kilometers, we expect to present
high-resolution global weather forecasts and accurate predictions of the course and
intensity of typhoons by climate simulation, and to contribute to climate change
research. On the larger scale of more than 10 to the 20th power of meters, we expect
to elucidate cosmic phenomena, such as the generation of stars and analysis of the
behaviors of galaxies.
As described earlier, various applications are expected for supercomputers, but
what is a supercomputer in the first place?
Although there is no clear definition of a supercomputer, it is regarded as a com-
puter with extremely high speed and outstanding computing capacity, compared with
the general computers of its era. For example, a supercomputer is defined in present
government procurement in Japan (2016) as a computer capable of 50 trillion or more
floating point operations per second (50 TFLOPS)4 : This number is reviewed as nec-
essary. In the mid-1940s, one of the first digital computers, named the ENIAC (an
abbreviation of Electronic Numerical Integrator and Computer), appeared. In 1976,
the CRAY-1, which was described as the world’s first supercomputer, appeared; its
theoretical computing performance was 160 MFLOPS. The performance of a per-
sonal computer using a Pentium IV in 2002 was about 6.4 GFLOPS: about 40 times
the performance of the CRAY-1. At that time, the performance of the Earth Sim-
ulator, which was Japan’s fastest supercomputer in 2002, was 40 TFLOPS, which
was about 250,000 times the performance of the CRAY-1. The K computer, which
achieved the world’s fastest performance in 2011, achieves 10 PFLOPS, about 62.5
million times the performance of CRAY-1.
Computers have achieved these drastic performance improvements, but how?

4 FLOPS denote a unit of calculation speed. One FLOPS is the execution of one floating point
calculation per second. Thus, 160 MFLOPS is equivalent to 160 million floating point operations
per second.
1 Supercomputers and Application Performance 3

Early computers had a single processor. After the technology changed from vac-
uum tubes to semiconductors, the improvement of the frequency of the semiconduc-
tor devices promoted the performance improvement of the CPU and improved the
performance of the computer. The memory was composed of one or more memory
banks, and a memory bank could not be accessed until a certain time had elapsed
after a previous access. The waiting time for memory accesses remains several tens
of nanoseconds even now. However, until the 1970s, because the operating frequency
of the computer was low and the operation of the computing unit was slow, memory
access times were not a major problem. It was an era when the computation speed
was the bottleneck rather than the memory transfer performance.
Since then, while the waiting time for memory access of several tens of nanosec-
onds has not reduced much, the number of cycles the CPU must wait for memory
access has increased with the improvement in CPU operating frequency. Moreover,
because of the miniaturization of the semiconductor process, more computing units
can be mounted in one CPU. As a result, the data transfer capability of the memory
reduces the computing capacity of the computing unit. This is called the memory
wall problem.
Between the latter half of 1970 and the 1980s, although it was based on a single
processor, a vector architecture was developed that enabled high-speed computation
using vector pipelines, treating data that could be processed in parallel by paying
attention to the parallelism within loops. To solve the memory wall problem in the
vector computer, the number of memory banks was increased, and the CPU read
data from different memory banks cyclically to supply data to the computing unit
continuously. The problem of the waiting time to access the same memory bank was
thus overcome. By adopting this mechanism, it was possible for the vector computer
to balance the data supply capability of the memory and the calculation ability of the
computing unit.
At that time, the processor’s operating frequency was several tens of MHz or
more, and as described earlier, more computing units could be added because of the
progress in the miniaturization of the semiconductor process. This increase, together
with the increased operating frequency of the computing unit, contributed to the
realization of high-speed vector computers. Although the vector system was fast,
the manufacturing cost and power consumption increased because of the expensive
memory bank mechanism described above.
Around the same time as the development of the vector architecture, the computing
unit of the scalar architecture also became RISC5 and was pipelined, taking advantage
of the increases in processing units and operating frequency. The scalar architecture
evolved into a superscalar architecture with multiple computing units. Furthermore,
by using SIMD6 and other techniques, high-speed computation was made possible
by utilizing advanced parallelism hidden in the program.

5 Reduced instruction set computing.


6 Singleinstruction multiple data: A class of parallel processing that applies one instruction to
multiple data.
4 K. Minami

In the scalar architecture, to cope with the memory wall problem, a countermea-
sure other than the vector architecture was taken: a cache with high data supply
capability was placed between the memory and the computing unit without increas-
ing the number of memory banks. As much of data as possible was placed in the
cache, and the data were reused to compensate for the limited data supply capacity of
the memory to the computing unit. Although this method has performance disadvan-
tages, it has benefits in terms of cost and power consumption over the multiple-bank
method of the vector architecture.
The single processor was approaching the limits of improvement of the operating
frequency and the memory wall problem remained. Even if the computing capacity
of the single processor could be increased, the data supply capacity of the memory
could not catch up with the computing capacity. Furthermore, with the increases in
operating frequency, we also faced the problem of power consumption increasing
faster than the improvement in performance. In other words, the limits of performance
improvement of single processors were becoming apparent.
To solve this problem, a parallel architecture in which many single processors are
connected by a communication mechanism has appeared. Without this development,
it would be impossible to obtain the necessary computing power with realistic power
consumption.
At present, hybrid, massively parallel computers are emerging, in which multiple
calculation cores are built in a processor and thousands to tens of thousands of
processors are connected by a communication network.
Although each node of a supercomputer is basically the same as an ordinary
computer, computing capacity and computing performance in total are extremely
high, and high-speed interconnection performance is required. Further, because low-
power performance of the total system is required, it is essential that power saving
is implemented at the processor level. Because the number of parts constituting the
system is very large, extremely high reliability is required for individual parts, and
high reliability of the total system is required.
Up to this point, we have explained the development of hardware. As the hardware
evolved, how has its usage changed so that the performance of the hardware can be
fully utilized?
In the early days of computers, the processing speed was a bottleneck compared
with the memory transfer performance with a single processor, so development envi-
ronments such as high-level languages and compilers emerged. It was common for
researchers and programmers to reproduce formalized and discretized theoretical
model equations in code. High-speed processing was realized by developing com-
pilers that could interpret the parallelism hidden in the program.
From the latter half of 1970 to the 1980s, when vector architectures used multiple
memory banks to cope with the memory wall problem, the parallel nature of loop
indices was exploited and parallel-processable data were pipelined.
As a programming technique in the age of vector architecture, it was necessary to
guarantee the parallel nature of loops. Eliminating recurrence (regression references)
became an essential performance optimization technique.
1 Supercomputers and Application Performance 5

In the scalar architecture, as described earlier, by dealing with a cache with high
data supply capability, we coped with the memory wall problem. In addition, in the
scalar architecture, SIMD was introduced and effectively used simplified instructions
with RISC, superscaling with multiple operation pipelines, and software pipelining
by compiler.
Similar changes were introduced for the scalar architecture as well. It was nec-
essary to guarantee the parallelism of the loop index, and SIMD vectorization has
become a performance optimization technique that requires eliminating recurrence.
Efficient cache usage technology has also become indispensable.
After the limits of performance improvement of a single processor were seen and
supercomputers changed to a parallel architecture, with parallelism among several to
several hundred cores in the CPU, parallelism among thousands and tens of thousands
of CPUs has been realized by introducing it explicitly in programs. In other words, it
becomes necessary for the programmer to parallelize the code in consideration of the
parallelism among the cores and the parallelism among the CPUs, and to program
with consideration of the distribution of the data used by the cores and CPUs for
calculation. In addition, a communication system usage technique that exploits the
network topology between the nodes where the processes are located has become
necessary.
As described earlier, modern computers still have the memory wall problem in
which the computing capacity of the computing unit is increased but the data supply
capacity of the memory is relatively insufficient. To cope with this problem, cache
memories (level 1, level 2, and level 3) with high data supply capability are provided,
and the data are placed in the cache and reused many times while performing a
calculation.7 Thus, compared with programming on older computers, the necessity
of programming with attention to multilevel memory structures such as cache became
obvious. However, many programs cannot reuse data as described here. Because the
capacity of the computing unit cannot then be fully used, programs may require the
use of high-speed data access mechanisms such as prefetch.

1.2 What Is a High-Performance Application?

The two points mentioned in Sect. 1.1, “programming conscious of parallelism” and
“programming conscious of execution performance” must be recognized by users,
researchers, and programmers who use the present supercomputers equipped with
tens of thousands of processors and containing various enhancements and new func-
tions. Thus, high-performance applications require “performance optimization with
high parallelism” and “performance optimization of single CPUs”. In Chaps. 1–3,
these are the performance-optimizing techniques used to exploit the performance of
modern supercomputers.

7 This approach is called cache blocking.


6 K. Minami

1.2.1 Important Points in Optimizing High Parallelism

Parallelization is briefly described first. The basic idea is simple. As shown in Fig. 1.1,
if a problem that is sequentially computed using one processor is computed in parallel
using four processors, the computation should be four times faster and it should be
executed in a quarter of the original calculation time.
In simulations of fluids and structural analysis, a mesh is constructed in the spatial
direction, and calculations are performed for each mesh point. High parallelization
is briefly explained by using this example.
To parallelize the calculation, the mesh is divided into multiple regions. These
regions are distributed among the processors and the calculations are performed in
parallel. Such a parallelization method is called a domain decomposition method
and is depicted in Fig. 1.2. In this figure, after executing the calculation using four
processors, data are exchanged using the communication network to achieve con-
sistency of the calculations proceeding in parallel; these steps are then repeated to
continue the calculation. As described earlier, in the parallel computation by domain
decomposition, adjacent communications are performed to exchange the data of a
part of the domain with the adjacent processors. When an inner-product calculation
is performed over all the domains, global communication is required to obtain the
sum of the data for all processors. An important point in achieving high parallelism
is to minimize the amounts of adjacent and global communications mentioned here.
It is also important to make the calculation times for each processor as equal as
possible. Differences in calculation time are called load imbalances.

Elapsed time

Sequential
Processor Processor Processor Processor
calculation 1 1 1 1

Elapsed time

Processor
1

Processor
2
Parallel
calculation
Processor
3

Processor
4

Fig. 1.1 Parallel calculation


1 Supercomputers and Application Performance 7

Calculation Calculation
time time

Processor
1
Communication

Processor
2
Communication

Processor
3
Communication

Processor
4

Communication Communication
time time

Fig. 1.2 Parallel calculation by domain decomposition

Next, the parallelization rate and parallelization efficiency will be described.


Assume that 99% of a sequential calculation can be performed as parallel calculations
but that the remaining 1% of the sequential calculation cannot be made parallel; the
parallelization rate of this calculation is 99%. Assume that the sequential calculation
time is 100 s in total, and the parallel computation is performed using 100 processors.
The parallel computation time for the parallelizable component will then be 99 s/100
= 0.99 s; 1 s of nonparallel computing time must then be added, so the calculation
time with 100 processors is 1.99 s. Thus, the original calculation is 100 s and about
1/50 of that with 100 processors, so the parallelization efficiency is 50%.
In the same way, for 1000 processors, the calculation time is about 1/91. That is,
if even 1% of the calculation is nonparallel, efficient parallel calculation cannot be
performed no matter how many processors are used. This limit on the benefits of
parallelization is known as Amdahl’s law. From this, as well as minimizing the com-
munication time, it is important to minimize the nonparallel computing component
as much as possible.
8 K. Minami

1.2.2 Important Points for Single-CPU Performance


Optimization

From the viewpoint of single-CPU performance, applications can be roughly divided


into two types according to the ratio of the number of bytes of memory accessed to
the number of floating point operations required to execute the application. In one
type, the data transfer requests (in bytes) are small compared with the number of
floating point operations (in FLOPS). Such calculations are called computations
with small required byte/FLOP (B/F) values. For example, for the matrix–matrix
product calculation shown in Fig. 1.3, in principle the B/F value is 1/N when the data
movement amount is represented by the number of elements. In general, because
the amount of data movement is represented by the number of bytes, for double
precision calculations, the movement amount is multiplied by eight, so that it is 8/N,
and in principle the B/F value becomes smaller as N increases. In real programs (see
Fig. 1.4), if N becomes a certain size and (a) the access direction is continuous in
memory, (a) is in the cache but (b) is not in the cache. Therefore, using a technique
called blocking, we divide the matrix into small matrices so that both (a) and (b)
are in the cache. Applications with small required B/F values like the matrix–matrix
product shown can achieve high single-CPU performance by reusing the data placed
in the cache many times.

Calculation of matrix-matrix product


(Small required B/F value)
(a) (b)

2N 3 N2 N2
calculations pieces of data pieces of data

Fig. 1.3 Example of matrix–matrix product (1)

(a) (b)

Fig. 1.4 Example of matrix–matrix product (2)


1 Supercomputers and Application Performance 9

2N 2
calculations (a) (b)

N2 N
pieces of data pieces of data

Fig. 1.5 Example of matrix–vector product

In the other type of application, the data transfer requests from memory are large
compared with the number of floating point operations required to execute the appli-
cation. These calculations are called calculations with large required B/F values. Such
calculations have problems in effectively using the high performance of the CPU
because it is difficult to use the cache effectively. For example, for the matrix–vector
product calculation shown in Fig. 1.5, in principle, the B/F value is approximately
1/2 when the data movement is expressed by the number of elements. As above, for
a double precision calculation, the movement amount is multiplied by 8 bytes, so
becomes 8/2 = 4; therefore, the B/F value is large when compared with the matrix
–matrix product. In this way, as a viewpoint for improving the performance of the
single CPU, the required B/F value of the application is important.
Exercises
1. Describe the memory wall problem, which is an important problem for single
processors, and describe the characteristics of recent supercomputers.
2. There are various benchmark tests (BMTs) to evaluate the performance of super-
computers. The most famous BMT is the top 500 (https://www.top500.org/),
which is evaluated by the performance of LINPACK, but there are others as well.
For some other BMTs, discuss the relationship between the evaluation method
and the field in which the evaluation is important.
Chapter 2
Performance Optimization
of Applications

Kazuo Minami

Abstract In this chapter, we present procedures for performance evaluation that


we have used for practical applications. The method is outlined in Sect. 2.1, and
Sects. 2.2–2.6 describe the details of the method. The classification of problems
related to high parallelism and the classification of applications from the viewpoint
of single-CPU performance are described. Sections 2.7 and 2.8 then describe perfor-
mance optimization techniques for each problem pattern related to high parallelism
and the techniques for single-CPU performance optimization according to applica-
tion classification.

2.1 Performance Evaluation Method

The performance evaluation of an application is divided into two parts: “highly par-
allel performance optimization” and “single-CPU performance optimization.” For
each part, the performance evaluation method has two working phases: “current state
recognition” and “understanding the problems.” The “current state recognition” is
common to both working phases and is divided into “source code investigation”
for analyzing the structure of source code, and “measurement of elapsed time” to
understand the current state of application performance. The final procedure in the
“current state recognition” is “calculation/communication kernel analysis,” in which
we evaluate the results of “source code investigation” and “measurement of elapsed
time.” The next phase of “current state recognition” begins with “problem evaluation
method” with working phases of “understanding the problems.” In the “problem
evaluation method” for “highly parallel performance optimization,” the problems
related to high parallelization are classified into six patterns. In the “problem eval-
uation method” for “single-CPU performance optimization,” applications are also
classified into six patterns.
Our approach is summarized in Table 2.1.

K. Minami (B)
RIKEN Center for Computational Science, RIKEN, Kobe, Hyogo, Japan
e-mail: [email protected]

© Springer Nature Singapore Pte Ltd. 2019 11


M. Geshi (ed.), The Art of High Performance Computing
for Computational Science, Vol. 2,
https://doi.org/10.1007/978-981-13-9802-5_2
12 K. Minami

Table 2.1 Outline of performance optimization method


Highly parallel performance Single-CPU performance
optimization
Current state recognition Source code investigation
Measurement of elapsed time
Calculation/communication kernel analysis
Understanding the problems Problem evaluation methods

2.2 Current State Recognition: Source Code Investigation

As the first step in current state recognition, we investigate the source code of the
application. We investigate the structure of the source code and analyze the call struc-
ture of subroutines and functions. We also analyze the subroutines, the loop structure
in the functions, and the control structure of the IF blocks, and organize and visualize
the structure of the entire program. The visualized source code is divided into blocks
of calculation and communication processing according to the algorithms of physics
and mathematics used in the program, and the blocks are organized. We understand
the physical/mathematical processing content of each processing block. By com-
paring these aspects of the processing blocks with the results of the investigated
source code, the calculation characteristics for each calculation block are obtained.
The calculation characteristics describe the processing of a calculation block as non-
parallel, completely parallel, or partially parallel, and identify the calculation index
(e.g., number of atoms or number of meshes), whether the calculation amount in the
calculation block is proportional to N or proportional to N2 when the calculation
index is N, and so on. We also investigate the communication characteristics of each
communication block: whether the processing of the communication block is global
communication, adjacent communication, or whether the communication amount
depends on the calculation index. These investigations are shown in Fig. 2.1.
The purpose of the investigation of the source code is to understand the charac-
teristics of each processing block in the program. However, the visualization of the
loop structure from the start to the end of the program and that of the entire control
structure of the IF blocks mentioned here are large tasks if done manually. Therefore,
we use a visualization tool for program structure, such as K-scope [1, 2].

2.3 Current State Recognition: Measurement Methods

In the sequential calculation before parallelization in the simulation, the calculation


is sequentially performed for each calculation unit such as a mesh. To parallelize a
code, we divide a set of calculation units such as meshes into plurality sets, share the
divided sets among the processors, and perform the calculations in parallel. In such
parallel computation, adjacent communications are performed in every calculation
2 Performance Optimization of Applications 13

Fig. 2.1 Investigation of source code

step to exchange data for parts of areas with neighboring processors. In addition,
when calculating inner products of scalar values for all areas, global communication
between all processors is required. An important point in achieving high parallelism
is to make the adjacent and global communication times as small as possible.
As described in Sect. 1.2.1, it is important in parallel computation, just like the
reduction of communication time, to make the nonparallel computing parts as small
as possible.
The next step of current status recognition is to conduct application performance
measurement. It is important for these measurements to be useful for investigating
parallel characteristics; that is, what kind of behaviors the adjacent and global com-
munication times described here show during highly parallel calculation, and where
nonparallel computing parts remain and their influence on behaviors in high paral-
lelism. Therefore, where performance measurement is possible, it is carried out as
follows.
In conducting application performance measurement as the next step of current
status recognition, it is important to conduct performance measurements that clarify
the parallel characteristics of applications: specifically, what kind of behaviors the
adjacent global communication times display during the highly parallel calculation,
which calculation parts are nonparallel, and how the nonparallel parts influence the
application’s behavior in highly parallel execution. For clarification of parallel char-
acteristics, where possible, the performance measurement is carried out as follows.
First, we define the problem to be solved, determine the number of parallel paths
in the problem, and create a test problem that has the same problem size with one
processor as the target problem that can be run with several levels of parallelism.
Next, we perform the performance measurement using the prepared test problem. In
the performance measurement, the execution time is measured for each process for
each calculation block and communication block, as defined in the previous section.
The parallel characteristics during parallel computation cannot be fully clarified by
14 K. Minami

measuring the entire application. Each processing block’s influence on the paral-
lel characteristics differs depending on whether it includes a nonparallel part and
the number of parallel paths, and whether the communication time changes. There-
fore, it is essential to measure the performance of each processing block for each
process separately. These measurements allow us to identify the processing blocks
that degrade parallel performance. In addition, because the communication behav-
ior during parallel execution differs between adjacent and global communication,
it is necessary to measure them separately. The adjacent communication time has
the same value if the communication amount is the same, as described later, but
the global communication time tends to increase as the number of parallel paths
increases, even if the communication volume stays the same. Furthermore, because
communication times may include waiting times caused by load imbalance, it is also
important to measure the waiting time and the net communication time separately,
thus allowing us to distinguish whether the problem is caused by communication or
load imbalance. With respect to computation, simultaneously with the computation
time, the amount of computation and the computation performance are also measured
for each processing block in each process.

2.4 Current State Recognition: Determination


of Computation and Communication Kernels

The analysis of the source code shows the correspondence between the physi-
cal/mathematical processing contents of each processing block and the source code,
and the calculation characteristics of each calculation block and the communica-
tion characteristics of each communication block. By matching these results with
measurement results, the calculation kernel and the communication kernel can be
identified.
For example, suppose there is a parameter N that determines the amount of com-
putation. Assume that the coefficient of computation amount proportional to the third
power of N is m1, the coefficient of computation amount proportional to N is m2,
and that m2 is considerably larger than m1. When N is relatively small, the amount
of computation for the two parts may be about the same. However, as N increases,
the amount of computation for the part proportional to the third power of N becomes
significantly larger, and the amount of computation for the part proportional to N
may become negligible.
Both the amount of computation and the computation time also vary depending
on the level of performance1 that can be obtained relative to the theoretical peak
performance. The essentially nonparallel parts may remain because of the adopted
parallelization method. By considering the size of the parameters of the problem to
be solved in this way, the parallelization method used, the parallelization method
that may be adopted in the future, the prospects for effective performance, and so on,

1 Performance obtained by dividing the measured amount of computation by the execution time.
2 Performance Optimization of Applications 15

Fig. 2.2 Identifying kernel of calculation and communication

and the kernels to be evaluated, are determined (see Fig. 2.2). The kernels selected
here can be reviewed at later stages of the evaluation.

2.5 Understanding the Problems: Evaluation


of High-Parallelism Problems

We explain how to evaluate the problems of high parallelism by carrying out the
measurements shown in Sect. 2.3 and how to measure parallel performance from
several parallel processes to about 100, about 1000, or several thousand, step by
step. There are two kinds of methods for measuring the performance by gradually
increasing the number of parallel processes: strong scaling measurement and weak
scaling measurement. Strong scaling measurement is a method of fixing the scale
of the problem to be solved and increasing the number of parallel processes: for
example, if the problem scale is fixed to N = 10,000, the number of parallel pro-
cesses and the problem size per processor change is 1 and 10,000, 2 and 5000, 4 and
2500, and so on, respectively. In contrast, weak scaling measurement is a method of
fixing the scale of the problem solved by each processor and increasing the number
of parallel processes. For example, if the problem scale is measured first at N =
1000, the problem size per processor and the number of parallel processes is 1000
and 2, respectively, and the total problem scale is N = 2000. If the problem scale
per processor and the number of parallel processes is 1000 and 4, the total problem
scale is increased to N = 4000. The feature of weak scaling measurement is, ide-
ally, that even when the number of parallel processes is increased, because the same
computation is performed, and the adjacent communication amount is not changed,
the execution time of the computation parts and the execution time of the adjacent
communications are not changed. When a nonparallel part is included in the com-
putation part, a significant increase in computation time should be measured, as the
number of parallel processes becomes large in weak scaling measurement.
16 K. Minami

For example, assume that the execution time of the parallelizable part during
sequential execution is Tp and the execution time of the nonparallelizable part dur-
ing sequential execution is Ts . The execution time T0 during sequential execution
is represented by T0 = Tp + Ts . The execution time when this problem is mul-
tiplied by N and executed sequentially is represented by N × T0 = N × Tp +
N × Ts . When this problem is executed in N parallel processes, it corresponds to
what we performed with weak scaling. If the execution time when executed in N
parallel processes is Twn , the parallelizable portion becomes N times faster but the
nonparallelizable portion does not become faster, so Twn = Tp + N × Ts , and the
term N × Ts increases. Incidentally, if Tsn is the execution time when run with
strong scaling, then Tsn = Ts + Tp /N.
Even when the adjacent communication time increases in accordance with the
number of parallel processes, it is easy to see that there are some problems in the
corresponding adjacent communications. The global communication time generally
increases in accordance with the number of parallel processes, and the increase can
be predicted from the data on the basic communication performance by comparing
the degree of increase with the predicted value. This can show whether there are
some problems in the corresponding global communications.
The method described here is shown in Fig. 2.3. The reason for using weak
scaling measurement in this way is that it is easy to find problems. However, in weak
scaling measurement, it is necessary to prepare separate execution data according
to the number of parallel processes, which may be troublesome. In a simulation in
which the amount of computation is proportional to the second or third power of
the problem size N, weak scaling measurement is sometimes difficult. In such a
case, strong scaling measurement is performed. For strong scaling measurement, it
is necessary to model the computation and communication times with the number
of parallel processes as a parameter, to predict these, and to compare the predictions
with the actual measured times so as to find any nonparallel parts or communication
problems. However, unlike weak scaling measurement, it is not necessary to prepare

Parallel number

Fig. 2.3 Measurement with weak scaling


2 Performance Optimization of Applications 17

execution data according to the number of parallel processes. It is sufficient to prepare


only one type of data.
For the calculation of kernel, we compare and evaluate the trend of the predicted
computation amount as clarified in the investigation of the source code and the com-
putation amounts from the measurement results. For example, assume the computa-
tional amount of the kernel is proportional to the problem size N and is completely
parallelized and measured with weak scaling. Because the computational amount for
each process is constant regardless of the number of parallel processes, the calcula-
tion time for the total system is constant. In this case, the value obtained by dividing
the computational amount of the measurement result by N is the proportional coef-
ficient. If it can be evaluated with weak scaling, as described above, and if it is
possible to completely parallelize and there are no problems in the communication
part, even if the number of parallel processes is increased, the execution time of the
computation part will be constant, and the adjacent communication time will also
be constant and should not increase. If the execution time of the computation part
increases remarkably with the number of parallel processes, it is likely that some
nonparallel parts remain in the operation kernel. In addition, if the adjacent commu-
nication time increases significantly according to the number of parallels, it is likely
that some processing that is not adjacent communication is included in the com-
munication kernel. For example, there may be some global communication that is
used instead of adjacent communication to simplify programming. As for the global
communication, as described at the beginning of this section, its communication time
also increases as the number of parallel processes increases. However, because the
extent of the increase can be predicted from the basic communication performance
data, if the communication time increases significantly more than predicted, we can
consider that there is some problem in the corresponding global communication. In
the discussion of measurement methods, we described the method of measuring the
communication time and waiting time separately. This measured waiting time often
indicates some imbalance included in the computation part and communication part.
In parallel computing, some processing imbalance is physically unavoidable.
However, where the extent of the imbalance is remarkably large or if the imbalance
increases with the number of parallel processes, it is likely that some problem caus-
ing imbalance was introduced in the programming stage. In evaluations using strong
scaling, as described above, the existence of nonparallel parts and communication
problems are found by comparing the predicted computation and communication
times with the measured times. The predicted times are obtained by modeling them
with the parallel number N as a parameter.
For example, suppose that the computation of the kernel is proportional to the
third power of the system parameter N and the computation kernel is completely
parallelized. When measured with strong scaling, if the number of parallel processes
is doubled, then the computation of each process should be halved.
The total computation amount is the number of processes multiplied by the mea-
sured computational amount of each process and the parallel number M. The total
computational amount divided by the third power of M is the proportional coeffi-
cient for the third power of M. The investigation of these computational amounts
18 K. Minami

and proportional coefficients is performed using many parallel measurement results,


and the evaluation is made as to whether the predicted value is consistent with the
measurement results. If the evaluation results are consistent, it means that the source
code is written according to the theory. If the evaluation results are not consistent
and there is an increase in computational amounts with the increase in the number
of parallel processes, it is likely that there is some problem such as the existence of
nonparallel parts in the source code. A similar evaluation is required for the adjacent
communications. For example, when a rectangular parallelepiped area is calculated
using twice the number of parallel processes, the length of one side of the allocated
area of each processor is 1/3 to the power of 1/2. The adjacent communication amount
for the adjacent faces is 2/3 to the power of 1/2. Therefore, assuming the same com-
munication performance, the communication time should also be 2/3 to the power of
1/2. For the global communication, a similarly modeled evaluation is required. As
for the evaluation of the imbalance, it is necessary to evaluate the results as for weak
scaling.
As repeatedly described, it is essential to carry out the evaluation shown here for
each computation and communication kernel. Some tools provided by manufacturers
have functions to measure the execution time, the amount of computation, and the
computation performance for each subroutine or function, and it is usual to measure
performance using these tools. However, the subroutines and the functions of the
application do not generally match the range of the block, and because a function
may be called from different blocks several times in different ways, these tools may
not yield accurate measurement results for each block. Therefore, it is better to
perform measurements on each block. However, this does not apply if the subroutine
or function is configured to match the block.

2.5.1 Classification of Problems Related to High Parallelism

In the HPCC benchmark, applications are classified by using two axes. The first axis
is defined by the locality versus nonlocality in the spatial direction of the data divided
among the processors. The second axis is defined by the locality versus nonlocality
in the temporal direction of data in the processors [3].
In addition, a study of application classification, the “Berkeley 13 dwarfs,” classi-
fied applications by the two axes of the communication and calculation patterns [4].
In this study, applications were classified among seven dwarfs in the HPC field, and
13 dwarfs by adding other fields.
In promoting performance optimization, we also classify the application and orga-
nize the execution performance optimization methods for applications based on the
classification. For high parallelism, the locality versus nonlocality of the data is
considered in the HPCC as one axis, and in the Berkeley 13 dwarfs, the pattern of
communication is considered as one axis. In this section, we focus on the kinds of
problems that occur and how we deal with those problems when optimizing the per-
formance of existing applications, and we classify them according to highly parallel
2 Performance Optimization of Applications 19

patterns. The problems relating to high parallelism are classified into six patterns, as
shown in Table 2.2.
The main problems relating to high parallelism are caused by calculations and
communication. The first, second, and sixth problems are caused by calculation, and
the third, fourth, and fifth problems are caused by communication. The six patterns
are described as follows.
The first pattern is the mismatch of the degree of parallelism between applications
and hardware. Researchers want to solve a problem within a certain time; suppose
that to do so, it is necessary to use tens of thousands of parallel nodes on a super-
computer. For example, the K computer makes it possible to use more than 80,000
parallel nodes in terms of parallelism of the hardware. However, sometimes only
thousands of parallel nodes can be used because of the limitations of the application
parallelization. This is the mismatch of degree of parallelism between the application
and the hardware. When approaching the limitation of the parallelism of the appli-
cation, the computation time becomes extremely small, whereas the proportion of
communication time increases, leading to a deterioration of the parallel efficiency.
The second pattern is the presence of nonparallel parts. As mentioned at the begin-
ning of this chapter, we can see that the parallel performance deteriorates because
of Amdahl’s law if nonparallel parts remain in the computation. Here, assuming that
the execution time of a certain application at the time of sequential execution is Ts
and the parallelization rate of the application is α, the nonparallelization ratio of the
application is 1 – α. When this application is executed using n parallel processes, the
execution time Tn is expressed as Tn = Ts (α/n + (1 – α)). For a parallelization effi-
ciency of 50%, the parallelization ratio α is required to be 99.99% when n = 10,000.
The easiest way to find remaining nonparallel parts is to measure the increase in
execution time of the calculation part using weak scaling measurement as described
above.
The third pattern is the occurrence of large communication sizes and frequent
global communication. Communication times, particularly for global communica-
tions, have a large impact on the parallel performance. Consider an example of
implementing the ALLREDUCE communication of M (bytes) between N nodes.
Assume that the ALLREDUCE communication is performed using a binary tree

Table 2.2 Bottlenecks in parallel performance


Bottleneck
1 Mismatch of the number of parallel processes between application and hardware
(insufficient parallelism of applications)
2 Presence of nonparallel part
3 Large communication size and the occurrence of frequent global communication
4 Global communication among all nodes
5 Large communication size and the occurrence of communication times in adjacent
communication
6 Load imbalances
20 K. Minami

algorithm and the communication performance is Pt (bytes/s). The communication


time Tg for acquiring the total amount of M (bytes) after all nodes have commu-
nicated is Tg = m×log Pt
2n
. To compare the global communications with the adjacent
communications, we consider an example in which N nodes perform the adjacent
communications of M (bytes) to the next rank. When the communication performance
is matched with the above conditions, the communication time Ta to complete the
communication of M (bytes) for all nodes is calculated by Ta = Pmt . When comparing
global and adjacent communications, it is found that the global communication time
is larger by the coefficient of log2 n. Global communication should be a minimum.
The fourth pattern is the occurrence of global communications among all nodes.
As described above, when the ALLREDUCE communication of M (bytes) is per-
formed between N nodes using the binary tree algorithm, the communication time
T is T = m×log Pt
2n
assuming the communication performance to be Pt (bytes/s).
Because the communication time increases as the number of nodes N increases, it is
better to limit global communication among all nodes as much as possible. However,
calculation of inner products is inevitable in the iterative solution of simultaneous
linear equations and other problems, so it is impossible to eliminate all-node global
communication.
The fifth pattern is the occurrence of a large communication size and a large
number of communications in the adjacent communication. In terms of the commu-
nication time, adjacent communication tends to be faster than global communication.
However, useless adjacent communication, such as communicating data for the entire
area for one mesh to the adjacent mesh, are sometimes performed. Such code should
be reviewed and only communication of data on the adjacent surface should be made.
The sixth pattern is the occurrence of load imbalances. Differences in the amount
of calculation for each node may occur, causing some load imbalance among nodes.
When the load imbalance deteriorates as the number of nodes increases, or when the
load imbalance is extremely large over a small number of nodes, it is a problem.

2.6 Understanding the Problems: Evaluation Methods


for Problems in Single-CPU Performance

2.6.1 Application Classification for Single-CPU Performance

As mentioned in Sect. 2.5, the developers of the HPCC benchmark [3] and the
Berkeley 13 dwarfs [4] classified applications. For the HPCC, applications were
classified using locality versus nonlocality of data in the temporal direction with
regard to the single-CPU performance. For the Berkeley 13 dwarfs, applications
were classified using the calculation pattern.
Similarly, in promoting the study of performance optimization, we also classify
applications and organize the application execution performance optimization tech-
niques based on the classification. In Sect. 1.2, from the viewpoint of the single-CPU
2 Performance Optimization of Applications 21

performance, we mentioned that applications can roughly be classified into two types,
one with a low required B/F value and one with a high required B/F value. This idea
is close to the classification used for the HPCC. In this section, we will develop this
view and show the classification of applications into six types as shown in Table 2.3.
The calculations for which the required B/F value is small are the first to the fourth
types. The performance greatly varies depending on whether the DGEMM library
or manual cache blocking can be used, even for calculations with small required
B/F values. When cache blocking can be used, the performance varies depending
on whether the data structure and loop structure are simple, or the data structure is
slightly complicated such as using list vector indexing by integer arrays. Applications
with more complex loop structures often fail to achieve high performance. These
considerations led to the four types of calculations with small required B/F values.
The first type includes applications that can be rewritten as matrix–matrix product
calculations. This type has small B/F values because in principle it can perform the
calculations proportional to the third power of n by loading the data for a square of
size n from memory. An example of this type of calculation is the application of the
first principle quantum calculation based on density functional theory.
The second type includes applications that allow cache blocking although they
are not rewritable to the matrix–matrix product, but still have small required B/F
values. The calculation of the Coulomb interactions of molecular dynamics and the
calculation of the gravity interaction of the gravitational multiple-body problem are
examples. In both cases, by loading the data for n particles and performing cache
blocking, calculations proportional to the square of n can be performed, so that the
required B/F value is small. This type often uses list vector indexing by integer arrays
for the particle access, and the loop body2 is somewhat complicated.
The third type contains examples such as special high-precision stencil calcula-
tions,3 which make it possible to use the cache effectively, so the required B/F value

Table 2.3 Classification of applications from the standpoint of single-CPU performance


Classification Application examples
1 Rewritable to matrix–matrix products Density functional theory calculations
2 Cache blocking is possible Molecular dynamics, many-body gravity
problems
3 The required B/F value is small, and the Special stencil calculations
loop bodies are simple
4 The required B/F value is small, but the Plasmas, physical processes of meteorology,
loop body is complex quantum chemical calculations
5 The required B/F value is large Mechanical processes of meteorology,
fluids, earthquakes, nuclear fusion
6 The required B/F value is large and list Structural calculations using finite-element
accesses are used methods, fluid calculations

2 The code contained in the loop.


3 The calculation using subscripts for differences such as i, i – 1 appearing in difference calculations.
22 K. Minami

is small and the loop body is a simple calculation. Although this type of calculation
gives good performance, unfortunately there are few examples.
In the fourth type of calculations, the required B/F value is small, but the loop
body is complex. Some weather calculations have mechanical processes to calculate
the motion of a fluid and physical processes to calculate the microphysics of clouds;
this physical process corresponds to the fourth type of calculation. By using small
amounts of data loaded from the memory, complex and in-cache calculations are
performed, but the loop body tends to be long and complicated. The calculation of
the PIC method4 used for plasma calculations is also of this type. In this technique,
although the mesh data around the particle are cached, list vector indexing by integer
array is commonly used to access the particle data, resulting in complex program
codes. The body of the calculation loop also tends to be long. For this type, we expect
high performance because the data are cached, but in many cases we cannot obtain
the expected performance because of the complexity of the program code.
The fifth and sixth types of calculation have high required B/F values. Even for
program codes that have the same high required B/F values, the performance varies
greatly, depending on whether discontinuous access to lists is required. This is the
basis for classifying calculations with high required B/F values into the fifth and
sixth types.
The fifth type of calculations has high required B/F values and do not use list
accesses. There are many calculations of this type in the usual stencil calculation,
and there are many other examples such as the dynamic processes in weather cal-
culations described earlier, fluid calculations and calculations of earthquakes. The
sixth type of computation has high required B/F values and uses list accesses. Such
calculations occur frequently in engineering; examples are structural analysis and
fluid calculations using finite-element methods. List accessing is the weak point for
the modern scalar computer architecture because random accesses are required for
each element.
In general, single-CPU performance decreases in the order from type 1 to type 6
calculations. However, there is usually little difference between types 2 and 3.

2.6.2 Evaluation by Cutting Out the Computation Kernel

First, we cut out the calculation kernel to form an independent test program that can
be executed in one process. In cutting out the kernel, the following steps are carried
out.
(A) Dump the necessary data at the time of executing the original program to prepare
the data such as arrays necessary for the execution of the test program. The
data used by the conditional statements are important. For the data used for
calculation, when only the performance is a problem, the appropriate data may
be set without using dumped data.

4 Particle-in-cell method. A method for arranging particles in the calculation lattice.


Other documents randomly have
different content
Was levelled to the ground,
And on its ruins, now a funeral pyre
Smouldered the ashes of her aged sire

and the foul monster had carried her off to his cave. The bridegroom
swore, in his despair, that “earth should no longer hold a thing so
vile,” and, marching off with his friends, killed the dragon and
rescued his bride; but the story ends on a classical note of tragedy,
for she died of horror in his house that very day.
Reference: E. Sidney Hartland, The Legend of Perseus, 3 vols.
(London, 1896).
IV
OF DRAGONS IN MODERN EUROPE

It will be well to begin this section with short accounts of the two
most satisfactory Renaissance dragons: the Dragon of Rhodes and
the Dragon of Bologna.
“The history of the ancient Order of the Knights of St. John (not
yet removed to Malta) records that about the year 1330 Dieudonné
de Gozon, afterwards third Grand Master of the Order, joined the
Knights in Rhodes, and was filled with pious zeal to kill a terrible
dragon which ravaged the Island; but the then Grand Master
considered such extravagant gaieties too dangerous for a knight
vowed to the defence of Christendom, and roundly forbade it. On
this de Gozon returned to the castle of his ancestors near Tarascon
in France, and, with the help of an ingenious dummy dragon (so
little does the art of war change), trained his horses and dogs to
face the monster, and, returning, killed it and removed its tongue as
evidence. A lying Greek (so little does the Greek nature change)
found the carcase and claimed the victory; but de Gozon showed
him up by producing the tongue—and was put in prison for
disobedience. The Pfalzgraf Ottheinrich made our first written record
of this feat when passing through on a pilgrimage in 1521, and the
corroborative evidence is indisputable: the feat is said to have been
recorded on the tombstone of the knight (we have the tombstone,
and it isn’t); there are said to be pictures of it in a wall-painting in a
house in Rhodes (which cannot be found); and the family are said to
have preserved the draconite taken by the hero from the monster’s
forehead (the family has disappeared); the head itself was seen by a
seventeenth century traveller still nailed to a gate in Rhodes, though
it disappeared during the last century. For countless years the simple
islanders had displayed it for the glory of God and without thought
of gain, and it would perhaps be uncharitable to connect its
disappearance with the recent development of transatlantic
transport, or with the discoveries of modern science, which have
shown that the skeleton of the dun-cow at Warwick is simply that of
a whale. And, finally, they will show you to this day in Rhodes the
cave where the dragon lived.”
The story of the Dragon of Bologna is tame by comparison. It is
recorded in great detail in The Natural History of Serpents and
Dragons by Professor Ulysses Aldrovandus, published at Bologna by
Mark Antony Bernia in the year 1640, at his own charges, with a
dedication to the Prince-Abbot Franciscus Perettus, and with the
approval of the Holy Roman Emperor Rudolph, the Rector to the
Cardinal-Archbishop of Bologna, and the legal adviser to the Most
Holy Office of the Inquisition in that city.
The story is as follows (p. 402): “In the early summer of the year
1572, to wit on the 13th of May, the dragon appeared in the
outskirts of Bologna, hissing horribly. It was caught the day after
Ascension Day by a cowherd called Baptista of Camaldulus, about 5
P.M. and about seven miles out from the City, on the high road. His
cows saw it and stopped dead, and Baptista, who was behind with
his cart, pricked them on with his goad; but they went down on their
knees and wouldn’t budge. Then he heard a great hissing and
beheld the astounding monster: but though frightened out of his
wits, he up with his stick and knocked it on the head so that it died.
The brave herd, fearing it might not be dead, cut off one of its feet
and brought it into Bologna as evidence. After three days the noble
Horatius Fontana gave orders for the carcase to be sent to the great
naturalist Aldrovandus, who declared it to be unique in all Italy and
all Europe, and had it stuffed and put in the museum (whence it has
unluckily disappeared). It was about this same time that the flying
dragon appeared by night in the sky, and no sane man will doubt
that these portents were sent in honour of Pope Gregory XIII, who
took office in that year and who sported a dragon on his coat-of-
arms.”
This same Aldrovandus is our chief source of information on the
modern dragon. He sets out, in due scholastico-scientific style, first
the alternative meanings of the word “dragon,” with a note that
Virgil is very haphazard in his use of “dragon” or “serpent” for
snakes in general; then the synonyms, as “syren,” “leviathan,” and
the Hebrew oach (whence perhaps our word “hoax”); then size—5
to 100 cubits (we may split the difference and safely say about 50);
habitation—Libya, India, Atlas, Æthiopia, Florida, etc. (with a caution
that the species born of a wolf and an eagle is probably fabulous
and nowhere to be found); colour—red, black, ashen, pea-green,
indeed the evidence is hopelessly conflicting; description—head of a
virgin or wild-boar, goose-feet or talons or hoofs (they probably
vary); St. Augustine confirms Herodotus’ opinion that they fly;
poison—more virulent in the hotter climes; jaw—some say very
large, some say very small, some say two rows of teeth, some say
three, and the number is in any case uncertain; manners and
customs—very vigilant and fond of gold (so we see why they are
normally set to guard treasures), not afraid of men, and able to
throw elephants with their tails: four or five, says Pliny, will twine
their tails together for a long flight and so cover the distance at an
incredible speed; very fierce, but Heracleides, the philosopher, had
one so tame that it followed him about like a dog; birth—the
evidence is conflicting as to whether from eggs or immediately.
Remedies against their poison—red mullet applied externally or
(better) internally, or (best of all) the head of a dragon skinned and
applied to the bite. Capture—men of the most magnificent courage
drug them with opium-seeds, so as to obtain the draconite; a scarlet
cloak and the appropriate incantation are effective, and an axe has
been tried with success: a useful trick is to catch them when they
are preoccupied with an elephant-fight (their customary recreation),
and another very good plan is to put down sulphur, which the
creature eagerly gulps down and then moving to the nearest river
drinks until it bursts. (This was the device of the great Cracus, who
gave his name to Cracow. It is an elaboration of the Prophet Daniel’s
method of dealing with Bal’s dragon—that holy man’s mixture, it will
be remembered, itself exploded the dragon; but the march of
science and the closer study of animal-habits no doubt made Cracus’
scheme more convincing.)
The eyes are precious stones and the teeth ivory; the fat is a
sovereign remedy against poison, fever, and blear-eyes; the spine is
a great cure for toothache; the gall-bladder and intestines mixed
with wine effect more than was ever claimed for Colman’s mustard
in the bath, removing warts. It is very lucky to bury a dragon’s head
under the front doorstep, and the eyes make a fine poison and send
away nightmares: and so on and so forth—all this less than three
centuries ago.
A little later, about 1660, the learned Jesuit Kircher visited the
Alps, and, though discounting many devils as due to the credulity of
the peasantry, could not resist the conclusion that so horrid and
inhospitable a country could only have been intended by God to
harbour dragons, especially when a public notice in the Church of St.
Leodegarius (our old friend St. Leger, the patron-saint of
bookmakers?) in Lucerne told how a man “paused some months in a
cave with two dragons, who were either naturally amiable or were
calmed by his energetic appeals to the Virgin, and finally escaped by
holding on to their tails when they flew away after their period of
hybernation” (History does not record whether they adopted Pliny’s
plan, or whether by a merciful dispensation of Providence they flew
so close together that he suffered no strain).
The anonymous author of The Golden Coast, or a Description of
Guinney (London, 1665) has little reliable information on this or any
other subject. The people, he tells us, are Nigritæ “from their colour,
which they are so much in love with that they use to paint the Devil
white”; and of the elephant, “which some call Oliphant,” that “they
have continual war against dragons which desire their blood because
it is very cold.” The book abounds in such old tales out of Pliny and
Bartholomew Anglicus, and has all the appearance of a literary puff
of the Company of Royal Adventures of England trading to Africa
(est. 1662); for what honest man could have waxed so enthusiastic
over that death-trap of a country, where (says he) “a man may gain
an estate by a handfull of beads, and his pocket full of gold for an
old hat; where a cat is a tenement and a few fox tails a Mannor;
where gold is sold for iron, and silver given for brasse and pewter?”
The Company failed shortly afterwards and was replaced by the
Royal African Company (1672), and this may well have been to over-
spending in the Advertising Department.
Doctor Thomas Browne, in his “Enquiries into very many received
tenents, and commonly presumed truths” (London, 1686)
(commonly called Browne’s Vulgar Errors) is more modern, but, he,
like a sensible man takes a middle path between scepticism and faith
—thinks we cannot safely deny that there is such an animal as the
basilisk; but we are not to confuse it with the cockatrice, a mere
hieroglyphical fancy, though even the cockatrice he will not declare
to be impossible (he does not see how such an oddity can be
hatched from “a cock’s egg” (sic: the phenomenon occurs only in a
cock’s eighth year, and causes it acute discomfort put under a toad
or serpent)); but many inventions, he says, are really “the courteous
revelations of spirits,” and we must not be too cocksure of our
merely human faculties.
Scheuchzer, the learned Botanist who toured the Alps in the first
ten years of the eighteenth century, frankly adopted the compromise
implicit in Aldrovandus—always to believe half of what he was told;
but he thought the dragon-stone in the museum at Lucerne entirely
convincing; for (says he) a dishonest man would not have invented
so simple a story as its falling from the sky—but rather some
fabulous tale about its coming from the farthest Indies; and the
stone not only cures simple hæmorrhages, which ordinary jasper or
marble might well do, but dysentery and fevers and all those ills of
which, to judge from the advertisements in the local press,
Glastonians may now rid themselves so much more simply. Item, a
respectable citizen returned home one evening lately “with a
swimming in the head and a marked uncertainty about the motions
of his legs, and how can we doubt his word when he attributes these
unprecedented phenomena to the influence of the dragon who
encountered him in the forest?” Scheuchzer’s scientific journals were
published at the expense of the Royal Society of London. Credible
witnesses of to-day maintain that “not the vestige of a dragon is to
be found, even in those wildest regions of the Alps which ... were
especially adapted for their generation.” Thus do beauty and
romance fade before the advance of Winter Sports and Grand
Babylon Hôtels.
References: Aldrovandus, op. cit.
Thomas Browne, op. cit.
Leslie Stephen, The Playground of Europe (London, 1871).
E. Ray Lankester, Science from an Easy Chair (London, 1910).
F. W. Hasluck, The Dragon of Rhodes (British School at Athens,
1914).
V
OF DRAGONS IN ANCIENT EGYPT

It is reported of Mr. Winston Churchill that, being challenged one


day by a Frenchman as to the remarkable uniform he was wearing,
he replied in the same language that he was an Elder Brother of the
Trinity. “Ah!” said the Frenchman, “that is indeed a unique
distinction.”
It is not so unique as might be supposed. If we could betake
ourselves to the Egypt of 5,000 years ago, we should find them
worshipping a Trinity of their own: Isis the all-Mother; Osiris the
Son, and Horus. Isis, the forerunner of all the gods of all mankind
was the goddess of fertility—goddess, not god, for what could be
more evident than the female fact of birth, whereas male assistance
went long unrecognized. The savage mother, finding herself with
child, would attribute her condition not to a “commonplace event
which took place perhaps many months before,” but to a recent
thunderstorm or other striking phenomenon to which all could bear
witness.
So Isis ruled alone for a while, and then in her own inimitable
fashion gave birth to the water-god Osiris; and between them in due
course they produced the warrior Horus, who in the fullness of time
became the avenger of Osiris, when the powers of darkness slew
him.
This is the bald and essential outline of their faith. The details are
extremely confusing, partly because of variants, but principally
because the savage-mind is so confused. “Anne’s Mother’s daughter,
Mother’s Anne’s daughter,” reasons my baby; and the small boys
who deliver messages round the factory find a similar difficulty in
distinguishing between the Buying Department and the Sales
Department. In exactly the same way the gods of old Egypt became
inextricably mixed. The tale told of one is easily applied to another,
and God the doer easily becomes God the done-by; while the symbol
of the god will equally well pass for (say) the enemy of the god, or
the weapon with which he fought. Like the old lady in the story, they
“do not distinguish.” (Compare how our Arthur and the Saxon Cedric,
whom he fought at Langport, were both identified with the dragon.)
After this warning, the chief events of the Egyptian Old Testament
may not seem so absurd. They centre round “The Destruction of
Mankind,” the original of all our myths.
The story is that Isis became angry with mankind because of their
infidelity, and determined to slay them all. She set about it with a
will and the earth ran red with their blood; but when she was near
the end of her task, the other gods took pity on those who were left,
and determined to thwart her. This they did by giving her some
doctored beer, whereupon she became “genially inoffensive”—and so
the remnant escaped; and to this day their descendants generally
regard beer with an almost superstitious veneration. The Flood is an
obvious and world-wide variation of this theme.
The next stage is that Isis the slayer becomes Isis the slain, whose
sacrifice will atone for the sins of mankind. The grandmother
goddess then becomes a mere mortal, “a beautiful and attractive
maiden”—say a virgin: the virgin is then abandoned to her fate, and
rescued by the conquering hero, and we are hot on the trail of
Perseus and St. George.
But, you will say, what has all this to do with dragons? It must be
admitted that in Egypt, “the great breeding-place of monsters,” no
dragons survive in full-blown splendour; but these legends are the
germ of all, and from them springs the essential dragon-conflict, the
vendetta of Horus against the powers of darkness. The dragon has
also been identified with Osiris the good controller of water, with Set
the evil who killed him, with Isis in so far as she is confused with
Osiris, and with Horus as the successor of Osiris, but we shall only
become confused if we try to follow all its transformations.
We have come now to the end of all our tales, and I shall try in
the last part of this section to link up all the parts; to show you how
remarkably little essential change there has been in man’s thinking
for fifty centuries, and how the commonplace incidents of originally
prosaic stories became distorted and elaborated with corroborative
detail, quite regardless of the original and often forgotten meaning.
Reference: Elliot Smith, The Evolution of the Dragon (London,
1920).
VI
OF THE BIRTH AND DEATH OF THE DRAGON

The chief satisfaction which learned men appear to derive from


these tales is quarrelling about their common or separate origin. The
Separatists say that their resemblances merely show how very much
alike men are, the world over; the Communists that they are so very
intricate and so far from obvious that they must have sprung from a
common stock (cf. Mr. W. J. Perry’s and Professor Elliot Smith’s
theories as to the common—Egyptian—origin of militarism, mining,
and many other branches of megalithic and modern culture).
Personally I am a Communist; for it is a perfectly good principle,
common to science and theology, that miracles are not to be
multiplied, beyond necessity. The question is, in any case, of no
fundamental importance to us, but it will simplify what follows if I
make my standpoint plain.
When our first fathers found themselves at large in this already
ancient world, the first fact they noticed was that they were alive.
Like all their descendants after them, they wisely worshipped facts,
and they made a religion of fertility; like us too, and like all those
who will follow us, they knew nothing certain of the two infinities
from which we come and to which we go, before birth and after
death. The next fact they noticed was that other men died, though
their minds shrank in horror from the fact that they too must die,
and could not entertain it. They hankered after immortality, for their
dear ones and (later) for themselves, as we hanker after it, and as
our children will; for in course of time it became a commonplace of
all the world that all men must die, and this doom of the “sad-eyed
race of mortal men” is the theme of pathos throughout antiquity.
Their souls rebelled against the bitterness of death, and the search
for the elixir of life (to renew man’s youth and to give him
immortality) has been “the inspiration of most of the world’s great
literature in every age and clime, and not only of our literature but of
all our civilization.”
They worshipped life, and feared and hated death. And so they
worshipped women, and the womb from which they all sprang. For
good luck they carried amulets, shells especially; and from being
amulets these shells came to be worshipped as the actual source of
life, were personified and made symbols again of the Great Mother,
the giver of life. (So Aphrodite, the goddess of love, came floating in
a shell on the foam of the sea to gladden the hearts of men). They
noticed, too, that water was the first necessity of men and beasts
and plants, and that dead men and things stiffened and withered as
though the water was gone out of them; and so they worshipped
water as the principle of life, and the water-god was the second-
born. Then, turning their vision further a-field, they took note of the
regular motions of the moon, her monthly course, and her strange
connection with the tides of the sea; and so the Great Mother
became identified with the Moon. And then as they pondered they
felt the greater glory of the Sun, and set him up above his mother
the Moon; but the moon long remained the personification of order
and light and goodness, set over against chaos and darkness and
evil—though in time it was the sun, or his successor-sun, who came
to be regarded as the prince of light.
They hated death, and in the presence of it protested their belief
that somehow, somewhere, the dead continued to live, needing all
the gifts his family could bring—a primitive doctrine of immortality.
And then, in the presence of corruption, they made plans to
preserve the body: they burnt incense to restore the odour of life;
they poured libations to replace the vital juices. They tried to infuse
blood, the life-giver (for “blood,” as we say still, “is thicker than
water”) or to find some painted substitute. They hung the tomb with
magic shells, that the dead might be born again. And when, after all,
the body still decayed, they made statues instead for the soul to
inhabit, and tried their charms on them; and from the idea that
statues can come to life grows the contrary idea that men can be
turned to stone. The crowning triumph of their statuary was the eye,
making the statue (as we say) “a living image”; and from the idea
that the open eye means life, came the belief in the power of the
eye for good or evil. To this day the neglect of the poorest grave is
regarded as a more than callous crime, and there are not wanting
those amongst us who shudder at the desecration of the age-old
tombs of Egyptian kings.
Thus it was in the beginning. And when in process of time a wise
king discovered the arts of irrigation (it may be that this discovery
made him king; or perhaps kingship originated with the discovery of
the calendar, which conferred the gift of prophecy: “king” here is in
any case premature), and spread fertility throughout the land, they
worshipped him too and made him a living god, and cherished him
as the soul (as it were) of their land’s fertility. And when he grew
old, and his powers began to wane, terror fell on them lest their
fortunes should fail with him and they be all dead men. So they
transferred their worship for the king to his office, killed him, and
made his son divine. And when he too began to age, they killed him
in turn, and his sons after him, so that they always had a young and
vigorous king-god. Until in time an ageing king refused to submit,
and this was the origin of the story of the wrath of the gods and the
destruction of mankind. Time passed, and the monarch was replaced
by a maiden among his subjects, and we are at the stage of ordinary
human sacrifice, “human blood being thought of as the only elixir.”
But in time that, too, was ended, by a kind of religious reformation,
through the belief that any other blood would do as well; and this
was the origin of the story of the rescued maiden and her deliverer.
They worshipped water, and they worshipped shells, and so the
pearl within the oyster-shell; and diving for pearls, their natural
enemy was the shark, the guardian of the treasure and the only true
and original dragon. But in the course of ages all this was naturally
forgotten, and the dragon came to be adorned with all the terrors of
all the monsters of travellers’ tales, from the python to the octopus
and the lion that lives in the waste. Any terrible or impressive fact of
life or nature—the existence of evil, or of hoary mountains—gave
rise to a fresh dragon-tale; and the fact was then brought in as
evidence of the truth of the tale, very much as a politician to-day will
convince men of his general veracity and wisdom by stating some
obvious truth; and in the absence of facts, the vague terrors of
untutored minds became embodied in similar monsters; and so in a
sense they are still, though nowadays we call the result a “complex.”
EPILOGUE
I would not wish man rid of the dragon as death; partly, no doubt,
because I know it to be impossible (“This business of death is a plain
case and admits no controversy”); partly because death is such a
satisfactory thing: it is always something to look forward to. Death is
perhaps the oldest of the dragons, long since domesticated and
become the friend of man through familiarity.
But there remains that dragon of which we spoke in the
beginning, compounded of respectability and bigotry and cant; or
rather these things are the evidence that the dragon still exists, for
they are all the effects of terror: terror of truth and knowledge and
hard fact, the old terror of man “a stranger and afraid, in a world he
never made.” This monster dwells not in the desert places of the
earth, but in the hearth and home of every man. Its appetite is
enormous and its destructive powers are equalled only by its fertility.
Like all the other dragons, it is begotten by dogma out of ignorance.
It would be a mistake to suppose (as some have done) that
religion is altogether a bad thing because it has fostered many
errors, or altogether a fraud because it is profitable to priests. Every
science under the sun has fostered innumerable errors, and every
doctor on earth practises pious frauds daily, seldom solely for his
private ends. Mankind as a whole has had a hand in these
imaginings for half-a-hundred centuries; our certain knowledge of
our surroundings is to this day infinitesimal; and “it is part of our
human make-up to bridge the gaps in our experience with rumours,
with conjectures, and with soothing traditions.”
Not many months ago there came to these shores a Chinese
game, Mah Jongg, so perfected in the course of centuries that not
even a Chinaman can cheat at it. Is it too much to hope that, with
the general increase of knowledge and the general recognition of the
limits to which our knowledge can attain, this old world may yet
produce some saint or hero who will finally rescue Andromeda from
the dragon?

Transcriber’s Notes:
Printer’s, punctuation, and spelling inaccuracies were silently
corrected.
Archaic and variable spelling has been preserved.
*** END OF THE PROJECT GUTENBERG EBOOK PERSEUS; OR, OF
DRAGONS ***

Updated editions will replace the previous one—the old editions


will be renamed.

Creating the works from print editions not protected by U.S.


copyright law means that no one owns a United States
copyright in these works, so the Foundation (and you!) can copy
and distribute it in the United States without permission and
without paying copyright royalties. Special rules, set forth in the
General Terms of Use part of this license, apply to copying and
distributing Project Gutenberg™ electronic works to protect the
PROJECT GUTENBERG™ concept and trademark. Project
Gutenberg is a registered trademark, and may not be used if
you charge for an eBook, except by following the terms of the
trademark license, including paying royalties for use of the
Project Gutenberg trademark. If you do not charge anything for
copies of this eBook, complying with the trademark license is
very easy. You may use this eBook for nearly any purpose such
as creation of derivative works, reports, performances and
research. Project Gutenberg eBooks may be modified and
printed and given away—you may do practically ANYTHING in
the United States with eBooks not protected by U.S. copyright
law. Redistribution is subject to the trademark license, especially
commercial redistribution.

START: FULL LICENSE


THE FULL PROJECT GUTENBERG LICENSE
PLEASE READ THIS BEFORE YOU DISTRIBUTE OR USE THIS WORK

To protect the Project Gutenberg™ mission of promoting the


free distribution of electronic works, by using or distributing this
work (or any other work associated in any way with the phrase
“Project Gutenberg”), you agree to comply with all the terms of
the Full Project Gutenberg™ License available with this file or
online at www.gutenberg.org/license.

Section 1. General Terms of Use and


Redistributing Project Gutenberg™
electronic works
1.A. By reading or using any part of this Project Gutenberg™
electronic work, you indicate that you have read, understand,
agree to and accept all the terms of this license and intellectual
property (trademark/copyright) agreement. If you do not agree
to abide by all the terms of this agreement, you must cease
using and return or destroy all copies of Project Gutenberg™
electronic works in your possession. If you paid a fee for
obtaining a copy of or access to a Project Gutenberg™
electronic work and you do not agree to be bound by the terms
of this agreement, you may obtain a refund from the person or
entity to whom you paid the fee as set forth in paragraph 1.E.8.

1.B. “Project Gutenberg” is a registered trademark. It may only


be used on or associated in any way with an electronic work by
people who agree to be bound by the terms of this agreement.
There are a few things that you can do with most Project
Gutenberg™ electronic works even without complying with the
full terms of this agreement. See paragraph 1.C below. There
are a lot of things you can do with Project Gutenberg™
electronic works if you follow the terms of this agreement and
help preserve free future access to Project Gutenberg™
electronic works. See paragraph 1.E below.
1.C. The Project Gutenberg Literary Archive Foundation (“the
Foundation” or PGLAF), owns a compilation copyright in the
collection of Project Gutenberg™ electronic works. Nearly all the
individual works in the collection are in the public domain in the
United States. If an individual work is unprotected by copyright
law in the United States and you are located in the United
States, we do not claim a right to prevent you from copying,
distributing, performing, displaying or creating derivative works
based on the work as long as all references to Project
Gutenberg are removed. Of course, we hope that you will
support the Project Gutenberg™ mission of promoting free
access to electronic works by freely sharing Project Gutenberg™
works in compliance with the terms of this agreement for
keeping the Project Gutenberg™ name associated with the
work. You can easily comply with the terms of this agreement
by keeping this work in the same format with its attached full
Project Gutenberg™ License when you share it without charge
with others.

1.D. The copyright laws of the place where you are located also
govern what you can do with this work. Copyright laws in most
countries are in a constant state of change. If you are outside
the United States, check the laws of your country in addition to
the terms of this agreement before downloading, copying,
displaying, performing, distributing or creating derivative works
based on this work or any other Project Gutenberg™ work. The
Foundation makes no representations concerning the copyright
status of any work in any country other than the United States.

1.E. Unless you have removed all references to Project


Gutenberg:

1.E.1. The following sentence, with active links to, or other


immediate access to, the full Project Gutenberg™ License must
appear prominently whenever any copy of a Project
Gutenberg™ work (any work on which the phrase “Project
Gutenberg” appears, or with which the phrase “Project
Gutenberg” is associated) is accessed, displayed, performed,
viewed, copied or distributed:

This eBook is for the use of anyone anywhere in the United


States and most other parts of the world at no cost and
with almost no restrictions whatsoever. You may copy it,
give it away or re-use it under the terms of the Project
Gutenberg License included with this eBook or online at
www.gutenberg.org. If you are not located in the United
States, you will have to check the laws of the country
where you are located before using this eBook.

1.E.2. If an individual Project Gutenberg™ electronic work is


derived from texts not protected by U.S. copyright law (does not
contain a notice indicating that it is posted with permission of
the copyright holder), the work can be copied and distributed to
anyone in the United States without paying any fees or charges.
If you are redistributing or providing access to a work with the
phrase “Project Gutenberg” associated with or appearing on the
work, you must comply either with the requirements of
paragraphs 1.E.1 through 1.E.7 or obtain permission for the use
of the work and the Project Gutenberg™ trademark as set forth
in paragraphs 1.E.8 or 1.E.9.

1.E.3. If an individual Project Gutenberg™ electronic work is


posted with the permission of the copyright holder, your use and
distribution must comply with both paragraphs 1.E.1 through
1.E.7 and any additional terms imposed by the copyright holder.
Additional terms will be linked to the Project Gutenberg™
License for all works posted with the permission of the copyright
holder found at the beginning of this work.

1.E.4. Do not unlink or detach or remove the full Project


Gutenberg™ License terms from this work, or any files
containing a part of this work or any other work associated with
Project Gutenberg™.

1.E.5. Do not copy, display, perform, distribute or redistribute


this electronic work, or any part of this electronic work, without
prominently displaying the sentence set forth in paragraph 1.E.1
with active links or immediate access to the full terms of the
Project Gutenberg™ License.

1.E.6. You may convert to and distribute this work in any binary,
compressed, marked up, nonproprietary or proprietary form,
including any word processing or hypertext form. However, if
you provide access to or distribute copies of a Project
Gutenberg™ work in a format other than “Plain Vanilla ASCII” or
other format used in the official version posted on the official
Project Gutenberg™ website (www.gutenberg.org), you must,
at no additional cost, fee or expense to the user, provide a copy,
a means of exporting a copy, or a means of obtaining a copy
upon request, of the work in its original “Plain Vanilla ASCII” or
other form. Any alternate format must include the full Project
Gutenberg™ License as specified in paragraph 1.E.1.

1.E.7. Do not charge a fee for access to, viewing, displaying,


performing, copying or distributing any Project Gutenberg™
works unless you comply with paragraph 1.E.8 or 1.E.9.

1.E.8. You may charge a reasonable fee for copies of or


providing access to or distributing Project Gutenberg™
electronic works provided that:

• You pay a royalty fee of 20% of the gross profits you derive
from the use of Project Gutenberg™ works calculated using the
method you already use to calculate your applicable taxes. The
fee is owed to the owner of the Project Gutenberg™ trademark,
but he has agreed to donate royalties under this paragraph to
the Project Gutenberg Literary Archive Foundation. Royalty
payments must be paid within 60 days following each date on
which you prepare (or are legally required to prepare) your
periodic tax returns. Royalty payments should be clearly marked
as such and sent to the Project Gutenberg Literary Archive
Foundation at the address specified in Section 4, “Information
about donations to the Project Gutenberg Literary Archive
Foundation.”

• You provide a full refund of any money paid by a user who


notifies you in writing (or by e-mail) within 30 days of receipt
that s/he does not agree to the terms of the full Project
Gutenberg™ License. You must require such a user to return or
destroy all copies of the works possessed in a physical medium
and discontinue all use of and all access to other copies of
Project Gutenberg™ works.

• You provide, in accordance with paragraph 1.F.3, a full refund of


any money paid for a work or a replacement copy, if a defect in
the electronic work is discovered and reported to you within 90
days of receipt of the work.

• You comply with all other terms of this agreement for free
distribution of Project Gutenberg™ works.

1.E.9. If you wish to charge a fee or distribute a Project


Gutenberg™ electronic work or group of works on different
terms than are set forth in this agreement, you must obtain
permission in writing from the Project Gutenberg Literary
Archive Foundation, the manager of the Project Gutenberg™
trademark. Contact the Foundation as set forth in Section 3
below.

1.F.

1.F.1. Project Gutenberg volunteers and employees expend


considerable effort to identify, do copyright research on,
transcribe and proofread works not protected by U.S. copyright
law in creating the Project Gutenberg™ collection. Despite these
efforts, Project Gutenberg™ electronic works, and the medium
on which they may be stored, may contain “Defects,” such as,
but not limited to, incomplete, inaccurate or corrupt data,
transcription errors, a copyright or other intellectual property
infringement, a defective or damaged disk or other medium, a
computer virus, or computer codes that damage or cannot be
read by your equipment.

1.F.2. LIMITED WARRANTY, DISCLAIMER OF DAMAGES - Except


for the “Right of Replacement or Refund” described in
paragraph 1.F.3, the Project Gutenberg Literary Archive
Foundation, the owner of the Project Gutenberg™ trademark,
and any other party distributing a Project Gutenberg™ electronic
work under this agreement, disclaim all liability to you for
damages, costs and expenses, including legal fees. YOU AGREE
THAT YOU HAVE NO REMEDIES FOR NEGLIGENCE, STRICT
LIABILITY, BREACH OF WARRANTY OR BREACH OF CONTRACT
EXCEPT THOSE PROVIDED IN PARAGRAPH 1.F.3. YOU AGREE
THAT THE FOUNDATION, THE TRADEMARK OWNER, AND ANY
DISTRIBUTOR UNDER THIS AGREEMENT WILL NOT BE LIABLE
TO YOU FOR ACTUAL, DIRECT, INDIRECT, CONSEQUENTIAL,
PUNITIVE OR INCIDENTAL DAMAGES EVEN IF YOU GIVE
NOTICE OF THE POSSIBILITY OF SUCH DAMAGE.

1.F.3. LIMITED RIGHT OF REPLACEMENT OR REFUND - If you


discover a defect in this electronic work within 90 days of
receiving it, you can receive a refund of the money (if any) you
paid for it by sending a written explanation to the person you
received the work from. If you received the work on a physical
medium, you must return the medium with your written
explanation. The person or entity that provided you with the
defective work may elect to provide a replacement copy in lieu
of a refund. If you received the work electronically, the person
or entity providing it to you may choose to give you a second
opportunity to receive the work electronically in lieu of a refund.
If the second copy is also defective, you may demand a refund
in writing without further opportunities to fix the problem.

1.F.4. Except for the limited right of replacement or refund set


forth in paragraph 1.F.3, this work is provided to you ‘AS-IS’,
WITH NO OTHER WARRANTIES OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO WARRANTIES OF
MERCHANTABILITY OR FITNESS FOR ANY PURPOSE.

1.F.5. Some states do not allow disclaimers of certain implied


warranties or the exclusion or limitation of certain types of
damages. If any disclaimer or limitation set forth in this
agreement violates the law of the state applicable to this
agreement, the agreement shall be interpreted to make the
maximum disclaimer or limitation permitted by the applicable
state law. The invalidity or unenforceability of any provision of
this agreement shall not void the remaining provisions.

1.F.6. INDEMNITY - You agree to indemnify and hold the


Foundation, the trademark owner, any agent or employee of the
Foundation, anyone providing copies of Project Gutenberg™
electronic works in accordance with this agreement, and any
volunteers associated with the production, promotion and
distribution of Project Gutenberg™ electronic works, harmless
from all liability, costs and expenses, including legal fees, that
arise directly or indirectly from any of the following which you
do or cause to occur: (a) distribution of this or any Project
Gutenberg™ work, (b) alteration, modification, or additions or
deletions to any Project Gutenberg™ work, and (c) any Defect
you cause.

Section 2. Information about the Mission


of Project Gutenberg™
Project Gutenberg™ is synonymous with the free distribution of
electronic works in formats readable by the widest variety of
computers including obsolete, old, middle-aged and new
computers. It exists because of the efforts of hundreds of
volunteers and donations from people in all walks of life.

Volunteers and financial support to provide volunteers with the


assistance they need are critical to reaching Project
Gutenberg™’s goals and ensuring that the Project Gutenberg™
collection will remain freely available for generations to come. In
2001, the Project Gutenberg Literary Archive Foundation was
created to provide a secure and permanent future for Project
Gutenberg™ and future generations. To learn more about the
Project Gutenberg Literary Archive Foundation and how your
efforts and donations can help, see Sections 3 and 4 and the
Foundation information page at www.gutenberg.org.

Section 3. Information about the Project


Gutenberg Literary Archive Foundation
The Project Gutenberg Literary Archive Foundation is a non-
profit 501(c)(3) educational corporation organized under the
laws of the state of Mississippi and granted tax exempt status
by the Internal Revenue Service. The Foundation’s EIN or
federal tax identification number is 64-6221541. Contributions
to the Project Gutenberg Literary Archive Foundation are tax
deductible to the full extent permitted by U.S. federal laws and
your state’s laws.

The Foundation’s business office is located at 809 North 1500


West, Salt Lake City, UT 84116, (801) 596-1887. Email contact
links and up to date contact information can be found at the
Foundation’s website and official page at
www.gutenberg.org/contact
Section 4. Information about Donations to
the Project Gutenberg Literary Archive
Foundation
Project Gutenberg™ depends upon and cannot survive without
widespread public support and donations to carry out its mission
of increasing the number of public domain and licensed works
that can be freely distributed in machine-readable form
accessible by the widest array of equipment including outdated
equipment. Many small donations ($1 to $5,000) are particularly
important to maintaining tax exempt status with the IRS.

The Foundation is committed to complying with the laws


regulating charities and charitable donations in all 50 states of
the United States. Compliance requirements are not uniform
and it takes a considerable effort, much paperwork and many
fees to meet and keep up with these requirements. We do not
solicit donations in locations where we have not received written
confirmation of compliance. To SEND DONATIONS or determine
the status of compliance for any particular state visit
www.gutenberg.org/donate.

While we cannot and do not solicit contributions from states


where we have not met the solicitation requirements, we know
of no prohibition against accepting unsolicited donations from
donors in such states who approach us with offers to donate.

International donations are gratefully accepted, but we cannot


make any statements concerning tax treatment of donations
received from outside the United States. U.S. laws alone swamp
our small staff.

Please check the Project Gutenberg web pages for current


donation methods and addresses. Donations are accepted in a
number of other ways including checks, online payments and
credit card donations. To donate, please visit:
www.gutenberg.org/donate.

Section 5. General Information About


Project Gutenberg™ electronic works
Professor Michael S. Hart was the originator of the Project
Gutenberg™ concept of a library of electronic works that could
be freely shared with anyone. For forty years, he produced and
distributed Project Gutenberg™ eBooks with only a loose
network of volunteer support.

Project Gutenberg™ eBooks are often created from several


printed editions, all of which are confirmed as not protected by
copyright in the U.S. unless a copyright notice is included. Thus,
we do not necessarily keep eBooks in compliance with any
particular paper edition.

Most people start at our website which has the main PG search
facility: www.gutenberg.org.

This website includes information about Project Gutenberg™,


including how to make donations to the Project Gutenberg
Literary Archive Foundation, how to help produce our new
eBooks, and how to subscribe to our email newsletter to hear
about new eBooks.
Welcome to our website – the ideal destination for book lovers and
knowledge seekers. With a mission to inspire endlessly, we offer a
vast collection of books, ranging from classic literary works to
specialized publications, self-development books, and children's
literature. Each book is a new journey of discovery, expanding
knowledge and enriching the soul of the reade

Our website is not just a platform for buying books, but a bridge
connecting readers to the timeless values of culture and wisdom. With
an elegant, user-friendly interface and an intelligent search system,
we are committed to providing a quick and convenient shopping
experience. Additionally, our special promotions and home delivery
services ensure that you save time and fully enjoy the joy of reading.

Let us accompany you on the journey of exploring knowledge and


personal growth!

textbookfull.com

You might also like