High Performance Computing On Vector Systems 2007-Springer-Verlag Berlin Heidelb
High Performance Computing On Vector Systems 2007-Springer-Verlag Berlin Heidelb
High Performance
Computing
on Vector Systems
2007
123
Michael Resch Toshiyuki Furui
Sabine Roller NEC Corporation
Peter Lammers Nisshin-cho 1-10
Höchstleistungsrechenzentrum 183-8501 Tokyo, Japan
Stuttgart (HLRS) [email protected]
Universität Stuttgart
Nobelstraße 19
Wolfgang Bez
70569 Stuttgart, Germany
Martin Galle
[email protected]
[email protected] NEC High Performance Computing
[email protected] Europe GmbH
Prinzenallee 11
40459 Düsseldorf, Germany
[email protected]
[email protected]
Front cover figure: Impression of the projected tidal current power plant to be built in the South Korean
province of Wando. Picture due to RENETEC, Jongseon Park, in cooperation with Institute of Fluid
Mechanics and Hydraulic Machinery, University of Stuttgart
DOI 10.1007/978-3-540-74384-2
Library of Congress Control Number: 2007936175
Mathematics Subject Classification (2000): 68Wxx, 68W10 , 68U20, 76-XX, 86A05, 86A10, 70Fxx
This work is subject to copyright. All rights are reserved, whether the whole or part of the material is
concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting,
reproduction on microfilm or in any other way, and storage in data banks. Duplication of this publication
or parts thereof is permitted only under the provisions of the German Copyright Law of September 9,
1965, in its current version, and permission for use must always be obtained from Springer. Violations
are liable to prosecution under the German Copyright Law.
The use of general descriptive names, registered names, trademarks, etc. in this publication does not
imply, even in the absence of a specific statement, that such names are exempt from the relevant protective
laws and regulations and therefore free for general use.
987654321
springer.com
Preface
more than 10 applications so far. The best performing application is the hydro-
dynamics code BEST, which is based on the solution of the Lattice Boltzmann
equations. This application achieves a performance of 5.7 TFLOP/s on the
72 nodes SX-8. Also other hydrodynamics as well as oceanography and cli-
matology applications are running highly efficient on the SX-8 architecture.
The enhancement of applications and their adaptation to the SX-8 Vector
architecture within the collaboration will continue.
On the other hand, the Teraflop Workbench project works on supporting
future applications, looking at the requirements users ask for. In that con-
text, we see an increasing interest in Coupled Applications, in which different
codes are interacting to simulate complex systems of multiple physical regimes.
Examples for such couplings are Fluid-Structure or Ocean-Atmosphere inter-
actions. The codes in these coupled applications may have completely dif-
ferent requirements concerning the computer architecture which often results
in the situation that they are running most efficient on different platforms.
The efficient execution of coupled application requires a close integration of
the different platforms. The platform integration and the support for cou-
pled applications will become another important share in the TERAFLOP
Workbench collaboration.
Within the framework of the TERAFLOP Workbench collaboration, semi-
annual workshops are carried out in which researchers and computer experts
come together to exchange their experiences. The workshop series started in
2004 with the 1st TERAFLOP Workshop in Stuttgart. In autumn 2005, the
TERAFLOP Workshop Japan session was established with the 3rd TERA-
FLOP Workshop in Tokyo.
The following book presents contributions from the 6th TERAFLOP
Workshop which was hosted by Tohoku University in Sendai, Japan in au-
tumn 2006 and the 7th Workshop in Stuttgart which was held in spring 2007
in Stuttgart. Focus is layed on current applications and future requirements,
as well as developments of next generation hardware architectures and instal-
lations.
Starting with a section on geophysics and climate simulations, the suit-
ability and necessity of vector systems is justified showing sustained teraflop
performance. Earthquake simulations based on the Spectral-Element Method
demonstrate that the systhetic seismic waves computed by this numerical
technique match with the observed seismic waves accurately. Further papers
address cloud-resolving simulation of tropical cyclones, or the question: What
is the impact of small-scale phenomena on the large-scale ocean and climate
modeling? Ensemble climate model simulation discribed in the closing paper
in this section enable scientists to better distinguish the forced signal due to
the increase of greenhouse gases from internal climat variability.
A section on computational fluid dynamics (CFD) starts with a paper
discussing the current capability of CFD and the maturity to reduce wind
tunnel testings. Further papers in this section show simulations in applied
Preface VII
fields as aeronautics and flows in gas and steam turbines, as well as basic
research and detailed performance analysis.
The following section considers multi-scale and multi-physics simulations
based on CFD. Current applications in aero-acoustics and the coupling of
Large-Eddy Simulation (LES) with acoustic perturbation equations (APE)
start the section, followed by fluid-structure interaction (FSI) in such differ-
ent aereas as tidal current turbines or the respiratory systems. The section is
closed by a paper addressing the algorithmic and imlementation issues asso-
ciated with FSI simulations on vector architecture. These examples show to
us the tendency to coupled applications and the requirements coming up with
future simulation techniques.
The section on chemistry and astrophysics combines simulation of pre-
mixed swirling flames and supernova simulations. The common basis for both
applications is the combination of a hydrodynamic module with processes as
chemical kinetics or multi-floavour, multi-frequencey neutrino transport based
on the Boltzmann transport equation, respectively.
A section on material science closes the applications part. Green chem-
istry from supercomputers considers Car-Parrinello simulations of ionic liq-
uids. Micromagnetic simulations of magnetic recording media allow new head
and media designs to be evaluated and optimized prior to fabrications.
These sections show the wide range of application areas performed on cur-
rent vector systems. The closing section on Future High Performance Systems
consider the potential of on-chip memory systems for future vector archi-
tectures. A technical note describing the TSUBAME installation at Tokyo
Institute of Technology (TiTech) closes the book.
The papers presented in this book lay out the wide range of fields in
which sustained performance can be achieved if engineering knowledge, nu-
merical mathematics and computer science skills are brought together. With
the advent of hybrid systems, the Teraflop workbench project will continue
the support of leading edge computations for future applications.
The editors would like to thank all authors and Springer for making this
publication possible and would like to express their hope that the entire high
performance computing community will benefit from it.
Seiji Tsuboi
Summary. Earthquakes are very large scale ruptures inside the Earth and generate
elastic waves, known as seismic waves, which propagate inside the Earth. We use a
Spectral-Element Method implemented on the Earth Simulator in Japan to calculate
seismic waves generated by recent large earthquakes. The spectral-element method
is based on a weak formulation of the equations of motion and has both the flexibility
of a finite-element method and the accuracy of a pseudospectral method. We perform
numerical simulation of seismic wave propagation for a fully three-dimensional Earth
model, which incorporates realistic 3D variations of Earth’s internal properties. The
simulations are performed on 4056 processors, which require 507 out of 640 nodes
of the Earth Simulator. We use a mesh with 206 million spectral-elements, for a
total of 13.8 billion global integration grid points (i.e., almost 37 billion degrees
of freedom). We show examples of simulations and demonstrate that the synthetic
seismic waves computed by this numerical technique match with the observed seismic
waves accurately.
1 Introduction
The Earth is an active planet, which exhibits thermal convection of solid
mantle and resultant plate dynamics at the surface. As a result of continuous
plate motion, we have seismic activities and sometimes we experience huge
earthquakes, which causes devastating damage to the human society. In order
to know the rupture process during large earthquakes, we need to have ac-
curate modeling of seismic wave propagation in fully three-dimensional (3-D)
Earth models, which is of considerable interest in seismology. However, signif-
icant deviations of Earth’s internal structure from spherical symmetry, such
as the 3-D seismic-wave velocity structure inside the solid mantle and later-
ally heterogeneous crust at the surface of the Earth, have made applications
of analytical approaches to this problem a formidable task. The numerical
modeling of seismic-wave propagation in 3-D structures has been significantly
advanced in the last few years due to the introduction of the Spectral-Element
Method (SEM), which is a high-degree version of the finite-element method.
4 Seiji Tsuboi
The 3-D SEM was first used in seis-mology for local and regional simulations
([Ko97]-[Fa97]), and more recently adapted to wave propagation at the scale
of the full Earth ([Ch00]-[Ko02]) .
In addition, massively parallel super computer has started its opera-
tion in 2002 at Japan Agency for Marine-Earth Science and Technology
(JAMSTEC). The machine is called the Earth Simulator and dedicated
specifically to basic research in Earth sciences, such as climate modeling
(http://www.es.jamstec.go.jp). The Earth Simulator consisted of 640 proces-
sor nodes with each equipped with 8 vector processors. Each vector processor
has its peak performance of 8 GFLOPS and main memory is 2 gigabytes per
processor. In total, the peak performance of the Earth Simulator is about
40 TFLOPS and maximum memory size is about 10 terabytes. In 2002, the
Earth Simulator has scored 36 TFLOPS as its sustained performance and has
been ranked as No.1 of TOP500.
Here we show that our implementation of the SEM on the Earth Simulator
in Japan allows us to calculate theoretical seismic waves which are accurate
up to 3.5 seconds and longer for fully 3-D Earth models. We include the
full complexity of the 3-D Earth in our simulations, i.e., a 3-D seismic wave
velocity [Ri99] and density structure, a 3-D crustal model [Ba00], ellipticity
as well as topography and bathymetry. Because dominant frequency of body
waves, which are one of the seismic waves that travel inside the Earth, is 1
Hz, it is desirable to have synthetic seismograms with the accuracy of 1Hz.
However, synthetic waveforms at this resolution (periods of 3.5 seconds and
longer) also allow us to perform direct comparisons between observed and
synthetic seismograms with various aspects, which has never been accom-
plished before. Conventional seismological algorithms, such as normal-mode
summation techniques that calculate quasi-analytical synthetic seismograms
for one-dimensional (1-D) spherically symmetric Earth models [Dz81], are
typically accurate down to 8 seconds [Da98]. In other words, the SEM on
the Earth Simulator allows us to simulate global seismic wave propagation in
fully 3-D Earth models at periods shorter than current seismological practice
for simpler 1-D spherically symmetric models. The results of our simulation
show that the synthetic seismograms calculated for fully 3-D Earth models
by using the Earth Simulator and the SEM agree well with the observed seis-
mograms, which enables us to investigate the earthquake rupture history and
the Earth’s internal structure in much higher resolution than before.
2 Numerical Technique
We use the spectral-element method (SEM) developed by Komatitsch and
Tromp [Ko02a, Ko02b] to simulate global seismic wave propagation through-
out a 3-D Earth model, which includes a 3-D seismic velocity and den-
sity structure, a 3-D crustal model, ellipticity as well as topography and
bathymetry. The SEM first divides the Earth into six chunks. Each of the
Simulation of Seismic Waves on the Earth Simulator 5
six chunks is divided into slices. Each slice is allocated to one CPU of the
Earth Simulator. Communication between each CPU is done by MPI. Be-
fore the system can be marched forward in time, the contributions from all
the elements that share a common global grid point need to be summed.
Since the global mass matrix is diagonal, time discretization of the second-
order ordinary differential equation is achieved based upon a classical explicit
second-order finite-difference scheme.
The number of nodes we used for this simulation is 4056 processors, i.e.,
507 nodes out of 640 of the Earth Simulator. This means that each chunk is
subdivided into 26 × 26 slices (6 × 26 × 26 = 4056). Each slice is allocated to
one processor of the Earth Simulator and subdivided with a mesh of 48 × 48
spectral-elements at the surface of each slice. Within each surface element we
use 5 × 5 = 25 Gauss-Lobatto-Legendre (GLL) grid points to interpolate the
wave field [Ko98, Ko99], which translates into an average grid spacing of 2.0
km (i.e., 0.018 degrees) at the surface. The total number of spectral elements
in this mesh is 206 million, which cor-responds to a total of 13.8 billion global
grid points, since each spectral element contains 5 × 5 × 5 = 125 grid points,
but with points on its faces shared by neighboring elements. This in turn
corresponds to 36.6 billion degrees of freedom (the total number of degrees of
freedom is slightly less than 3 times the number of grid points because we solve
for the three components of displacement everywhere in the mesh, except in
the liquid outer core of the Earth where we solve for a scalar potential). Using
this mesh, we can calculate synthetic seismograms that are accurate down to
seismic periods of 3.5 seconds. Total performance of the code, measured using
the MPI Program Runtime Performance Information was 10 teraflops, which
is about one third of the expected peak performance for this number of nodes
(507 nodes × 64gigaflops = 32 teraflops). Figure 1 shows a global view of the
spectral-element mesh at the surface of the Earth. Before we could use 507
nodes of the Earth Simulator for this simulation, we could have successfully
used 243 nodes to calculate synthetic seismograms. Using 243 nodes (1944
CPUs), we can subdivide the six chunks into 1944 slices (1944 = 6 × 18 × 18).
Each slice is then subdivided into 48 elements in one direction. Because each
element has 5 Gauss-Lobatto Legendre integration points, the average grid
spacing at the surface of the Earth is about 2.9 km. The number of grid
points in total amounts to about 5.5 billion. Using this mesh, it is expected
that we can calculate synthetic seismograms accurate up to 5 sec all over the
globe. For the 243 nodes case, the total performance we achieved was about 5
teraflops, which also is about one third of the peak performance. The fact that
when we double the number of nodes from 243 to 507 the total performance
also doubles from 5 teraflops to 10 teraflops shows that our SEM code exhibits
an excellent scaling relation with respect to performance. The details of our
computation with 243 nodes of the Earth Simulator were described in Tsuboi
et al (2003) [Ts03] and Komatitsch et al (2003) [Ko03a], which was awarded
2003 Gordon Bell prize for peak performance in SC2003.
6 Seiji Tsuboi
Fig. 1. The SEM uses a mesh of hexahedral finite elements on which the wave field
is interpolated by high-degree Lagrange polynomials on Gauss-Lobatto-Legendre
(GLL) integration points. This figure shows a global view of the mesh at the surface,
illustrating that each of the six sides of the so-called “cubed sphere” mesh is divided
into 26 × 26 slices, shown here with different colors, for a total of 4056 slices (i.e.,
one slice per processor).
Fig. 2. Snapshots of the propagation of seismic waves excited by the December 26,
2004 Sumatra earthquake. Total displacement at the surface of the Earth is plotted
at 10 min after the origin time of the event.
8 Seiji Tsuboi
The Earth’s internal structure is another target that we can study by using our
synthetic seismograms calculated for fully 3-D Earth model. We describe the
examples of Tono et al (2005) [To05]. They used records of 500 tiltmeters of
the Hi-net, in addition to 60 broadband seismometers of the F-net, operated
by the National Research Institute for Earth Science and Disaster Prevention
of Japan (NIED). They analyzed pairs of sScS waves, which means that the
S-wave traveled upward from the hypocenter reflected at the surface and re-
flected again at the core-mantle boundary, and its reverberation from the 410-
or 660-km reflectors (sScSSdS where d=410 or 660 km) for the deep earth-
quake of the Russia-N.E. China border (PDE; 2002:06:28; 17:19:30.30; 43.75N;
130.67E; 566 km depth; 6.7 Mb). The two horizontal components are rotated
to obtain the transverse component. They have found that these records show
clearly the near-vertical reflections from the 410- and 660-km seismic veloc-
ity discontinuities inside the Earth as post-cursors of sScS phase. By reading
the travel time difference between sScS and sScSSdS, they concluded that
this differential travel time anomaly can be attributed to the depth anomaly
of the reflection point, because it is little affected by the uncertainties asso-
ciated with the hypocentral determination, structural complexities near the
source and receiver and long-wavelength mantle heterogeneity. The differen-
tial travel time anomaly is obtained by measuring the arrival time anomaly of
sScS and that of sScSSdS separately and then by taking their difference. The
arrival time anomaly of sScS (or sScSSdS) is measured by cross-correlating
the observed sScS (or sScSSdS) with the corresponding synthetic waveform
computed by SEM on the Earth Simulator. They plot the measured values
of the two-way near-vertical travel time anomaly at the corresponding sur-
face bounce points located beneath the Japan Sea. The results show that the
660-km boundary is depressed at a constant level of 15 km along the bot-
tom of the horizontally extending aseismic slab under southwestern Japan.
The transition from the normal to the depressed level occurs sharply, where
the 660-km boundary intersects the bottom of the obliquely subducting slab.
This observation should give important imprecations to geodynamic activities
inside the Earth.
Another topic is the structure of the Earth’s inner most core. The Earth
has solid inner core inside the fluid core with the radius of about 1200 km. It
is proposed that the inner core has anisotropic structure, which means that
the seismic velocity is faster in one direction than the other, and used to infer
inner core differential rotation [Zh05]. Because the Earth’s magnetic field is
originated by convective fluid motion inside the fluid core, the evolution of
the inner core should have important effect to the evolution of the Earth’s
magnetic field.
Figure 4 illustrates definitions of typical seismic waves which travel through
the Earth’s core. The seismic wave, labeled as PKIKP, penetrates inside the
inner core and its propagation time from the earthquake hypocenter to the
10 Seiji Tsuboi
Fig. 4. Raypaths and its naming conventions of seismic waves, which travel inside
the Earth’s core.
seismic station (that is travel time) is used to infer the seismic velocity struc-
ture inside the inner core. Especially the dependence of PKIKP travel time to
the direction is useful to estimate anisotropic structure of the inner core. We
calculate synthetic seismograms for those PKIKP and PKP(AB) waves and
evaluate the effect of inner core anisotropy to these waves. When we construct
global mesh in SEM computation, we put one small slice at the center of the
Earth. Because of this, we do not have any singularity at the center of the
Earth, which makes our synthetic seismograms very accurate and unique. We
calculate synthetic seismograms by using the Earth Simulator and SEM for
deep earthquake on April 8, 1999, at E. USSR-N.E. CHINA Border region
(43.66N 130.47E depth 575.4km Mw7.1). We calculate synthetic seismograms
for both isotropic inner core model and anisotropic inner core model and
compare with the observed seismograms. Figure 5 summarizes comparisons of
Simulation of Seismic Waves on the Earth Simulator 11
Fig. 5. Great circle paths to the broadband seismograph stations from the earth-
quake. Open circles show crossing points of Pdiff paths along the core mantle bound-
ary (CMB). Red circles show crossing point at CMB for PKP(AB). Blue circles
show crossing point at CMB for PKIKP. Blue squares show crossing point at ICB
for PKIKP.Travel time differences of (synthetics)-(observed) are overlaid along the
great circle paths with the color scale shown in the right of the figures. Comparison
for isotropic inner core model (top) and anisotropic inner core model (bottom).
These results illustrate that the current inner core anisotropic model does
improve the observation but it also should be modified to get much better
agreement. They also demonstrate that there exist some anomalous structure
along some portion of the core mantle boundary. This kind of anomalous
structure should be incorporated in the Earth model to explain observed travel
time anomaly of Pdiff waves.
5 Discussion
We have shown that we now calculate synthetic seismograms for realistic 3D
Earth model with the accuracy of 3.5 sec by using the Earth Simulator and
SEM. 3.5 second seismic wave is sufficiently short enough to explain various
characteristics of seismic waves which propagate inside aspherical Earth. How-
ever, it also is true that we need to have 1Hz accuracy to explain body wave
travel time anomaly. We will consider if it will be possible to calculate 1Hz
seismograms in near future using the next generation Earth Simulator. We
could calculate seismograms of 3.5 sec accuracy with 507 nodes (4056 CPUs)
of the Earth Simulator. The key to how we increase the accuracy is the size of
the mesh used in SEM. If we reduce the size of one slice by half, √the required
memory will become quadruple and the accuracy is increased by 2. Thus to
have 1Hz accuracy, we should reduce the size of mesh at least one fourth of
3.5 sec (507 nodes) case. If we assume that the size of memory available per
each CPU is the same as the current Earth Simulator, we need to have at least
16 × 507 = 8112 nodes (64,896 CPUs). If we can use 4 times larger memory
per CPU, then number of CPU becomes 16,224, which is a realistic value. We
have examined if we will be able to have expected performance for possible
candidate of next generation machine. We have NEC SX-8R at JAMSTEC, of
which peak performance of each CPU is about 4 times faster than that of the
Earth Simulator. We have compiled our SEM program on SX-8R as it is and
measured the performance. The result shows that the performance is less than
two times faster than the Earth Simulator. Because we have not optimized
our code so that it fits the architecture of SX-8R, there is still a possibility
that we may have good performance. However, we have found that the reason
why we did not get good performance is because of the memory access speed.
As the memory used in SX-8R is not as fast as the Earth Simulator, bank
conflict time becomes bottleneck of the performance. This result illustrates
that it may become feasible to calculate 1Hz synthetic seismograms on the
next generation machine but it is necessary to have good balance between
CPU speed and memory size to get excellent performance.
Acknowledgments
The author used the program package SPECFEM3D developed by Jeroen
Tromp and Dimitri Komatitsch at Caltech to perform Spectral-Element
Simulation of Seismic Waves on the Earth Simulator 13
method computation. All of the computation shown in this paper was done by
using the Earth Simulator operated by the Earth Simulator Center of JAM-
STEC. The rupture model of 2004 Sumatra earthquake was provided by Chen
Ji of University of California Santa Barbara. Figures 3 through 5 are prepared
by Dr. Yoko Tono of JAMSTEC. Implementation of SEM program on SX-8R
was done by Dr. Ken’ichi Itakura of JAMSTEC.
References
[Ko97] Komatitsch, D.: Spectral and spectral-element methods for the 2D and 3D
elasto-dynamics equations in heterogeneous media, PhD thesis, Institut
de Physique du Globe, Paris (1997)
[Fa97] Faccioli, E., F. Maggio, R. Paolucci, A. Quarteroni,: 2D and 3D elastic
wave propagation by a pseudo-spectral domain decomposition method. J.
Seismol., 1, 237–251 (1997)
[Se98] Seriani, G.: 3-D large-scale wave propagation modeling by a spectral el-
ement method on a Cray T3E multiprocessor. Comput. Methods Appl.
Mech. Engrg., 164, 235–247 (1998)
[Ch00] Chaljub, E.: Numerical modelling of the propagation of seismic waves in
spherical geometry: applications to global seismology. PhD thesis, Uni-
versit Paris VII Denis Diderot, Paris (2000)
[Ko02a] Komatitsch, D., J. Tromp: Spectral-element simulations of global seismic
wave propagation-I. Validation. Geophys. J. Int. 149, 390–412 (2002)
[Ko02b] Komatitsch, D, J. Tromp: Spectral-element simulations of global seismic
wave propagation-II. 3-D models, oceans, rotation, and self-gravitation.
Geophys. J. Int. 150, 303–318 (2002)
[Ko02] Komatitsch, D., J. Ritsema, J. Tromp: The spectral-element method, Be-
owulf computing, and global seismology. Science, 298, 1737–1742 (2002)
[Ri99] Ritsema, J., H. J. Van Heijst, J. H. Woodhouse: Complex shear velocity
struc-ture imaged beneath Africa and Iceland. Science 286, 1925–1928
(1999)
[Ba00] Bassin, C., G. Laske, G. Masters: The current limits of resolution for
surface wave tomography in North America. EOS Trans. AGU. 81: Fall
Meet. Suppl., Abstract S12A-03 (2000)
[Dz81] Dziewonski, A. M., D. L. Anderson: Preliminary reference Earth model.
Phys. Earth Planet. Inter. 25, 297–356 (1981)
[Da98] Dahlen, F. A., J. Tromp: Theoretical Global Seismology. Princeton Uni-
versity Press, Princeton (1998)
[Ko98] Komatitsch, D., J. P. Vilotte: The spectral-element method: an efficient
tool to simulate the seismic response of 2D and 3D geological structures.
Bull. Seismol. Soc. Am. 88, 368–392 (1998)
[Ko99] Komatitsch, D., J. Tromp: Introduction to the spectral-element method
for 3-D seismic wave propagation. Geophys. J. Int. 139, 806–822 (1999)
[Ts03] Tsuboi, S., D. Komatitsch, C. Ji, J. Tromp: Broadband modeling of the
2003 Denali fault earthquake on the Earth Simulator, Phys. Earth Planet.
Int., 139, 305–312 (2003)
14 Seiji Tsuboi
Fig. 1. Cloud microphysical processes. Boxes indicate water condensate and arrows
do transformation or fallout processes. After Murakami (1990) [3]
integrated for 120 hours with and without ice phase processes, and the results
are compared to see their impacts. Hereafter, integrations with and without
ice phase processes are called as control run and warm rain run, respectively.
3.2 Results
The development and structure of a tropical cyclone are very different between
control and warm rain runs. Figure 2 depicts time sequences of central mean
sea level pressure (MSLP), maximum azimuthally averaged tangential wind,
area-averaged precipitation rate and area-averaged kinetic energy. The warm
rain run exhibits greater precipitation and increases in maximum wind and
depth of the central pressure more rapidly than the control run. It indicates
that ice phase processes interfere with the organization of tropical cyclones,
which is called as Conditional Instability of the Second Kind (CISK). As far as
the maximum wind and pressure depth are concerned, the control run catches
up to the warm rain run at about day 3 of integration and achieves the same
levels. However, the kinetic energy of warm rain run is still increasing at day 5
of integration and becomes much larger than the control run. The warm rain
run continues to extend its strong wind area.
Figure 3 presents the horizontal structure distributions of vertically in-
tegrated total water condensates for the control and warm-rain experiments
after 5 days of integration. Although the two runs have circular eyewall clouds
of the total water condensate maximum, their radii differ significantly. The ra-
dius of the eyewall is about 30km for the control run, and about 60km for the
warm-rain run. Figure 4 presents radius-height distributions of azimuthally
averaged tangential wind and water condensates. In the control run, the total
water content has double peaks near the ground and above the melting level,
where the latter is due to the slow vertical velocity of snow. The low-level
tangential wind is maximal at around 30km in the control run and around
50km in the warm rain run. Thus, ice phase processes considerably shrink
the tropical cyclone in the horizontal dimension even in at their equilibrium
states. The impact on the radius is very important for actual typhoon track
predictions, because the typhoon tracks are very sensitive to their simulated
sizes (Iwasaki et al., 1987 [2]).
Figure 5 plots the vertical profile of 12-hourly and azimuthally averaged
diabatic heating for a mature tropical cyclone at day 5 of integration. In the
control run with ice phase processes, the total diabatic heating rates are of
three categories. Figures 5a to c present the sum of condensational heating
and evaporative cooling rates (cnd+evp), the sum of freezing heating and
melting cooling rates (frz+mlt), and the sum of deposition heating and subli-
mation cooling rates (dep+slm) respectively. Strong updrafts are induced near
the eyewall by condensation heating below the melting layer and depositional
heating and in the upper troposphere respectively. In reverse, the updrafts in-
duce the large condensation heating and depositional heating (see Figs. 5a, c).
Also, the updrafts induce the small freezing heating above the melting layer
20 Toshiki Iwasaki and Masahiro Sawada
Fig. 2. Impacts of ice phase processes on the time evolution of an idealized trop-
ical cyclone. The panels depict (a) minimum sea-level pressure, (b) maximum az-
imuthally averaged tangential wind, (c) area-averaged precipitation rate within a
radius of 200km from the center of the TC, and (d) area-averaged kinetic energy
within a radius of 300km fron the center of the TC; the solid and dashed lines
indicate the control experiment including ice phase processes and the warm-rain
experiment, respectively. After Sawada and Iwasaki (2007) [6].
(Fig. 5b). Melting and sublimation cooling spread below and above the melt-
ing layer (4-7km), respectively (Figs. 5b, c). Graupel cools the atmosphere
four times larger than snow near the melting layer. Figure 5d illustrates cross
sections of the sum of above all diabatic heating related to phase transition of
water. Significant cooling occurs at the outside of the eyewall near the melt-
ing layer, which reduces the size of tropical cyclone. Figure 5e shows diabatic
heating of warm rain run, which consist only of condensation and evapora-
tion. In the lower troposphere, there is evaporative cooling from rain drop.
Comparing between Figs. 5d and 5e, we see that ice phase processes produce
the radial differential heating in the middle troposphere, which reduces the
typhoon size. The detailed mechanisms are discussed in Sawada and Iwasaki
(2007) [6].
Cloud-Resolving Simulation of Tropical Cyclones 21
Fig. 4. Radial-height cross sections of tangential wind speed with a contour interval
of 10 ms−1 and water condensates with coulars ( g/kg) at the mature stage (T=108-
120hs) in the control (upper panel) and in the warm rain (lower panel).
The cloud-resolving simulation indicates that the ice phase processes de-
lay the organization of a tropical cyclone and reduce its size. This is hardly
expressed in coarse resolution models with deep cumulus parameterization
schemes.
22 Toshiki Iwasaki and Masahiro Sawada
Fig. 5. Radius-height cross sections of diabatic heating for mature tropical cyclone
(T=108-120hs) in the (a)-(d) control and (e) warm-rain experiments. (a), (b), (c),
(d) and (e) show the sum of condensational heating and evaporative cooling rates
(cnd+evp), the sum of freezing heating and melting cooling rates (frz+mlt), the
sum of deposition heating and sublimation cooling rates (dep+slm), and the total
diabatic heating rates due to phase change (total), respectively. Contour values are
-10, -5, -2, -1, 5, 10, 20, 30, 40, 50, 60, 70, 80K/h. Shaded areas indicate regions of
less than -1K/h. The dashed line denotes the melting layer (T=0C). After Sawada
and Iwasaki (2007) [6].
Cloud-Resolving Simulation of Tropical Cyclones 23
References
1. Emanuel, K. A.: The dependence of hurricane intensity on climate. Nature, 326,
483–485 (1987)
2. Iwasaki, T., Nakano, H., Sugi, H.: The performance of a typhoon track predic-
tion model with cumulus parameterization. J. Meteor. Soc. Japan, 65, 555–570
(1987)
24 Toshiki Iwasaki and Masahiro Sawada
1 Introduction
Japanese and French oceanographers built close collaborations since numerous
years but the arrival of the Earth Simulator (highly parallel vector supercom-
puter system, 5120 vector processors, 40 Teraflops of peak performance) rein-
forced and speeded-up this cooperation. The Achievement of this exceptional
computer motivated the creation of a 4-year (2001-2005) postdoc position
for a French researcher in Japan, followed by 2 new post-doc positions since
2006. This active Franco-Japanese collaboration already lead to 16 publica-
tions. In addition, the signature of a Memorandum of Understanding (MoU)
between the Earth Simulator Center, the French National Scientific Research
Center (CNRS) and French Research Institute for Exploitation of the Sea
(l’IFREMER) formalizes this scientific collaboration and guarantees access to
the ES until the end of 2008.
Within this frame, four French teams are currently using the ES to explore
a common interest: What is the impact of the small-scale phenomena on the
large-scale ocean and climate modeling?
Figure 1 illustrates the large variety of scales that are, for example, ob-
served in the ocean. Left panel presents the Gulf Stream as a narrow band
looking like a river within the Atlantic Ocean. The middle panel reveals that
the Gulf Stream path is in fact a characterized by numerous clockwise and
anticlockwise eddies. When looking even closer, we observe that these eddies
are made of filaments delimiting waters with different physical characteristics
(right panel). Today, because of computational cost, the very large majority
of climate simulations (for example most IPCC experiments) ”see” the world
as shown on the left panel. Smaller phenomena are then parameterized or
sometimes even ignored. The computing power of the ES offers us the unique
opportunity to explicitly take into account some of these scale phenomena
and quantify their impact on the large-scale climate. This project represents
26 S. Masson et al.
Fig. 1. Satellite observation of the sea surface temperature (left panel, Gulf Stream
visible in blue), the sea level anomalies (middle panel, revealing eddies along the
Gulf Stream path) and chlorophyll concentration (right panel, enlightening filaments
within eddies).
a technical and scientific challenge. The size of the simulated problem is 100 to
1500 times bigger than existing work. New small-scale mechanisms with po-
tential impacts on large-scale circulation and climate dynamics are explored.
At the end, this work will help us to progress in our understanding of the cli-
mate and thus improve parameterizations used in climate change simulation
for example.
The next four sections give a brief overview of the technical and scientific
aspects of the four parts of the MoU project that started at the end of 2004.
In Europe the OPA9 application is being investigated and improved in
the Teraflop Workbench project at Hchstleistungsrechenzentrum Stuttgart
(HLRS), University of Stuttgart, Germany. The last section describes the
work that have been done and the performance improvment for OPA9 using
a dataset provided by the project partner, Institute für Meereskunde IFM
GEOMAR of the University of Kiel, Germany.
The goal of this study is to explore the impact of very small-scale phenomena
(order of 1 km) on vertical and horizontal mixing of the upper part of the
ocean that plays a key role in air-sea exchange and thus climate regulation.
These phenomena are created by the interactions between mesoscale eddies
(50-100 km) that are, for example, observed in the Antarctic Circumpolar
Current known for its very high eddy activity. In this process study, we there-
fore selected this region and modeled this circular current by a simple periodic
canal. A set of 3-year long experiments are performed with the horizontal and
vertical resolution increasing step by step until 1 km × 1 km × 300 levels
(or 3000×2000×300 points). Major results show that, first, the increase of
resolution is associated with an explosion on the number of eddies. Second,
OPA9 Experiments on Vector Computers 27
when reaching the highest resolutions, these eddies are encircled with very
energetic filaments where very high vertical velocities (upward or downward
according to the rotating direction of the eddy) are observed, see Fig. 2. Our
first results conclude that being able to explicitly represent these very fine
filaments warms the upper part of the ocean in areas such as the Antarctic
Circumpolar Current. This could therefore have a significant impact on the
earth climate that is at first driven by the heat redistribution from equatorial
regions toward the higher latitudes. The biggest simulations we perform in
this study use 512 vector processors (or about 10% of the ES). Future exper-
iments should reach 1024 processors. Performances in production mode are
1.6 Teraflops corresponding to 40% of the peak performance that is excellent.
The capacity of the ocean to pump or reject CO2 is a key point to quantify
the climatic response to the atmospheric CO2 increase. Through biochemistry
processes involving life and death of phyto and zooplankton, oceans reject CO2
to the atmosphere in the equatorial regions where upwelling are observed,
whereas at mid and high latitudes, oceans pump and store CO2. It is thus
primordial to explore the processes that keep balance between oceanic CO2
reject and pumping and understand how this equilibrium could be affected in
the global warming context: Will the ocean be able to store more CO2 or not?
The efficiency of CO2 pumping at mid latitude is linked to ocean dynamics
and particularly the meso-scale eddies. This second study will thus aim to
explore impacts of small-scale ocean dynamics on biochemistry activity with
a focus on the carbon cycle and fluxes at the air-sea interface. Our area of
interest is this time much larger and represents the western part of the North
Atlantic (see Fig. 3) that is one of the most important regions for oceanic
CO2 absorption. Our ocean model is also coupled to a biochemistry model
Fig. 3. Localization of the model domain. Oceanic vertical velocity at 100 m for
different model resolutions. Blue/red denotes upward/downward currents.
OPA9 Experiments on Vector Computers 29
Fig. 4. Schematic representation of the global conveyor belt (left), the piling up of
equatorial jets (middle) and their potential impact on the vertical mixing (acting
like radiator blades, right)
30 S. Masson et al.
drives theses jets and their characteristics differences between the Atlantic
and the Pacific. Further studies are ongoing to now explore their impact on
the global ocean As for the first part, these simulations reaches almost 40%
of the ES peak performance.
will result in calculations and select those array limits to work on. As ice
changes the limits are adjusted in each iteration, and to be sure to handle
growths of ice the band where the ice is calculated is set larger than where ice
have been detected. The scanning of the ice arrays increase the the runtime
but for the tested domain decompositions the amount of ice on a CPU were
always less than 50% or even less than 10% for some. This reduction of data
does reduce the runtime so much that the small increase for scanning the data
is negligible.
The second step taken was to merge the many loops, especially inside the
relaxation iteration part where most time where spent. At the end there were
one major loop in the relaxation after the scanning of the bands and at the
end the convergence test. By merging the loops, the access to the input arrays
u ice and v ice was limited, and several temporary array could be replaced by
temporary variables. In the original version the different arrays addressed mul-
tiple times in more complex structures and also by using temporary variables
for these kind of calculations the compiler could do a better job in scheduling
the calculations.
The MPI boarder exchange went through several steps of improvements. The
original code consisted of five steps and two branches depending on if the
northern area of the globe were covered by one MPI thread or by several.
The first step was to exchange the boarder nodes of the arrays with the east-
west neighbors. Then secondly to do a north-south exchange. The third steps
treats the four most north lines of the dataset. If this array part is owned
by one MPI thread the calculation is made in the subroutine, and if it is
distributed over more routines a MPI Gather getting all the data to the zero
rank MPI thread of the northern communicator, and there a subroutine was
called to do the same kind of transformation that was made in-line in the
one MPI thread case. After that the data was distributed with MPI Scatter
to the waiting MPI threads. The fourth step was to make another east-west
communication to make sure that the top four lines were properly exchanged
after the northern communication. All these exchanges were configurable to
use MPI Send, MPI Isend and MPI Bsend with different version of MPI Recv.
The MPI Recv were using MPI ANY SOURCE.
The first step taken was to combine the two different branches of treating
the northern part. They were basically the same and only had small differences
in variable naming. By merging this part into a subroutine, it was possible to
just arrange the input variables differently depending on if one or more MPI
threads were responsible for the northern part. The communication pattern
was also changed (see Fig. 5) to use MPI Allgather so that each MPI thread
in the northern communicator have access to the data, and each calculate the
region. By doing this the MPI Scatter with its second synchronization point
32 S. Masson et al.
Fig. 5. The new communication contains less synchronization and less communica-
tion than the old version.
can be avoided and also the last East-West exchange as this data already are
available at all MPI threads.
It is an important tuning as the boarder exchange is used by many rou-
tines, for example the solver of the transport divergence system using a di-
agonal preconditioned conjugate gradient method. One boarder exchange per
iteration is made in the solver.
The selection of exchange models were before made with CASE and CHAR
constants, this was changed to IF statements with INTEGER constants to
improve constant folding done by the compiler, and to facilitate inlineing.
Several routines were reworked to improve memory access and loop layouts.
Simple tunings like precalclulating some masks that remain constant dur-
ing the simulation (like the land/water mask). Min/max calculation without
having to store the intermediate results in temporary arrays, to reduce the
memory access.
6.4 Results
The original version that was used as a baseline was the 1/4 degree model
(1442×1021×46 points) that was delivered from IFM GEOMAR in Kiel in
OPA9 Experiments on Vector Computers 33
March 2006. The some issues with reading the input deck of such high resolu-
tion models were fixed and a baseline run was made with that version. It is here
referenced as the Original version. To put it in relation to the OPA production
versions it is named OPA 9.0 , LOCEAN-IPSL (2005) in opa.F90 with the
CVS tag $Header: /home/opalod/NEMOCVSROOT/NEMO/OPA SRC/opa.F90,v
1.19 2005/12/28 09:25:03 opalod Exp $ The model settings that were
used is the production settings from IFM GEOMAR in Kiel.
The OPA9 version called 5:th TFW is the state of the OPA9 tunings
before the 5:th Teraflop Workshop at Tohoku University in Sendai, Japan.
(November 20th and 21st 2006) This contains the sea ice rheology tunings and
the MPI tunings for 2D arrays. The results named 6:th TFW are the results
measured before the 6:th Teraflop Workshop at the University of Stuttgart,
Germany (March 26th and 27th 2007).
All tunings brouht an improvement of 17.2% in runtime using 16 SX-8
CPUs and 14.1% in runtime using 64 SX-8 CPUs compared to the original
version.
7 Conclusion
For its fifth anniversary, the ES remains a unique supercomputer with ex-
ceptional performances when considering real application in climate research
filed. The MoU between ESC, CNRS and IFREMER allowed four French
teams to benefit of the ES computational power to investigate challenging and
unexplored scientific questions on climate modeling. The size of the studied
problems is at least two orders of magnitude larger than the usual simulations
done on French computer resources. Accessing the ES allowed us to remain
at the top of the world climate research. However, we deplore that such com-
putational facilities do not exist in Europe. We are afraid that within a few
140
120 1500
Time [s]
100
80 1000
60
40 500
20
0 0
16 32 48 64 16 32 48 64
CPUs CPUs
Fig. 6. Performance and time plot - Scaling results made with OPA initial version,
before 5:th Teraflop workbench and before 6:th Teraflop workbench - 1200 simulation
cycles without initialization
34 S. Masson et al.
years European climate research will decline in comparison with work done
in US or Japan.
The work in the Teraflop Workbench is to enable this kind of research
on new larger models, being able to test limits of models and improve the
application.
TERAFLOP Computing and Ensemble
Climate Model Simulations
Henk A. Dijkstra
1 Introduction
The atmospheric concentrations of CO2 , CH4 and other so-called greenhouse-
gases (GHG) have increased rapidly since the beginning of the industrial rev-
olution, leading to an increase of radiative forcing of 2.4 W/m2 up to the year
2000 compared to pre-industrial times [1]. Simultaneously, long term climate
trends are observed everywhere on Earth. Among others, the global mean sur-
face temperature has increased by 0.6 ± 0.2 ◦ C over the 20th century, there
has been a widespread retreat of non-polar glaciers, and patterns of pressure
and precipitation have changed [2].
Although one may be tempted to attribute the climate trends directly to
changes in the radiative forcing, the causal chain is unfortunately not so easy.
The climate system displays strong internal climate variability on a number
of time scales. Hence, even when one would be able to keep the total radiative
forcing constant, substantial natural variability would still be observed. In
many cases, this variability expresses itself in certain patterns with names such
as the North Atlantic Oscillation, El Niño/Southern Oscillation, the Pacific
Decadal Oscillation and the Atlantic Multidecadal Oscillation. The relevant
time scales of variability of these patterns are in many cases comparable to the
trends mentioned above and the observational record is too short to accurately
establish their amplitude and phase.
36 Henk A. Dijkstra
To determine the causal chain between the increase in radiative forcing and
climate change observed, climate models are essential. Over the last decade,
climate models have grown in complexity at a fast pace due to increased detail
of description of the climate system and increased spatial resolution. One of
the standard numerical simulations with a climate model is the response to a
doubling in CO2 over a period of about 100 years. Current climate models,
predict an increase in global averaged surface temperature within the range
from 1◦ C to 4◦ C [2]. Often just one or a few transient coupled climate simu-
lations are performed for a given emission scenario due to the high computa-
tional demand of a single simulation. This allows an assessment of the mean
climate change but, because of the strong internal variability, it is difficult
to attribute certain trends in model response to increased radiative forcing.
To distinguish the forced signal from internal variability, a large ensemble of
climate simulations is necessary.
The Dutch climate community, grouped into the Center for Climate re-
search (CKO), has played a relatively minor role in running climate models
compared to that of the large climate research centers elsewhere (Hadley Cen-
ter in the UK, DRKZ in Germany and centers in the USA such as NCAR and
GFDL). Since 2003, however, there have been two relatively large projects on
Teraflop machines where the CKO group (Utrecht University and the Royal
Dutch Meteorological Institute) has been able to perform relatively large en-
semble simulations with state-of-the-art climate models.
due to the increase of greenhouse gases which has its origin in precipitation
changes in the Tropical Pacific. In [5], it was shown that the El Niño/Southern
Oscillation does not change under an increase of greenhouse gases. Further
analysis of the data showed, however, that deficiencies in the climate model,
i.e. a small meridional scale of the equatorial wind response, were the cause of
this negative result. In [6] it is shown that fluctuations in ocean surface tem-
peratures lead to an increase in Sahel rainfall in response to anthropogenic
warming. In [7, 8], it is found that anthropogenically forced changes in the
thermohaline ocean circulation and its internal variability are distinguishable.
Forced changes are found at intermediate levels over the entire Atlantic basin.
The internal variability is confined to the North Atlantic region. In [9], the
changes in extremes in European daily mean temperatures were analyzed in
more detail. The ensemble model data were also used for development of a
new detection technique of climate change by [10].
Likely the data have been used in several other publications but we had
no policy that manuscripts would pass along project leaders, while data were
open to everyone. Even the people directly involved sometimes did not care to
provide the correct acknowledgment of project funding and management. For
people involved in arranging funding for the next generation of supercomput-
ing systems this has been quite frustrating as output from the older systems
cannot be fully justified and it is eventually the resulting science which will
convince funding agencies.
During the project, there have been many interactions with the press and
several articles have appeared in the national newspapers. In addition, results
have been presented on several national and international meetings. On Oc-
tober 15, 2004, many of the results were presented at a dedicated meeting at
SARA and in the evening some of these results were highlighted on the na-
tional news. With SARA assistance, an animation of a superstorm has been
made within the CAVE at SARA. During the meeting on October 15, many
of the participants took the opportunity to see this animation. Finally, several
articles have been written for popular magazines (in Dutch).
ular climate model, i.e., neglecting model uncertainties, the expected global
warming of 4 K in 2100 is very robust.
Again the advantage of the relatively large size of the ensemble is the
large signal-to-noise ratio. We were able to determine the year in which the
forced signal (i.e., the trend) in several variables emerges from the noise. The
enhanced signal-to-noise ratio that is achieved by averaging over all ensemble
members is reflected in a large number of effective degrees-of-freedom, even for
short time periods, that enter the significance test. This makes the detection
of significant trends over short periods (10-20 years) possible.
The earliest detection times (Figure 2) for the surface temperature are
found off the equator in the western parts of the tropical oceans and the
Middle East, where the signal emerges as early as around 2005 from the noise.
In these regions the variability is extremely low, while the trend is only modest.
A second region with such an early detection time is the Arctic, where the
trend is very large due to the decrease of the sea-ice. Over most of Eurasia
and Africa detection is possible before 2020. The longest detection times are
found along the equatorial Pacific, where, due to El Niño, the variability is
very high, as well as in the Southern Ocean and the North Atlantic, where
the trend is very low.
Having learned from the Dutch Computing Challenge project, we de-
cided to have a first paper out [15] with a summary of main results and
we have a strict publication policy. The website of the project can be found
at http://www.knmi.nl/∼sterl/Essence/.
Fig. 2. Year in which the trend (measured from 1980 onwards) of the annual-mean
surface temperature emerges from the weather noise at the 95%-significance level,
from [15].
Climate Model Simulations 41
Acknowledgments
The Dutch Computing Challenge project was funded through NCF (Nether-
lands National Computing Facilities foundation) project SG-122. We thank
the DEISA Consortium (co-funded by the EU, FP6 projects 508830/031513),
for support within the DEISA Extreme Computing Initiative (www.deisa.org).
NCF contributed to ESSENCE through NCF projects NRG-2006.06, CAVE-
06-023 and SG-06-267. We thank HLRS and SARA staff, especially Wim Rijks
and Thomas Bönisch, for their excellent technical support. The Max-Planck-
Institute for Meteorology in Hamburg (http://www.mpimet.mpg.de/) made
available their climate model ECHAM5/MPI-OM and provided valuable ad-
vice on implementation and use of the model. We are especially indebted to
Monika Esch and Helmuth Haak.
42 Henk A. Dijkstra
References
1. Houghton, J.T., Ding, Y., Griggs, D., Noguer, M., van der Linden, P.J., Xiaosu,
D., eds.: Climate Change 2001: The Scientific Basis. Contribution of Working
Group I to the Third Assessment Report of the Intergovernmental Panel on
Climate Change (IPCC), Cambridge University Press, UK (2001)
2. : Summary for policymakers and technical summary. ipcc fourth assessment
report (2001)
3. Selten, F., Kliphuis, M., Dijkstrass, H.A.: Transient coupled ensemble climate
simulations to study changes in the probability of extreme events. CLIVAR
Exchanges 28 (2003) 11–13
4. Selten, F.M., Branstator, G.W., Dijkstra, H.A., Kliphuis, M.: Tropical origins
for recent and future Northern Hemisphere climate change. Geophysical Re-
search Letters 31 (2004) L21205
5. Zelle, H., van Oldenborgh, G.J., Burgers, G., Dijkstra, H.A.: El Niño and Green-
house warming: Results from Ensemble Simulations with the NCAR CCSM. J.
Climate 18 (2005) 4669–4683
6. Haarsma, R.J., Selten, F.M., Weber, S.L., Kliphuis, M.: Sahel rainfall variability
and response to greenhouse warming. Geophysical Research Letters 32 (2005)
L17702
7. Drijfhout, S.S., Hazeleger, W.: Changes in MOC and gyre-induced Atlantic
Ocean heat transport. Geophysical Research Letters 33 (2006) L07707
8. Drijfhout, S.S., Hazeleger, W.: Detecting Atlantic MOC changes in an ensemble
of climate change simulations. J. Climate in press (2007)
9. Tank, A.M.G.K., Können, G.P., Selten, F.M.: Signals of anthropogenic influ-
ence on European warming as seen in the trend patterns of daily temperature
variance. International Journal of Climatology 25 (2005) 1–16
10. Stone, D.A., Allen, M., Selten, F., Kliphuis, M., Stott, P.: The detection and
attribution of climate change using an ensemble of opportunity. J. Climate 20
(2007) 504–516
11. Roeckner, E., Bäuml, G., Bonaventura, L., Brokopf, R., Esch, M., Giorgetta,
M., Hagemann, S., Kirchner, I., Kornblueh, L., Manzini, E., Rhodin, A., Schlese,
U., Schulzweida, U., Tompkins, A.: The atmospheric general circulation model
echam 5. part i: Model description. Technical Report Report No. 349, Max-
Planck-Institut für Meteorologie, Hamburg, Germany (2003)
12. Marsland, S.J., Haak, H., Jungclaus, J., Latif, M., Röskes, F.: The Max-Planck-
Institute global ocean/sea ice model with orthogonal curvilinear coordinates.
Ocean Modelling 5 (2003) 91–127
13. Nakicenovic, N., Swart, R., eds.: Special Report on Emissions Scenarios: A
Special Report of Working Group III of the Intergovernmental Panel on Climate
Change, Cambridge University Press, Cambridge, U.K (2000)
14. Brohan, P., Kennedy, J., Haris, I., Tett, S., Jones, P.: Uncertainty estimates in
regional and global observed temperature changes: A new data set from 1850.
Journal of Geophysical Research 111 (2006) D12106
15. Sterl, A., Severijns, v Oldenborgh, G., Dijkstra, H.A., Hazeleger, W., van den
Broeke, M., Burgers, G., van den Hurk, B., van Leeuwen, P., van Velthovens,
P.: The essence project - signal to noise ratio in climate projections. Geophys.
Res. Lett. (2007) submitted
Current Capability of Unstructured-Grid CFD
and a Consideration for the Next Step
Kazuhiro Nakahashi
1 Introduction
Impressive progress in computational fluid dynamics (CFD) has been made
during the last three decades. Currently CFD has become an indispensable
tool for analyzing and designing aircrafts. Wind tunnel testing, however, is
still the central player for aircraft developments and CFD plays a subordinate
part.
In this article, current capability of CFD is discussed and demands
for next-generation CFD are described with an expectation of near future
PetaFlops computers. Then, Cartesian grid approach, as a promising candi-
date for next-generation CFD, is discussed by comparing it with the current
unstructured-grid CFD. It is concluded that the simplicity of the algorithms
from grid generation to post processing of Cartesian mesh CFD will be a big
advantage in the days of PetaFlops computers.
order schemes on unstructured grids is not easy. Post processing of huge data
output may also become another bottleneck due to irregularity of the data
structure.
Recently, studies of Cartesian grid method were renewed in the CFD com-
munity, because of the several advantages such as rapid grid generation, easy
higher-order extension, and simple data structure for easy post processing.
This is another candidate for the next-generation CFD.
Letfs compare the computational cost of uniform Cartesian grid methods
with that of tetrahedral unstructured grids. The most time-consuming part in
compressible flow simulations is the numerical flux computations. The number
of flux computations on a cell-vertex, finite volume method is proportional to
the number of edges in the grid. In a tetrahedral grid, the number of edges is
at least twice of that of the edges in a Cartesian grid of the same number of
node points. Therefore, the computational costs on unstructured grids are at
least twice as large as the costs of Cartesian grids. Moreover, computations
of linear reconstructions, limiter functions, and implicit time integrations on
tetrahedral grids easily doubles the total computational costs.
For higher-order spatial accuracy, the difference of computational costs be-
tween two approaches expands rapidly. In Cartesian grids, the spatial accuracy
can be easily increased up to the fourth order without extra computational
costs. In contrast, to increase the spatial accuracy from second to third-order
on unstructured grids can easily increase tenfold the computational cost.
Namely, for the same computational cost and the same spatial accuracy
of third-order or higher, we can use 100 to 1000 times more grid points in
the Cartesian grid than in unstructured grid. The increase of grid points
improves the accuracy of geometrical representation in computations as well
as the spatial solution accuracy.
Although the above estimate is very rough, it is apparent that the Carte-
sian grid CFD is a big advantage for high resolution computations required
for DNS.
5 Building-Cube Method
A drawback of uniform Cartesian grid is the difficulty of changing the mesh
size locally. This is critical, especially for airfoil/wing computations, where an
extremely large difference in characteristic flow lengths exists between bound-
ary layer regions and far fields. Accurate representation of curved boundaries
by Cartesian meshes is another issue.
A variant of the Cartesian grid method is to use the adaptive mesh refine-
ment [7] in space and cut cells or the immersed boundary method [8] on the
wall boundaries. However, introduction of irregular subdivisions and cells into
Cartesian grids complicate the algorithm for higher-order schemes. The ad-
vantages of the Cartesian mesh over the unstructured grid, such as simplicity
and less memory requirement, disappear.
50 Kazuhiro Nakahashi
Fig. 5. Cube frames around RAE2822 airfoil (left) and an enlarged view of Cartesian
grid near tripping wire (right).
6 Conclusion
References
1. Jameson, A. and Baker, T.J. : Improvements to the Aircraft Euler Method.
AIAA Paper 1987-452 (1987).
2. Nakahashi, K., Ito, Y., and Togashi, F.: Some challenges of realistic flow simu-
lations by unstructured grid CFD. Int. J. for Numerical Methods in Fluids, 43,
769–783 (2003).
3. Pambagjo,T.E., Nakahashi, K., Matsushima, K.: An Alternate Configuration for
Regional Transport Airplane. Transactions of the Japan Society for Aeronautical
and Space Sciences, 45, 148, 94–101 (2002).
4. Yamazaki, W., Matsushima, K. and Nakahashi, K.: Drag Reduction of a Near-
Sonic Airplane by using Computational Fluid Dynamics. AIAA J., 43, 9, 1870–
1877 (2005).
5. Koc, S., Kim, H.J. and Nakahashi, K.: Aerodynamic Design of Wing-Body-
Nacelle-Pylon Configuration. AIAA-2005-4856, 17th AIAA CFD Conf. (2005).
6. Top500 Supercomputers Sites, http://www.top500.org/.
7. Berger, M. and Oliger, M.: Adaptive Mesh Refinement for Hyperbolic Partial
Differential Equations. J. Comp. Physics, 53, 561–568 (1984).
8. Mittal, R. and Iaccarino, G.: Immersed Boundary Methods. Annual Review of
Fluid Mechanics, 37, 239–261 (2005).
9. Nakahashi, K.: Building-Cube Method for Flow Problems with Broadband
Characteristic Length. Computational Fluid Dynamics 2002, edited by S. Arm-
field, et. al., Springer, 77–81 (2002).
10. Meakin, R.L. and Wissink, A.M.: Unsteady Aerodynamic Simulation of Static
and Moving Bodies Using Scalable Computers. AIAA-99-3302, Proc. AIAA 14th
CFD Conf. (1999).
11. Nakahashi, K.: High-Density Mesh Flow Computations with Pre-/Post-Data
Compressions. AIAA 2005-4876, Proc. AIAA 17th CFD Conf. (2005).
Smart Suction – an Advanced Concept for
Laminar Flow Control of Three-Dimensional
Boundary Layers
1 Introduction
The list of reasons for a sustained reduction of commercial-aircraft fuel
consumption is getting longer every day: significant environmental impacts
of the strongly-growing world-wide air traffic, planned taxes on kerosene and
emission of greenhouse gases, and the lasting rise in crude-oil prices. As fuel
consumption during cruise is mainly determined by viscous drag its reduction
offers the greatest potential for fuel savings. One promising candidate to
reduce viscous drag of a commercial aircraft is laminar flow control (LFC)
by boundary-layer suction on the wings, tailplanes, and nacelles with a fuel
saving potential of 16%. (The other candidate is management of turbulent
flow, e.g., on the fuselage of the aircraft, by a kind of shark-skin surface
structure that however has a much lower saving potential.)
Suction has been known for decades to delay the onset of the drag-
increasing turbulent state of the boundary-layer by significantly enhancing its
laminar stability and thus pushing laminar-turbulent transition downstream.
However, in case of swept aerodynamic surfaces, boundary-layer suction
is not as straightforward and efficient as desired due to a crosswise flow
component inherent in the three-dimensional boundary layer. Suction aims
here primarily at reducing this crossflow, and not, as on unswept wings, at
54 Ralf Messing and Markus Kloker
Recently, a new strategy for laminar flow control has been proposed
and experimentally [Sar98] and numerically demonstrated [Was02].At a
single chordwise location artifical roughness elements are laterally placed
and ordered such that they excite relatively closely-spaced, benign crossflow
vortices that suppress the nocent ones by a nonlinear mechanism and do
not trigger fast secondary instability. If the streamwise variation of flow
conditions and stability characteristics is weak this approach has proven
to impressively delay transition. A better understanding of the phyiscal
background of the effectiveness of this approach has been provided by
direct numerical simulations [Was02], who coined the term Upstream Flow
Deformation (UFD).
3.0 48
Reδ1 /500
2.5 H12
44
uB,e
Reδ1 2.0
ϕ e [0 ]
40
ws,B ϕe
1.5
H12
uB,e
36
1.0
0.5 32
100 200 300 400 500
x
3 β=0
γ 2
0
100 200 300 400 500
x
method handles this issue and occuring difficulties are presented in the next
section.
3 Smart Suction
The main idea of our approach is to combine two laminar-flow techniques
that have already proven to delay transition in experiments and free-flight
tests, namely boundary-layer suction and UFD. The suction orifices serve
as excitation source and are ordered such that benign, closely-spaced UFD
vortices are generated and maintained on a beneficial amplitude level. A
streamwise variation of flow conditions and stability characteristics can be
taken into account by adapting the spacing of the suction orifices continuously
or in discrete steps. In this way we overcome the shortcomings of the single
excitation of UFD vortices. However we note that this is not at all a trivial
task because it is not clear a priori which direction the vortices follow – the
flow direction depends on the wall-normal distance -, and improper excitation
can lead to destructive nonlinear interaction with benign vortices from
upstream, or nocent vortices. For illustration of a case where the adaptation
to the chordwise varying flow properties has been done in an improper way
see Fig. 4. On the left side of the figure a visualisation of the vortices in the
flow field, and on the right side the location of the suction orifices at the wall,
in this case spanwise slots, are shown. At about one-quarter of the domain
58 Ralf Messing and Markus Kloker
Fig. 4. Visualisation of vortices (left) and suction orifices/slots at the wall (right)
in perspective view of the wing surface for smart suction with improper adaptation
of suction orifices. Flow from bottom to top, breakdown to turbulence at the end of
domain. In crosswise (spanwise) direction about 1.25λz is shown.
the spanwise spacing of the slots is increased in a discrete step from four
slots per fundamental spanwise wavelength λz corresponding to a spanwise
wave number γ = 1.6 (refer to Fig. 3) to three slots per λz corresponding to
a spanwise wave number γ = 1.2 to adapt to the changing region of unstable
wave numbers. In this case adaptation fails as transition to turbulent flow
takes place at the end of the considered domain. A more detailed analysis
reveals that nonlinear interactions between the both excited UFD vortices
lead to conditions triggering secondary instability in the vicinity of the
slot-spacing adaptation.
Fig. 5. As for Fig. 4 but with proper adaptation of suction orifices for a simple case.
conventional suction system have been risen. The reason is that the vortices
generate by nonlinear mechanisms a mean-flow distortion not unlike suction,
cf. [Was02], influencing the stability in an equally favourable manner as suc-
tion itself. The new method is termed smart suction as the instability of the
laminar flow is exploited to enhance stability rather than increasing the suc-
tion rate.
References
[Sar98] Saric, W.S., Carrillo, Jr. & Reibert, M.S. 1998a Leading-Edge
Roughness as a Transition Control Mechanism. AIAA-Paper 98-0781.
[Mes05] Messing, R. & Kloker, M. 2005 DNS study of spatial discrete suction
for Laminar Flow Control. In: High Performance Computing in Science
and Engineering 2004 (ed. E. Krause, W. Jäger, M. Resch), 163–175,
Springer.
[Mar06] Marxen, O. & Rist, U. 2006 DNS of non-linear transitional stages in
an experimentally investigated laminar separation bubble. In: High Per-
formance Computing in Science and Engineering 2005 (ed. W.E. Nagel,
W. Jäger, M. Resch), 103–117, Springer.
[Schra05] Schrauf, G. 2005 Status and perspectives of laminar flow. The Aeronau-
tical Journal (RAeS), 109, no. 1102, 639–644.
[Was02] Wassermann, P. & Kloker, M. J. 2002 Mechanisms and passive control
of crossflow-vortex induced transition in a three-dimansional boundary
layer. J. Fluid Mech., 456, 49–84.
Supercomputing of Flows with Complex
Physics and the Future Progress
Satoru Yamamoto
1 Introduction
Current progress of Computational Fluid Dynamics(CFD) researches pro-
ceeded in our laboratory (Mathematical Modeling and Computation), is pre-
sented in this article. In our labo., mainly three projects are running. First
one is the project: Numerical Turbine(NT Project). A parallel computational
code which can simulate two- and three-dimensional multistage stator-rotor
cascade flows in gas and steam turbines is being developed in this project
with the development of pre- and post-processing softwares. This code can
calculate not only those air flows, but also those flows of moist air and of wet
steam. Second one is the project: Supercritical-Fluids Simulator(SFS Project).
A computational code for simulating flows of arbitrary substance in arbitrary
conditions, such in gas, liquid, and supercritical states is being developed.
Third one is the project for making a custom computing machine optimized
for CFD. A systric computational-memory architecture for high-performance
CFD solvers is designed on a FPGA board. In this article, the NT and SFS
projects are briefly introduced and some typical calculated results are shown
as visualized figures.
diminishing(TVD) limiter and robust implicit schemes are employed for ac-
curate and robust calculations.
The computational code for the NT project is developed from the code for
condensate flows. Condensation occurs in fan rotors or compressors if moist
air streams in them. Also wet-steam flows occasionally condense in steam
turbines. The phase change from water vapor to water liquid is governed
by homogeneous nucleation and the nonequilibrium process of condensation.
The latent heat in water vapor is released to surrounding non-condensed gas
when the phase change occurs, increasing temperature and pressure. This non-
adiabatic effect induces a nonlinear phenomenon, the so-called ”condensation
shock”. Finally, condensed water vapor affects the performance.
We developed two- and three-dimensional codes for transonic condensate
flows assuming homogeneous and heterogeneous nucleations. For examples,
3-D flows of moist air around the ONERA M6[3] airfoil and condensate flows
over a 3-D delta-wing in atmospheric flight conditions [4] have been already
studied numerically by our group. Figure 1 shows a typical calculated conden-
sate mass fraction contours over the delta-wing at uniform Mach number 0.6.
This figure indicates that condensation occurs in a streamwise vortex, that is,
the so-called ”vapor trail”. Those codes were applied to transonic flows of wet
steam through steam-turbine cascade channels [5]. Figure 2 shows the calcu-
where
The computational cost has been relatively saved by employing the al-
gebraic approximation. However, the calculation should be sequentially pro-
ceeded, because the calculation at a grid point depends on those at the neigh-
boring points in the same time-step. Therefore, the computational algorithm
for the LU-SGS scheme is not suitable for the parallel computation. 39.4% of
the total CPU time per one time-iteration was occupied for the LU-SGS rou-
tine when a wet-steam flow through a 3-D single channel was calculated using
single CPU. In the NT code, flows through multistage stator-rotor cascade
channels as shown in Fig. 3 are simultaneously calculated. Since each channel
can be calculated separately in each time-iteration, a parallel computation us-
ing MPI is preferable. Also the 3-D LU-SGS algorithm may be parallelized on
Supercomputing of Flows with Complex Physics and the Future Progress 65
Fig. 4. Left: A Hyper-plane divided to two threads. Right: Pipelined LU-SGS al-
gorithm
the so-called ”hyper-plane” in each channel (Fig. 4). We applied the pipeline
algorithm [10] to the hyper-plane assisted by the OpenMP directives. Then,
the hyper-plane can be divided to a multi-thread.
Here, the pipeline algorithm applied to two threads are taken into account
(Fig. 5). The algorithm is explained simply using the 2-D case (Fig. 6). Then,
the calculation of the data on the hyper-plane depends on the left and the
low grid point. In this example, the data are divided to two blocks. The lower
block is calculated by CPU1 and the upper block is calculated by CPU2.
CPU1 starts from the low-left corner and the data on the first grid-column
is calculated. Then, CPU2 starts the calculation of the upper block from
its low-left corner, using the boundary data of the first column in the lower
block. Hereafter, CPU1 and CPU2 synchronously perform their calculation
toward the right column. The number of CPUs can be increased easily. As
increasing the number of threads to 2, 4, 8, and 16, the speedup ratio increases.
But, the ratio is not always improved, even though the number of threads is
increased up to 32 and 64. Consequently, 4CPUs may be the most effective
and economical number for the pipelined LU-SGS calculation [11]. In the NT
code, the calculation of each passage through turbine blades are assisted by
the OpenMP and the set of passages is parallelized using MPI.
66 Satoru Yamamoto
Figure 5 shows the typical calculated results for the 2D and 3D codes.
Contours of condensate mass fraction are visualized in both figures.
solvers for condensate flows. The preconditioning method can enable the
solvers to calculate both high-speed flows and very slow flows using the precon-
ditioning matrix, which switches Navier-Stokes equations from compressible
equations to incompressible equations automatically when a very slow flow is
calculated. A preconditioned flux-vector splitting scheme [14] which can ap-
ply further to a static field(zero-velocity flow) has been also proposed. The
SFS code employs the preconditioning code to simulate very slow flows at the
Mach number far less than 0.1.
Supercritical fluids appear if both the bulk pressure and the temperature
increase beyond the critical values. It is known that some anomalous proper-
ties are observed especially near the critical points. For examples, the density,
the thermal conductivity, and the viscosity are rapidly changed near the crit-
ical points. These values and the variations are different among substances.
Therefore, we should define all the thermal properties in each substance if we
calculate supercritical fluids accurately. In the present SFS code, the database
for thermal properties: PROPATH [15], developed by Kyushu university has
been employed and been applied to the preconditioning method. Then, flows of
arbitrary substance not only in supercritical conditions but also atmospheric
and cryogenic conditions can be accurately calculated.
As a typical comparison for different substances in supercritical conditions,
the calculated results of two-dimensional Rayleigh-Bènard(R-B) convections
in supercritical CO2 and H2 O are only shown here [16]. The aspect ratio of
the flow field is fixed at 9 and 217 × 25 grid points are generated for the
computational grid. The Rayleigh number is fixed to Ra = 3 × 105 in both
cases. It is known that the flow properties are fundamentally same if the R-B
convections at a same Rayleigh number are calculated by assuming ideal gas.
However, even though the flows of CO2 and H2 O in near-critical conditions
are calculated assuming the same Rayleigh number, the calculated instanta-
neous temperature contours are compared as quite different flow patterns. As
obtained in H2 O case, the flow field is dominated by that with relatively lower
temperature than that in CO2 case.
4 Other Projects
The SFS code is now extending to a three-dimensional code. The SFS code is
based on the finite-difference method and the governing equations in general
curvilinear coordinates are solved. One of big problems using this approach
may be how flows around a complex geometry should be solved. Recently
we developed an immersed boundary(IB) method [17] which calculates flows
using a rectangular grid for the flow field and surface meshes for complex
geometries. Even though a number of the IB methods have been proposed in
recent 10 years, the present IB method developed by us may be the simplest
method.
68 Satoru Yamamoto
5 Concluding Remarks
Two projects: Numerical Turbine(NT), and Supercritical-Fluids Simula-
tor(SFS), proceeded in our laboratory and the future perspectives were briefly
introduced. Both projects are strongly assisted by the supercomputer SX-7
of Information Synergy Center(ISC) in Tohoku University. The NT project
has been also collaborated with private companies, Mitsubishi Heavy Industry
at Takasago, and Toshiba Cooperation, and with Institute of Fluid Sciences
of Tohoku University and the ISC of Tohoku University. The SFS project
has been collaborated with private companies, Idemitsu Kousan and JFE
Engineering, and with Institute of Multidisciplinary Research for Advanced
Materials of Tohoku University. Also the SFS project will be supported by
the Grand-in-Aid for Scientific Research(B) of JSPS, for next three years.
Although the third project for a custom computing machine for CFD was
not explained here. The research activity will be presented at the IEEE Sym-
posium on Field-Programmable Custom Computing Machines(FCCM 2007)-
Napa, 2007 [18].
References
1. S. Yamamoto, N. Takasu, and H. Nagatomo, Numerical Investigation of
Shock/Vortex Interaction in Hypersonic Thermochemical Nonequilibrium
Flow,” J. Spacecraft and Rockets, 36-2(1999), 240-246.
2. H. Takeda and S. Yamamoto, Implicit Time-marching Solution of Partially Ion-
ized Flows in Self-Field MPD Thruster,” Trans. the Japan Society for Aeronau-
tical and Space Sciences, 44-146 (2002), 223-228.
3. S. Yamamoto, H. Hagari and M. Murayama, Numerical Simulation of Conden-
sation around the 3-D Wing,” Trans. the Japan Society for Aeronautical and
Space Sciences, 42-138(2000), 182-189.
4. S. Yamamoto, Onset of Condensation in Vortical Flow over Sharp-edged Delta
Wing,” AIAA Journal, 41-9 (2003), 1832-1835.
5. Y. Sasao and S. Yamamoto, Numerical Prediction of Unsteady Flows through
Turbine Stator-rotor Channels with Condensation,” Proc. ASME Fluids Engi-
neering Summer Conference, FEDSM2005-77205(2005).
6. F.R. Menter, Two-equation Eddy-viscosity Turbulence Models for Engineering
Applications,” AIAA Journal, 32-8(1994), 1598-1605.
7. P.L. Roe, Approximate Riemann Solvers, Parameter Vectors, and Difference
Schemes,” J. Comp. Phys., 43(1981), 357-372.
8. S.Yamamoto and H.Daiguji, Higher-Order-Accurate Upwind Schemes for Solv-
ing the Compressible Euler and Navier-Stokes Equations,” Computers and Flu-
ids, 22-2/3(1993), 259-270.
9. S.Yoon and A.Jameson, Lower-upper symmetric-Gauss-Seidel method for the
Euler and Navier-Stokes equations,” AIAA Journal , 26(1988), 1025-1026.
10. M.Yarrow and R. Van der Wijngaart, Communication Improvement for the NAS
Parallel Benchmark: A Model for Efficient Parallel Relaxation Schemes,” Tech.
Report NAS RNR-97-032, NASA ARC, (1997).
11. S.Yamamoto, Y.Sasao, S.Sato and K.Sano, Parallel-Implicit Computation of
Three-dimensional Multistage Stator-Rotor Cascade Flows with Condensation,”
Proc.18th AIAA CFD Conf.-Miami, (2007).
12. Y.-H.Choi and C.L. Merkle, The Application of Preconditioning in Viscous
Flows,” J. Comp. Phys., 105(1993), 207-223.
13. J.M. Weiss and W.A. Smith, Preconditioning Applied to Variable and Constant
Density Flows,” AIAA Journal, 33(1995), 2050-2056.
14. S. Yamamoto, Preconditioning Method for Condensate Fluid and Solid Coupling
Problems in General Curvilinear Coordinates,” J. Comp. Phys., 207-1(2005),
240-260.
15. A Program Package for Thermophysical Properties of Fluids(PROPATH),
Ver.12.1, PROPATH GROUP.
16. S. Yamamoto and A. Ito, Numerical Method for Near-critical Fluids of Arbi-
trary Material,” Proc. 4th Int. Conf. on Computational Fluid Dynamics-Ghent,
(2006).
17. S.Yamamoto and K.Mizutani, A Very Simple Immersed Boundary Method Ap-
plied to Three-dimensional Incompressible Navier-Stokes Solvers using Stag-
gered Grid,” Proc. 5th Joint ASME JSME Fluids Engineering Conference-San
Diego, (2007).
18. K.Sano, T.Iizuka and S.Yamamoto, Systolic Architecture for Computational
Fluid Dynamics on FPGAs,” Proc. IEEE Symp. on Field-Programmable Custom
Computing Machines-Napa, (2007).
Large-Scale Computations of Flow Around a
Circular Cylinder
1 Introduction
Two-dimensional flows around circular and square cylinders have always been
popular test cases for novel numerical methods, see for instance [Wis97]. With
increasing Reynolds number, Re, based on the inflow velocity and the diameter
of the cylinder, the topology of the flow around a circular cylinder changes.
For low Reynolds numbers, the flow is two-dimensional and consists of
a steady separation bubble behind the cylinder. As the Reynolds number
72 Jan G. Wissink and Wolfgang Rodi
were employed. This data, however, may not contain all relevant length-scales
that are typical for a near wake flow. With the availability of new, high-quality
data from the near wake of a circular cylinder, we hope to be able to resolve
this issue.
The DNS were perfomed using a finite-volume code with a collocated variable
arrangement which was designed to be used on curvi-linear grids. A second-
order central discretization was employed in space and combined with a three-
stage Runge-Kutta method for the time-integration. To avoid the decoupling
of the velocity field and the pressure field, the momentum interpolation proce-
dure of Rhie and Chow [Rhie83] was employed. The momentum interpolation
effectively replaced the discretization of the pressure by another one with a
more compact numerical stencil. The code was vectorizable to a high degree
and was also parallellized. To obtain a near-optimal load balancing, the com-
putational mesh was subdivided into a series of blocks which all contained
an equal number of grid points. Each block was assigned to its own unique
processor and communication between blocks was performed by using the
standard message passing interface (MPI) protocol.
The Poisson equation for the pressure from the original three-dimensional
problem was reduced to a set of equations for the pressure in two-dimensional
planes by employing a Fourier transform in the homogeneous, spanwise direc-
tion. This procedure was found to significantly speed up the iterative solution
of the pressure field on scalar computers. Because of the reduction of the orig-
inal three-dimensional problem into a series of two-dimensional problems, the
average vector-length was reduced, which might lead to a reduction in perfor-
mance on vector computers. For more information on the numerical method
see Breuer and Rodi [Breu96].
Figure 1 shows a spanwise cross-section through the computational geom-
etry. Along the top and bottom surface, a free-slip boundary condition was
employed and along the surface of the cylinder a no-slip boundary condition
was used. At the inflow plane, a uniform flow-field was prescribed with u = U0
and v = w = 0, where u, v, w are the velocities in the x, y, z-directions, re-
spectively. At the outflow plane, a convective outflow boundary condition was
employed that allows the wake to leave the computational domain without
any reflections. In the spanwise direction, finally, periodic boundary condi-
tions were employed.
The computational mesh in the vicinity of the cylinder is shown in Fig. 2.
Because of the dense grid, only every eighth grid line is displayed. As illus-
trated in the figure, ”O”-meshes were employed in all DNS of flow around a
circular cylinder. Near the cylinder, the mesh was only stretched in the ra-
dial direction, and not in the circumpherential direction. Table 1 provides an
overview of the performed simulations and shows some details on the perfor-
mance of the code on the NEC which will be analyzed in the next section. The
74 Jan G. Wissink and Wolfgang Rodi
free-slip
v=0
w=0 Outflow
D
free-slip
-10 -5 0 5 10 15
x/D
maximum size of the grid cells adjacent to the cylinder’s surface in Simulations
B-F (in wall units) was smaller ore equal to ∆φ+ = 3.54 in the circumpher-
antial direction, ∆r+ = 0.68 in the radial direction and ∆z + = 5.3 in the
spanwise direction.
1
y/D
-1
-2
-2 -1 0 1 2
x/D
Fig. 2. Spanwise cross-section through the computational mesh near the cylinder
as used in Simulations D and E (see Table 1), displaying every 8th grid line
3400
B
3200
C D
Mflops/node 3000
2800
2600 E F
2400
A
2200
2000
100 120 140 160
vector length
3500
C D
3000
Mflops/node
2000
20 40 60 80 100 120 140
number of processors
for this is the increase in communication when the number of blocks (which
is the same as the number of processors) increases. Communication is needed
to exchange information between the outer parts of the blocks during every
timestep. Though the amount of data exchanged is not very large, the fre-
quency with which this exchange takes place is relatively high and, therefore,
slows down the calculations.
As an illustration of what happens when the number of points in the span-
wise direction is reduced by a factor of 4 (compared to Simulations B,C,D,F),
we consider a simulation of flow around a wind turbine blade. In this simula-
tion, the mesh consisted of 1510 × 1030 × 128 grid points, in the streamwise,
wall-normal and spanwise direction, respectively. The grid was subdivided
into 256 blocks, each containing 777650 computational points. The average
vector-length was 205 and the vector operation ratio was 99.4%. The mean
performance of code reached values of approximately 4 GFlops per processor,
such that the combined performance of the 256 processors reached a value in
excess of 1 Tflops for this one simulation.
From the above we can conclude that optimizing the average vector-length
of the code resulted in a significant increase in performance and that it is
helpful to try to increase the number of points per block. The latter will
reduce the amount of communication between blocks and may also increase
the average vector-length.
Acknowledgements
The authors would like to thank the German Research Council (DFG) for
sponsoring this research and the Steering Committee of the Supercomputing
Centre (HLRS) in Stuttgart for granting computing time on the NEC SX-8.
References
[Beau94] Beaudan, P., Moin, P.: Numerical experiments on the flow past a circular
cylinder at subcritical Reynolds number, In Report No. TF-62, Thermo-
sciences Division, Department of Mechanical Engineering, Stanford Uni-
versity, pp. 1–44 (1994)
[Breu96] Breuer, M., Rodi, W.: Large eddy simulation for complex turbulent flow
of practical interest, In E.H. Hirschel, editor, Flow simulation with high-
preformance computers II, Notes in Numerical Fluid Mechanics, volume
52, Vieweg Verlag, Braunschweig, (1996)
[Breu98] Breuer, M.: Large eddy simulations of the subcritical flow past a circular
cylinder: numerical and modelling aspects, Int. J. Numer. Meth. Fluids,
28, 1281–1302 (1998)
[Dong06] Dong, S., Karniadakis, G.E., Ekmekci, A., Rockwell, D.: A combined di-
rect numerical simulation-partical image velocimetry study of the turbu-
lent near wake, J. Fluids Mech., 569, 185–207 (2006)
[Froe98] Fröhlich, J., Rodi, W., Kessler, Ph., Parpais, S., Bertoglio, J.P., Laurence,
D.: Large eddy simulation of flow around circular cylinders on structured
and unstructured grids, In E.H. Hirschel, editor, Flow simulation with
high-preformance computers II, Notes in Numerical Fluid Mechanics, vol-
ume 66, Vieweg Verlag, Braunschweig, (1998)
[Krav00] Kravchenko, A.G., Moin, P.: Numerical studies of flow around a circular
cylinder at ReD = 3900, Phys. Fluids, 12, 403–417 (2000)
[Lour93] Lourenco, L.M., Shih, C.: Characteristics of the plane turbulent near
wake of a circular cylinder, a partical image velocimetry study, Published
in [Beau94], data taken from Kravchenko and Moin [Krav00] (1993)
[Ma00] Ma, X., Karamonos, G.-S., Karniadakis, G.E.: Dynamics and low-
dimensionality of a turbulent near wake, J. Fluids Mech., 410, 29–65
(2000)
[Mit97] Mittal, R., Moin, P.: Suitability of upwind-biased finite-difference schemes
for large eddy simulations of turbulent flows, AIAA J., 35(88), 1415–1417
(1997)
[Nor94] Norberg, C.: An experimental investigation of flow around a circular cylin-
der: influence of aspect ratio, J. Fluid Mech., 258, 287–316 (1994)
[Nor03] Norberg, C.: Fluctuating lift on a circular cylinder: review and new mea-
surements, J. Fluids and Structures, 17, 57–96 (2003)
Large-Scale Computations of Flow Around a Circular Cylinder 81
[Ong96] Ong, J., Wallace, L.: The velocity field of the turbulent very near wake of
a circular cylinder, Exp. in Fluids, 20, 441–453 (1996)
[Rhie83] Rhie, C.M., Chow, W.L.: Numerical study of the turbulent flow past an
airfoil with trailing edge separation, AIAA J, 21(11), 1525–1532 (1983)
[Stone68] Stone, H.L.: Iterative solutions of implicit approximations of multidimen-
sional partial differential equations, SIAM J Numerical Analysis, 5, 87–
113 (1968)
[Thom96] Thompson, M., Hourigan, M., Sheridan, J.: Three-dimensional instabil-
ities in the wake of a circular cylinder, Exp. Thermal Fluid Sci., 12(2),
190–196 (1996)
[Wis97] Wissink, J.G.: DNS of 2D turbulent flow around a square cylinder, Int.
J. for Numer. Methods in Fluids, 25, 51–62 (1997)
[Wis03] Wissink, J.G.: DNS of a separating low Reynolds number flow in a turbine
cascade with incoming wakes, Int. J. of Heat and Fluid Flow, 24, 626–635
(2003).
[Wis06] Wissink, J.G. and Rodi, W.: Direct Numerical Simulation of flow and
heat transfer in turbine cascade with incoming wakes, J. Fluid Mech.,
569, 209–247 (2006).
[Wil96a] Williamson, C.H.K.: Vortex dynamics in the cylinder wake, Ann. Rev.
Fluid Mech., 28, 477–539 (1996)
[Wil96b] Williamson, C.H.K.: Three-dimensional wake transition, J. Fluid Mech.,
328, 345–407 (1996)
Performance Assessment and Parallelisation
Issues of the CFD Code NSMB
of the simulation codes. As a part of the ETH domain, the Swiss National
Supercomputing Centre (CSCS) is the main provider of HPC infrastructure to
academic research in Switzerland. Due to our excellent experience with vector
machines, we developed and optimised our CFD codes primarily for this ar-
chitecture. Our main computational platform has been the NEC SX-5 vector
computer (and its predecessor machine, a NEC SX-4) at CSCS. However, the
NEC SX-5 was decommissioned recently, and two new scalar machines were
installed, a Cray XT3 and an IBM SP5.
This shift in the computational infrastructure at CSCS from vector to
scalar machines prompted us to assess and compare the code performance on
the different HPC systems available to us, as well as to investigate the poten-
tial for optimisation. The central question to be answered by this benchmark-
ing study is, if — and how — the new Cray XT3 is capable to achieve a code
performance that is superior to the NEC SX-5 (preferably in the range of the
NEC SX-8). This report tries to answer this question by presenting results
of our benchmarking study for one particular, but important, flow case and
simulation code.
It thereby adheres to the following structure. In Sect. 2 we outline the key
properties of our simulation code. After the description of the most prominent
performance-related characteristics of the machines that were investigated in
the benchmarking study in Sect. 3, we will give some details about the test
configuration and benchmarking method in Sect. 4. The main part of this
paper is comprised by Section 5, where we present the performance measure-
ments. In the first subsection, Sect. 5.1, we show data that was obtained on
the two vector systems. After that, we discuss the benchmarking results from
two massively-parallel machines in Sect. 5.2. In Sect. 5.3, we discuss a way
to alleviate the problem of uneven load-balancing, a problem which is com-
monly observed in the investigated simulation scenario and severely inhibits
performance in parallel simulations. First, we demonstrate its favourable im-
pact on the performance on the massively-parallel computer Cray XT3, and
study its effect on the performance and performance-related measures on the
NEC SX-8 vector machine. Finally, in Sect. 5.4 we compare the results of all
the platforms and return to the initial question, how the code performance
on the Cray XT3 compares to the two vector machines NEC SX-5 and SX-8.
ical (central and upwind) schemes with an accuracy up to fourth order (fifth
order for upwind schemes). The core routines are available in two versions, op-
timised for vector and scalar architectures, respectively. Furthermore, NSMB
offers convergence acceleration by means of multi-grid and residual smooth-
ing, as well as the possibility of moving grids and a large number of different
boundary conditions. Technical details about NSMB can be found in [3], and
examples of complex flows simulated with NSMB are published in [4, 5, 6, 7, 8].
20D
5D
x
5D
13D
◦
35
5D
7D
y
3D
8D
(a) (b)
Fig. 1. Schematic of the jet-in-crossflow configuration. (a) Side view and (b) top
view. The gray areas symbolise the jet fluid
Performance Assessment & Parallelisation Issues of CFD Code 87
composition (see Sect. 5.3). This was done by splitting the original topology of
34 blocks into a larger number of up to 680 blocks. Table 1 lists the properties
of the investigated block configurations. Further information about this flow
case and the simulation results will be available in [10].
Unlike in other benchmarking studies, we do not primarily focus on com-
mon measures such as parallel efficiency or speedup in the present paper.
Rather, we also consider quantities that are more geared towards the prag-
matic aspects of parallel computing. This study was necessitated by the need
to evaluate new simulation platforms in view of our practical needs, such as a
quick-enough simulation throughput to allow for satisfactory progress in our
research projects.
To this end, we measured the elapsed wall-clock time which is consumed
by the simulation to advance the solution by a fixed number of time steps. All
computations were started from the exactly same initial conditions, and the
fixed amount of time steps was chosen so that the fastest resulting simulation
time would still be long enough to keep measuring errors small. For better
illustration, the elapsed wall-clock time is normalised with the number of
time steps. Therefore, in the following, our performance charts display the
performance in terms of elapsed time per time step (abbreviated “performance
measure”). Since in our work typically the simulation time interval and thus
the number of time steps is known a priori, this performance measure allows
for a quick estimate of the necessary computational time (and simulation
turnover time) for similar calculations.
As we wanted to get information about the real-world behaviour of our
codes, we did not artificially improve the benchmarking environment by ob-
taining exclusive access to the machine and other privileged priorities, such
as performing the computations in times with decreased general system load.
To ensure statistically significant results, all benchmarking calculations were
repeated several times and only the best result was used. In this sense, our
data reflects the maximum performance that can be expected from the given
machines in everyday operation. Of course, the simulation turnover time is
typically significantly degraded by the time the simulation spends in batch
queues, but we did not include this aspect into our study as we felt that the
queueing time is a rather arbitrary and heavily installation-specific quantity
which would only dilute the more deterministic results of our study.
5 Benchmarking Results
5.1 Performance Evaluation of the Vector Machines NEC SX-5
and NEC SX-8
2.5 35
30
2
parallel speed-up
25
1.5 20
1 15
10
0.5
5
0 0
0 10 20 30 40 0 10 20 30 40
number of processors number of processors
(a) (b)
Figure 3 shows the benchmarking results for the scalar machine Cray XT3.
The Cray compiler supports an option which controls the size of the mem-
ory pages (“small pages” of 4 Kilobytes versus “large pages” comprising
2 Megabytes) that are used at run time. Small memory pages are favourable
if the data is heavily distributed in physical memory. In this case the available
memory is used more efficiently, resulting in generally fewer page faults than
with large pages, where only a small fraction of the needed data can be stored
in memory at a given time. Additionally, small memory pages can be loaded
faster than large ones. On the other hand, large memory pages are beneficial
if the data access pattern can make use of data locality, since fewer pages have
to be loaded. We investigated the influence of this option by performing all
benchmarking runs twice, once with small and once with large memory pages.
In all cases, the performance using small memory pages was almost twice as
good with a very similar general behaviour. Therefore, we will only consider
benchmarking data obtained with this option in the following.
The performance development of the Cray XT3 with small memory pages
for an increasing number of processors also exhibits an almost-linear initial
scaling, just as observed for the NEC SX-5 and SX-8 vector machines. (With
large memory pages, the scaling properties are less ideal, and notable devia-
tions from linear scaling already occur for more than four processors, see 3(b).
Furthermore, the parallel speed-up stagnates earlier and at a lower level than
for small memory pages.) For 16 or more processors on the Cray XT3, the
performance stagnates and exhibits a similar saturation behaviour originat-
ing from bad load-balancing as already observed for the two vector machines.
This is not surprising, since load-balancing is not a problem specific to the
90 Jörg Ziefle et al.
timesteps / wallclock time (s−1 )
0.16 40
0.14
parallel speed-up
30
0.12
0.1
20
0.08
0.06
10
0.04
0.02 0
0 10 20 30 40 0 10 20 30 40
number of processors number of processors
(a) (b)
memory) as on the NEC SX-8, when crossing nodes. Therefore, the general
scaling behaviour is expected to be better and more homogeneous on a scalar
machine. This is, however, not an explanation for the larger linear scaling
range of the Cray, as computations on the NEC are still performed within a
single node for eight processors, where they first deviate from linear scaling.
We rather conclude that the architecture and memory access performance of
the Cray XT3, but most importantly its network performance does not no-
tably inhibit the parallel scaling on the Cray XT3, while on the two vector
machines there are architectural effects which decrease performance. Poten-
tially the uneven load-balancing exerts a higher influence on the NECs and
thus degrades performance already at a lower number of processors than on
the Cray XT3. While the exact reasons for the shorter initial linear parallel
scaling range on the vector machines remain yet to be clarified, we will at
least show in the following two sections that load-balancing effects show up
quite differently in both the Cray XT3 and the NEC SX-5/SX-8.
We also obtained some benchmarking results for the (now rather outdated)
IBM SP4 at CSCS in Manno, see the triangles in Fig. 3. Since its successor
machine IBM SP5 was already installed at CSCS and was in its acceptance
phase during our study, we were interested in performance data, especially
the general parallelisation behaviour, of a machine with similar architecture.
However, due to the heavy loading of the IBM SP4, which led to long queueing
times of our jobs, we could not gather benchmarking data for more than four
CPUs during the course of this study. For such a low number of CPUs, the
IBM SP4 approximately achieves the same performance as the Cray XT3.
This does not come as surprise, since the floating-point peak performance of
the processors in the IBM SP4 and the Opteron CPUs in the Cray XT3 are
nearly identical with 5.2 Gigaflops. However, when increasing the degree of
parallelism on the IBM SP4, we expect its parallel scaling to be significantly
worse than on the Cray XT3, due to its slower memory access, and slower
network performance when crossing nodes (i. e., for more than 32 processors).
It was already pointed out in the previous two sections that the low granularity
of the domain decomposition in the benchmarking problem inhibits an effi-
cient load balancing for an even moderate number of CPUs. Suboptimal load
balancing imposes a quite strict performance barrier which cannot be broken
with the brute-force approach of further increasing the level of parallelism:
the performance will only continue to stagnate.
For our typical applications, this quick saturation of parallel scaling is not
a problem on the NEC SX-5 and NEC SX-8 vector machines, since their fast
CPUs yield stagnation levels of performance which are high enough for satis-
factory turnover times already at a low number of CPUs. On the Cray XT3,
where the CPU performance is significantly lower, a high number of processors
92 Jörg Ziefle et al.
0.5 35
30
0.4
parallel speed-up
25
0.3 20
0.2 15
10
0.1
5
0 0
0 10 20 30 40 0 10 20 30 40
number of processors number of processors
(a) (b)
0.45
parallel efficiency
0.8
0.4
0.35
0.6
0.3
0.25
0.4
0.2
0.15 0.2
0 200 400 600 800 0 200 400 600 800
number of blocks number of blocks
(a) (b)
this data transfer would not be necessary when utilising fewer blocks, so it can
be considered additional work to obtain the same result. Blocks that reside
on the same CPU can communicate faster directly in memory, but the cor-
responding communication routines still have to be processed. In these parts
of the code the data is copied from the source array to a work array, and
then from the work array to the target field. In both cases, the computation
spends extra time copying data and (in the case of blocks distributed to dif-
ferent CPUs) communicating it over the network. This time could otherwise
be used for the advancement of the solution. For very large numbers of pro-
cessors, this communication overhead over-compensates the performance gain
from the more balanced work distribution, and the total performance degrades
again. There are additional effects that come along with smaller block sizes,
which are mostly related to the locality of data in memory and cache-related
effects. However, as their impact is relatively difficult to assess and quantify,
we do not discuss them here. Additionally, on vector machines such as the
NEC SX-5 and SX-8, the average vector length, which is strongly dependent
on the block size, exerts a big influence on the floating-point performance. We
will further analyse this topic in the next section.
While the performance gain on the Cray XT3 due to block splitting for
32 processors is quite impressive, the results are still far apart from the levels
that were obtained with the NEC SX-8 (see Fig. 2).
Furthermore, since the performance of the Cray XT3 does not depend
strongly on the number of blocks (as long as there are enough of them), only
a variation of the number of processors can yield further improvements. To
this end, we performed a new series of benchmarking computations at a fixed
number of blocks with an increasing number of CPUs. Since the optimum
block configuration with 340 blocks for 32 processors (see Fig. 5) would limit
the maximum number of CPUs to 340, we selected the block configuration
96 Jörg Ziefle et al.
with 680 blocks for this investigation. We doubled the number of CPUs suc-
cessively up to a maximum of 512, starting from the computational setup
of 32 processors and 680 blocks that was utilised in the above initial block-
splitting investigation. The results are shown as rectangular symbols in Fig. 6.
Additionally, the previous performance data for the Cray XT3 is displayed,
without block splitting, and with 32 CPUs and block splitting.
Especially in the left plots with linear axes, the considerably extended
initial parallel scaling region is clearly visible. In the original configuration
without block splitting, it only reached up to eight processors, while with
block splitting, the first larger deviation occurs only for more than 32 pro-
cessors. Furthermore, the simulation performance rises continually with an
increasing number of CPUs, and flattens out only slowly. While a perfor-
1
2 10
1.5
0
10
1
1
10
0.5
2
0 10 0 1 2 3
0 100 200 300 400 500 600 10 10 10 10
number of processors number of processors
(a) (b)
700
600
parallel speed-up
parallel speed-up
500 2
10
400 6
300
10
1
6
200
2% of 31% of
100 machine machine
0 capacity capacity
0 10 0 1 2
0 200 400 600 800 10 10 10
number of processors number of processors
(c) (d)
When considering the sizes of the individual blocks, it becomes clear that the
big size difference between the largest and the smallest block inhibits a better
load balancing. As the number of CPUs increases in figures 4 and , 5 the
situation gets even worse. There is less and less opportunity to fill the gaps
with smaller blocks, as they are needed elsewhere. For the worst investigated
98 Jörg Ziefle et al.
case of 32 blocks on 34 CPUs, only two CPUs can hold two blocks — the other
32 CPUs obtain one block only. It is clear that for such a configuration, the
ratio of the largest to the smallest block size plays a pivotal role for load
balancing. While the CPU containing the largest block is processing it, all the
other CPUs idle for most of the time.
The impact of an uneven load balancing on important performance mea-
sure can be observed in Fig. 7(a), where the minimum, mean and maximum
values of the average floating-point performance per CPU are plotted over
the number of CPUs for the above domain decomposition of 34 blocks. Note
that the minimum and maximum values refer to the performance reached on
one CPU of the parallel simulation. For an information of the floating-point
performance of the whole parallel simulation, the mean value needs to be mul-
tiplied with the total number of CPUs. This is also the reason for the steadily
decreasing floating-point performance as the number of CPUs increases. In a
serial simulation, there is a minimal communication overhead, and the simu-
lation code can most efficiently process work routines, where the majority of
the floating-point operations take place. For parallel computations, a steadily
increasing fraction of the simulation time has to be devoted to bookkeep-
ing and data-exchange tasks. Therefore, the share of floating-point operations
decreases, and with it the floating-point performance.
An interesting observation can be made for the graph of the maximum
floating-point performance. It declines continuously from one up to eight
CPUs, but experiences a jump to a higher level from 8 to 16 CPUs, and
continues to climb from 16 to 32 CPUs. When comparing Fig. 7(a) to the
corresponding block distributions –1
5 in Fig. 13, it is evident that the dra-
0.5 1
avg. / max. vector length
0.95
% peak performance
0.4
0.9
0.3 0.85
0.2 0.8
0.75
0.1
0.7
0 0.65
0 10 20 30 40 0 10 20 30 40
number of processors number of processors
(a) (b)
the load balancing degrades for a higher number of CPUs, there are processors
with very small blocks, which result in much smaller vector lengths, and thus
in considerably lower minimum vector lengths of this computation.
After this discussion of the CPU-number dependence (using the initial
unsplit block configuration with 34 blocks) of the two most prominent perfor-
mance measures on vector machines, the floating-point performance and the
average vector length, we will study the impact of block splitting. Similarly
to the investigation on the Cray XT3 in the previous section, we conducted
computations with a varying number of blocks (ranging from 34 to 340) at a
fixed number of processors. In Fig. 8 we display the variation of the perfor-
mance measure and the parallel efficiency with the number of blocks for three
sets of computations with 16, 32 and 64 processors, respectively.
The graphs can be directly compared to Fig. 5, where the results for the
Cray XT3 are plotted. As discussed in Sect. 5.3, its performance curve ex-
hibits a concave shape, i. e., there is an optimum block count yielding a global
performance maximum. In contrast, after the initial performance jump from
34 to 68 blocks due to the improved load balancing, all three curves for the
NEC SX-8 are continually falling with an increasing number of blocks. The de-
crease occurs in an approximately linear manner, with an increasing slope for
the higher number of CPUs. As expected, the performance for a given block
count is higher for the computations with the larger number of processors,
but the parallel efficiencies behave in the opposite way due to the increased
communication overhead, cf. Figs. 8(a) and (b). Furthermore, the parallel
efficiency does not surpass 32%-53% (depending on the number of proces-
sors) even after block splitting, which are typical values for a vector machine.
In contrast, the parallel efficiency on the Cray XT3 (with 32 processors, see
Fig. 5(b)) is improved from about the same value as on the NEC SX-8 (roughly
timesteps / wallclock time (s−1 )
7 0.6
6 0.5
parallel efficiency
5
0.4
4
0.3
3
2 0.2
1 0.1
0 100 200 300 400 0 100 200 300 400
number of blocks number of blocks
(a) (b)
Fig. 8. Dependence of (a) performance measure and (b) parallel efficiency on num-
ber of blocks on NEC SX-8 for a fixed number of processors. ◦ 16 CPUs, × 32 CPUs
and + 64 CPUs, linear fit through falling parts of the curves
Performance Assessment & Parallelisation Issues of CFD Code 101
25% with 34 blocks and 32 CPUs) to about 75% with 68 blocks, whereas only
45% are reached on the NEC SX-8.
The decreasing performance with an increasing block count for the
NEC SX-8 can be explained by reconsidering the performance measures that
are investigated in the previous section. On the positive side, a higher block
count leads to an increasingly homogeneous distribution of the work, cf. plots
–
4 ,
10 /
5 –
11 15 and –
16
20 in Fig. 13. However, at the same time the aver-
age block size decreases significantly, as evident from Table 2, and the over-
head associated with the inter-block data exchange strongly increases. In the
preceding section, it was observed that this results in a degraded mean of
the average vector length and floating-point performance. Both quantities are
crucial to the performance on vector machines, and thus the overall perfor-
mance decreases. On the other hand, a smaller dataset size does not yield
such unfavourable effects on the Cray XT3 by virtue of its scalar architec-
ture. Only for very high block numbers, the resulting communication overhead
over-compensates the improved load balancing, and thus slightly decreases the
parallel performance.
The above observations suggest that while block splitting is a suitable
method to overcome load-balancing problems also on the NEC vector ma-
chines, it is generally advisable to keep the total number of blocks on them
as low as possible. This becomes even more evident when looking at the ag-
gregate performance chart in Fig. 11, where the same performance measure
is plotted above the number of CPUs for all benchmarking calculations. The
solid curve denoting the original configuration with 34 blocks runs into a satu-
ration region due to load-balancing issues for more than 8 CPUs. After refining
the domain decomposition by block splitting, the load-balancing problem can
be alleviated and delayed to a much higher number of CPUs. As detailed in
the discussion of Fig. 10, the envelope curve through all measurements with
maximum performance for each CPU count cuts through the points with the
lowest number of split blocks, 68 in this case. A further subdivision of the
blocks only reduces the performance. For instance, for 32 processors, using
the original number of 34 blocks or 680 blocks yields almost the same per-
formance of about 2.5 time-steps/second, while a maximum performance of
approximately 4.5 time-steps/second is reached for 68 blocks.
The dependence of the performance-related measures on the total block
count can be investigated in Fig. 9. Here the floating-point performance and
the average vector length are shown over the total number of blocks for a fixed
number of processors. As in Fig. 7, the minimum, mean, and maximum values
of a respective computation are CPU-related measures. The three symbols
denote three sets of computations, which have been conducted with a fixed
number of 16, 32 and 64 processors, respectively.
Both the vector length and the floating-point performance drop with an in-
creasing number of blocks. While this degrading effect of the block size on both
quantities is clearly visible, it is relatively weak, especially for the floating-
point performance. The number of CPUs exert a considerably higher influence
102 Jörg Ziefle et al.
0.5 1
0.4
0.9
0.3 0.85
0.2 0.8
0.75
0.1
0.7
0 0.65
0 100 200 300 400 0 100 200 300 400
number of blocks number of blocks
(a) (b)
on performance, as also evident from Fig. 8. The reason for the degradation of
the floating-point performance lies in the increasing fraction of bookkeeping
and communication tasks, which are typically non-floating point operations,
when more blocks are employed. This means that in the same amount of time,
fewer floating-point operations can be performed. The average vector length
is primarily diminished by the increasingly smaller block sizes when they are
split further. However, while the floating-point performance in Fig. 9(a) is
largest for 16 CPUs and smallest for 64 CPUs for a given number of blocks,
the vector lengths behave oppositely. Here the largest vector lengths occur for
64 CPUs, while the vector lengths for 32 and 16 processors are significantly
smaller and approximately the same for all block counts. The reason for this
inverse behaviour can again be found in the block distribution, which is visu-
alised in Fig. 13. The mechanisms are most evident for the simulations with
the lowest number of blocks, nb = 68 blocks in this case. Whereas in all three
cases with 16, 32 and 64 processors the dimensions of the individual blocks are
the same, the properties of their distribution onto the CPUs is varying signifi-
cantly. For the highest number of 64 processors, each CPU holds one only one
block, with the exception of four CPUs containing two blocks each. This leads
to a very uneven load distribution. In contrast, for 16 processors, most of the
CPUs process four blocks, and some even five. It is clearly visible that this
leads to a much more homogeneous load balancing. As discussed above, the
load balancing has a direct influence on the floating-point rate. The maximum
floating-point performance for a given block count is reached on the CPU with
the highest amount of work. This processor conducts floating-point operations
during most of the simulation and only interrupts this task for small periods to
Performance Assessment & Parallelisation Issues of CFD Code 103
After the detailed investigation of the impact of the block and processor
count on the performance and related measures, we will consider the overall
effect of block splitting on the NEC SX-8 in an aggregate performance chart.
In Fig. 10, the performance of all computations on the NEC SX-8 is displayed
with the usual symbols with both linear and logarithmic axes. The perfor-
mance development with 34 blocks was already discussed in Sect. 5.1. After a
linear initial parallel scaling for up to 4 processors, the performance stagnates
at around 16 processors due to an unfavourable load balancing, and a higher
level of parallelism does not yield any notable performance improvements. An
alleviation of this problem was found in the refinement of the domain decom-
position by splitting the mesh blocks. The three sets of computations with
16, 32 and 64 processors each with a varied block number were studied above
in more detail. For 16 processors, the performance is at best only slightly
increased by block splitting, but can also be severely degraded by it: already
at a moderate number of blocks, the performance is actually lower than with
the un-split block configuration (34 blocks). At 32 processors, where the load
balancing is much worse, block splitting exhibits a generally more positive
effect. The performance almost doubles when going to 34 to 68 blocks, and
continually degrades for a higher number of blocks. At the highest investi-
gated block count, 340 blocks, the performance is approximately the same as
with 34 blocks. At 64 processors, a computation with 34 blocks is technically
not possible. However, it is remarkable that already with 68 blocks and thus
bad load balancing (cf. plot 16 in Fig. 13), the performance is optimum, and
3 0
10
2
0 0 1
0 20 40 60 80 10 10
number of processors number of processors
(a) (b)
When considering the hull curve through the maximum values for each
processor count, it is notable that the performance continues to scale rather
well for a higher number of CPUs, and the performance stagnation due to
inhomogeneous load balancing can be readily overcome by block splitting.
However, in contrast to the Cray XT3, the block count exhibits a relatively
strong effect on performance, and a too copious use of block splitting can even
deteriorate the performance, especially at a low number of processors. We
therefore conclude that block splitting is also a viable approach to overcome
load balancing issues on the NEC SX-8. In contrast to scalar machines, it is
however advisable to keep the number of blocks as low as possible. A factor of
two to the unbalanced block count causing performance stagnation appears
to be a good choice. When following this recommendation, the performance
scales well on the NEC SX-8 to a relatively high number of processors.
In this section, we compare the benchmarking data obtained on the Cray XT3
to the one gathered on the NEC SX-8, especially in view of the performance
improvements that are possible with block splitting. In Fig. 11(a) and (b), we
display this information in two plots with linear and logarithmic axes.
Using the original configuration with 34 blocks, all three machines (NEC
SX-5, NEC SX-8 and Cray XT3) quickly run into performance stagnation
after a short initial almost-linear scaling region. While the linear scaling re-
gion is slightly longer on the Cray XT3 than on the two vector machines, its
stagnating performance level (with 16 CPUs) barely reaches the result of the
NEC SX-5 on a single processor. While complete performance saturation can-
not be reached on the NEC SX-5 due to its too small queue and machine size,
its “extrapolated” stagnation performance is almost an order of magnitude
higher, and the NEC SX-8 is approximately twice as fast on top of that.
For our past simulations, the performance on the two vector machines,
especially on the NEC SX-8, has been sufficient for an adequate simulation
turnover time, while the Cray XT3 result is clearly unsatisfactory. By refining
the domain decomposition through block splitting, the load balancing is dras-
tically improved, and the simulation performance using 32 CPUs on the Cray
jumps up by a factor of about four. At this level, it is just about competitive
with the NEC SX-5 using 4 CPUs and no block splitting (which would not
yield notable performance improvements here anyway). Furthermore, at this
setting, the Cray XT3 output equals approximately the result with one or two
NEC SX-8 processors (also without block splitting).
Since the block splitting strategy has the general benefit of extending the
linear parallel scaling range, an increase of the number of CPUs comes along
with considerable performance improvements. On the Cray XT3, the perfor-
mance using 512 processors roughly equals the SX-8 output with 8 CPUs. The
512 CPUs on the Cray correspond to about one third of the machine installed
at CSCS, which can be considered the maximum allocation that is realistic
106 Jörg Ziefle et al.
5 0 6
4
10
6
3
1
10
2
11% of 31% of
1 machine machine
2 capacity capacity
0 10 0 1 2 3
0 100 200 300 400 500 600 10 10 10 10
number of processors number of processors
(a) (b)
3
10 1
0.8
parallel efficiency
parallel speed-up
2
10
0.6
1
0.4
10
0.2
0
10 0 1 2 3
0 0 1 2 3
10 10 10 10 10 10 10 10
number of processors number of processors
(c) (d)
Fig. 11. Aggregate chart for varying number of processors. (a) Performance measure
(linear axes), (b) performance measure (logarithmic axes), (c) parallel speed-up,
(d) parallel efficiency. NEC SX-8 with 34 blocks; ◦ 16 CPUs, × 32 CPUs and
+ 64 CPUs on NEC SX-8 with a varying number of blocks ranging from 68 to 340.
NEC SX-5 with 34 blocks, Cray XT3 with 34 blocks, ∗ Cray XT3 with
a varying number of blocks ranging from 68 to 680 and 32 CPUs, Cray XT3
with 680 blocks, ideal scaling in (c)
in everyday operation, and thus this result marks the maximum performance
of this test case on the Cray XT3. Since a similar performance is achievable
on the NEC SX-8 using only one out of 72 nodes (slightly more than 1% of
the machine capacity), we conclude that calculations on the NEC SX-8 are
much more efficient than on the Cray XT3 for our code and test case. Further
increases of the CPU count on the NEC SX-8, still without block splitting,
yield notable performance improvements to about 50% above the maximum
Cray XT3 performance. However, when block splitting is employed on the
NEC SX-8, the full potential of the machine is uncovered with the given sim-
ulation setup, and the performance approximately doubles for 32 processors,
when increasing from 34 to 68 blocks. The maximally observed performance,
Performance Assessment & Parallelisation Issues of CFD Code 107
6
0.8
parallel efficiency
5
4 0.6
3 0.4
2
0.2
1
0 0
0 200 400 600 800 0 200 400 600 800
number of blocks number of blocks
(a) (b)
Fig. 12. Aggregate chart for varying number of blocks. (a) Performance mea-
sure, (b) parallel efficiency. ◦ 16 CPUs, × 32 CPUs and + 64 CPUs on NEC SX-8,
linear fit through falling parts of the curves. ∗ Cray XT3 with 32 CPUs
6 Conclusions
We conducted a comparative performance assessment with the code NSMB on
different high-performance computing platforms at CSCS in Manno and HLRS
in Stuttgart using a typical computational fluid dynamics simulation case
involving time-resolved turbulent flow. The investigation was centred around
the question if and how it is possible to achieve a similar performance on the
new massively-parallel Cray XT3 at CSCS as obtained on the NEC SX-5 and
SX-8 vector machines at CSCS and HLRS, respectively.
While for the given test case the processor performance of the mentioned
vector computers is sufficient for low simulation turnover times even at a low
number of processors, the Cray CPUs are considerably slower. Therefore, cor-
respondingly more CPUs have to be employed on the Cray to compensate.
However, this is usually not easily feasible due to the block-parallel nature of
the simulation code in combination with the coarse-grained domain decompo-
sition of our typical flow cases. While the total block count is a strict upper
limit for the number of CPUs in a parallel simulation, a severe degradation of
the load-balancing renders parallel simulations with a block-to-CPU number
ratio of less than 4 very inefficient.
An alleviation of this problem can be found in the block-splitting tech-
nique, were the total number of blocks is artificially increased by splitting
them with an existing utility programme. The finer granularity of the domain
Performance Assessment & Parallelisation Issues of CFD Code 109
in Table 2
112 Jörg Ziefle et al.
Acknowledgements
A part of this work was carried out under the HPC-EUROPA project (RII3-
CT-2003-506079), with the support of the European Community – Research
Infrastructure Action under the FP6 “Structuring the European Research
Area” Programme. The hospitality of Prof. U. Rist and his group at the In-
stitute of Aero and Gas Dynamics (University of Stuttgart) is greatly appre-
ciated. We thank Stefan Haberhauer (NEC HPC Europe) and Peter Kunszt
(CSCS) for fruitful discussions, as well as CSCS and HLRS staff for their
support regarding our technical inquiries.
References
1. Vos, J. B., van Kemenade, V., Ytterström, A., and Rizzi, A. W., “Parallel
NSMB: An Industrialized Aerospace Code for Complete Aircraft Simulations,”
Proc. Parallel CFD Conference 1996 , edited by P. Schiano et al., North Holland,
1997.
2. Vos, J. B., Rizzi, A., Corjon, A., Chaput, E., and Soinne, E., “Recent Advances
in Aerodynamics inside the NSMB (Navier Stokes Multi Block) consortium,”
AIAA Paper 98-0225 , 1998.
3. Vos, J. B., Leyland, P., van Kemenade, V., Gacherieu, C., Duquesne, N., Lot-
stedt, P., Weber, C., Ytterström, A., and Saint Requier, C., NSMB Handbook
4.5 .
4. Gacherieu, C., Collercandy, R., Larrieu, P., Soumillon, P., Tourette, L., and
Viala, S., “Navier-Stokes calculations at Aerospace Matra Airbus for aircraft
design,” Proc. ICAS, Harrogate, UK , edited by G. I., Royal Aeronautical Soci-
ety, London, UK, 2000.
5. Viala, S., Amant, S., and Tourette, L., “Recent achievements on Navier-Stokes
methods for engine integration,” Proc. CEAS, Cambridge, Royal Aeronautical
Society, London, UK, 2002.
6. Mossi, M., Simulation of benchmark and industrial unsteady compressible tur-
bulent fluid flows, Ph. D. thesis no. 1958, EPFL Lausanne, 1999.
7. Ziefle, J., Stolz, S., and Kleiser, L., “Large-Eddy Simulation of Separated Flow in
a Channel with Streamwise-Periodic Constrictions,” 17th AIAA Computational
Fluid Dynamics Conference, Toronto, Canada, June 6–9, 2005, AIAA Paper
2005-5353.
8. Ziefle, J. and Kleiser, L., “Large-Eddy Simulation of a Round Jet in Crossflow,”
36th AIAA Fluid Dynamics Conference, San Francisco, USA, June 5–8 2006,
2006, AIAA Paper 2006-3370.
9. TOP500 Team, “TOP500 Report for November 2006,” Tech. rep., November
2006, also available as http://www.top500.org/list/2006/11/.
10. Ziefle, J. and Kleiser, L., “Large-Eddy Simulation of Film Cooling,” 2007, in
preparation.
11. Ytterström, A., “A Tool For Partitioning Structured Multiblock Meshes For Par-
allel Computational Mechanics,” The International Journal of Supercomputer
Applications and High Performance Computing, Vol. 11, 1997, pp. 336–343.
High Performance Computing Towards Silent
Flows
Summary. The flow field and the acoustic field of various jet flows and a high-lift
configuration consisting of a deployed slat and a main wing are numerically ana-
lyzed. The flow data, which are computed via large-eddy simulations (LES), provide
the distributions being plugged in the source terms of the acoustic perturbation
equations (APE) to compute the acoustic near field. The investigation emphasizes
the core flow to have a major impact on the radiated jet noise. In particular the
effect of heating the inner stream generates substantial noise to the sideline of the
jet, whereas the Lamb vector is the dominant noise source for the downstream noise.
Furthermore, the analysis of the airframe noise shows the interaction of the shear
layer of the slat trailing edge and the slat gap flow to generate higher vorticity than
the main airfoil trailing edge shear layer. Thus, the slat gap is the more dominant
noise region for an airport approaching aircraft.
1 Introduction
In the recent years the emitted sound by aircraft has become a very con-
tributing factor during the development process. This is due to the predicted
growth of air-traffic as well as the stricter statutory provisions. The generated
sound can be assigned to engine and airframe noise, respectively. The present
paper deals with two specific noise sources, the jet noise and the slat noise.
Jet noise constitutes the major noise source for aircraft during take-off.
In the last decade various studies [12, 25, 6, 5] focused on the computation of
unheated and heated jets with emphasis on single jet configurations. Although
extremely useful theories, experiments, and numerical solutions exist in the
literature, the understanding of subsonic jet noise mechanisms is far from
perfect. It is widely accepted that there exist two distinct mechanisms, one
is associated with coherent structures radiating in the downstream direction
and the other one is related to small scale turbulence structures contributing
to the high frequency noise normal to the jet axis. Compared with single jets,
116 E. Gröschel1 , D. König2 , S. Koh, W. Schröder, and M. Meinke
coaxial jets with round nozzles can develop flow structures of very different
topology, depending on environmental and initial conditions and, of course,
on the temperature gradient between the inner or core stream and the bypass
stream. Not much work has been done on such jet configurations and as
such there are still many open questions [3]. For instance, how is the mixing
process influenced by the development of the inner and outer shear layers
What is the impact of the temperature distribution on the mixing and on
the noise generation mechanisms The current investigation contrasts the flow
field and acoustic results of a high Reynolds number cold single jet to a more
realistic coaxial jet configuration including the nozzle geometry and a heated
inner stream.
During the landing approach, when the engines are near idle condition,
the airframe noise becomes important. The main contributor to airframe
noise are high-lift devices, like slats and flaps, and the landing gear. The
paper focuses here on the noise generated by a deployed slat.
The present study applies a hybrid method to predict the noise from
turbulent jets and a deployed slat. It is based on a two-step approach using
a large-eddy simulation (LES) for the flow field and approximate solutions
of the acoustic perturbation equations (APE) [10] for the acoustic field. The
LES comprises the neighborhood of the dominant noise sources such as the
potential cores and the spreading shear layers for the jet noise and the slat
cove region for the airframe noise. In a subsequent step, the sound field is
calculated for the near field, which covers a much larger area than the LES
source domain. Compared to direct methods the hybrid approach possess
the potential to be more efficient in many aeroacoustical problems since it
exploits the different length scales of the flow field and the acoustic field. To
be more precise, in subsonic flows the characteristic acoustic length scale is
definitely larger than that of the flow field. Furthermore, the discretization
scheme of the acoustic solver is designed to mimic the physics of the wave
operator.
The paper is organized as follows. The governing equations and the nu-
merical procedure of the LES/APE method are described in Sect. 2. The
simulation parameters of the cold single jet and the heated coaxial jet are
given in the first part of Sect. 3 followed by the description of the high-lift
configuration. The results for the flow field and the acoustical field are dis-
cussed in detail in Sect. 4. In each section, the jet noise and the slat noise
problem are discussed subsequently. Finally, in Sect. 5, the findings of the
present study are summarized.
High Performance Computing Towards Silent Flows 117
2 Numerical Methods
The computations of the flow fields are carried out by solving the unsteady
compressible three-dimensinal Navier-Stokes equations with a monotone-
integrated large-eddy simulation (MILES) [7]. The block-structured solver is
optimized for vector computers and parallelized by using the Message Pass-
ing Interface (MPI). The numerical solution of the Navier-Stokes equations
is based on an vertex centered finite-volume scheme, in which the convective
fluxes are computed by a modified AUSM method with an accuracy is 2nd or-
der. For the viscous terms a central discretization is applied also of 2nd order
accuracy. Meinke et al. showed in [21] that the obtained spatial precision is
sufficient compared to a sixth-order method. The temporal integration from
time level n to n + 1 is done by an explicit 5-stage Runge-Kutta technique,
whereas the coefficients are optimized for maximum stability and lead to a
2nd order accurate time approximation. At low Mach number flows a pre-
conditioning method in conjunction with a dual-time stepping scheme can be
used [2]. Furthermore, a multi-grid technique is implemented to accelerate the
convergence of the dual-time stepping procedure.
The set of acoustic perturbation equations (APE) used in the present simu-
lations corresponds to the APE-4 formulation proposed in [10]. It is derived
by rewriting the complete Navier-Stokes equations as
∂p′ p′
+ c̄2 ∇ · ρ̄u′ + ū 2 = c̄2 qc (1)
∂t c̄
′
∂u′ p
+ ∇ (ū · u′ ) + ∇ = qm . (2)
∂t ρ̄
The right-hand side terms constitute the acoustic sources
′ ρ̄ D̄s′
qc = −∇ · (ρ′ u′ ) + (3)
cp Dt
′ ′
′ ′ ′ (u′ )2 ∇·τ
qm = − (ω × u) + T ∇s̄ − s ∇T̄ − ∇ + . (4)
2 ρ
3 Computational Setup
3.1 Jet
The quantities uj and cj are the jet nozzle exit velocity and sound speed,
respectively, and Tj and T∞ the temperature at the nozzle exit and in the
ambient fluid. Unlike the single jet, the simulation parameters of the coax-
ial jet have additional indices ”p” and ”s” indicating the primary and sec-
ondary stream. An isothermal turbulent single jet at Mj = uj /c∞ = 0.9 and
Re = 400, 000 is simulated. These parameters match with previous investiga-
tions performed by a direct noise computation via an acoustic LES by Bogey
and Bailly [6] and a hybrid LES/Kirchhoff method by Uzun et al. [25]. The
chosen Reynolds number can be regarded as a first step towards the sim-
ulation of real jet configurations. Since the flow parameters match those of
various studies, a good database exists to validate our hybrid method for such
high Reynolds number flows. The inflow condition at the virtual nozzle exit
is given by a hyperbolic-tangent profile for the mean flow, which is seeded by
random velocity fluctuations into the shear layers in form of a vortex ring [6]
to provide turbulent fluctuations. Instantaneous LES data are sampled over a
period of T̄ = 3000 · ∆t · uj /R = 300.0 corrsponding to approximately 6 times
the time interval an acoustic wave needs to travel through the computational
domain. Since the source data is cyclically fed into the acoustic simulation a
modifed Hanning windowing [20] has been performed to avoid spurious noise
generated by discontinuities in the source term distribution. More details on
the computational set up can be found in Koh et al.[17]
The flow parameters of the coaxial jet comprises a velocity ratio of the sec-
ondary and primary jet exit velocity of λ = ujs /ujp = 0.9, a Mach number 0.9
High Performance Computing Towards Silent Flows 119
for the secondary and 0.877 for the primary stream, and a temperature ratio
of Tjs /Tjp = 0.37. An overview of the main parameter specifications is given
in Table 1. To reduce the computational costs the inner part of the nozzle was
not included in the simulation, but a precursor RANS simulation was set up
to generate the inflow profiles for the LES. For the coaxial jet instantaneous
data are sampled over a period of T̄s = 2000 · ∆t · c∞ /rs = 83. This period
corresponds to roughly three times the time interval an acoustic wave needs
to propagate through the computational domain. As in the single jet compu-
tation, the source terms are cyclically inserted into the acoustic simulation.
The grid topology and in particular the shape of the short cowl nozzle are
shown in Fig.1. The computational grid has about 22 · 106 grid points.
Large-Eddy Simulation
Fig. 1. The grid topology close to the nozzle tip is ”bowl” shaped, i.e., grid lines
from the primary nozzle exit end on the opposite side of the primary nozzle. Every
second grid point is shown.
Fig. 2. LES grid of the high-lift configu- Fig. 3. LES grid in the slat cove area of
ration. Every 2nd grid point is depicted. the high-lift configuration. Every 2nd grid
point is depicted.
High Performance Computing Towards Silent Flows 121
Acoustic Simulation
Fig. 4. APE grid of the high-lift configuration. Every 2nd grid point is depicted.
122 E. Gröschel1 , D. König2 , S. Koh, W. Schröder, and M. Meinke
The results of the present study are divided into two parts. First, the flow
field and the acoustic field of the cold single jet and the heated coaxial jet
will be discussed concerning the mean flow properties, turbulent statistics and
acoustic signature in the near field. To relate the findings of the coaxial jet
to the single jet, the flow field and acoustic field of which has been validated
in current studies [17] against the experimental results by [27] and numerical
results by Bogey and Bailly [6], comparisons to the flow field and acoustic
near field properties of the single jet computation are drawn. This part also
comprises a discussion on the results of the acoustic near field concerning
the impact by the additional source terms of the APE system, which are
related to heating effects. The second part describes in detail the airframe
noise generated by the deployed slat and the main wing element. Acoustic
near field solutions are discussed on the basis of the LES solution alone and
the hybrid LES/APE results.
4.1 Jet
Large-Eddy Simulation
In the following the flow field of the single jet is briefly discussed to show that
the relevant properties of the high Reynolds number jet are well computed
when compared with jets at the same flow condition taken from the literature.
In Fig. 5 the half-width radius shows an excellent agreement with the LES by
Bogey and Bailly [6] and the experiments by Zaman [27] indicating a poten-
tial core length of approximately 10.2 radii. The jet evolves downstream of the
q
Fig. 5. Jet half-width radius in compar- Fig. 6. Reynolds stresses u′ u′ /u2j nor-
ison with numerical [6] and experimental malized by the nozzle exit velocity in
results [27]. comparison with numerical [6] and exper-
imental [4, 19] results.
High Performance Computing Towards Silent Flows 123
q q
Fig. 7. Reynolds stresses v ′ v ′ /u2j nor- Fig. 8. Reynolds stresses u′ u′ /u2j nor-
malized by the nozzle exit velocity in malized by the nozzle exit velocity over
comparison with numerical [6] results. jet half-width radius at x/R = 22 in com-
parison with numerical [6] results.
Fig. 11. Mean flow development of coax- Fig. 12. Axial velocity profiles for cold
ial jet in parallel planes perpendicular to single jet and heated coaxial jet.
the jet axis in comparison with experi-
mental results.
between the inner and outer stream, the so-called primary mixing region, is
generally very high. This is especially noticeable in Fig. 10 with the growing
shear layer instability separating the two streams. Spatially growing vortical
structures generated in the outer shear layer seem to affect the inner shear
layer instabilities further downstream. This finally leads to the collapse and
break-up near the end of the inner core region.
Figure 11 shows mean flow velocity profiles based on the secondary jet exit
velocity of the coaxial jet at different axial cross sections ranging from
x/RS = 0.0596 to x/Rs = 14.5335 and comparisons to experimental results.
A good agreement, in particular in the near nozzle region, is obtained, how-
ever, the numerical jet breaks-up earlier than in the experiments resulting in
a faster mean velocity decay on the center line downstream of the potential
core.
The following three Figs. 12 to 14 compare mean velocity, mean density, and
Reynolds stress profiles of the coaxial jet to the single jet in planes normal to
the jet axis and equally distributed in the streamwise direction from x/Rs = 1
to x/Rs = 21. In the initial coaxial jet exit region the mixing of the primary
shear layer takes place. During the mixing process, the edges of the initially
sharp density profile are smoothed. Further downstream the secondary jet
shear layers start to break up causing a rapid exchange and mixing of the
fluid in the inner core. This can be seen by the fast decay of the mean density
profile in Fig. 13.
During this process, the two initially separated streams merge and show at
x/Rs = 5 a velocity profile with only one inflection point roughly at r/Rs =
0.5. Unlike the density profile, the mean axial velocity profile decreases only
slowly downstream of the primary potential core. In the self-similar region the
velocity decay and the spreading of the single and the coaxial jet is similar.
High Performance Computing Towards Silent Flows 125
Fig. 13. Density profiles for cold single Fig. 14. Reynolds stresses profiles for
jet and heated coaxial jet. cold single jet and heated coaxial jet.
The break-up process enhances the mixing process yielding higher levels of
turbulent kinetic energy on the center line. The axial velocity fluctuations of
the coaxial jet starts to increase at x/Rs = 1 in the outer shear layer and
reach at x/Rs = 9 high levels on the center line, while the single jet axial
fluctuations start to develop not before x/rs = 5 and primarily in the shear
layer but not on the center line. This difference is caused by the density and
entropy gradient, which is the driving force of this process. This is confirmed
by the mean density profiles. These profiles are redistributed beginning at
x/rs = 1 until they take on a uniform shape at approx. x/rs = 9. When this
process is almost finished the decay of the mean axial velocity profile sets in.
This redistribution evolves much slower over several radii in the downstream
direction.
Acoustic Simulation
The presentation of the jet noise results is organized as follows. First, the
main characteristics of the acoustic field of the single jet from previous noise
[13],[17] computations are summarized, by which the present hybrid method
has been successfully validated against. Then, the acoustic fields for the single
and coaxial jet are discussed. Finally, the impact of different source terms on
the acoustic near field is presented.
Unlike the direct acoustic approach by an LES or a DNS, the hybrid meth-
ods based on an acoustic analogy allows to separate different contributions to
the noise field. These noise mechanisms are encoded in the source terms of
the acoustic analogy and can be simulated separately exploiting the linearity
of the wave operator. Previous investigations of the single jet noise demon-
strated the fluctuating Lamb vector to be the main source term for cold jet
noise problems. An acoustic simulation with the Lamb vector only was per-
formed and the sound field at the same points was computed and compared
with the solution containing the complete source term.
126 E. Gröschel1 , D. König2 , S. Koh, W. Schröder, and M. Meinke
The overall acoustic field of the single and coaxial jet is shown in
Figs. 15 and 16 by instantaneous pressure contours in the near field, i.e., out-
side the source region, and contours of the Lamb vector in the acoustic source
region. The acoustic field is dominated by long pressure waves of low frequency
radiating in the downstream direction. The dashed line in Fig. 15 indicates
the measurement points at a distance of 15 radii from the jet axis based on
the outer jet radius at which the acoustic data have been sampled. Fig. 17
shows the acoustic near field signature generated by the Lamb vector only in
comparison with an, in terms of number of grid points, highly resolved LES
and the direct noise computation by Bogey and Bailly. The downstream noise
is well captured by the LES/APE method and is consistent with the highly
resolved LES results. The increasing deviation of the overall sound pressure
level at obtuse angles with respect to the jet axis is due to missing contribu-
tions from nonlinear and entropy source terms. A detailed investigation can
be found in Koh et al.[17]. Note that the results by Bogey and Bailly are 2
to 3 dB too high compared to the present LES and LES/APE distributions.
Since different grids (Cartesian grids by Bogey and Bailly and boundary fit-
ted grids in the present simulation) and different numerical methods for the
compressible flow field have been used resulting resulting in varying boundary
conditions, e.g.,the resolution of the initial momentum thickness, differences
in the sensitive acoustic field are to be expected. The findings of the hybrid
LES/Kirchhoff approach by Uzun et al. [25] do also compare favorably with
the present solutions.
The comparison between the near field noise signature generated by the Lamb
vector only of the single and the coaxial jet at the same measurement line
shows almost the same characteristic slope and a similar peak value location
along the jet axis. This is suprising, since the flow field development of both
jets including mean flow and turbulent intensities differed strongly.
Fig. 15. Pressure contours of the sin- Fig. 16. Pressure contours out-
gle jet by LES/APE generated by the side the source domain and the y-
Lamb vector only. Dashed line indi- component of the Lamb vector inside
cates location of observer points to the source domain of the coaxial jet.
compute the acoustic near field.
High Performance Computing Towards Silent Flows 127
Fig. 17. Overall sound pressure level Fig. 18. Comparison of the acoustic field
(OASPL) in dB for r/R = 15. Compari- between the single jet and the coaxial jet
son with data from Bogey and Bailly [6]. generated by the Lamb vector only. Com-
parison with data from Bogey and Bailly
[6].
Finally, Figs. 19 and 20 show the predicted far field directivity at 60 radii from
the jet axis by the Lamb vector only and by the Lamb vector and the entropy
source terms, respectively, in comparison with numerical and experimental
results at the same flow condition. To obtain the far field noise signature, the
near field results have been scaled to the far field by the 1/r-law assuming the
center of directivity at x/Rs = 4. The acoustic results generated by the Lamb
vector only match very well the experimental results at angles lower than 40
degree. At larger angles from the jet axis the OASPL falls off more rapidly.
Fig. 19. Directivity at r/Rs = 60 gener- Fig. 20. Directivity at r/Rs = 60 gen-
ated by the Lamb vector only. Compari- erated by the Lamb vector and entropy
son with experimental and numerical re- sources. Comparison with experimental
sults. and numerical results..
128 E. Gröschel1 , D. König2 , S. Koh, W. Schröder, and M. Meinke
This deviation is due to the missing contributions from the entropy sourece
terms. When including those source terms in the computation, the LES/APE
are in good agreement with the experimental results up to angles of 70 degree.
That observation confirms previous studies [14] on the influence of different
source terms. To be more precise, the Lamb vector radiates dominantly in the
downstream direction, whereas the entropy sources radiate to obtuse angles
from the jet axis.
450 250
Δx+ Δx+
+
400 Δy *100+ Δy+*100+
Δz 200 Δz
350
300
150
250
Δhi+
Δhi+
200
100
150
100 50
50
0 0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
x/c x/c
Fig. 21. Grid resolution near the wall: Fig. 22. Grid resolution near the wall:
Suction side of the main wing. Pressure side of the main wing.
High Performance Computing Towards Silent Flows 129
350 + 350 +
Δx Δx
Δy+*100 Δy+*100
300 Δz
+ 300 Δz
+
250 250
200 200
Δhi+
+
Δhi
150 150
100 100
50 50
0 0
0 50 100 150 200 250 0 20 40 60 80 100 120 140 160 180
point point
Fig. 23. Grid resolution near the wall: Fig. 24. Grid resolution near the wall:
Suction side of the slat. Slat cove.
The Mach number distribution and some selected streamlines of the time and
spanwise averaged flow field is presented in Fig. 25. Apart form the two stag-
nation points one can see the area with the highest velocity on the suction side
short downstream of the slat gap. Also recognizable is a large recirculation
domain which fills the whole slat cove area. It is bounded by a shear layer
which develops form the slat cusp and reattaches close to the end of the slat
trailing edge.
The pressure coefficient cp computed by the time averaged LES solution
is compared in Fig. 26 with RANS results [9] and experimental data. The
measurements were carried out at DLR Braunschweig in an anechoic wind
tunnel with an open test section within the national project FREQUENZ.
These experiments are compared to numerical solutions which mimic uniform
freestream conditions. Therefore, even with the correction of the geometric
angle of attack of 23◦ in the measurements to about 13◦ in the numerical
solution no perfect match between the experimental and numerical data can
be expected.
4 LES
RANS
3 Exp. data
2
-cp
-1
-2
0 0.2 0.4 0.6 0.8 1
x/c
Fig. 25. Time and spanwise av- Fig. 26. Comparison of the cp co-
eraged Mach number distribution efficient between LES, RANS [9]
and some selected streamlines. and experimental data [18].
130 E. Gröschel1 , D. König2 , S. Koh, W. Schröder, and M. Meinke
Fig. 27. λ2 contours in the slat re- Fig. 28. λ2 contours in the slat re-
gion. gion.
The distribution of the time and spanwise averaged turbulent kinetic energy
k = 21 u′2 + v ′2 + w′2 is depicted in Fig. 30. One can clearly identify the
shear layer and the slat trailing edge wake. The peak values occur, in agree-
ment with [8], in the reattachment area. This corresponds to the strong vor-
tical structures in this area evidenced in Fig. 28.
Acoustic Simulation
Fig. 33. Pressure contours based on Fig. 34. Pressure contours based on
the LES/APE solution. the LES solution.
Fig. 35. Power spectral density for a Fig. 36. Directivities for a circle with
point at x=-1.02 and y=1.76. R = 1.5 based on the APE solution.
High Performance Computing Towards Silent Flows 133
The slat source covers the part from the leading edge of the slat through 40%
chord of the main wing. The remaining part belongs to the main wing trailing
edge source. An embedded boundary formulation is used to ensure that no
artificial noise is generated [22]. It is evident that the sources located near
the slat cause a stronger contribution to the total sound field than the main
wing trailing edge sources. This behavior corresponds to the distribution of
the Lamb vector.
5 Conclusion
In the present paper we successfully computed the dominant aeroacoustic
noise sources of aircraft during take-off and landing, that is, the jet noise and
the slat noise by means of a hybrid LES/APE method. The flow parameters
were chosen to match current industrial requirements such as nozzle geome-
try, high Reynolds numbers, heating effects etc. The flow field and acoustic
field were computed in good agreement with experimental results showing the
correct noise generation mechanisms to be determined.
The dominant source term in the APE formulation for the cold single jet has
been shown to be the Lamb vector, while for the coaxial jets additional source
terms of the APE-4 system due to heating effects must be taken into account.
These source terms are generated by temperature and entropy fluctuations
and by heat release effects and radiate at obtuse angles to the far field. The
comparison between the single and coaxial jets revealed differences in the flow
field development, however, the characteristics of the acoustic near field sig-
nature was hardly changed. The present investigation shows that the noise
levels in the near field of the jet are not directly connected to the statistics of
the Reynolds stresses.
The analysis of the slat noise study shows the interaction of the shear layer of
the slat trailing edge and slat gap flow to generate higher vorticity than the
main airfoil trailing edge shear layer. Thus, the slat gap is the dominant noise
source region. The results of the large-eddy simulation are in good agreement
with data from the literature. The acoustic analysis shows the correlation be-
tween the areas of high vorticity, especially somewhat downstream of the slat
trailing edge and the main wing trailing edge, and the emitted sound.
Acknowledgments
The jet noise investigation, was funded by the Deutsche Forschungsgemein-
schaft and the Centre National de la Recherche Scientifique (DFG-CNRS)
in the framework of the subproject ”Noise Prediction for a Turbulent Jet”
of the research group 508 “Noise Generation in Turbulent flows”. The slat
noise study was funded by the national project FREQUENZ. The APE solu-
tions were computed with the DLR PIANO code the development of which
134 E. Gröschel1 , D. König2 , S. Koh, W. Schröder, and M. Meinke
References
1. LESFOIL: Large Eddy Simulation of Flow Around a High Lift Airfoil, chapter
Contribution by ONERA. Springer, 2003.
2. N. Alkishriwi, W. Schröder, and M. Meinke. A large-eddy simulation method
for low mach number flows using preconditioning and multigrid. Computers and
Fluids, 35(10):1126–1136, 2006.
3. N. Andersson, L.-E. Eriksson, and L. Davidson. Les prediction of flow and
acoustcial field of a coaxial jet. Paper 2005-2884, AIAA, 2005.
4. V. Arakeri, A. Krothapalli, V. Siddavaram, M. Alkislar, and L. Lourenco. On
the use of microjets to suppress turbulence in a mach 0.9 axissymmetric jet. J.
Fluid Mech., 490:75–98, 2003.
5. D. J. Bodony and S. K. Lele. Jet noise predicition of cold and hot subsonic jets
using large-eddy simulation. CP 2004-3022, AIAA, 2004.
6. C. Bogey, C.and Bailly. Computation of a high reynolds number jet and its
radiated noise using large eddy simulation based on explicit filtering. Computers
and Fluids, 35:1344–1358, 2006.
7. J. P. Boris, F. F. Grinstein, E. S. Oran, and R. L. Kolbe. New insights into
large eddy simulation. Fluid Dynamics Research, 10:199–228, 1992.
8. M. M. Choudhari and M. R. Khorrami. Slat cove unsteadiness: Effect of 3d flow
structures. In 44st AIAA Aerospace Sciences Meeting and Exhibit. AIAA Paper
2006-0211, 2006.
9. M. Elmnefi. Private communication. Institute of Aerodynamics, RWTH Aachen
University, 2006.
10. R. Ewert and W. Schröder. Acoustic pertubation equations based on flow de-
composition via source filtering. J. Comput. Phys., 188:365–398, 2003.
11. R. Ewert, Q. Zhang, W. Schröder, and J. Delfs. Computation of trailing edge
noise of a 3d lifting airfoil in turbulent subsonic flow. AIAA Paper 2003-3114,
2003.
12. J. B. Freund. Noise sources in a low-reynolds-number turbulent jet at mach 0.9.
J. Fluid Mech., 438:277 – 305, 2001.
13. E. Gröschel, M. Meinke, and W. Schröder. Noise prediction for a turbulent jet
using an les/caa method. Paper 2005-3039, AIAA, 2005.
14. E. Gröschel, M. Meinke, and W. Schröder. Noise generation mechanisms in
single and coaxial jets. Paper 2006-2592, AIAA, 2006.
15. F. Q. Hu, M. Y. Hussaini, and J. L. Manthey. Low-dissipation and low-dispersion
runge-kutta schemes for computational acoustics. J. Comput. Phys., 124(1):177–
191, 1996.
16. M. Israeli and S. A. Orszag. Approximation of radiation boundary conditions.
Journal of Computational Physics, 41:115–135, 1981.
17. S. Koh, E. Gröschel, M. Meinke, and W. Schröder. Numerical analysis of sound
sources in high reynolds number single jets. Paper 2007-3591, AIAA, 2007.
18. A. Kolb. Private communication. FREQUENZ, 2006.
19. J. Lau, P. Morris, and M. Fisher. Measurements in subsonic and supersonic free
jets using a laser velocimeter. J. Fluid Mech., 193(1):1–27, 1979.
High Performance Computing Towards Silent Flows 135
1 Basic equations
∂Ui
=0 (1)
∂xi
⎡ ⎤
∂Ui ∂Ui 1 ∂p ∂ ⎢ ∂Ui ∂Uj ′ ′
⎥
+ (Uj − UG ) =− + ⎢ν + − ui uj ⎥ . (2)
∂t ∂xj ρ ∂xi ∂xj ⎣ ∂xj ∂xi ⎦
Reynolds
Stresses
138 Felix Lippold and Ivana Buntić Ogor
M ü + Du̇ + Ku = f , (3)
The first mesh update method discussed here uses an interpolation between
the nodal distance between moving and fixed boundaries to compute the new
nodal position after a displacement step of the moving boundary. The most
simple approach is to use a linear interpolation value 0 ≤ κ ≤ 1. Here we are
|s|
using a modification of the parametre κ = |r|+| s| proposed by Kjellgren and
Hyvärinen [KH98].
⎧
⎨ 0, κ<δ
κ̃ = 12 (cos(1 − ( 1−2δ
κ−δ
) · π) + 1, δ ≤ κ ≤ 1 − δ (4)
⎩
1, 1−δ <κ≤1
This parametre is found from the nearest distance to the moving boundary
and the distance to the fixed boundary in the opposite direction.
To use this approach for parallel computations the boundaries have to
be available on all processors. Since a graph based domain decomposition is
used here this is not implicitely given. Hence, the boundary exchange has
to be implemented additionally. Usually the number of boundary nodes is
considerably small, i.e. the additional communication time and overhead is
negligible.
Another approach is based on a pseudo-structural approach using the lineal
springs introduced by Batina [BAT89]. In order to stabilise the element angles
FSI – Tidal Current Turbine 139
this method is combined with the torsional springs approach by Farhat et. al.
[FAR98-1, DEG02] for two- and three-dimensional problems. Since we use
quadrilateral and hexahedral elements the given formulation for the torsional
springs is enhanced for these element types.
Here, a complete matrix is built to solve the elasto-static problem of the
dynamic mesh. The stiffness matrix built for the grid smoothing has the same
graph as the matrix of the implicit CFD-problem. Hence, the the whole cfd-
solver structure including the memory can be reused and the matrix graph
has to be computed only once for both fields, the fluid and the structure. This
means that there is almost no extra memory needed for the moving mesh. Fur-
thermore, the parallelised and optimised, preconditioned BICGStab(2) per-
forms good on cache and vector CPUs for the solution of the flow problem,
which brings a good performance for the solution of the dynamic mesh equa-
tions, as well. Nevertheless, the stiffness matrix has to be computed and solved
for every mesh update step, i.e. usually every time step. The overall computa-
tion time for a three-dimensional grid shows that the percentage of computing
time needed for the mesh update compared to the total time is independent of
the machine and number of CPUs used, see Lippold [LI06]. This means that
the parallel performance is of the same quality as the one of the flow solver.
Regarding the usage of the torsional springs two issues have to be ad-
dressed. Computing the torsional part of the element matrices includes a
bunch of nested loops costing quite a considerable amount of computational
time. Because of that, the respective routine requires approximately twice the
time than the lineal part does. Due to the fact that the lineal springs use
the edge length to determine the stiffness but the torsional springs use the
element area or volume, respectively, it is important to notice that the entries
contributed to the matrix may differ within some scales. Hence, using the
additional torsional springs the numbers shown above are changing for these
two reasons. Furthermore, it yields higher values on the off-diagonal elements
of the matrix which reduces its condition number. Hence, the smoother needs
more iterations to reduce the residual to a given value. This leads to additional
computational time. Furthermore the element quality might be unsatisfying if
the matrix entries coming from the torsional springs are dominant. Meaning
that the smoothed grid fulfills the angle criterion but not the nodal distance.
In order to reduce this disadvantage, the matrix entries of the torsional springs
have to be scaled to the same size as the contribution of the lineal springs.
4 Results
First results are obtained with an inclined (10o ) NACA0012 wing in 3D. The
wing is clamped at one end and free at the other end. The fluid used for these
simulations is water with a density of ρ =1000 kg/m3 at a flow velocity
of v∞ = 10.0m/s. The interpolation method, presented above, shows a good
performance and yields a good mesh quality for this application.
The fluid grid contains about 100000 nodes. Hence, two processors are suf-
ficient to compute the flow within an acceptable time-frame. For the structural
model a grid consisting of one domain with 1500 nodes and linear deformation
elements is used.
Figure 3 shows the original and the deformed shape including the surface
pressure on the wing in static equilibrium of the fluid-structure system. High
pressure is marked with red and low pressure with blue shadings.
Furthermore, simulation results for the flow around the tidal current tur-
bine runner, see Fig. 4, show a good agreement with available measurements
of a reduced runner model. The computational grid used for these simulations
consists of 2 Million nodes. Constant flow velocities were used at the domain
boundaries.
Acknowledgements
Fig. 4. Pressure distribution (blue -25000 Pa, red 12000 Pa) and streamlines.
FSI – Tidal Current Turbine 143
References
[FP02] Ferziger, J.H., Perić, M.: Computational Methods for Fluid Dynamics
(third Ed.). Springer (2002).
[HU81] Hughes, T.J.R., Liu, W.K., Zimmermann, T.K.: Lagrangian-Eulerian Fi-
nite Element Formulation for Viscous Flows. Computer Methods in Ap-
plied Mechanics and Engineering. 29, 329-349 (1981)
[ZI89] Zienkiewicz, O.C., Taylor, R.L.: The Finite Element Method (Vol. I).
McGraw-Hill (1989)
[GR99] Gresho, P.M., Sani, R.L.: Incompressible Flow and the Finite Element
Method (Vol. I). John Wiley & Sons (1999)
[MAI02] Maihoefer, M.: Effiziente Verfahren zur Berechnung dreidimensionaler
Stroemungen mit nichtpassenden Gittern (PhD-Thesis). University of
Stuttgart, (2002)
[RU89] Ruprecht, A.: Finite Elemente zur Berechnung dreidimensionaler turbu-
lenter Stroemungen in komplexen Geometrien (PhD-Thesis). University
of Stuttgart (1989)
[VVO92] van der Vorst, H.A.: BI-CGSTAB: A fast and smoothly converging variant
of BI-CG for the solution of nonsymmetric linear systems. SIAM Journal
of Scientific Stat. Computing, 13, 631-644, (1992)
[BAT89] Batina, J.T.: Unsteady Euler airfoil solutions using unstructured dynamic
meshes. AIAA Paper No. 89-0115, AIAA 27th Aerospace Sciences Meet-
ing, Reno, Nevada (1989)
[FAR98-1] Farhat, C., Degand, C., Koobus, B., Lesoinne, M.: Torsional springs for
two-dimensional dynamic unstructred fluid meshes. Computational Meth-
ods in Applied Mechanics and Engineering, 163, 231-245 (1998)
[DEG02] Degand, C., Farhat, C.: A three-dimensional torsional spring analogy
method for unstructured dynamic meshes. Computers and Structures,
80, 305-316 (2002)
[FAR98-2] Farhat, C., Lesoinne, M., le Tallec, P.: Load and motion transfer algo-
rithms for fluid-structure interaction problems with non-matching inter-
faces. Computer Methods in Applied Mechanics and Engineering, 157,
95-114 (1998)
[LI06] Lippold, F., Fluid-structure interaction in an axial fan. HPC-Europa re-
port (2006)
[KH98] Kjellgren, P., Hyvärinen, J.: An Arbitrary Langrangian-Eulerian Finite
Element Method. Computational Mechanics, 21, 81-90 (1998)
Coupled Problems in Computational Modeling
of the Respiratory System
1 Introduction
Mechanical ventilation of the human lung plays a significant role in medicine,
especially in case of patients with acute lung diseases such as ALI (acute lung
injury) and ARDS (acute respiratory distress syndrome) where it is known
to be a vital supportive therapy. Improper methods of ventilation, however,
can cause mechanical overstraining of parenchymal tissue resulting in addi-
tional inflammatory injuries. This complication is commonly called ventilator-
induced lung injury (VILI) and is responsible for a significant increase in
mortality rate.
Since up to now it is unclear how to improve ventilation strategies in order
to prevent VILI and thereby minimize mortality, we want to bring a little more
light into some of the involved phenomena.
The work described in this paper was done on the basis of our in-house
research finite element program BACI covering a wide range of applications
146 L. Wiechert et al.
Since currently no real geometries of pulmonary alveoli are available for sim-
ulations due to the low resolution of conventional imaging techniques, we are
interested in finding ways to artificially generate them. For that purpose the
labyrinthine algorithm presented in [2] is extended for the application to com-
plex geometries.
A labyrinthine algorithm enables the generation of random pathways
through an a priori defined assemblage of base cells. All cells have to be con-
nected to a given starting cell and should be passed only once except in case
of branching cells. If a cell is affiliated in the course of the labyrinth creation,
it is stored in a queue. In every step, the cell located in front of the queue is
set active and thus can create a path to a new cell by randomly choosing one
Coupled Problems in Computational Modeling of the Respiratory System 147
of its neighboring cells that are not already passed. Afterwards, the active cell
and the just affiliated new cell are moved to the end of the queue. If, however,
the active cell has no unpassed neighbors in the beginning or at the end of the
step, it is deleted from the queue. The procedure is repeated until the queue
is worn out.
Generated labyrinths should contain no detours for the sake of preserving
minimal overall pathlength, a feature of great importance regarding effective
gas transport in the lungs. In [2], this is accomplished by introducing the
concept of priority directions. Provided that cell geometries are simple, this
method works well. In case of tetrakaidecahedral cells, however, it can be
shown that the conventional labyrinthine algorithm fails in preserving optimal
mean pathlength due to possible diagonal connection of cells.
Hence a new approach was developed enforcing generation of optimal ran-
dom labyrinths while allowing the starting cell to be arbitrarily located within
the given alveolar ensemble. In this connection it has to be explicitly verified
that the pathway to the randomly chosen new cell via the currently active cell
is an optimal one. If no shorter pathlength via other cells in the queue ex-
ists, then the new cell can be connected to the active cell. Otherwise another
possible new cell has to be selected from the range of neighbors. In case that
either no other neighboring cell is at hand or no other adjacent cell can be
affiliated optimally, the subsequent cell in the queue is set active. For a more
detailed presentation we refer to [3].
Based on the presented generalized labyrinthine algorithm arbitrarily large
alveolar ensembles can be generated and structurally meshed with hexahedral
finite elements. An example of a created labyrinthine pathway and the corre-
sponding alveolar geometry is depicted in Fig. 1.
H = κI + (1 − 3κ) e3 ⊗ e3 . (4)
In this connection κ represents a parameter derived from the orientation den-
sity distribution function ρ(θ)
1 Π
κ= ρ(θ)sin3 (θ)dθ. (5)
4 0
Fiber orientation in alveolar tissue seems to be rather random, hence lung
parenchyma can be treated as a homogeneous, isotropic continuum following
Coupled Problems in Computational Modeling of the Respiratory System 149
K = tr (CH) . (6)
The strain-energy function of the non-linear collagen fiber network then reads
k1 2
2k exp k2 (K − 1) − 1 for K ≥ 1
Wf ib = 2 (7)
0 for K < 1
Unfortunately only very few experimental data are published regarding the
mechanical behavior of alveolar tissue. To the authors’ knowledge, no mate-
rial parameters for single alveolar walls are derivable since up to now only
parenchyma was tested (see for example [9], [10], [11]). For that purpose, we
fitted the material model to experimental data published in [12] for lung tissue
sheets.
The variation of the overall work with respect to the nodal displacements d
then takes the following form
T T
S S
∂ ∗ ∗ ∂ ∗ ∗ ∂S
δWsurf = γ(S )dS δd = γ(S )dS δd.
∂d S0 ∂S S0 ∂d
(11)
Using x
d
f (t)dt = f (x) (12)
dx a
yields
T
∂S T
δWsurf = γ(S) δd = fsurf δd. (13)
∂d
with the internal force vector
∂S
fsurf = γ(S) . (14)
∂d
The consistent tangent stiffness matrix derived by linearization of (14) there-
fore reads
T T
∂ ∂S ∂γ(S) ∂S
Asurf = γ(S) + . (15)
∂d ∂d ∂d ∂d
For details refer also to [13] where, however, an additional surface stress ele-
ment was introduced in contrast to the above mentioned concept of enriching
the interfacial structural nodes.
Unlike e.g. water with its constant surface tension, surfactant exhibits a
dynamically varying surface stress γ depending on the interfacial concentra-
tion of surfactant molecules. We use the adsorption-limited surfactant model
developed in [14] to capture this dynamic behavior.
It is noteworthy that no scaling techniques as in [15], where a single explicit
function for γ is used, are necessary, since the employed surfactant model itself
delivers the corresponding surface stress depending on both input parameter
and dynamic data.
25 30
25
cm )
cm )
20
γ ( dyn
γ ( dyn
20
15
15
10
0.50 S0 10 0.2 Hz
5 0.75 S0 0.5 Hz
1.00 S0 5 2.0 Hz
0 0
1 1.2 1.4 1.6 1.8 2 1 1.2 1.4 1.6 1.8 2
S S
S0 S0
Fig. 4. Absolute displacements for sinusoidal ventilation of single alveoli with differ-
ent interfacial configurations for two characteristic geometric sizes. Top: Character-
istic geometric size comparable to human alveoli. Bottom: Characteristic geometric
size comparable to small animals like e.g. hamsters. Left: No interfacial phenomena.
Middle: Dynamic surfactant model. Right: Water with constant surface tension
152 L. Wiechert et al.
Since large alveolar ensembles are analyzed in our studies, an efficient solver
is sorely needed. In this section, we give an overview of multigrid as well as a
brief introduction to smoothed aggregation multigrid (SA), which we use in
parallel versions in order to solve the resulting systems of equations.
Multigrid Overview
Multigrid methods are among the most efficient iterative algorithms for solv-
ing the linear system, Ad = f , associated with elliptic partial differential
equations. The basic idea is to damp errors by utilizing multiple resolutions
in the iterative scheme. High-energy (or oscillatory) components are efficiently
reduced through a simple smoothing procedure, while the low-energy (or
smooth) components are tackled using an auxiliary lower resolution version of
the problem (coarse grid). The idea is applied recursively on the next coarser
level. An example multigrid iteration is given in Algorithm 1 to solve
A1 d1 = f1 . (16)
The two operators needed to fully specify the multigrid method are the
relaxation procedures, Rk , k = 1, . . . , Nlevels , and the grid transfers, Pk ,
k = 2, . . . , Nlevels . Note that Pk is an interpolation operator that transfers
grid information from level k + 1 to level k. The coarse grid discretization
operator Ak+1 (k ≥ 1) can be specified by the Galerkin product
The key to fast convergence is the complementary nature of these two opera-
tors. That is, errors not reduced by Rk must be well interpolated by Pk .
Even though constructing multigrid methods via algebraic concepts
presents certain challenges, algebraic multigrid (AMG) can be used for several
problem classes without requiring a major effort for each application. Here,
we focus on a strategy to determine the Pk ’s based on algebraic principles. It
is assumed that A1 and f1 are given.
The basic idea of the tentative prolongator is that it must accurately inter-
polate certain near null space (kernel) components of the discrete operator
Ak . Once constructed, the tentative prolongator is then improved by the pro-
longator smoother in a way that reduces the energy or smoothes the basis
functions associated with the tentative prolongator. Constructing P̂k consists
of deriving its sparsity pattern and then specifying its nonzero values. The
sparsity pattern is determined by decomposing the set of degrees of freedoms
associated with Ak into a set of so called aggregates Aki , such that
Nk+1
Aki = {1, ..., Nk } , Aki ∩ Akj = 0 , 1 ≤ i < j ≤ Nk+1 (19)
i=1
is the neighborhood of nodal blocks, (Ak )′jn s, that share a nonzero off-diagonal
block entry with node i. While ideal aggregates would only consist of a root
nodal block and its immediate neighboring blocks, it is usually not possible to
entirely decompose a problem into ideal aggregates. Instead, some aggregates
which are a little larger or smaller than an ideal aggregate must be created.
For this paper, each nodal block contains mk degrees of freedom where for
simplicity we assume that the nodal block size mk is constant throughout Ak .
154 L. Wiechert et al.
With Nk denoting the number of nodal blocks in the system on level k this
results in nk = Nk mk being the dimension of Ak .
Aggregates Aki can be formed based on the connectivity and the strength
of the connections in Ak . For an overview of serial and parallel aggregation
techniques, we refer to [18].
Although we speak of ‘nodal blocks’ and ‘connectivity’ in an analogy to
finite element discretizations here, it shall be stressed that a node is a strictly
algebraic entity consisting of a list of degrees of freedom. In fact, this analogy
is only possible on the finest level; on coarser levels, k > 1, a node denotes
a set of degrees of freedom associated with the coarse basis functions whose
support contain the same aggregate on level k − 1. Hence, each aggregate Aki
on level k gives rise to one node on level k + 1 and each degree of freedom
(DOF) associated with that node is a coefficient of a particular basis function
associated with Aki .
Populating the sparsity structure of P̂k derived from aggregation with
appropriate values is the second step. This is done using a matrix Bk which
represents the near null space of Ak . On the finest mesh, it is assumed that
Bk is given and that it satisfies Ãk Bk = 0, where Ãk differs from Ak in that
Dirichlet boundary conditions are replaced by natural boundary conditions.
Tentative prolongators and a coarse representation of the near null space are
constructed simultaneously and recursively to satisfy
This guarantees exact interpolation of the near null space by the tentative
prolongator. To do this, each nodal aggregate is assigned a set of columns of
P̂k with a sparsity structure that is disjoint from all other columns. We define
1 if i = j, i a DOF in lth nodal block with l ∈ Akm
Ikm (i, j) = (23)
0 otherwise
Bm m
k = Ik Bk , m = 1, ..., Nk+1 (24)
⎛ ⎞
B1k
⎜
⎜ B2k ⎟
⎟
B̄k = ⎜ .. ⎟, (25)
⎝ . ⎠
N
Bk k+1
⎛ ⎞
Q1k
⎜
⎜ Q2k ⎟
⎟
P̂k = ⎜ .. ⎟, (27)
⎝ . ⎠
N
Qk k+1
while the coefficients Ri define the coarse representation of the near null space
⎛ ⎞
R1k
⎜
⎜ R2k ⎟
⎟
Bk+1 = ⎜ .. ⎟. (28)
⎝ . ⎠
N
Rk k+1
||r||
< 10−6 (31)
||r0 ||
with ||r|| and ||r0 || as L2 -norm of the current and initial residuum, respectively.
The number of DOF as well as details concerning solver and setup times and
number of iterations per solve are summarized in Fig. 5. The time per solver
call for both simulations is given in Fig. 6. It is noteworthy that O(n) overall
scalability is achieved with the presented approach for these examples with
complex geometries.
30
25
solver time [s]
20 960,222 DOF
184,689 DOF
15
10
0
0 200 400 600 800 1000
solver call
Fig. 6. Solution time per solver call
Coupled Problems in Computational Modeling of the Respiratory System 157
Currently appropriate boundary conditions for pulmonary alveoli are not yet
known. To bridge the gap between the respiratory zone where VILI occurs and
the ventilator where pressure and flow are known, it is essential to understand
the airflow in the respiratory system. In a first step we have studied flow in
a CT-based geometry of the first four generations of lower airways [20]. In a
second step we also included flexibility of airway walls and investigated fluid-
structure interaction effects [21]. The CT scans are obtained from in-vivo
experiments of patients under normal breathing and mechanical ventilation.
where u is the velocity vector, uG is the grid velocity vector, p is the pres-
sure and f F is the body force vector. A superimposed F refers to the fluid
domain and ∇ denotes the nabla operator. The parameter ν = μ/ρF is the
kinematic viscosity with viscosity μ and fluid density ρF . The kinematic pres-
sure is represented by p where p̄ = p ρF is the physical pressure within the
fluid field. The balance of linear momentum (32) refers to a deforming arbi-
trary Lagrangean Eulerian (ALE) frame of reference denoted by χ where the
geometrical location of a mesh point is obtained from the unique mapping
x = ϕ(χ, t).
The stress tensor of a Newtonian fluid is given by
u(t = 0) = u0 in ΩF
u = û on ΓF
D
F
σ · n = ĥ on ΓF
N (36)
158 L. Wiechert et al.
where ΓF F
D and ΓN denote the Dirichlet and Neumann partition of the fluid
# F F
boundary, respectively, with normal n, with ΓFD ΓN = ∅. û and ĥ are the
prescribed velocities and tractions.
The governing equation in the solid domain is the linear momentum equa-
tion given by
ρS d̈ = ∇0 · S + ρS f S in ΩS (37)
where d are the displacements, the superimposed dot denotes material time
derivatives and the superimposed S refers to the solid domain. ρS and f S
represent the density and body force, respectively.
The initial and boundary conditions are
d(t = 0) = d0 in ΩS
ḋ(t = 0) = ḋ0 in ΩS
d = d̂ on ΓSD
S
S · n = ĥ on ΓSN , (38)
where ΓSD and ΓSN denote the Dirichlet and Neumann partition of the struc-
# S
tural boundary, respectively, with ΓSD ΓSN = ∅. d̂ and ĥ are the prescribed
displacements and tractions.
Within this paper, we will account for geometrical nonlinearities but we
will assume the material to be linear elastic. Since we expect only small strains
and due to lack of experimental data, this assumption seems to be fair for first
studies.
where r are the displacements of the fluid mesh and n is the unit normal
on the interface. Satisfying the kinematic continuity leads to mass conserva-
tion at ΓFSI , satisfying the dynamic continuity yields conservation of linear
momentum, and energy conservation finally requires to simultaneously satisfy
both continuity equations.
The algorithmic framework of the partitioned FSI analysis is discussed in
detail elsewhere, cf. e.g. [22], [23], [24] and[25].
Coupled Problems in Computational Modeling of the Respiratory System 159
In the fluid domain, we used linear tetrahedral elements with GLS stabiliza-
tion. The airways are discretized with 7-parameter triangular shell elements
(cf. [26], [27], [28]). We refined the mesh from 110,000 up to 520,000 fluid ele-
ments and 50,000 to 295,000 shell elements, respectively, until the calculated
mass flow rate was within a tolerance of 1%.
Time integration was done with a one-step-theta method with fixed-point
iteration and θ = 2/3. For the fluid, we employed a generalized minimal
residual (GMRES) iterative solver with ILU-preconditioning.
We study normal breathing under moderate activity conditions with a
tidal volume of 2l and a breathing cycle of 4s, i.e. 2s inspiration and 2s
expiration. Moreover, we consider mechanical ventilation where experimental
data from the respirator is available, see Fig. 7. A pressure-time history can be
applied at the outlets such that the desired tidal volume is obtained. For the
case of normal breathing, the pressure-time history at the outlets is sinusoidal,
negative at inspiration and positive at expiration as it occurs in “reality”. The
advantage is that inspiration and expiration can be handled quite naturally
within one computation. The difficulty is to calibrate the boundary conditions
such that the desired tidal volume is obtained which is an iterative procedure.
To investigate airflow in the diseased lung, non-uniform boundary con-
ditions are assumed. For that purpose, we set the pressure outlet boundary
conditions consistently twice and three-times higher on the left lobe of the lung
as compared to the right lobe. This should model a higher stiffness resulting
from collapsed or highly damaged parts of lower airway generations.
3.4 Results
At inspiration the flow in the right bronchus exhibits a skew pattern towards
the inner wall, whereas the left main bronchus shows an M-shape, see Fig. 8.
Fig. 7. Pressure-time and flow-time history of the respirator for the mechanically
ventilated lung
160 L. Wiechert et al.
Fig. 8. Total flow structures at different cross sections for the healthy lung under
normal beathing (left) and the diseased lung under mechanical ventilation (right)
The overall flow and stress distribution at inspiratory peak flow are shown
in Fig. 9. The flow pattern is similar in the entire breathing cycle with more
or less uniform secondary flow intensities except at the transition from inspi-
ration to expiration. Stresses in the airway model are highest in the trachea
as well as at bifurcation points. Due to the imposed boundary conditions,
Fig. 9. Flow and principal tensile stress distribution in the airways of the healthy
lung at inspiratory peak flow rate under normal breathing and mechanical ventilation
Coupled Problems in Computational Modeling of the Respiratory System 161
Fig. 10. Normalized flow distribution at the outlets under normal breathing and
mechanical ventilation for the healthy and diseased lung
velocity and stress distributions as well as normalized mass flow through the
outlets are uniform as can be seen in Figs. 9 and 10.
Similar results are obtained for pure computational fluid dynamics (CFD)
simulations where the airway walls are assumed to be rigid and not moving (cf.
[20]). However, differences regarding secondary flow pattern can be observed
between FSI and pure CFD simulations. The largest deviations occur in the
fourth generation and range around 17% at selected cross sections.
Fig. 11. Flow and principal tensile stress distribution in the airways of the diseased
lung at inspiratory peak flow rate under mechanical ventilation
Airflow structures obtained for diseased lungs differ significantly from those
for healthy lungs in inspiration as well as in expiration. Flow and stress dis-
tributions are no longer uniform because of the different imposed pressure
outlet boundary conditions, see Fig. 11. Only 25% of the tidal air volume
enters the diseased part of the lung, i.e. the left lobe. The normalized mass
flow calculated at every outlet of the airway model is shown in Fig. 10.
In Fig. 8 the differences in airflow structures of the healthy and diseased
lung in terms of discrete velocity profiles during inspiratory flow are visualized.
The secondary flow structures are not only quite different from the healthy
lung but they also deviate from the results for diseased lungs obtained in [20]
where the airway walls were assumed to be rigid and nonmoving. Thus FSI-
forces are significantly larger in simulations of the diseased lung and the in-
fluence of airway wall flexibility on the flow should therefore not be neglected.
In general, airway wall stresses are larger in the diseased compared to the
healthy lung. Interestingly, stresses in the diseased lung are larger in the less
ventilated parts due to the higher secondary flow intensities (especially close
to the walls) found there. The highest stresses occur at the beginning of expi-
ration. We have modified the expiration curves of the respirator and decreased
the pressure less abruptly resulting in a significant reduction of airway wall
stresses. This finding is especially interesting with respect to our long-term
goal of proposing protective ventilation strategies allowing minimization of
VILI.
Coupled Problems in Computational Modeling of the Respiratory System 163
In the present paper, several aspects of coupled problems in the human res-
piratory system were addressed.
The introduced model for pulmonary alveoli comprises the generation of
three-dimensional artificial geometries based on tetrakaidecahedral cells. For
the sake of ensuring optimal mean pathlength – a feature of great importance
regarding effective gas transport in the lungs – a labyrinthine algorithm for
complex geometries is employed. A polyconvex hyperelastic material model
incorporating general histologic information is applied to describe the behav-
ior of parenchymal lung tissue. Surface stresses stemming from the alveolar
liquid lining are considered by enriching interfacial structural nodes of the
finite element model. For that purpose, a dynamic adsorption-limited surfac-
tant model is applied. It could be shown that interfacial phenomena influence
the overall mechanical behavior of alveoli significantly. Due to different sizes
and curvatures of mammalian alveoli, the intensity of this effect is species
dependent.
On the part of the structural solver, a smoothed aggregation algebraic
multigrid method was applied. Remarkably, an O(n) overall scalability could
be proven for the application to our alveolar simulations.
The investigation of airflow in the bronchial tree is based on a human CT-
scan airway model of the first four generations. For this purpose a partitioned
FSI method for incompressible Newtonian fluids under transient flow condi-
tions and geometrically nonlinear structures was applied. We studied airflow
structures under normal breathing and mechanical ventilation in healthy and
diseased lungs. Airflow under normal breathing conditions is steady except in
the transition from inspiration to expiration. By contrast, airflow under me-
chanical ventilation is unsteady during the whole breathing cycle due to the
given respirator settings. We found that results obtained with FSI and pure
CFD simulations are qualitatively similar in case of the healthy lung whereas
significant differences can be shown for the diseased lung. Apart from that,
stresses are larger in the diseased lung and can be influenced by the choice of
ventilation parameters.
The lungs are highly heterogeneous structures comprising multiple spatial
length scales. Since it is neither reasonable nor computationally feasible to
simulate the lung on the whole, investigations are restricted to certain in-
teresting parts of it. Modeling the interplay between the different scales is
essential in gaining insight into the lungs’ behavior on both the micro- and
the macroscale. In this context, coupling our bronchial and alveolar model
and thus deriving appropriate boundary conditions for both models plays an
essential role. Due to the limitations of mathematical homogenization and
sequential multi-scale methods particularly in the case of nonlinear behavior
of complex structures, an integrated scale coupling as depicted in Fig. 12 is
desired, see e.g. [29]. This will be subject of future investigations.
164 L. Wiechert et al.
Despite the fact that we do not intend to compute an overall and fully
resolved model of the lung each part that is involved in our investigations is
a challenging area and asks for the best that HPC nowadays can offer.
Acknowledgement
References
1. J. DiRocco, D. Carney, and G. Nieman. The mechanism of ventilator-induced
lung injury: Role of dynamic alveolar mechanics. In Yearbook of Intensive Care
and Emergency Medicine. 2005.
2. H. Kitaoka, S. Tamura, and R. Takaki. A three-dimensional model of the human
pulmonary acinus. J. Appl. Physiol., 88(6):2260–2268, Jun 2000.
3. L. Wiechert and W.A. Wall. An artificial morphology for the mammalian pul-
monary acinus. in preparation, 2007.
Coupled Problems in Computational Modeling of the Respiratory System 165
22. U. Küttler, C. Förster, and W.A. Wall. A solution for the incompressibility
dilemma in partitioned fluid-structure interaction with pure dirichlet fluid do-
mains. Computational Mechanics, 38:417–429, 2006.
23. C. Förster, W.A. Wall, and E. Ramm. Artificial added mass instabilities in
sequential staggered coupling of nonlinear structures and incompressible flows.
Computer Methods in Applied Mechanics and Engineering, 2007.
24. W.A. Wall. Fluid-Struktur-Interaktion mit stabilisierten Finiten Elementen.
PhD thesis, Institut für Baustatik, Universität Stuttgart, 1999.
25. D.P. Mok. Partitionierte Lösungsansätze in der Strukturdynamik und der Fluid-
Struktur-Interaktion. PhD thesis, Institut für Baustatik, Universität Stuttgart,
2001.
26. M. Bischoff. Theorie und Numerik einer dreidimensionalen Schalenfor-
mulierung. PhD thesis, Institut für Baustatik, University Stuttgart, 1999.
27. M. Bischoff and E. Ramm. Shear deformable shell elements for large strains and
rotations. International Journal for Numerical Methods in Engineering, 1997.
28. M. Bischoff, W.A. Wall, K.-U. Bletzinger, and E. Ramm. Models and finite el-
ements for thin-walled structures. In E. Stein, R. de Borst, and T.J.R. Hughes,
editors, Encyclopedia of Computational Mechanics - Volume 2: Solids, Struc-
tures and Coupled Problems. John Wiley & Sons, 2004.
29. V. G. Kouznetsova. Computational homogenization for the multi-scale analysis
of multi-phase materials. PhD thesis, Technische Universiteit Eindhoven, 2002.
FSI Simulations on Vector Systems –
Development of a Linear Iterative Solver
(BLIS)
Summary. This paper addresses the algorithmic and implementation issues asso-
ciated with fluid structure interaction simulations, specially on vector architecture.
Firstly, the fluid structure coupling algorithm is presented and then a newly devel-
oped parallel sparse linear solver is introduced and its performance discussed.
1 Introduction
In this paper we focus on the performance improvement of the fluid structure
interaction simulations on vector systems. The work described here was done
on the basis of the research finite element program Computer Aided Research
Analysis Tool (CCARAT), that is developed and maintained at the Institute
of Structural Mechanics of the University of Stuttgart. The research code
CCARAT is a multipurpose finite element program covering a wide range of
applications in computational mechanics, like e.g. multi-field and multi-scale
problems, structural and fluid dynamics, shape and topology optimization,
material modeling and finite element technology. The code is parallelized using
MPI and runs on a variety of platforms.
The major time consuming portions of a finite element simulation are
calculating the local element contributions to the globally assembled matrix
and solving the assembled global system of equations. As much as 80% of the
time in a very large scale simulation can be spent in the linear solver, specially
if the problem to be solved is ill-conditioned. While the time taken in element
calculation scales linearly with the size of the problem, often the time in
the sparse solver does not. Major reason being the kind of preconditioning
168 Sunil R. Tiyyagura and Malte von Scheven
needed for a successful solution. In Sect. 2 of this paper the fluid structure
coupling algorithm implemented in CCARAT is presented. Sect. 3 of this
paper briefly analyses the performance of public domain solvers on vector
architecture and then a newly developed parallel iterative solver (Block-based
Linear Iterative Solver – BLIS) is introduced. In Sect. 4, a large-scale fluid-
structure interaction example is presented. Sect. 5 discusses the performance
of a pure fluid example and fluid structure interaction example on scalar and
vector systems along with the scaling results of BLIS on the NEC SX-8.
Key requirement for the coupling schemes is to fulfill two coupling condi-
tions: the kinematic and the dynamic continuity across the interface. Kine-
matic continuity requires that the position of structure and fluid boundary
are equal at the interface, while dynamic continuity means that all tractions
at the interface are in equilibrium:
with n denoting the unit normal vector on the interface. Satisfying the kine-
matic continuity leads to mass conservation at Γ , satisfying the dynamic con-
tinuity leads to conservation of linear momentum, and energy conservation
finally requires to simultaneously satisfy both continuity equations. In this
paper (and in figure 1) only no-slip boundary conditions and sticking grids at
the interface are considered.
Average vector length is an important metric that has a huge effect on per-
formance. In sparse linear algebra, the matrix object is sparse whereas the
vectors are dense. So, any operations involving only the vectors, like the dot
product, run with high performance on any architecture as they exploit spatial
locality in memory. But, for any operations involving the sparse matrix object,
like the matrix vector product (MVP), sparse storage formats play a crucial
role in achieving good performance, specially on the vector architecture. This
is extensively discussed in [5, 6].
The performance of sparse MVP on vector as well as on superscalar ar-
chitectures is not limited by memory bandwidth, but by latencies. Due to
sparse storage, the vector to be multiplied in a sparse MVP is accessed ran-
domly (non-strided access). This introduces indirect memory access which is a
memory latency bound operation. Blocking is employed on scalar as well as on
vector architecture to reduce the amount of indirect memory access needed for
the sparse MVP kernel using any storage format [7, 8]. The cost of accessing
170 Sunil R. Tiyyagura and Malte von Scheven
In the sparse MVP kernel discussed so far, the major hurdle to performance
is not memory bandwidth but the latencies involved due to indirect memory
addressing. Block based computations exploit the fact that many FE problems
typically have more than one physical variable to be solved per grid point.
Thus, small blocks can be formed by grouping the equations at each grid point.
Operating on such dense blocks considerably reduces the amount of indirect
addressing required for sparse MVP [6]. This improves the performance of the
kernel dramatically on vector machines [9] and also remarkably on superscalar
architectures [10, 11]. A vectorized general parallel iterative solver (BLIS)
targeting performance on vector architecture is under development. Block-
based approach is adopted in BLIS primarily to reduce the penalty incurred
due to indirect memory access on most hardware architectures. Some solvers
already implement similar blocking approaches, but use BLAS routines when
processing each block. This method will not work on vector architecture as the
innermost loop is short when processing small blocks. So, explicitly unrolling
the kernels is the key to achieve high sustained performance. This approach
also has advantages on scalar architectures and is adopted in [7].
used to store the dense blocks. This assures sufficient average vector length
for operations done using the sparse matrix object (Preconditioning, Sparse
MVP). The single CPU performance of sparse MVP, Fig. 2, with a matrix
consisting of 4x4 dense blocks is around 7.2 GFlop/s (about 45% vector peak)
on the NEC SX-8. The sustained performance in the whole solver is about
30% peak when the problem size is enough to fill the vector pipelines.
BLIS is based on MPI and includes well known Krylov subspace methods
such as CG, BiCGSTAB and GMRES. Block scaling, block Jacobi, colored
block symmetric Gauss-Seidel and block ILU(0) on subdomains are the avail-
able matrix preconditioners. Exchange of halos in sparse MVP can be done
using MPI blocking, non-blocking or persistent communication.
Future work:
The restriction of block sizes will be solved by extending the solver to han-
dle any number of unknowns. Blocking functionality will be provided in the
solver in order to relieve the users from preparing blocked matrices in order
to use the library. This makes adaptation of the library to an application
easier. Reducing global synchronization at different places in Krylov subspace
algorithms has to be extensively looked into for further improving scaling of
the solver [12]. We also plan to implement domain decomposition based and
multigrid preconditioning methods.
4 Numerical example
In this numerical example a simplified 2-dimensional representation of a cubic
building with a flat membrane roof is studied. The building is situated in a
horizontal flow with an initially exponential profile and a maximum velocity
of 26.6 m/s. The fluid is Newtonian with dynamic viscosity νF = 0.1 N s/m2
and density ρF = 1.25Kg/m3.
In the following two different configurations are compared:
• a rigid roof, i.e. pure fluid simulation
• a flexible roof including fluid-structure interaction
For the second case the roof is assumed to be a very thin membrane (t/l =
1/1000) with Young’s modulus ES = 1.0 · 109 N/m2 , Poisson’s ratio νS = 0.0
and density ρS = 1000.0 Kg/m3 .
The fluid domain is discretized by 25,650 GLS-stabilized Q1Q1 and the
structure with 80 Q1 elements. The moving boundary of the fluid is considered
via an ALE-Formulation only for the fluid subdomain situated above the
membrane roof. Here 3,800 pseudo-structural elements are used to calculate
the new mesh positions [2]. This discretization results in ∼ 85, 000 degrees of
freedom for the complete system.
172 Sunil R. Tiyyagura and Malte von Scheven
The calculation was run for approx. 2,000 time steps with ∆t = 0.01 s,
resulting in a simulation time of ∼ 20 s. For each timestep 4-6 iterations be-
tween fluid and structure field were needed to fulfill the coupling conditions. In
the single fields 3-5 Newton iterations for fluid and 2-3 iterations for structure
were necessary to solve the nonlinear problems.
The results for both the rigid and the flexible roof for t = 9.0 s are vi-
sualized in figure 4. For both simulations the pressure field clearly shows a
large vortex, which emerges in the wake of the building and then moves slowly
downstream. In addition, for the flexible roof smaller vortices are separating
at the upstream edge of the building and traveling over the membrane roof.
Fig. 4. Membrane Roof: Pressure field on deformed geometry (10-fold) for rigid
roof (left) and flexible roof (right)
FSI Simulations on Vector Systems 173
These vortices, originating from the interaction between fluid and structure,
cause the nonsymmetric deformation of the roof.
5 Performance
This section provides the performance analysis of finite element simulations
on both scalar and vector architectures. Firstly, scaling of BLIS on NEC SX-8
is presented for a laminar flow problem with different discretizations. Then,
performance of a pure fluid example and a FSI example is compared between
two different hardware architectures. The machines tested are a cluster of
NEC SX-8 SMPs and a cluster of Intel 3.2 GHz Xeon EM64T processors.
The network interconnect available on NEC SX-8 is a proprietary multi-stage
crossbar called IXS and Infiniband on the Xeon cluster. Vendor tuned MPI
library is used on the SX-8 and Voltaire MPI library on the Xeon cluster.
Scaling of the solver on NEC SX-8 was tested for the above mentioned nu-
merical example using stabilized 3D hexahedral fluid elements implemented
in CCARAT. Table 1 lists all the six discretizations of the example used.
174 Sunil R. Tiyyagura and Malte von Scheven
Figure 6 plots weak scaling of BLIS for different processor counts. Each
curve represents performance using particular number of CPUs with varying
problem size. All problems were run for 5 time steps where each non-linear
time step needs about 3-5 newton iterations for convergence. The number of
iterations needed for convergence in BLIS for each newton step varies largely
between 200-2000 depending on the problem size (number of equations). The
plots show the drop in sustained floating point performance of BLIS from over
6 GFlop/s to 3 GFlop/s depending on the number of processors used for each
problem size.
The right plot of Fig. 6 explains the reason for this drop in performance
in terms of drop in computation to communication ratio in BLIS. It has to be
noted that major part of the communication with the increase in processor
count is spent in MPI global reduction calls which need global syncronization.
As the processor count increases, the performance curves climb slowly till
the performance saturates. This behavior can be directly attributed to the
time spent in communication which is clear from the right plot. These plots
are hence important as they accentuate the problem with Krylov subspace
algorithms where large problem sizes are needed to sustain high performance
on large processor counts. This is a drawback for certain class of applications
Fig. 6. Scaling of BLIS wrt. problem size on NEC SX-8 (left) Computation to
communication ratio in BLIS on NEC SX-8 (right)
FSI Simulations on Vector Systems 175
where the demand for HPC (High Performance Computing) is due to the
largely transient nature of the problem. For instance, even though the problem
size is moderate in some Fluid-Structure interaction examples, thousands of
time steps are necessary to simulate the transient effects.
Table 2. Performance comparison in solver between SX-8 and Xeon for a pure fluid
example with 631504 equations on 8 CPUs
Machine Solver Precond. BiCGSTAB iters. MFlop/s CPU time
per solver call per CPU
SX-8 BLIS4 BJAC 65 4916 110
BLIS4 BILU 125 1027 765
AZTEC ILU 48 144 3379
Xeon BLIS4 BJAC 68 - 1000
BLIS4 BILU 101 - 625
AZTEC ILU 59 - 1000
Table 3. Performance comparison in solver between SX-8 and Xeon for a fluid struc-
ture interaction example with 25168 fluid equations and 26352 structural equations
Machine Solver Precond. CG iters. MFlop/s CPU time
per solver call per CPU
SX-8 BLIS3,4 BJAC 597 6005 66
AZTEC ILU 507 609 564
Xeon BLIS3,4 BILU 652 - 294
AZTEC ILU 518 - 346
176 Sunil R. Tiyyagura and Malte von Scheven
6 Summary
The fluid structure interaction framework has been presented. The reasons
behind the dismal performance of most of the public domain sparse iterative
solvers on vector machines were briefly stated. We then introduced the Block-
based Linear Iterative Solver (BLIS) which is currently under development
targeting performance on all architectures. Results show an order of mag-
nitude performance improvement over other public domain libraries on the
tested vector system. A moderate performance improvement is also measured
on the scalar machines.
References
1. Wall, W.A.: Fluid-Struktur-Interaktion mit stabilisierten Finiten Elementen.
phdthesis, Institut für Baustatik, Universität Stuttgart (1999)
2. Wall, W., Ramm, E.: Fluid-Structure Interaction based upon a Stabilized (ale)
Finite Element Method. In: E. Oñate and S. Idelsohn (Eds.), Computational
Mechanics, Proceedings of the Fourth World Congress on Computational Me-
chanics WCCM IV, Buenos Aires. (1998)
3. Tuminaro, R.S., Heroux, M., Hutchinson, S.A., Shadid, J.N.: Aztec user’s guide:
Version 2.1. Technical Report SAND99-8801J, Sandia National Laboratories
(1999)
4. Heroux, M.A., Willenbring, J.M.: Trilinos users guide. Technical Report
SAND2003-2952, Sandia National Laboratories (2003)
5. Saad, Y.: Iterative Methods for Sparse Linear Systems, Second Edition. SIAM,
Philadelphia, PA (2003)
6. Tiyyagura, S.R., Küster, U., Borowski, S.: Performance improvement of sparse
matrix vector product on vector machines. In Alexandrov, V., van Albada, D.,
Sloot, P., Dongarra, J., eds.: Proceedings of the Sixth International Conference
on Computational Science (ICCS 2006). LNCS 3991, May 28-31, Reading, UK,
Springer (2006)
7. Im, E.J., Yelick, K.A., Vuduc, R.: Sparsity: An optimization framework for
sparse matrix kernels. International Journal of High Performance Computing
Applications (1)18 (2004) 135–158
FSI Simulations on Vector Systems 177
8. Tiyyagura, S.R., Küster, U.: Linear iterative solver for NEC parallel vector
systems. In Resch, M., Bönisch, T., Tiyyagura, S., Furui, T., Seo, Y., Bez, W.,
eds.: Proceedings of the Fourth Teraflop Workshop 2006, March 30-31, Stuttgart,
Germany, Springer (2006)
9. Nakajima, K.: Parallel iterative solvers of geofem with selective blocking pre-
conditioning for nonlinear contact problems on the earth simulator. GeoFEM
2003-005, RIST/Tokyo (2003)
10. Jones, M.T., Plassmann, P.E.: Blocksolve95 users manual: Scalable library soft-
ware for the parallel solution of sparse linear systems. Technical Report ANL-
95/48, Argonne National Laboratory (1995)
11. Tuminaro, R.S., Shadid, J.N., Hutchinson, S.A.: Parallel sparse matrix vector
multiply software for matrices with data locality. Concurrency: Practice and
Experience (3)10 (1998) 229–247
12. Demmel, J., Heath, M., van der Vorst, H.: Parallel numerical linear algebra.
Acta Numerica 2 (1993) 53–62
13. Schäfer, M., Turek, S.: Benchmark Computations of Laminar Flow Around a
Cylinder. Notes on Numerical Fluid Mechanics 52 (1996) 547–566
Simulations of Premixed Swirling Flames
Using a Hybrid Finite-Volume/Transported
PDF Approach
Abstract
1 Introduction
In many industrial applications there is a high demand for reliable predictive
models for turbulent swirling flows. While the calculation of non-reacting
flows has become a standard task and can be handled using Reynolds
averaged Navier-Stokes (RANS) or Large Eddy Simulation (LES) methods
the modeling of reacting flows still is a challenging task due to the difficulties
that arise from the strong non-linearity of the chemical source term which
can not be modeled satisfactorily by using oversimplified closure methods.
PDF methods (probability density function) show a high capability for
modeling turbulent reactive flows, because of the advantage of treating
convection and finite rate non-linear chemistry exactly [1, 2]. Only the effect
of molecular mixing has to be modeled [3]. In the literature different kinds
of PDF approaches can be found. Some use stand-alone PDF methods in
182 Stefan Lipp and Ulrich Maas
which all flow properties are computed by a joint probability density function
method [4, 5, 6, 7]. The transport equation for the joint probability density
function that can be derived from the Navier-Stokes equations still contains
unclosed terms that need to be modeled. These terms are the fluctuating
pressure gradient and the terms describing the molecular transport. In
contrast the above mentioned chemistry term, the body forces and the mean
pressure gradient term already appear in closed form and need no more
modeling assumtions.
Compared to RANS methods the structure of the equations appearing in the
PDF context is remarkably different. The moment closure models (RANS)
result in a set of partial differential equations, which can be solved numerically
using finite-difference or finite-volume methods [8]. In contrast the transport
equation for the PDF is a high-dimensional scalar transport equation. In
general it has 7 + nS dimensions which consist of three dimensions in space,
three dimensions in velocity space, the time and the number of species
nS used for the description of the thermokinetic state. Due to this high
dimensionality it is not feasible to solve the equation using finite-difference
of finite-volume methods. For that reason Monte Carlo methods have been
employed, which are widely used in computational physics to solve problems
of high dimensionality, because the numerical effort increases only linearly
with the number of dimensions.
Using the Monte Carlo method the PDF is represented by an ensemble of
stochastic particles [9]. The transport equation for the PDF is transformed
to a system of stochastic ordinary differential equations. This system is
constructed in such a way that the particle properties, e.g. velocity, scalars,
and turbulent frequency, represent the same PDF as in the turbulent flow.
In order to fulfill consistency of the modeled PDF, the mean velocity field
derived from an ensemble of particles needs to satisfy the mass conservation
equation [1]. This requires the pressure gradient to be calculated from a
Poission equation. The available Monte Carlo methods cause strong bias
determining the convective and diffusive terms in the momentum conservation
equations. This leads to stability problems calculating the pressure gradient
from the Poisson equation. To avoid these instabilities different methods to
calculate the mean pressure gradient where used. One possibility is to couple
the particle method with an ordinary finite-volume or finite-difference solver
to optain the mean pressure field from the Navier-Stokes equations. These
so called hybrid PDF/CFD methods are widely used by different authors for
many types of flames [10, 11, 12, 13, 14, 15].
In the presented paper a hybrid scheme is used. The fields for mean pressure
gradient and a turbulence charactaristic, e.g. the turbulent time scale, are
derived solving the Reynolds averaged conservation equations for momentum,
mass and energy for the flow field using a finite-volume method. The effect of
turbulent fluctuations is modeled using a k-τ model [16]. Chemical kinetics
are taken into account by using the ILDM method to get reduced chemical
mechanisms [17, 18]. In the presented case the reduced mechanism describes
Simulations of Premixed Flames Using a Hybrid FV/PDF Approach 183
the reaction with three parameters which is on the one hand few enough
to limit the simulation time to an acceptable extent and on the other hand
sufficiently high to get a detailed description of the chemical reaction.
The test case for the developed model is a model combustion chamber
investigated by serveral authors [19, 20, 21, 22]. With their data the results
of the presented simulations are validated.
2 Numerical Model
As mentioned above a hybird CFD/PDF method is used in this work. In Fig. 1
a complete sketch of the solution precedure can be found. Before explaning the
details of the implemented equations and discussing consistency and numerical
matters the idea of the solution procedure shall be briefly overviewed.
The calulation starts with a CFD step in which the Navier-Stokes equa-
tions for the flow field are solved by a finite-volume method. The resulting
mean pressure gradient together with the mean velocities and the turbulence
characteristics is handed over to the PDF part. Here the joint probability
density function of the scalars and the velocity is solved by a particle Monte
Carlo method. The reaction progress is taken into account by reading from
a lookup table based on a mechanism reduced with the ILDM method. As
a result of this step the mean molar mass, the composition vector and the
mean temperature field are returned to the CFD part. This internal iteration
is performed until convergence is achieved.
The CFD code which is used to calculate the mean velocity and pressure
field along with the turbulent kinetic energy and the turbulent time scale is
called Sparc3 and was developed by the Department of Fluid Machinery at
Karlsruhe University. It solves the Favre-averaged compressible Navier Stokes
equations using a Finite-Volume method on block structured non-uniform
∂ p̄
∂xi
, ũ, ṽ, w̃, k, τ
?
CFD PDF
6
R
M
, ψi , T
∂ ρ̄ ∂ (ρ̄ũi )
+ =0 (1)
∂t ∂xi
∂ (ρ̄ũi ) ∂ $ ′′ ′′
%
+ ρ̄ũi ũj + ρui uj + ρ̄δij − τij = 0 (2)
∂t ∂xj
(∂ ρ̄ẽ) ∂ $ %
′′ ′′
+ ρ̄u˜j ẽ + ũj p̄ + uj p + ρuj e′′ + q̄j − ui τij = 0 (3)
∂t ∂xj
which are the conservation equations for mass, momentum and energy in
Favre average manner, respectively. Modeling of the unclosed terms in the
energy equation will not be described in detail any further but can be found
for example in [8]. The unclosed cross correlation term in the momentum
conservation equation is modeled using the Boussinesq approximation
′′ ′′ ∂ ũi ∂ ũj
ρui uj = ρ̄μT + (4)
∂xj ∂xi
with
μT = Cµ fµ kτ . (5)
The parameter Cµ is an empirical constant with a value of Cµ = 0.09 and
fµ accounts for the influence of walls. The turbulent kinetic energy k and the
turbulent time scale τ are calculated from their transport equation which are
[16]
& '
∂k ∂k ∂ui k ∂ μT ∂k
ρ̄ + ρ̄ũj = τij − ρ̄ + μ+ (6)
∂t ∂xj ∂xj τ ∂xi σk ∂xj
∂τ ∂τ τ ∂ui
ρ̄ + ρ̄ũj = (1 − Cǫ1 ) τij + (Cǫ2 − 1) ρ̄ +
∂t ∂x j k ∂xj
& '
∂ μT ∂k 2 μT ∂k ∂τ
μ+ + μ+ −
∂xj στ 2 ∂xj k στ 1 ∂xk ∂xk
2 μT ∂τ ∂τ
μ+ . (7)
τ στ 2 ∂xk ∂xk
Here Cǫ1 = 1.44 and στ 1 = στ 2 = 1.36 are empirical model constants. The
parameter Cǫ2 is calculated from the turbulent Reynolds number Ret .
kτ
Ret = (8)
μ
& $ %'
2 2
Cǫ2 = 1.82 1 − exp (−Ret /6 ) (9)
9
Simulations of Premixed Flames Using a Hybrid FV/PDF Approach 185
In the literature many different joint PDF models can be found, for example
models for the joint PDF of velocity and composition [23, 24] or for the joint
PDF of velocity, composition and turbulent frequency [25]. A good overview
of the different models can be found in [12].
In most joint PDF approaches a turbulent (reactive) flow field is described
by a one-time, one-point joint PDF of certain fluid properties. At this level
chemical reactions are treated exactly without any modeling assumptions [1].
However, the effect of molecular mixing has to be modeled.
The state of the fluid at a given point in space and time can be fully de-
scribed by the velocity vector V = (V1 , V2 , V3 )T and the the composition
vector Ψ containing the mass% fractions of nS − 1 species and the enthalpy h
$
T
Ψ = (Ψ1 , Ψ2 , . . . , Ψns −1 , h) . The probability density function is
and gives the probability that at one point in space and time one realization
of the flow is within the interval
V ≤ U ≤ V + dV (11)
Ψ ≤ Φ ≤ Ψ + dΨ (12)
Term I describes the instationary change of the PDF, Term II its change by
convection in physical space and Term III takes into account the influence of
gravitiy and the mean pressure gradient on the PDF. Term IV includes the
chemical source term which describes the change of the PDF in composition
space due to chemical reactions. All terms on the left hand side of the equation
appear in closed form, e.g. the chemical source term. In contrast the terms
on the right hand side are unclosed and need further modeling. Many closing
186 Stefan Lipp and Ulrich Maas
assumptions for these two terms exist. In the following only the ones that are
used in the present work shall be explained further.
Term V describes the influence of pressure fluctuations and viscous stresses on
the PDF. Commonly a Langevin approach [26, 27] is used to close this term.
In the presented case the SLM (Simplified Langevin Model) is used [1]. More
sophisticated approaches that take into account the effect of non-isotropic
turbulence or wall effects exist as well [26, 28]. But in the presented case of a
swirling non-premixed free stream flame the closure of the term by the SLM
is assumed to be adequate and was chosen because of its simplicity.
Term VI regards the effect of molecular diffusion within the fluid. This diffu-
sion flattens the steep composition gradients which are created by the strong
vortices in a turbulent flow. Several models have been proposed to close this
term. The simplest model is the interaction by exchange with the mean model
(IEM) [29, 30] which models the fact that fluctuations in the composition space
relax to the mean. A more detailed model has been proposed by Curl [31] and
modified by [32, 33] and is used in its modified form in the presented work.
More recently new models based on Euclidian minimum spanning trees have
been developed [34, 35] but are not yet implemented in this work.
As mentioned previously it is numerically unfeasable to solve the PDF trans-
port equation with finite-volume or finite-difference methods because of its
high dimensionality. Therefore a Monte Carlo method is used to solve the
transport equation making use of the fact that the PDF of a fluid flow can be
represented as a sum of δ-functions.
N (t)
.
∗
fU,φ (U, Ψ; x, t) = δ v − ui δ φ − Ψi δ x − xi (14)
i=1
k and τ the turbulent kinetic energy and the turbulent time scale, respectively.
Finally the evolution of the composition vector can be calculated as
dΨ
=S+M (17)
dt
in which S is the chemical source term (appearing in closed form) and M
denotes the effect of molecular mixing. As previously mentioned this term is
unclosed und needs further modeling assumptions. For this a modified Curl
model is used [32].
4
Internal Recirculation Zone
Simulations of Premixed Flames Using a Hybrid FV/PDF Approach 189
0.04 0.04
0.03 0.03
r / (m)
r / (m)
0.02 0.02
0.01 0.01
10 0 10 20 30 40 50 60 70 10 0 10 20 30 40 50 60 70
u / (m/s) u / (m/s)
(a) Axial velocity (a) Axial velocity
0.04 0.04
0.03 0.03
r / (m)
0.01 0.01
10 0 10 20 30 40 50 60 70 10 0 10 20 30 40 50 60 70
w / (m/s) w / (m/s)
(b) Tangential velocity (b) Tangential velocity
As an example for the reacting case the calculated temperature field of the
flame is shown in Fig. 7 which can not be compared to quantitative experi-
ments due to the lack of data. But the qualitative behaviour of the flame is
predicted correctly. As one can see the tip of the flame is located at the start
of the inner recirulation zone. It shows a turbulent flame brush in which the
reaction occurs which can be seen in the figure by the rise of temperature. It
can not be assessed whether the thickness of the reaction zone is predicted
well because no measurements of the temperature field are available.
4 Conclusion
Simulations of a premixed swirling methane-air flame are presented. To ac-
count for the strong turbulence chemistry interaction occuring in these flames
a hybrid finite-volume/transported PDF model is used. This model consists
of two parts: a finte volume solver for the mean velocities and the mean pres-
sure gradient and a Monte Carlo solver for the transport equation of the joint
PDF of velocity and compostion vector. Chemical kinetics are described by
automatically reduced mechanisms created with the ILDM method.
The presented results show the validity of the model. The simulated veloc-
ity profiles match well with the experimental results. The calculations of the
reacting case also show a qualitatively correct behaviour of the flame. A quan-
titative analysis is subject of future research work.
References
1. S.B. Pope. Pdf methods for turbulent reactive flows. Progress in Energy Com-
bustion Science, 11:119–192, 1985.
2. S.B Pope. Lagrangian pdf methods for turbulent flows. Annual Review of Fluid
Mechanics, 26:23–63, 1994.
3. Z. Ren and S.B. Pope. An investigation of the performence of turbulent mixing
models. Combustion and Flame, 136:208–216, 2004.
4. P.R. Van Slooten and S.B Pope. Application of pdf modeling to swirling and
nonswirling turbulent jets. Flow Turbulence and Combustion, 62(4):295–334,
1999.
5. V. Saxena and S.B Pope. Pdf simulations of turbulent combustion incorporating
detailed chemistry. Combustion and Flame, 117(1-2):340–350, 1999.
6. S. Repp, A. Sadiki, C. Schneider, A. Hinz, T. Landenfeld, and J. Janicka. Pre-
diction of swirling confined diffusion flame with a monte carlo and a presumed-
pdf-model. International Journal of Heat and Mass Transfer, 45:1271–1285,
2002.
7. K. Liu, S.B. Pope, and D.A. Caughey. Calculations of bluff-body stabilized
flames using a joint probability density function model with detailed chemistry.
Combustion and Flame, 141:89–117, 2005.
8. J.H. Ferziger and M. Peric. Computational Methods for Fluid Dynamics.
Springer Verlag, 2 edition, 1997.
9. S.B Pope. A monte carlo method for pdf equations of turbulent reactive flow.
Combustion, Science and Technology, 25:159–174, 1981.
10. P. Jenny, M. Muradoglu, K. Liu, S.B. Pope, and D.A. Caughey. Pdf simulations
of a bluff-body stabilized flow. Journal of Computational Physics, 169:1–23,
2000.
192 Stefan Lipp and Ulrich Maas
11. A.K. Tolpadi, I.Z. Hu, S.M. Correa, and D.L. Burrus. Coupled lagrangian monte
carlo pdf-cfd computation of gas turbine combustor flowfields with finite-rate
chemistry. Journal of Engineering for Gas Turbines and Power, 119:519–526,
1997.
12. M. Muradoglu, P. Jenny, S.B Pope, and D.A. Caughey. A consistent hybrid
finite-volume/particle method for the pdf equations of turbulent reactive flows.
Journal of Computational Physics, 154:342–370, 1999.
13. M. Muradoglu, S.B. Pope, and D.A. Caughey. The hybid method for the pdf
equations of turbulent reactive flows: Consistency conditions and correction al-
gorithms. Journal of Computational Physics, 172:841–878, 2001.
14. G. Li and M.F. Modest. An effective particle tracing scheme on struc-
tured/unstructured grids in hybrid finite volume/pdf monte carlo methods.
Journal of Computational Physics, 173:187–207, 2001.
15. V. Raman, R.O. Fox, and A.D. Harvey. Hybrid finite-volume/transported pdf
simulations of a partially premixed methane-air flame. Combustion and Flame,
136:327–350, 2004.
16. H.S. Zhang, R.M.C. So, C.G. Speziale, and Y.G. Lai. A near-wall two-equation
model for compressible turbulent flows. In Aerospace Siences Meeting and Ex-
hibit, 30th, Reno, NV, page 23. AIAA, 1992.
17. U. Maas and S. B. Pope. Simplifying chemical kinetics: Intrinsic low-dimensional
manifolds in composition space. Combustion and Flame, 88:239–264, 1992.
18. U. Maas and S.B. Pope. Implementation of simplified chemical kinetics based
on intrinsic low-dimensional manifolds. In Twenty-Fourth Symposium (Interna-
tional) on Combustion, pages 103–112. The Combustion Institute, 1992.
19. F. Kiesewetter, C. Hirsch, J. Fritz, M. Kröner, and T. Sattelmayer. Two-
dimensional flashback simulation in strongly swirling flows. In Proceedings of
ASME Turbo Expo 2003.
20. M. Kröner. Einfluss lokaler Löschvorgänge auf den Flammenrückschlag durch
verbrennungsinduziertes Wirbelaufplatzen. PhD thesis, Technische Universität
München, Fakultät für Maschinenwesen, 2003.
21. J. Fritz. Flammenrückschlag durch verbrennungsinduziertes Wirbelaufplatzen.
PhD thesis, Technische Universität München, Fakultät für Maschinenwesen,
2003.
22. F. Kiesewetter. Modellierung des verbrennungsinduzierten Wirbelaufplatzens in
Vormischbrennern. PhD thesis, Technische Universität München, Fakultät für
Maschinenwesen, 2005.
23. D.C. Haworth and S.H. El Tahry. Propbability density function approach
for multidimensional turbulent flow calculations with application to in-cylinder
flows in reciproating engines. AIAA Journal, 29:208, 1991.
24. S.M. Correa and S.B. Pope. Comparison of a monte carlo pdf finite-volume
mean flow model with bluff-body raman data. In Twenty-Fourth Symposium
(International) on Combustion, page 279. The Combustion Institute, 1992.
25. W.C. Welton and S.B. Pope. Pdf model calculations of compressible turbu-
lent flows using smoothed particle hydrodynamics. Journal of Computational
Physics, 134:150, 1997.
26. D.C. Haworth and S.B. Pope. A generalized langevin model for turbulent flows.
Physics of Fluids, 29:387–405, 1986.
27. H.A. Wouters, T.W. Peeters, and D. Roekaerts. On the existence of a generalized
langevin model representation for second-moment closures. Physics of Fluids,
8, 1996.
Simulations of Premixed Flames Using a Hybrid FV/PDF Approach 193
28. T.D. Dreeben and S.B. Pope. Pdf/monte carlo simulation of near-wall turbulent
flows. Journal of Fluid Mechanics, 357:141–166, 1997.
29. C. Dopazo. Relaxation of initial probability density functions in the turbulent
convection of scalar flieds. Physics of Fluids, 22:20–30, 1979.
30. P.A. Libby and F.A. Williams. Turbulent Reacting Flows. Academic Press,
1994.
31. R.L. Curl. Dispersed phase mixing: 1. theory and effects in simple reactors.
A.I.Ch.E. Journal, 9:175,181, 1963.
32. J. Janicka, W. Kolbe, and W. Kollmann. Closure of the transport equation
of the probability density function of turbulent scalar flieds. Journal of Non-
Equilibrium Thermodynamics, 4:47–66, 1979.
33. S.B Pope. An improved turbulent mixing model. Combustion, Science and
Technology, 28:131–135, 1982.
34. S. Subramaniam and S.B Pope. A mixing model for turbulent reactive
flows based on euclidean minimum spanning trees. Combustion and Flame,
115(4):487–514, 1999.
35. S. Subramaniam and S.B Pope. Comparison of mixing model performance for
nonpremixed turbulent reactive flow. Combustion and Flame, 117(4):732–754,
1999.
36. J. Fritz, M. Kron̈er, and T. Sattelmayer. Flashback in a swirl burner with
cylindrical premixing zone. In Proceedings of ASME Turbo Expo 2001.
Supernova Simulations with the Radiation
Hydrodynamics Code
PROMETHEUS/VERTEX
Summary. We give an overview of the problems and the current status of our two-
dimensional (core collapse) supernova modelling, and discuss the system of equations
and the algorithm for its solution that are employed in our code. In particular we
report our recent progress, and focus on the ongoing calculations that are performed
on the NEC SX-8 at the HLRS Stuttgart. We also discuss recent optimizations
carried out within the framework of the Teraflop Workbench, and comment on the
parallel performance of the code, stressing the importance of developing a MPI
version of the employed hydrodynamics module.
1 Introduction
A star more massive than about 8 solar masses ends its live in a cataclysmic
explosion, a supernova. Its quiescent evolution comes to an end, when the
pressure in its inner layers is no longer able to balance the inward pull of
gravity. Throughout its life, the star sustained this balance by generating
energy through a sequence of nuclear fusion reactions, forming increasingly
heavier elements in its core. However, when the core consists mainly of iron-
group nuclei, central energy generation ceases. The fusion reactions producing
iron-group nuclei relocate to the core’s surface, and their “ashes” continuously
increase the core’s mass. Similar to a white dwarf, such a core is stabilised
against gravity by the pressure of its degenerate gas of electrons. However,
to remain stable, its mass must stay smaller than the Chandrasekhar limit.
When the core grows larger than this limit, it collapses to a neutron star, and
a huge amount (∼ 1053 erg) of gravitational binding energy is set free. Most
(∼ 99%) of this energy is radiated away in neutrinos, but a small fraction
is transferred to the outer stellar layers and drives the violent mass ejection
which disrupts the star in a supernova.
196 B. Müller et al.
Despite 40 years of research, the details of how this energy transfer happens
and how the explosion is initiated are still not well understood. Observational
evidence about the physical processes deep inside the collapsing star is sparse
and almost exclusively indirect. The only direct observational access is via
measurements of neutrinos or gravitational waves. To obtain insight into the
events in the core, one must therefore heavily rely on sophisticated numeri-
cal simulations. The enormous amount of computer power required for this
purpose has led to the use of several, often questionable, approximations and
numerous ambiguous results in the past. Fortunately, however, the develop-
ment of numerical tools and computational resources has meanwhile advanced
to a point, where it is becoming possible to perform multi-dimensional simula-
tions with unprecedented accuracy. Therefore there is hope that the physical
processes which are essential for the explosion can finally be unravelled.
An understanding of the explosion mechanism is required to answer many
important questions of nuclear, gravitational, and astro-physics like the fol-
lowing:
• How do the explosion energy, the explosion timescale, and the mass of
the compact remnant depend on the progenitor’s mass? Is the explosion
mechanism the same for all progenitors? For which stars are black holes
left behind as compact remnants instead of neutron stars?
• What is the role of the – poorly known – equation of state (EoS) for the
proto neutron star? Do softer or stiffer EoSs favour the explosion of a core
collapse supernova?
• What is the role of rotation during the explosion? How rapidly do newly
formed neutron stars rotate?
• How do neutron stars receive their natal kicks? Are they accelerated by
asymmetric mass ejection and/or anisotropic neutrino emission?
• What are the generic properties of the neutrino emission and of the grav-
itational wave signal that are produced during stellar core collapse and
explosion? Up to which distances could these signals be measured with
operating or planned detectors on earth and in space? And what can one
learn about supernova dynamics from a future measurement of such signals
in case of a Galactic supernova?
2 Numerical models
2.1 History and constraints
Thus the currently favoured theoretical paradigm needs to exploit the fact
that a huge energy reservoir is present in the form of neutrinos, which are
abundantly emitted from the hot, nascent neutron star. The absorption of
electron neutrinos and antineutrinos by free nucleons in the post shock layer
is thought to reenergize the shock, and lead to the supernova explosion.
Detailed spherically symmetric hydrodynamic models, which recently in-
clude a very accurate treatment of the time-dependent, multi-flavour, multi-
frequency neutrino transport based on a numerical solution of the Boltzmann
transport equation [1, 2, 3], reveal that this “delayed, neutrino-driven mecha-
nism” does not work as simply as originally envisioned. Although in principle
able to trigger the explosion (e.g., [4], [5], [6]), neutrino energy transfer to the
postshock matter turned out to be too weak. For inverting the infall of the
stellar core and initiating powerful mass ejection, an increase of the efficiency
of neutrino energy deposition is needed.
A number of physical phenomena have been pointed out that can enhance
neutrino energy deposition behind the stalled supernova shock. They are all
linked to the fact that the real world is multi-dimensional instead of spherically
symmetric (or one-dimensional; 1D) as assumed in the work cited above:
(1) Convective instabilities in the neutrino-heated layer between the neutron
star and the supernova shock develop to violent convective overturn [7].
This convective overturn is helpful for the explosion, mainly because (a)
neutrino-heated matter rises and increases the pressure behind the shock,
thus pushing the shock further out, and (b) cool matter is able to pene-
trate closer to the neutron star where it can absorb neutrino energy more
efficiently. Both effects allow multi-dimensional models to explode easier
than spherically symmetric ones [8, 9, 10].
(2) Recent work [11, 12, 13, 14] has demonstrated that the stalled supernova
shock is also subject to a second non-radial low-mode instability, called
SASI, which can grow to a dipolar, global deformation of the shock [14, 15].
(3) Convective energy transport inside the nascent neutron star [16, 17, 18, 19]
might enhance the energy transport to the neutrinosphere and could thus
boost the neutrino luminosities. This would in turn increase the neutrino-
heating behind the shock.
This list of multi-dimensional phenomena awaits more detailed exploration
in multi-dimensional simulations. Until recently, such simulations have been
performed with only a grossly simplified treatment of the involved micro-
physics, in particular of the neutrino transport and neutrino-matter interac-
tions. At best, grey (i.e., single energy) flux-limited diffusion schemes were
employed. All published successful simulations of supernova explosions by the
convectively aided neutrino-heating mechanism in two [8, 9, 20] and three di-
mensions [21, 22] used such a radical approximation of the neutrino transport.
Since, however, the role of the neutrinos is crucial for the problem, and
because previous experience shows that the outcome of simulations is indeed
very sensitive to the employed transport approximations, studies of the explo-
198 B. Müller et al.
sion mechanism require the best available description of the neutrino physics.
This implies that one has to solve the Boltzmann transport equation for neu-
trinos.
Fig. 1. The shock position (solid white line) at the north pole (upper panel) and
south pole (lower panel) of the rotating 15 M⊙ model as function of postbounce
time. Colour coded is the entropy of the stellar fluid.
200 B. Müller et al.
Fig. 2. The ratio of the advection timescale to the heating timescale for the rotating
model L&S-rot and the non-rotating model L&S-2D. Also shown is model L&S-
rot-90 which is identical to model L&S-rot except for the computational domain
that does not extend from pole to pole but from the north pole to the equator.
The advection timescale is the characteristic timescale that matter stays inside the
heating region before it is advected to the proto-neutron star. The heating timescale
is the typical timescale that matter needs to be exposed to neutrino heating for
observing enough energy to become gravitationally unbound.
The crucial quantity required to determine the source terms for the en-
ergy, momentum, and electron fraction of the fluid owing to its interac-
tion with the neutrinos is the neutrino distribution function in phase space,
f (r, ϑ, φ, ǫ, Θ, Φ, t). Equivalently, the neutrino intensity I = c/(2πc)3 · ǫ3 f
may be used. Both are seven-dimensional functions, as they describe, at every
point in space (r, ϑ, φ), the distribution of neutrinos propagating with energy
ǫ into the direction (Θ, Φ) at time t (Fig. 3).
The evolution of I (or f ) in time is governed by the Boltzmann equation,
and solving this equation is, in general, a six-dimensional problem (as time
is usually not counted as a separate dimension). A solution of this equation
by direct discretisation (using an SN scheme) would require computational
resources in the Petaflop range. Although there are attempts by at least one
group in the United States to follow such an approach, we feel that, with the
currently available computational resources, it is mandatory to reduce the
dimensionality of the problem.
Actually this should be possible, since the source terms entering the hy-
drodynamic equations are integrals of I over momentum space (i.e. over ǫ, Θ,
and Φ), and thus only a fraction of the information contained in I is truly
required to compute the dynamics of the flow. It makes therefore sense to
consider angular moments of I, and to solve evolution equations for these
moments, instead of dealing with the Boltzmann equation directly. The 0th
to 3rd order moments are defined as
1
J, H, K, L, . . . (r, ϑ, φ, ǫ, t) = I(r, ϑ, φ, ǫ, Θ, Φ, t) n0,1,2,3,... dΩ (1)
4π
where dΩ = sin Θ dΘ dΦ, n = (cos Θ, sin Θ cos Φ, sin Θ sin Φ), and exponen-
tiation represents repeated application of the dyadic product. Note that the
moments are tensors of the required rank.
This leaves us with a four-dimensional problem. So far no approximations
have been made. In order to reduce the size of the problem even further,
Fig. 3. Illustration of the phase space coordinates (see the main text).
202 B. Müller et al.
one needs to resort to assumptions on its symmetry. At this point, one usu-
ally employs azimuthal symmetry for the stellar matter distribution, i.e. any
dependence on the azimuth angle φ is ignored, which implies that the hydro-
dynamics of the problem can be treated in two dimensions. It also implies
I(r, ϑ, ǫ, Θ, Φ) = I(r, ϑ, ǫ, Θ, −Φ). If, in addition, it is assumed that I is even
independent of Φ, then each of the angular moments of I becomes a scalar,
which depends on two spatial dimensions, and one dimension in momentum
space: J, H, K, L = J, H, K, L(r, ϑ, ǫ, t). Thus we have reduced the problem to
three dimensions in total.
$ % $ 2
%
1 ∂
c ∂t
∂
+ βr ∂r + βrϑ ∂ϑ
∂
H + H r12 ∂(r∂rβr ) + r sin
1
ϑ
∂(sin ϑβϑ )
∂ϑ
$ % 0 1
∂βr βr ∂K ǫ ∂βr
+ ∂K
∂r + 3K−J
r + H ∂r + c ∂t − ∂
∂ǫ c ∂t K
0 $ %1
∂ ∂βr βr 1 ∂(sin ϑβϑ )
− ∂ǫ ǫL ∂r − r − 2r sin ϑ ∂ϑ
2
∂ βr 1 ∂(sin ϑβϑ ) 1 ∂βr
− ǫH + + (J + K) = C (1) . (3)
∂ǫ r 2r sin ϑ ∂ϑ c ∂t
These are evolution equations for the neutrino energy density, J, and the
neutrino flux, H, and follow from the zeroth and first moment equations
of the comoving frame (Boltzmann) transport equation in the Newtonian,
O(v/c) approximation. The quantities C (0) and C (1) are source terms that
result from the collision term of the Boltzmann equation, while βr = vr /c and
βϑ = vϑ /c, where vr and vϑ are the components of the hydrodynamic veloc-
ity, and c is the speed of light. The functional dependences βr = βr (r, ϑ, t),
J = J(r, ϑ, ǫ, t), etc. are suppressed in the notation. This system includes four
unknown moments (J, H, K, L) but only two equations, and thus needs to be
supplemented by two more relations. This is done by substituting K = fK · J
and L = fL · J, where fK and fL are the variable Eddington factors, which
Supernova Simulations with VERTEX 203
for the moment may be regarded as being known, but in our case are indeed
determined from the formal solution of a simplified (“model”) Boltzmann
equation. For the adopted coordinates, this amounts to the solution of inde-
pendent one-dimensional PDEs (typically more than 200 for each ray), hence
very efficient vectorization is possible [23].
A finite volume discretisation of Eqs. (2–3) is sufficient to guarantee exact
conservation of the total neutrino energy. However, and as described in detail
in [23], it is not sufficient to guarantee also exact conservation of the neutrino
number. To achieve this, we discretise and solve a set of two additional equa-
tions. With J = J/ǫ, H = H/ǫ, K = K/ǫ, and L = L/ǫ, this set of equations
reads
$ % $ 2
%
1 ∂ ∂ βϑ ∂ 1 ∂(r βr ) 1 ∂(sin ϑβϑ )
c ∂t + βr ∂r + r ∂ϑ J + J r2 ∂r + r sin ϑ ∂ϑ
2
0 1 0 $ %1
+ r12 ∂(r∂rH) + βcr ∂H
∂t − ∂
∂ǫ
ǫ ∂βr
c ∂t H − ∂
∂ǫ ǫJ βr
r + 1
2r sin ϑ
∂(sin ϑβϑ )
∂ϑ
0 $ %1
∂ ∂β β 1 ∂(sin ϑβ ) ∂β
− ∂ǫ ǫK ∂rr − rr − 2r sin ϑ ∂ϑ
ϑ
+ 1c ∂tr H = C (0) , (4)
$ % $ 2
%
1 ∂
c ∂t
∂
+ βr ∂r + βrϑ ∂ϑ
∂
H + H r12 ∂(r∂rβr ) + r sin 1
ϑ
∂(sin ϑβϑ )
∂ϑ
$ % 0 1
∂β β ǫ ∂βr
+ ∂K
∂r +
3K−J
r + H ∂rr + cr ∂K ∂t − ∂ǫ
∂
c ∂t K
0 $ %1
∂(sin ϑβϑ )
− ∂ǫ∂
ǫL ∂β ∂r
r
− βr
r − 1
2r sin ϑ ∂ϑ
0 $ %1 $ %
∂ β 1 ∂(sin ϑβ ) ∂β βr 1 ∂(sin ϑβϑ )
− ∂ǫ ǫH rr + 2r sin ϑ ∂ϑ
ϑ
− L ∂r
r
− r − 2r sin ϑ ∂ϑ
$ %
βr 1 ∂(sin ϑβϑ ) 1 ∂βr
− H r + 2r sin ϑ ∂ϑ
+ c ∂t J = C (1) . (5)
The moment equations (2–5) are very similar to the O(v/c) equations in spher-
ical symmetry which were solved in the 1D simulations of [23] (see Eqs. 7,8,30,
and 31 of the latter work). This similarity has allowed us to reuse a good
fraction of the one-dimensional version of VERTEX, for coding the multi-
dimensional algorithm. The additional terms necessary for this purpose have
been set in boldface above.
Finally, the changes of the energy, e, and electron fraction, Ye , required
for the hydrodynamics are given by the following two equations
de 4π ∞ .
=− dǫ Cν(0) (ǫ), (6)
dt ρ 0
ν∈(νe ,ν̄e ,... )
dYe 4π mB ∞ $ (0) (0)
%
=− dǫ Cνe (ǫ) − Cν̄e (ǫ) (7)
dt ρ 0
(for the momentum source terms due to neutrinos see [25]). Here mB is the
baryon mass, and the sum in Eq. (6) runs over all neutrino types. The full
system consisting of Eqs. (2–7) is stiff, and thus requires an appropriate dis-
cretisation scheme for its stable solution.
204 B. Müller et al.
Method of solution
In order to discretise Eqs. (2–7), the spatial domain [0, rmax ] × [ϑmin , ϑmax ] is
covered by Nr radial, and Nϑ angular zones, where ϑmin = 0 and ϑmax = π
correspond to the north and south poles, respectively, of the spherical grid.
(In general, we allow for grids with different radial resolutions in the neutrino
transport and hydrodynamic parts of the code. The number of radial zones
for the hydrodynamics will be denoted by Nrhyd .) The number of bins used
in energy space is Nǫ and the number of neutrino types taken into account is
Nν .
The equations are solved in two operator-split steps corresponding to a
lateral and a radial sweep.
In the first step, we treat the boldface terms in the respectively first lines
of Eqs. (2–5), which describe the lateral advection of the neutrinos with the
stellar fluid, and thus couple the angular moments of the neutrino distribution
of neighbouring angular zones. For this purpose we consider the equation
1 ∂Ξ 1 ∂(sin ϑ βϑ Ξ)
+ = 0, (8)
c ∂t r sin ϑ ∂ϑ
where Ξ represents one of the moments J, H, J , or H. Although it has been
suppressed in the above notation, an equation of this form has to be solved
for each radius, for each energy bin, and for each type of neutrino. An explicit
upwind scheme is used for this purpose.
In the second step, the radial sweep is performed. Several points need to
be noted here:
• terms in boldface not yet taken into account in the lateral sweep, need to
be included into the discretisation scheme of the radial sweep. This can be
done in a straightforward way since these remaining terms do not include
derivatives of the transport variables (J, H) or (J , H). They only depend
on the hydrodynamic velocity vϑ , which is a constant scalar field for the
transport problem.
• the right hand sides (source terms) of the equations and the coupling in
energy space have to be accounted for. The coupling in energy is non-local,
since the source terms of Eqs. (2–5) stem from the Boltzmann equation,
which is an integro-differential equation and couples all the energy bins
• the discretisation scheme for the radial sweep is implicit in time. Explicit
schemes would require very small time steps to cope with the stiffness of
the source terms in the optically thick regime, and the small CFL time step
dictated by neutrino propagation with the speed of light in the optically
thin regime. Still, even with an implicit scheme 105 time steps are
required per simulation. This makes the calculations expensive.
Once the equations for the radial sweep have been discretized in radius and
energy, the resulting solver is applied ray-by-ray for each angle ϑ and for each
Supernova Simulations with VERTEX 205
If the sub-diagonal matrix blocks Ai and Bi are eliminated and the diagonal
matrix blocks Ci are inverted, one would obtain a system of the form
xi + Yi xi+1 + Zi xi+2 = ri , 1 ≤ i ≤ n − 2,
xn−1 + Yn−1 xn = rn−1 , (11)
xn = rn .
Applying the Thomas algorithm signifies that the variables Yi , Zi and ri
are calculated by substituting xi−2 and xi−1 in (10) using the appropriate
equations of (11) and comparing coefficients. This results in
Yi = Gi−1 (Di − Ki Zi−1 )
Zi = G−1
i Ei (12)
ri = Gi−1 (f i − Ai ri−2 − Ki ri−1 )
for i = 1, n, where
Ki = Bi − Ai Yi−2
(13)
Gi = Ci − Ki Yi−1 − Ai Zi−2
and Y−1 , Z−1 , Y0 , and Z0 are set to zero. Backward substitution
xn = rn ,
xn−1 = rn−1 − Yn−1 xn , (14)
xi = ri − Yi xi+1 −Zi xi+2 , i = n − 2, −1, 1.
yields the solution x.
The compact multiplication scheme and the improved solver for (16) are in-
tegrated into a new BPD solver. Its execution times are compared to a tra-
ditional BLAS/LAPACK solver in table 1 for 100 systems with block sizes
k = 20, 55 and 85 and n = 500 block rows resp. columns. The diagonals of
the BPD matrix are stored as five vectors of matrix blocks.
5 Parallelization
The ray-by-ray approximation readily lends itself to parallelization over
the different angular zones. For the radial transport sweep, we presently
use an OpenMP/MPI hybrid approach, while the hydrodynamics module
PROMETHEUS can only exploit shared-memory parallelism as yet. For a
small number of MPI processes, this does not severely affect parallel scaling,
as the neutrino transport takes 90% to 99% (heavily model-dependent) of
the total serial time. This is a reasonable strategy for systems with a large
number of processors per shared-memory node on which our code has been
used in the past, such as the IBM Regatta at the Rechenzentrum Garching of
the Max-Panck-Gesellschaft (32 processors per node) or the Altix 3700 Bx2
(MPA, ccNUMA architecture with 112 processors). However, this approach
does not allow us to fully exploit the capabilities of the NEC SX-8 with its 8
CPUs per node. While the neutrino transport algorithm can be expected to
exhibit good scaling for up to Nϑ processors (128-256 for a typical setup), the
lack of MPI parallelism in PROMETHEUS prohibits the use of more than
Table 1. Execution times for the BPD solver for 100 systems with n = 500
k= 20 55 85
BLAS + LAPACK [s] 6.43 33.80 76.51
comp. mult. + new solver [s] 3.79 23.10 55.10
decrease in runtime [%] 54.5 42.6 35.4
208 B. Müller et al.
four nodes. Full MPI functionality is clearly desirable, as it could reduce the
turnaround time by another factor of 3–4 on the SX-8. As the code already
profits from the vector capabilities of NEC machines, this amounts to a run-
time of several weeks as compared to more than a year required on the scalar
machines mentioned before, i. e. the overall reduction is even larger. For this
reason, a MPI version of the hydrodynamics part is currently being developed
within the framework of the Teraflop Workbench.
6 Conclusions
After reporting on recent developments in supernova modelling and briefly
describing the numerics of the ray-by-ray method employed in our code
PROMETHEUS/VERTEX, we addressed the issue of serial optimization. We
presented benchmarks for the improved implementation of the Block-Thomas
algorithm, finding reductions in runtime of about 1/3 or more for the relevant
block sizes. Finally, we discussed the limitations of the current parallelization
approach and emphasized the potential and importance of a fully MPI-capable
version of the code.
Acknowledgements
Support from the SFB 375 “Astroparticle Physics”, SFB/TR7 “Gravitation-
swellenastronomie”, and SFB/TR27 “Neutrinos and Beyond” of the Deutsche
Forschungsgemeinschaft, and computer time at the HLRS and the Rechenzen-
trum Garching are acknowledged.
References
1. Rampp, M., Janka, H.T.: Spherically Symmetric Simulation with Boltzmann
Neutrino Transport of Core Collapse and Postbounce Evolution of a 15 M⊙
Star. Astrophys. J. 539 (2000) L33–L36
2. Mezzacappa, A., Liebendörfer, M., Messer, O.E., Hix, W.R., Thielemann, F.,
Bruenn, S.W.: Simulation of the Spherically Symmetric Stellar Core Collapse,
Bounce, and Postbounce Evolution of a Star of 13 Solar Masses with Boltzmann
Neutrino Transport, and Its Implications for the Supernova Mechanism. Phys.
Rev. Letters 86 (2001) 1935–1938
3. Liebendörfer, M., Mezzacappa, A., Thielemann, F., Messer, O.E., Hix, W.R.,
Bruenn, S.W.: Probing the gravitational well: No supernova explosion in spher-
ical symmetry with general relativistic Boltzmann neutrino transport. Phys.
Rev. D 63 (2001) 103004–+
4. Bethe, H.A.: Supernova mechanisms. Reviews of Modern Physics 62 (1990)
801–866
Supernova Simulations with VERTEX 209
Numerical method and results for a 15 Mȯ star. Astron. Astrophys. 447 (2006)
1049–1092
26. Buras, R., Janka, H.T., Rampp, M., Kifonidis, K.: Two–dimensional hydrody-
namic core–collapse supernova simulations with spectral neutrino transport. II.
Models for different progenitor stars. Astron. Astrophys. 457 (2006) 281–308
27. Müller, E., Rampp, M., Buras, R., Janka, H.T., Shoemaker, D.H.: Toward
Gravitational Wave Signals from Realistic Core-Collapse Supernova Models.
Astrophys. J. 603 (2004) 221–230
28. Kitaura, F.S., Janka, H.T., Hillebrandt, W.: Explosions of O–Ne–Mg cores,
the Crab supernova, and subluminous type II–P supernovae. Astron. Astro-
phys. 450 (2006) 345–350
29. Marek, A., Kifonidis, K., Janka, H.T., Müller, B.: The supern-project: Under-
standing core collapse supernovae. In Nagel, W.E., Jäger, W., Resch, M., eds.:
High Performance Computing in Science and Engineering 06, Berlin, Springer
(2006)
30. Colella, P., Woodward, P.R.: The piecewise parabolic method for gas-dynamical
simulations. Jour. of Comp. Phys. 54 (1984) 174
31. Fryxell, B.A., Müller, E., Arnett, W.D.: Hydrodynamics and nuclear burning.
Max-Planck-Institut für Astrophysik, Preprint 449 (1989)
32. Thomas, L.H.: Elliptic problems in linear difference equations over a network.
Watson Sci. Comput. Lab. Rept., Columbia University, New York (1949)
33. Bruce, G.H., Peaceman, D.W., Jr. Rachford, H.H., Rice, J.D.: Calculations of
unsteady-state gas flow through porous media. Petrol. Trans. AIME 198 (1953)
79–92
34. Benkert, K., Fischer, R.: An efficient implementation of the Thomas-algorithm
for block penta-diagonal systems on vector computers. In Shi, Y., van Albada,
G.D., Dongarra, J., Sloot, P.M., eds.: Computational Science – ICCS 2007.
Volume 4487 of LNCS., Springer (2007) 144–151
35. Anderson, E., Blackford, L.S., Sorensen, D., eds.: Lapack User’s Guide. Society
for Industrial & Applied Mathematics (2000)
Green Chemistry from Supercomputers:
Car–Parrinello Simulations of
Emim-Chloroaluminates Ionic Liquids
1 Introduction
Ionic liquids (IL) or room temperature molten salts are alternatives to “more
toxic” liquids. [1] Their solvent properties can be adjusted to the particular
problem by combining the right cation with the right anion, which makes
them designer liquids. Usually an ionic liquid is formed by an organic cation
combined with an inorganic anion. [2, 3] For a more detailed discussion on
the definition we refer to the following review articles. [4, 5, 6]
Despite of this continuing interest in ionic liquids their fundamental
properties and microscopic behavior are still only poorly understood. un-
resolved questions regarding those liquids are still controversially discussed.
A large contribution to the understanding of the microscopic aspects can
come from the investigation of these liquids by means of theoretical meth-
ods. [7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22]
In our project AIMD-IL at HLRS/NEC SX-8 we have investigated a
prototypical ionic liquid using ab initio molecular dynamics methods, where
the interaction between the ions is solved by explicitly treating the electronic
structure during the simulation. The huge investigation in terms of comput-
ing time is more than justified due to the increased accuracy and reliability
compared to simulations employing parameterized, classical potentials.
In this summary we will describe the results obtained within our project of
a Car–Parrinello simulation of 1-ethyl-3-methylimidazolium ([C2 C1 im]+ , see
Fig. 1) chloroaluminates ionic liquids; for a snapshot of the liquid see Fig. 2.
Depending on the mole fraction of the AlCl3 to [C2 C1 im]Cl these liquids
can behave from acidic to basic. Welton describes the nomenclature of these
fascinating liquids in his review article as follows: [5] “Since Cl− is a Lewis base
and [Al2 Cl7 ]− and [Al3 Cl10 ]− are both Lewis acids, the Lewis acidity/basicity
of the ionic liquid may be manipulated by altering its composition. This leads
214 Barbara Kirchner and Ari P Seitsonen
Fig. 2. A snapshot from the Car–Parrinello simulation of the “neutral” ionic liquid
[C2 C1 im]AlCl4 . Left panel: The system in atomistic resolution. Blue spheres: nitro-
gen; cyan: carbon; white: hydrogen; silver: aluminium; green: chlorine. Right panel:
Center of mass of [C2 C1 im]+ , white spheres, and AlCl−4 , green spheres
Car–Parrinello simulations of ionic liquids 215
2 Method
Density functional theory (DFT) [26, 27] is nowadays the most-widely used
electronic-structure method. DFT combines reasonable accuracy in several
different chemical environments with minimal computational effort.
The most frequently applied form of DFT is the Kohn–Sham method.
There one solves the set of equations
2
1 2
− ∇ + VKS [n] (r) ψi (r) = εi ψi (r)
2
. 2
n (r) = |ψi (r)|
i
VKS [n] (r) = Vext ({RI }) + VH (r) + Vxc [n] (r)
Here ψi (r) are the Kohn–Sham orbitals, or the wave functions of the elec-
trons; εi are the Kohn–Sham eigenvalues, n (r) the electron density (can be
interpreted also as the probability of finding an electron at position r) and
VKS [n] (r) is the Kohn–Sham potential, consisting of the attractive interac-
tion with the ions in Vext ({RI }), the electron-electron repulsion VH (r) and
the so-called exchange-correlation potential Vxc [n] (r).
The Kohn–Sham equations are in principle exact. However, whereas the
analytic expression for the exchange term is known, it is not the case for the
correlation, and even the exact expression for the exchange is too involved
to be evaluated in practical calculations for large systems. Thus one is forced
to rely on approximations. The mostly used one is the generalized gradient
approximation, GGA, where one at a given point includes not only the mag-
nitude of the density – like in the local density approximation, LDA – but also
its first gradient as an input variable for the approximate exchange correlation
functional. even very good,
In order to solve the Kohn–Sham equations with the aid of computers
they have to be discretiszed using a basis set. A straight-forward choice is
to sample the wave functions on a real-space grid at points {r}. Another
approach, widely used in condensed phase systems, is the expansion in the
plane wave basis set,
.
ψi (r) = ci (G) eiG·r
G
216 Barbara Kirchner and Ari P Seitsonen
Here G are the wave vectors, whose possible values are given by the unit cell
of the simulation.
One of the advantages of the plane wave basis set is that there is only one
parameter controlling the quality of the basis set. This is the so-called cut-off
energy Ecut : All the plane waves within a given radius from the origin,
1
|G|2 < Ecut ,
2
are included in the basis set. Typical number of plane wave coefficients in
practice is of the order of 105 per electronic orbital.
The use of plane waves necessitates a reconsideration of the spiked external
potential due to the ions, −Z/r. The standard solution is to use pseudo poten-
tials instead of these hard, very strongly changing functions around the nuclei
[28]. This is a well controlled approximation, and reliable pseudo potentials
are available for most of the elements in the periodic table.
When the plane wave expansion of the wave functions is inserted into the
Kohn–Sham equations it becomes obvious that some of the terms are most
efficiently evaluated in the reciprocal space, whereas other terms are better
executed in real space. Thus it is advantageous to use fast Fourier transforms
(FFT) to exchange between the two spaces. Because one usually wants to
study realistic, three-dimensional models, the FFT in the DFT codes is also
three dimensional. This can, however, be considered as three subsequent one-
dimensional FFT’s with two transpositions between the application of the
FFT in the different directions.
The numerical effort of applying a DFT plane wave code mainly consists of
basic linear algebra subprograms (BLAS) and fast Fourier transform (FFT)
operations. The previous one generally require quite little communication.
However the latter one requires more complicated communication patterns
since in larger systems the data on which the FFT is performed needs to be
distributed on the processors. Yet the parallellisation is quite straightforward
and can yield an efficient implementation, as recently demonstrated in IBM
Blue Gene machines [29]; combined with a suitable grouping of the FFT’s
one can achieve good scaling up to tens of thousands of processors with the
computer code CPMD. [30]
Car–Parrinello method
where RI is the coordinate of ion I, μ is the fictitious electron mass, the dots
denote time derivatives, EKS is the Kohn–Sham total energy of the system
and the holonomic constraints keep the Kohn–Sham orbitals orthonormal as
Car–Parrinello simulations of ionic liquids 217
required by the Pauli exclusion principle. From the Lagrangean the equations
of motions can be derived via Euler-Lagrange equations:
∂EKS
MI R̈I (t) = −
∂RI
δEKS δ
μψ̈i = − + {constraints} (2)
δ ψi | δ ψi |
For the simulations we used density functional theory with the generalized
gradient approximation of Perdew, Burke and Ernzerhof, PBE [31] as the
exchange-correlation term in the Kohn–Sham equations, and we replaced the
action of the core electrons on the valence orbitals with norm-conserving
pseudo potentials of Troullier-Martins type [32]; they are the same ones as
in [33] for Al and Cl. We expanded the wave functions with plane waves up
to the cut-off energy of 70 Ry. We sampled the Brillouin zone at the Γ point,
employing periodic boundary conditions.
We performed the simulations in the NVT ensemble, employing a Nosé-
Hoover thermostat at a target temperature of 300 K and a characteristic fre-
quency of 595 cm−1 , a stretching mode of the AlCl3 molecules. We propagated
the velocity Verlet equations of motion with a time step of 5 a.t.u. = 0.121 fs,
218 Barbara Kirchner and Ari P Seitsonen
and the fictitious mass in the Car–Parrinello dynamics for the electrons was
700 a.u. A cubic simulation cell with a edge length of 22.577 Å containing 32
molecules of cations and anions each, equaling to the experimental density of
1.293 g/cm3 . We ran our trajectory employing the Car–Parrinello molecular
dynamics for 20 ps.
Fig. 3. The radial distribution function of the AlCl− 4 anion (bold lines) together
with the corresponding function from the pure AlCl3 simulations (dotted lines) of
Ref. [33]. Distances are in Å. Black: Al-Al; red: Al-Cl; blue: Cl-Cl
Car–Parrinello simulations of ionic liquids 219
is because only monomer units ((AlCl3 )n Cl− with n=1) exists and these
monomers are separated from each other by the cations. The more struc-
tured functions of the ionic liquid can be attributed to the lower temperature
at which it was simulated. The first Al-Cl peak (black solid line) appears
at 222.4 pm while the corresponding peak in the pure AlCl3 simulations oc-
curs already at 214.0 pm. There is no shoulder in the Al-Cl function at the
first peak and the second peak occurs at larger distances. The Cl-Cl function
presents its first peak at 361.1 pm which is approximately 10 pm earlier than
what was observed for the pure AlCl3 liquid.
In Fig. 4 we concentrate on the radial distribution functions of the imi-
dazolium protons to the chlorine atoms. For each of the three ring protons
we show an individual function in the left panel of Fig. 4. Because the H2-
Cl function shows the most pronounced peak and the first peak appears at
shorter distances than for the functions of H4-Cl and H5-Cl it is clear that
this proton is the most popular coordination site for the chlorine atoms. How-
ever, the other two protons from the rear show also peaks at slightly larger
distances which indicates an involved network instead of individual pairs. It
should be noted that from this structural behavior it can not be deduced
how long lived the coordination partners are. Considering the protons of the
ethyl and the methyl group it is striking that here also small pronounced
peaks can be observed. While the ethyl group hydrogen atoms-Cl functions
like the H4-Cl functions are least pronounced, the methyl-group-Cl function
show a maximum height larger than that of the H4-Cl function. Obviously
this functional group is also in touch with the chlorine atoms of the anion.
Fig. 4. The radial distribution function of chlorine atoms from AlCl− 4 anion to the
protons from [C2 C1 im]. Distances are in Å. Left: H2-Cl (black), H4-Cl (red), H5-Cl
(blue); Right: Terminal ethyl-H-Cl (black), α-ethyl-H-Cl (red), methyl-H-Cl (blue)
220 Barbara Kirchner and Ari P Seitsonen
Table 1. Geometrical parameters of the isolated and the average AlCl− 4 in the ionic
liquid. Distances r are in pm. rmin indicates the shortest while rmax indicates the
longest distance. r is the average over all configurations. The abbreviation “iso/dyn”
indicates a dynamic calculation of the isolated anion. “liq” denotes the average values
from the simulations of the neutral liquid
AlCl ClCl
rmin rmax rmin/max rmin rmax rmin/max
iso/dyn 218.1 230.2 1.06 354.4 376.9 1.06
liq 216.6 229.1 1.06 344.2 381.8 1.11
For the AlCl−4 anion we observe a perturbation from the ideal tetrahedral
symmetry both in the isolated system and in the anions in the liquid. Whereas
the shortest and longest Al-Cl distances vary only by 10 pm, the Cl-Cl dis-
tances show larger deviations of 30 pm (iso/dyn) to 40 pm (liq). This means
that the perturbation is already induced by temperature; in the liquid the
perturbation from the optimal geometry is somewhat more enhanced.
Fig. 5. The distribution of the shortest proton-Cl distance from a particular proton
(H2, H4, H5) to any Cl atom
Electrostatic potential
Wannier centers
We used the maximally localized Wannier functions and their geometric cen-
ters, also called Wannier centers, to characterize the distribution of electrons
222 Barbara Kirchner and Ari P Seitsonen
Fig. 6. The electrostatic potential mapped onto the electron density with a surface
values of 0.067 e− /Å3 . The colour scale ranges from −0.1 (blue) to +0.1 (red) atomic
+
units. Left: AlCl−
4 , right: [C2 C1 im]
Fig. 7. The Wannier centers, denoted as red spheres. Top: [C2 C1 im]+ , bottom:
AlCl−
4 .
Car–Parrinello simulations of ionic liquids 223
corresponding proton we can see that in the C2-H2 pair the electrons are
closer to the carbon than in the C4-H4 and C5-H5 pairs, pointing towards
a larger polarity of the C2-H2 pair. Thus the H2 is more positive than the
H4 and H5, and can electro-statically attract the negative Cl atoms from the
anion molecules towards itself, as was seen in the Section 3.3.
5 Computational performance
For the simulation of our system, i. e. 32 [C2 C1 im]AlCl4 pairs, we have to treat
768 atoms and 1216 electronic states in each time step. The amount of atoms
is by far larger than in a usual single-molecule static calculation. Therefore
the use of GGA-density functional theory is necessary in order to make the
simulation computationally tractable. It should be noted that 32 molecules
is more or less the lowest limit of a calculation employing periodic boundary
conditions, as smaller sized systems would result in artificial finite-size effects
due to interactions with the mirror images. Regarding these circumstances our
simulation provides the first real ab initio molecular dynamics simulations of
an ionic liquid. Due to the computational constraints previous simulations
treated a smaller amount of molecular pairs or only employed simplified mod-
els of ionic liquids (for example [C1 C1 im]Cl). For the obvious reasons it was
necessary for our calculations to be carried out on a large amount of efficient
processors. We used 128 processors on the NEC SX-8. Therefore we were able
to carry out our simulations within just two months. The size of our system
leads to restart files of 14 GB in size.
Before starting the real production runs we measured the scaling of the
computing time and computational efficiency when changing the system size
and/or the number of processors incorporated in a job. The results of these
tests are shown in Table 2 and Fig. 8. The smallest system contains 32 (IL-32)
pairs. The next system contains 48 (IL-48) and the largest system 64 (IL-64)
pairs.
We see very good scaling in the computing time still when going from 64
to 128 processors. We did not go beyond 128 processors still, but we estimate
Table 2. Scaling of the wall-clock time in seconds per iteration and performance
in GFLOPs versus number of processors. IL-x indicates the system size of the ionic
liquid
Fig. 8. The scaling of the wall clock time per iteration – left – and numerical per-
formance in GFLOPs – right – plotted against the number of processors. The green,
dashed lines denote the ideal scaling and theoretical peak performance, respectively
a decent or a good scaling in the IL-32 system, or very good scaling in the
IL-48 and IL-64 cases. We note that at very large processor counts a differ-
ent parallellisation using OpenMP built in the CPMD code could be tried if
the scaling otherwise is no longer satisfactory. A concrete limitation is met
if the number of processors is larger than the length of the FFT grid in the
first direction; however, further scaling is achieved by applying task groups,
yet-another efficient method inside CPMD; thereby the FFT’s over different
electronic states are grouped to set of processors, thus overcoming the limi-
tation on the maximum number of processors due to the length of the FFT
grid. The task groups can be incorporated particularly well on the NEC SX-8
computer at HLRS due to the large amount of memory at each node, because
the task groups increase the memory requirement per node somewhat. We
report here always the best performance over the different number of task
groups; typically its optimum is at four or eight groups.
The numerical performance exceeds one 1012 floating point operations per
second (tera-FLOPs or TFLOPs) in all the systems studied at 128 processors.
Furthermore, the performance still scales very favorably when going from 64
to 128 processors. Thus from the efficiency point of view processor counts
exceeding 128 could also be used. However, due to the limited number of
processors available, and because we already hit the “magic target” of one
TFLOPs we restricted our production to 128 processors.
Overall we were more than satisfied with the performance and with the
prospect of performing the production calculations for the IL-48 or even IL-
64 systems. However, due to the total time of the simulation is a multiple of
the number of molecular dynamics steps, we were forced to choose the IL-32
Car–Parrinello simulations of ionic liquids 225
system for the production job, as otherwise we would not have been able to
simulate a trajectory of ≈ 20 ps like we managed to do now.
We also want to note that our calculations profit from the computer ar-
chitecture of the NEC SX-8 at HLRS not only due to the high degree of vec-
torization and very good single-processor computing power in as evidenced in
the high numerical efficiency (over 10 GFLOPs/processors; this number also
includes the I/O ), but also due to the large memory as we could stored some
large intermediate results in the memory, thus avoiding the need to recalcu-
late part of the results; this would be unavoidable in a machine with smaller
amount of memory per processor. This way almost one third of the FFT’s,
and thus of the most demanding all-to-all parallel operations can be avoided,
improving the parallel scaling still somewhat over a normal calculation where
this option could not be used.
6 Conclusions
Acknowledgements
We thank the HLRS for the allocation of computing time; without this our
project would not have been feasible!
We are grateful to Prof. Jürg Hutter for several discussions, and to Stefan
Haberhauer (NEC) for executing the benchmarks on the NEC SX-8 and op-
timising CPMD on the vector machines. BK would like to thank T. Welton, A.
East, K. E. Johnson and J. S. Wilkes for helpful discussion. BK acknowledges
the financial support of the DFG priority program SPP 1191 “Ionic Liquids”,
the ERA program and the financial support from the collaborative research
center SFB 624 “Templates” at the University of Bonn.
226 Barbara Kirchner and Ari P Seitsonen
References
1. Ed, P. Wasserscheid and T. Welton. Ionic Liquids in Synthesis. VCH-Wiley,
Weinheim, 2003.
2. J. H. Davis. Task-specific ionic liquids . Chem. Lett., 33:1072–1077, 2004.
3. A. E. Visser, R. P. Swatloski, W. M. Reichert, R. Mayton, S. Sheff,
A. Wierzbicki, J. H. Davis, and R. D. Rogers. Task-specific ionic liquids for
the extraction of metal ions from aqueous solutions. Chem. Commun., 01:135–
136, 2001.
4. V. A. Cocalia, K. E. Gutowski, and R. D. Rogers. The coordination chemistry
of actinides in ionic liquids: A review of experiment and simulation. Coord.
Chem. Rev., 150:755–764, 2006.
5. T. Welton. Room-Temperature Ionic Liquids. Solvents for Synthesis and Catal-
ysis. Chem. Rev., 99:2071–2083, 1999.
6. T. Welton. Ionic Liquids in catalysis. Coord. Chem. Rev., 248:2459–2477, 2004.
7. P.A. Hunt and I. Gould. J. Phys. Chem. A, 110:2269, 2006.
8. S. Kossmann, J. Thar, B. Kirchner, P. A. Hunt, and T. Welton. Cooperativity
in ionic liquids. J. Chem. Phys., 124:174506, 2006.
9. Z. Liu, S. Haung, and W. Wang. A refined force field for molecular simulation
of imidazolium-based ionic liquids. J. Phys. Chem. B, 108:12978, 2004.
10. J.K. Shah and E.J. Maginn. Fluid Phase Equlib, 222-223:195, 2004.
11. T.I. Morrow and E.J. Maginn. Molecular dynamics study of the ionic liquid 1-n-
butyl-3-methylimidazolium hexafluorophosphate. J. Phys. Chem. B, 106:12807,
2002.
12. C.J. Margulis, H.A. Stern, and B.J. Berne. Computer simulation of a ”green
chemistry” room-temperature ionic solvent. J. Phys. Chem. B, 106:12017, 2002.
13. J. Lopes, J. Deschamps, and A. Padua. Modeling ionic liquids using a systematic
all-atom force field. J. Chem. Phys. B, 108:2038, 2004.
14. S. Urahata and M. Ribeiro. Structure of ionic liquids of 1-alkyl-3-
methylimidazolium cations: A systematic computer simulation study. J. Chem.
Phys., 120(4):1855, 2004.
15. T. Yan, C.J. Burnham, M.G. Del Popolo, and G.A. Voth. Molecular dynamics
simulation of ionic liquids: The effect of electronic polarizability. J. Phys. Chem.
B, 108:11877, 2004.
16. S. Takahashi, K. Suzuya, S. Kohara, N. Koura, L.A. Curtiss, and M. Saboungi.
Structure of 1-ethyl-3-methylimidazolium chloroaluminates: Neutron diffraction
measurements and ab initio calculations. Z. fur Phys. Chem., 209:209, 1999.
17. Z. Meng, A. Dölle, and W.R. Carper. J. Mol. Struct., 585:119, 2002.
18. A. Chaumont and G. Wipff. Solvation of uranyl(ii) and europium(iii) cations
and their chloro complexes in a room-temperature ionic liquid. a theoretical
study of the effect of solvent ”humidity”. Inorg. Chem., 43:5891, 2004.
19. F.C. Gozzo, L.S. Santos, R. Augusti, C.S. Consorti, J. Dupont, and M.N. Eber-
lin. Chem. Eur. J., 10:6187, 2004.
20. E.R. Talaty, S. Raja, V.J. Storhaug, A. Dölle, and W.R. Carper. J. Phys. Chem.
B, 108:13177, 2004.
21. Y.U. Paulechka, G.J. Kabo, A.V. Blokhin, A.O. Vydrov, J.W. Magee, and
M. Frenkel. J. Chem. Eng. Data, 48:457, 2003.
22. J. de Andrade, E.S. Böes, and H. Stassen. Computational study of room temper-
ature molten salts composed by 1-alkyl-3-methylimidazolium cations-force-field
proposal and validation. J. Phys. Chem. B, 106:13344, 2002.
Car–Parrinello simulations of ionic liquids 227
Simon Greaves1
In recent years the capacity of magnetic hard disk drives used for data stor-
age has increased at annual rates of between 30% - 60%. Behind this rapid
increase in areal density lies a constant process of innovation and technological
improvement.
Micromagnetic models can be used to simulate the recording process in
magnetic data storage media. Such models are useful because they allow new
head and media designs to be evaluated and optimised prior to the fabrication
of prototypes, speeding up the development cycle.
This paper describes the components of a micromagnetic model running
on the NEC SX-7 supercomputer installed at Information Synergy Centre of
Tohoku University. The model is mainly used for the simulation of magnetic
recording.
1 Fundamentals of micromagnetics
1.1 Discretisation
on the problem under consideration the cell volumes might lie in a range from
1 nm3 to 1000 nm3 .
The data storage layer of a typical magnetic recording medium consists
of polycrystalline grains with an average diameter of 6 nm - 8 nm and a
thickness of 10 nm - 20 nm. A transmission electron microscope (TEM) image
showing a plan view of a magnetic recording medium is shown in Fig. 1(a). The
grains are irregular shapes and are separated by non-magnetic material. Early
micromagnetic models were restricted to modelling grains of uniform size with
square or hexagonal cross sections. Increased memory and computing power
have enabled the modelling of irregular grains, often represented by Voronoi
cells. To create a set of Voronoi cells a set of seed points is first distributed
over a plane. A Voronoi cell is defined as the region of the plane which is
nearest to a particular seed point. The boundaries of the Voronoi cells are
the loci of points equidistant between seed points. The average size and size
distribution of the Voronoi cells can be controlled through the density and
location of the seed points. Non-magnetic boundary regions can be created
by moving the vertices of each Voronoi cell some distance towards the seed
point. An example of Voronoi cells used in the micromagnetic model is shown
in Fig. 1(b). The Voronoi cell microstructure is a good approximation of the
real medium. If the grains are sufficiently small each grain can be modelled
by a single cell. If more detail is required the Voronoi cells can be subdivided
into smaller cells both in the plane and along the axis normal to the plane.
An algorithm for the generation of Voronoi cells can be found in [SF01].
Each of the grains in a recording medium can be considered as a small,
permanent magnet. The important magnetic properties of the grains can be
obtained from a hysteresis loop, such as the one shown in Fig. 2. The mag-
netisation of the grain along an axis is measured as an external magnetic field
110
100
90
80
70
Down track (nm)
60
50
40
30
20
10
0
0 20 40 60 80 100
Cross track (nm)
Fig. 1. TEM image of grains in a recording medium (plan view) (left) and a Voronoi
cell representation of grains (right). The side of each image is 110 nm and the average
grain size is about 8 nm.
Micromagnetic Simulations of Magnetic Recording Media 231
600
400
Ms
Mz (emu/cm )
200
3
Hc
-200
-400
-600
-20 -10 0 10 20
Applied field, Hz (kOe)
is applied along the same axis. When the magnetic field is sufficiently large
the magnetisation of the grain will reverse direction. The field at which this
reversal occurs is called the coercive field, or coercivity Hc . In order to pro-
vide a stable storage medium Hc should be large enough to resist the effect
of stray fields and temperature fluctuations. Another important parameter is
the saturation magnetisation Ms which is a measure of the maximum strength
of the grain magnetisation. A ferromagnetic element such as iron has Ms of
1700 emu/cm3 at room temperature. The saturation magnetisation of grains
in a recording medium is typically in the range 400 - 750 emu/cm3 . The poles
of magnetic write heads are made of alloy materials with Ms as high as 2100
emu/cm3 , allowing them to produce a large magnetic field.
Having discretised the body into suitably sized cells, the time variation of the
magnetisation of each cell under the influence of internal and external mag-
netic fields is calculated using the Landau-Lifshitz-Gilbert (LLG) equation
[TG55].
dM α dM
= −γ M × H − (1)
dt γMs dt
In Eq. 1 M is the magnetic moment in each cell and H is the magnetic
field acting on M. γ is the gyromagnetic constant (1.76×107s−1 Oe−1 ) and
α is the damping constant; typical values lie between 0.01 and 1. The time
variation of M is obtained by computing H and calculating dM using the
LLG equation.
Eq. 1 consists of two terms: a precession term and a damping term. To see
the effect of the precession term we set α = 0 and the LLG equation becomes
232 Simon Greaves
dM
= −γ (M × H) . (2)
dt
The magnetic field H exerts a torque on M which is perpendicular to
the M − H plane, causing M to precess about H. However, if α = 0 M
will precess about H indefinitely and will never align with H. The damping
term in the LLG equation containing the constant α damps the precessional
motion, causing M to align with H. The higher the value of α the sooner M
aligns with H. Fig. 3 shows an example of damped precessional motion for a
single magnetic vector M which initially lies in the x − y plane, as indicated
by the arrow. A magnetic field H of magnitude 2 kOe is then applied along
the z axis and M begins to precess, tracing the path indicated by the spiral.
The magnitude of M is constant, meaning that as the in-plane component of
M decreases the out of plane component increases, and eventually M ends up
pointing along the z axis i.e. perpendicular to the page.
Magnetostatic field
0.5
M y / Ms
-0.5
-1
-1 -0.5 0 0.5 1
Mx / M s
16
14
12
0
0 5 10 15 20
Cross track (nm)
Fig. 4. Segmentation of a hexagonal cell into cuboids for the purposes of calculating
the magnetostatic tensors.
Here, Aij is the exchange coupling constant between the two cells i and j.
For a regular microstructure with cell centres separated by a distance dij we
can expand the ∇2 m term in Eq. 6 and approximate Hex using
2 . m(j) − m(i)
Hex (i) = Aij . (7)
Ms (i) j d2ij
Micromagnetic Simulations of Magnetic Recording Media 235
However, if the cells are irregular, we must correct for variations in the
length of the common boundary Lij between each pair of cells. The exchange
coupling strength between cells depends upon the ratio of Lij to the total
boundary length of the cell Ltotal . This is shown schematically in Fig. 5(a)
by the thicknesses of the lines connecting cell i to neighbouring cells. A dis-
tribution of common boundary lengths gives rise to a distribution of Hex
characterised by σL . Eq. 8 takes account of the boundary length dependence
of Hex with the addition of an extra term 4Lij /Ltotal (i). For cuboid cells
Lij = Ltotal (i)/4, the last term becomes unity and Eq. 8 is identical to Eq. 7.
Differences in intergranular spacing will also give rise to a distribution of Hex
characterised by a parameter σA . Thus, Hex distributions have two origins:
edge length and intergranular spacing variations, as shown in Fig. 5(d).
2 . m(j) − m(i) 4Lij
Hex (i) = Aij 2 × (8)
Ms (i) j dij Ltotal (i)
Thermal field
The thermal field is a stochastic term, localised to each cell, which represents
the effect of heat on the magnetic moment M. In the same way that a small
particle suspended in a liquid undergoes random position fluctuations due to
Brownian motion, the magnetic moment of a cell also fluctuates randomly
under the influence of the thermal field. New values of the thermal field are
chosen each time the LLG equation is evaluated. Orthogonal components of
the thermal field form a Gaussian distribution with a standard deviation given
by
5
2kb T α
σ= (9)
V Ms γdt
where T is the temperature, V the cell volume, kb is Boltzmann’s constant and
dt the time step used in the LLG equation. There is no correlation between
successive values of the thermal field and the time-averaged value is zero.
There are several advantages to including the thermal field in a micromag-
netic model which are founded on the avoidance of anomalous magnetic states.
a) b) i
d)
sL
i
c) sA
0.5
My / Ms
-0.5
-1 -0.5 0 0.5 1
Mx / M s
Fig. 7. Angle between M and H for an 8 nm cube. The distribution was obtained
from the data in Fig. 6.
Spin torque
dM
= − (u · ∇) m (10)
dt
where u is the direction of current flow with a magnitude proportional to the
current density and spin polarisation rate and m is a unit vector indicating
the direction in which the material is magnetised. The spin torque effect can
be important in magnetic sensors, such as the read heads used in hard disk
drives as the read current flows through magnetic layers used to sense the
stray field from the recording medium.
Applied field
2 A micromagnetic model
There are several freely available micromagnetic programs. Two of the most
commonly used are the object oriented micromagnetic framework (OOM M F )
program developed at NIST [OO01] and the finite element micromagnetics
package magpar, available at [WS01]. There are also many commercial, closed
source programs. A micromagnetic model has also been developed by the au-
thor, based on the description in Sect. 1, mainly for the purpose of carrying
out simulations of magnetic recording. The code is written in C and runs on
SX-7, TX-7 and Linux/UNIX clusters, usually on 8 - 32 processors depend-
ing on the problem size. On the SX-7 machine vectorisation exceeds 99% and
memory usage for a 19000 cell model without cell grouping for the magneto-
static calculation is about 10 GB. With cell grouping the memory requirement
is reduced to around 4 GB for a 54000 cell model. All arrays are single dimen-
sional to eliminate nested loops and improve vectorisation. Other than the
standard C libraries, no external libraries are required unless MPI routines
are used, making the program highly portable. A diagram depicting program
flow is shown in Fig. 8. First of all the model geometry is generated. Voronoi
cells are generated using an implementation of the Fortune algorithm [SF01].
Calculate magnetic
Create geometry
fields
Normalise
Set initial magnetisation
magnetic state vectors
Exit Done ?
Next, the magnetostatic tensors are calculated and stored and the initial mag-
netic state is applied. The head field distribution used to write data tracks
can be generated internally or loaded from a pre-existing file. The program
then enters a loop in which the magnetic field in each cell is calculated and
the magnetisation vectors updated by applying the LLG equation. The LLG
equation does not preserve the magnitude of M, so the magnetisation vectors
must be re-normalised each time the LLG equation has been applied. The
write head moves along the recording medium as time elapses and the po-
larity and magnitude of the write field is varied to record data bits onto the
recording medium.
The program information from a typical micromagnetic simulation run
on the SX-7 supercomputer is shown in Fig. 9. The simulation involved the
writing of data tracks on media with various parameters; in total, ten tracks
were written. Each medium consisted of eight layers and 28400 cells. The
vectorisation level exceeded 99.6% and the load was well distributed across
the eight processors used. Further optimisations to reduce the Lock Wait and
Bank times are required.
The easiest way to solve the LLG equation is to use the Euler method. How-
ever, a small time step dt is required in order to maintain stability of M
and this increases the overall computation time. The Heun (or improved Eu-
ler) method allows larger time steps to be used and also has the advantage
of being compatible with the expressions for the thermal field, given earlier.
Higher order solvers are redundant due to the stochastic nature of the model
which imposes a minimum noise base on solutions to the LLG equation. An
example of the effectiveness of the Heun method is shown in Fig. 10 which
depicts small oscillations of the magnetisation in a soft magnetic material.
When the time step dt is increased from 0.6×10−14s to 1.2×10−14s the solu-
tion obtained using the Euler method diverges from the correct solution, but
significant deviations occur only after the simulation has ran for 6000 steps.
The Heun method allows a time step at least five times larger than the Eu-
ler method at the cost of a less than twofold increase in per-step execution
time (the thermal field only needs to be calculated once, the other fields are
calculated twice).
1
Mx / Ms
0.9995
-14
Euler, dt = 0.6×10 s
-14
Euler, dt = 1.2×10 s
-14
Heun, dt = 0.6×10 s
-14
Heun, dt = 6.0×10 s
0.999
0 0.1 0.2 0.3
Time (ns)
Fig. 10. Effect of LLG time step and integration method when calculating small
oscillations in a soft magnetic material.
Micromagnetic Simulations of Magnetic Recording Media 241
Fig. 11. A flowchart for a model containing multiple, interacting magnetic objects.
relative position of the objects changes during the simulation the interaction
tensors must be recalculated and this can take a considerable amount of time.
Further optimisation of these routines is required; e.g. interpolating the in-
teraction tensors for intermediate head-medium positions, load balancing to
make the thread execution times similar for different sized objects etc.
Future data storage products may make use of heat assisted recording. The
motivation is to improve the thermal stability of recorded data bits by in-
creasing the strength of the uniaxial anisotropy. At room temperature the
recording medium will have a very high coercivity and a head would be un-
able to record on it, the head field being less than the coercivity. During the
recording process the medium is heated by a laser, reducing the coercivity and
allowing the head to write data bits. To enable the simulation of heat assisted
recording a heat flow model has been added to the micromagnetic model. The
heat flow between two cells i and j is written as
dQi Tj − Ti
= Kj Sij (11)
dt dij
where Q = heat, K = thermal conductivity, S = common surface area of the
two cells, T = temperature and d = centre-centre distance between the two
cells. The heat flow equation and LLG equation are synchronised and solved
in parallel. The constant parts of Eq. 11 are encapsulated in an array of values
Hij for pairs of neighbouring cells, i and j.
The Hij values can be varied for each pair of neighbouring cells. This is
useful for simulating patterned media in which the heat flow is much larger
within patterned elements than between patterned elements separated by air
or some other material. Fig. 12 shows the result of a heat flow calculation for a
5 nm thick CoCrPt recording medium on a 10 nm Ru seed layer on glass. The
mesh used for the heat flow calculation is independent of the geometry used
for the micromagnetic calculation and the cells of the two models do not need
to be the same size. The micromagnetic geometry can occupy all or part of
the space used for the heat flow calculation. The heat flow calculation usually
requires a much larger region than the micromagnetic model in order to avoid
boundary issues. An accurate heat flow calculation also requires the inclusion
of non-magnetic conduction layers, in addition to the recording layer itself.
Temperatures are calculated using the heat conduction model and transferred
to the equivalent location in the micromagnetic model. The temperature de-
pendence of magnetic properties such as Ms and Ku is described by a function
or data table.
3 Applications of micromagnetics
Magnetic recording simulations
700 500
480
600
460
500 440
420
Down track (nm)
400
400
300
380
360
200
340
100
320
300
100 200 300 400 500 600 700
Cross track (nm)
Adding a heat source to the write head could allow an increase in the recording
density. Combined LLG and heat flow models were used to simulate thermally
assisted recording. Fig. 14 shows a snapshot of the recording layer during a
thermally assisted recording simulation. On the left is the temperature dis-
tribution in the recording layer; the maximum temperature is 623 K and the
minimum temperature is 300 K. The magnetic region is shown on the right;
it is smaller than the heat flow region to reduce computing time. The portion
of the magnetic region directly under the laser spot is above the Curie tem-
perature and the magnetisation is disordered. As the laser moves along the
Fig. 14. Simulation of thermally assisted recording, left : temperature, right : mag-
netisation. Image size : 900 nm × 1100 nm.
244 Simon Greaves
medium the recording layer cools below the Curie temperature. Ms and Ku
increase as the medium cools and the polarity of the written bits is determined
by a magnetic field generated by a write head. The rate of cooling, which can
be controlled by the thermal properties of the seed layers, influences the SNR,
written track width and transition noise.
4 Conclusions
References
[DL84] Lindholm, D.A.: Three-dimensional magnetostatic fields from point-
matched integral equations with linearly varying scalar sources. IEEE Trans.
Magn., 20, 2025–2032 (1984)
[EB05] Boerner, E.D., Chubykalo-Fesenko, O., Mryasov, O.N., Chantrell, R.W.,
Heinonen, O.: Moving toward an atomistic reader model. IEEE Trans. Magn.,
41, 936–940 (2005)
[HF98] Fukushima., H., Nakatani, Y., Hayashi, N.: Volume average demagnetising
tensor of rectangular prisms. IEEE Trans. Magn., 34, 193–198 (1998)
[MT01] Matsumomto, M.: An algorithm for random number generation.
http://www.math.sci.hiroshima-u.ac.jp/ m-mat/MT/emt.html
[OO01] Donahue, M.: The object oriented micromagnetic framework (OOMMF)
project at ITL/NIST. http://math.nist.gov/oommf/
[SF01] Fortune, S.J.: An algorithm for Voronoi cell generation. http://netlib.bell-
labs.com/cm/cs/who/sjf/index.html
[TG55] Gilbert, T.L.: A Lagrangian formulation of the gyromagnetic equation of
the magnetic field. Phys. Rev., 100, 1243 (1955).
[WS01] Scholz, W.: magpar - Parallel finite element micromagnetics package.
http://magnet.atp.tuwien.ac.at/scholz/magpar/
[YN89] Nakatani, Y., Uesaka, Y., Hayashi, N.: Direct solution of the Landau–
Lifshitz–Gilbert equation for micromagnetics. Jap. J. Appl. Phys., 28, 2485–2507
(1989)
The Potential of On-Chip Memory Systems for
Future Vector Architectures
1 Introduction
The most advantageous feature of modern vector systems is their outstanding
memory performance compared to scalar systems. This feature brings them
to their high-sustained system performance when executing real application
codes, which are extensively used in the fields of advanced sciences and engi-
neering [9],[10],[1]. However, recent trends in semiconductor technology gen-
erate a strong head wind for vector systems. Thanks to the historical growth
rate in on-chip silicon budget, named Moore’s law, processor performance re-
garding flop/s rates increases remarkably, but memory performance cannot
follow it [2]. Regarding vector systems, their bytes/flop rates that show the
balance between flop/s performance and memory bandwidth go down from
8 B/flop in 1998, to 4 in 2003, and to 2 in 2007. We have pointed out that
reducing the memory bandwidth seriously affects the sustained system perfor-
mance even in case of vector systems [3], although their absolute performance
increases to a certain degree. Memory performance definitely becomes one of
key points for design of future highly-efficient vector architectures to survive
in an era of multi-core processors.
To cover the lack of memory performance of future vector systems, this pa-
per discusses the potential of on-chip memory systems such as cache and local
(scratchpad) memory, in particular, the their effects on the sustained system
performance. Some vector supercomputers have already employed caches [13],
however, the quantitative discussion on the effects of on-chip caching for vec-
tor architectures by using practical application codes has not been discussed
so far. This paper presents some early experimental results of vector on-chip
caches in execution of real application codes, and tries to figure out how much
capacity of the on-chip cache would be equivalent to B/flop rates.
248 H.Kobayashi et al.
The rest of the paper is organized as follows. Section 2 reviews the trends
in performance of the modern vector systems, and presents that the memory
performance seriously affects their sustained system performance in execution
of real application codes. Section 3 presents a baseline vector architecture with
an on-chip memory subsystem to be discussed in this paper. The architecture
is designed based on the NEC SX-7 architecture [6]. We also discuss the pros
and cons of on-chip cache and on-chip local memory. In Section 4, we present
the experimental results on the vector cache system by using a simulator. For
the performance evaluation, five real application codes, which are actually
developed in the leading fields of computational science and engineering, are
used. We discuss the effects of on-chip caching, an assisting mechanism for
vector load/store units, on the sustained performance, and show that on-
chip caching is promising even for vector architectures. Finally, Section 5
summarizes the paper.
Fig. 1. Efficiency of the four systems for the five simulation codes
around 10%. These results mean that the outstanding sustained performance
of the vector system are strongly supported by the excellent memory per-
formance, and therefore, to keep the sustained performance higher, sufficient
memory bandwidth over peak-flop/s is essential for future vector systems to
survive in the HPC community.
4 Performance Evaluation
4.1 Experimental Environment
To evaluate the effects of the on-chip cache for vector architectures, we are
developing a simulator with a cache mechanism based on the NEC SX simula-
tor. The simulator can change its memory bandwidth from 1B/flop to 8B/flop.
We discuss the performance of the SX systems with 0.5MB and 2MB vector
caches, and assume that the memory latency is 1.5 to 2 times longer than that
of vector caches. The line size is 8B, as same as double precision data size.
The write-through policy is employed. The number of memory ports are 32,
and a 2-way set-associative sub-cache is connected to each port.
Table 2 shows five application codes used for benchmarking. The appli-
cation codes have been developed by researchers of Tohoku University, and
actually used in each research area. Through the performance evaluation, we
will discuss how much capacity of vector-cache is equivalent to B/flop metric.
Table 3 summarizes the features of the five simulation codes. In the table,
Arithmetic Intensity means the ratio of floating-point operations to mem-
ory references, and is a metric that suggests computational-intensiveness or
memory-intensiveness of the application codes. VOR and AVL are the vec-
torization ratio and average vector length of the codes, respectively. In the
following, we quickly review each code.
EM Scatter
Antenna
loop, and its length is 255. The arithmetic intensity of the innermost loop is
2.25. Therefore, this code is computational-intensive and cache-friendly.
Combustion
Heat Transfer
Earthquake
Fig. 7. Relationship between cache hit rates and recovered efficiency rates
EM Scatter
Fig. 8. EM Scatter
Vector cache hit rates of 13% and 17% are obtained in the EM Scatter
case when using 0.5MB and 2MB on-chip caches, respectively. As a result,
the efficiency is improved by 5.4% with the 0.5MB cache and 5.7% with the
2MB cache for the 2B/flop case, and 2.8% with the 0.5MB cache and 3.0%
with the 2MB cache for 1B/flop case. The experimental results suggest that
the 2MB cache covers 24% of the memory bandwidth shortage (2B/flop) when
employing the 2B/flop memory bandwidth compared to the 4B/flop system.
It also covers 8% of the memory bandwidth shortage (3B/flop) of the 1B/flop
system. To obtain the further improvement by vector caching, more capacity
is required. The discussion on the effects of using larger caches remains as our
future work.
The Potential of On-Chip Memory Systems 259
Antenna
of the colored arrays Phi is performed, vector loads for underlined arrays are
effectively provided from the cache, and a 2.8 times higher hit rate is obtained
even with the 1MB cache compared to the case where all the arrays are cached,
resulting in more than 9% improvement in processing time. Of course more
detailed discussion on the results of larger on-chip caches for different kernels
is required, but these results suggest a potential of the selective caching for
efficient on-chip data management on the limited on-chip capacity.
The Potential of On-Chip Memory Systems 261
Fig. 13. High cost kernel of Heat Transfer for selective caching
5 Summary
In this paper, we have discussed the potential of on-chip memory subsystems
for future vector architectures. The performance evaluation based on the early
experiments suggests that even with moderate-sized on-chip cache with 512KB
to 2MB, it covered a lack of the memory bandwidths of vector load/store units
with 2B/flop or lower, and boosted the sustained system performance up to
the level of the 4B/flop performance. Selective caching, in which only the data
with the high locality of reference are cached, is also effective for efficient use
of limited on-chip caches.
The Potential of On-Chip Memory Systems 263
Acknowledgments
This work has been done in collaboration between Tohoku University and
NEC, and many colleagues contribute to this project. We would also like to
thank Professors Motoyuki Sato, Akira Hasegawa, Goro Masuya, Terukazu
Ota, and Kunio Sawaya of Tohoku University for providing the codes for the
experiments.
References
1. M.Resh et al. Ed. High Performance Compting on Vector Systems 2006.
Springer, 2006.
2. J. Hennessy and D.Patterson. Computer Architecture: A Quantitative Ap-
proach:4th Edition. Morgan Kaufmann Publishers, 2006.
3. H.Kobayashi. Implication of Memory Performance in Vector-Parallel and Scalar-
Parallel HEC System. Proceedings of 4th Teraflop Workshop (High Performance
Computing on Vector Systems 2006), pages 21–51, 2006.
4. J.D.McCalpin. STREAM: Sustainable Memory Bandwidth in High Performance
Computers. http://www.cs.virginia.edu/stream, 2005.
5. K.Ariyoshi et al. Spatial variation in propagation speed of postseismic slip on
the subducting plate boundary. Proceedings of 2nd Water Dynamics, 35, 2004.
6. K.Kitagawa et al. A Hardware Overview of SX-6 and SX-7 Supercomputer.
NEC Research & Development, 44(1):2–7, 2003.
7. K.S.Kunz and R.J.Luebbers. The Finite Difference Time Domain Method for
Electromagnetics. CRC Press, 1993.
8. K.Tsuboi and G.Masuya. Direct Numerical Simulations for Instabilities of
Remixed Planar Flames. Proceedings of The Fourth Asia-Pacific Conference
on Combustion, 2003.
9. L.Oliker et al. Scientific Computations on Modern Parallel Vector Systems.
Proceedings of SC2004, 2004.
10. L.Oliker et al. Leading Computational Methods on Scalar and Vector HEC
Platforms. Proceedings of SC2005, 2005.
11. M.Nakajima et al. Numerical Simulation of Three-Dimensional Separated Flow
and Heat Transfer around Staggerd Suerface-Mounted Rectangular Blocks in a
Channel. Numerical Heat Transfer, 47(Part A):691–708, 2005.
12. NEC. SX-8R Press Releae. http://www.hpce.nec.com/, 2006.
13. T.H.Dunigan Jr. et al. Performance Evaluation of The Cray X1 Distributed
Shared-Memory Architecture. IEEE MICRO, 25(1):30–40, 2005.
14. T.Kobayashi et al. FDTD simulation on array antenna SAR-GPR for land mine
detection. Proceedings of 1st International Symposium on Systems and Human
Science, pages 279–283, 2003.
264 H.Kobayashi et al.
15. Y.Takagi et al. Study of High Gain and Broadband Antipodal Fermi Antenna
with Corrugation. Proceedings of 2004 International Symposium on Antennas
and Propagation, 1:69–72, 2004.
The Road to TSUBAME and Beyond
Satoshi Matsuoka
racks, as well as 32 CRC units for cooling, that are laid out in a customized
fashion to maximize cooling efficiency, instead of the machine itself merely
being placed as an afterthought. This allows for considerable density and
much better cooling efficiency compared to other machines of similar per-
formance. The total weight of TSUBAME exceeds 60 tons, requiring minor
building reinforcements as the current building was designed for systems of
much smaller scale. TSUBAME occupies three rooms, where room-to-room
Infiniband connections are achieved via optical fiber connection, whereas uses
the CX4 copper cable within a room. The total power consumption of TSUB-
AME is less than a Megawatt even at peak load, making it one of the most
power-efficient supercomputer in the 100 Teraflops performance scale.
TSUBAME’s lifetime was initially designed to be 4 years, until the spring
of 2010. This could be extended up to a year with interim upgrades, such
as upgrade to future quad-core processors or beyond. However, eventually
the lifetime will expire, and we are already beginning the plans for designing
”TSUBAME 2.0”. One design consideration that is clear at the moment is that
the success of ”Everybody’s Supercomputer” should be continued; however,
simply waiting for processor improvements relying on CPU vendors would not
be sufficient to meet the growing demands, as a result of success of ”Every-
body’s Supercomputer”, in growth of the supercomputing community itself,
not just the individual needs.
Another requirement is not to increase the power or the footprint require-
ment of the current machine, resulting in a considerable challenge in super-
computer design we are researching at the current moment. One such research
investment is in the area of acceleration technologies, which will provide vastly
improved Megaflops/Watt ratio. In fact, even currently, two-fifth of TSUB-
AME’s peak computing power is provided by the ClearSpeed Advanced Accel-
erator PCI-X board. However, acceleration technology is still narrowly scoped
in terms of its applicability and user base; as such, we must generalize the use
of acceleration via advances in algorithm and software technologies, as well as
design a machine with right mix of various heterogeneous resources, including
general-purpose processors, and various types of accelerators. Another factor
is storage, where multi-Petabyte storage with high bandwidth must be ac-
commodated. Challenges are in devising more efficient cooling, better power
control, etc. etc.... there are various challenges abound, and it will require ad-
vances in multi-disciplinary fashion to meet this challenge. This is not a mere
pursuit of FLOPS, but rather, ”pursuit of FLOPS usable by everyone” — a
challenge worthwhile taking for those of us who are computer scientists. And
the challenge will continue beyond TSUBAME2.0 for many years to come.
References
1. Top500 Supercomputers Sites, http://www.top500.org/.