0% found this document useful (0 votes)
114 views7 pages

Data-Parallel Line Relaxation Method For The Navier Stokes Equations

NS solver

Uploaded by

Aniruddha Bose
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
114 views7 pages

Data-Parallel Line Relaxation Method For The Navier Stokes Equations

NS solver

Uploaded by

Aniruddha Bose
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

AIAA JOURNAL

Vol. 36, No. 9, September 1998

Data-Parallel Line Relaxation Method


for the Navier– Stokes Equations
Michael J. Wright,¤ Graham V. Candler,† and Deepak Bose‡
University of Minnesota, Minneapolis, Minnesota 55455

The Gauss– Seidel line relaxation method is modiŽ ed for the simulation of viscous  ows on massively parallel
computers. The resulting data-parallel line relaxation method is shown to have good convergence properties for a
series of test cases. The new method requires signiŽ cantly more memory than the previously developed data-parallel
relaxation methods, but it reaches a steady-state solution in much less time for all cases tested to date. In addition,
the data-parallel line relaxation method shows good convergence properties even on the high-cell-aspect-ratio
grids required to simulate high-Reynolds-number  ows. The new method is implemented using message passing
on the Cray T3E, and the parallel performance of the method on this machine is discussed. The data-parallel line
relaxation method combines the fast convergence of the Gauss– Seidel line relaxation method with a high parallel
efŽ ciency and thus shows promise for large-scale simulation of viscous  ows.
Downloaded by SUNY AT BUFFALO on January 23, 2015 | http://arc.aiaa.org | DOI: 10.2514/2.586

Introduction viscous problems. However, the method shows a signiŽ cant degra-

T HE numerical simulation of large complex  owŽ elds is a com- dation of the convergence rate for high-Reynolds-number ows be-
putationally intensive task. In addition, the large disparity be- cause of the high-cell-aspect-ratio grids needed to resolve the thin
tween the different length scales encountered in high-Reynolds- boundary layer. ModiŽ cations to the DP-LUR method that help to
number simulations can result in a stiff equation set that usually alleviate this problem were presented in Ref. 3, but it remains an
requires an implicit method to converge to a steady-state solution issue for high-Reynolds-number ow simulations.
in a reasonable time. The cost associated with solving this equation To fully address this convergence degradation on high-cell-
set makes the use of a massively parallel supercomputer attractive aspect-ratio grids, it is necessary to solve the implicit equation
because these machines have a very large peak performance. How- in a more closely coupled manner. This is the approach taken in
ever, most implicit methods are inefŽ cient when implemented on a the Gauss– Seidel line relaxation (GSLR) method of MacCormack,5
parallel computer. A true implicit method requires the inversion of which breaks a two-dimensionalprobleminto a series of block tridi-
a large sparse matrix, which involves a great deal of interprocessor agonal matrix solutions in the body-normal direction. This method
communication. This results in poor computational performance. would be inefŽ cient on a massively parallel machine because the
The traditional solution to this problem has been to choose an ef- Gauss– Seidel sweeps would require frequent and irregular inter-
fective serial algorithm and implement it in parallel using some sort processor communication. However, it is possible to modify this
of domain decomposition. This approach has been used with some method using an approach similar to that previously applied to the
success,1 but many of the most effective serial algorithms contain LU-SGS method. By replacing the Gauss– Seidel sweeps with a
inherent data dependencies and cannot be implemented effectively series of line relaxation steps, the algorithm can be parallelized
in parallel without modiŽ cation. effectively. In addition, the potential for a solution bias resulting
Another approach for structured meshes is to design a new im- from the use of the Gauss– Seidel sweeps, which can cause prob-
plicit algorithm that would take advantageof the structured commu- lems in three-dimensionalsimulationsusing the GSLR, 6 is removed.
nication pattern. Such an algorithm would be inherently well suited This paper discusses the modiŽ cations required to create this data-
to a data-parallel environment without explicit domain decomposi- parallel version of the GSLR algorithm. The resulting data-parallel
tion. The algorithm would be efŽ cient and easy to implement in ei- line relaxation (DPLR) method is then compared with the DP-LUR
ther data-parallelor message-passingmode and would be portableto method and the original GSLR method on a variety of viscous prob-
a wide variety of parallel architectures.For example, Candler et al.2 lems. Finally, implementation and performance issues on two dif-
and Wright et al.3 have shown that it is possible to make some mod- ferent parallel computers, the Cray T3E and the Thinking Machines
iŽ cations to the Yoon and Jameson lower-upper symmetric Gauss– CM-5, are discussed.
Seidel (LU-SGS) method4 that make it almost perfectlydata parallel.
The resulting data-parallel lower-upper relaxation (DP-LUR) me- Numerical Method
thod replaces the diagonal Gauss– Seidel sweeps of the LU-SGS The fully implicit form of the two-dimensional Navier– Stokes
method with a series of pointwise relaxation steps. The DP-LUR equations in curvilinear coordinates is
method was shown to be attractive for the solution of a variety of
Un C1 ¡ Un @ FQ n C 1 @ GQ n C 1
C C D0
1t @» @´
Received May 5, 1997; presented as Paper 97-2046 at the AIAA 13th
Computational Fluid Dynamics Conference, Snowmass Village, CO, June where U is the vector of conserved quantities and FQ and GQ are the
29– July 2, 1997; revision received May 4, 1998; accepted for publication  ux vectors in the » (body-tangential) and ´ (body-normal) direc-
May 14, 1998. Copyright ° c 1998 by the American Institute of Aeronautics
tions. The  ux vectors can be split into convectiveand viscous parts:
and Astronautics, Inc. All rights reserved.
¤ Postdoctoral Research Associate, Department of Aerospace Engineering
FQ D F C Fv ; GQ D G C G v
and Mechanics and Army High Performance Computing Research Center.
Member AIAA.
† Associate Professor, Department of Aerospace Engineering and Me- If we focus on the inviscid problem for now, we can linearize the
chanics and Army High Performance Computing Research Center. Senior  ux vector using
Member AIAA. n
‡ Graduate Research Assistant, Department of Aerospace Engineering and @F
Fn C 1 ’ Fn C .U n C 1 ¡ U n / D F n C An ±U n
Mechanics and Army High Performance Computing Research Center; cur- @U
rently Research Scientist, Thermosciences Institute, Mail Stop 230-3,NASA
Ames Research Center, Moffett Field, CA 94035. Member AIAA. G n C 1 ’ G n C B n ±U n
1603
1604 WRIGHT, CANDLER, AND BOSE

We then split the  uxes according to the sign of the eigenvalues of Then a series of k max relaxationsteps are made using, for k D 1; kmax ,
the Jacobians ¡1
±Ui;.k/j D I C ¸nA I C ¸nB I i; j
1t Ri;n j
F D A C U C A ¡ U D FC C F¡
C .1t = Vi; j / A nC i ¡ 1 ; j Si ¡ 1 ; j ±Ui.k¡¡1;1/j ¡ An¡ i C 1 ; j Si C 1 ; j ±Ui.kC¡1;1/j
2 2 2 2
to obtain the standard upwind Ž nite volume representation
C BCn i; j ¡ 1 Si; j ¡ 1 ±Ui;.kj ¡¡ 1/1 ¡ B¡n i; j C 1 Si; j C 1 ±Ui;.kj ¡C 1/1
±Ui;n j C .1t = Vi; j / A Ci C 12 ; j Si C 12 ; j ±Ui; j ¡ A Ci ¡ 12 ; j Si ¡ 12 ; j ±Ui ¡ 1; j 2 2 2 2

then
¡ A¡i ¡ 1 ; j Si ¡ 1 ; j ±Ui; j ¡ A¡i C 1 ; j Si C 1 ; j ±Ui C 1; j ±Ui;n j D ±Ui;.kjmax / .2/
2 2 2 2
where ¸ A D ½ A 1t S = V . For the solutionof viscous  ows, Eq. (2) can
C BCi; j C 1 Si; j C 1 ±Ui; j ¡ BCi; j ¡ 1 Si; j ¡ 1 ±Ui; j ¡ 1 be modiŽ ed to include the contribution of the appropriate implicit
2 2 2 2
viscous Jacobians by using a spectral radius approximation.3 With
n
¡ B¡i; j ¡ 1 Si; j ¡ 1 ±Ui; j ¡ B¡i; j C 1 Si; j C 1 ±Ui; j C 1 D 1t Ri;n j this approach, all data that are required for each relaxation step
2 2 2 2
have already been computed during the previousstep. Therefore, the
(1) entire relaxation step may be performed simultaneously in parallel
where Ri;n j is the change in the solution due to the  uxes at time without data dependencies.In addition, because the same pointwise
calculation is performed on each computationalcell, load balancing
level n, S is the surface area of the cell face indicated by its indices,
and Vi; j is the cell volume. will be ensured as long as the data are evenly distributed across
the available processors. Thus, the DP-LUR algorithm is almost
For the solution of the Navier– Stokes equations the appropriate
implicit viscous Jacobians must be included in Eq. (1). Following perfectly data parallel, and aside from the required nearest-neighbor
Downloaded by SUNY AT BUFFALO on January 23, 2015 | http://arc.aiaa.org | DOI: 10.2514/2.586

communication, a massively parallel computer should approach its


the method of Tysinger and Caughey,7 we can linearize the viscous
 ux vectors Fv and G v , assuming that the transport coefŽ cients are peak operational performance.
A modiŽ ed version of this method that improves the convergence
locally constant, to obtain
rate on high-cell-aspect-ratio grids can be derived from Eq. (2) if the
@ @ Yoon and Jameson4 approximation is relaxed and the approximate
Fvn C 1 ’ Fvn C .L±U /n ; G nv C 1 ’ G nv C .N ±U /n Jacobians are replaced with their exact counterparts. The resulting
@» @´ full matrix DP-LUR method improves performance by eliminat-
ing the overstabilization that is characteristic of all diagonalized
where the viscousJacobians L and N are evaluatedin such a way that
methods.8 The full matrix method is slightly more memory and
they are functionsof the vector of conservedquantitiesU and not the
computationallyintensive because it requires the storage and inver-
derivatives of U . With these deŽ nitions Eq. (1) will be unchanged
sion of a single Jacobian matrix at each grid point, but it retains the
if we simply replace the Euler Jacobians A and B with AQ and B, Q
excellent parallel efŽ ciency of the original method.3
where
DPLR Method
AQ C D A C ¡ L ; AQ ¡ D A¡ C L
Although both variations of the DP-LUR method are efŽ cient
BQ C D BC ¡ N ; BQ ¡ D B¡ C N for the simulation of many  ows, they are both affected to some
extent by the high-cell-aspect-ratio grids necessary to resolve the
boundary layers of high-Reynolds-number  ows. The dependence
Equation (1) is the basic starting equation for many implicit
of the convergence rate on the cell aspect ratio was reduced by the
schemes, including both the DP-LUR and the DPLR methods. From
introductionof the full matrix DP-LUR method, as discussedearlier,
this point, however, the derivation of these two methods is different.
but some performance degradation is still apparent. The remaining
We Ž rst brie y review the derivation of the DP-LUR method here
dependence is primarily due to the fact that these methods place
so that the similarities between the DP-LUR and the new DPLR
all of the off-diagonal terms on the right-hand side, and thus their
method may be noted. The full derivation of the DP-LUR method
effect is only weakly coupled to the diagonal. A more accurate
can be found in Refs. 2 and 3.
approach would be to move all of the off-diagonal terms back to
the left-hand side, as in Eq. (1), and to solve the fully coupled
DP-LUR Method problem using a large block-bandedmatrix inversion.However, this
The Ž rst step in the derivation of the DP-LUR method is to move approach would be extremely expensiveand inefŽ cient on a parallel
all of the off-diagonal terms in Eq. (1) to the right-hand side. The machine. A better approach is possible for viscous external  ows if
method of Yoon and Jameson4 is then used to approximate the im- we recognize that the viscous  ow gradients will be much stronger
plicit Jacobians with in the body-normal .´/ direction. Thus, the physical problem is
much more strongly coupled in the body-normal direction, and it is
A C D 12 .A C ½ A I /; A ¡ D 12 .A ¡ ½ A I / possible to move just these body-normal terms back to the left-hand
side, resulting in the following:
where ½ A is the spectral radius of the Jacobian A, given by the mag- BO i; j ±Ui; j C 1 C AO i; j ±Ui; j ¡ CO i; j ±Ui; j ¡ 1
nitude of the largest eigenvalue juj C a, where a is the speed of
sound. With this approximation the differences between the Jaco- D ¡ DO i; j ±Ui C 1; j C EO i; j ±Ui ¡ 1; j C 1t Ri;n j (3)
bians on the diagonal become diagonal matrices, and the solution
of the resulting equation is greatly simpliŽ ed. where the matrices denoted by the carets are deŽ ned from the Jaco-
The LU-SGS algorithm employs a series of corner-to-corner bians as
sweeps through the  owŽ eld using the latest available data for the AO i; j D I C .1t = Vi; j / AQ 1 S 1 ¡ AQ 1 S 1
Ci C 2 ; j iC 2;j ¡i ¡ 2 ; j i ¡ 2 ;j
off-diagonal terms to solve the resulting equation. Although this
method has been shown to be efŽ cient on a serial or vector machine, C BQ Ci; j C 1 Si; j C 1 ¡ BQ ¡i; j ¡ 1 Si; j ¡ 1
2 2 2 2
signiŽ cant modiŽ cations are required to reduce or to eliminate the
data dependencies and to make the method parallelize effectively. BO i; j D .1t = Vi; j / BQ ¡i; j C 1 Si; j C 1
The DP-LUR approach solves this problem by replacing the Gauss– 2 2

Seidel sweeps with a series of pointwise relaxation steps using the CO i; j D .1t = Vi; j / BQ Ci; j ¡ 1 Si; j ¡ 1
following scheme. First, the right-hand-side Ri; j is divided by the 2 2

diagonal operator to obtain ±U .0/ DO i; j D .1t = Vi; j / AQ ¡i C 1 ; j Si C 1 ; j


2 2
¡1
.0/
±Ui; j D I C ¸nA I C ¸nB I i; j
1t Ri;n j EO i; j D .1t = Vi; j / AQ Ci ¡ 1 ; j Si ¡ 1 ; j
2 2
WRIGHT, CANDLER, AND BOSE 1605

In this way it is possible to solve Eq. (3) for all j points at each
i location as a series of fully coupled block tridiagonal systems
aligned in the body-normal direction. This was the approach used
by MacCormack in the GSLR method.5 In this method the problem
is solved via a series of alternating forward and backward sweeps
through the  owŽ eld in the i direction, using the latest available
data for the terms on the right-hand side. This method works well
for two-dimensional  ows on a serial or vector machine, but it is
not straightforwardto extendthe method to three-dimensional ows.
The obvious approach to the three-dimensional case is to continue
to set up the problem as a series of block tridiagonal systems nor-
mal to the body and to sweep in both the axial and circumferential
directions. However, this approach can lead to a nonphysical bias
in the converged solution due to the biasing that is inherent in the
Gauss– Seidel sweeping process.6 In addition,the data dependencies
in the Gauss– Seidel sweeps would make the algorithm inefŽ cient
on a parallel machine. However, it is possible to eliminate both of
these difŽ culties if we modify the GSLR method by replacing the
Fig. 1 Sample 128 £ 128 cylinder-wedge grid. Every fourth grid point
Gauss– Seidel sweeps with a series of line relaxation steps, using a
is shown.
procedure similar to that outlined in the preceding section for the
DP-LUR method. The resulting DPLR method is then described by
the following scheme. First, the implicit terms on the right-handside The boundary-layerresolution is measured with the wall variable
Downloaded by SUNY AT BUFFALO on January 23, 2015 | http://arc.aiaa.org | DOI: 10.2514/2.586

of Eq. (3) are neglected, and the resulting block tridiagonal system
is factored and solved for ±U .0/ : yC D ½yu ¤ =¹
BO i; j ±Ui;.0/j C 1 C AO i; j ±Ui;.0/j ¡ CO i; j ±Ui;.0/j ¡ 1 D 1t Ri;n j where u ¤ ispthe friction velocity, given in terms of the wall stress
Then a series of k max relaxationsteps are made using,for k D 1; kmax , ¿w by u ¤ D .¿w =½/. For a well-resolved boundary layer the mesh
spacing is typically chosen so that yC · 1 for the Ž rst cell above the
BO i; j ±Ui;.k/j C 1 C AO i; j ±Ui;.k/j ¡ CO i; j ±Ui;.k/j ¡ 1 body surface. The grid for each case is then exponentiallystretched
from the wall to the outer boundary.
D ¡ DO i; j ±Ui.kC¡1;1/j C EO i; j ±Ui.k¡¡1;1/j C 1t Ri;n j Although the results presented here are based on modiŽ ed Ž rst-
order Steger– Warming  ux vector splitting for the numerical  ux
then evaluation,9 note that the derivation of the implicit algorithm is
±Ui;n j D ±Ui;.kjmax / .4/ general and can be used with many  ux evaluation methods. In fact,
Liou and Van Leer10 have shown that the use of Steger– Warming
Boundary conditions are implemented by folding the contribu- splitting can be effective and robust for the implicit part of the
tion of the boundary cells into the appropriate matrix in a manner problem, even when the  uxes are evaluated by a different method.
identical to that used for GSLR. 5 The DPLR method requires a sin- The DPLR method is in all instances more sensitive to the size
gle lower-upper (LU) factorization and kmax C 1 backsubstitutions of the implicit time step than the DP-LUR method, as expected,
per iteration. By using this approach, all of the data required for because the DPLR method involves a more exact representation of
each relaxation step have already been computed during the pre- the problem, and therefore the size of the time step will have more
vious step. Therefore, as long as the data are distributed on the physical meaning. In all cases presented here, the implicit time step
processors in such a way that the body-normal direction is entirely was chosen to correspond to a Courant– Friedrichs– Lewy (CFL)
local, the entire relaxation step can be performed simultaneously number of 1 in the Ž rst iteration and was rapidly increased to its
in parallel with no data dependencies. In addition, because the i - maximum stable value. The size of the maximum stable time step
direction, off-diagonal terms are all equally lagged by one relax- for each case was governed primarily by the freestream conditions,
ation step, the biasing problemno longer exists, and implementation with little or no limitation due to the mesh spacing. Therefore the
of a three-dimensional version of the algorithm becomes straight- maximum CFL number varied considerably from case to case.
forward. The DPLR approach will use signiŽ cantly more memory Figure 2a shows the effect of the number of relaxationsteps (kmax )
than the DP-LUR methods becauseŽ ve Jacobianmatrices (seven for on the convergence rate of the method on the two-dimensional
three-dimensional ows) must now be computed and stored at each cylinder-wedge geometry at a Reynolds number of 3 £ 104 and
grid point as compared with one for the full matrix DP-LUR method yC D 1. In Fig. 2a the explicit solution was obtained using a stan-
and none for the diagonal method. For perfect gas  ows the DPLR dard Ž rst-order Euler method with the maximum stable time step
method uses about twice as much memory as the DP-LUR method (CFL D 0:1). The line marked kmax D 0 corresponds to performing
for the two-dimensionalimplementation and four times as much for just the initial block tridiagonal solution along each i line, with no
the three-dimensionalversion. However, the DPLR method uses no implicit coupling in the i direction. The k max D 0 approach is simi-
more memory than the original GSLR method and should converge lar to that proposed by Wang and Widhopf11 and later implemented
much faster than either DP-LUR approach due to the more exact in parallel by Taylor and Wang.12 Although the kmax D 0 approach
formulation of the implicit operator. offers a signiŽ cant improvement over the explicit method, we see
from Fig. 2a that the convergencerate of the method improves when
Results the effect of the i -direction coupling terms is included (kmax > 0).
The perfect gas implementation of the DPLR method has been The DPLR method shows a dependence on k max similar to that of
tested on two- and three-dimensional geometries with an emphasis the DP-LUR methods, with the convergence rate steadily improv-
on evaluating the convergence properties and parallel performance ing as k max increases, up to kmax D 6. Figure 2b shows the effect of
of the new method. The primary test case for this paper is the Mach the number of relaxation steps on the cost of the method, evaluated
15 perfect gas  ow over a cylinder-wedgeblunt body, with Reynolds as total CPU time on an eight-processor Cray T3E-900. Because
numbers Re based on the freestream conditions and body length the cost of evaluating the Jacobian matrices and setting up the LU
varying from 3 £ 104 to 3 £ 108 . A sample 128 £ 128 grid for this factored block tridiagonal system is much greater than the cost of
problem is shown in Fig. 1. The three-dimensional computations performingadditionalbacksubstitutions,we see that increasingkmax
were performed on multiple planes of the same two-dimensional also improves the cost effectiveness of the method, up to k max D 4.
grids, which makes it easy to directly compare the convergence As shown in Fig. 2b, values of kmax larger than 4 give little improve-
properties of the two- and three-dimensional implementations of ment in convergence and are not cost effective. Therefore, all of the
the method. results presented in this paper were run at kmax D 4.
1606 WRIGHT, CANDLER, AND BOSE

a)
Fig. 4 Convergence histories for the DPLR method on a high-Mach-
number and low-supersonic-Mach-number  ow: two-dimensional
cylinder-wedge blunt body at Re = 3 £ 104 ; 128 £ 128 grid with y+ = 1
for each case.
Downloaded by SUNY AT BUFFALO on January 23, 2015 | http://arc.aiaa.org | DOI: 10.2514/2.586

that, although both methods achieve a 10-order-of-magnitude re-


duction in the density error norm in fewer than 300 iterations, the
DPLR method using k max D 4 actually performs a little better than
the GSLR method with four sweeps through the  owŽ eld. This is
because the DPLR method allows larger time steps to be used for
this case. If both methods are run with the same time step, their per-
formance is nearly identical. This is surprising because the GSLR
method always uses latest available data and thus should allow in-
formation to travel across the entire computational domain during
each implicit iteration, whereas with the DPLR method informa-
tion can travel only k max cells per iteration in the axial (i ) direction.
However, both methods are identical in their treatment of the body-
b)
normal terms.
Fig. 2 a) Convergence histories and b) CPU times on an eight-processor Figure 4 compares the convergence rate of the DPLR method on
Cray T3E-900 for the DPLR method showing in uence of kmax : two-
two cylinder-wedge  ows at Mach 15 and 2. Both cases are run at
dimensional cylinder-wedge blunt body at M1 = 15 and Re = 3 £ 104 ;
128 £ 128 grid with y+ = 1 for the Ž rst cell above the body. a Reynolds number of 3 £ 104 . We see that both cases reach a 10-
order-of-magnitudereductionin the densityerror norm in fewer than
300iterations.However, there are differencesin the convergencehis-
tories. The low-Mach-number case requires fewer iterations for the
bow shock to reach its Ž nal location because the low-Mach-number
 ow is less nonlinear. In addition, once the shock has reached its
Ž nal location,the convergencerate of the Mach 2  ow is slower than
that for the Mach 15  ow, due to the longer characteristic ow time.
Similar results are obtained at other Mach numbers. This shows that
the DPLR method can be an effective tool for the solution of both
supersonic and hypersonic  ows.
The performanceof the DPLR method is also examined for three-
dimensional  ows, using multiple identical planes of the baseline
128 £ 128 cylinder-wedge blunt body grid. In all of the cases, there
is essentially no difference in the convergence histories between
the two-dimensional and three-dimensional implementations. By
performingthe three-dimensionalcalculationson multiple planes of
a two-dimensional grid, we can also easily check for any evidence
of the solution bias that is exhibited by the three-dimensionalGSLR
method. This effect is always more noticeablein such cases because
Fig. 3 Convergence histories for the DPLR method as compared with
any solutionbias will tend to create a nonphysicalcross ow velocity
the originalGSLR method: two-dimensionalcylinder-wedge blunt body in the direction in which the multiple planes are projected. This will
at M1 = 15 and Re = 3 £ 104 ; 128 £ 128 grid with y+ = 1. be evidenteven beforeany other differencescan be detectedbetween
the solutions on different planes. In all of the cases tested to date,
the maximum cross ow velocity in the three-dimensional owŽ eld
The convergencerate of the new DPLR method is compared with is more than 10 orders of magnitude smaller than the freestream
the original GSLR method in Fig. 3. Both methods are tested at the velocity.We feel that this value is sufŽ cientlysmall to be attributedto
same conditions as in Fig. 2. The methods behave similarly, with machine roundofferrors, and thus the DPLR method has eliminated
the convergence rate increasing signiŽ cantly after about 150 itera- the solution bias problem.
tions. The slower convergencerate at the beginningof the solution is The new DPLR method is compared with the diagonal and full
due to the motion of the bow shock through the computational grid. matrix DP-LUR methods in Fig. 5 for the cylinder-wedge blunt
Because this is a highly nonlinear process, the shock will move body at a Reynolds number of 3 £ 104 and yC D 1. All three meth-
at most one computational cell per iteration. However, once the ods are run at k max D 4. We see in Fig. 5a that the DPLR method is
bow shock has reached its Ž nal location, the block tridiagonal so- very efŽ cient, achieving a 10-order-of-magnitude reduction in the
lutions rapidly drive the error norm toward machine zero. We see density error norm in fewer than 300 iterations, as compared with
WRIGHT, CANDLER, AND BOSE 1607

a)
Fig. 6 Convergence histories for the DPLR method showing in uence
of Reynolds number: two-dimensional cylinder-wedge blunt body at
M1 = 15; 128 £ 128 grid with y+ = 1 for each case.
Downloaded by SUNY AT BUFFALO on January 23, 2015 | http://arc.aiaa.org | DOI: 10.2514/2.586

b)
Fig. 5 a) Convergence histories and b) CPU times on an eight-processor
Cray T3E-900 for the DPLR method as compared with the two DP-LUR
methods: two-dimensional cylinder-wedge blunt body at M1 = 15 and
Re = 3 £ 104 ; 128 £ 128 grid with y+ = 1.
Fig. 7 CPU times on an eight-processor Cray T3E-900 required to
achieve 10 orders of density error norm convergence for the DPLR
2300 iterations for the full matrix method and 6200 iterations for and DP-LUR methods as a function of the Reynolds number: two-
the diagonal DP-LUR method. However, convergence in fewer it- dimensional cylinder-wedge blunt body at M1 = 15 and Re = 3 £ 104 ;
erations does not necessarily imply that the method will be more 128 £ 128 grid with y+ = 1 for each case.
cost effectiveon a massively parallel machine. It is also necessaryto
know how much each iteration of the method will cost. Therefore,
to examine the effectivenessof the new algorithm, Fig. 5b compares the 128 £ 128 grid for each case was chosen so that yC D 1 for the
the cost of the methods, plotted as CPU time on an eight-processor Ž rst cell above the body. As the Reynolds number is increased by
Cray T3E-900. From Fig. 5b we see that the DPLR method also has four orders of magnitude, the amount of computer time required
a high parallel efŽ ciency and is by far the most cost effective of the to achieve a 10-order-of-magnitude reduction in the density error
three methods. For this problem the line relaxation method reaches norm remains constant for the DPLR method, whereas the time in-
a 10-order-of-magnitudereduction in the density error norm in just creases by a factor of 6 for the full matrix method and 7 for the
35 s, compared with 235 s for the full matrix method and 359 s diagonal method. This shows that the more exact implicit operator
for the diagonal method. This shows that the DPLR method can used in the DPLR method eliminates the convergence degradation
potentially be a powerful tool in the simulation of viscous  ows. on high-cell-aspect-ratio grids.
Figure 6 shows the convergencehistoriesof the DPLR method for Figure 8 compares the viscous and inviscid implementations of
three viscous  ows with Reynolds numbers ranging from 3 £ 104 the DPLR method for two of the cases in Fig. 6. The grid for each
to 3 £ 108 . The 128 £ 128 grid for each case was chosen so that case was chosen to satisfy the yC D 1 condition for the viscous so-
yC D 1 for the Ž rst cell above the body, ensuring that the bound- lution. The inviscid solutions were then obtained on the same com-
ary layers for all cases are equally well resolved. Because the putational grids. We see that the convergence rates for the viscous
boundary-layerthickness decreases with increasing Reynolds num- and inviscid versions of the method are almost identical. This is in
ber, the maximum cell aspect ratio (CAR) of the grid must in- direct contrast to the DP-LUR method, which always requires more
crease as well to meet the yC D 1 requirement. For the test cases iterations to converge the viscous solution. This again shows the
in Fig. 6, the maximum CAR ranges from 35 for Re D 3 £ 104 to beneŽ t of moving the body-normal terms back to the left-hand side
about 125,000 for Re D 3 £ 108 . We can see that, although each of of the equation and coupling them directly to the diagonal.
the cases behaves differently during the early phases of the  ow evo-
lution,all convergewith the same terminal slope after the bow shock Parallel Performance
reaches its Ž nal location. In addition,there is essentially no increase The DPLR algorithm is inherently data parallel by design and re-
in the number of iterations required to reach steady state as the quiresno asynchronouscommunicationor computation.This makes
Reynolds number (and therefore the CAR) is increased. the code readily portable to a variety of parallel architectures with
Figure 7 compares the convergence properties of the DPLR only minor modiŽ cations because it is relatively easy to modify a
method with the DP-LUR methods on several viscous  ows with data-parallelcode to run effectively on a message-passingmachine,
Reynolds numbers ranging from 3 £ 104 to 3 £ 108 . Once again, whereas the reverse is not necessarily possible. The DPLR method
1608 WRIGHT, CANDLER, AND BOSE

Fig. 8 Convergence histories for the viscous and inviscid implementa- Fig. 9 Parallel efŽ ciency for the two-dimensional viscous DPLR
tion of the DPLR method: two-dimensional cylinder-wedge blunt body method on the T3E-900: 512 £ 512 computational grid used.
at M1 = 15; 128 £ 128 grid with y+ = 1 for each case.

The sustained performance for the two-dimensional and three-


Downloaded by SUNY AT BUFFALO on January 23, 2015 | http://arc.aiaa.org | DOI: 10.2514/2.586

was implemented and tested on two different massively parallel dimensional implementations of the DPLR method on the T3E-900
architectures. First, a message-passing version using the Message- is about 75 M ops per node, which is only 8.3% of the peak theoreti-
Passing Interface (MPI) standard for interprocessorcommunication cal performanceof the machine. This number seems quite low, but it
was implemented on the 272-processor Cray T3E-900 located at is comparable to other published results. The NAS 2 parallel bench-
Network Computing Services, Inc. (formerly the Minnesota Super- mark results offer the best comparison because these codes have
computerCenter). Each processorof this machinehas 128 Mbytes of been written to simulate actual computational uid dynamics appli-
memory and a theoretical peak performance of 900 M ops. In addi- cations, and the individual machine vendors have not been allowed
tion, a data-parallel version was implemented on the 896-processor to perform assembler-level optimizations to the source code. Un-
Thinking Machines CM-5 located at the University of Minnesota fortunately, NAS 2 benchmark results have not yet been published
Army High Performance Computing Research Center. Each pro- for the T3E. However, results from the T3D for the block tridiag-
cessor of this machine has 32 Mbytes of memory and four vector onal (BT) benchmark show a performance of about 10 M ops per
units, yielding a theoretical peak performance of 128 M ops per processor.13 In addition, results for the NAS 1 benchmarks, which
processor. have been published for both the T3D and T3E-600 (600 M ops
The data-parallel implementation of the new DPLR method peak performance), show that the sustained speed on the T3E-600
should retain the perfect scalability and high parallel efŽ ciency that is typically about 3.3 times that on the T3D.14 This would result in a
characterized the DP-LUR method2 because the algorithm design performance of about 33 M ops per processor on the T3E-600 for
is very similar. However, it is difŽ cult to show this on the CM-5 the BT benchmarkor about 50 M ops per processoron the T3E-900,
because it is a vector parallel machine, and thus large vector lengths assuming perfect scalability. Therefore, the 75 M ops per proces-
are required to ensure good performance. In addition, it has been sor obtained for the DPLR method seems reasonable.However, it is
shown that only the number of points in the dimensions that are possible that further optimizations can be made to the source code
spread across all of the nodes should be used when evaluating the that would increase the performance.
vector length.2 To solve the block tridiagonal system required by
the DPLR method without interprocessorcommunication,it is nec- Conclusions
essary that all j points corresponding to a particular i location be The GSLR method has been modiŽ ed to make it amenable to
entirely on processor, whereas with the DP-LUR method both the the solution of the Navier– Stokes equations on massively parallel
i and j directions can be spread across the available processors. supercomputers. The resulting DPLR method replaces the Gauss–
Therefore, when the 128 £ 128 test case presented in this paper is Seidel sweeps of the original with a series of line relaxationsteps. In
run on a 64-processorCM-5 with four vector units per processor,the this manner all of the data dependencies in the original method are
vector length will be 64 for the DP-LUR method but will actually removed, and each relaxation step becomes almost perfectly data
be less than 1 for the DPLR method. This means that the DPLR parallel. Because of its design, the new method can be easily imple-
method is not using the vector hardware on the processors. There- mented in either the data-parallelor message-passingprogramming
fore we would expect the two-dimensionalDPLR method to be very styles. The method also retains the good convergence properties of
slow on the CM-5 due to the small vector length, even though it may the original GSLR method, and in fact with four relaxation steps it
have a high parallel efŽ ciency. This is indeed the case for the im- converges in fewer iterations for the test cases considered. In addi-
plementation tested. For the 128 £ 128 test case presented here, the tion, the relaxation steps eliminate the solution bias problem exhib-
DPLR method runs at only 1.39 M ops per processor, as compared ited by the three-dimensional GSLR method, and thus the DPLR
with 13.2 for the diagonal DP-LUR method and 20.7 for the full method can easily be extended to the solution of three-dimensional
matrix method. This would not be a problem on a machine with-  ows.
out vector hardware. Thus, although the DPLR method should be The DPLR method uses considerably more memory than either
efŽ cient on many data-parallelarchitectures,it is difŽ cult to directly of the previously developed DP-LUR methods but demonstrates a
calculate the parallel efŽ ciency of the method on the CM-5. dramatic improvement in cost effectiveness, reaching steady state
The performance of the message-passing implementation of the in about 15% of the time on a Cray T3E. In addition, both DP-
method on the T3E is easier to evaluate because there is no vec- LUR methods showed a degradation of the convergence rate when
tor hardware on this machine. In this implementation, the data are high-Reynolds-number ows were simulated. However, the DPLR
distributed across the processors by breaking the problem in the method is more strongly coupled in the body-normal direction and
i direction. Communication latency is masked by using nonblock- thus has good convergence properties at all Reynolds numbers.
ing sends and receives in MPI. The parallel speedup curve for the The new method has been implemented using message passing on
two-dimensional DPLR algorithm on the T3E-900 is presented in the Cray T3E-900 and shows nearly perfect speedup, with a parallel
Fig. 9. We see that the method has almost perfect speedup, up to the efŽ ciency of 98% even when 32 processorsare used. The singlenode
maximum number of processorstested. In fact, on 32 processors the performance of the method is about 75 M ops per processor, which
speedup is 31.3, which corresponds to a parallel efŽ ciency of 0.98. is only 8.3% of the peak theoretical performance of the machine.
WRIGHT, CANDLER, AND BOSE 1609

However, this value is comparable to data obtained for the NAS 4 Yoon, S., and Jameson, A., “A Lower-Upper Symmetric Gauss-Seidel

2 parallel benchmarks and does not detract from the high parallel Method for the Euler and Navier – Stokes Equations,” AIAA Journal, Vol. 26,
efŽ ciency of the method. No. 9, 1988, pp. 1025, 1026.
5 MacCormack, R. W., “Current Status of the Numerical Solutions of the
In short, the high parallel efŽ ciency and good convergence char-
acteristics of the DPLR method make it attractive for the solution Navier– Stokes Equations,” AIAA Paper 85-0032, Jan. 1985.
6 MacCormack, R. W., “Solution of the Navier – Stokes Equations in Three
of very large compressible  ow problems. Dimensions,” AIAA Paper 90-1520, June 1990.
7 Tysinger, T., and Caughey, D., “Implicit Multigrid Algorithm for the
Acknowledgments Navier– Stokes Equations,” AIAA Paper 91-0242, Jan. 1991.
The authors were supportedby the NASA Langley Research Cen- 8 Yoon, S., and Kwak, D., “Multigrid Convergence of an Implicit Symmet-

ter under Contract NAG-1-1498 and the Army Research OfŽ ce ric Relaxation Scheme,” AIAA Journal, Vol. 32, No. 5, 1994, pp. 950– 955.
9 MacCormack, R. W., and Candler, G. V., “The Solution of the Navier –
under Grant DAAH04-93-G-0089. This work was also supported
in part by the U.S. Army High Performance Computing Research Stokes Equations Using Gauss – Seidel Line Relaxation,” Computers and
Center under the auspices of the Department of the Army, Army Fluids, Vol. 17, No. 1, 1989, pp. 135– 150.
10 Liou, M. S., and Van Leer, B., “Choice of Implicit and Explicit Operators
ResearchLaboratoryCooperativeAgreementDAAH04-95-2-0003/
for the Upwind Differencing Method,” AIAA Paper 88-0624, Jan. 1988.
Contract DAAH04-95-C-0008, the content of which does not nec- 11 Wang, J. C., and Widhopf, G. F., “An EfŽ cient Finite Volume TVD
essarily re ect the position or the policy of the government, and no Scheme for Steady State Solutions of the 3-D Compressible Euler/Navier –
ofŽ cial endorsement should be inferred. Computer time on the Cray Stokes Equations,” AIAA Paper 90-1523, June 1990.
T3E was provided by the Minnesota Supercomputer Institute. 12 Taylor, S., and Wang, J. C., “Launch Vehicle Simulations Using a Con-

current Implicit Navier– Stokes Solver,” AIAA Paper 95-0223, Jan. 1995.
References 13 Saphir, W., Woo, A., and Yarrow, M., “The NAS Parallel Benchmarks
1 Simon, H. D. (ed.), Parallel Computational Fluid Dynamics Implemen- 2.1 Results,” Numerical Aerospace Simulation Facility, NAS TR NAS-96-
tations and Results, MIT Press, Cambridge, MA, 1992. 010, NASA Ames Research Center, Moffett Field, CA, Aug. 1996.
Downloaded by SUNY AT BUFFALO on January 23, 2015 | http://arc.aiaa.org | DOI: 10.2514/2.586

2 Candler, G. V., Wright, M. J., and McDonald, J. D., “Data-Parallel 14 Saini, S., and Bailey, D. H., “NAS Parallel Benchmark (Version 1.0)

Lower-Upper Relaxation Method for Reacting Flows,” AIAA Journal, Vol. Results 11-96,” Numerical Aerospace Simulation Facility, NAS TR NAS-
32, No. 12, 1994, pp. 2380– 2386. 96-018, NASA Ames Research Center, Moffett Field, CA, Nov. 1996.
3 Wright, M. J., Candler, G. V., and Prampolini, M., “Data-Parallel Lower-

Upper Relaxation Method for the Navier– Stokes Equations,” AIAA Journal, D. S. McRae
Vol. 34, No. 7, 1996, pp. 1371– 1377. Associate Editor

You might also like