Data-Parallel Line Relaxation Method For The Navier Stokes Equations
Data-Parallel Line Relaxation Method For The Navier Stokes Equations
The Gauss– Seidel line relaxation method is modi ed for the simulation of viscous ows on massively parallel
computers. The resulting data-parallel line relaxation method is shown to have good convergence properties for a
series of test cases. The new method requires signi cantly more memory than the previously developed data-parallel
relaxation methods, but it reaches a steady-state solution in much less time for all cases tested to date. In addition,
the data-parallel line relaxation method shows good convergence properties even on the high-cell-aspect-ratio
grids required to simulate high-Reynolds-number ows. The new method is implemented using message passing
on the Cray T3E, and the parallel performance of the method on this machine is discussed. The data-parallel line
relaxation method combines the fast convergence of the Gauss– Seidel line relaxation method with a high parallel
ef ciency and thus shows promise for large-scale simulation of viscous ows.
Downloaded by SUNY AT BUFFALO on January 23, 2015 | http://arc.aiaa.org | DOI: 10.2514/2.586
Introduction viscous problems. However, the method shows a signi cant degra-
T HE numerical simulation of large complex ow elds is a com- dation of the convergence rate for high-Reynolds-number ows be-
putationally intensive task. In addition, the large disparity be- cause of the high-cell-aspect-ratio grids needed to resolve the thin
tween the different length scales encountered in high-Reynolds- boundary layer. Modi cations to the DP-LUR method that help to
number simulations can result in a stiff equation set that usually alleviate this problem were presented in Ref. 3, but it remains an
requires an implicit method to converge to a steady-state solution issue for high-Reynolds-number ow simulations.
in a reasonable time. The cost associated with solving this equation To fully address this convergence degradation on high-cell-
set makes the use of a massively parallel supercomputer attractive aspect-ratio grids, it is necessary to solve the implicit equation
because these machines have a very large peak performance. How- in a more closely coupled manner. This is the approach taken in
ever, most implicit methods are inef cient when implemented on a the Gauss– Seidel line relaxation (GSLR) method of MacCormack,5
parallel computer. A true implicit method requires the inversion of which breaks a two-dimensionalprobleminto a series of block tridi-
a large sparse matrix, which involves a great deal of interprocessor agonal matrix solutions in the body-normal direction. This method
communication. This results in poor computational performance. would be inef cient on a massively parallel machine because the
The traditional solution to this problem has been to choose an ef- Gauss– Seidel sweeps would require frequent and irregular inter-
fective serial algorithm and implement it in parallel using some sort processor communication. However, it is possible to modify this
of domain decomposition. This approach has been used with some method using an approach similar to that previously applied to the
success,1 but many of the most effective serial algorithms contain LU-SGS method. By replacing the Gauss– Seidel sweeps with a
inherent data dependencies and cannot be implemented effectively series of line relaxation steps, the algorithm can be parallelized
in parallel without modi cation. effectively. In addition, the potential for a solution bias resulting
Another approach for structured meshes is to design a new im- from the use of the Gauss– Seidel sweeps, which can cause prob-
plicit algorithm that would take advantageof the structured commu- lems in three-dimensionalsimulationsusing the GSLR, 6 is removed.
nication pattern. Such an algorithm would be inherently well suited This paper discusses the modi cations required to create this data-
to a data-parallel environment without explicit domain decomposi- parallel version of the GSLR algorithm. The resulting data-parallel
tion. The algorithm would be ef cient and easy to implement in ei- line relaxation (DPLR) method is then compared with the DP-LUR
ther data-parallelor message-passingmode and would be portableto method and the original GSLR method on a variety of viscous prob-
a wide variety of parallel architectures.For example, Candler et al.2 lems. Finally, implementation and performance issues on two dif-
and Wright et al.3 have shown that it is possible to make some mod- ferent parallel computers, the Cray T3E and the Thinking Machines
i cations to the Yoon and Jameson lower-upper symmetric Gauss– CM-5, are discussed.
Seidel (LU-SGS) method4 that make it almost perfectlydata parallel.
The resulting data-parallel lower-upper relaxation (DP-LUR) me- Numerical Method
thod replaces the diagonal Gauss– Seidel sweeps of the LU-SGS The fully implicit form of the two-dimensional Navier– Stokes
method with a series of pointwise relaxation steps. The DP-LUR equations in curvilinear coordinates is
method was shown to be attractive for the solution of a variety of
Un C1 ¡ Un @ FQ n C 1 @ GQ n C 1
C C D0
1t @» @´
Received May 5, 1997; presented as Paper 97-2046 at the AIAA 13th
Computational Fluid Dynamics Conference, Snowmass Village, CO, June where U is the vector of conserved quantities and FQ and GQ are the
29– July 2, 1997; revision received May 4, 1998; accepted for publication ux vectors in the » (body-tangential) and ´ (body-normal) direc-
May 14, 1998. Copyright ° c 1998 by the American Institute of Aeronautics
tions. The ux vectors can be split into convectiveand viscous parts:
and Astronautics, Inc. All rights reserved.
¤ Postdoctoral Research Associate, Department of Aerospace Engineering
FQ D F C Fv ; GQ D G C G v
and Mechanics and Army High Performance Computing Research Center.
Member AIAA.
† Associate Professor, Department of Aerospace Engineering and Me- If we focus on the inviscid problem for now, we can linearize the
chanics and Army High Performance Computing Research Center. Senior ux vector using
Member AIAA. n
‡ Graduate Research Assistant, Department of Aerospace Engineering and @F
Fn C 1 ’ Fn C .U n C 1 ¡ U n / D F n C An ±U n
Mechanics and Army High Performance Computing Research Center; cur- @U
rently Research Scientist, Thermosciences Institute, Mail Stop 230-3,NASA
Ames Research Center, Moffett Field, CA 94035. Member AIAA. G n C 1 ’ G n C B n ±U n
1603
1604 WRIGHT, CANDLER, AND BOSE
We then split the uxes according to the sign of the eigenvalues of Then a series of k max relaxationsteps are made using, for k D 1; kmax ,
the Jacobians ¡1
±Ui;.k/j D I C ¸nA I C ¸nB I i; j
1t Ri;n j
F D A C U C A ¡ U D FC C F¡
C .1t = Vi; j / A nC i ¡ 1 ; j Si ¡ 1 ; j ±Ui.k¡¡1;1/j ¡ An¡ i C 1 ; j Si C 1 ; j ±Ui.kC¡1;1/j
2 2 2 2
to obtain the standard upwind nite volume representation
C BCn i; j ¡ 1 Si; j ¡ 1 ±Ui;.kj ¡¡ 1/1 ¡ B¡n i; j C 1 Si; j C 1 ±Ui;.kj ¡C 1/1
±Ui;n j C .1t = Vi; j / A Ci C 12 ; j Si C 12 ; j ±Ui; j ¡ A Ci ¡ 12 ; j Si ¡ 12 ; j ±Ui ¡ 1; j 2 2 2 2
then
¡ A¡i ¡ 1 ; j Si ¡ 1 ; j ±Ui; j ¡ A¡i C 1 ; j Si C 1 ; j ±Ui C 1; j ±Ui;n j D ±Ui;.kjmax / .2/
2 2 2 2
where ¸ A D ½ A 1t S = V . For the solutionof viscous ows, Eq. (2) can
C BCi; j C 1 Si; j C 1 ±Ui; j ¡ BCi; j ¡ 1 Si; j ¡ 1 ±Ui; j ¡ 1 be modi ed to include the contribution of the appropriate implicit
2 2 2 2
viscous Jacobians by using a spectral radius approximation.3 With
n
¡ B¡i; j ¡ 1 Si; j ¡ 1 ±Ui; j ¡ B¡i; j C 1 Si; j C 1 ±Ui; j C 1 D 1t Ri;n j this approach, all data that are required for each relaxation step
2 2 2 2
have already been computed during the previousstep. Therefore, the
(1) entire relaxation step may be performed simultaneously in parallel
where Ri;n j is the change in the solution due to the uxes at time without data dependencies.In addition, because the same pointwise
calculation is performed on each computationalcell, load balancing
level n, S is the surface area of the cell face indicated by its indices,
and Vi; j is the cell volume. will be ensured as long as the data are evenly distributed across
the available processors. Thus, the DP-LUR algorithm is almost
For the solution of the Navier– Stokes equations the appropriate
implicit viscous Jacobians must be included in Eq. (1). Following perfectly data parallel, and aside from the required nearest-neighbor
Downloaded by SUNY AT BUFFALO on January 23, 2015 | http://arc.aiaa.org | DOI: 10.2514/2.586
Seidel sweeps with a series of pointwise relaxation steps using the CO i; j D .1t = Vi; j / BQ Ci; j ¡ 1 Si; j ¡ 1
following scheme. First, the right-hand-side Ri; j is divided by the 2 2
In this way it is possible to solve Eq. (3) for all j points at each
i location as a series of fully coupled block tridiagonal systems
aligned in the body-normal direction. This was the approach used
by MacCormack in the GSLR method.5 In this method the problem
is solved via a series of alternating forward and backward sweeps
through the ow eld in the i direction, using the latest available
data for the terms on the right-hand side. This method works well
for two-dimensional ows on a serial or vector machine, but it is
not straightforwardto extendthe method to three-dimensional ows.
The obvious approach to the three-dimensional case is to continue
to set up the problem as a series of block tridiagonal systems nor-
mal to the body and to sweep in both the axial and circumferential
directions. However, this approach can lead to a nonphysical bias
in the converged solution due to the biasing that is inherent in the
Gauss– Seidel sweeping process.6 In addition,the data dependencies
in the Gauss– Seidel sweeps would make the algorithm inef cient
on a parallel machine. However, it is possible to eliminate both of
these dif culties if we modify the GSLR method by replacing the
Fig. 1 Sample 128 £ 128 cylinder-wedge grid. Every fourth grid point
Gauss– Seidel sweeps with a series of line relaxation steps, using a
is shown.
procedure similar to that outlined in the preceding section for the
DP-LUR method. The resulting DPLR method is then described by
the following scheme. First, the implicit terms on the right-handside The boundary-layerresolution is measured with the wall variable
Downloaded by SUNY AT BUFFALO on January 23, 2015 | http://arc.aiaa.org | DOI: 10.2514/2.586
of Eq. (3) are neglected, and the resulting block tridiagonal system
is factored and solved for ±U .0/ : yC D ½yu ¤ =¹
BO i; j ±Ui;.0/j C 1 C AO i; j ±Ui;.0/j ¡ CO i; j ±Ui;.0/j ¡ 1 D 1t Ri;n j where u ¤ ispthe friction velocity, given in terms of the wall stress
Then a series of k max relaxationsteps are made using,for k D 1; kmax , ¿w by u ¤ D .¿w =½/. For a well-resolved boundary layer the mesh
spacing is typically chosen so that yC · 1 for the rst cell above the
BO i; j ±Ui;.k/j C 1 C AO i; j ±Ui;.k/j ¡ CO i; j ±Ui;.k/j ¡ 1 body surface. The grid for each case is then exponentiallystretched
from the wall to the outer boundary.
D ¡ DO i; j ±Ui.kC¡1;1/j C EO i; j ±Ui.k¡¡1;1/j C 1t Ri;n j Although the results presented here are based on modi ed rst-
order Steger– Warming ux vector splitting for the numerical ux
then evaluation,9 note that the derivation of the implicit algorithm is
±Ui;n j D ±Ui;.kjmax / .4/ general and can be used with many ux evaluation methods. In fact,
Liou and Van Leer10 have shown that the use of Steger– Warming
Boundary conditions are implemented by folding the contribu- splitting can be effective and robust for the implicit part of the
tion of the boundary cells into the appropriate matrix in a manner problem, even when the uxes are evaluated by a different method.
identical to that used for GSLR. 5 The DPLR method requires a sin- The DPLR method is in all instances more sensitive to the size
gle lower-upper (LU) factorization and kmax C 1 backsubstitutions of the implicit time step than the DP-LUR method, as expected,
per iteration. By using this approach, all of the data required for because the DPLR method involves a more exact representation of
each relaxation step have already been computed during the pre- the problem, and therefore the size of the time step will have more
vious step. Therefore, as long as the data are distributed on the physical meaning. In all cases presented here, the implicit time step
processors in such a way that the body-normal direction is entirely was chosen to correspond to a Courant– Friedrichs– Lewy (CFL)
local, the entire relaxation step can be performed simultaneously number of 1 in the rst iteration and was rapidly increased to its
in parallel with no data dependencies. In addition, because the i - maximum stable value. The size of the maximum stable time step
direction, off-diagonal terms are all equally lagged by one relax- for each case was governed primarily by the freestream conditions,
ation step, the biasing problemno longer exists, and implementation with little or no limitation due to the mesh spacing. Therefore the
of a three-dimensional version of the algorithm becomes straight- maximum CFL number varied considerably from case to case.
forward. The DPLR approach will use signi cantly more memory Figure 2a shows the effect of the number of relaxationsteps (kmax )
than the DP-LUR methods because ve Jacobianmatrices (seven for on the convergence rate of the method on the two-dimensional
three-dimensional ows) must now be computed and stored at each cylinder-wedge geometry at a Reynolds number of 3 £ 104 and
grid point as compared with one for the full matrix DP-LUR method yC D 1. In Fig. 2a the explicit solution was obtained using a stan-
and none for the diagonal method. For perfect gas ows the DPLR dard rst-order Euler method with the maximum stable time step
method uses about twice as much memory as the DP-LUR method (CFL D 0:1). The line marked kmax D 0 corresponds to performing
for the two-dimensionalimplementation and four times as much for just the initial block tridiagonal solution along each i line, with no
the three-dimensionalversion. However, the DPLR method uses no implicit coupling in the i direction. The k max D 0 approach is simi-
more memory than the original GSLR method and should converge lar to that proposed by Wang and Widhopf11 and later implemented
much faster than either DP-LUR approach due to the more exact in parallel by Taylor and Wang.12 Although the kmax D 0 approach
formulation of the implicit operator. offers a signi cant improvement over the explicit method, we see
from Fig. 2a that the convergencerate of the method improves when
Results the effect of the i -direction coupling terms is included (kmax > 0).
The perfect gas implementation of the DPLR method has been The DPLR method shows a dependence on k max similar to that of
tested on two- and three-dimensional geometries with an emphasis the DP-LUR methods, with the convergence rate steadily improv-
on evaluating the convergence properties and parallel performance ing as k max increases, up to kmax D 6. Figure 2b shows the effect of
of the new method. The primary test case for this paper is the Mach the number of relaxation steps on the cost of the method, evaluated
15 perfect gas ow over a cylinder-wedgeblunt body, with Reynolds as total CPU time on an eight-processor Cray T3E-900. Because
numbers Re based on the freestream conditions and body length the cost of evaluating the Jacobian matrices and setting up the LU
varying from 3 £ 104 to 3 £ 108 . A sample 128 £ 128 grid for this factored block tridiagonal system is much greater than the cost of
problem is shown in Fig. 1. The three-dimensional computations performingadditionalbacksubstitutions,we see that increasingkmax
were performed on multiple planes of the same two-dimensional also improves the cost effectiveness of the method, up to k max D 4.
grids, which makes it easy to directly compare the convergence As shown in Fig. 2b, values of kmax larger than 4 give little improve-
properties of the two- and three-dimensional implementations of ment in convergence and are not cost effective. Therefore, all of the
the method. results presented in this paper were run at kmax D 4.
1606 WRIGHT, CANDLER, AND BOSE
a)
Fig. 4 Convergence histories for the DPLR method on a high-Mach-
number and low-supersonic-Mach-number ow: two-dimensional
cylinder-wedge blunt body at Re = 3 £ 104 ; 128 £ 128 grid with y+ = 1
for each case.
Downloaded by SUNY AT BUFFALO on January 23, 2015 | http://arc.aiaa.org | DOI: 10.2514/2.586
a)
Fig. 6 Convergence histories for the DPLR method showing in uence
of Reynolds number: two-dimensional cylinder-wedge blunt body at
M1 = 15; 128 £ 128 grid with y+ = 1 for each case.
Downloaded by SUNY AT BUFFALO on January 23, 2015 | http://arc.aiaa.org | DOI: 10.2514/2.586
b)
Fig. 5 a) Convergence histories and b) CPU times on an eight-processor
Cray T3E-900 for the DPLR method as compared with the two DP-LUR
methods: two-dimensional cylinder-wedge blunt body at M1 = 15 and
Re = 3 £ 104 ; 128 £ 128 grid with y+ = 1.
Fig. 7 CPU times on an eight-processor Cray T3E-900 required to
achieve 10 orders of density error norm convergence for the DPLR
2300 iterations for the full matrix method and 6200 iterations for and DP-LUR methods as a function of the Reynolds number: two-
the diagonal DP-LUR method. However, convergence in fewer it- dimensional cylinder-wedge blunt body at M1 = 15 and Re = 3 £ 104 ;
erations does not necessarily imply that the method will be more 128 £ 128 grid with y+ = 1 for each case.
cost effectiveon a massively parallel machine. It is also necessaryto
know how much each iteration of the method will cost. Therefore,
to examine the effectivenessof the new algorithm, Fig. 5b compares the 128 £ 128 grid for each case was chosen so that yC D 1 for the
the cost of the methods, plotted as CPU time on an eight-processor rst cell above the body. As the Reynolds number is increased by
Cray T3E-900. From Fig. 5b we see that the DPLR method also has four orders of magnitude, the amount of computer time required
a high parallel ef ciency and is by far the most cost effective of the to achieve a 10-order-of-magnitude reduction in the density error
three methods. For this problem the line relaxation method reaches norm remains constant for the DPLR method, whereas the time in-
a 10-order-of-magnitudereduction in the density error norm in just creases by a factor of 6 for the full matrix method and 7 for the
35 s, compared with 235 s for the full matrix method and 359 s diagonal method. This shows that the more exact implicit operator
for the diagonal method. This shows that the DPLR method can used in the DPLR method eliminates the convergence degradation
potentially be a powerful tool in the simulation of viscous ows. on high-cell-aspect-ratio grids.
Figure 6 shows the convergencehistoriesof the DPLR method for Figure 8 compares the viscous and inviscid implementations of
three viscous ows with Reynolds numbers ranging from 3 £ 104 the DPLR method for two of the cases in Fig. 6. The grid for each
to 3 £ 108 . The 128 £ 128 grid for each case was chosen so that case was chosen to satisfy the yC D 1 condition for the viscous so-
yC D 1 for the rst cell above the body, ensuring that the bound- lution. The inviscid solutions were then obtained on the same com-
ary layers for all cases are equally well resolved. Because the putational grids. We see that the convergence rates for the viscous
boundary-layerthickness decreases with increasing Reynolds num- and inviscid versions of the method are almost identical. This is in
ber, the maximum cell aspect ratio (CAR) of the grid must in- direct contrast to the DP-LUR method, which always requires more
crease as well to meet the yC D 1 requirement. For the test cases iterations to converge the viscous solution. This again shows the
in Fig. 6, the maximum CAR ranges from 35 for Re D 3 £ 104 to bene t of moving the body-normal terms back to the left-hand side
about 125,000 for Re D 3 £ 108 . We can see that, although each of of the equation and coupling them directly to the diagonal.
the cases behaves differently during the early phases of the ow evo-
lution,all convergewith the same terminal slope after the bow shock Parallel Performance
reaches its nal location. In addition,there is essentially no increase The DPLR algorithm is inherently data parallel by design and re-
in the number of iterations required to reach steady state as the quiresno asynchronouscommunicationor computation.This makes
Reynolds number (and therefore the CAR) is increased. the code readily portable to a variety of parallel architectures with
Figure 7 compares the convergence properties of the DPLR only minor modi cations because it is relatively easy to modify a
method with the DP-LUR methods on several viscous ows with data-parallelcode to run effectively on a message-passingmachine,
Reynolds numbers ranging from 3 £ 104 to 3 £ 108 . Once again, whereas the reverse is not necessarily possible. The DPLR method
1608 WRIGHT, CANDLER, AND BOSE
Fig. 8 Convergence histories for the viscous and inviscid implementa- Fig. 9 Parallel ef ciency for the two-dimensional viscous DPLR
tion of the DPLR method: two-dimensional cylinder-wedge blunt body method on the T3E-900: 512 £ 512 computational grid used.
at M1 = 15; 128 £ 128 grid with y+ = 1 for each case.
was implemented and tested on two different massively parallel dimensional implementations of the DPLR method on the T3E-900
architectures. First, a message-passing version using the Message- is about 75 M ops per node, which is only 8.3% of the peak theoreti-
Passing Interface (MPI) standard for interprocessorcommunication cal performanceof the machine. This number seems quite low, but it
was implemented on the 272-processor Cray T3E-900 located at is comparable to other published results. The NAS 2 parallel bench-
Network Computing Services, Inc. (formerly the Minnesota Super- mark results offer the best comparison because these codes have
computerCenter). Each processorof this machinehas 128 Mbytes of been written to simulate actual computational uid dynamics appli-
memory and a theoretical peak performance of 900 M ops. In addi- cations, and the individual machine vendors have not been allowed
tion, a data-parallel version was implemented on the 896-processor to perform assembler-level optimizations to the source code. Un-
Thinking Machines CM-5 located at the University of Minnesota fortunately, NAS 2 benchmark results have not yet been published
Army High Performance Computing Research Center. Each pro- for the T3E. However, results from the T3D for the block tridiag-
cessor of this machine has 32 Mbytes of memory and four vector onal (BT) benchmark show a performance of about 10 M ops per
units, yielding a theoretical peak performance of 128 M ops per processor.13 In addition, results for the NAS 1 benchmarks, which
processor. have been published for both the T3D and T3E-600 (600 M ops
The data-parallel implementation of the new DPLR method peak performance), show that the sustained speed on the T3E-600
should retain the perfect scalability and high parallel ef ciency that is typically about 3.3 times that on the T3D.14 This would result in a
characterized the DP-LUR method2 because the algorithm design performance of about 33 M ops per processor on the T3E-600 for
is very similar. However, it is dif cult to show this on the CM-5 the BT benchmarkor about 50 M ops per processoron the T3E-900,
because it is a vector parallel machine, and thus large vector lengths assuming perfect scalability. Therefore, the 75 M ops per proces-
are required to ensure good performance. In addition, it has been sor obtained for the DPLR method seems reasonable.However, it is
shown that only the number of points in the dimensions that are possible that further optimizations can be made to the source code
spread across all of the nodes should be used when evaluating the that would increase the performance.
vector length.2 To solve the block tridiagonal system required by
the DPLR method without interprocessorcommunication,it is nec- Conclusions
essary that all j points corresponding to a particular i location be The GSLR method has been modi ed to make it amenable to
entirely on processor, whereas with the DP-LUR method both the the solution of the Navier– Stokes equations on massively parallel
i and j directions can be spread across the available processors. supercomputers. The resulting DPLR method replaces the Gauss–
Therefore, when the 128 £ 128 test case presented in this paper is Seidel sweeps of the original with a series of line relaxationsteps. In
run on a 64-processorCM-5 with four vector units per processor,the this manner all of the data dependencies in the original method are
vector length will be 64 for the DP-LUR method but will actually removed, and each relaxation step becomes almost perfectly data
be less than 1 for the DPLR method. This means that the DPLR parallel. Because of its design, the new method can be easily imple-
method is not using the vector hardware on the processors. There- mented in either the data-parallelor message-passingprogramming
fore we would expect the two-dimensionalDPLR method to be very styles. The method also retains the good convergence properties of
slow on the CM-5 due to the small vector length, even though it may the original GSLR method, and in fact with four relaxation steps it
have a high parallel ef ciency. This is indeed the case for the im- converges in fewer iterations for the test cases considered. In addi-
plementation tested. For the 128 £ 128 test case presented here, the tion, the relaxation steps eliminate the solution bias problem exhib-
DPLR method runs at only 1.39 M ops per processor, as compared ited by the three-dimensional GSLR method, and thus the DPLR
with 13.2 for the diagonal DP-LUR method and 20.7 for the full method can easily be extended to the solution of three-dimensional
matrix method. This would not be a problem on a machine with- ows.
out vector hardware. Thus, although the DPLR method should be The DPLR method uses considerably more memory than either
ef cient on many data-parallelarchitectures,it is dif cult to directly of the previously developed DP-LUR methods but demonstrates a
calculate the parallel ef ciency of the method on the CM-5. dramatic improvement in cost effectiveness, reaching steady state
The performance of the message-passing implementation of the in about 15% of the time on a Cray T3E. In addition, both DP-
method on the T3E is easier to evaluate because there is no vec- LUR methods showed a degradation of the convergence rate when
tor hardware on this machine. In this implementation, the data are high-Reynolds-number ows were simulated. However, the DPLR
distributed across the processors by breaking the problem in the method is more strongly coupled in the body-normal direction and
i direction. Communication latency is masked by using nonblock- thus has good convergence properties at all Reynolds numbers.
ing sends and receives in MPI. The parallel speedup curve for the The new method has been implemented using message passing on
two-dimensional DPLR algorithm on the T3E-900 is presented in the Cray T3E-900 and shows nearly perfect speedup, with a parallel
Fig. 9. We see that the method has almost perfect speedup, up to the ef ciency of 98% even when 32 processorsare used. The singlenode
maximum number of processorstested. In fact, on 32 processors the performance of the method is about 75 M ops per processor, which
speedup is 31.3, which corresponds to a parallel ef ciency of 0.98. is only 8.3% of the peak theoretical performance of the machine.
WRIGHT, CANDLER, AND BOSE 1609
However, this value is comparable to data obtained for the NAS 4 Yoon, S., and Jameson, A., “A Lower-Upper Symmetric Gauss-Seidel
2 parallel benchmarks and does not detract from the high parallel Method for the Euler and Navier – Stokes Equations,” AIAA Journal, Vol. 26,
ef ciency of the method. No. 9, 1988, pp. 1025, 1026.
5 MacCormack, R. W., “Current Status of the Numerical Solutions of the
In short, the high parallel ef ciency and good convergence char-
acteristics of the DPLR method make it attractive for the solution Navier– Stokes Equations,” AIAA Paper 85-0032, Jan. 1985.
6 MacCormack, R. W., “Solution of the Navier – Stokes Equations in Three
of very large compressible ow problems. Dimensions,” AIAA Paper 90-1520, June 1990.
7 Tysinger, T., and Caughey, D., “Implicit Multigrid Algorithm for the
Acknowledgments Navier– Stokes Equations,” AIAA Paper 91-0242, Jan. 1991.
The authors were supportedby the NASA Langley Research Cen- 8 Yoon, S., and Kwak, D., “Multigrid Convergence of an Implicit Symmet-
ter under Contract NAG-1-1498 and the Army Research Of ce ric Relaxation Scheme,” AIAA Journal, Vol. 32, No. 5, 1994, pp. 950– 955.
9 MacCormack, R. W., and Candler, G. V., “The Solution of the Navier –
under Grant DAAH04-93-G-0089. This work was also supported
in part by the U.S. Army High Performance Computing Research Stokes Equations Using Gauss – Seidel Line Relaxation,” Computers and
Center under the auspices of the Department of the Army, Army Fluids, Vol. 17, No. 1, 1989, pp. 135– 150.
10 Liou, M. S., and Van Leer, B., “Choice of Implicit and Explicit Operators
ResearchLaboratoryCooperativeAgreementDAAH04-95-2-0003/
for the Upwind Differencing Method,” AIAA Paper 88-0624, Jan. 1988.
Contract DAAH04-95-C-0008, the content of which does not nec- 11 Wang, J. C., and Widhopf, G. F., “An Ef cient Finite Volume TVD
essarily re ect the position or the policy of the government, and no Scheme for Steady State Solutions of the 3-D Compressible Euler/Navier –
of cial endorsement should be inferred. Computer time on the Cray Stokes Equations,” AIAA Paper 90-1523, June 1990.
T3E was provided by the Minnesota Supercomputer Institute. 12 Taylor, S., and Wang, J. C., “Launch Vehicle Simulations Using a Con-
current Implicit Navier– Stokes Solver,” AIAA Paper 95-0223, Jan. 1995.
References 13 Saphir, W., Woo, A., and Yarrow, M., “The NAS Parallel Benchmarks
1 Simon, H. D. (ed.), Parallel Computational Fluid Dynamics Implemen- 2.1 Results,” Numerical Aerospace Simulation Facility, NAS TR NAS-96-
tations and Results, MIT Press, Cambridge, MA, 1992. 010, NASA Ames Research Center, Moffett Field, CA, Aug. 1996.
Downloaded by SUNY AT BUFFALO on January 23, 2015 | http://arc.aiaa.org | DOI: 10.2514/2.586
2 Candler, G. V., Wright, M. J., and McDonald, J. D., “Data-Parallel 14 Saini, S., and Bailey, D. H., “NAS Parallel Benchmark (Version 1.0)
Lower-Upper Relaxation Method for Reacting Flows,” AIAA Journal, Vol. Results 11-96,” Numerical Aerospace Simulation Facility, NAS TR NAS-
32, No. 12, 1994, pp. 2380– 2386. 96-018, NASA Ames Research Center, Moffett Field, CA, Nov. 1996.
3 Wright, M. J., Candler, G. V., and Prampolini, M., “Data-Parallel Lower-
Upper Relaxation Method for the Navier– Stokes Equations,” AIAA Journal, D. S. McRae
Vol. 34, No. 7, 1996, pp. 1371– 1377. Associate Editor