Dendro: Parallel Algorithms For Multigrid and AMR Methods On 2:1 Balanced Octrees
Dendro: Parallel Algorithms For Multigrid and AMR Methods On 2:1 Balanced Octrees
k=0:L
N
k
np
. For a regular grid and the coarse grid had size 1, L = log N/3 and N
k
= 2
3k
N. Thus
the overall coarsening complexity is the same of a single-level coarsening. Using a similar argument the cost for balancing
and meshing is O(
N
np
log(
N
np
) + n
p
log n
p
)L. (Notice the L factor in the communication cost.)
This estimate does not include the communication costs related to the mapping between dierent partitionings of the
same octree. This operation is carried using and MPI Alltoallv() call, which has and O(mp) complexity, with m being
the message length [12]. We have not been able to derive the cost this operation. Notice that in most cases the repartition
octree has signicant overlaps since it is always Morton-ordered and the Block Partitioning algorithm uses this ordering.
Also, out empirical results show that the associated costs are not two high even in the case in which we repartition at
each level and large number of CPUs have zero elements.
3 Numerical Experiments
In this section, we report the results from iso-granular (weak scaling) and xed size (strong scaling) scalability experiments
on up to 4K CPUs on NCSAs Intel 64 cluster and PSCs Cray XT3. The NCSA machine has 8 CPUs/node and the PSC
machine has 2 CPUs/node. Isogranular scalability analysis was performed by roughly keeping the problem size per CPU
xed while increasing the number of CPUs. Fixed-size scalability was performed by keeping problem size constant and
increasing the number of CPUs.
Implementation details. Here, we discuss features that we plan to incorporate into Dendro in the near future. Cur-
rently, the number of multigrid levels is user-specied. However, the problem size for the coarsest grid is not known
13
Figure 3: The LEFT FIGURE shows isogranular scalability with a grain size of 0.25M (approx) elements per CPU (np) on the nest level.
The dierence between the minimum and maximum levels of the octants on the nest grid is 5. A V-cycle using 4 pre-smoothing steps and
4 post-smoothing steps per level was used as a preconditioner to CG. The damped Jacobi method with a damping factor of 0.857 was used
as the smoother at all multigrid levels. A relative tolerance of 10
10
in the 2-norm of the residual was used. 6 CG iterations were required
in each case, to solve the problem to the specied tolerance. The nest level octrees for the multiple CPU cases were generated using regular
renements from the nest octree for the single CPU case. SuperLU was used to solve the coarsest grid problem. This isogranular scalability
experiment was performed on NCSAs Intel 64 cluster.
The RIGHT FIGURE gives the results for the variable coecient scalar Laplacian operator (left column) and results for the constant coecient
linear elastostatic problem (right column). For the elasticity problem, the actual timings are given in the table below and are 1/10 of that value
is plotted. The coecient of the scalar Laplacian operator was chosen to be (1 + 10
6
(cos
2
(2x) + cos
2
(2y) + cos
2
(2z))). The elastostatic
problem uses homogeneous Dirichlet boundary conditions and the scalar problem uses homogeneous Neumann boundary conditions. A Poissons
ratio of 0.4 was used for this problem. The solver options are the same as described in the left gure. Only the timings for the solve phase
are reported. The timings for the setup phase are not reported since the results are similar to that the left gure. This isogranular scalability
experiment was performed on PSCs Cray XT3.
a-priori. Hence, choosing an appropriate number of multigrid levels to better utilize all the available CPUs is not obvious.
We are currently working on developing heuristics to dynamically adjust the number of multigrid levels. Imposing a parti-
tion computed on the coarse grid onto a ne grid may lead to load-imbalance. However, not doing so results in duplicating
meshes for the intermediate levels and also increases communication during intergrid transfer operations. Hence, a balance
between the two need to be arrived at and we are currently working on developing heuristics to address this issue. In
Figure 2, we depict the use of duplicate octrees that are used to map scalar and vector functions between dierent octree
partitions. These partitions can be dened by three dierent ways: (1) One way is using the same MPI communicator
14
and a uniform load across all CPUs; (2) A second way is again using the same MPI communicator but partitioning the
load across a fewer number of CPUs and letting the other CPUs idle; and (3) Creating a new communicator. In this set of
experiments, we use the the rst way, i.e.,we partition across the total number of CPUs for all levels. Also, we
repartition at each level. In this way, the results reported in the isogranular and xed-size scalability results represent
the worst-case performance for our method.
Isogranular scalability. In Figure 3, we report results for the constant-coecient (left subgure) and variable-coecient
(right subgure) scalar elliptic problem with homogeneous Neumann boundary conditions on meshes whose mesh-size
follows a Gaussian distribution. The coarsest level uses 1 CPU only in all of the experiments unless otherwise specied.
We use SuperLU dist [17] as an exact solver at the coarsest grid. 4 multigrid levels were used on 1 CPU and the number
of multigrid levels were incremented by 1 every time the number of CPUs increased by a factor of 8. In the left subgure,
for each CPU, the left column gives the time for the setup-phase and the right column gives the time for the solve-phase.
n
p
seconds
1 2 4 8 16 32 64 128 256
0
20
40
60
80
100
120
140
160
180
200
220
240
260
280
GMG Setup
GMG Solve
Falgout Setup
Falgout Solve
CLJP Setup
CLJP Solve
Elements
1.21
9.23
7.86
5.54
6.09
2.56
52K
2.42
16.6
18.97
11.77
14.12
6.11
138K
2.11
14.19
18.9
7.23
13.41
3.96
240K
2.05
14.76
30
9.25
20.08
4.77
0.5M
5.19
21.07
41.56
11.35
24.77
6.38
995K
5.67
22.49
59.29
13.91
34.54
6.92
2M
6.63
19.2
92.61
14.5
52.05
8.12
3.97M
6.84
20.33
146.23
18.56
80.09
9.4
8M
10.7
22.23
248.5
23.3
151.64
13.9
16M
Figure 4: A variable coecient (contrast of 10
7
) elliptic problem with homogeneous Neumann boundary conditions was solved on meshes
constructed on Gaussian distributions. A 7-level multiplicative-multigrid cycle was used as a preconditioner to CG for the GMG scheme.
SuperLU Dist was used to solve the coarsest grid problem in the GMG scheme.
The setup cost for the GMG scheme includes the time for constructing the mesh for all the levels (including the nest),
constructing and balancing all the coarser levels, setting up the intergrid transfer operators by performing one dummy
15
Restriction MatVec at each level. Here we repartition at each level. The time to create the work vectors for the MG scheme
and the time to build the coarsest grid matrix are included in Extras-1. Scatter refers to the process of transferring
the vectors between two dierent partitions of the same multigrid level during the intergrid transfer operations. This
is required whenever the coarse and ne grids do not share the same partition. The time spent in applying the Jacobi
preconditioner, computing the inner-products within CG and solving the coarsest grid problems using LU are all grouped
together as Extras-2. In Figure 4, we report results from a comparison with the algebraic multigrid scheme. In order to
minimize communication costs, the coarsest level was distributed on a maximum of 32 CPUs in all experiments.
For BoomerAMG, we experimented with two dierent coarsening schemes: Falgout coarsening and CLJP coarsening.
The results from both experiments are reported. [14] reports that Falgout coarsening works best for structured grids and
CLJP coarsening is better suited for unstructured grids. Since octree meshes lie in between both structured and generic
unstructured grids, we compare our results using both the schemes. Both GMG and AMG schemes used 4 pre-smoothing
steps and 4 post-smoothing steps per level with the damped Jacobi smoother. A relative tolerance of 10
10
in the 2-norm
of the residual was used in all the experiments.
The GMG scheme took about 12 CG iterations, the Falgout scheme took about 7 iterations and the CLJP scheme also
took about 7 iterations. Each node of the cluster has 8 CPUs which share an 8GB RAM. However, only 1 CPU per node
was utilized in the above experiments. This is because the AMG scheme required a lot of memory and this allowed the
entire memory on any node to be available for a single process.
9
The setup time reported for the AMG schemes includes
the time for meshing the nest grid and constructing the nest grid FE matrix, both of which are quite small ( 1.35
seconds for meshing and 22.93 seconds for building the ne grid matrix even on 256 CPUs) compared to the time to
setup the rest of the AMG scheme. Although GMG seems to be performing well, more dicult problems with multiple
discontinuous coecients can cause it to fail. Our method is not robust in the presence of discontinuous coecientsin
contrast to AMG.
Fixed-size scalability. Next, we report xed-size scalability results for the Poisson and elasticity solvers.
10
In Figure
5, we report xed-size scalability for two dierent grain sizes for the Poisson problem. In Figure 6 we report xed-
size scalability results of the elasticity and variable-coecient Poisson cases. Overall we observe excessive costs for the
coarsening, balancing, and meshing when the grain size is relatively small. Like in the isogranular case, we repartition at
each multigrid level and we use all available CPUs. However, notice that the MatVecs and overall the solver scale quite
well.
AMR. Finally, in Table 2 and Figure 7, we report the performance of the balancing, meshing and the solution transfer
algorithms to solve the linear parabolic problem described in Section 2.5.
9
We did not attempt to optimize neither our code nor AMG and we used the default options. It is possible that one can reduce the AMG
cost by using appropriate options.
10
The elastostatics equation is given by div((x)u(x)) + ((x) + (x))v) = b(x), where is a scalar eld; and v and b are 3D vector
functions.
16
Figure 5: LEFT FIGURE: Scalability for a xed ne grid problem size of 64.4M elements. The problem is the same as described in Figure
3. 8 multigrid levels were used. 64 CPUs were used for the coarsest grid for all cases in this experiment. The minimum and maximum levels
of the octants on the nest grid are 3 and 18, respectively. 5 iterations were required to solve the problem to a relative tolerance of 10
10
in
the 2-norm of the residual.
RIGHT FIGURE: Scalability for a xed ne grid problem size of 15.3M elements. The setup cost and the cost to solve the constant coecient
poisson problem are reported. 6 multigrid levels were used. This xed size scalability experiment was performed on NCSAs Intel 64 cluster.
This xed size scalability experiment was performed on NCSAs Intel 64 cluster.
Table 2: Isogranular scalability with a grain size of 38K (approx) elements per CPU (np). A constant-coecient linear parabolic problem
with homogeneous Neumann boundary conditions was solved on octree meshes. We used the analytical solution of a traveling wave of the form
exp(10
4
(y 0.5 0.1t 0.1 sin(2x))
2
) to construct the octrees. We used a time step of t = 0.05 and solved for 10 time steps. We used
second-order Crank-Nicholson scheme for time-stepping and CG with Jacobi preconditioner to solve the linear system of equations at every
time step. We used a stopping criterion of r/r
0
< 10
6
. We report the number of elements (Elements), CG iterations (iter), Solve time
(Solve), total time to balance (Balancing) and mesh (Meshing) the octrees generated at each time step. We also report the time to transfer
the solution between the meshes (Transfer) at two dierent time steps. The total number of elements and the number of CG iterations are
approximately constant over all time steps. This isogranular scalability experiment was performed on NCSAs Intel 64 cluster.
n
p
Elements iter Solve Balancing Meshing Transfer
8 300K 110 78.1 6.24 6.0 0.76
64 2M 204 143.9 17.0 7.86 0.82
512 14M 384 356 132.9 72.3 3.32
17
n
p
seconds
64 128 256 512
0
5
10
15
20
25
30
35
40
Restriction/
Prolongation
Scatter
FE Matvecs
Extras
3.52
0.65
26.83
2.36
13.43
4.97
348.9
4.29
1.8
0.38
13.42
3.24
7.62
3.05
176.4
5.3
1.29
0.23
7.06
3.68
4.21
1.72
89.07
6.5
0.88
0.21
3.95
4.98
2.81
0.78
46.16
11.9
Figure 6: Scalability for a xed ne grid problem with 15.3M elements. The cost to solve the variable coecient poisson problem and the
constant coecient linear elasticity problem are reported. The left column on each CPU gives the results for the variable coecient scalar
Laplacian operator and the right column gives the results for the constant coecient linear elastostatic problem. For the elasticity problem,
the actual timings are given in the table but are scaled by an order of magnitude (1/10) for the plot. The setup cost is the same as in Fig. 5.
6 multigrid levels were used. This xed size scalability experiment was performed on NCSAs Intel 64 cluster.
Figure 7: An example of adaptive mesh renement on a traveling wave. This is a synthetic solution which we use to illustrate the adaptive
octrees. The coarsening and renement are based on the approximation error between the discretized and the exact function.
4 Conclusions
We have described a parallel geometric multigrid method for solving partial dierential equations using nite elements on
octrees. Also, we described and AMR scheme that can be used to transfer vector and scalar functions between arbitrary
octrees.
We automatically generate a sequence of coarse meshes from an arbitrary 2:1 balanced ne octree. We do not impose
any restrictions on the number of meshes in this sequence or the size of the coarsest mesh. We do not require the meshes
to be aligned and hence the dierent meshes can be partitioned independently. Although a bottom-up tree construction
and meshing is harder than top-down approaches we believe that it is more practical since the ne mesh can be dened
naturally based on the physics of the problem.
18
In the nal submission, we will include additional results with signicant improvements for meshing and balancing
parts in the multigrid case, to have the load determined by a minimum grain size.
We have demonstrated reasonable scalability of our implementation and can solve problems with billions of elements
on thousands of CPUs. We tested the worse case scenarios for our code in which we repartition at each level and we use
all available CPUs for all levels. We have demonstrated that our implementation works well even on problems with largely
varying material properties. We have compared our geometric multigrid implementation with a state-of-the-art algebraic
multigrid implementation in a standard o-the-shelf package (HYPRE). Overall we showed that the proposed algorithm
is quite ecient although signicant work remains to improve the partitioning algorithm and the overall robustness of the
scheme in the presence of discontinuous coecients.
References
[1] M. F. Adams, H. Bayraktar, T. Keaveny, and P. Papadopoulos, Ultrascalable implicit nite element analyses in solid
mechanics with over a half a billion degrees of freedom, in Proceedings of SC2004, The SCxy Conference series in high perfor-
mance networking and computing, Pittsburgh, Pennsylvania, 2004, ACM/IEEE.
[2] S. Balay, K. Buschelman, W. D. Gropp, D. Kaushik, M. Knepley, L. C. McInnes, B. F. Smith, and H. Zhang,
PETSc home page, 2001. http://www.mcs.anl.gov/petsc.
[3] G. T. Balls, S. B. Baden, and P. Colella, SCALLOP: A highly scalable parallel Poisson solver in three dimensions, in
SC 03: Proceedings of the 2003 ACM/IEEE conference on Supercomputing, Washington, DC, USA, 2003, IEEE Computer
Society, p. 23.
[4] R. Becker and M. Braack, Multigrid techniques for nite elements on locally rened meshes, Numerical Linear Algebra
with applications, 7 (2000), pp. 363379.
[5] B. Bergen, F. Hulsemann, and U. Ru ede, Is 1.7 10
10
Unknowns the Largest Finite Element System that Can Be Solved
Today?, in SC 05: Proceedings of the 2005 ACM/IEEE conference on Supercomputing, Washington, DC, USA, 2005, IEEE
Computer Society, p. 5.
[6] M. W. Bern, D. Eppstein, and S.-H. Teng, Parallel construction of quadtrees and quality triangulations, International
Journal of Computational Geometry and Applications, 9 (1999), pp. 517532.
[7] M. Bittencourt and R. Feijoo, Non-nested multigrid methods in nite element linear structural analysis, in Virtual Pro-
ceedings of the 8th Copper Mountain Conference on Multigrid Methods (MGNET), 1997.
[8] P. M. Campbell, K. D. Devine, J. E. Flaherty, L. G. Gervasio, and J. D. Teresco, Dynamic octree load balancing
using space-lling curves, Tech. Report CS-03-01, Williams College Department of Computer Science, 2003.
[9] E. Chow, R. D. Falgout, J. J. Hu, R. S. Tuminaro, and U. M. Yang, A survey of parallelization techniques for multigrid
solvers, in Parallel Processing for Scientic Computing, M. A. Heroux, P. Raghavan, and H. D. Simon, eds., Cambridge
University Press, 2006, pp. 179195.
[10] R. Falgout, An introduction to algebraic multigrid, Computing in Science and Engineering, Special issue on Multigrid Com-
puting, 8 (2006), pp. 2433.
19
[11] R. Falgout, J. Jones, and U. Yang, The design and implementation of hypre, a library of parallel high performance
preconditioners, in Numerical Solution of Partial Dierential Equations on Parallel Computers, A. Bruaset and A. Tveito, eds.,
vol. 51, Springer-Verlag, 2006, pp. 267294.
[12] A. Grama, A. Gupta, G. Karypis, and V. Kumar, An Introduction to Parallel Computing: Design and Analysis of
Algorithms, Addison Wesley, second ed., 2003.
[13] M. Griebel and G. Zumbusch, Parallel multigrid in an adaptive PDE solver based on hashing, in Parallel Computing:
Fundamentals, Applications and New Directions, Proceedings of the Conference ParCo97, 19-22 September 1997, Bonn,
Germany, E. H. DHollander, G. R. Joubert, F. J. Peters, and U. Trottenberg, eds., vol. 12, Amsterdam, 1998, Elsevier,
North-Holland, pp. 589600.
[14] V. E. Henson and U. M. Yang, Boomeramg: a parallel algebraic multigrid solver and preconditioner, Appl. Numer. Math.,
41 (2002), pp. 155177.
[15] F. H ulsemann, M. Kowarschik, M. Mohr, and U. R ude, Parallel geometric multigrid, in Numerical Solution of Partial
Dierential Equations on Parallel Computers, A. M. Bruaset and A. Tveito, eds., Birk auser, 2006, pp. 165208.
[16] A. Jones and P. Jimack, An adaptive multigrid tool for elliptic and parabolic systems, International Journal for Numerical
Methods in Fluids, 47 (2005), pp. 11231128.
[17] X. S. Li and J. W. Demmel, SuperLU DIST: A Scalable Distributed-Memory Sparse Direct Solver for Unsymmetric Linear
Systems, ACM Transactions of Mathematical Software, 29 (2003), pp. 110140.
[18] D. J. Mavriplis, M. J. Aftosmis, and M. Berger, High resolution aerospace applications using the NASA Columbia
Supercomputer, in SC 05: Proceedings of the 2005 ACM/IEEE conference on Supercomputing, Washington, DC, USA, 2005,
IEEE Computer Society, p. 61.
[19] R. Sampath and G. Biros, A parallel geometric multigrid method for nite elements on octree meshes, tech. report, University
of Pennsylvania, 2008. Sumbitted for publication.
[20] R. Sampath, H. Sundar, S. S. Adavani, I. Lashuk, and G. Biros, Dendro home page, 2008.
http://www.seas.upenn.edu/csela/dendro.
[21] H. Sundar, R. Sampath, C. Davatzikos, and G. Biros, Low-constant parallel algorithms for nite element simulations
using linear octrees, in Proceedings of SC2007, The SCxy Conference series in high performance networking and computing,
Reno, Nevada, 2007, ACM/IEEE.
[22] H. Sundar, R. S. Sampath, and G. Biros, Bottom-up construction and 2:1 balance renement of linear octrees in parallel,
SIAM Journal on Scientic Computing, (2008). to appear.
[23] Trottenberg, U. and Oosterlee, C. W. and Schuller, A., Multigrid, Academic Press Inc., San Diego, CA, 2001.
[24] T. Tu, D. R. OHallaron, and O. Ghattas, Scalable parallel octree meshing for terascale applications, in SC 05: Proceedings
of the 2005 ACM/IEEE conference on Supercomputing, Washington, DC, USA, 2005, IEEE Computer Society, p. 4.
[25] T. Tu, H. Yu, L. Ramirez-Guzman, J. Bielak, O. Ghattas, K.-L. Ma, and D. R. OHallaron, From mesh generation
to scientic visualization: an end-to-end approach to parallel supercomputing, in SC 06: Proceedings of the 2006 ACM/IEEE
conference on Supercomputing, New York, NY, USA, 2006, ACM Press, p. 91.
20