0% found this document useful (0 votes)
68 views

Dendro: Parallel Algorithms For Multigrid and AMR Methods On 2:1 Balanced Octrees

This document summarizes the Dendro software package for solving PDEs on 2:1 balanced octree meshes in parallel. Dendro includes modules for octree generation, meshing, geometric multigrid solvers, and adaptive mesh refinement. The key contributions are a parallel algorithm for constructing coarser multigrid levels from an arbitrary fine grid, matrix-free implementations of finite element operators and intergrid transfers using octree properties, and a scalable geometric multigrid solver integrated with PETSc that has solved problems with billions of elements on thousands of CPUs. Results demonstrating the multigrid solver effectiveness and AMR performance are presented.

Uploaded by

lanwatch
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
68 views

Dendro: Parallel Algorithms For Multigrid and AMR Methods On 2:1 Balanced Octrees

This document summarizes the Dendro software package for solving PDEs on 2:1 balanced octree meshes in parallel. Dendro includes modules for octree generation, meshing, geometric multigrid solvers, and adaptive mesh refinement. The key contributions are a parallel algorithm for constructing coarser multigrid levels from an arbitrary fine grid, matrix-free implementations of finite element operators and intergrid transfers using octree properties, and a scalable geometric multigrid solver integrated with PETSc that has solved problems with billions of elements on thousands of CPUs. Results demonstrating the multigrid solver effectiveness and AMR performance are presented.

Uploaded by

lanwatch
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 20

Dendro: Parallel algorithms for

multigrid and AMR methods on 2:1 balanced octrees


Rahul S. Sampath, Santi S. Adavani, Hari Sundar, Ilya Lashuk, and George Biros
University of Pennsylvania
Abstract
In this article, we present Dendro, a suite of parallel algorithms for the discretization and solution of partial dierential
equations that require discretization of second-order elliptic operators. Dendro uses trilinear nite element discretizations
constructed using octrees. Dendro, which is built on top of PETSc (Argonne National Laboratories), comprises four
main modules: a bottom-up octree generation and 2:1 balancing module, a meshing module, a geometric multiplicative
multigrid module, and a module for adaptive mesh renement (AMR). The rst two components constitute prior work
that has been published elsewere. Here, we focus on the multigrid module and AMR modules. The key features of
Dendro are coarsening/renement, inter-octree transfers of scalar and vector elds, and parallel partition of multilevel
octree forests. We describe an algorithm for constructing the coarser multigrid levels starting with an arbitrary 2:1
balanced ne grid octree discretization. We also describe matrix-free implementations for the discretized nite element
operators and the intergrid transfer operations. The current implementation of Dendro is most appropriate for problems
with smooth variable coecients.
We present scalability results for a Poisson problem, a linear elastostatics problem, and for a time-dependent heat
equation. We use the rst two equations to illustrate the eectiveness of the multigrid solver. We use the third
equation to illustrate the performance of the AMR components. We present results on up to 4096 CPUs on the Cray
XT3 (BigBen) at the Pittsburgh Supercomputing Center (PSC) and the Intel 64 system (Abe) at the National
Center for Supercomputing Applications (NCSA).
1 Introduction
Second-order elliptic operators, like the Laplacian operator, are ubiquitous in computational science and engineering.
They eectively model diusive transport, viscous dissipation in uids, and stress-strain relationships in solid mechanics.
The need for high-resolution simulation of such PDEs requires scalable discretization and solution schemes.
Octree meshes strike a balance between the simplicity of structured meshes and the adaptivity of generic unstructured
meshes [24, 25]. On multithousand-CPU platforms, the resulting octree-based discretized operators must be inverted
using iterative solvers. Iterative solvers for nite element based discretizations must address the ill-conditioning of second
order operators, which deteriorates with increasing problem size. Multigrid algorithms (geometric and algebraic) provide
a powerful mathematical framework that allows the construction of solvers with optimal algorithmic complexity [23].
Numerous sequential and parallel implementations for both geometric and algebraic multigrid methods are available.
1
Related work on parallel multigrid. Excellent surveys on parallel algorithms for multigrid can be found in [9] and
[15].
1
Here, we give a brief (and incomplete!) overview of distributed memory message-passing based parallel multigrid
algorithms. The two basic approaches are algebraic multigrid and geometric multigrid. The advantages of algebraic
multigrid is that it can be used in a black-box fashion with unstructured grids and a great variety of operators (e.g.,
operators with discontinuous coecients). The disadvantage of existing implementations is the relatively high setup costs.
The main advantage of geometric multigrid approaches is that it is easier to devise coarsening and intergrid transfers with
low overhead. Its disadvantage is that it is can not be used in a black-box fashion as it depends on both the PDE and the
scheme used for its discretization.
Our interest is in developing multigrid methods for highly non-uniform meshes. Currently, the most scalable methods
are based on graph-based methods for coarsening, namely maximal-matchings. Examples of scalable codes that use such
graph-based coarsening include [1] and [18]. Another powerful code for algebraic multigrid is Hypre [11, 10]. Hypre
has been used extensively to solve problems with a variety of operators on structured and unstructured grids. The
associated constants for constructing the mesh and performing the calculations however, are quite large. The high-costs
related to partitioning, setup, and accessing generic unstructured grids, has motivated us to design geometric multigrid
for octree-based data structures.
Many geometric multigrid algorithms for nonuniform discretizations have been proposed in the past. In [7], a sequential
geometric multigrid algorithm was proposed for nite elements on non-nested unstructured meshes, but with intergrid
transfer operations that is dicult to parallelize. A sequential multigrid scheme for nite element simulations on quadtree
meshes was described in [16]. In addition to the 2:1 balance constraint, a number of safety layers of octants were
added at each level to support intergrid transfer operations. Projections were required at each level to preserve the
continuity of the solution, which is otherwise not guaranteed using their non-conforming discretizations. In [3], the
authors describe a Poisson solver with excellent scalability. Strictly speaking, it is not a bona de multigrid solver as
there is no iteration between the multiple levels. Instead, it is based on local independent solves in nested grids followed
by global corrections to restore smoothness. It similar to block structured grids and its extension to arbitrarily graded
grids is not immediately obvious. One of the largest calculations was reported in [5], using conforming discretizations and
geometric multigrid solvers on semi-structured meshes. This approach is highly scalable for nearly structured meshes and
for constant coecient PDEs. However, the scheme is not ecient when deviating from that structure.
Multigrid methods on octrees have been proposed and developed for sequential and modestly parallel adaptive nite
element implementations [4, 13]. A characteristic of octree meshes is that they contain hanging nodes. In [21], we
presented a strategy to tackle these hanging nodes and build conforming, trilinear nite element discretizations on these
meshes. This algorithm scaled up to four billion octants on 4096 CPUs on a Cray XT3 at the Pittsburgh Supercomputing
Center. We also showed that the cost of applying the Laplacian using this framework is comparable to that of applying
it using a direct indexing regular grid discretization with the same number of elements. To our knowledge, there is
1
We do not attempt to review the extensive literature on methods for adaptive mesh renement. Our work is restricted on simple intergrid
transfers for dierent octrees.
2
no work on parallel, octree-based, geometric multigrid solvers for nite element discretizations. Here, we
build on our previous work and present a novel parallel geometric multigrid scheme.
Contributions. Our goal is to minimize storage requirements, obtain low setup costs, and achieve end-to-end
2
parallel
scalability:
We propose a parallel global coarsening algorithm to construct a hierarchy of nested 2:1 balanced octrees and
their corresponding meshes starting with an arbitrary 2:1 balanced ne octree. We do not impose any restrictions
on the number of meshes in this sequence or the size of the coarsest mesh. The process of constructing coarser
meshes from a ne mesh is harder than iterative global renements of a coarse mesh because we must ensure that
the coarser grids satisfy the 2:1 balanced constraint as well; while this is automatically satised for the case of global
renements, it is not true with global coarsening. However, this bottom-up approach is more natural for certain
applications (e.g., evolution PDEs), in which only the ne mesh is available.
Transferring information between successive multigrid levels is a challenging task on unstructured meshes and es-
pecially so in parallel since the coarse and ne grids need not be aligned and near neighbor searches are required.
Here, we describe a matrix-free algorithm for intergrid transfer operators that uses the special properties of an
octree data structure.
We have integrated the above mentioned components in a parallel matrix-free implementation of a geometric multi-
plicative multigrid method for nite elements on octree meshes. Our MPI-based implementation, Dendro has scaled
to billions of elements on thousands of CPUs even for problems with large contrasts in the material properties.
Dendro is an open source code that can be download from [20]. Dendro is tightly integrated with PETSc [2].
In the following, we will use the term MatVec to denote a matrix-vector multiplication, we well use octants or
nodes to refer to octree nodes and vertices to refer to element vertices. We will use CPUs to refer to message
passing processes. We will use hanging nodes/vertices to refer to octree nodes with special properties, which we explain
later. There is a one-to-one correspondence between nodes, elements, and vertices.
Limitations of Dendro. Here, we briey summarize the limitations of the proposed methodology. More details can
be found in [19]. (1) The restriction and prolongation operators and the coarse grid approximations are not ecient for
problems with discontinuous coecients. (2) The method does not work for strongly indenite elliptic operators (e.g.,
high-frequency Helmholtz problems). (3) Currently Dendro supports only Dirichlet and Neumann conditions. (4) Dendro
is limited to second-order accurate discretization of elliptic operators on the unit cube. Problems with complex geometries
are not directly supported although Dendro can be used with penalty approaches to allow solution of such problems. (5)
Only the intergrid transfers have been implemented in the AMR module (no error estimators). (6) Load balancing is not
2
By end-to-end we collectively refer to the construction of octree-based meshes for all multigrid levels, restriction/prolongation, smoothing,
coarse solve, and CG drivers.
3
(a) Level k 2 (b) Level k 1 (c) Level k
Figure 1: Quadtree meshes for three successive multigrid levels.
addressed fully. There are issues due to the 2:1 balance constraint and coarsening that can prevent perfect speed ups. (7)
The interlevel partitioning scheme does not use neither load information across multiple levels. We are currently working
to address items (5) and (6) and we are optimistic that improved results on up to 9K CPUs on NCSAs Intel cluster.
Organization of the paper. In Section 2, we describe a matrix-free implementation for the multigrid method. We
describe our framework for handling hanging nodes and how we use it to implement the MatVecs for the nite element
matrices and the restriction/prolongation matrices. In Section 3, we present the results from xed-size and iso-granular
scalability experiments. We also compare our implementation with BoomerAMG [14], an algebraic multigrid imple-
mentation available in Hypre. In Section 4, we present the conclusions from this study and discuss ongoing and future
work.
2 Parallel Geometric Multigrid
The prototypical equation for second-order elliptic operators is the Poisson problem
div((x)u(x)) = f(x) , u(x) = 0 on . (1)
Here, is the unit cube in three dimensions, it the boundary of the cube, x is a point in the 3D space, u(x) is the
unknown scalar function, (x) is a given scalar function (commonly referred to as a the coecient for (1)), and f(x)
is a known function. If (x) is smooth, (1) can be eciently solved with Dendro. If is discontinuous we can still use
Dendro but the convergence rates may be suboptimal.
3
We write A
k
u
k
= f
k
to denote the discretized nite-dimensional
system of linear equations; k denotes the multigrid level.
The Dendro interface. Let us describe the Dendro interface for (1). (The current implementation of Dendro supports
more general scalar equations, vector equations, and time-dependent parabolic, and hyperbolic equations.) The input
is , f, boundary conditions, and a list of target points in which we want to evaluate the solution. The output is the
solution u and, optionally, its gradient at the target points. Here, for simplicity, we describe only the representation of
3
The current Dendro implementation supports isotropic problems only. We are working on extending Dendro to anisotropic operators in
which (x) is tensor.
4
: it is approximated as the sum of a constant plus a function that is dened by specifying its value on a set of points.
f is represented in a similar manner. We assume that the lists of points that are used to specify and f are arbitrarily
distributed across CPUs. Dendro uses these points to dene the ne-level octree and subsequently the overall multigrid
hierarchy. Next, we give details about the main algorithmic components of this procedure.
Outline of the Dendro algorithms. The main algorithm in Dendro is the geometric multigrid solver. In Dendro, we use
a hierarchy of octrees (see Figure 1), which we construct as part of the geometric multigrid V-cycle algorithm. The V-cycle
algorithm consists of 6 main steps: (1) Pre-smoothing: u
k
= S
k
(u
k
, f
k
, A
k
); (2) Residual computation: r
k
= f
k
A
k
u
k
;
(3) Restriction: r
k1
= R
k
r
k
; (4) Recursion: e
k1
=Multigrid(A
k1
, r
k1
); (5) Prolongation: e
k
= P
k
e
k1
; and (6) Post-
smoothing: u
k
= S
k
(u
k
, f
k
, A
k
). When k reaches a minimum level, which we term the exact solve level, we solve for
e
k
exactly using a single level solver (e.g., a parallel sparse direct factorization method).
Octree Forest construction Bottom-up Coarsening Top-Down Meshing
1. Bottom-up construct fine
octree
2. Partition fine octree
3. Balance fine octree
4. Bottom-up coarsen octree
5. Top-down meshing
1. Replace leafs with their
parent
2. Balance and Morton-order
partition
3. Repeat
1. Mesh and partition coarse
grid
2. Prolong partitioning to
next grid
3. Check load balancing and
repartition
4. Construct prolongation
operator
Table 1: Summary of main algorithmic components for the multigrid construction used in Dendro [19].
We use the standard Jacobi smoothing. Due to space limitations, we do not discuss more advanced smoothers. Let us
just say that any read-only-ghost type smoother can be implemented with no additional communication. For the exact
solve we use SuperLU [17]. The operators A
k
are dened using standard FEM discretization (so in the general case,
the coarse grid operators do not satisfy the so-called Galerkin condition). We say more about this and the coarse-grid
representation in the next section.
To construct the solver we start with user input. Given the points for and f, we merge them and use them to
bottom-up construct the ne-level octree using the methods we described in [21]. The main challenge is the construction
of the hierarchy of meshes and the corresponding prolongation and restriction operators. Therefore, the basic steps in
constructing the multigrid step are the following:
1. Given input points, construct and partition octree at the ne level.
2. Given the total number of levels, construct and partition the coarser octrees.
3. Mesh the octrees.
4. Construct restriction and prolongation operators.
The basic algorithms that we use to implement these steps are outlined in Table 1. In the following, we briey
summarize the octree construction and then we explain the rest of the algorithm.
5
Smoothing and coarse grid operator. One of the problems with geometric multigrid methods is that their perfor-
mance deteriorates with increasing contrast in material properties. In matrix-free geometric multigrid implementations
the coarse grid operator is not constructed using the Galerkin condition, instead the coarser grid discretization of problem
is used as the coarse grid operator. We can easily show [19] that these two formulations are equivalent provided the same
bilinear form, a(u, v), is used both on the coarse and ne levels. This poses no diculty for constant coecient problems
or problems in which the material property is described in a closed form. However, sometimes the material property is just
dened on each ne grid element. Hence, the bilinear form actually depends on the discretization used. If the coarser grids
must use the same bilinear form, the coarser grid MatVecs must be performed by looping over the underlying nest grid
elements, using the material property dened on each ne grid element. This would make the coarse grid MatVecs quite
expensive. A cheaper alternative would be to dene the material properties for the coarser grid elements as the average of
those for the underlying ne grid elements. This process amounts to using a dierent bilinear form for each multigrid level
and hence is a clear deviation from the theory. Hence, the convergence of the stand-alone multigrid solver deteriorates
with increasing contrast in material properties. However, the multigrid scheme is known to be a good preconditioner to
the Conjugate Gradient method [23]. We have conducted numerical experiments that demonstrate this for the Poisson
problem.
2.1 Octree construction
Each octree node has a maximum of eight children. A node with no children is called a leaf and a node with one or more
children is called an interior node. Complete octrees are trees in which every interior node has exactly eight children. The
only node with no parent is the root and all other nodes have exactly one parent. Nodes that have the same parent are
called siblings.
At each level, we use a linear octree representation using a Morton encoding to represent the position and level of
the octant of the tree locational code to identify the octants [8]. In order to construct a Morton encoding, the maximum
permissible depth, D
max
, of the tree is specied a priori. Next, we assume that the domain is represented by an imaginary
uniform grid of 2
Dmax
indivisible cells in each dimension. Each cell is identied by an integer triplet representing its x, y
and z coordinates, respectively. Any octant in the domain can be uniquely identied by specifying one of its vertices, also
known as its anchor, and its level in the tree. By convention, the anchor of an octant is its lower left corner facing the
reader. The Morton encoding for any octant is then derived by interleaving the binary representations (D
max
bits each)
of the three coordinates of the octants anchor, and then appending the binary representation (((log
2
D
max
) + 1) bits)
of the octants level to this sequence of bits [6].
In many applications involving octrees, it is desirable to impose a restriction on the relative sizes of adjacent octants,
also known as the 2:1 balance constraint (not to be confused with load balancing) that requires no leave at level l shares
a corner, edge, or face with another leaf at a level greater than l + 1.
Nodes that exist at the center of a face of an octant are called face-hanging nodes and those that are located at the mid-
6
point of an edge are called edge-hanging nodes. The 2:1 balance constraint ensures that there is at most one hanging node
on any edge or face. In [22], we presented a parallel algorithm for linear octree construction and 2:1 balancing and which is
a key component of Dendro. This balancing algorithm has an O(N log N) work complexity and an O(
N
np
log
N
np
+n
p
log n
p
)
parallel time complexity, with N being the problem size and n
p
the number of CPUs.
To partition the tree for balancing end meshing, we introduced Block Partition, a heuristic algorithm for a single-
level octree. We rst partition the octree using Morton-ordering and then we build a smaller complete linear octree whose
leaves we term blocks. This auxiliary octree encapsulates and compresses the local spatial distribution of the input octants.
Moreover, every octant in the input is either a block in itself or is a descendant of some block. We then compute the
number of original octants that lie within each block and assign this number as the blocks weight. We use a weighted
Morton partitioning algorithm to partition these blocks. Finally, the computed partition is projected onto the input
octants. The result of this is that the octants in the input and the blocks containing them are sent to the same CPU.
2.2 Global coarsening
Starting with the nest octree, we iteratively construct a hierarchy of complete, balanced, linear octrees such that every
octant in the k-th octree (coarse) is either present in the k + 1-th octree (ne) as well or all its eight children are present
instead (Figure 1).
We construct the k-th octree from the k+1-th octree by replacing every set of eight siblings by their parent. This is an
operation with O(N) work complexity, where N is the number of leaves in the k +1-th octree. It is easy to parallelize and
has an O(
N
np
) parallel time complexity, where n
p
is the number of CPUs.
4
The main parallel operations are two circular
shifts; one clockwise and another anti-clockwise.
However, the operation described above may produce 4:1 balanced octrees
5
instead of 2:1 balanced octrees. Although
there is only one level of imbalance that we need to correct, the imbalance can still aect octants that are not in its
immediate vicinity. This is known as the ripple eect. Even with just one level of imbalance, a ripple can still propagate
across many CPUs.
The sequence of octrees constructed as described above has the property that non-hanging nodes in any octree remain
non-hanging in all ner octrees as well. Hanging nodes on any octree could either become non-hanging on a ner octree
or remain hanging on the ner octrees too. In addition, an octree can have new hanging as well as non-hanging nodes
that are not present in any of the coarser octrees.
2.3 Meshing
By meshing, we refer to the construction of the octree-node to the mesh-vertex and element mappings, the partitioning
and construction of ghost and local vertex lists, and the construction of the scatter/gather operators for the near-neighbor
4
When we discuss communication costs we assume a Hypercube network topology with (np) bandwidth.
5
The input is 2:1 balanced and we coarsen by at most one level in this operation. Hence, this operation will only introduce one additional
level of imbalance resulting in 4:1 balanced octrees.
7
communications that are required during the MatVec.
In [21], we developed algorithms and ecient data structures that support trilinear conforming nite element calcula-
tions on linear octrees. We use these data structures in the present work too. The key features of the algorithm are that
(1) we use a hanging-node management scheme that allows MatVecs in a single tree traversal as opposed to multiple tree
traversal required by the scheme in [24], and (2) we reduce the memory overhead by storing the octree in a compressed
form that requires only one byte per octant (used to store the level of the octant). The element-to-mesh vertex mappings
can be compressed at a modest expense of uncompressing this on the y while looping over the elements to perform the
nite element MatVecs. The resulting shape functions on the octree meshes are trilinear, are equal to 1 at the vertex at
which they are rooted, vanish at all other non-hanging vertices in the octree, and their support can spread over more than
8 elements.
Here, we extend the meshing algorithm to the case of a forest of octrees. The main dierence with the single-level
algorithm is that the partitionings between dierent levels are coupled. We give details below.
2.3.1 Multilevel meshing and the prolongation operators
To implement the intergrid transfer operations, we need to nd all the non-hanging ne grid nodes that lie within the
support of each coarse grid shape function. Given the hierarchy of the octree meshes constructed as described these
operations can be implemented quite eciently. The restriction matrix is the transpose of the prolongation matrix. We
do not construct these matrices explicitly, instead we implement a matrix-free scheme using MatVecs as described below.
The MatVecs for the restriction and prolongation operators are very similar. In both the MatVecs, we loop over the coarse
and ne grid octants simultaneously. For each coarse grid octant, the underlying ne grid octant could either be the same
as itself or be one of its eight children. We identify these cases and handle them separately. The main operation within the
loop is selecting the coarse grid shape functions that do not vanish within the current coarse grid octant and evaluating
them at the non-hanging ne grid nodes that lie within this coarse grid octant. These form the entries of the restriction
and prolongation matrices.
To be able to do this operation eciently in parallel, we need the coarse and ne grid partitions to be aligned. This
means that the following two conditions must be satised. (1) If an octant exists both in the coarse and ne grids, then
the same CPU must own this octant on both the meshes; and (2) If an octants children exist in the ne grid, then the
same CPU must own this octant on the coarse mesh and all its 8 children on the ne mesh.
To satisfy these conditions, we rst compute the partition on the coarse grid and then impose it on the ner grid. In
general, it might not be possible or desirable to use the same partition for all of the levels. For example, the coarser levels
might be too sparse to be distributed across all the CPUs or using the same partition for all the levels could lead to a large
load imbalance across the CPUs. Hence, we allow some levels to be partitioned dierently than others
6
. When a transition
in the partitions is required, we duplicate the octree in question and let one of the duplicates share the same partition
6
It is also possible that some CPUs are idle on the coarse grids, while no CPU is idle on the ner grids.
8
(a) Sequential V-cycle (b) Parallel V-cycle
Smoothing Restriction Prolongation Pseudo Scatter
Figure 2: (a) A V-cycle where the meshes at all levels share the same partition and (b) A V-cycle where not all meshes share the same
partition. Some meshes do share the same partition and whenever the partition changes a pseudo mesh is added. The pseudo mesh is only
used to support intergrid transfer operations and smoothing is not performed on this mesh.
as that of its immediate ner level and let the other one share the same partition as that of its immediate coarser level.
We refer to one of these duplicates as the pseudo mesh (Figure 2). The pseudo mesh is only used to support intergrid
transfer operations and smoothing is not performed on this mesh. On these levels, the intergrid transfer operations include
an additional step referred to as Scatter, which just involves re-distributing the values from one partition to another.
One of the challenges with the MatVec for the intergrid transfer operations is that as we loop over the octants we must
also keep track of the pairs of coarse and ne grid nodes that were visited already. In order, to implement this MatVec
eciently we make use of the following observations. (1) Every non-hanging ne grid node is shared by at most eight ne
grid elements, excluding the elements whose hanging vertices are mapped to this node. (2) Each of these eight ne grid
elements will be visited only once within the Restriction and Prolongation MatVecs. (3) Since we loop over the coarse
and ne elements simultaneously, there is a coarse octant associated with each of these eight ne octants. These coarse
octants (maximum of eight) overlap with the respective ne octants. (4) The only coarse grid shape functions that do not
vanish at the non-hanging ne grid node under consideration are those whose indices are stored in the vertices of each of
these coarse octants. Some of these vertices may be hanging, but they will be mapped to the corresponding non-hanging
vertex. So, the correct index is always stored immaterial of the hanging state of the vertex.
We pre-compute and store a mask for each ne grid node. Each of these masks is a set of eight bytes, one for each
of the eight ne grid elements that surround this ne grid node. When we visit a ne grid octant and the corresponding
coarse grid octant within the loop, we read the eight bits corresponding to this ne grid octant. Each of these bits is a ag
to determine whether or not the respective coarse grid shape function contributes to this ne grid node. The overhead of
using this mask within the actual MatVecs is just the cost of a few bitwise operations for each ne grid octant. Algorithm
1 lists the sequence of operations performed by a CPU for the restriction MatVec. This MatVec is an operation with
O(N) work complexity and has an O(
N
np
) parallel time complexity. For simplicity, we do not overlap communication
9
with computation in the pseudocode. In the actual implementation, we overlap communication with computation. The
following section describes how we compute these masks for any given pair of coarse and ne octrees.
Algorithm 1 Operations performed by CPU P for the restriction MatVec Input: Fine vector (F),
masks (M), pre-computed stencils (R
1
) and (R
2
), ne octree (O
f
), coarse octree (O
c
).
Output: Coarse vector (C).
1. Exchange ghost values for F and M with other CPUs.
2. C 0.
3. for each o
c
O
c
4. Let c
c
be the child number of o
c
.
5. Let h
c
be the hanging type of o
c
.
6. Step through O
f
until o
f
O
f
is found s.t.
Anchor(o
f
) = Anchor(o
c
).
7. if Level(o
c
) = Level(o
f
)
8. for each vertex, V
f
, of o
f
9. Let V
f
be the i-th vertex of o
f
.
10. if V
f
is not hanging
11. for each vertex, V
c
, of o
c
12. Let V
c
be the j-th vertex of o
c
.
13. If V
c
is hanging, use the corresponding
non-hanging node instead.
14. if the j-th bit of M(V
f
, i) = 1
15. C(V
c
) = C(V
c
) + R
1
(c
c
, h
c
, i, j)F(V
f
)
16. end if
17. end for
18. end if
19. end for
20. else
21. for each of the 8 children of o
c
22. Let c
f
be the child number of o
f
, the child of o
c
that is processed in the current iteration.
23. Perform steps 8 to 19 by replacing R
1
(c
c
, h
c
, i, j)
with R
2
(c
f
, c
c
, h
c
, i, j) in step 15.
24. end for
25. end if
26. end for
27. Exchange ghost values for C with other CPUs.
28. Add the contributions received from other CPUs
to the local copy of C.
Computing the masks for restriction and prolongation. Each non-hanging ne grid vertex has a maximum
7
of 1758 unique locations at which a coarse grid shape function that contributes to this vertex could be rooted. Each of
the vertices of the coarse grid octants, which overlap with the ne grid octants surrounding this ne grid vertex, can be
mapped to one of these 1758 possibilities. It is also possible that some of these vertices are mapped to the same location.
When we pre-compute the masks described earlier, we want to identify these many-to-one mappings and only one of them
is selected to contribute to the ne grid node under consideration. Details on identifying these cases are given in [19].
7
This is a weak upper bound.
10
We can not pre-compute the masks oine since this depends on the coarse and ne octrees under consideration. To do
this computation eciently, we employ a dummy MatVec before we actually begin solving the problem. In this dummy
MatVec, we use a set of 16 bytes per ne grid node; 2 bytes for each of the eight ne grid octants surrounding the node.
In these 16 bits, we store the ags to determine the following: (1) Whether or not the coarse and ne grid octants are
the same (1 bit). (2) The child number of the current ne grid octant (3 bits).(3) The child number of the corresponding
coarse grid octant (3 bits). (4) The hanging conguration of the corresponding coarse grid octant (5 bits). (5) The
relative size of the current ne grid octant with respect to the reference element (2 bits).
Using this information and some simple bitwise operations, we can compute and store the masks for each ne grid
node. The dummy MatVec is an operation with O(N) work complexity and has an O(
N
np
) parallel time complexity.
Integrid transfers. Unlike the nite element MatVec the loop is split into three parts in this case. We can not loop
over ghost octants in this case. This is because the ghost octants on the coarse and ne grids need not be aligned. Hence,
each CPU only loops over the coarse and the underlying ne octants that it owns. As a result, we need to both read
as well as write to ghost values within the MatVec. We rst loop over some of the elements in the interior of the CPU
domains since these elements do not share any nodes with the neighboring CPUs. Simultaneously, each CPUs also reads
the ghost values from the other CPUs in the background. At the end of the rst loop, we use the ghost values that were
received from the other CPUs and loop over those elements that will contribute to ghost values. At the end of the second
loop, we exchange the updated ghost values and simultaneously loop over the remaining elements in the interior of the
CPU domains.
2.4 Matrix vector multiplication
Every octant is owned by a single CPU. However, the values of unknowns associated with octants on inter-CPU boundaries
need to be shared among several CPUs. We keep multiple copies of the information related to these octants and we term
them ghost octants. In our implementation of the nite element MatVec, each CPU iterates over all the octants it owns
and also loops over a layer of ghost octants that contribute to the nodes it owns. Within the loop, each octant is mapped
to one of the above described hanging congurations. This is used to select the appropriate element stencil from a list of
pre-computed stencils. We then use the selected stencil in a standard element based assembly technique. Although the
CPUs need to read ghost values from other CPUs they only need to write data back to the nodes they own and do not
need to write to ghost nodes. Thus, there is only one communication step within each MatVec. We even overlap this
communication with useful computation. We rst loop over the elements in the interior of the CPU domain since these
elements do not share any nodes with the neighboring CPUs. While we perform this computation, we communicate the
ghost values in the background. At the end of the rst loop, we use the ghost values that were received from the other
CPUs and loop over the remaining elements.
11
Algorithm 2 Adaptive Mesh Refinement
1. Coarsen or rene an octree using the exact analytical solution at the current time step
2. Balance the octree to enforce the 2:1 balance constraint
3. Mesh the octree to get the element-node connectivity
4. Transfer the solution at the previous time step to the new mesh
5. Solve the linear system of equations using CG with Jacobi preconditioner
Algorithm 3 Solution transfer algorithm
1. Get the list of co-ordinates of the nodes in the new mesh O(N/n
p
)
2. Sort the list of co-ordinates using parallel sample sort O(N/n
p
log(N/n
p
) + n
p
log n
p
))
3. Impose the partition of the old mesh on the sorted list O(N/n
p
)
4. Evaluate the function values at the nodes of the new mesh using the nodal values
and shape functions of the old mesh O(N/n
p
)
5. Re-distribute the evaluated function values to the partition of the new mesh O(N/n
p
)
2.5 Adaptive mesh renement
We solve a linear parabolic problem:
u
t
= u +f(x, t) with homogeneous Neumann boundary conditions using adaptive
mesh renement. We compute f(x, t) using an analytical solution of a traveling wave of the form u(x, t) = exp(10
4
(y
0.5 0.1t 0.1 sin(2x))
2
) (see Figure 7). A second order implicit Crank-Nicholson scheme is used to solve the problem
for 10 time steps with a step size of t = 0.05. We used CG with Jacobi preconditioner to solve the linear system of
discretized equations at every time step. We build the octree using the exact analytical solution. We do not use any error
estimator. Our focus is in demonstrate the performance of our tree-construction, balancing, meshing and solution transfer
schemes. In Algorithm 2, we describe the adaptive mesh renement scheme. We also present a parallel algorithm to map
the solution between the meshes in Algorithm 3. The overall computational complexity of the solution transfer algorithm,
assuming that the octrees have O(N) nodes and are similar, is O(N/n
p
+n
p
log n
p
) We do not require the meshes at two
dierent time steps to be aligned or to share the same partition. In fact, the two meshes can be entirely dierent. Of
course, the greater their dierences the higher the intergrid transfer communication costs.
2.6 Summary of the overall multigrid algorithm
1. A suciently ne
8
2:1 balanced complete linear octree is constructed using the algorithms described in [22].
2. Starting with the nest octree, a sequence of 2:1 balanced coarse linear octrees is constructed using the global
coarsening algorithm.
3. Starting with the coarsest octree, the octree at each level is meshed using the algorithm described in [21]. As long as
the load imbalance across the CPUs is acceptable and as long as the coarser grid was able to be partitioned without
leaving any CPU idle, the partition of the coarser grid is imposed on to the ner grid during meshing. If either of
8
Here the term suciently is used to mean that the discretization error introduced is acceptable.
12
the above two conditions is violated then the octree for the ner grid is duplicated; One of them is meshed using
the partition of the coarser grid and the other is meshed using a fresh partition computed using the block partition
algorithm. The process is repeated until the nest octree has been meshed.
4. A dummy restriction MatVec (Section 2.3.1) is performed at each level (except the coarsest) and the masks that will
be used in the actual restriction and prolongation MatVecs are computed and stored.
5. For the case of variable coecient operators, vectors that store the material properties at each level are created.
6. The discrete system of equations is then solved using the conjugate gradient algorithm preconditioned with the
multigrid scheme.
Complexity. Let N be the total number of octants in the ne level and n
p
be the number of CPUs. In [21] and
[22], we showed that the parallel complexity of the single-level construction, 2:1 balancing, partition, and meshing is
O(
N
np
log
N
np
+ n
p
log n
p
) for trees that are nearly uniform. The cost of a MatVec is O(
N
np
).
For the multigrid case, we report the complexity (again for trees that are nearly uniform) for the case in which all
CPUs are used in all levels. Given that complexity of the a single-level coarsening is O(
N
np
), the overall complexity of the
coarsening is equal to

k=0:L
N
k
np
. For a regular grid and the coarse grid had size 1, L = log N/3 and N
k
= 2
3k
N. Thus
the overall coarsening complexity is the same of a single-level coarsening. Using a similar argument the cost for balancing
and meshing is O(
N
np
log(
N
np
) + n
p
log n
p
)L. (Notice the L factor in the communication cost.)
This estimate does not include the communication costs related to the mapping between dierent partitionings of the
same octree. This operation is carried using and MPI Alltoallv() call, which has and O(mp) complexity, with m being
the message length [12]. We have not been able to derive the cost this operation. Notice that in most cases the repartition
octree has signicant overlaps since it is always Morton-ordered and the Block Partitioning algorithm uses this ordering.
Also, out empirical results show that the associated costs are not two high even in the case in which we repartition at
each level and large number of CPUs have zero elements.
3 Numerical Experiments
In this section, we report the results from iso-granular (weak scaling) and xed size (strong scaling) scalability experiments
on up to 4K CPUs on NCSAs Intel 64 cluster and PSCs Cray XT3. The NCSA machine has 8 CPUs/node and the PSC
machine has 2 CPUs/node. Isogranular scalability analysis was performed by roughly keeping the problem size per CPU
xed while increasing the number of CPUs. Fixed-size scalability was performed by keeping problem size constant and
increasing the number of CPUs.
Implementation details. Here, we discuss features that we plan to incorporate into Dendro in the near future. Cur-
rently, the number of multigrid levels is user-specied. However, the problem size for the coarsest grid is not known
13
Figure 3: The LEFT FIGURE shows isogranular scalability with a grain size of 0.25M (approx) elements per CPU (np) on the nest level.
The dierence between the minimum and maximum levels of the octants on the nest grid is 5. A V-cycle using 4 pre-smoothing steps and
4 post-smoothing steps per level was used as a preconditioner to CG. The damped Jacobi method with a damping factor of 0.857 was used
as the smoother at all multigrid levels. A relative tolerance of 10
10
in the 2-norm of the residual was used. 6 CG iterations were required
in each case, to solve the problem to the specied tolerance. The nest level octrees for the multiple CPU cases were generated using regular
renements from the nest octree for the single CPU case. SuperLU was used to solve the coarsest grid problem. This isogranular scalability
experiment was performed on NCSAs Intel 64 cluster.
The RIGHT FIGURE gives the results for the variable coecient scalar Laplacian operator (left column) and results for the constant coecient
linear elastostatic problem (right column). For the elasticity problem, the actual timings are given in the table below and are 1/10 of that value
is plotted. The coecient of the scalar Laplacian operator was chosen to be (1 + 10
6
(cos
2
(2x) + cos
2
(2y) + cos
2
(2z))). The elastostatic
problem uses homogeneous Dirichlet boundary conditions and the scalar problem uses homogeneous Neumann boundary conditions. A Poissons
ratio of 0.4 was used for this problem. The solver options are the same as described in the left gure. Only the timings for the solve phase
are reported. The timings for the setup phase are not reported since the results are similar to that the left gure. This isogranular scalability
experiment was performed on PSCs Cray XT3.
a-priori. Hence, choosing an appropriate number of multigrid levels to better utilize all the available CPUs is not obvious.
We are currently working on developing heuristics to dynamically adjust the number of multigrid levels. Imposing a parti-
tion computed on the coarse grid onto a ne grid may lead to load-imbalance. However, not doing so results in duplicating
meshes for the intermediate levels and also increases communication during intergrid transfer operations. Hence, a balance
between the two need to be arrived at and we are currently working on developing heuristics to address this issue. In
Figure 2, we depict the use of duplicate octrees that are used to map scalar and vector functions between dierent octree
partitions. These partitions can be dened by three dierent ways: (1) One way is using the same MPI communicator
14
and a uniform load across all CPUs; (2) A second way is again using the same MPI communicator but partitioning the
load across a fewer number of CPUs and letting the other CPUs idle; and (3) Creating a new communicator. In this set of
experiments, we use the the rst way, i.e.,we partition across the total number of CPUs for all levels. Also, we
repartition at each level. In this way, the results reported in the isogranular and xed-size scalability results represent
the worst-case performance for our method.
Isogranular scalability. In Figure 3, we report results for the constant-coecient (left subgure) and variable-coecient
(right subgure) scalar elliptic problem with homogeneous Neumann boundary conditions on meshes whose mesh-size
follows a Gaussian distribution. The coarsest level uses 1 CPU only in all of the experiments unless otherwise specied.
We use SuperLU dist [17] as an exact solver at the coarsest grid. 4 multigrid levels were used on 1 CPU and the number
of multigrid levels were incremented by 1 every time the number of CPUs increased by a factor of 8. In the left subgure,
for each CPU, the left column gives the time for the setup-phase and the right column gives the time for the solve-phase.
n
p
seconds
1 2 4 8 16 32 64 128 256
0
20
40
60
80
100
120
140
160
180
200
220
240
260
280
GMG Setup
GMG Solve
Falgout Setup
Falgout Solve
CLJP Setup
CLJP Solve
Elements
1.21
9.23
7.86
5.54
6.09
2.56
52K
2.42
16.6
18.97
11.77
14.12
6.11
138K
2.11
14.19
18.9
7.23
13.41
3.96
240K
2.05
14.76
30
9.25
20.08
4.77
0.5M
5.19
21.07
41.56
11.35
24.77
6.38
995K
5.67
22.49
59.29
13.91
34.54
6.92
2M
6.63
19.2
92.61
14.5
52.05
8.12
3.97M
6.84
20.33
146.23
18.56
80.09
9.4
8M
10.7
22.23
248.5
23.3
151.64
13.9
16M
Figure 4: A variable coecient (contrast of 10
7
) elliptic problem with homogeneous Neumann boundary conditions was solved on meshes
constructed on Gaussian distributions. A 7-level multiplicative-multigrid cycle was used as a preconditioner to CG for the GMG scheme.
SuperLU Dist was used to solve the coarsest grid problem in the GMG scheme.
The setup cost for the GMG scheme includes the time for constructing the mesh for all the levels (including the nest),
constructing and balancing all the coarser levels, setting up the intergrid transfer operators by performing one dummy
15
Restriction MatVec at each level. Here we repartition at each level. The time to create the work vectors for the MG scheme
and the time to build the coarsest grid matrix are included in Extras-1. Scatter refers to the process of transferring
the vectors between two dierent partitions of the same multigrid level during the intergrid transfer operations. This
is required whenever the coarse and ne grids do not share the same partition. The time spent in applying the Jacobi
preconditioner, computing the inner-products within CG and solving the coarsest grid problems using LU are all grouped
together as Extras-2. In Figure 4, we report results from a comparison with the algebraic multigrid scheme. In order to
minimize communication costs, the coarsest level was distributed on a maximum of 32 CPUs in all experiments.
For BoomerAMG, we experimented with two dierent coarsening schemes: Falgout coarsening and CLJP coarsening.
The results from both experiments are reported. [14] reports that Falgout coarsening works best for structured grids and
CLJP coarsening is better suited for unstructured grids. Since octree meshes lie in between both structured and generic
unstructured grids, we compare our results using both the schemes. Both GMG and AMG schemes used 4 pre-smoothing
steps and 4 post-smoothing steps per level with the damped Jacobi smoother. A relative tolerance of 10
10
in the 2-norm
of the residual was used in all the experiments.
The GMG scheme took about 12 CG iterations, the Falgout scheme took about 7 iterations and the CLJP scheme also
took about 7 iterations. Each node of the cluster has 8 CPUs which share an 8GB RAM. However, only 1 CPU per node
was utilized in the above experiments. This is because the AMG scheme required a lot of memory and this allowed the
entire memory on any node to be available for a single process.
9
The setup time reported for the AMG schemes includes
the time for meshing the nest grid and constructing the nest grid FE matrix, both of which are quite small ( 1.35
seconds for meshing and 22.93 seconds for building the ne grid matrix even on 256 CPUs) compared to the time to
setup the rest of the AMG scheme. Although GMG seems to be performing well, more dicult problems with multiple
discontinuous coecients can cause it to fail. Our method is not robust in the presence of discontinuous coecientsin
contrast to AMG.
Fixed-size scalability. Next, we report xed-size scalability results for the Poisson and elasticity solvers.
10
In Figure
5, we report xed-size scalability for two dierent grain sizes for the Poisson problem. In Figure 6 we report xed-
size scalability results of the elasticity and variable-coecient Poisson cases. Overall we observe excessive costs for the
coarsening, balancing, and meshing when the grain size is relatively small. Like in the isogranular case, we repartition at
each multigrid level and we use all available CPUs. However, notice that the MatVecs and overall the solver scale quite
well.
AMR. Finally, in Table 2 and Figure 7, we report the performance of the balancing, meshing and the solution transfer
algorithms to solve the linear parabolic problem described in Section 2.5.
9
We did not attempt to optimize neither our code nor AMG and we used the default options. It is possible that one can reduce the AMG
cost by using appropriate options.
10
The elastostatics equation is given by div((x)u(x)) + ((x) + (x))v) = b(x), where is a scalar eld; and v and b are 3D vector
functions.
16
Figure 5: LEFT FIGURE: Scalability for a xed ne grid problem size of 64.4M elements. The problem is the same as described in Figure
3. 8 multigrid levels were used. 64 CPUs were used for the coarsest grid for all cases in this experiment. The minimum and maximum levels
of the octants on the nest grid are 3 and 18, respectively. 5 iterations were required to solve the problem to a relative tolerance of 10
10
in
the 2-norm of the residual.
RIGHT FIGURE: Scalability for a xed ne grid problem size of 15.3M elements. The setup cost and the cost to solve the constant coecient
poisson problem are reported. 6 multigrid levels were used. This xed size scalability experiment was performed on NCSAs Intel 64 cluster.
This xed size scalability experiment was performed on NCSAs Intel 64 cluster.
Table 2: Isogranular scalability with a grain size of 38K (approx) elements per CPU (np). A constant-coecient linear parabolic problem
with homogeneous Neumann boundary conditions was solved on octree meshes. We used the analytical solution of a traveling wave of the form
exp(10
4
(y 0.5 0.1t 0.1 sin(2x))
2
) to construct the octrees. We used a time step of t = 0.05 and solved for 10 time steps. We used
second-order Crank-Nicholson scheme for time-stepping and CG with Jacobi preconditioner to solve the linear system of equations at every
time step. We used a stopping criterion of r/r
0
< 10
6
. We report the number of elements (Elements), CG iterations (iter), Solve time
(Solve), total time to balance (Balancing) and mesh (Meshing) the octrees generated at each time step. We also report the time to transfer
the solution between the meshes (Transfer) at two dierent time steps. The total number of elements and the number of CG iterations are
approximately constant over all time steps. This isogranular scalability experiment was performed on NCSAs Intel 64 cluster.
n
p
Elements iter Solve Balancing Meshing Transfer
8 300K 110 78.1 6.24 6.0 0.76
64 2M 204 143.9 17.0 7.86 0.82
512 14M 384 356 132.9 72.3 3.32
17
n
p
seconds
64 128 256 512
0
5
10
15
20
25
30
35
40
Restriction/
Prolongation
Scatter
FE Matvecs
Extras
3.52
0.65
26.83
2.36
13.43
4.97
348.9
4.29
1.8
0.38
13.42
3.24
7.62
3.05
176.4
5.3
1.29
0.23
7.06
3.68
4.21
1.72
89.07
6.5
0.88
0.21
3.95
4.98
2.81
0.78
46.16
11.9
Figure 6: Scalability for a xed ne grid problem with 15.3M elements. The cost to solve the variable coecient poisson problem and the
constant coecient linear elasticity problem are reported. The left column on each CPU gives the results for the variable coecient scalar
Laplacian operator and the right column gives the results for the constant coecient linear elastostatic problem. For the elasticity problem,
the actual timings are given in the table but are scaled by an order of magnitude (1/10) for the plot. The setup cost is the same as in Fig. 5.
6 multigrid levels were used. This xed size scalability experiment was performed on NCSAs Intel 64 cluster.
Figure 7: An example of adaptive mesh renement on a traveling wave. This is a synthetic solution which we use to illustrate the adaptive
octrees. The coarsening and renement are based on the approximation error between the discretized and the exact function.
4 Conclusions
We have described a parallel geometric multigrid method for solving partial dierential equations using nite elements on
octrees. Also, we described and AMR scheme that can be used to transfer vector and scalar functions between arbitrary
octrees.
We automatically generate a sequence of coarse meshes from an arbitrary 2:1 balanced ne octree. We do not impose
any restrictions on the number of meshes in this sequence or the size of the coarsest mesh. We do not require the meshes
to be aligned and hence the dierent meshes can be partitioned independently. Although a bottom-up tree construction
and meshing is harder than top-down approaches we believe that it is more practical since the ne mesh can be dened
naturally based on the physics of the problem.
18
In the nal submission, we will include additional results with signicant improvements for meshing and balancing
parts in the multigrid case, to have the load determined by a minimum grain size.
We have demonstrated reasonable scalability of our implementation and can solve problems with billions of elements
on thousands of CPUs. We tested the worse case scenarios for our code in which we repartition at each level and we use
all available CPUs for all levels. We have demonstrated that our implementation works well even on problems with largely
varying material properties. We have compared our geometric multigrid implementation with a state-of-the-art algebraic
multigrid implementation in a standard o-the-shelf package (HYPRE). Overall we showed that the proposed algorithm
is quite ecient although signicant work remains to improve the partitioning algorithm and the overall robustness of the
scheme in the presence of discontinuous coecients.
References
[1] M. F. Adams, H. Bayraktar, T. Keaveny, and P. Papadopoulos, Ultrascalable implicit nite element analyses in solid
mechanics with over a half a billion degrees of freedom, in Proceedings of SC2004, The SCxy Conference series in high perfor-
mance networking and computing, Pittsburgh, Pennsylvania, 2004, ACM/IEEE.
[2] S. Balay, K. Buschelman, W. D. Gropp, D. Kaushik, M. Knepley, L. C. McInnes, B. F. Smith, and H. Zhang,
PETSc home page, 2001. http://www.mcs.anl.gov/petsc.
[3] G. T. Balls, S. B. Baden, and P. Colella, SCALLOP: A highly scalable parallel Poisson solver in three dimensions, in
SC 03: Proceedings of the 2003 ACM/IEEE conference on Supercomputing, Washington, DC, USA, 2003, IEEE Computer
Society, p. 23.
[4] R. Becker and M. Braack, Multigrid techniques for nite elements on locally rened meshes, Numerical Linear Algebra
with applications, 7 (2000), pp. 363379.
[5] B. Bergen, F. Hulsemann, and U. Ru ede, Is 1.7 10
10
Unknowns the Largest Finite Element System that Can Be Solved
Today?, in SC 05: Proceedings of the 2005 ACM/IEEE conference on Supercomputing, Washington, DC, USA, 2005, IEEE
Computer Society, p. 5.
[6] M. W. Bern, D. Eppstein, and S.-H. Teng, Parallel construction of quadtrees and quality triangulations, International
Journal of Computational Geometry and Applications, 9 (1999), pp. 517532.
[7] M. Bittencourt and R. Feijoo, Non-nested multigrid methods in nite element linear structural analysis, in Virtual Pro-
ceedings of the 8th Copper Mountain Conference on Multigrid Methods (MGNET), 1997.
[8] P. M. Campbell, K. D. Devine, J. E. Flaherty, L. G. Gervasio, and J. D. Teresco, Dynamic octree load balancing
using space-lling curves, Tech. Report CS-03-01, Williams College Department of Computer Science, 2003.
[9] E. Chow, R. D. Falgout, J. J. Hu, R. S. Tuminaro, and U. M. Yang, A survey of parallelization techniques for multigrid
solvers, in Parallel Processing for Scientic Computing, M. A. Heroux, P. Raghavan, and H. D. Simon, eds., Cambridge
University Press, 2006, pp. 179195.
[10] R. Falgout, An introduction to algebraic multigrid, Computing in Science and Engineering, Special issue on Multigrid Com-
puting, 8 (2006), pp. 2433.
19
[11] R. Falgout, J. Jones, and U. Yang, The design and implementation of hypre, a library of parallel high performance
preconditioners, in Numerical Solution of Partial Dierential Equations on Parallel Computers, A. Bruaset and A. Tveito, eds.,
vol. 51, Springer-Verlag, 2006, pp. 267294.
[12] A. Grama, A. Gupta, G. Karypis, and V. Kumar, An Introduction to Parallel Computing: Design and Analysis of
Algorithms, Addison Wesley, second ed., 2003.
[13] M. Griebel and G. Zumbusch, Parallel multigrid in an adaptive PDE solver based on hashing, in Parallel Computing:
Fundamentals, Applications and New Directions, Proceedings of the Conference ParCo97, 19-22 September 1997, Bonn,
Germany, E. H. DHollander, G. R. Joubert, F. J. Peters, and U. Trottenberg, eds., vol. 12, Amsterdam, 1998, Elsevier,
North-Holland, pp. 589600.
[14] V. E. Henson and U. M. Yang, Boomeramg: a parallel algebraic multigrid solver and preconditioner, Appl. Numer. Math.,
41 (2002), pp. 155177.
[15] F. H ulsemann, M. Kowarschik, M. Mohr, and U. R ude, Parallel geometric multigrid, in Numerical Solution of Partial
Dierential Equations on Parallel Computers, A. M. Bruaset and A. Tveito, eds., Birk auser, 2006, pp. 165208.
[16] A. Jones and P. Jimack, An adaptive multigrid tool for elliptic and parabolic systems, International Journal for Numerical
Methods in Fluids, 47 (2005), pp. 11231128.
[17] X. S. Li and J. W. Demmel, SuperLU DIST: A Scalable Distributed-Memory Sparse Direct Solver for Unsymmetric Linear
Systems, ACM Transactions of Mathematical Software, 29 (2003), pp. 110140.
[18] D. J. Mavriplis, M. J. Aftosmis, and M. Berger, High resolution aerospace applications using the NASA Columbia
Supercomputer, in SC 05: Proceedings of the 2005 ACM/IEEE conference on Supercomputing, Washington, DC, USA, 2005,
IEEE Computer Society, p. 61.
[19] R. Sampath and G. Biros, A parallel geometric multigrid method for nite elements on octree meshes, tech. report, University
of Pennsylvania, 2008. Sumbitted for publication.
[20] R. Sampath, H. Sundar, S. S. Adavani, I. Lashuk, and G. Biros, Dendro home page, 2008.
http://www.seas.upenn.edu/csela/dendro.
[21] H. Sundar, R. Sampath, C. Davatzikos, and G. Biros, Low-constant parallel algorithms for nite element simulations
using linear octrees, in Proceedings of SC2007, The SCxy Conference series in high performance networking and computing,
Reno, Nevada, 2007, ACM/IEEE.
[22] H. Sundar, R. S. Sampath, and G. Biros, Bottom-up construction and 2:1 balance renement of linear octrees in parallel,
SIAM Journal on Scientic Computing, (2008). to appear.
[23] Trottenberg, U. and Oosterlee, C. W. and Schuller, A., Multigrid, Academic Press Inc., San Diego, CA, 2001.
[24] T. Tu, D. R. OHallaron, and O. Ghattas, Scalable parallel octree meshing for terascale applications, in SC 05: Proceedings
of the 2005 ACM/IEEE conference on Supercomputing, Washington, DC, USA, 2005, IEEE Computer Society, p. 4.
[25] T. Tu, H. Yu, L. Ramirez-Guzman, J. Bielak, O. Ghattas, K.-L. Ma, and D. R. OHallaron, From mesh generation
to scientic visualization: an end-to-end approach to parallel supercomputing, in SC 06: Proceedings of the 2006 ACM/IEEE
conference on Supercomputing, New York, NY, USA, 2006, ACM Press, p. 91.
20

You might also like