Pt-Scotch and Libscotch 5.1 User'S Guide: (Version 5.1.11)
Pt-Scotch and Libscotch 5.1 User'S Guide: (Version 5.1.11)
1
User’s Guide
(version 5.1.11)
François Pellegrini
Bacchus team, INRIA Bordeaux Sud-Ouest
ENSEIRB & LaBRI, UMR CNRS 5800
Université Bordeaux I
351 cours de la Libération, 33405 TALENCE, FRANCE
[email protected]
November 17, 2010
Abstract
This document describes the capabilities and operations of PT-Scotch
and libScotch, a software package and a software library which compute
parallel static mappings and parallel sparse matrix block orderings of graphs.
It gives brief descriptions of the algorithms, details the input/output formats,
instructions for use, installation procedures, and provides a number of exam-
ples.
PT-Scotch is distributed as free/libre software, and has been designed
such that new partitioning or ordering methods can be added in a straight-
forward manner. It can therefore be used as a testbed for the easy and quick
coding and testing of such new methods, and may also be redistributed, as
a library, along with third-party software that makes use of it, either in its
original or in updated forms.
1
Contents
1 Introduction 4
1.1 Static mapping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.2 Sparse matrix ordering . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.3 Contents of this document . . . . . . . . . . . . . . . . . . . . . . . . 5
3 Algorithms 6
3.1 Parallel static mapping by Dual Recursive Bipartitioning . . . . . . 6
3.1.1 Static mapping . . . . . . . . . . . . . . . . . . . . . . . . . . 6
3.1.2 Cost function and performance criteria . . . . . . . . . . . . . 7
3.1.3 The Dual Recursive Bipartitioning algorithm . . . . . . . . . 8
3.1.4 Partial cost function . . . . . . . . . . . . . . . . . . . . . . . 9
3.1.5 Parallel graph bipartitioning methods . . . . . . . . . . . . . 10
3.1.6 Mapping onto variable-sized architectures . . . . . . . . . . . 11
3.2 Parallel sparse matrix ordering by hybrid incomplete nested dissection 11
3.2.1 Hybrid incomplete nested dissection . . . . . . . . . . . . . . 11
3.2.2 Parallel ordering . . . . . . . . . . . . . . . . . . . . . . . . . 12
3.2.3 Performance criteria . . . . . . . . . . . . . . . . . . . . . . . 16
3.3 Changes from version 5.0 . . . . . . . . . . . . . . . . . . . . . . . . 16
5 Programs 18
5.1 Invocation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
5.2 File names . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
5.2.1 Sequential and parallel file opening . . . . . . . . . . . . . . . 19
5.2.2 Using compressed files . . . . . . . . . . . . . . . . . . . . . . 20
5.3 Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
5.3.1 dgmap / dgpart . . . . . . . . . . . . . . . . . . . . . . . . . . 20
5.3.2 dgord . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
5.3.3 dgpart . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
5.3.4 dgscat . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
5.3.5 dgtst . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
6 Library 25
6.1 Running at proper thread level . . . . . . . . . . . . . . . . . . . . . 25
6.2 Calling the routines of libScotch . . . . . . . . . . . . . . . . . . . 26
6.2.1 Calling from C . . . . . . . . . . . . . . . . . . . . . . . . . . 26
6.2.2 Calling from Fortran . . . . . . . . . . . . . . . . . . . . . . . 26
6.2.3 Compiling and linking . . . . . . . . . . . . . . . . . . . . . . 27
6.2.4 Machine word size issues . . . . . . . . . . . . . . . . . . . . . 28
6.3 Data formats . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
6.3.1 Distributed graph format . . . . . . . . . . . . . . . . . . . . 29
6.3.2 Block ordering format . . . . . . . . . . . . . . . . . . . . . . 34
6.4 Strategy strings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
6.4.1 Using default strategy strings . . . . . . . . . . . . . . . . . . 35
2
6.4.2 Parallel mapping strategy strings . . . . . . . . . . . . . . . . 36
6.4.3 Parallel graph bipartitioning strategy strings . . . . . . . . . 37
6.4.4 Parallel ordering strategy strings . . . . . . . . . . . . . . . . 40
6.4.5 Parallel node separation strategy strings . . . . . . . . . . . . 42
6.5 Distributed graph handling routines . . . . . . . . . . . . . . . . . . 45
6.5.1 SCOTCH dgraphAlloc . . . . . . . . . . . . . . . . . . . . . . . 45
6.5.2 SCOTCH dgraphInit . . . . . . . . . . . . . . . . . . . . . . . 45
6.5.3 SCOTCH dgraphExit . . . . . . . . . . . . . . . . . . . . . . . 46
6.5.4 SCOTCH dgraphFree . . . . . . . . . . . . . . . . . . . . . . . 46
6.5.5 SCOTCH dgraphLoad . . . . . . . . . . . . . . . . . . . . . . . 46
6.5.6 SCOTCH dgraphSave . . . . . . . . . . . . . . . . . . . . . . . 47
6.5.7 SCOTCH dgraphBuild . . . . . . . . . . . . . . . . . . . . . . . 48
6.5.8 SCOTCH dgraphGather . . . . . . . . . . . . . . . . . . . . . . 50
6.5.9 SCOTCH dgraphScatter . . . . . . . . . . . . . . . . . . . . . 50
6.5.10 SCOTCH dgraphCheck . . . . . . . . . . . . . . . . . . . . . . . 51
6.5.11 SCOTCH dgraphSize . . . . . . . . . . . . . . . . . . . . . . . 51
6.5.12 SCOTCH dgraphData . . . . . . . . . . . . . . . . . . . . . . . 52
6.5.13 SCOTCH dgraphGhst . . . . . . . . . . . . . . . . . . . . . . . 54
6.5.14 SCOTCH dgraphHalo . . . . . . . . . . . . . . . . . . . . . . . 55
6.5.15 SCOTCH dgraphHaloAsync . . . . . . . . . . . . . . . . . . . . 55
6.5.16 SCOTCH dgraphHaloWait . . . . . . . . . . . . . . . . . . . . . 56
6.6 Distributed graph mapping and partitioning routines . . . . . . . . . 57
6.6.1 SCOTCH dgraphPart . . . . . . . . . . . . . . . . . . . . . . . 57
6.6.2 SCOTCH dgraphMap . . . . . . . . . . . . . . . . . . . . . . . . 58
6.6.3 SCOTCH dgraphMapInit . . . . . . . . . . . . . . . . . . . . . 58
6.6.4 SCOTCH dgraphMapExit . . . . . . . . . . . . . . . . . . . . . 59
6.6.5 SCOTCH dgraphMapSave . . . . . . . . . . . . . . . . . . . . . 60
6.6.6 SCOTCH dgraphMapCompute . . . . . . . . . . . . . . . . . . . 60
6.7 Distributed graph ordering routines . . . . . . . . . . . . . . . . . . . 61
6.7.1 SCOTCH dgraphOrderInit . . . . . . . . . . . . . . . . . . . . 61
6.7.2 SCOTCH dgraphOrderExit . . . . . . . . . . . . . . . . . . . . 61
6.7.3 SCOTCH dgraphOrderSave . . . . . . . . . . . . . . . . . . . . 62
6.7.4 SCOTCH dgraphOrderSaveMap . . . . . . . . . . . . . . . . . . 62
6.7.5 SCOTCH dgraphOrderSaveTree . . . . . . . . . . . . . . . . . 63
6.7.6 SCOTCH dgraphOrderCompute . . . . . . . . . . . . . . . . . . 64
6.7.7 SCOTCH dgraphOrderPerm . . . . . . . . . . . . . . . . . . . . 64
6.7.8 SCOTCH dgraphOrderCblkDist . . . . . . . . . . . . . . . . . 65
6.7.9 SCOTCH dgraphOrderTreeDist . . . . . . . . . . . . . . . . . 65
6.8 Centralized ordering handling routines . . . . . . . . . . . . . . . . . 66
6.8.1 SCOTCH dgraphCorderInit . . . . . . . . . . . . . . . . . . . 66
6.8.2 SCOTCH dgraphCorderExit . . . . . . . . . . . . . . . . . . . 67
6.8.3 SCOTCH dgraphOrderGather . . . . . . . . . . . . . . . . . . 68
6.9 Strategy handling routines . . . . . . . . . . . . . . . . . . . . . . . . 68
6.9.1 SCOTCH stratInit . . . . . . . . . . . . . . . . . . . . . . . . 68
6.9.2 SCOTCH stratExit . . . . . . . . . . . . . . . . . . . . . . . . 69
6.9.3 SCOTCH stratSave . . . . . . . . . . . . . . . . . . . . . . . . 69
6.9.4 SCOTCH stratDgraphMap . . . . . . . . . . . . . . . . . . . . . 69
6.9.5 SCOTCH stratDgraphMapBuild . . . . . . . . . . . . . . . . . 70
6.9.6 SCOTCH stratDgraphOrder . . . . . . . . . . . . . . . . . . . 71
6.9.7 SCOTCH stratDgraphOrderBuild . . . . . . . . . . . . . . . . 71
6.10 Other data structure routines . . . . . . . . . . . . . . . . . . . . . . 72
3
6.10.1 SCOTCH dmapAlloc . . . . . . . . . . . . . . . . . . . . . . . . 72
6.10.2 SCOTCH dorderAlloc . . . . . . . . . . . . . . . . . . . . . . . 72
6.11 Error handling routines . . . . . . . . . . . . . . . . . . . . . . . . . 72
6.11.1 SCOTCH errorPrint . . . . . . . . . . . . . . . . . . . . . . . 73
6.11.2 SCOTCH errorPrintW . . . . . . . . . . . . . . . . . . . . . . . 73
6.11.3 SCOTCH errorProg . . . . . . . . . . . . . . . . . . . . . . . . 73
6.12 Miscellaneous routines . . . . . . . . . . . . . . . . . . . . . . . . . . 74
6.12.1 SCOTCH randomReset . . . . . . . . . . . . . . . . . . . . . . . 74
6.13 ParMeTiS compatibility library . . . . . . . . . . . . . . . . . . . . 74
6.13.1 ParMETIS V3 NodeND . . . . . . . . . . . . . . . . . . . . . . . 74
6.13.2 ParMETIS V3 PartGeomKway . . . . . . . . . . . . . . . . . . . 75
6.13.3 ParMETIS V3 PartKway . . . . . . . . . . . . . . . . . . . . . 76
7 Installation 77
7.1 Thread issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
7.2 File compression issues . . . . . . . . . . . . . . . . . . . . . . . . . . 78
7.3 Machine word size issues . . . . . . . . . . . . . . . . . . . . . . . . . 78
8 Examples 79
1 Introduction
1.1 Static mapping
The efficient execution of a parallel program on a parallel machine requires that
the communicating processes of the program be assigned to the processors of the
machine so as to minimize its overall running time. When processes have a limited
duration and their logical dependencies are accounted for, this optimization problem
is referred to as scheduling. When processes are assumed to coexist simultaneously
for the entire duration of the program, it is referred to as mapping. It amounts to
balancing the computational weight of the processes among the processors of the
machine, while reducing the cost of communication by keeping intensively inter-
communicating processes on nearby processors.
In most cases, the underlying computational structure of the parallel programs
to map can be conveniently modeled as a graph in which vertices correspond to
processes that handle distributed pieces of data, and edges reflect data dependencies.
The mapping problem can then be addressed by assigning processor labels to the
vertices of the graph, so that all processes assigned to some processor are loaded
and run on it. In a SPMD context, this is equivalent to the distribution across
processors of the data structures of parallel programs; in this case, all pieces of data
assigned to some processor are handled by a single process located on this processor.
A mapping is called static if it is computed prior to the execution of the program.
Static mapping is NP-complete in the general case [10]. Therefore, many studies
have been carried out in order to find sub-optimal solutions in reasonable time,
including the development of specific algorithms for common topologies such as the
hypercube [8, 16]. When the target machine is assumed to have a communication
network in the shape of a complete graph, the static mapping problem turns into the
partitioning problem, which has also been intensely studied [3, 17, 25, 26, 40]. How-
ever, when mapping onto parallel machines the communication network of which is
not a bus, not accounting for the topology of the target machine usually leads to
worse running times, because simple cut minimization can induce more expensive
4
long-distance communication [16, 43]; the static mapping problem is gaining pop-
ularity as most of the newer massively parallel machines have a strongly NUMA
architecture
Because there always exist large problem graphs which cannot fit in the memory
of sequential computers and cost too much to partition, it is necessary to resort to
parallel graph ordering tools. PT-Scotch provides such features.
5
extended to native mesh structures, thanks to hypergraph partitioning algorithms.
New graph partitioning methods have also been recently added [6, 34]. Version
5.0 of Scotch was the first one to comprise parallel graph ordering routines [7],
and version 5.1 now offers parallel graph partitioning features, while parallel static
mapping will be available in the next release.
2.2 Availability
Starting from version 4.0, which has been developed at INRIA within the ScAlAp-
plix project, Scotch is available under a dual licensing basis. On the one hand, it
is downloadable from the Scotch web page as free/libre software, to all interested
parties willing to use it as a library or to contribute to it as a testbed for new
partitioning and ordering methods. On the other hand, it can also be distributed,
under other types of licenses and conditions, to parties willing to embed it tightly
into closed, proprietary software.
3 Algorithms
3.1 Parallel static mapping by Dual Recursive Bipartitioning
For a detailed description of the sequential implementation of this mapping algo-
rithm and an extensive analysis of its performance, please refer to [33, 36]. In the
next sections, we will only outline the most important aspects of the algorithm.
6
onto processor vT of T , and ρS,T (eS ) = {e1T , e2T , . . . , enT } if communication channel
eS of S is routed through communication links e1T , e2T , . . . , enT of T . |ρS,T (eS )|
denotes the dilation of edge eS , that is, the number of edges of E(T ) used to route
eS .
This function, which has already been considered by several authors for hypercube
target topologies [8, 16, 20], has several interesting properties: it is easy to compute,
allows incremental updates performed by iterative algorithms, and its minimization
favors the mapping of intensively intercommunicating processes onto nearby pro-
cessors; regardless of the type of routage implemented on the target machine (store-
and-forward or cut-through), it models the traffic on the interconnection network
and thus the risk of congestion.
The strong positive correlation between values of this function and effective
execution times has been experimentally verified by Hammond [16] on the CM-2,
and by Hendrickson and Leland [21] on the nCUBE 2.
The quality of mappings is evaluated with respect to the criteria for quality that
we have chosen: the balance of the computation load across processors, and the
minimization of the interprocessor communication cost modeled by function fC .
These criteria lead to the definition of several parameters, which are described
below.
For load balance, one can define µmap , the average load per computational
power unit (which does not depend on the mapping), and δmap , the load imbalance
ratio, as
P
wS (vS )
def vS ∈V (S)
µmap = P and
wT (vT )
vT ∈V (T )
P 1 P
wS (vS ) − µmap
wT (vT )
vT ∈V (T ) vS ∈ V (S)
def τS,T (vS ) = vT
δmap = P .
wS (vS )
vS ∈V (S)
However, since the maximum load imbalance ratio is provided by the user in input
of the mapping, the information given by these parameters is of little interest, since
what matters is the minimization of the communication cost function under this
load balance constraint.
For communication, the straightforward parameter to consider is fC . It can be
normalized as µexp , the average edge expansion, which can be compared to µdil ,
7
the average edge dilation; these are defined as P
|ρS,T (eS )|
f eS ∈E(S)
P C
def def
µexp = and µdil = .
wS (eS ) |E(S)|
eS ∈E(S)
def µ
δexp = µexpdil
is smaller than 1 when the mapper succeeds in putting heavily inter-
communicating processes closer to each other than it does for lightly communicating
processes; they are equal if all edges have same weight.
The above algorithm lies on the ability to define five main objects:
• a domain distance function, which gives, in the target graph, a measure of the
distance between two disjoint domains. Since domains may not be convex nor
connected, this distance may be estimated. However, it must respect certain
homogeneity properties, such as giving more accurate results as domain sizes
decrease. The domain distance function is used by the graph bipartitioning
algorithms to compute the communication function to minimize, since it allows
the mapper to estimate the dilation of the edges that link vertices which belong
to different domains. Using such a distance function amounts to considering
8
that all routings will use shortest paths on the target architecture, which
is how most parallel machines actually do. We have thus chosen that our
program would not provide routings for the communication channels, leaving
their handling to the communication system of the target machine;
All these routines are seen as black boxes by the mapping program, which can thus
accept any kind of target architecture and process bipartitioning functions.
v ∈ V (S ′ )
{v, v ′ } ∈ E(S)
which accounts for the dilation of edges internal to subgraph S ′ as well as for the
one of edges which belong to the cocycle of S ′ , as shown in Figure 1. Taking into
account the partial mapping results issued by previous bipartitionings makes it pos-
sible to avoid local choices that might prove globally bad, as explained below. This
amounts to incorporating additional constraints to the standard graph bipartition-
ing problem, turning it into a more general optimization problem termed as skewed
graph partitioning by some authors [23].
D D
D0 D1 D0 D1
Figure 1: Edges accounted for in the partial communication cost function when
bipartitioning the subgraph associated with domain D between the two subdomains
D0 and D1 of D. Dotted edges are of dilation zero, their two ends being mapped
onto the same subdomain. Thin edges are cocycle edges.
9
3.1.5 Parallel graph bipartitioning methods
The core of our parallel recursive mapping algorithm uses process graph parallel
bipartitioning methods as black boxes. It allows the mapper to run any type of
graph bipartitioning method compatible with our criteria for quality. Bipartitioning
jobs maintain an internal image of the current bipartition, indicating for every vertex
of the job whether it is currently assigned to the first or to the second subdomain. It
is therefore possible to apply several different methods in sequence, each one starting
from the result of the previous one, and to select the methods with respect to the
job characteristics, thus enabling us to define mapping strategies. The currently
implemented graph bipartitioning methods are listed below.
Band
Like the multi-level method which will be described below, the band method
is a meta-algorithm, in the sense that it does not itself compute partitions, but
rather helps other partitioning algorithms perform better. It is a refinement
algorithm which, from a given initial partition, extracts a band graph of given
width (which only contains graph vertices that are at most at this distance
from the separator), calls a partitioning strategy on this band graph, and
prolongs1 back the refined partition on the original graph. This method was
designed to be able to use expensive partitioning heuristics, such as genetic
algorithms, on large graphs, as it dramatically reduces the problem space by
several orders of magnitude. However, it was found that, in a multi-level
context, it also improves partition quality, by coercing partitions in a problem
space that derives from the one which was globally defined at the coarsest
level, thus preventing local optimization refinement algorithms to be trapped
in local optima of the finer graphs [6].
Diffusion
This global optimization method, the sequential formulation of which is pre-
sented in [34], flows two kinds of antagonistic liquids, scotch and anti-scotch,
from two source vertices, and sets the new frontier as the limit between ver-
tices which contain scotch and the ones which contain anti-scotch. In order to
add load-balancing constraints to the algorithm, a constant amount of liquid
disappears from every vertex per unit of time, so that no domain can spread
across more than half of the vertices. Because selecting the source vertices is
essential to the obtainment of useful results, this method has been hard-coded
so that the two source vertices are the two vertices of highest indices, since
in the band method these are the anchor vertices which represent all of the
removed vertices of each part. Therefore, this method must be used on band
graphs only, or on specifically crafted graphs.
Multi-level
This algorithm, which has been studied by several authors [3, 18, 25] and
should be considered as a strategy rather than as a method since it uses other
methods as parameters, repeatedly reduces the size of the graph to bipartition
by finding matchings that collapse vertices and edges, computes a partition for
the coarsest graph obtained, and prolongs the result back to the original graph,
as shown in Figure 2. The multi-level method, when used in conjunction with
the banded diffusion method to refine the prolonged partitions at every level,
1 While a projection is an application to a space of lower dimension, a prolongation refers to
an application to a space of higher dimension. Yet, the term projection is also commonly used to
refer to such a propagation, most often in the context of a multilevel framework.
10
Refined partition
Prolonged partition
Coarsening Uncoarsening
phase phase
Initial partitioning
Figure 2: The multi-level partitioning process. In the uncoarsening phase, the light
and bold lines represent for each level the prolonged partition obtained from the
coarser graph, and the partition obtained after refinement, respectively.
11
solving sparse matrices.
The minimum degree algorithm [42] is a local heuristic that performs its pivot
selection by iteratively selecting from the graph a node of minimum degree. It is
known to be a very fast and general purpose algorithm, and has received much
attention over the last three decades (see for example [1, 13, 31]). However, the
algorithm is intrinsically sequential, and very little can be theoretically proved
about its efficiency.
Nested dissection As said above, the first level of concurrency relates to the
parallelization of the nested dissection method itself, which is straightforward thanks
to the intrinsically concurrent nature of the algorithm. Starting from the initial
graph, arbitrarily distributed across p processors but preferably balanced in terms
of vertices, the algorithm proceeds as illustrated in Figure 3 : once a separator
has been computed in parallel, by means of a method described below, each of
the p processors participates in the building of the distributed induced subgraph
corresponding to the first separated part (even if some processors do not have any
12
P1
P0 P3 P3
P2 P2
P1
P0 P3
P2
P1 P1
P0 P3 P0
P2
vertex of it). This induced subgraph is then folded onto the first ⌈ p2 ⌉ processors, such
that the average number of vertices per processor, which guarantees efficiency as it
allows the shadowing of communications by a subsequent amount of computation,
remains constant. During the folding process, vertices and adjacency lists owned
by the ⌊ p2 ⌋ sender processors are redistributed to the ⌈ p2 ⌉ receiver processors so as
to evenly balance their loads.
The same procedure is used to build, on the ⌊ p2 ⌋ remaining processors, the
folded induced subgraph corresponding to the second part. These two constructions
being completely independent, the computations of the two induced subgraphs and
their folding can be performed in parallel, thanks to the temporary creation of an
extra thread per processor. When the vertices of the separated graph are evenly
distributed across the processors, this feature favors load balancing in the subgraph
building phase, because processors which do not have many vertices of one part
will have the rest of their vertices in the other part, thus yielding the same overall
workload to create both graphs in the same time. This feature can be disabled
when the communication system of the target machine is not thread-safe.
At the end of the folding process, every processor has a folded subgraph fragment
of one of the two folded subgraphs, and the nested dissection process car recursively
proceed independently on each subgroup of p2 (then p4 , p8 , etc.) processors, until
each subgroup is reduced to a single processor. From then on, the nested dissection
process will go on sequentially on every processor, using the nested dissection rou-
tines of the Scotch library, eventually ending in a coupling with minimum degree
methods [39], as described in the previous section.
13
matching phase is complete, the coarsened subgraph building phase takes place.
It can be parametrized so as to allow one to choose between two options. Either
all coarsened vertices are kept on their local processors (that is, processors that
hold at least one of the ends of the coarsened edges), as shown in the first steps
of Figure 4, which decreases the number of vertices owned by every processor and
speeds-up future computations, or else coarsened graphs are folded and duplicated,
as shown in the next steps of Figure 4, which increases the number of working copies
of the graph and can thus reduce communication and increase the final quality of
the separators.
As a matter of fact, separator computation algorithms, which are local heuristics,
heavily depend on the quality of the coarsened graphs, and we have observed with
the sequential version of Scotch that taking every time the best partition among
two ones, obtained from two fully independent multi-level runs, usually improved
overall ordering quality. By enabling the folding-with-duplication routine (which
will be referred to as “fold-dup” in the following) in the first coarsening levels, one
can implement this approach in parallel, every subgroup of processors that hold a
working copy of the graph being able to perform an almost-complete independent
multi-level computation, save for the very first level which is shared by all subgroups,
for the second one which is shared by half of the subgroups, and so on.
The problem with the fold-dup approach is that it consumes a lot of memory.
Consequently, a good strategy can be to resort to folding only when the number
of vertices of the graph to be considered reaches some minimum threshold. This
threshold allows one to set a trade off between the level of completeness of the
independent multi-level runs which result from the early stages of the fold-dup
process, which impact partitioning quality, and the amount of memory to be used
in the process.
Once all working copies of the coarsened graphs are folded on individual pro-
cessors, the algorithm enters a multi-sequential phase, illustrated at the bottom of
Figure 4: the routines of the sequential Scotch library are used on every processor
to complete the coarsening process, compute an initial partition, and prolong it
back up to the largest centralized coarsened graph stored on the processor. Then,
the partitions are prolonged back in parallel to the finer distributed graphs, select-
ing the best partition between the two available when prolonging to a level where
fold-dup had been performed. This distributed prolongation process is repeated
until we obtain a partition of the original graph.
Band refinement The third level of concurrency concerns the refinement heuris-
tics which are used to improve the prolonged separators. At the coarsest levels of
the multi-level algorithm, when computations are restricted to individual proces-
sors, the sequential FM algorithm of Scotch is used, but this class of algorithms
does not parallelize well.
This problem can be solved in two ways: either by developing scalable and
efficient local optimization algorithms, or by being able to use the existing sequential
FM algorithm on very large graphs. In [6] has been proposed a solution which
enables both approaches, and is based on the following reasoning. Since every
refinement is performed by means of a local algorithm, which perturbs only in a
limited way the position of the prolonged separator, local refinement algorithms
need only to be passed a subgraph that contains the vertices that are very close to
the prolonged separator.
The computation and use of distributed band graphs is outlined in Figure 5.
Given a distributed graph and an initial separator, which can be spread across
14
P1
P0 P3
P2
P1
P0 P3
P2
P1
P0 P3
P2
P0
P0 P1
P1
P2 P2
P3
P3
P0 P3
Figure 5: Creation of a distributed band graph. Only vertices closest to the sep-
arator are kept. Other vertices are replaced by anchor vertices of equivalent total
weight, linked to band vertices of the last layer. There are two anchor vertices per
processor, to reduce communication. Once the separator has been refined on the
band graph using some local optimization algorithm, the new separator is prolonged
back to the original distributed graph.
several processors, vertices that are closer to separator vertices than some small
user-defined distance are selected by spreading distance information from all of
the separator vertices, using our halo exchange routine. Then, the distributed
band graph is created, by adding on every processor two anchor vertices, which are
connected to the last layers of vertices of each of the parts. The vertex weight of
the anchor vertices is equal to the sum of the vertex weights of all of the vertices
they replace, to preserve the balance of the two band parts. Once the separator of
the band graph has been refined using some local optimization algorithm, the new
separator is prolonged back to the original distributed graph.
Basing on these band graphs, we have implemented a multi-sequential refine-
ment algorithm, outlined in Figure 6. At every distributed uncoarsening step, a
distributed band graph is created. Centralized copies of this band graph are then
gathered on every participating processor, which serve to run fully independent in-
stances of our sequential FM algorithm. The perturbation of the initial state of the
sequential FM algorithm on every processor allows us to explore slightly different
solution spaces, and thus to improve refinement quality. Finally, the best refined
band separator is prolonged back to the distributed graph, and the uncoarsening
process goes on.
15
P0 P3
16
structure sizes.
Distributed graph files, which usually end in “.dgr”, describe fragments of val-
uated graphs, which can be valuated process graphs to be mapped onto target
architectures, or graphs representing the adjacency structures of matrices to order.
In Scotch, graphs are represented by means of adjacency lists: the definition
of each vertex is accompanied by the list of all of its neighbors, i.e. all of its
adjacent arcs. Therefore, the overall number of edge data is twice the number of
edges. Distributed graphs are stored as a set of files which contain each a subset
of graph vertices and their adjacencies. The purpose of this format is to speed-up
the loading and saving of large graphs when working for some time with the same
number of processors: the distributed graph loading routine will allow each of the
processors to read in parallel from a different file. Consequently, the number of
files must be equal to the number of processors involved in the parallel loading phase.
The first line of a distributed graph file holds the distributed graph file version
number, which is currently 2. The second line holds the number of files across
which the graph data is distributed (referred to as procglbnbr in libScotch; see
for instance Figure 8, page 32, for a detailed example), followed by the number of
this file in the sequence (ranging from 0 to (procglbnbr − 1), and analogous to
proclocnum in Figure 8). The third line holds the global number of graph vertices
(referred to as vertglbnbr), followed by the global number of arcs (inappropriately
called edgeglbnbr, as it is in fact equal to twice the actual number of edges). The
fourth line holds the number of vertices contained in this graph fragment (analogous
to vertlocnbr), followed by its local number of arcs (analogous to edgelocnbr).
The fifth line holds two figures: the graph base index value (baseval) and a numeric
flag.
The graph base index value records the value of the starting index used to
describe the graph; it is usually 0 when the graph has been output by C programs,
17
and 1 for Fortran programs. Its purpose is to ease the manipulation of graphs within
each of these two environments, while providing compatibility between them.
The numeric flag, similar to the one used by the Chaco graph format [19], is
made of three decimal digits. A non-zero value in the units indicates that vertex
weights are provided. A non-zero value in the tenths indicates that edge weights
are provided. A non-zero value in the hundredths indicates that vertex labels are
provided; if it is the case, vertices can be stored in any order in the file; else, natural
order is assumed, starting from the starting global index of each fragment.
This header data is then followed by as many lines as there are vertices in the
graph fragment, that is, vertlocnbr lines. Each of these lines begins with the vertex
label, if necessary, the vertex load, if necessary, and the vertex degree, followed by
the description of the arcs. An arc is defined by the load of the edge, if necessary,
and by the label of its other end vertex. The arcs of a given vertex can be provided
in any order in its neighbor list. If vertex labels are provided, vertices can also be
stored in any order in the file.
Figure 7 shows the contents of two complementary distributed graph files mod-
eling a cube with unity vertex and edge weights and base 0, distributed across two
processors.
2 2
2 0 2 1
8 24 8 24
4 12 4 12
0 000 0 000
3 4 2 1 3 0 6 5
3 5 3 0 3 1 7 4
3 6 0 3 3 2 4 7
3 7 1 2 3 3 5 6
5 Programs
5.1 Invocation
All of the programs comprised in the Scotch and PT-Scotch distributions have
been designed to run in command-line mode without any interactive prompting,
so that they can be called easily from other programs by means of “system ()”
or “popen ()” system calls, or be piped together on a single shell command line.
In order to facilitate this, whenever a stream name is asked for (either on input
or output), the user may put a single “-” to indicate standard input or output.
Moreover, programs read their input in the same order as stream names are given
in the command line. It allows them to read all their data from a single stream
(usually the standard input), provided that these data are ordered properly.
A brief on-line help is provided with all the programs. To get this help, use the
“-h” option after the program name. The case of option letters is not significant,
except when both the lower and upper cases of a letter have different meanings.
When passing parameters to the programs, only the order of file names is significant;
options can be put anywhere in the command line, in any order. Examples of use
of the different programs of the PT-Scotch project are provided in section 8.
18
Error messages are standardized, but may not be fully explanatory. However,
most of the errors you may run into should be related to file formats, and located in
“...Load” routines. In this case, compare your data formats with the definitions
given in section 4, and use the dgtst program of the PT-Scotch distribution to
check the consistency of your distributed source graphs.
According to your MPI environment, you may either run the programs directly,
or else have to invoke them by means of a command such as mpirun. Check your
local MPI documentation to see how to specify the number of processors on which
to run them.
%r Replaced on each process running the program by the rank of this process in
the global communicator. Leads to parallel opening.
%% Replaced by a single “%” character. File names using this escape sequence are
not considered for parallel opening, unless one or several of the three other
escape sequences are also present.
For instance, filename “brol” will lead to the opening of file “brol” on the root
processor only, filename “%-brol” (or even “br%-ol”) will lead to the parallel open-
ing of files called “brol” on every processor, and filename “brol%p-%r” will lead
to the opening of files “brol2-0” and “brol2-1”, respectively, on each of the two
processors on which which would run a program of the PT-Scotch distribution.
19
5.2.2 Using compressed files
Starting from version 5.0.6, Scotch allows users to provide and retrieve data in
compressed form. Since this feature requires that the compression and decompres-
sion tasks run in the same time as data is read or written, it can only be done
on systems which support multi-threading (Posix threads) or multi-processing (by
means of fork system calls).
To determine if a stream has to be handled in compressed form, Scotch checks
its extension. If it is “.gz” (gzip format), “.bz2” (bzip2 format) or “.lzma” (lzma
format), the stream is assumed to be compressed according to the corresponding
format. A filter task will then be used to process it accordingly if the format is
implemented in Scotch and enabled on your system.
To date, data can be read and written in bzip2 and gzip formats, and can
also be read in the lzma format. Since the compression ratio of lzma on Scotch
graphs is 30% better than the one of gzip and bzip2 (which are almost equivalent
in this case), the lzma format is a very good choice for handling very large graphs.
To see how to enable compressed data handling in Scotch, please refer to Section 7.
When the compressed format allows it, several files can be provided on
the same stream, and be uncompressed on the fly. For instance, the
command “cat brol.grf.gz brol.xyz.gz | gout -.gz -.gz -Mn - brol.iv”
concatenates the topology and geometry data of some graph brol and feed them
as a single compressed stream to the standard input of program gout, hence the
”-.gz” to indicate a compressed standard stream.
5.3 Description
5.3.1 dgmap / dgpart
Synopsis
dgmap [input graph file [input target file [output mapping file [output log
file]]]] options
dgpart number of parts [input graph file [output mapping file [output
log file]]] options
Description
The dgmap program is the parallel static mapper. It uses a static mapping
strategy to compute a mapping of the given source graph to the given target
architecture. The implemented algorithms aim at assigning source graph ver-
tices to target vertices such that every target vertex receives a set of source
vertices of summed weight proportional to the relative weight of the target
vertex in the target architecture, and such that the communication cost func-
tion fC is minimized (see Section 3.1.2 for the definition and rationale of this
cost function).
Since its main purpose is to provide mappings that exhibit high concurrency
for communication minimization in the mapped application, it comprises a
parallel implementation of the dual recursive bipartitioning algorithm [33], as
well as all of the sequential static mapping methods used by its sequential
counterpart gmap, to be used on subgraphs located on single processors.
20
dgpart is a simplified interface to dgmap, which performs graph partitioning
instead of static mapping. Consequently, the desired number of parts has to
be provided, in lieu of the target architecture.
The -b and -c options allow the user to set preferences on the behavior of the
mapping strategy which is used by default. The -m option allows the user to
define a custom mapping strategy.
The input graph file filename can refer either to a centralized or to a dis-
tributed graph, according to the semantics defined in Section 5.2. The map-
ping file must be a centralized file.
Options
Since the program is devoted to experimental studies, it has many optional
parameters, used to test various execution modes. Values set by default will
give best results in most cases.
-brat
Set the maximum load imbalance ratio to rat, which should be a value
comprised between 0 and 1. This option can be used in conjunction with
option -c, but is incompatible with option -m.
-cflags
Tune the default mapping strategy according to the given preference
flags. Some of these flags are antagonistic, while others can be combined.
See Section 6.4.1 for more information. The currently available flags are
the following.
b Enforce load balance as much as possible.
q Privilege quality over speed. This is the default behavior.
s Privilege speed over quality.
t Use only safe methods in the strategy.
x Favor scalability.
This option can be used in conjunction with option -b, but is incompat-
ible with option -m. The resulting strategy string can be displayed by
means of the -vs option.
-h Display the program synopsis.
-mstrat
Apply parallel static mapping strategy strat. The format of parallel
mapping strategies is defined in section 6.4.2. This option is incompatible
with options -b and -c.
-rnum
Set the number of the root process which will be used for centralized file
accesses. Set to 0 by default.
-sobj
Mask source edge and vertex weights. This option allows the user to “un-
weight” weighted source graphs by removing weights from edges and ver-
tices at loading time. obj may contain several of the following switches.
e Remove edge weights, if any.
v Remove vertex weights, if any.
-V Print the program version and copyright.
21
-vverb
Set verbose mode to verb, which may contain several of the following
switches.
a Memory allocation information.
m Mapping information, similar to the one displayed by the gmtst
program of the sequential Scotch distribution.
s Strategy information. This parameter displays the default mapping
strategy used by gmap.
t Timing information.
5.3.2 dgord
Synopsis
dgord [input graph file [output ordering file [output log file]]] options
Description
The dgord program is the parallel sparse matrix block orderer. It uses an
ordering strategy to compute block orderings of sparse matrices represented
as source graphs, whose vertex weights indicate the number of DOFs per node
(if this number is non homogeneous) and whose edges are unweighted, in order
to minimize fill-in and operation count.
Since its main purpose is to provide orderings that exhibit high concur-
rency for parallel block factorization, it comprises a parallel nested dissection
method [14], but sequential classical [31] and state-of-the-art [39] minimum
degree algorithms are implemented as well, to be used on subgraphs located
on single processors.
Ordering methods can be combined by means of selection, grouping, and
condition operators, so as to define ordering strategies, which can be passed
to the program by means of the -o option. The -c option allows the user
to set preferences on the behavior of the ordering strategy which is used by
default.
The input graph file filename can refer either to a centralized or to a dis-
tributed graph, according to the semantics defined in Section 5.2. The order-
ing file must be a centralized file.
Options
Since the program is devoted to experimental studies, it has many optional
parameters, used to test various execution modes. Values set by default will
give best results in most cases.
-cflags
Tune the default ordering strategy according to the given preference flags.
Some of these flags are antagonistic, while others can be combined. See
Section 6.4.1 for more information. The resulting strategy string can be
displayed by means of the -vs option.
b Enforce load balance as much as possible.
q Privilege quality over speed. This is the default behavior.
s Privilege speed over quality.
22
t Use only safe methods in the strategy.
x Favor scalability.
-h Display the program synopsis.
-moutput mapping file
Write to output mapping file the mapping of graph vertices to column
blocks. All of the separators and leaves produced by the nested dissection
method are considered as distinct column blocks, which may be in turn
split by the ordering methods that are applied to them. Distinct integer
numbers are associated with each of the column blocks, such that the
number of a block is always greater than the ones of its predecessors
in the elimination process, that is, its descendants in the elimination
tree. The structure of mapping files is described in detail in the relevant
section of the Scotch User’s Guide [35].
When the geometry of the graph is available, this mapping file may be
processed by program gout to display the vertex separators and super-
variable amalgamations that have been computed.
-ostrat
Apply parallel ordering strategy strat. The format of parallel ordering
strategies is defined in section 6.4.4.
-rnum
Set the number of the root process which will be used for centralized file
accesses. Set to 0 by default.
-toutput tree file
Write to output tree file the structure of the separator tree. The data
that is written resembles much the one of a mapping file: after a first
line that contains the number of lines to follow, there are that many lines
of mapping pairs, which associate an integer number with every graph
vertex index. This integer number is the number of the column block
which is the parent of the column block to which the vertex belongs,
or −1 if the column block to which the vertex belongs is a root of the
separator tree (there can be several roots, if the graph is disconnected).
Combined to the column block mapping data produced by option -m, the
tree structure allows one to rebuild the separator tree.
-V Print the program version and copyright.
-vverb
Set verbose mode to verb, which may contain several of the following
switches.
a Memory allocation information.
s Strategy information. This parameter displays the default parallel
ordering strategy used by dgord.
t Timing information.
5.3.3 dgpart
Synopsis
dgpart [number of parts [input graph file [output mapping file [output log
file]]]] options
23
Description
5.3.4 dgscat
Synopsis
Description
The dgscat program creates a distributed source graph, in the Scotch dis-
tributed graph format, from the given centralized source graph file.
The input graph file filename should therefore refer to a centralized graph,
while output graph file must refer to a distributed graph, according to the
semantics defined in Section 5.2.
Options
-c Check the consistency of the distributed graph at the end of the graph
loading phase.
-h Display the program synopsis.
-rnum
Set the number of the root process which will be used for centralized file
accesses. Set to 0 by default.
-V Print the program version and copyright.
5.3.5 dgtst
Synopsis
Description
The program dgtst is the source graph tester. It checks the consistency of
the input source graph structure (matching of arcs, number of vertices and
edges, etc.), and gives some statistics regarding edge weights, vertex weights,
and vertex degrees.
It produces the same results as the gtst program of the Scotch sequential
distribution.
24
Options
6 Library
All of the features provided by the programs of the PT-Scotch distribution may
be directly accessed by calling the appropriate functions of the libScotch library,
archived in files ptlibscotch.a and libptscotcherr.a. All of the existing parallel
routines belong to four distinct classes:
• distributed source graph handling routines, which serve to declare, build, load,
save, and check the consistency of distributed source graphs;
• strategy handling routines, which allow the user to declare and build parallel
mapping and ordering strategies;
• parallel graph partitioning and static mapping routines, which allow the user
to declare, compute, and save distributed static mappings of distributed source
graphs;
• parallel ordering routines, which allow the user to declare, compute, and save
distributed orderings of distributed source graphs.
Error handling is performed using the existing sequential routines of the Scotch
distribution, which are described in the Scotch User’s Guide [35]. Their use is
recalled in Section 6.11.
A ParMeTiS compatibility library, called libptscotchparmetis.a, is also
available. It allows users who were previously using ParMeTiS in their software to
take advantage of the efficieny of PT-Scotch without having to modify their code.
The services provided by this library are described in Section 6.13.
25
6.2 Calling the routines of libScotch
6.2.1 Calling from C
All of the C routines of the libScotch library are prefixed with “SCOTCH ”. The
remainder of the function names is made of the name of the type of object to which
the functions apply (e.g. “dgraph”, “dorder”, etc.), followed by the type of action
performed on this object: “Init” for the initialization of the object, “Exit” for the
freeing of its internal structures, “Load” for loading the object from one or several
streams, and so on.
Typically, functions that return an error code return zero if the function suc-
ceeds, and a non-zero value in case of error.
For instance, the SCOTCH dgraphInit and SCOTCH dgraphLoad routines, de-
scribed in section 6.5, can be called from C by using the following code.
#include <stdio.h>
#include <mpi.h>
#include "ptscotch.h"
...
SCOTCH_Dgraph grafdat;
FILE * fileptr;
if (SCOTCH_dgraphInit (&grafdat) != 0) {
... /* Error handling */
}
if ((fileptr = fopen ("brol.grf", "r")) == NULL) {
... /* Error handling */
}
if (SCOTCH_dgraphLoad (&grafdat, fileptr, -1, 0) != 0) {
... /* Error handling */
}
...
Since “ptscotch.h” uses several system and communication objects which are
declared in “stdio.h” and “mpi.h”, respectively, these latter files must be included
beforehand in your application code.
Although the “scotch.h” and “ptscotch.h” files may look very similar on
your system, never mistake them, and always use the “ptscotch.h” file as the
right include file for compiling a program which uses the parallel routines of the
libScotch library, whether it also calls sequential routines or not.
Since all the data structures used in libScotch are opaque, equivalent declara-
tions for these structures must be provided in Fortran. These structures must there-
26
fore be defined as arrays of DOUBLEPRECISIONs, of sizes given in file ptscotchf.h,
which must be included whenever necessary.
For routines that read or write data using a FILE * stream in C, the Fortran
counterpart uses an INTEGER parameter which is the numer of the Unix file descrip-
tor corresponding to the logical unit from which to read or write. In most Unix
implementations of Fortran, standard descriptors 0 for standard input (logical unit
5), 1 for standard output (logical unit 6) and 2 for standard error are opened by
default. However, for files that are opened using OPEN statements, an additional
function must be used to obtain the number of the Unix file descriptor from the
number of the logical unit. This function is called PXFFILENO in the normalized
POSIX Fortran API, and files which use it should include the USE IFPOSIX direc-
tive whenever necessary. An alternate, non normalized, function also exists in most
Unix implementations of Fortran, and is called FNUM.
For instance, the SCOTCH dgraphInit and SCOTCH dgraphLoad routines, de-
scribed in sections 6.5.2 and 6.5.5, respectively, can be called from Fortran by using
the following code.
INCLUDE "ptscotchf.h"
DOUBLEPRECISION GRAFDAT(SCOTCH_DGRAPHDIM)
INTEGER RETVAL
...
CALL SCOTCHFDGRAPHINIT (GRAFDAT (1), RETVAL)
IF (RETVAL .NE. 0) THEN
...
OPEN (10, FILE=’brol.grf’)
CALL SCOTCHFDGRAPHLOAD (GRAFDAT (1), FNUM (10), 1, 0, RETVAL)
CLOSE (10)
IF (RETVAL .NE. 0) THEN
...
Although the “scotchf.h” and “ptscotchf.h” files may look very similar on
your system, never mistake them, and always use the “ptscotchf.h” file as the
include file for compiling a Fortran program that uses the parallel routines of the
libScotch library, whether it also calls sequential routines or not.
All of the Fortran routines of the libScotch library are stubs which call their C
counterpart. While this poses no problem for the usual integer and double precision
data types, some conflicts may occur at compile or run time if your MPI implemen-
tation does not represent the MPI Comm type in the same way in C and in Fortran.
Please check on your platform to see in the mpi.h include file if the MPI Comm data
type is represented as an int. If it is the case, there should be no problem in using
the Fortran routines of the PT-Scotch library.
27
error routines that print an error message and exit are provided in the classical
Scotch library file libptscotcherr.a.
Therefore, the linking of applications that make use of the libScotch li-
brary with standard error handling is carried out by using the following options:
“-lptscotch -lptscotcherr -lmpi -lm”. The “-lmpi” option is most often not
necessary, as the MPI library is automatically considered when compiling with com-
mands such as mpicc.
If you want to handle errors by yourself, you should not link with library file
libptscotcherr.a, but rather provide a SCOTCH errorPrint() routine. Please
refer to Section 6.11 for more information on error handling.
28
Also, the MeTiS compatibility library provided by Scotch will not work when
SCOTCH Nums are not ints, since the interface of MeTiS uses regular ints to represent
graph indices. In addition to compile-time warnings, an error message will be issued
when one of these routines is called.
In all of the following, whenever arrays are defined, passed, and accessed, it is
assumed that the first element of these arrays is always labeled as baseval, whether
baseval is set to 0 (for C-style arrays) or 1 (for Fortran-style arrays). PT-Scotch
internally manages with base values and array pointers so as to process these arrays
accordingly.
baseval
Base value for all array indexings.
vertglbnbr
Overall number of vertices in the distributed graph.
edgeglbnbr
Overall number of arcs in the distributed graph. Since edges are represented
by both of their ends, the number of edge data in the graph is twice the
number of edges.
procglbnbr
Overall number of processes that share distributed graph data.
29
proccnttab
Array holding the current number of local vertices borne by every process.
procvrttab
Array holding the global indices from which the vertices of every process are
numbered. For optimization purposes, this array has an extra slot which
stores a number which must be greater than all of the assigned global in-
dices. For each process p, it must be ensured that procvrttab[p + 1] ≥
(procvrttab[p] + proccnttab[p]), that is, that no process can have more
local vertices than allowed by its range of global indices. When the global
numbering of vertices is continuous, for each process p, procvrttab[p + 1] =
(procvrttab[p] + proccnttab[p]).
vertgstnbr
Number of both local and ghost vertices borne by the given process. Ghost
vertices are local images of neighboring vertices located on distant processes.
vertloctab
Array of start indices in edgeloctab and edgegsttab of vertex adjacency
sub-arrays.
vendloctab
Array of after-last indices in edgeloctab and edgegsttab of vertex adja-
cency sub-arrays. For any local vertex i, with baseval ≤ i < (baseval +
vertlocnbr), vendloctab[i] − vertloctab[i] is the degree of vertex i.
When all vertex adjacency lists are stored in order in edgeloctab with-
out any empty space between them, it is possible to save memory by
not allocating the physical memory for vendloctab. In this case, illus-
trated in Figure 8, vertloctab is of size vertlocnbr + 1 and vendloctab
points to vertloctab + 1. For these graphs , called “compact edge array
graphs”, or “compact graphs” for short, vertloctab is sorted in ascend-
ing order, vertloctab[baseval] = baseval and vertloctab[baseval +
vertlocnbr] = (baseval + edgelocnbr).
Since vertloctab and vendloctab only account for local vertices and not for
ghost vertices, the sum across all processes of the sizes of these arrays does
not depend on the number of ghost vertices; it is equal to (v + p) for compact
graphs and to 2v else.
veloloctab
Optional array, of size vertlocnbr, holding the integer load associated with
every vertex.
edgeloctab
Array, of a size equal at least to (maxi (vendloctab[i]) − baseval), hold-
ing the adjacency array of every local vertex. For any local vertex i, with
baseval ≤ i < (baseval + vertlocnbr), the global indices of the neigh-
bors of i are stored in edgeloctab from edgeloctab[vertloctab[i]] to
edgeloctab[vendloctab[i] − 1], inclusive.
30
Since ghost vertices do not have adjacency arrays, because only arcs from
local vertices to ghost vertices are recorded and not the opposite, the overall
sum of the sizes of all edgeloctab arrays is e.
edgegsttab
Optional array holding the local and ghost indices of neighbors of local ver-
tices. For any local vertex i, with baseval ≤ i < (baseval + vertlocnbr),
the local and ghost indices of the neighbors of i are stored in edgegsttab from
edgegsttab[vertloctab[i]] to edgegsttab[vendloctab[i]−1], inclusive.
Local vertices are numbered in global vertex order, starting from baseval to
(baseval + vertlocnbr − 1), inclusive. Ghost vertices are also numbered in
global vertex order, from (baseval+vertlocnbr) to (baseval+vertgstnbr−
1), inclusive.
Only edgeloctab has to be provided by the user. edgegsttab is internally
computed by PT-Scotch whenever needed, or can be explicitey asked for
by the user by calling function SCOTCH dgraphGhst. This array can serve
to index user-defined arrays of quantities borne by graph vertices, which can
be exchanged between neighboring processes thanks to the SCOTCH dgraph
Halo routine documented in Section 6.5.14.
edloloctab
Optional array, of a size equal at least to (maxi (vendloctab[i]) − baseval),
holding the integer load associated with every arc. Matching arcs should
always have identical loads.
Dynamic graphs can be handled elegantly by using the vendloctab and proc
vrttab arrays. In order to dynamically manage distributed graphs, one just has
to reserve index ranges large enough to create new vertices on each process, and to
allocate vertloctab, vendloctab and edgeloctab arrays that are large enough to
contain all of the expected new vertex and edge data. This can be done by passing
SCOTCH graphBuild a maximum number of local vertices, vertlocmax, greater than
the current number of local vertices, vertlocnbr.
On every process p, vertices are globally labeled starting from procvrttab[p],
and locally labeled from baseval, leaving free space at the end of the local arrays.
To remove some vertex of local index i, one just has to replace vertloctab[i] and
vendloctab[i] with the values of vertloctab[vertlocnbr − 1] and vendloctab
[vertlocnbr−1], respectively, and browse the adjacencies of all neighbors of former
vertex (vertlocnbr − 1) such that all (vertlocnbr − 1) indices are turned into
is. Then, vertlocnbr must be decremented, and SCOTCH dgraphBuild() must be
called to account for the change of topology. If a graph building routine such as
SCOTCH dgraphLoad() or SCOTCH dgraphBuild() had already been called on the
SCOTCH Dgraph structure, SCOTCH dgraphFree() has to be called first in order to
free the internal structures associated with the older version of the graph, else these
data would be lost, which would result in memory leakage.
To add a new vertex, one has to fill vertloctab[vertnbr-1] and vendloctab
[vertnbr-1] with the starting and end indices of the adjacency sub-array of the
new vertex. Then, the adjacencies of its neighbor vertices must also be updated to
account for it. If free space had been reserved at the end of each of the neighbors,
one just has to increment the vendloctab[i] values of every neighbor i, and add
the index of the new vertex at the end of the adjacency sub-array. If the sub-
array cannot be extended, then it has to be copied elsewhere in the edge array,
and both vertloctab[i] and vendloctab[i] must be updated accordingly. With
31
Duplicated data
baseval 1
3 6
vertglbnbr 8
edgeglbnbr 26
1 4 7
procglbnbr 3
2 8
proccnttab 3 2 3
procvrttab 1 4 6 9
5
Local data 0 1 2
3 4 5 1
1 4 1 4 2
2 3 6 3
5 2 5
vertlocnbr 3 2 3
vertgstnbr 5 6 5
edgelocnbr 9 8 9
vendloctab
vertloctab 1 3 7 10 1 6 9 1 4 6 10
edgeloctab 3 2 3 5 4 1 4 2 1 3 6 2 8 5 8 2 4 4 7 8 6 8 4 6 7 5
edgegsttab 3 2 3 5 4 1 4 2 1 4 5 3 6 2 6 3 1 4 2 3 1 3 4 1 2 5
Figure 8: Sample distributed graph and its description by libScotch arrays using
a continuous numbering and compact edge arrays. Numbers within vertices are
vertex indices. Top graph is a global view of the distributed graph, labeled with
global, continuous, indices. Bottom graphs are local views labeled with local and
ghost indices, where ghost vertices are drawn in black. Since the edge array is
compact, all vertloctab arrays are of size vertlocnbr+ 1, and vendloctab points
to vertloctab + 1. edgeloctab edge arrays hold global indices of end vertices,
while optional edgegsttab edge arrays hold local and ghost indices. edgelocnbr
is the local number of arcs (that is, twice the number of edges), including arcs
to local vertices as well as to ghost vertices veloloctab and edloloctab are not
represented.
32
Duplicated data
baseval 1
3 17
vertglbnbr 8
edgeglbnbr 26
1 11 18
procglbnbr 3
2 19
proccnttab 3 2 3
procvrttab 1 11 17 99
12
Local data 0 1 2
3 4 5 1
1 4 1 4 2
2 3 6 3
5 2 5
vertlocnbr 3 2 3
vertgstnbr 5 6 5
edgelocnbr 9 8 9
vertloctab 9 1 6 6 2 1 4 7
edgeloctab 3 12 11 1 11 2 1 3 2 19 2 11 3 17 2 19 12 11 18 19 17 19 11 17 18 12
edgegsttab 3 5 4 1 4 2 1 3 2 6 3 1 4 5 3 6 2 4 2 3 1 3 4 1 2 5
vendloctab 11 5 9 11 5 4 6 11
Figure 9: Adjacency structure of the sample graph of Figure 8 with a disjoint edge
array and a discontinuous ordering. Both vertloctab and vendloctab are of size
vertlocnbr. This allows for the handling of dynamic graphs, the structure of which
can evolve with time.
33
simple housekeeping of free areas of the edge array, dynamic arrays can be updated
with as little data movement as possible.
permtab
Array holding the permutation of the reordered matrix. Thus, if k =
permtab[i], then row i of the original matrix is now row k of the reordered
matrix, that is, row i is the k th pivot.
peritab
Inverse permutation of the reordered matrix. Thus, if i = peritab[k], then
row k of the reordered matrix was row i of the original matrix.
cblknbr
Number of column blocks (that is, supervariables) in the block ordering.
rangtab
Array of ranges for the column blocks. Column block c, with baseval ≤
c < (cblknbr + baseval), contains columns with indices ranging from
rangtab[i] to rangtab[i + 1], exclusive, in the reordered matrix. There-
fore, rangtab[baseval] is always equal to baseval, and rangtab[cblknbr
+ baseval] is always equal to vertglbnbr+baseval. In order to avoid mem-
ory errors when column blocks are all single columns, the size of rangtab must
always be one more than the number of columns, that is, vertglbnbr + 1.
treetab
Array of ascendants of permuted column blocks in the separators tree.
treetab[i] is the index of the father of column block i in the separators
tree, or −1 if column block i is the root of the separators tree. Whenever sep-
arators or leaves of the separators tree are split into subblocks, as the block
splitting, minimum fill or minimum degree methods do, all subblocks of the
same level are linked to the column block of higher index belonging to the
closest separator ancestor. Indices in treetab are based, in the same way as
for the other blocking structures. See Figure 10 for a complete example.
34
permtab 2 3 10 6 4 11 8 7 1 12 5 9
1 2 3 4 2 2 3 10 6
5
peritab 9 1 2 5 11 4 8 7 12 3 6 10
7
cblknbr 7 5 6 7 8 4 11 8 7
3 6
rangtab 1 2 4 5 6 8 10 13
9 10 11 12 1 1 12 5 9
treetab 3 3 7 6 6 7 −1 4
Figure 10: Arrays resulting from the ordering by complete nested dissection of a 4
by 3 grid based from 1. Leftmost grid is the original grid, and righmost grid is the
reordered grid, with separators shown and column block indices written in bold.
SCOTCH STRATQUALITY
Privilege quality over speed. This is the default behavior of default strategy
strings when they are used just after being initialized.
SCOTCH STRATSPEED
Privilege speed over quality.
SCOTCH STRATBALANCE
Enforce load balance as much as possible.
SCOTCH STRATSAFETY
Do not use methods that can lead to the occurrence of problematic events,
such as floating point exceptions, which could not be properly handled by the
calling software.
35
SCOTCH STRATSCALABILITY
Favor scalability as much as possible.
(strat )
Grouping operator. The strategy enclosed within the parentheses is treated
as a single mapping method.
cond1 |cond2
Logical or operator. The result of the condition is true if cond1 or cond2
are true, or both.
cond1 &cond2
Logical and operator. The result of the condition is true only if both
cond1 and cond2 are true.
!cond
Logical not operator. The result of the condition is true only if cond is
false.
var relop val
Relational operator, where var is a node variable, val is either a node
variable or a constant of the type of variable var, and relop is one of ’<’,
’=’, and ’>’. The node variables are listed below, along with their types.
edge
The global number of arcs of the current subgraph. Integer.
levl
The level of the subgraph in the recursion tree, starting from zero
for the initial graph at the root of the tree. Integer.
load
The overall sum of the vertex loads of the subgraph. It is equal to
vert if the graph has no vertex loads. Integer.
mdeg
The maximum degree of the subgraph. Integer.
proc
The number of processes on which the current subgraph is dis-
tributed at this level of the separators tree. Integer.
rank
The rank of the current process among the group of processes on
which the current subgraph is distributed at this level of the sepa-
rators tree. Integer.
36
vert
The global number of vertices of the current subgraph. Integer.
method [{parameters}]
Parallel graph mapping method. Available parallel mapping methods are
listed below.
seq=strat
Set the sequential mapping strategy that is used on every centralized
subgraph of the recursion tree, once the dual recursive bipartitioning
process has gone far enough such that the number of processes handling
some subgraph is restricted to one.
sep=strat
Set the parallel graph bipartitioning strategy that is used on every cur-
rent job of the recursion tree. Parallel graph bipartitioning strategies are
described below, in section 6.4.3.
strat1 |strat2
Selection operator. The result of the selection is the best bipartition of the
two that are obtained by the distinct application of strat1 and strat2 to the
current bipartition.
strat1 strat2
Combination operator. Strategy strat2 is applied to the bipartition resulting
from the application of strategy strat1 to the current bipartition. Typically,
the first method used should compute an initial bipartition from scratch, and
every following method should use the result of the previous one at its starting
point.
(strat )
Grouping operator. The strategy enclosed within the parentheses is treated
as a single bipartitioning method.
cond1 |cond2
Logical or operator. The result of the condition is true if cond1 or cond2
are true, or both.
37
cond1 &cond2
Logical and operator. The result of the condition is true only if both
cond1 and cond2 are true.
!cond
Logical not operator. The result of the condition is true only if cond is
false.
var relop val
Relational operator, where var is a graph or node variable, val is either
a graph or node variable or a constant of the type of variable var , and
relop is one of ’<’, ’=’, and ’>’. The graph and node variables are listed
below, along with their types.
edge
The global number of edges of the current subgraph. Integer.
levl
The level of the subgraph in the bipartition or multi-level tree, start-
ing from zero for the initial graph at the root of the tree. Integer.
load
The overall sum of the vertex loads of the subgraph. It is equal to
vert if the graph has no vertex loads. Integer.
load0
The vertex load of the first subset of the current bipartition of the
current graph. Integer.
proc
The number of processes on which the current subgraph is dis-
tributed at this level of the nested dissection process. Integer.
rank
The rank of the current process among the group of processes on
which the current subgraph is distributed at this level of the nested
dissection process. Integer.
vert
The number of vertices of the current subgraph. Integer.
The currently available parallel vertex separation methods are the following.
b Band method. Basing on the current distributed graph and on its parti-
tion, this method creates a new distributed graph reduced to the vertices
which are at most at a given distance from the current frontier, runs a
parallel graph bipartitioning strategy on this graph, and prolongs back
the new bipartition to the current graph. This method is primarily used
to run bipartition refinement methods during the uncoarsening phase of
the multi-level parallel graph bipartitioning method. The parameters of
the band method are listed below.
bnd=strat
Set the parallel graph bipartitioning strategy to be applied to the
band graph.
org=strat
Set the parallel graph bipartitioning strategy to be applied to the
full distributed graph if the band graph could not be extracted.
width=val
Set the maximum distance from the current frontier of vertices to be
38
kept in the band graph. 0 means that only frontier vertices them-
selves are kept, 1 that immediate neighboring vertices are kept too,
and so on.
d Parallel diffusion method. This method, presented in its sequential for-
mulation in [34], flows two kinds of antagonistic liquids, scotch and anti-
scotch, from two source vertices, and sets the new frontier as the limit
between vertices which contain scotch and the ones which contain anti-
scotch. Because selecting the source vertices is essential to the obtain-
ment of useful results, this method has been hard-coded so that the two
source vertices are the two vertices of highest indices, since in the band
method these are the anchor vertices which represent all of the removed
vertices of each part. Therefore, this method must be used on band
graphs only, or on specifically crafted graphs. Applying it to any other
graphs is very likely to lead to extremely poor results. The parameters
of the diffusion bipartitioning method are listed below.
dif=rat
Fraction of liquid which is diffused to neighbor vertices at each pass.
To achieve convergence, the sum of the dif and rem parameters must
be equal to 1, but in order to speed-up the diffusion process, other
combinations of higher sum can be tried. In this case, the number of
passes must be kept low, to avoid numerical overflows which would
make the results useless.
pass=nbr
Set the number of diffusion sweeps performed by the algorithm. This
number depends on the width of the band graph to which the diffu-
sion method is applied. Useful values range from 30 to 500 according
to chosen dif and rem coefficients.
rem=rat
Fraction of liquid which remains on vertices at each pass. See above.
m Parallel multi-level method. The parameters of the multi-level method
are listed below.
asc=strat
Set the strategy that is used to refine the distributed bipartition ob-
tained at ascending levels of the uncoarsening phase by prolongation
of the bipartition computed for coarser graphs. This strategy is not
applied to the coarsest graph, for which only the low strategy is
used.
dlevl=nbr
Set the minimum level after which duplication is allowed in the fold-
ing process. A value of −1 results in duplication being always per-
formed when folding.
dvert=nbr
Set the average number of vertices per process under which the fold-
ing process is performed during the coarsening phase.
low=strat
Set the strategy that is used to compute the bipartition of the coars-
est distributed graph, at the lowest level of the coarsening process.
rat=rat
Set the threshold maximum coarsening ratio over which graphs are
39
no longer coarsened. The ratio of any given coarsening cannot be
less that 0.5 (case of a perfect matching), and cannot be greater
than 1.0. Coarsening stops when either the coarsening ratio is above
the maximum coarsening ratio, or the graph has fewer node vertices
than the minimum number of vertices allowed.
vert=nbr
Set the threshold minimum size under which graphs are no longer
coarsened. Coarsening stops when either the coarsening ratio is
above the maximum coarsening ratio, or the graph has fewer node
vertices than the minimum number of vertices allowed.
q Multi-sequential method. The current distributed graph and its sep-
arator are centralized on every process that holds a part of it, and a
sequential graph bipartitioning method is applied independently to each
of them. Then, the best bipartition found is prolonged back to the dis-
tributed graph. This method is primarily designed to operate on band
graphs, which are orders of magnitude smaller than their parent graph.
Else, memory bottlenecks are very likely to occur. The parameters of
the multi-sequential method are listed below.
strat=strat
Set the sequential edge separation strategy that is used to refine
the bipartition of the centralized graph. For a description of all of
the available sequential bipartitioning methods, please refer to the
Scotch User’s Guide [35].
x Load balance enforcement method. This method moves as many vertices
from the heaviest part to the lightest one so as to reduce load imbalance
as much as possible, without impacting communication load too nega-
tively. The only parameter of this method is listed below.
sbbt=nbr
Number of sub-buckets to sort communication gains. 5 is a common
value.
z Zero method. This method moves all of the vertices to the first part,
resulting in an empty frontier. Its main use is to stop the bipartitioning
process whenever some condition is true.
(strat )
Grouping operator. The strategy enclosed within the parentheses is treated
as a single ordering method.
40
cond1 |cond2
Logical or operator. The result of the condition is true if cond1 or cond2
are true, or both.
cond1 &cond2
Logical and operator. The result of the condition is true only if both
cond1 and cond2 are true.
!cond
Logical not operator. The result of the condition is true only if cond is
false.
var relop val
Relational operator, where var is a node variable, val is either a node
variable or a constant of the type of variable var, and relop is one of ’<’,
’=’, and ’>’. The node variables are listed below, along with their types.
edge
The global number of arcs of the current subgraph. Integer.
levl
The level of the subgraph in the separators tree, starting from zero
for the initial graph at the root of the tree. Integer.
load
The overall sum of the vertex loads of the subgraph. It is equal to
vert if the graph has no vertex loads. Integer.
mdeg
The maximum degree of the subgraph. Integer.
proc
The number of processes on which the current subgraph is dis-
tributed at this level of the separators tree. Integer.
rank
The rank of the current process among the group of processes on
which the current subgraph is distributed at this level of the sepa-
rators tree. Integer.
vert
The global number of vertices of the current subgraph. Integer.
method [{parameters}]
Parallel graph ordering method. Available parallel ordering methods are listed
below.
ole=strat
Set the parallel ordering strategy that is used on every distributed leaf of
the parallel separators tree if the node separation strategy sep has failed
to separate it further.
ose=strat
Set the parallel ordering strategy that is used on every distributed sep-
arator of the separators tree.
41
osq=strat
Set the sequential ordering strategy that is used on every centralized sub-
graph of the separators tree, once the nested dissection process has gone
far enough such that the number of processes handling some subgraph is
restricted to one.
sep=strat
Set the parallel node separation strategy that is used on every current
leaf of the separators tree to make it grow. Parallel node separation
strategies are described below, in section 6.4.5.
q Sequential ordering method. The distributed graph is gathered onto a single
process which runs a sequential ordering strategy. The only parameter of the
sequential method is given below.
strat=strat
Set the sequential ordering strategy that is applied to the centralized
graph. For a description of all of the available sequential ordering meth-
ods, please refer to the Scotch User’s Guide [35].
s Simple method. Vertices are ordered in their natural order. This method is
fast, and should be used to order separators if the number of extra-diagonal
blocks is not relevant
strat1 |strat2
Selection operator. The result of the selection is the best vertex separator of
the two that are obtained by the distinct application of strat1 and strat2 to
the current separator.
strat1 strat2
Combination operator. Strategy strat2 is applied to the vertex separator
resulting from the application of strategy strat1 to the current separator.
Typically, the first method used should compute an initial separation from
scratch, and every following method should use the result of the previous one
as a starting point.
(strat )
Grouping operator. The strategy enclosed within the parentheses is treated
as a single separation method.
/cond ?strat1 [:strat2];
Condition operator. According to the result of the evaluation of condition
cond, either strat1 or strat2 (if it is present) is applied. The condition applies
to the characteristics of the current subgraph, and can be built from logical
and relational operators. Conditional operators are listed below, by increasing
precedence.
cond1 |cond2
Logical or operator. The result of the condition is true if cond1 or cond2
are true, or both.
42
cond1 &cond2
Logical and operator. The result of the condition is true only if both
cond1 and cond2 are true.
!cond
Logical not operator. The result of the condition is true only if cond is
false.
var relop val
Relational operator, where var is a graph or node variable, val is either
a graph or node variable or a constant of the type of variable var , and
relop is one of ’<’, ’=’, and ’>’. The graph and node variables are listed
below, along with their types.
edge
The global number of edges of the current subgraph. Integer.
levl
The level of the subgraph in the separators tree, starting from zero
for the initial graph at the root of the tree. Integer.
load
The overall sum of the vertex loads of the subgraph. It is equal to
vert if the graph has no vertex loads. Integer.
proc
The number of processes on which the current subgraph is dis-
tributed at this level of the nested dissection process. Integer.
rank
The rank of the current process among the group of processes on
which the current subgraph is distributed at this level of the nested
dissection process. Integer.
vert
The number of vertices of the current subgraph. Integer.
The currently available parallel vertex separation methods are the following.
b Band method. Basing on the current distributed graph and on its parti-
tion, this method creates a new distributed graph reduced to the vertices
which are at most at a given distance from the current separator, runs
a parallel vertex separation strategy on this graph, and prolongs back
the new separator to the current graph. This method is primarily used
to run separator refinement methods during the uncoarsening phase of
the multi-level parallel graph separation method. The parameters of the
band method are listed below.
strat=strat
Set the parallel vertex separation strategy to be applied to the band
graph.
width=val
Set the maximum distance from the current separator of vertices to
be kept in the band graph. 0 means that only separator vertices
themselves are kept, 1 that immediate neighboring vertices are kept
too, and so on.
m Parallel vertex multi-level method. The parameters of the vertex multi-
level method are listed below.
43
asc=strat
Set the strategy that is used to refine the distributed vertex sep-
arators obtained at ascending levels of the uncoarsening phase by
prolongation of the separators computed for coarser graphs. This
strategy is not applied to the coarsest graph, for which only the low
strategy is used.
dlevl=nbr
Set the minimum level after which duplication is allowed in the fold-
ing process. A value of −1 results in duplication being always per-
formed when folding.
dvert=nbr
Set the average number of vertices per process under which the fold-
ing process is performed during the coarsening phase.
low=strat
Set the strategy that is used to compute the vertex separator of
the coarsest distributed graph, at the lowest level of the coarsening
process.
rat=rat
Set the threshold maximum coarsening ratio over which graphs are
no longer coarsened. The ratio of any given coarsening cannot be
less that 0.5 (case of a perfect matching), and cannot be greater
than 1.0. Coarsening stops when either the coarsening ratio is above
the maximum coarsening ratio, or the graph has fewer node vertices
than the minimum number of vertices allowed.
vert=nbr
Set the threshold minimum size under which graphs are no longer
coarsened. Coarsening stops when either the coarsening ratio is
above the maximum coarsening ratio, or the graph has fewer node
vertices than the minimum number of vertices allowed.
q Multi-sequential method. The current distributed graph and its sep-
arator are centralized on every process that holds a part of it, and a
sequential vertex separation method is applied independently to each
of them. Then, the best separator found is prolonged back to the dis-
tributed graph. This method is primarily designed to operate on band
graphs, which are orders of magnitude smaller than their parent graph.
Else, memory bottlenecks are very likely to occur. The parameters of
the multi-sequential method are listed below.
strat=strat
Set the sequential vertex separation strategy that is used to refine
the separator of the centralized graph. For a description of all of
the available sequential methods, please refer to the Scotch User’s
Guide [35].
z Zero method. This method moves all of the node vertices to the first part,
resulting in an empty separator. Its main use is to stop the separation
process whenever some condition is true.
44
6.5 Distributed graph handling routines
6.5.1 SCOTCH dgraphAlloc
Synopsis
Description
Return values
SCOTCH dgraphAlloc returns the pointer to the memory area if it has been
successfully allocated, and NULL else.
45
Return values
SCOTCH dgraphInit returns 0 if the graph structure has been successfully
initialized, and 1 else.
Description
The SCOTCH dgraphExit function frees the contents of a SCOTCH Dgraph struc-
ture previously initialized by SCOTCH dgraphInit. All subsequent calls to
SCOTCH dgraph routines other than SCOTCH dgraphInit, using this structure
as parameter, may yield unpredictable results.
Description
The SCOTCH dgraphFree function frees the graph data of a SCOTCH Dgraph
structure previously initialized by SCOTCH dgraphInit, but preserves its in-
ternal communication data structures. This call is equivalent to a call to
SCOTCH dgraphExit immediately followed by a call to SCOTCH dgraphInit
with the same communicator as in the previous SCOTCH dgraphInit call. Con-
sequently, the given SCOTCH Dgraph structure remains ready for subsequent
calls to any distributed graph handling routine of the libScotch library.
46
scotchfdgraphload (doubleprecision (*) grafdat,
integer fildes,
integer*num baseval,
integer*num flagval,
integer ierr)
Description
The SCOTCH dgraphLoad routine fills the SCOTCH Dgraph structure pointed
to by grafptr with the centralized or distributed source graph description
available from one or several streams stream in the Scotch graph formats
(please refer to section 4.1 for a description of the distributed graph format,
and to the Scotch User’s Guide [35] for the centralized graph format).
When only one stream pointer is not null, the associated source graph file
must be a centralized one, the contents of which are spread across all of the
processes. When all stream pointers are non null, they can either refer to
multiple instances of the same centralized graph, or to the distinct fragments
of a distributed graph. In the first case, all processes read all of the contents
of the centralized graph files but keep only the relevant part. In the second
case, every process reads its fragment in parallel.
To ease the handling of source graph files by programs written in C as well as
in Fortran, the base value of the graph to read can be set to 0 or 1, by setting
the baseval parameter to the proper value. A value of -1 indicates that the
graph base should be the same as the one provided in the graph description
that is read from stream.
The flagval value is a combination of the following integer values, that may
be added or bitwise-ored:
0 Keep vertex and edge weights if they are present in the stream data.
1 Remove vertex weights. The graph read will have all of its vertex weights
set to one, regardless of what is specified in the stream data.
2 Remove edge weights. The graph read will have all of its edge weights
set to one, regardless of what is specified in the stream data.
Fortran users must use the PXFFILENO or FNUM functions to obtain the num-
ber of the Unix file descriptor fildes associated with the logical unit of the
graph file. Processes which would pass a NULL stream pointer in C must pass
descriptor number -1 in Fortran.
Return values
SCOTCH dgraphLoad returns 0 if the distributed graph structure has been suc-
cessfully allocated and filled with the data read, and 1 else.
47
scotchfdgraphsave (doubleprecision (*) grafdat,
integer fildes,
integer ierr)
Description
The SCOTCH dgraphSave routine saves the contents of the SCOTCH Dgraph
structure pointed to by grafptr to streams stream, in the Scotch distributed
graph format (see section 4.1).
Fortran users must use the PXFFILENO or FNUM functions to obtain the number
of the Unix file descriptor fildes associated with the logical unit of the graph
file.
Return values
SCOTCH dgraphSave returns 0 if the graph structure has been successfully
written to stream, and 1 else.
48
Description
The SCOTCH dgraphBuild routine fills the distributed source graph structure
pointed to by grafptr with all of the data that are passed to it.
baseval is the graph base value for index arrays (typically 0 for structures
built from C and 1 for structures built from Fortran). vertlocnbr is the
number of local vertices on the calling process, used to create the proccnttab
array. vertlocmax is the maximum number of local vertices to be created on
the calling process, used to create the procvrttab array of global indices, and
which must be set to vertlocnbr for graphs wihout holes in their global num-
bering. vertloctab is the local adjacency index array, of size (vertlocnbr+1)
if the edge array is compact (that is, if vendloctab equals vertloctab + 1
or NULL), or of size vertlocnbr else. vendloctab is the adjacency end index
array, of size vertlocnbr if it is disjoint from vertloctab. veloloctab is
the local vertex load array, of size vertlocnbr if it exists. vlblloctab is the
local vertex label array, of size vertlocnbr if it exists. edgelocnbr is the
local number of arcs (that is, twice the number of edges), including arcs to
local vertices as well as to ghost vertices. edgelocsiz is lower-bounded by
the minimum size of the edge array required to encompass all used adjacency
values; it is therefore at least equal to the maximum of the vendloctab en-
tries, over all local vertices, minus baseval; it can be set to edgelocnbr if
the edge array is compact. edgeloctab is the local adjacency array, of size at
least edgelocsiz, which stores the global indices of end vertices. edgegsttab
is the adjacency array, of size at least edgelocsiz, if it exists; if edgegsttab
is given, it is assumed to be a pointer to an empty array to be filled with ghost
vertex data computed by SCOTCH dgraphGhst whenever needed by commu-
nication routines such as SCOTCH dgraphHalo. edloloctab is the arc load
array, of size edgelocsiz if it exists.
The vendloctab, veloloctab, vlblloctab, edloloctab and edgegsttab ar-
rays are optional, and a null pointer can be passed as argument whenever
they are not defined. Since, in Fortran, there is no null reference, passing
the scotchfdgraphbuild routine a reference equal to vertloctab in the
veloloctab or vlblloctab fields makes them be considered as missing ar-
rays. The same holds for edloloctab and edgegsttab when they are passed
a reference equal to edgeloctab. Setting vendloctab to refer to one cell after
vertloctab yields the same result, as it is the exact semantics of a compact
vertex array.
To limit memory consumption, SCOTCH dgraphBuild does not copy array
data, but instead references them in the SCOTCH Dgraph structure. Therefore,
great care should be taken not to modify the contents of the arrays passed to
SCOTCH dgraphBuild as long as the graph structure is in use. Every update
of the arrays should be preceded by a call to SCOTCH dgraphFree, to free
internal graph structures, and eventually followed by a new call to SCOTCH
dgraphBuild to re-build these internal structures so as to be able to use the
new distributed graph.
To ensure that inconsistencies in user data do not result in an erroneous behav-
ior of the libScotch routines, it is recommended, at least in the development
stage of your application code, to call the SCOTCH dgraphCheck routine on the
newly created SCOTCH Dgraph structure before calling any other libScotch
49
routine.
Return values
SCOTCH dgraphBuild returns 0 if the graph structure has been successfully
set with all of the input data, and 1 else.
Return values
SCOTCH dgraphGather returns 0 if the graph structure has been successfully
gathered, and 1 else.
50
indicate it by passing a pointer to the distributed graph structure equal to
the pointer to their centralized graph structure.
The scattering is performed such that graph vertices are evenly spread across
the processes of the communicator associated lwith the distributed graph, in
vertglbnbr vertglbnbr
m j k
ascending order. Every process receives either or
procglbnbr procglbnbr
vertices, according to its rank: processes of lower ranks are filled first, even-
tually with one more vertex than processes of higher ranks.
Return values
SCOTCH dgraphScatter returns 0 if the graph structure has been successfully
scattered, and 1 else.
The SCOTCH dgraphCheck routine checks the consistency of the given SCOTCH
Dgraph structure. It can be used in client applications to determine if a graph
which has been created from user-generated data by means of the SCOTCH
dgraphBuild routine is consistent, prior to calling any other routines of the
libScotch library which would otherwise return internal error messages or
crash the program.
Return values
SCOTCH dgraphCheck returns 0 if graph data are consistent, and 1 else.
51
Description
The SCOTCH dgraphSize routine fills the four areas of type SCOTCH Num
pointed to by vertglbptr, vertlocptr, edgeglbptr and edgelocptr with
the number of global vertices and arcs (that is, twice the number of edges) of
the given graph pointed to by grafptr, as well as with the number of local
vertices and arcs borne by each of the calling processes.
Any of these pointers can be set to NULL on input if the corresponding infor-
mation is not needed. Else, the reference to a dummy area can be provided,
where all unwanted data will be written.
This routine is useful to get the size of a graph read by means of the SCOTCH
dgraphLoad routine, in order to allocate auxiliary arrays of proper sizes. If the
whole structure of the graph is wanted, function SCOTCH dgraphData should
be preferred.
52
scotchfdgraphdata (doubleprecision (*) grafdat,
integer*num (*) indxtab,
integer*num baseval,
integer*num vertglbnbr,
integer*num vertlocnbr,
integer*num vertlocmax,
integer*num vertgstnbr,
integer*idx vertlocidx,
integer*idx vendlocidx,
integer*idx velolocidx,
integer*idx vlbllocidx,
integer*num edgeglbnbr,
integer*num edgelocnbr,
integer*num edgelocsiz,
integer*idx edgelocidx,
integer*idx edgegstidx,
integer*idx edlolocidx,
integer comm)
Description
The SCOTCH dgraphData routine is the dual of the SCOTCH dgraphBuild rou-
tine. It is a multiple accessor that returns scalar values and array references.
baseptr is the pointer to a location that will hold the graph base value for
index arrays (typically 0 for structures built from C and 1 for structures built
from Fortran). vertglbptr is the pointer to a location that will hold the global
number of vertices. vertlocptr is the pointer to a location that will hold the
number of local vertices. vertlocptz is the pointer to a location that will
hold the maximum allowed number of local vertices, that is, (procvrttab[p +
1] − procvrttab[p]), where p is the rank of the local process. vertgstptr is
the pointer to a location that will hold the number of local and ghost vertices
if it has already been computed by a prior call to SCOTCH dgraphGhst, and
−1 else. vertloctab is the pointer to a location that will hold the reference
to the adjacency index array, of size *vertlocptr+ 1 if the adjacency array is
compact, or of size *vertlocptr else. vendloctab is the pointer to a location
that will hold the reference to the adjacency end index array, and is equal
to vertloctab + 1 if the adjacency array is compact. veloloctab is the
pointer to a location that will hold the reference to the vertex load array, of
size *vertlocptr. vlblloctab is the pointer to a location that will hold the
reference to the vertex label array, of size vertlocnbr. edgeglbptr is the
pointer to a location that will hold the global number of arcs (that is, twice
the number of global edges). edgelocptr is the pointer to a location that
will hold the number of local arcs (that is, twice the number of local edges).
edgelocptz is the pointer to a location that will hold the declared size of
the local edge array, which must encompass all used adjacency values; it is
at least equal to *edgelocptr. edgeloctab is the pointer to a location that
will hold the reference to the local adjacency array of global indices, of size
at least *edgelocptz. edgegsttab is the pointer to a location that will hold
the reference to the ghost adjacency array, of size at least *edgelocptz; if it
is non null, its data are valid if vertgstnbr is non-negative. edloloctab is
53
the pointer to a location that will hold the reference to the arc load array, of
size *edgelocptz. comm is the pointer to a location that will hold the MPI
communicator of the distributed graph.
Any of these pointers can be set to NULL on input if the corresponding infor-
mation is not needed. Else, the reference to a dummy area can be provided,
where all unwanted data will be written.
Since there are no pointers in Fortran, a specific mechanism is used to allow
users to access graph arrays. The scotchfdgraphdata routine is passed an
integer array, the first element of which is used as a base address from which all
other array indices are computed. Therefore, instead of returning references,
the routine returns integers, which represent the starting index of each of the
relevant arrays with respect to the base input array, or vertlocidx, the index
of vertloctab, if they do not exist. For instance, if some base array myarray
(1) is passed as parameter indxtab, then the first cell of array vertloc
tab will be accessible as myarray(vertlocidx). In order for this feature
to behave properly, the indxtab array must be word-aligned with the graph
arrays. This is automatically enforced on most systems, but some care should
be taken on systems that allow to access data that is not word-aligned. On
such systems, declaring the array after a dummy doubleprecision array can
coerce the compiler into enforcing the proper alignment. The integer value
returned in comm is the communicator itself, not its index with respect to
indxtab. Also, on 32 64 architectures, such indices can be larger than the
size of a regular INTEGER. This is why the indices to be returned are defined
by means of a specific integer type. See Section 6.2.4 for more information on
this issue.
The SCOTCH dgraphGhst routine fills the edgegsttab arrays of the distributed
graph structure pointed to by grafptr with the local and ghost vertex indices
corresponding to the global vertex indices contained in its edgeloctab arrays,
according to the semantics described in Section 6.3.1.
If memory areas had not been previously reserved by the user for the edge
gsttab arrays and linked to the distributed graph structure through a call to
SCOTCH dgraphBuild, they are allocated. Their references can be retrieved
on every process by means of a call to SCOTCH dgraphData, which will also
return the number of local and ghost vertices, suitable for allocating vertex
data arrays for SCOTCH dgraphHalo.
Return values
54
SCOTCH dgraphGhst returns 0 if ghost vertex data has been successfully com-
puted, and 1 else.
The SCOTCH dgraphHalo routine propagates the data borne by local vertices
to all of the corresponding halo vertices located on neighboring processes,
in a synchronous way. On every process, datatab should point to a data
array of a size sufficient to hold vertgstnbr elements of the data type to
be exchanged, the first vertlocnbr slots of which must already be filled
with the information associated with the local vertices. On completion, the
(vertgstnbr − vertlocnbr) remaining slots are filled with copies of the cor-
responding remote data obtained from the local parts of the data arrays of
neighboring processes.
When the MPI data type to be used is not a collection of contiguous en-
tries, great care should be taken in the definition of the upper bound of the
type (by using the MPI UB pseudo-datatype), such that when asking MPI to
send a certain number of elements of the said type located at some address,
contiguous areas in memory will be considered. Please refer to the MPI docu-
mentation regarding the creation of derived datatypes [32, Section 3.12.3] for
more information.
To perform its data exchanges, the SCOTCH dgraphHalo routine requires ghost
vertex management data provided by the SCOTCH dgraphGhst routine. There-
fore, the edgegsttab array returned by the SCOTCH dgraphData routine will
always be valid after a call to SCOTCH dgraphHalo, if it was not already.
In case useful computation can be carried out during the halo exchange, an
asynchronous version of this routine is available, called SCOTCH dgraphHalo
Async.
Return values
SCOTCH dgraphHalo returns 0 if halo data has been successfully exchanged,
and 1 else.
55
int SCOTCH dgraphHaloAsync (SCOTCH Dgraph * const grafptr,
void * datatab,
MPI Datatype typeval,
SCOTCH DgraphHaloReq * const requptr)
scotchfdgraphhaloasync (doubleprecision (*) grafdat,
doubleprecision (*) datatab,
integer typeval,
doubleprecision (*) requptr,
integer ierr)
Description
Return values
SCOTCH dgraphHaloAsync returns 0 if the halo data exchange has been suc-
cessfully started, and 1 else.
56
int SCOTCH dgraphHaloWait (SCOTCH DgraphHaloReq * const requptr)
scotchfdgraphhalowait (doubleprecision (*) requptr,
integer ierr)
Description
Return values
SCOTCH dgraphHaloWait returns 0 if halo data has been successfully ex-
changed, and 1 else.
57
Return values
SCOTCH dgraphPart returns 0 if the partition of the graph has been success-
fully computed, and 1 else. In this latter case, the partloctab array may
however have been partially or completely filled, but its content is not signif-
icant.
Return values
SCOTCH dgraphMap returns 0 if the partition of the graph has been successfully
computed, and 1 else. In this last case, the partloctab arrays may however
have been partially or completely filled, but their contents is not significant.
58
int SCOTCH dgraphMapInit (const SCOTCH Dgraph * grafptr,
SCOTCH Dmapping * mappptr,
const SCOTCH Arch * archptr,
SCOTCH Num * partloctab)
scotchfdgraphmapinit (doubleprecision (*) grafdat,
doubleprecision (*) mappdat,
doubleprecision (*) archdat,
integer*num (*) partloctab,
integer ierr)
Description
Return values
SCOTCH dgraphMapInit returns 0 if the distributed mapping structure has
been successfully initialized, and 1 else.
59
6.6.5 SCOTCH dgraphMapSave
Synopsis
Return values
SCOTCH dgraphMapSave returns 0 if the mapping structure has been success-
fully written to stream, and 1 else.
60
mapped. The numbering of target values is not based: target vertices are
numbered from 0 to the number of target vertices, minus 1.
Attention: version 5.1 of Scotch does not allow yet to map distributed
graphs onto target architectures which are not complete graphs. This restric-
tion will be removed in the next release.
Return values
SCOTCH dgraphMapCompute returns 0 if the mapping has been successfully
computed, and 1 else. In this latter case, the local mapping arrays may
however have been partially or completely filled, but their contents is not
significant.
Return values
SCOTCH dgraphOrderInit returns 0 if the distributed ordering structure has
been successfully initialized, and 1 else.
61
Description
Return values
SCOTCH dgraphOrderSave returns 0 if the ordering structure has been suc-
cessfully written to stream, and 1 else.
62
Description
Return values
SCOTCH dgraphOrderSaveMap returns 0 if the ordering structure has been suc-
cessfully written to stream, and 1 else.
63
ordering file. Processes which would pass a NULL stream pointer in C must
pass descriptor number -1 in Fortran.
Return values
SCOTCH dgraphOrderSaveTree returns 0 if the ordering structure has been
successfully written to stream, and 1 else.
64
Return values
SCOTCH dgraphOrderPerm returns 0 if the distributed permutation has been
successfully computed, and 1 else.
Return values
SCOTCH dgraphOrderCblkDist returns a positive number if the number of
distributed elimination tree nodes has been successfully computed, and a neg-
ative value else.
65
with the given distributed ordering. This structure describes the sizes and
relations between all distributed elimination tree (super-)nodes. These nodes
are mainly the result of parallel nested dissection, before the ordering process
goes sequential. Sequential nodes generated locally on individual processes
are not represented in this structure.
A node can either be a leaf column block, which has no descendants, or a
nested dissection node, which has most often three sons: its two separated
sub-parts and the separator. A nested dissection node may have two sons
only if the separator is empty; it cannot have only one son. Sons are indexed
such that the separator of a block, if any, is always the son of highest index.
Hence, the order of the indices of the two sub-parts matches the one of the
direct permutation of the unknowns.
For any column block i, treeglbtab[i] holds the index of the father of node i
in the elimination tree, or −1 if i is the root of the tree. All node indices start
from baseval. sizeglbtab[i] holds the number of graph vertices possessed
by node i, plus the ones of all of its descendants if it is not a leaf of the tree.
Therefore, the sizeglbtab value of the root vertex is always equal to the
number of vertices in the distributed graph.
Each of the treeglbtab and sizeglbtab arrays must be large enough to
hold a number of SCOTCH Nums equal to the number of distributed elimination
tree nodes and column blocks, as returned by the SCOTCH dgraphOrderCblk
Dist routine.
Return values
SCOTCH dgraphOrderTreeDist returns 0 if the arrays describing the dis-
tributed part of the distributed tree structure have been successfully filled,
and 1 else.
66
scotchfdgraphcorderinit (doubleprecision (*) grafdat,
doubleprecision (*) corddat,
integer*num (*) permtab,
integer*num (*) peritab,
integer*num cblknbr,
integer*num (*) rangtab,
integer*num (*) treetab,
integer ierr)
Description
Return values
SCOTCH dgraphCorderInit returns 0 if the ordering structure has been suc-
cessfully initialized, and 1 else.
67
The SCOTCH dgraphCorderExit function frees the contents of a centralized
SCOTCH Ordering structure previously initialized by SCOTCH dgraphCorder
Init.
Return values
SCOTCH dgraphOrderGather returns 0 if the centralized ordering structure
has been successfully updated, and 1 else.
68
Return values
SCOTCH stratInit returns 0 if the strategy structure has been successfully
initialized, and 1 else.
Description
The SCOTCH stratExit function frees the contents of a SCOTCH Strat struc-
ture previously initialized by SCOTCH stratInit. All subsequent calls to
SCOTCH strat routines other than SCOTCH stratInit, using this structure
as parameter, may yield unpredictable results.
The SCOTCH stratSave routine saves the contents of the SCOTCH Strat struc-
ture pointed to by straptr to stream stream, in the form of a text string.
The methods and parameters of the strategy string depend on the type of the
strategy, that is, whether it is a bipartitioning, mapping, or ordering strategy,
and to which structure it applies, that is, graphs or meshes.
Fortran users must use the PXFFILENO or FNUM functions to obtain the number
of the Unix file descriptor fildes associated with the logical unit of the output
file.
Return values
SCOTCH stratSave returns 0 if the strategy string has been successfully writ-
ten to stream, and 1 else.
69
int SCOTCH stratDgraphMap (SCOTCH Strat * straptr,
const char * string)
scotchfstratdgraphmap (doubleprecision (*) stradat,
character (*) string,
integer ierr)
Description
Return values
SCOTCH stratDgraphMap returns 0 if the strategy string has been successfully
set, and 1 else.
70
Return values
SCOTCH stratDgraphMapBuild returns 0 if the strategy string has been suc-
cessfully set, and 1 else.
Return values
SCOTCH stratDgraphOrder returns 0 if the strategy string has been success-
fully set, and 1 else.
71
and imbalance ratio balrat, to be used on procnbr processes. From this
point, the strategy structure can only be used as a parallel ordering strategy,
to be used by function SCOTCH dgraphOrder, for instance. See Section 6.4.1
for a description of the available flags.
Return values
SCOTCH stratDgraphOrderBuild returns 0 if the strategy string has been
successfully set, and 1 else.
Description
Return values
SCOTCH dmapAlloc returns the pointer to the memory area if it has been
successfully allocated, and NULL else.
Description
Return values
SCOTCH dorderAlloc returns the pointer to the memory area if it has been
successfully allocated, and NULL else.
72
To match these two requirements, all the error and warning messages pro-
duced by the routines of the libScotch library are issued using the user-definable
variable-length argument routines SCOTCH errorPrint and SCOTCH errorPrintW.
Thus, one can redirect these error messages to his own error handling routines, and
can choose if he wants his program to terminate on error or to resume execution
after the erroneous function has returned.
In order to free the user from the burden of writing a basic error handler from
scratch, the libptscotcherr.a library provides error routines that print error
messages on the standard error stream stderr and return control to the appli-
cation. Application programmers who want to take advantage of them have to
add -lptscotcherr to the list of arguments of the linker, after the -lptscotch
argument.
73
6.12 Miscellaneous routines
6.12.1 SCOTCH randomReset
Synopsis
Description
The SCOTCH randomReset routine resets the seed of the pseudo-random gen-
erator used by the graph partitioning routines of the libScotch library. Two
consecutive calls to the same libScotch partitioning routines, and separated
by a call to SCOTCH randomReset, will always yield the same results, as if the
equivalent standalone Scotch programs were used twice, independently, to
compute the results.
74
parmetis v3 nodend (integer (*) vtxdist,
integer (*) xadj,
integer (*) adjncy,
integer numflag,
integer (*) options,
integer (*) order,
integer (*) sizes,
integer comm)
Description
75
parmetis v3 partgeomkway (integer (*) vtxdist,
integer (*) xadj,
integer (*) adjncy,
integer (*) vwgt,
integer (*) adjwgt,
integer wgtflag,
integer numflag,
integer ndims,
float (*) xyz,
integer ncon,
integer nparts,
float (*) tpwgts,
float (*) ubvec,
integer (*) options,
integer edgecut,
integer (*) part,
integer comm)
Description
76
parmetis v3 partkway (integer (*) vtxdist,
integer (*) xadj,
integer (*) adjncy,
integer (*) vwgt,
integer (*) adjwgt,
integer wgtflag,
integer numflag,
integer ncon,
integer nparts,
float (*) tpwgts,
float (*) ubvec,
integer (*) options,
integer edgecut,
integer (*) part,
integer comm)
Description
7 Installation
Version 5.1 of the Scotch software package, which contains the PT-Scotch
routines, is distributed as free/libre software under the CeCILL-C free/libre
software license [4], which is very similar to the GNU LGPL license. There-
fore, it is not distributed as a set of binaries, but instead in the form of a
source distribution, which can be downloaded from the Scotch web page at
http://www.labri.fr/~pelegrin/scotch/ .
All Scotch users are welcome to send an e-mail to the author so that they
can be added to the Scotch mailing list, and be automatically informed of new
releases and publications.
The extraction process will create a scotch 5.1.11 directory, containing several
subdirectories and files. Please refer to the files called LICENSE EN.txt or LICENCE
FR.txt, as well as file INSTALL EN.txt, to see under which conditions your distri-
bution of Scotch is licensed and how to install it.
77
not defined at compile time. If the flag is defined, make sure to use the MPI Init
thread MPI routine to initialize the communication subsystem, at the MPI THREAD
MULTIPLE level (see Section 6.1).
78
8 Examples
This section contains chosen examples destined to show how the programs of the
PT-Scotch project interoperate and can be combined. It is assumed that parallel
programs are launched by means of the mpirun command, which comprises a -np
option to set the number of processes on which to run them. Character “%” in
bold represents the shell prompt.
79
Credits
I wish to thank all of the following people:
• Cédric Chevalier, during his PhD at LaBRI, did research on efficient paral-
lel matching algorithms and coded the parallel multi-level algorithm of PT-
Scotch. He also studied parallel genetic refinement algorithms. Many thanks
to him for the great job!
• Yves Secretan contributed to the MinGW32 port.
References
[1] P. Amestoy, T. Davis, and I. Duff. An approximate minimum degree ordering
algorithm. SIAM J. Matrix Anal. and Appl., 17:886–905, 1996.
[2] C. Ashcraft, S. Eisenstat, J. W.-H. Liu, and A. Sherman. A comparison of
three column based distributed sparse factorization schemes. In Proc. Fifth
SIAM Conf. on Parallel Processing for Scientific Computing, 1991.
[3] S. T. Barnard and H. D. Simon. A fast multilevel implementation of recur-
sive spectral bisection for partitioning unstructured problems. Concurrency:
Practice and Experience, 6(2):101–117, 1994.
[4] CeCILL: “CEA-CNRS-INRIA Logiciel Libre” free/libre software license. Avail-
able from http://www.cecill.info/licenses.en.html.
[5] P. Charrier and J. Roman. Algorithmique et calculs de complexité pour un
solveur de type dissections emboı̂tées. Numerische Mathematik, 55:463–476,
1989.
[6] C. Chevalier and F. Pellegrini. Improvement of the efficiency of genetic algo-
rithms for scalable parallel graph partitioning in a multi-level framework. In
Proc. EuroPar, Dresden, LNCS 4128, pages 243–252, September 2006.
[7] C. Chevalier and F. Pellegrini. PT-Scotch: A tool for efficient parallel graph
ordering. Parallel Computing, jan 2008. http://www.labri.fr/~pelegrin/
papers/scotch parallelordering parcomp.pdf.
[8] F. Ercal, J. Ramanujam, and P. Sadayappan. Task allocation onto a hyper-
cube by recursive mincut bipartitionning. Journal of Parallel and Distributed
Computing, 10:35–44, 1990.
[9] C. M. Fiduccia and R. M. Mattheyses. A linear-time heuristic for improving
network partitions. In Proceedings of the 19th Design Automation Conference,
pages 175–181. IEEE, 1982.
[10] M. R. Garey and D. S. Johnson. Computers and Intractablility: A Guide to
the Theory of NP-completeness. W. H. Freeman, San Francisco, 1979.
[11] G. A. Geist and E. G.-Y. Ng. Task scheduling for parallel sparse Cholesky
factorization. International Journal of Parallel Programming, 18(4):291–314,
1989.
[12] A. George, M. T. Heath, J. W.-H. Liu, and E. G.-Y. Ng. Sparse Cholesky
factorization on a local memory multiprocessor. SIAM Journal on Scientific
and Statistical Computing, 9:327–340, 1988.
80
[13] A. George and J. W.-H. Liu. The evolution of the minimum degree ordering
algorithm. SIAM Review, 31:1–19, 1989.
[14] J. A. George and J. W.-H. Liu. Computer solution of large sparse positive
definite systems. Prentice Hall, 1981.
[15] A. Gupta, G. Karypis, and V. Kumar. Scalable parallel algorithms for sparse
linear systems. In Proc. Stratagem’96, Sophia-Antipolis, pages 97–110. INRIA,
July 1996.
[19] B. Hendrickson and R. Leland. The Chaco user’s guide. Technical Report
SAND93–2339, Sandia National Laboratories, November 1993.
[20] B. Hendrickson and R. Leland. The chaco user’s guide – version 2.0. Technical
Report SAND94–2692, Sandia National Laboratories, 1994.
[24] B. Hendrickson and E. Rothberg. Improving the runtime and quality of nested
dissection ordering. SIAM J. Sci. Comput., 20(2):468–489, 1998.
[25] G. Karypis and V. Kumar. A fast and high quality multilevel scheme for par-
titioning irregular graphs. Technical Report 95-035, University of Minnesota,
June 1995.
[26] G. Karypis and V. Kumar. Multilevel k-way partitioning scheme for irregular
graphs. Technical Report 95-064, University of Minnesota, August 1995.
81
[30] R. J. Lipton, D. J. Rose, and R. E. Tarjan. Generalized nested dissection.
SIAM Journal of Numerical Analysis, 16(2):346–358, April 1979.
[32] MPI: A Message Passing Interface Standard, version 1.1, jun 1995. Available
from http://www.mpi-forum.org/docs/mpi-11-html/mpi-report.html.
[35] F. Pellegrini. Scotch 5.1 User’s Guide. Technical report, LaBRI, Université
Bordeaux I, August 2008. Available from http://www.labri.fr/~pelegrin/
scotch/.
[36] F. Pellegrini and J. Roman. Experimental analysis of the dual recursive bipar-
titioning algorithm for static mapping. Research Report, LaBRI, Université
Bordeaux I, August 1996. Available from http://www.labri.fr/~pelegrin/
papers/scotch_expanalysis.ps.gz.
[37] F. Pellegrini and J. Roman. Scotch: A software package for static mapping
by dual recursive bipartitioning of process and architecture graphs. In Proc.
HPCN’96, Brussels, LNCS 1067, pages 493–498, April 1996.
[38] F. Pellegrini and J. Roman. Sparse matrix ordering with scotch. In Proc.
HPCN’97, Vienna, LNCS 1225, pages 370–378, April 1997.
[40] A. Pothen, H. D. Simon, and K.-P. Liou. Partitioning sparse matrices with
eigenvectors of graphs. SIAM Journal of Matrix Analysis, 11(3):430–452, July
1990.
82