
1 Introduction
MUMPS (“MUltifrontal Massively Parallel Solver”) is a package for solving systems of linear equations of
the form Ax = b, where A is a square sparse matrix that can be either unsymmetric, symmetric positive
definite, or general symmetric. MUMPS is direct method based on a multifrontal approach which performs
a direct factorization A = LU or A = LDL
T
depending on the symmetry of the matrix. We refer the
reader to the papers [3, 4, 7, 18, 19, 22, 21, 9] for full details of the techniques used. MUMPS exploits both
parallelism arising from sparsity in the matrix A and from dense factorizations kernels.
The main features of the MUMPS package include the solution of the transposed system, input of
the matrix in assembled format (distributed or centralized) or elemental format, error analysis, iterative
refinement, scaling of the original matrix, out-of-core capability, detection of null pivots, basic estimate
of rank deficiency and null space basis, and computation of a Schur complement matrix. MUMPS offers
several built-in ordering algorithms, a tight interface to some external ordering packages such as PORD
[
27], SCOTCH [25] or METIS [23] (strongly recommended), and the possibility for the user to input
a given ordering. Finally, MUMPS is available in various arithmetics (real or complex, single or double
precision).
The software is written in Fortran 90 although a C interface is available (see Section 8). The parallel
version of MUMPS requires MPI [
28] for message passing and makes use of the BLAS [13, 14], BLACS,
and ScaLAPACK [
11] libraries. The sequential version only relies on BLAS.
MUMPS is downloaded from the web site almost four times a day on average and has been run on very
many machines, compilers and operating systems, although our experience is really only with UNIX-
based systems. We have tested it extensively on parallel computers from SGI, Cray, and IBM and on
clusters of workstations.
MUMPS distributes the work tasks among the processors, but an identified processor (the host) is
required to perform most of the analysis phase, to distribute the incoming matrix to the other processors
(slaves) in the case where the matrix is centralized, and to collect the solution. The system Ax = b is
solved in three main steps:
1. Analysis. The host performs an ordering (see Section
2.2) based on the symmetrized pattern
A + A
T
, and carries out the symbolic factorization. A mapping of the multifrontal computational
graph is then computed, and symbolic information is transferred from the host to the other
processors. Using this information, the processors estimate the memory necessary for factorization
and solution.
2. Factorization. The original matrix is first distributed to processors that will participate in the
numerical factorization. Based on the so called elimination tree [
24], the numerical factorization
is then a sequence of dense factorization on so called frontal matrices. The elimination tree also
expresses independency between tasks and enables multiple fronts to be processed simultaneously.
This approach is called multifrontal approach. After the factorization, the factor matrices are kept
distributed (in core memory or on disk); they will be used at the solution phase.
3. Solution. The right-hand side b is broadcasted from the host to the working processors that
compute the solution x using the (distributed) factors computed during factorization. The solution
is then either assembled on the host or kept distributed on the working processors.
Each of these phases can be called separately and several instances of MUMPS can be handled
simultaneously. MUMPS allows the host processor to participate to the factorization and solve phases,
just like any other processor (see Section
2.7).
For both the symmetric and the unsymmetric algorithms used in the code, we have chosen a
fully asynchronous approach with dynamic scheduling of the computational tasks. Asynchronous
communication is used to enable overlapping between communication and computation. Dynamic
scheduling was initially chosen to accommodate numerical pivoting in the factorization. The other
important reason for this choice was that, with dynamic scheduling, the algorithm can adapt itself at
execution time to remap work and data to more appropriate processors. In fact, we combine the main
features of static and dynamic approaches; we use the estimation obtained during the analysis to map
some of the main computational tasks; the other tasks are dynamically scheduled at execution time. The
main data structures (the original matrix and the factors) are similarly partially mapped during the analysis
phase.
4