slides.41
slides.41
–
Finite element methods in
scientific computing
Parallelization on a cluster of
distributed memory machines
Advantage:
●
Makes parallelization simpler
Disadvantages:
●
Problem size limited by
– number of cores on your machine
– amount of memory on your machine
– memory bandwidth
●
Need synchronisation via locks
●
Makes it too easy to avoid hard decisions
Example:
●
Only one Triangulation, DoFHandler, matrix, rhs vector
●
Multiple threads work in parallel to
– assemble linear system
– perform matrix-vector products
– estimate the error per cell
– generate graphical output for each cell
●
All threads access the same global objects
This lecture:
●
Multiple machines with their own address spaces
●
No direct access to remote data
●
Data has to be transported explicitly between machines
Advantage:
●
(Almost) unlimited number of cores and memory
●
Often scales better in practice
Disadvantages:
●
Much more complicated programming model
●
Requires entirely different way of thinking
●
Practical difficulties debugging, profiling, ...
http://www.dealii.org/ Wolfgang Bangerth
Distributed memory
Example:
●
One Triangulation, DoFHandler, matrix, rhs vector object
per processor
●
Union of these objects represent global object
●
Multiple programs work in parallel to
– assemble their part of the linear system
– perform their part of the matrix-vector products
– estimate the error on their cells
– generate graphical output for each of their cells
●
Each program only accesses their part of global objects
See step-40/32/42 and the “Parallel computing with multiple
processors using distributed memory” module
http://www.dealii.org/ Wolfgang Bangerth
Distributed memory
●
Remote procedure calls (RPC)
●
Partitioned global address space (PGAS) languages:
– Unified Parallel C (UPC – an extension to C)
– Coarray Fortran (part of Fortran 2008)
– Chapel, X10, Titanium
●
Processes can send “messages” to other processes…
●
…but nothing happens if the other side is not listening
●
Instead, option 1:
– you need to send a request message
– other side has to pick up message
– other side has to know what to do
– other side has to send a message with the data
– you have to pick up message
●
Option 2:
– depending on phase of program, I know when someone
else needs my data → send it
– I will know who sent me data → go get it
http://www.dealii.org/ Wolfgang Bangerth
Message Passing Interface (MPI)
MPI implementations:
●
MPI is defined as a set of
– functions
– data types
– constants
with bindings to C and Fortran
●
Is not a language on its own
●
Can be compiled by a standard C/Fortran compiler
●
Is typically compiled using a specific compiler wrapper:
mpicc -c myprog.c -o myprog.o
mpiCC -c myprog.cc -o myprog.o
mpif90 -c myprog.f90 -o myprog.o
●
Bindings to many other languages exist
http://www.dealii.org/ Wolfgang Bangerth
Message Passing Interface (MPI)
double d = foo();
MPI_Send (/*data=*/&d, /*count=*/1, /*type=*/MPI_DOUBLE,
/*dest=*/13, /*tag=*/42,
/*universe=*/MPI_COMM_WORLD);
Notes:
●
MPI_Send blocks the program: function only returns
when the data is out the door
●
MPI_Recv blocks the program: function only returns when
– a message has come in
– the data is in the final location
●
There are also non-blocking start/end versions
(MPI_Isend, MPI_Irecv, MPI_Wait)
… do something …
MPI_Barrier (MPI_COMM_WORLD);
MPI_Barrier (MPI_COMM_WORLD);
std::time_point end_global = std::now(); // get current time
Example: Reduction
parallel::distributed::Triangulation<dim> triangulation;
… create triangulation …
Example: AllReduce
parallel::distributed::Triangulation<dim> triangulation;
… create triangulation …
●
One can form subsets of a communicator
●
Forms the basis for collective operations among a subset
of processes
●
Useful if subsets of processors do different tasks
●
MPI provides ways to make this more efficient
Also in MPI:
●
“One-sided communication”: directly writing into and
reading from another process's memory space
●
Topologies: mapping network characteristics to MPI
●
Starting additional MPI processes
Situation:
●
Multiply a large NxN matrix by a vector of size N
●
Matrix is assumed to be dense
●
Every one of P processors stores N/P rows of the matrix
●
Every processor stores N/P elements of each vector
●
For simplicity: N is a multiple of P
struct ParallelVector {
unsigned int size;
unsigned int my_elements_begin;
unsigned int my_elements_end;
double *elements;
struct ParallelSquareMatrix {
unsigned int size;
unsigned int my_rows_begin;
unsigned int my_rows_end;
double *elements;
A x y
●
To compute the locally owned elements of y, processor P
needs all elements of x
http://www.dealii.org/ Wolfgang Bangerth
An MPI example: MatVec
●
We repeatedly allocate/deallocate memory – should set
up buffer only once
http://www.dealii.org/ Wolfgang Bangerth
An MPI example: MatVec
col_block = my_rank;
for (i=A.my_rows_begin; i<A.my_rows_end; ++i)
for (j=A.size/comm_size*col_block; ...)
y.elements[i-y.my_rows_begin] = A[...i,j...] * x[...j...];
●
Distributed computing lives in the conflict zone between
– trying to keep as much data available locally to avoid
communication
– not creating a memory/CPU bottleneck
●
MPI makes the flow of information explicit
●
Forces programmer to design data structures/algorithms
for communication
●
Typical programs have relatively few MPI calls
http://www.dealii.org/ Wolfgang Bangerth
Message Passing Interface (MPI)
Alternatives to MPI:
●
boost::mpi is nice, but doesn't buy much in practice
●
Partitioned Global Address Space (PGAS) languages like
Co-Array Fortran, UPC, Chapel, X10, …:
Pros:
– offer nicer syntax
– communication is part of the language
Cons:
– typically no concept of “communicators”
– communication is implicit
– encourages poor data structure/algorithm design