0% found this document useful (0 votes)
9 views

08_1_MPI_Comm_Data_Distributions

Uploaded by

mazharmohyuddin1
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views

08_1_MPI_Comm_Data_Distributions

Uploaded by

mazharmohyuddin1
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 60

MPI: Collective Communications & Data

Distributions

Dr. Mian M. Hamayun


[email protected]
http://seecs.nust.edu.pk/faculty/mianhamayun.html
Some material re-used from Mohamed Zahran (NYU)
Today’s Lecture
 Dealing with I/O
 The Trapezoidal Rule in MPI
 Reductions in MPI
 Collective vs. Point-to-Point
Communication
 Data Distributions

Copyright © 2010, Elsevier Inc. All rights Reserved 1


SPMD
 Single-Program Multiple-Data
 We compile one program.
 Process 0 does something different.
 Receives messages and prints them while the
other processes do the work.

 The if-else construct makes our program


SPMD.

Copyright © 2010, Elsevier Inc. All rights Reserved 2


Dealing with I/O

In all MPI implementations, all


processes in MPI_COMM_WORLD
has access to stdout and sterr.

BUT .. In most of them there is no


scheduling of access to output
devices!

Copyright © 2010, Elsevier Inc. All rights Reserved 3


Running with 6 Processes

unpredictable output!!
• Processes are competing for stdout
• Result: non-determinism!

Copyright © 2010, Elsevier Inc. All rights Reserved 4


How about Input?
 Most MPI implementations only allow
process 0 in MPI_COMM_WORLD
access to stdin.
 Process 0 must read the data and send to
the other processes.

Copyright © 2010, Elsevier Inc. All rights Reserved 5


Function for reading user input

Copyright © 2010, Elsevier Inc. All rights Reserved 6


TRAPEZOIDAL RULE IN MPI

Copyright © 2010, Elsevier Inc. All rights Reserved 7


The Trapezoidal Rule

Copyright © 2010, Elsevier Inc. All rights Reserved 8


One trapezoid

Copyright © 2010, Elsevier Inc. All rights Reserved 9


The Trapezoidal Rule

Copyright © 2010, Elsevier Inc. All rights Reserved 10


Pseudo-code for a serial program

Copyright © 2010, Elsevier Inc. All rights Reserved 11


Parallelizing the Trapezoidal Rule
1. Partition problem solution into tasks.
2. Identify communication channels between
tasks.
3. Aggregate tasks into composite tasks.
4. Map composite tasks to cores.

Copyright © 2010, Elsevier Inc. All rights Reserved 12


Tasks and communications for
Trapezoidal Rule

Copyright © 2010, Elsevier Inc. All rights Reserved 13


Parallel pseudo-code

Copyright © 2010, Elsevier Inc. All rights Reserved 14


First version (1)

Copyright © 2010, Elsevier Inc. All rights Reserved 15


First version (2)

Copyright © 2010, Elsevier Inc. All rights Reserved 16


First version (3)

Copyright © 2010, Elsevier Inc. All rights Reserved 17


COLLECTIVE
COMMUNICATION

Copyright © 2010, Elsevier Inc. All rights Reserved 18


The Global Sum ... Again!!
1. In the first phase:
(a) Process 1 sends to 0, 3 sends to 2, 5 sends to 4, and
7 sends to 6.
(b) Processes 0, 2, 4, and 6 add in the received values.
(c) Processes 2 and 6 send their new values to
processes 0 and 4, respectively.
(d) Processes 0 and 4 add the received values into their
new values.

2. (a) Process 4 sends its newest value to process 0.


(b) Process 0 adds the received value to its newest
value.

Copyright © 2010, Elsevier Inc. All rights Reserved 19


A tree-structured global sum

Copyright © 2010, Elsevier Inc. All rights Reserved 20


An alternative tree-structured
global sum

Is this better? or the previous one?


A: Depends on the underlying system!

Copyright © 2010, Elsevier Inc. All rights Reserved 21


Reduction
Reducing a set of numbers into a smaller
set of numbers via a function
 Example: reducing the group [1, 2, 3, 4, 5]
with the sum function  15
MPI provides a handy function that handles
almost all of the common reductions that a
programmer needs to do in a parallel
application

Copyright © 2010, Elsevier Inc. All rights Reserved 22


Reduction Examples

Every process has an element

Every process has an array of elements


Copyright © 2010, Elsevier Inc. All rights Reserved 23
MPI_Reduce
has size: sizeof(datatype) * count
Only relevant in
dest_process

MPI_Reduce is called by all processes involved.


Copyright © 2010, Elsevier Inc. All rights Reserved 24
Predefined reduction operators in MPI

Location = rank of the process that owns it.


Copyright © 2010, Elsevier Inc. All rights Reserved 25
Collective vs. Point-to-Point Communications
 All the processes in the communicator must call
the same collective function.
 For example, a program that attempts to match a call
to MPI_Reduce on one process with a call to
MPI_Recv on another process is erroneous, and, in all
likelihood, the program will hang or crash.
 The arguments passed by each process to an
MPI collective communication must be
“compatible.”
 For example, if one process passes in 0 as the
dest_process and another passes in 1, then the
outcome of a call to MPI_Reduce is erroneous, and,
once again, the program is likely to hang or crash.

Copyright © 2010, Elsevier Inc. All rights Reserved 26


Collective vs. Point-to-Point Communications

 The output_data_p argument is only used


on dest_process.
 However, all of the processes still need to
pass in an actual argument corresponding
to output_data_p, even if it’s just NULL.
 All collective communication calls are
blocking.

Copyright © 2010, Elsevier Inc. All rights Reserved 27


Collective vs. Point-to-Point Communications

 Point-to-point communications are


matched on the basis of tags and
communicators.
 Collective communications don’t use tags.
 They’re matched solely on the basis of the
communicator and the order in which
they’re called.

Copyright © 2010, Elsevier Inc. All rights Reserved 28


Example (1)

Assume:
 all processes use the operator MPI_SUM

 and the destination is process 0

What will be the final values of b and d?

Copyright © 2010, Elsevier Inc. All rights Reserved 29


Example (2)
 At first glance, it might seem that after the two
calls to MPI_Reduce, the value of b will be 3,
and the value of d will be 6.
 However, the names of the memory locations
are irrelevant to the matching of the calls to
MPI_Reduce.
 The order of the calls will determine the
matching so the value stored in b will be 1+2+1 =
4, and the value stored in d will be 2+1+2 = 5.

Copyright © 2010, Elsevier Inc. All rights Reserved 30


Another Example
MPI_Reduce(&x, &x, 1, MPI_DOUBLE,
MPI_SUM, 0, comm);

This is illegal in MPI & the result is unpredictable!

Copyright © 2010, Elsevier Inc. All rights Reserved 31


Global Sum: Update All

A global sum followed


by distribution of the
result.

Copyright © 2010, Elsevier Inc. All rights Reserved 32


Global Sum: Exchange Partial Results

A butterfly-structured global sum.

Copyright © 2010, Elsevier Inc. All rights Reserved 33


MPI_Allreduce

 Useful in a situation in which all of the


processes need the result of a global sum
in order to complete some larger
computation.

Note: No destination argument is required!


Copyright © 2010, Elsevier Inc. All rights Reserved 34
Broadcast
 Data belonging to a single process is sent
to all of the processes in the
communicator.

All processes in the communicator must call MPI_Bcast()

Copyright © 2010, Elsevier Inc. All rights Reserved 35


A tree-structured broadcast.

Copyright © 2010, Elsevier Inc. All rights Reserved 36


A version of Get_input that uses MPI_Bcast

Copyright © 2010, Elsevier Inc. All rights Reserved 37


Collective vs. Point-to-Point – Summary

Collective
Point-to-Point

Copyright © 2010, Elsevier Inc. All rights Reserved 38


Data distributions

Compute a vector sum – Serial Version

Copyright © 2010, Elsevier Inc. All rights Reserved 39


Partitioning options
 Block partitioning
 Assign blocks of consecutive components to
each process.
 Cyclic partitioning
 Assign components in a round robin fashion.
 Block-cyclic partitioning
 Use a cyclic distribution of blocks of
components.

Copyright © 2010, Elsevier Inc. All rights Reserved 40


Different partitions of a 12-component
vector among 3 processes

Copyright © 2010, Elsevier Inc. All rights Reserved 41


Parallel implementation of
vector addition

How will you distribute parts of x[] and y[] to


processes?

Copyright © 2010, Elsevier Inc. All rights Reserved 42


Scatter
 MPI_Scatter can be used in a function that
reads in an entire vector on process 0 but
only sends the needed components to
each of the other processes.

Amount of data
going to each
process

All arguments are important for the source process (process 0 in our case)
For all other processes, only recv_buf_p, recv_count, recv_type, src_proc,
and comm are important
Copyright © 2010, Elsevier Inc. All rights Reserved 43
Reading and distributing a vector

Copyright © 2010, Elsevier Inc. All rights Reserved 44


Scatter

 send_buf_p
 is not used except by the sender.

 However, it must be defined or NULL on others to make the code

correct
 Must have at least communicator size * send_count elements

 All processes must call MPI_Scatter, not only the sender.


 send_count is the amount of data sent to each process.
 recv_buf_p must have at least send_count elements
 MPI_Scatter uses block distribution
Copyright © 2010, Elsevier Inc. All rights Reserved 45
Scatter

Copyright © 2010, Elsevier Inc. All rights Reserved 46


Gather
 Collect all of the components of the vector
onto process 0, and then process 0 can
process all of the components.

All arguments are important for the destination process.


For all other processes, only send_buf_p, send_count, send_type,
dest_proc, and comm are important
Copyright © 2010, Elsevier Inc. All rights Reserved 47
Print a distributed vector (1)

Copyright © 2010, Elsevier Inc. All rights Reserved 48


Print a distributed vector (2)

Copyright © 2010, Elsevier Inc. All rights Reserved 49


Allgather
 Concatenates the contents of each
process’ send_buf_p and stores this in
each process’ recv_buf_p.
 As usual, recv_count is the amount of data
being received from each process.

Copyright © 2010, Elsevier Inc. All rights Reserved 50


Matrix-vector multiplication

i-th component of y
Dot product of the ith
row of A with x.

Copyright © 2010, Elsevier Inc. All rights Reserved 51


Matrix-vector multiplication

Serial pseudo-code

Copyright © 2010, Elsevier Inc. All rights Reserved 52


C style arrays

stored as

Copyright © 2010, Elsevier Inc. All rights Reserved 53


Serial matrix-vector multiplication

What if x[] is distributed among the different processes?


Copyright © 2010, Elsevier Inc. All rights Reserved 54
An MPI matrix-vector multiplication
function (1)

Copyright © 2010, Elsevier Inc. All rights Reserved 55


An MPI matrix-vector multiplication
function (2)

Copyright © 2010, Elsevier Inc. All rights Reserved 56


Concluding Remarks (1)
 Most serial programs are deterministic: if
we run the same program with the same
input we’ll get the same output.
 Parallel programs often don’t possess this
property.
 Many parallel programs use the single-
program multiple data or SPMD approach.
 A communicator is a collection of
processes that can send messages to
each other.
Copyright © 2010, Elsevier Inc. All rights Reserved 57
Concluding Remarks (2)
 Collective communications involve all the
processes in a communicator.
 When studying MPI be careful of the
caveats (i.e. usage that leads to crash,
nondeterministic behavior, ... ).
 In distributed memory systems,
communication is more expensive than
computation.

Copyright © 2010, Elsevier Inc. All rights Reserved 58


Concluding Remarks (3)
 Reducing messages is a good
performance strategy!
 Collective vs point-to-point
 Distributing a fixed amount of data among
several messages is more expensive than
sending a single big message.

Copyright © 2010, Elsevier Inc. All rights Reserved 59

You might also like