0% found this document useful (0 votes)
55 views23 pages

MPI Matrix Multiplication 1 PDF

Uploaded by

Aliaa Karam
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
55 views23 pages

MPI Matrix Multiplication 1 PDF

Uploaded by

Aliaa Karam
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 23

MPI Example

Matrix-Vector Multiplication
Dong Dai ([email protected])
Key Topic

• Understand what a real world MPI application looks like


• Learn some MPI APIs
• Learn how to create and manage communicators
• Understand how data partition would affect the performance
What is Matrix-Vector Multiplication
• It is simply a series of inner product (or dot product) computations
• Illustrated as:
• The sequential algorithms:

Input: a[0…m-1, 0…n-1], matrix mxn


b[0...n-1], vector nx1
Output: c[0…m-1], vector mx1
Algorithm:
For I from 0 to m-1
c[i] <- 0
for j <- 0 to n-1
c[i] <- c[i] + a[i,j]xb[j]
end for
end for
Matrix-vector Multiplication
• It is easy but important

• Often embedded in algorithms solving wide variety of problems.


• Recommendation systems
• Conjugate gradient method
• Neural networks

• So its performance is critical for many applications


• In this lecture, we discuss how to parallelized it to solve really big Matrix
Vector multiplication problem
Data Partition Options
• There are three straightforward ways to decompose an mxn matrix A:
• Rowwise block striping (Horizontal Data Partitioning)
• Columnwise block striping (Vertical Data Partitioning)
• Checkerboard block decomposition (Block Partioning)
Data Partition Options
• There are two natural ways to distribute vector b and c
• The vector elements may be replicated, meaning all the vector elements are
copied on all of the tasks
• Why it is acceptable?
• The vector elements may be divided among some or all of the tasks.
• For instance, in vertical partition, each process only needs a portion of b to calculate
A1. Horizontal Partition + Vector Replicated
• We first try to associate a primitive task with each row of the matrix
A, vector b and c are replicated among the primitive tasks

• In this case:
• Each task needs a set of rows (N/P) and the column vector.
• After the inner product computation done by each task, task i has N/P
element of vector c.
• The vector is then supposed to be replicated. An all-gather step
communicates each task’s element of c to all other tasks.
• The algorithm terminates or is ready for the next iteration.
Implementation: using MPI_Allgatherv

• An all-gather communication
concatenates blocks of a vector
distributed among a group of
processes and copies the resulting
whole vector to all the processes
• Use MPI function MPI_Allgatherv
• If the same number of items is
gathered from each process, the
simpler function MPI_Allgather is
more appropriate.
• But, we can not ensure that in a
general case that all processes
handle the same amount of rows
A2. Vertical Partition + Vector Divided
• We then try to associate a primitive task
with columns of the matrix A, vector b and
c are divided among the primitive tasks

• In this case:
• Each task i multiplies its columns of A by b_i,
• This creates a vector of partial results
• At the end of the computation, task i needs
only a part of result vector c_i for calculating
results
• We need an all-to-all communication
• Each partial results j on task i must be transferred
to task j
Implementation: MPI_Scatterv

• The all-to-all communication


moves the appropriate partial
results to the tasks that will add
them up
• The MPI function MPI_Scatterv
enables a single root process to
distribute a contiguous group of
elements to all of the processes in
a communicator, including itself
• If the same number of data items
is distributed to every process, the
simpler function MPI_Scatter is
appropriate.
A.3 Grid-based Partition + Vector Divided
• In the last case, we associate a
primitive task with a small grid of
element of the matrix and a portion
of vector.
• Key steps
• Redistribute vector b so that each task
has the correct portion of b
• Each task performs a matrix-vector
multiplication with its portions of A and
b
• Tasks in each row of the task gride
perform a sum-reduction on their
portion of c
• After this, c will be redistributed to the
first column of the task grid for the next
iteration
Redistribute vector b
• After calculating c, we need to redistribute its value as b for the next
iteration.

* This can be done by a point-to-point communication + a broadcast


communication
Creating Communicators
• In our grid-based matrix-vector multiplication, we have two collective
communication operations involved:
• Each row of processes in the grid performs an independent sum-reduction,
yielding vector c in the first column of processes.

• Each first-row process broadcasts its block of b to other processes in the same
column of the virtual process grid

• You can implement this by writing the communication code using pt-
to-pt APIs, or we can use collective APIs more efficiently
• The problem is each time we are doing group communication for different
sets of processes
• Solution: Create new communicators, and assign processes to these
communicators
Implementation: MPI_Comm_Split
• Collective function
MPI_Comm_Split
partitions the processes in
an existing communicator
into one or more
subgroups and constructs a
communicator for each of
these new subgroups.
• What is needed in our case is:
• Create a per-row communicator for
conducting sum-reduction
• Create a per-column communicator
for broadcasting subvector b_j
• We can use similar call as this
example shows
Question
• Implement Case b using pseudo code

You might also like