0% found this document useful (0 votes)
21 views

Unit 3

Uploaded by

Nayan Kadhre
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
21 views

Unit 3

Uploaded by

Nayan Kadhre
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 62

SAVITRIBAI PHULE PUNE UNIVERSITY

Faculty Orientation Program


On
High Performance Computing
(2019 Course)
Unit III : Parallel Communication
Rushali Patil
Assistant Professor
Army Institute of Technology
Course Objective and Outcome
Course Objective
To illustrate the various techniques to parallelize the
algorithm
Course Outcome
Illustrate data communication operations on various
parallel architecture
Reference Book
“Introduction to Parallel Computing” by
Ananth Grama, Anshul Gupta, Gerorge Karypis, Vipin
Kumar
High Performanc Computing (2019 course)
Syllabus
Basic Communication:
One-to-All Broadcast
All-to-One Reduction
All-to-All Broadcast and Reduction
All-Reduce and Prefix-Sum Operations
All-to-All Personalized Communication
Improving the speed of some communication operations
Circular Shift
Principles of Message Passing Programming
 Blocking and Non Blocking MPI
 Collective Communication using MPI:
 Barrier
 Broadcast

 Reduce

 Scatter

 Gather

High Performanc Computing (2019 course)


Basic Communication Operations:
Introduction
Many interactions in practical parallel programs occur in
well defined patterns involving groups of processors
Efficient implementations of these operations can
improve performance, reduce development effort and
cost, and improve software quality
Efficient implementations must leverage underlying
architecture. For this reason, we refer to specic
architectures here
Time required to communicate a message of size m over
an uncongested network is ts + tmw. This time is will be
considered as a basis for analyses
High Performanc Computing (2019 course)
One-to-All Broadcast and
All-to-One Reduction

One-to-All Broadcast: Single process has a piece of data( of


size m) and it needs to send to all other processes
All-to-One Reduction: Each participating process has data
of size m. These data must be combined through
associative operator and accumulated at a single target
process
Applicatios:
matrix-vector multiplication
Gaussian elimination
shortest paths
vector inner product
High Performanc Computing (2019 course)
One-to-All Broadcast and
All-to-One Reduction

M One-to-All Broadcast M M M

0 1 ... p-1 0 1 ... p-1


One-to-All Reduction

High Performanc Computing (2019 course)


One-to-All Broadcast and
All-to-One Reduction on Rings
Simplest way is to send p-1 messages from the source to the
other p-1 processes
It is inefficient
Source process becomes a bottleneck
Recursive doubling:
Source process sends a message to another processes.
Now both these processes can send the message to two other
processes.
The message can be broadcast in log p steps.
Reduction can be performed in an identical fashion by
inverting the process.
High Performanc Computing (2019 course)
One-to-All Broadcast on Rings
Node 0 is the source of the broadcast. Each message
transfer step is shown by a numbered, dotted arrow from
the source of the message to its destination
3 3
2

7 6 5 4

0 1 2 3

2
3 3
High Performanc Computing (2019 course)
All-to-One Reduction on Rings
All-to-One Reduction Reduction on an eight-node
ring with node 0 as the destination of the reduction
1 1
2

7 6 5 4

0 1 2 3

2
1 1
High Performanc Computing (2019 course)
Broadcast and Reduction: Example
Consider the problem of multiplying a matrix with a vector:
The n × n matrix is assigned to an n × n (virtual) processor grid
The vector is assumed to be on the first row of processors
The first step of the product requires a one-to-all broadcast of the
vector element along the corresponding column of processors
The processors compute local product of the vector element and
the local matrix entry
In the final step, the results of these products are accumulated to
the first row using n concurrent all-to-one reduction operations
along the columns (using the sum operation)

High Performanc Computing (2019 course)


Broadcast and Reduction: Matrix-Vector
Multiplication Example
All-to-one P0 P1 P2 P3
reduction
one-to-all broadcast
P0 P0 P1 P2 P3

P4 ˇ P4 ˇ P5 ˇ P6 ˇP7
P8 ˇ P8 ˇ P9 ˇP10 ˇP11 Matrix
P12 ˇ P12 ˇ P13 ˇ P14 ˇP15
Output Vector

One-to-all broadcast and all-to-one reduction in the


multiplication of a 4 × 4 matrix with a 4 × 1 vector
High Performanc Computing (2019 course)
Broadcast and Reduction on a Mesh
We can view each row and column of a square mesh of
p nodes as a linear array of √p nodes
Broadcast and reduction operations can be performed
in two steps –
 the first step does the operation along a row and
the second step along each column concurrently
This process generalizes to higher dimensions as well

High Performanc Computing (2019 course)


Broadcast and Reduction on a Mesh:
Example
One-to-all broadcast on a 16-node mesh

3 7 11 15

4 4 4 4

2 6 10 14

3 3 3 3

1 5 9 13

4 4 4 4
2 2
0 4 8 12
1
High Performanc Computing (2019 course)
Broadcast and Reduction on a Hypercube
A hypercube with 2d nodes can be regarded as a d
dimensional mesh with two nodes in each dimension
The mesh algorithm can be generalized to a hypercube
and the operation is carried out in d (= log p) steps.

High Performanc Computing (2019 course)


Broadcast and Reduction on a Hypercube:
Example (110)
3
6 7 (111)
(010) (011)
2 3 2
3

2 3
1 4 5
(101)
(100)
0 1
(000) 3 (001)

One-to-all broadcast on a three-dimensional hypercube. The binary


representations of node labels are shown in parentheses
High Performanc Computing (2019 course)
Broadcast and Reduction on a Balanced
Binary Tree
Consider a binary tree in which processors are
(logically) at the leaves and internal nodes are routing
nodes
Assume that source processor is the root of this tree.
In the first step, the source sends the data to the right
child (assuming the source is also the left child). The
problem has now been decomposed into two problems
with half the number of processors.

High Performanc Computing (2019 course)


Broadcast and Reduction on a Balanced
Binary Tree
1

2 2

3 3 3 3
0 1 2 3 4 5 6 7

One-to-all broadcast on an eight-node tree


High Performanc Computing (2019 course)
Cost Analysis
Assume that p processes participate in the operation
and the data to be broadcast or reduced contains m
words
The broadcast or reduction procedure involves lop p
point-to-point simple message transfers, each at a time
cost of ts + tmw
Therefore, total time taken by the procedure is
T= (ts + tmw)log p

High Performanc Computing (2019 course)


All-to-All Broadcast and Reduction
Generalization of one-to-all broadcast in which each
process is the source as well as destination
A process sends the same m-word message to every
other process, but different processes may broadcast
different messages
All-to-all broadcast used in matrix operations
All-to-all Reduction: It is the dual of all-to-all
broadcast

High Performanc Computing (2019 course)


All-to-All Broadcast and Reduction
Mp-1 Mp-1 Mp-1
. . .
. . .
. . .
M1 M1 M1
M0 M1 Mp-1 One-to-All Broadcast M0 M0 M0

0 1 ... p-1 0 1 ... p-1


One-to-All Reduction

High Performanc Computing (2019 course)


All-to-All Broadcast on Rings
1st communication step

1(6) 1(5) 1(4)

7 6 5 4
(7) (4)
(6) (5)

1(7) 1 (3)
(0) (2)
(1) (3)
0 1 2 3

1(0) 1(1) 1(2)

High Performanc Computing (2019 course)


All-to-All Broadcast on Rings
2nd communication step

2(5) 2(4) 2(3)

7 6 5 4
(7,6) (4,3)
(6,5) (5,4)

2(6) 2 (2)
(0,7) (2,1)
(1,0) (3,2)
0 1 2 3

2(7) 2(0) 2(1)

High Performanc Computing (2019 course)


All-to-All Broadcast on Rings
7th communication step

7(0) 7(7) 7(6)

7 6 5 4
(7,6,5,4,3,2,1)
(6,5,4,3,2,1,0) (5,4,3,2,1,0,7) (4,3,2,1,0,7,6)
7(1) 7 (5)
(0,7,6,5,4,3,2) (1,0,7,6,5,4,3) (2,1,0,7,6,5,4) (3,2,1,0,7,6,5)

0 1 2 3

7(2) 7(3) 7(4)

High Performanc Computing (2019 course)


All-to-All Broadcast on Mesh
(7) (8) (6,7,8) (6,7,8)

(6) 6 7 8 (6,7,8) 6 7 8

3 4 5 3 4 5
(3) (4) (5) (3,4,5) (3,4,5) (3,4,5)

0 1 2 0 1 2

(0) (1) (2) (0,1,2) (0,1,2) (0,1,2)


(a) Initial data distribution (b) Data Distribution after rowwise broadcast

High Performanc Computing (2019 course)


Cost Analysis
On a ring, the time is given by:
T= (ts + tmw)(p-1)
On a mesh, the time is given by:
T= 2ts (√p-1) + tmm(p-1)
On a hypercube, we have:
T= ts log p + tmm(p-1)

High Performanc Computing (2019 course)


All-Reduce
In all-reduce, each node starts with a buffer of size m and the
final results of the operation are identical buffers of size m on
each node that are formed by combining the original p buffers
using an associative operator.
Identical to all-to-one reduction followed by a one-to-all
broadcast. This formulation is not the most efficient. Uses the
pattern of all-to-all broadcast, instead. The only difference is
that message size does not increase here. Time for this operation
is (ts + tmw) log p
Different from all-to-all reduction, in which p simultaneous all-
to-one reductions take place, each with a different destination
for the result.
High Performanc Computing (2019 course)
The Prefix-Sum Operation
Given p numbers n0, n1,… np-1 (one on each node), the
problem is to compute the sums sk = 
ni for all k
k
i0

between 0 and p-1


Initially, nk resides on the node labeled k, and at the
end of the procedure, the same node holds sk

High Performanc Computing (2019 course)


The Prefix-Sum Operation

High Performanc Computing (2019 course)


Scatter and Gather
In the scatter operation, a single node sends a unique message of
size m to every other node (also called a one-to-all personalized
communication).
In the gather operation, a single node collects a unique message
from each node.
While the scatter operation is fundamentally different from
broadcast, the algorithmic structure is similar, except for
differences in message sizes (messages get smaller in scatter and
stay constant in broadcast).
The gather operation is exactly the inverse of the scatter operation

High Performanc Computing (2019 course)


One-to-All Broadcast and
All-to-One Reduction
Mp-1
.
.
M1
M0 Scatter M0 M1 Mp-1

0 1 ... p-1 0 1 ... p-1


Gather

High Performanc Computing (2019 course)


All-to-All Personalized Communication
Each node has a distinct message of size m for every
other node
 This is unlike all-to-all broadcast, in which each node
sends the same message to all other nodes
 All-to-all personalized communication is also known
as total exchange
Applications:
Fast Fourier tranform
Matrix transpose
Sample sort
High Performanc Computing (2019 course)
Improving Speed of Some Communication
Operations
1. Splitting and Routing Messages in Parts
1. One-to-All Broadcast
1. scatter
2. all-to-all broadcast
2. All-to-One Reduction
1. all-to-all reduction
2. gather
3. All-Reduce
1. all-to-one reduction
2. one-to-all broadcast
2. All-Port Communication
High Performanc Computing (2019 course)
MPI: Message Passing Interface
MPI: the Message Passing Interface
MPI defines a standard library for message-passing that
can be used to develop portable message-passing
programs using either C or Fortran.
The MPI standard defines both the syntax as well as the
semantics of a core set of library routines.
Vendor implementations of MPI are available on almost
all
commercial parallel computers.
It is possible to write fully-functional message-passing
programs by using only the six routines.
High Performanc Computing (2019 course)
Starting and Terminating the MPI Library
MPI_Init is called prior to any calls to other MPI routines. Its
purpose is to initialize the MPI environment.
 MPI_Finalize is called at the end of the computation, and it
performs various clean-up tasks to terminate the MPI
environment.
 The prototypes of these two functions are:
int MPI_Init(int *argc, char ***argv)
int MPI_Finalize()
All MPI routines, data-types, and constants are prefixed
by “MPI_” The return code for successful completion is
MPI_SUCCESS.
High Performanc Computing (2019 course)
Skeleton of MPI Program
#include <mpi.h>
main( int argc, char** argv )
{
MPI_Init( &argc, &argv );
/* main part of the program */
/*
Use MPI function call depend on your data
partitioning and the parallelization
architecture
*/
MPI_Finalize();
}

High Performanc Computing (2019 course)


A minimal MPI program
#include “mpi.h”
#include <stdio.h>
int main(int argc, char *argv[])
{
MPI_Init(&argc, &argv);
printf(“Hello, world!\n”);
MPI_Finalize();
Return 0;
}

High Performanc Computing (2019 course)


Communicator
A communicator defines a communication domain - a set of
processes that are allowed to communicate with each other.
Information about communication domains is stored in
variables of type MPI_Comm.
Communicators are used as arguments to all message transfer
MPI routines.
 A process can belong to many different (possibly overlapping)
communication domains.
 MPI defines a default communicator called
MPI_COMM_WORLD which includes all the processes.
High Performanc Computing (2019 course)
Querying Information
The MPI_Comm_size and MPI_Comm_rank functions are

used to determine the number of processes and the label of the


calling process, respectively
The calling sequences of these routines are as follows:
int MPI_Comm_size(MPI_Comm comm, int *size)

int MPI_Comm_rank(MPI_Comm comm, int *rank)

The rank of a process is an integer that ranges from zero up to

the size of the communicator minus one


High Performanc Computing (2019 course)
Sample Program
>include <mpi.h#
#include <stdio.h>
int main(int argc, char *argv[])
{
int rank, size;
MPI_Init(&argc, &argv);
MPI_Comm_size(MPI_COMM_WORLD, &size);
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
printf("I am %d of %d\n", rank, size);
MPI_Finalize();
return 0;
}

High Performanc Computing (2019 course)


Sending and Receiving Messages (Blocking)
The basic functions for sending and receiving messages

in MPI are the MPI_Send and MPI_Recv, respectively


 The calling sequences of these routines are as follows:
int MPI_Send(void *buf, int count, MPI_Datatype

datatype,int dest, int tag, MPI_Comm comm)


int MPI_Recv(void *buf, int count, MPI_Datatype datatype,

int source, int tag, MPI_Comm comm, MPI_Status *status)

High Performanc Computing (2019 course)


Basic MPI Datatypes
MPI datatype C datatype

MPI_CHAR signed char


MPI_UNSIGNED_CHAR unsigned char
MPI_SHORT signed short
MPI_UNSIGNED_SHORT unsigned short
MPI_INT signed int
MPI_UNSIGNED unsigned int
MPI_LONG signed long
MPI_UNSIGNED_LONG unsigned long
MPI_FLOAT float
MPI_DOUBLE double
MPI_LONG_DOUBLE long double

High Performanc Computing (2019 course)


MPI_Status
MPI Structure:
typedef struct MPI_Status{
int MPI_SOURCE;
int MPI_TAG;
int MPI_ERROR;
}
int MPI_Get_count(MPI_Status *status, MPI_Datatype datatype, int
*count)
It takes as arguments the
 status returned by MPI_Recv and
 the type of received data in datatype and
 the no. of entries that were actually received in the count variable
High Performanc Computing (2019 course)
Non-Blocking Communication
Nonblocking communications are useful for overlapping
communication with computation
int MPI_Isend(const void *buf, int count, MPI_Datatype
datatype, int dest, int tag, MPI_Comm comm, MPI_Request
*request)
int MPI_Irecv(void *buf, int count, MPI_Datatype datatype,
int source, int tag, MPI_Comm comm, MPI_Request *
request)
To check the completion of non-blocking send and
receive operations, MPI provides MPI_Test and MPI_Wait

High Performanc Computing (2019 course)


MPI Collective Communication Operations
Barrier
Broadcast
Reduction
Prefix
Gather
Scatter

High Performanc Computing (2019 course)


MPI_Barrier
int MPI_Barrier( MPI_Comm comm );
MPI_SUCCESS No error; MPI routine completed
successfully.
MPI_ERR_COMM Invalid communicator

High Performanc Computing (2019 course)


MPI_Barrier Example
#include "mpi.h" 
#include <stdio.h> 
int main(int argc, char *argv[]) 

    int rank, nprocs;

    MPI_Init(&argc,&argv); 
    MPI_Comm_size(MPI_COMM_WORLD,&nprocs); 
    MPI_Comm_rank(MPI_COMM_WORLD,&rank); 
    MPI_Brrier(MPI_COMM_WORLD);
    printf("Hello, world.  I am %d of %d\n", rank, nprocs);
fflush(stdout); 
    MPI_Finalize(); 
    return 0; 

High Performanc Computing (2019 course)


MPI_Bcast
int MPI_Bcast( void *buffer, int count,
MPI_Datatype datatype, int root, MPI_Comm
comm );
Parameters
buffer[in/out] starting address of buffer (choice)
count[in] number of entries in buffer (integer)
datatype[in] data type of buffer (handle)
root[in] rank of broadcast root (integer)
comm[in] communicator (handle)
MPI_Bcast Example
void my_bcast(void* data, int count, MPI_Datatype datatype, int root, MPI_Comm
communicator)
{
int world_rank;
MPI_Comm_rank(communicator, &world_rank);
int world_size;
MPI_Comm_size(communicator, &world_size);
if (world_rank == root)
{ // If we are the root process, send our data to everyone
int i; for (i = 0; i < world_size; i++)
{
if (i != world_rank)
{
MPI_Send(data, count, datatype, i, 0, communicator);
}
}
} else
{ // If we are a receiver process, receive the data from the root
MPI_Recv(data, count, datatype, root, 0, communicator, MPI_STATUS_IGNORE);
}
}
MPI_Reduce
int MPI_Reduce( void *sendbuf, void *recvbuf, int
count, MPI_Datatype datatype, MPI_Op op, int
root, MPI_Comm comm );
Parameters
sendbuf[in] address of send buffer (choice)
recvbuf[out] address of receive buffer (choice, significant
only at root)
count[in] number of elements in send buffer (integer)
datatype[in] data type of elements of send buffer
(handle)
op[in] reduce operation (handle)
root[in] rank of root process (integer)
comm[in] communicator (handle)
MPI Reduction Operations
MPI_MAX - Returns the maximum element.
MPI_MIN - Returns the minimum element.
MPI_SUM - Sums the elements.
MPI_PROD - Multiplies all elements.
MPI_LAND - Performs a logical and across the elements.
MPI_LOR - Performs a logical or across the elements.
MPI_BAND - Performs a bitwise and across the bits of the
elements.
MPI_BOR - Performs a bitwise or across the bits of the
elements.
MPI_MAXLOC - Returns the maximum value and the rank of
the process that owns it.
MPI_MINLOC - Returns the minimum value and the rank of
the process that owns it.
MPI_Reduce Example
float *rand_nums = NULL;
rand_nums = create_rand_nums(num_elements_per_proc); // Sum the numbers locally
float local_sum = 0;
int i;
for (i = 0; i < num_elements_per_proc; i++)
{
local_sum += rand_nums[i];
}
// Print the random numbers on each process
printf("Local sum for process %d - %f, avg = %f\n", world_rank, local_sum, local_sum /
num_elements_per_proc);
// Reduce all of the local sums into the global sum float global_sum;
MPI_Reduce(&local_sum, &global_sum, 1, MPI_FLOAT, MPI_SUM, 0,
MPI_COMM_WORLD);
// Print the result
if (world_rank == 0)
{
printf("Total sum = %f, avg = %f\n", global_sum, global_sum / (world_size *
num_elements_per_proc));
}
MPI_Allreduce
int MPI_Allreduce( void *sendbuf, void *recvbuf,
int count, MPI_Datatype datatype, MPI_Op op,
MPI_Comm comm );
Parameters
sendbuf[in] starting address of send buffer (choice)
recvbuf[out] starting address of receive buffer (choice)
count[in] number of elements in send buffer (integer)
datatype[in] data type of elements of send buffer
(handle)
op[in] operation (handle)
comm[in] communicator (handle)
MPI_Gather
int MPI_Gather( void *sendbuf, int sendcnt, MPI_Datatype
sendtype, void *recvbuf, int recvcnt, MPI_Datatype recvtype,
int root, MPI_Comm comm );
Parameters
sendbuf[in] starting address of send buffer (choice)
sendcount[in] number of elements in send buffer (integer)
sendtype[in] data type of send buffer elements (handle)
recvbuf[out] address of receive buffer (choice, significant only
at root)
recvcount[in] number of elements for any single receive (integer,
significant only at root)
recvtype[in] data type of recv buffer elements (significant only at
root) (handle)
root[in] rank of receiving process (integer)
comm[in] communicator (handle)
MPI_Gather Example
#include<stdio.h> #include<stdlib.h> #include<mpi.h>
int main (int argc, char **argv) {
int myrank, size,i;
int *recvbuffer;
int sendbuffer[2];
int recvbufflen = 0;
MPI_Status status;
MPI_Init(&argc,&argv);
MPI_Comm_rank(MPI_COMM_WORLD, &myrank);
MPI_Comm_size(MPI_COMM_WORLD, &size); /*Initialize Send buffer*/
for(i=0;i<2;i++) sendbuffer[i] = i*2; /*Only Process 0 allocates memory for Recvbuffer*/
if (myrank==0){
recvbufflen = 2*size; recvbuffer = (int*)malloc(recvbufflen * sizeof(int));
}
MPI_Gather(sendbuffer,2,MPI_INT,recvbuffer,2,MPI_INT,0,MPI_COMM_WORLD);
if (myrank==0){
for(i=0;i<recvbufflen;i++){
printf(“recvbuffer[%d]=%d\n”,i,recvbuffer[i]);
}}
MPI_Finalize();
return 0;} 
MPI_Allgather
int MPI_Allgather( void *sendbuf, int sendcount,
MPI_Datatype sendtype, void *recvbuf, int recvcount,
MPI_Datatype recvtype, MPI_Comm comm );
Parameters
sendbuf[in] starting address of send buffer (choice)
sendcount[in] number of elements in send buffer (integer)
sendtype[in] data type of send buffer elements (handle)
recvbuf[out] address of receive buffer (choice)
recvcount[in] number of elements received from any
process (integer)
recvtype[in] data type of receive buffer elements (handle)
comm[in] communicator (handle)
MPI_Scatter
int MPI_Scatter( void *sendbuf, int sendcnt, MPI_Datatype
sendtype, void *recvbuf, int recvcnt, MPI_Datatype recvtype,
int root, MPI_Comm comm );
Parameters
 sendbuf[in] address of send buffer (choice, significant only at root)
 sendcount[in] number of elements sent to each process (integer,
significant only at root)
 sendtype[in] data type of send buffer elements (significant only
at root) (handle)
 recvbuf[out] address of receive buffer (choice)
 recvcount[in] number of elements in receive buffer (integer)
 recvtype[in] data type of receive buffer elements (handle)
 root[in] rank of sending process (integer)
 comm[in] communicator (handle)
MPI_Scatter Example
#include "mpi.h"
#include <stdio.h>
#include <stdlib.h>
#define SIZE 4
int main (int argc, char *argv[]) {
int numtasks, rank, sendcount, recvcount, source;
float sendbuf[SIZE][SIZE] =
{ {1.0, 2.0, 3.0, 4.0}, {5.0, 6.0, 7.0, 8.0}, {9.0, 10.0, 11.0, 12.0}, {13.0, 14.0, 15.0, 16.0} };
float recvbuf[SIZE];
MPI_Init(&argc,&argv);
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
MPI_Comm_size(MPI_COMM_WORLD, &numtasks);
if (numtasks == SIZE)
{
source = 1;
sendcount = SIZE;
recvcount = SIZE;
MPI_Scatter(sendbuf,sendcount,MPI_FLOAT,recvbuf,recvcount, MPI_FLOAT,source,MPI_COMM_WORLD);
printf("rank= %d Results: %f %f %f %f\n",rank,recvbuf[0], recvbuf[1],recvbuf[2],recvbuf[3]);
}
Else
printf("Must specify %d processors. Terminating.\n",SIZE);
MPI_Finalize();
}
MPI_Alltoall
int MPI_Alltoall( void *sendbuf, int sendcount,
MPI_Datatype sendtype, void *recvbuf, int recvcount,
MPI_Datatype recvtype, MPI_Comm comm );
Parameters
sendbuf[in] starting address of send buffer (choice)
sendcount[in] number of elements to send to each process
(integer)
sendtype[in] data type of send buffer elements (handle)
recvbuf[out] address of receive buffer (choice)
recvcount[in] number of elements received from any process
(integer)
recvtype[in] data type of receive buffer elements (handle)
comm[in] communicator (handle)
All-to-All Using MPI (Examples)
One-dimensional Matrix-Vector Multiplication
Single-Source Shortest Path
Sample Sort

High Performanc Computing (2019 course)


Thank You

High Performanc Computing (2019 course)

You might also like