PC_course_notes_May17
PC_course_notes_May17
Hairong Wang
1 Introduction 1
1.1 Parallel computers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1.1 Parallel Computing — What is it? . . . . . . . . . . . . . . . . . . . . . 1
1.1.2 Why parallelism? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.1.3 A brief history of processor performance . . . . . . . . . . . . . . . . . 3
1.1.4 A brief history of parallel computing . . . . . . . . . . . . . . . . . . . . 5
1.2 Supercomputers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.3 Classification of Parallel Computers . . . . . . . . . . . . . . . . . . . . . . . . 9
1.3.1 Control Structure Based Classification - Flynn’s Taxonomy . . . . . . . . 9
1.4 A Quantitative Look at Parallel Computation . . . . . . . . . . . . . . . . . . . 11
1.4.1 A simple performance modelling . . . . . . . . . . . . . . . . . . . . . . 12
1.4.2 Amdahl’s Law . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
1.4.3 Gustafson’s Law . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
1.4.4 Efficiency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
1.4.5 Scalability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
1.5 Limitations of memory system performance . . . . . . . . . . . . . . . . . . . . 15
1.6 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
Bibliography 18
Bibliography 28
i
3 Parallel Algorithm Design 30
3.1 Shared Memory vs Distributed memory . . . . . . . . . . . . . . . . . . . . . . 30
3.1.1 Shared Memory Programming . . . . . . . . . . . . . . . . . . . . . . . 31
3.1.2 Distributed Memory Programming . . . . . . . . . . . . . . . . . . . . . 32
3.2 Parallel Algorithm Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
3.2.1 Parallel Algorithm Design — Simple Examples . . . . . . . . . . . . . . 33
3.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
3.4 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
Bibliography 37
Bibliography 70
ii
5.2.4 Benchmarking the Performance . . . . . . . . . . . . . . . . . . . . . . 76
5.3 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
5.4 Collective Communication and Computation Operations . . . . . . . . . . . . . 79
5.5 Running MPI Program With MPICH . . . . . . . . . . . . . . . . . . . . . . . . 81
5.5.1 What is MPICH . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
5.5.2 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
5.6 Point to Point Communication . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
5.6.1 Blocking vs. Non-blocking . . . . . . . . . . . . . . . . . . . . . . . . . 82
5.6.2 MPI Message Passing Function Arguments . . . . . . . . . . . . . . . . 82
5.6.3 Avoiding Deadlocks . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
5.6.4 Sending and Receiving Messages Simultaneously . . . . . . . . . . . . . 84
5.6.5 Overlapping Communication with Computation . . . . . . . . . . . . . . 85
5.7 Collective Communications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
5.7.1 MPI Collective Communication Illustrations . . . . . . . . . . . . . . . 88
5.8 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
5.8.1 Parallel Quicksort Algorithm . . . . . . . . . . . . . . . . . . . . . . . . 98
5.8.2 Hyperquicksort . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
5.8.3 The Odd-Even Transposition Sort . . . . . . . . . . . . . . . . . . . . . 100
5.8.4 Parallel Bitonic Sort . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
5.9 MPI Derived Datatypes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
5.9.1 Typemap . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
5.9.2 Creating and Using a New Datatype . . . . . . . . . . . . . . . . . . . . 105
5.9.3 Contiguous Type . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
5.9.4 Vector Type . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
5.9.5 Indexed Type . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
5.9.6 Struct Type . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
5.10 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112
5.11 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112
Bibliography 114
List of Figures
iii
1.8 Illustration of the die area for integer unit and cache logic for Intel Itanium II. . . 6
1.9 Intel VP Patrick Gelsinger (ISSCC 2001) . . . . . . . . . . . . . . . . . . . . . 6
1.10 Growth of transistors, frequency, cores, and power consumption . . . . . . . . . 7
1.11 Fugaku, a supercomputer from RIKEN and Fujitsu Limited. 158,976 nodes fit
into 432 racks, that means 368 nodes in each rack, 48 cores each node. . . . . . . 8
1.12 The SISD architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.13 The SIMD architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
1.14 The MIMD architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
1.15 The SMP architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
1.16 The NUMA architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
1.17 The distributed memory architecture . . . . . . . . . . . . . . . . . . . . . . . . 11
1.18 A cluster . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
iv
3.7 Tasks and communications. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
3.8 Each row elements depends on the element to the left; each column depends on
the previous column. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
3.9 Update all the red cells first, then update all the black cells. The process is repeated
until convergence. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
5.1 The trapezoidal rule: (a) area to be estimated, (b) estimate area using trapezoids . 86
5.2 MPI_Bcast . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
5.3 MPI_Reduce . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
5.4 MPI_Allreduce . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
5.5 (a) A global sum followed by a broadcasting; (2) A butterfly structured global sum. 91
5.6 MPI_Scatter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
5.7 MPI_Gather . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
5.8 MPI_Allgather . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
5.9 MPI_Alltoall . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
5.10 MPI_Scan . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
5.11 MPI_Scatter (top) and MPI_Scatterv (bottom) example . . . . . . . . . . . . . . . 96
5.12 MPI_Alltoallv example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
5.13 MPI_Alltoallv example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
5.14 Parallel quicksort . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
5.15 Illustration of Hyperquicksort algorithm . . . . . . . . . . . . . . . . . . . . . . 99
5.16 A schematic representation of a network that converts an input sequence into a
bitonic sequence. In this example, ⊕BM[k] and ⊖BM[k] denote bitonic merging
networks of input size k that use ⊕ and ⊖ comparators, respectively. The last
merging network (⊕BM[16]) sorts the input. In this example, n = 16. . . . . . . 103
5.17 A bitonic merging network for n = 16. The input wires are numbered 0, 1 . . . , n−
1, and the binary representation of these numbers is shown. Each column of
comparators is drawn separately; the entire figure represents a ⊕BM[16] bitonic
merging network. The network takes a bitonic sequence and outputs it in sorted
order. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
5.18 The comparator network that transforms an input sequence of 16 unordered num-
bers into a bitonic sequence. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
5.19 MPI_Type_vector . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
5.20 MPI_Type_indexed . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
v
List of Tables
vi
List of Algorithms
vii
viii
Chapter 1
Introduction
Some contents in this chapter are based on [Grama et al. 2003] and [Trobec et al. 2018].
Parallel computing
• A problem is broken into discrete parts that can be solved concurrently;
An example of parallelizing an addition of two vectors is shown in Figure 1.3. The computation
in this example is element-wise summation of two vectors. Note that the additions of two corre-
sponding vector elements are independent of each other, so they can be executed in parallel. We
often call this kind of easily parallelizable problem embarrassingly parallel problem. One of the
main goals of parallel computing is speedup – time on 1 processor / time on p processors. Others
include high efficiency (cost, area, power) or working on bigger problems that can’t fit on a single
machine.
Question: How do you understand concurrency and parallelism?
3 1.1 Parallel computers
Intel launched the first commercial microprocessor chip, the Intel 4004, with 2300 transistors,
5 1.1 Parallel computers
each the size of a red blood cell. Since then, chips have improved in line with the prediction of
Gordon Moore, Intel’s co-founder (see Figure 1.6).
Question: How fast is this growth? How do we relate it to the physical world?
An issue comes with this is that when the performance of the processor increases exponen-
tially, the memory speed hasn’t increased at the same rate, instead it has grown very slowly com-
pared to the CPU performance (see Figure 1.7). This caused the growing performance gap be-
tween the processor and memory. This means that the memory is just not fast enough to meet
the demand of the processor. So the actual performance of solving a problem could be limited by
the memory performance instead of the faster processor performance. One way to address this
performance gap is to add faster memory on the processor chip so that the data and the processor
are physically close, that means faster data access. we know such on chip memory is cache. The
problem with this is that the cache size is usually very small, because the chip area is limited.
• For example, Intel Itanium II (see Figure 1.8)
– 6-way integer unit < 2% die area;
– Cache logic > 50% die area.
• Most of chip there to keep these 6 integer units at ‘peak’ rate.
• Main issue is external DRAM latency (50ns) to internal clock rate (0.25ns) ratio is 200:1.
Figure 1.8: Illustration of the die area for integer unit and cache logic for Intel Itanium II.
• What can we do? We can still pack more and more transistors onto a die, but we cannot
make the individual transistor faster.
• We have to make better use of those more and more transistors to increase the performance
instead of making only the clock speed faster. One solution is to use parallelism.
• Add multiple cores to add performance, keep clock frequency the same or reduced.
Why parallelism – summary
• Serial machines have inherent limitations:
– Processor speed, memory bottlenecks, . . .
• Parallelism has become the future of computing.
• Two primary benefits of parallel computing:
– Solve fixed size problem in shorter time
– Solve larger problems in a fixed time
• Other factors motivate parallel processing:
– Effective use of machine resources;
– Cost efficiencies;
7 1.2 Supercomputers
1.2 Supercomputers
See Table 1.1 for Frontier, the No 1 Supercomputer in 11/2022 & 11/2023. Table 1.2 and Fig-
ure 1.11 show Fugaku, the No 1 Supercomputer in 11/2020 & 11/2021. Supercomputers are
used to run applications in areas such as drug discovery; personalized and preventive medicine;
simulations of natural disasters; weather and climate forecasting; energy creation, storage, and
use; development of clean energy; new material development; new design and production pro-
cess; and-as a purely scientific endeavour-elucidation of the fundamental laws and evolution of
the universe. Some performance measures used:
Rmax Maximal LINPACK performance achieved. The benchmark used in the LINPACK Bench-
mark is to solve a dense system of linear equations. The LINPACK Benchmark is a measure
of a computer’s floating-point rate of execution.
Rpeak Theoretical peak performance. The theoretical peak is based not on an actual performance
from a benchmark run, but on a paper and pen computation to determine the theoretical peak
rate of execution of floating point operations for the machine. For example, an Intel Itanium
2 at 1.5 GHz can complete 4 double-precision floating point operations per cycle (or per
clock tick). Then its theoretical peak performance is 4 × 1.5 × 109 = 6 × 109 Flops or =
6GFlops.
Nmax Problem size for achieving Rmax
N1/2 Problem size for achieving half of Rmax
Figure 1.11: Fugaku, a supercomputer from RIKEN and Fujitsu Limited. 158,976 nodes fit into 432 racks,
that means 368 nodes in each rack, 48 cores each node.
As of 2023, the clusters (or partitions) at Wits Mathematical Sciences Lab (MSL)
2
9408 CPUs and 37632 GPUs
3
Hewlett Packard Enterprise
4
High performance conjugate gradient
9 1.3 Classification of Parallel Computers
• stampede: For general purpose use or jobs that can leverage InfiniBand (if enabled). It has
40 nodes, each with two Xeon E5-2680 CPUs, two GTX1060 GPUs (6GB per GPU, 12GB
per node), and 32GB of system RAM. MaxTime = 4320 and MaxNodes = 16.
• batch: For bigger jobs that can leverage a bigger GPU and additional system RAM. It has 48
nodes each with a single Intel Core i9-10940X CPU (14 cores), NVIDIA RTX3090 GPU
(24GB), and 128GB of system RAM. MaxTime = 4320 and MaxNodes = 16. Partition
batch has the potential for 10Gb networking (This has not yet been implemented).
• biggpu: For mature code that can meaningfully leverage very large amounts of GPU and
system RAM. It has 38 nodes, each node has two Intel Xeon Platinum 8280L CPUs (28
cores per CPU, 56 cores per node) with two NVIDIA Quadro RTX8000 GPUs (48GB per
GPU, 96GB per node), and 1TB of system RAM. MaxTime = 4320 and MaxNodes = 3.
biggpu has the potential for 10Gb networking (This has not yet been implemented). When-
ever possible, limit jobs on biggpu to using only a single node so others can use the partition
simultaneously. Please use biggpu responsibly!
• SIMD: Single instruction stream multiple data stream. In this computer type there can be
multiple processors, each operating on its own data item, but they are all executing the same
instruction on that data item. SIMD computers excel at operations on arrays, such as
• MISD: Multiple instruction stream single data stream. Each processing unit executes dif-
ferent instruction streams on a single data stream. Very few computers are in this type.
• MIMD: Multiple instruction stream multiple data stream. Multiple processors operate on
1.3 Classification of Parallel Computers 10
multiple data items, each executing independent, possibly different instructions. Most cur-
rent parallel computers are of this type.
Most of MIMD machines operate in single program multiple data (SPMD) mode, where the
programmers starts up the same executable on the parallel processors. The MIMD category is
typically further decomposed according to memory organization: shared memory and distributed
memory.
Shared memory: In a shared memory system, all processes share a single address space and
communicate with each other by writing and reading shared variables.
• One class of shared-memory systems is called SMPs(symmetric multiprocessors) (see Fig-
ure 1.15). All processors share a connection to a common memory and access all memory
locations at equal speeds. SMP systems are arguably the easiest parallel systems to program
because programmers do not need to distribute data structures among processors.
11 1.4 A Quantitative Look at Parallel Computation
• The other main class of shared-memory systems is called non-uniform memory access
(NUMA) (Figure 1.16). The memory is shared, it is uniformly addressable from all pro-
cessors, but some blocks of memory may be physically more closely associated with some
processors than others.
• To mitigate the effects of non-uniform access, each processor has a cache, along with a
protocol to keep cache entries coherent — cache-coherent non-uniform memory access
systems (ccNUMA). Programming ccNUMA system is the same as programming an SMP
logically, but to obtain the best performance, a programmer needs to be more careful about
locality issues and cache effects.
Distributed memory systems: Each process has its own address space and communicates
with other processes by message passing (sending and receiving messages). Figure 1.17 shows
an example. As off-the-shelf networking technology improves, systems of this type are becoming
more common and much more powerful.
Clusters Clusters are distributed memory systems composed of off-the-shelf computers con-
nected by an off-the-shelf network (Figure 1.18).
In the following, let’s have a look at some analyses on the limit of parallel processing.
• Then we obtain the well-known Amdahl’s law:
Ttotal (1) Ttotal (1) 1
S(P ) = = 1−γ = (1.7)
Ttotal (P ) (γ + P )Ttotal (1) γ + 1−γ
P
1
S(P ) = (1.8)
γ
Example 1.4.1. Suppose we are able to parallelize 90% of a serial program. Further suppose the
speedup of this part is P , the number of processes we used (which is a perfect linear speedup). If
the serial time is Tserial = 20s, then the runtime of the parallelized part is (0.9 × Tserial )/P =
18/P . The runtime of unparallelized part is 0.1 × 20 = 2s. The overall parallel runtime will be
Tparallel = 18/P + 2,
So, what are we doing here? Should we give up? of course no. There is a number of reasons
we should not worry about Amdahl’s Law too much. First, it does not take into account the
problem size. When we increase the problem size, the serial fraction of the problem decreases
in size; Second, there are many scientific and engineering applications routinely achieve huge
speedups.
Tsetup + Tf inalization
γscaled = , (1.10)
Ttotal (P )
and then
Ttotal (1) = γscaled Ttotal (P ) + P (1 − γscaled )Ttotal (P ). (1.11)
1.4 A Quantitative Look at Parallel Computation 14
The exercise underlying Amdahl’s law, namely running exactly the same problem with varying
numbers of processors, is artificial in some circumstances. In Eq. 1.2, we obtained the Ttotal (P )
from the execution time of serial part and the execution time of parallel part when executed on
one PE.
• Using Eq. 1.11, we obtain the scaled speedup, sometimes known as Gustafson’s law.
Ttotal (1) γscaled Ttotal (P ) + P (1 − γscaled )Ttotal (P )
S(P ) = =
Ttotal (P ) Ttotal (P ) (1.12)
= P (1 − γscaled ) + γscaled = P + (1 − P )γscaled .
• Suppose we take the limit in P while holding Tcompute and γscaled constant. That is, we
are increasing the size of the problem so that the total running time remains constant when
more processors are added. In this case, the speedup is linear in P.
The assumption above implies that the execution time of the serial parts does not change as the
problem size grows.
Example 1.4.2. Now, for the previous example, let’s assume the scaled serial fraction of the same
problem is 0.1, that is γscaled = 0.1, and p = 16. Then the scaled speedup is S = 16(1 − 0.1) +
0.1 = 14.5, which is greater than the upper bound determined by Amdahl’s Law (10).
1.4.4 Efficiency
• There are another two performance metrics we frequently use in the process of this course:
efficiency and scalability.
• Efficiency:
S
E= , (1.13)
P
where S is the speedup, and P is the number of processes used to achieve the speedup.
Note 0 < E ≤ 1. Efficiency is better when E is closer to 1, which simply means you
utilized the P processors efficiently.
1.4.5 Scalability
• Scalability: In general, a technology is scalable if it can handle ever-increasing problem
size.
• In parallel program performance, scalability refers to the following measure. Suppose we
run a parallel program using certain number of processors and with a certain problem size,
and obtained an efficiency E. Further suppose that now we want to increase the number of
processors. In this case, if we can find a rate at which the problem size increases so that we
can still maintain the efficiency E, we say the parallel program scalable.
Example 1.4.3. Suppose Tserial = n, where n is also the problem size. Also suppose Tparallel =
n/p + 1. Then
n n
E= = .
p(n/p + 1) n+p
To see if the problem is scalable, we increase p by a factor of k, then we get
n xn
E= = .
n+p xn + kp
15 1.5 Limitations of memory system performance
If x = k, then we have the same efficiency. That is, if we increase the problem size at the same
rate that we increase the number of processors, then the efficiency remain constant, hence, the
program is scalable.
Strong scalability: When we increase the number of processes, we can keep the efficiency
fixed without increasing the problem size, the program is strongly scalable.
Weak scalability: If we can keep the efficiency fixed by increasing the problem size at the
same rate as we increase the number of processes, then the program is said to be weakly scalable.
1. Latency: A memory system, possibly with multiple levels of cache, takes in a request for a
memory word and returns a block of data of size b containing the requested word after lns.
Here, l is referred to as the latency of the memory.
2. Effect of memory latency on performance
Example 1.5.1. Consider a processor operating at 1GHz (1ns clock) connected to a DRAM
with a latency of 100ns (no caches). Assume that the processor has two multiply-add units
and is capable of executing four instructions in each cycle of 1ns. The peak performance
rating is therefore 4GFLOPS. However, since the memory latency is 100ns, and if the block
size is only one word (4 bytes), then every time a memory request is made, the processor
must wait 100ns (latency) before it can process a word. In the case of each floating point
operation requires one data fetch, the peak speed of this computation is limited to one
floating point operation every 100ns, or a speed of 10MFLOPS, which is only a small
fraction of the peak processor performance.
3. Improving effective memory latency using caches: Caches are used to address the speed
mismatch between the processor and memory. Caches are a smaller and faster memory that
is placed between the processor and the DRAM. The data needed by the processor is first
fetched into cache. All subsequent accesses to data items residing in the cache are serviced
by the cache. Thus, in principle, if a piece of data is repeatedly used, the effective latency
of this memory system can be reduced by the cache.
4. Cache hit ratio: The fraction of data references satisfied by the cache is called the cache
hit ratio of the computation on the system.
5. Memory bound computation: The effective computation rate of many applications is
bounded not by the processing rate of the CPU, but by the rate at which data can be pumped
into the CPU. Such computations are referred to as being memory bound. The performance
of memory bound program is critically impacted by the cache hit ratio.
Example 1.5.2. We still consider the 1GHz processor with a 100ns latency DRAM as in
Example 1.5.1. In this case, let’s consider adding a cache of size 32KB (1K = 210 ≈ 103 )
with a latency of 1ns. We use this setup to multiply two matrices A and B of dimensions
1.6 Exercises 16
32 × 32 (for this, the input matrices and output matrix can all fit in the cache). Fetching the
two input matrices (2K words) into the cache takes approximately 200µs. The total floating
point operations in multiplying the two matrix is 64K (multiplying two n × n matrices takes
2n3 multiply and add operations), which takes approximately 16K cycles, i.e., 16Kns (or
16µs) at 4 instructions per cycle. So, the total time for the computation is 200µs + 16µs =
216µs. This corresponds to a peak computation rate of 64K/216µs ≈ 303MFLOPS.
Compared to Example 1.5.1, this is a much better rate (due to cache), but is still a small
fraction of the peak processor performance (4GFLOPS).
6. Temporal locality: The notion of repeated reference to a data item in a small time window
is called temporal locality of reference. Data reuse is critical for cache performance because
if each data item is used only once, it would still have to be fetched once per use from the
DRAM, and therefore the DRAM latency would be paid for each operation.
7. Bandwidth: The rate at which data can be moved between the processor and the memory.
It is determined by the bandwidth of the memory bus as well as the memory units.
Example 1.5.3. We continue with Examples 1.5.1 and 1.5.2. Now consider increasing the
block size to 4 words (memory bandwidth in this case; also, if a memory request returns
a contiguous block of 4 words, the 4-word unit is also considered as a cache line). Now
fetching the two input matrices into the cache takes 200/4 = 50µs. Consequently the total
time for the computation is 50µs+16µs = 66µs, which corresponds to a peak computation
rate of 64K/66µs ≈ 993MFLOPS.
8. Increasing the memory bandwidth, or building wide data bus connected to multiple memory
banks, is expensive to construct. In a more practical system, consecutive words are sent on
the memory bus on subsequent bus cycles after the first word is retrieved. For example,
with a 32-bit data bus, the first word is put on the bus after 100ns (the latency) and one
word is put on each subsequent bus cycle.Then the 4 words will become available after
100 + 3 × (memory bus cycle)ns. Assuming a data bus operating at 200MHz, this adds
15ns to the cache line access time.
9. Spatial locality: Successive computations require contiguous data. If the computation (or
data access pattern) in a program does not have spatial locality, then effective bandwidth
can be much smaller than the peak bandwidth.
1.6 Exercises
Objectives
The objectives are to understand and be able to apply the concepts and discussions on
• Classification of parallel computers
• Understand the simple metrics used to evaluate the performance of a parallel program
• Understand the similarities and differences between Amdahl’s Law and Gustafson’s Law
17 1.6 Exercises
• Apply the basic performance metrics, Amdahl’s Law, and Gustafson’s Law for simple prob-
lems.
• Limitations of memory system performance
Problems
1. In the discussion of parallel hardware, we used Flynn’s taxonomy to identify parallel sys-
tems: SISD, SIMD, MIMD, SPMD, and MISD. Explain how would each of these parallel
computer works, and give some examples of such systems.
2. Assume p processing elements (PEs) share a bus to the memory, each PE accesses k data
items, and each data access takes time tcycle .
(a) What is the maximum execution time for each PE to access the k data items? (Answer:
tcycle kp)
(b) Assume each PE now has its own cache, and on average, 50% of the data accesses
are made to the local memory (or cache in this case). Further, let’s assume the access
time to local memory is the same as to the global memory, i.e., tcycle . In such case,
what is the maximum execution time for each PE to access k data items? (Answer:
0.5ktcycle + 0.5kptcycle )
3. Suppose that 70% of a program execution can be sped up if the program is parallelized
and run on 16 processing units instead of one. By Amdahl’s Law, what is the maximum
speedup we can achieve for this problem if we increase the number of processing units,
respectively, to 32, then to 64, and then to 128? (Answer: Use Amdahl’s Law.)
4. For a problem size of interest, 6% of the operations of a parallel program are inside I/O
functions that are executed on a single processor. What is the minimum number of proces-
sors needed in order for the parallel program to exhibit a speedup of 10? (Answer: Use
Amdahl’s Law.)
5. An application executing on 64 processors requires 220 seconds to run. Benchmarking
reveals that 5% of the time is spent executing serial portions of the computation on a single
processor. What is the scaled speedup of the application? (Answer: 60.85.)
6. A company recently purchased a powerful machine with 128 processors. You are tasked to
demonstrate the performance of this machine by solving a problem in parallel. In order to
do that, you aim at achieving a speedup of 100 in this application. What is the maximum
fraction of the parallel execution time that can be devoted to inherently sequential operations
if your application is to achieve this goal? (Answer: Use Gustafson’s Law.)
7. Both Amdahl’s Law and Gustafson’s Law are derived from the same general speedup for-
mula. However, when increasing the number of processors p, the maximum speedup pre-
dicted by Amdahl’s Law converges to a certain limit, while the speedup predicted by Gustafson’s
Law increases without bound. Explain why this is so.
8. Consider a memory system with a DRAM of 512MB and L1 cache of 32KB with the CPU
operating at 1GHz. The lDRAM = 100ns and lL1 = 1ns (l represents the latency). In each
memory cycle, the processor fetches 4 words. What is the peak achievable performance of a
dot product of two vectors? (Answer: Assuming 2 multiply-add units as in Example 1.5.1,
the computation performs 8 FLOPS on 2 memory fetches, i.e., 8 FLOPS in 200ns ignoring
cache read time and 1 CPU cycle time, thus 40MFLOPS. If cache read time as well 1 CPU
cycle are also counted, then it becomes 8 FLOPS in 207ns, which is at 38.64MFLOPS rate.)
1.6 Exercises 18
Bibliography
Grama, A., Gupta, A., Karypis, G., and Kumar, V. (2003). Introduction to Parallel Computing.
Addison Wesley.
Trobec, R., Slivnik, B., Bulić, P., and Robič, B. (2018). Introduction to Parallel Computing:
From Algorithms to Programming on State-of-the-Art Platforms. Springer.
Chapter 2
Overview of Parallel Systems
Some contents of this chapter are based on [Trobec et al. 2018] and [Grama et al. 2003].
Figure 2.1: RAM model of computation: memory M - contains program instructions and data; processing
unit P - execute instructions on data.
A natural extension of RAM to parallel computing can simply consists of multiple processing
elements, in contrast to a single PE in serial computing; and we use a global memory of unlimited
size that is uniformly accessible to all processing elements. The generalization of RAM to parallel
computing can be done in 3 different ways:
2.1 Modelling Parallel Computation 20
(a) (b)
Figure 2.2: (a) PRAM model for parallel computation; (b) multiple PEs try to access the same memory
location simultaneously.
First of all, PRAM - parallel random access machine – multiple PEs connected to a global
memory, or we can call it shared memory; this is similar to RAM, where each PE has random ac-
cess to any memory unit, all PEs have uniform, or equal access to the global memory, and we also
assume the global memory has unlimited size. Clearly there are some issues with PRAM: what
happens if multiple PEs need to access the same memory location (see an example in Figure 2.2
(b), where two PEs are trying to access the same memory unit L)? It could be two or more PEs try
to read from the same memory location, multiple PEs try to write to the same memory location, or
one or more PEs try to read and some others try to write to the same memory location? Of course,
an obvious solution is to serialize such contending accesses, that means at one time, only one PE
is allowed to access a particular memory unit. For such solution, we have another issue: which
PE should go first, which one go next etc. These issues lead to uncertainties. In summary, simul-
taneous access to the same memory location could lead to unpredictable data in PEs, as well as in
memory locations accessed. Depending on how multiple PEs access the memory simultaneously,
a number of variants of PRAM are proposed. These variants differ in the ways for simultaneous
access, and the ways for avoiding unpredictability.
• Exclusive read exclusive write PRAM (EREW-PRAM): It does not support simultaneous
access to the same memory location - an access to any memory location must be exclusive,
that means simultaneous memory accesses will be serialized; hence it affords minimum
concurrency in memory access. Note that we are talking about simultaneous accesses to
the same memory location; if multiple PEs, each tries to access different memory locations,
they can happen in parallel.
• Concurrent read exclusive write PRAM (CREW-PRAM): Allows simultaneous reads from
the same memory location, but writing to a memory location must be exclusive, that is.
concurrent writes will be serialized.
• Concurrent read concurrent write (CRCW PRAM): Supports simultaneous reads from the
same memory location; simultaneous writes to the same memory location, and simulta-
neous reads and writes to the same memory location. The unpredictability is handled in
different ways: so within this type, we can further define a few different varieties:
– Consistent CRCW-PRAM: PEs may simultaneously write to the same memory loca-
21 2.2 Interconnection Networks
(a) (b)
Figure 2.3: (a) LMM; (b) MMM.
(a) (b)
Figure 2.4: (a) Fully connected network; (b) A fully connected crossbar switch. It is non-blocking in the
sense that a connection of a PE to a memory does not block another similar connection. It is scalable in
terms of performance, but not scalable in terms of cost.
23 2.2 Interconnection Networks
• Indirect networks (dynamic): Connect nodes and memory modules via switches and com-
munication links. A cross point is a switch that can be opened or closed. Uses switches to
establish paths among nodes.
– Fully connected crossbar switch (Figure 2.5 (b)): On one end is nodes, and the other
end memory modules. The fully connected crossbar has too large complexity to be
used for connecting large numbers of input and output ports. For example, if we have
1000 nodes and 1000 memory modules, then we need one million switches to build
the fully connected crossbar switches.
(a) (b)
Figure 2.5: (a) Fully connected network; (b) A fully connected crossbar network.
• Linear array. Used in LMMs. Every node (except the two nodes at the ends) is connected
to two neighbours, see Figure 2.7. Simple. If a node index is i, its two neighbours can be
found using indices (i + 1) mod n and (i − 1) mod n, where n is the total number of nodes
25 2.2 Interconnection Networks
Figure 2.7: (a) Linear array without wraparound link; (b) linear array with wraparound link (also called
ring, see next slide).
• Ring. Used in LMMs. Every node is connected to two neighbours, see Figure 2.8. Simple.
If a node index is i, its two neighbours can be found using indices (i + 1) mod n and
(i − 1) mod n, where n is the total number of nodes in the ring.
Figure 2.8: Ring topology. Each node represents a processing element with local memory.
• 2D mesh (Figure 2.9 (a)). Can be used in LMMs. Each node is connected to a switch. The
number of switches can be determined by the lengths of the two sides. Every switch, except
those along the 4 borders, has 4 neighbours.
• 2D torus (Figure 2.9 (b)). Similar to 2D mesh, however, each pair of corresponding border
switches is connected. Every switch has 4 neighbours.
(a) (b)
Figure 2.9: (a) 2D mesh topology; (b) 2D torus topology. Each node represents a processing element with
local memory.
• 3D mesh. Similar to 2D mesh, however, in 3 dimension. Every switch except the border
ones, has 6 neighbours (see Figure 2.10 (a)).
2.2 Interconnection Networks 26
(a) (b)
Figure 2.10: (a) 3D mesh, (b) Hypercube topology Each node represents a processing element with local
memory.
100 110
0 00 10 000 010
101 111
1 11 001 011
01
1000
0001 0011
1001 1011
4-D hypercube
• Multistage network: used in MMM, where input switches are connected to PEs, and output
switches are connected to memory modules (see Figure 2.12).
• Fat tree: used in constructing LMM, where PEs with their local memories are attached to
the leaves (see Figure 2.13).
27 2.3 Exercises
Figure 2.12: Multistage network topology. A 4-stage interconnection network capable of connecting 16
PEs to 16 memory modules. Each switch can establish a connection between a pair of input and output
channels.
Figure 2.13: Fat tree topology. A fattree interconnect network of 16 processing nodes. Each switch can
establish a connection between arbitrary pair of incidents. Edges closer to the root are thicker. The idea is
to increase the number of communication links and switching nodes closer to the root.
• Diameter
• Bisection width
• Cost
See Table 2.1 for quantitative measures of various interconnect networks.
2.3 Exercises
Objectives
The objectives are to understand and be able to apply the concepts and discussions on
• Theoretical modelling of parallel computation
• The basic properties and topologies of common interconnection networks
2.3 Exercises 28
Problems
1. Of the three PRAM models, EREW-PRAM, CREW-PRAM, and CRCW-PRAM, which
one has the minimum concurrency, and which one has the maximum power? Why?
2. How do you label the nodes in a d-dimensional hypercube? Construct a 4-dimensional
hypercube with labels, where there is a communication link between two nodes when their
labels differ in only one binary bit.
3. (a) Partition a d-dimensional hypercube into two equal partitions such that
i. each partition is still a hypercube;
ii. neither partition is a hypercube.
(b) Derive the bisection width for each of the two cases in Item 3a.
4. Derive the diameter, bisection width, and cost for each of the following interconnection
networks. Show your derivation. Assume there are p number of nodes where applicable.
• Fully connected network
• Bus
• Ring
• 2D mesh
• 2D torus
• 3D mesh
• Hypercube
• Fully connected crossbar network
5. Using the basic interconnection network topologies we studied in the class, design, as many
as you can, network topologies that can connect p number of PEs to m number of mem-
ory modules, so that each topology should provide a path between any pair of a PE and a
memory module.
Bibliography
Grama, A., Gupta, A., Karypis, G., and Kumar, V. (2003). Introduction to Parallel Computing.
Addison Wesley.
29 2.3 Exercises
Trobec, R., Slivnik, B., Bulić, P., and Robič, B. (2018). Introduction to Parallel Computing:
From Algorithms to Programming on State-of-the-Art Platforms. Springer.
Chapter 3
Parallel Algorithm Design
The contents of this chapter are mainly based on [Culler et al. 1998] and [Grama et al. 2003].
• From programming point of view: shared memory programming and distributed memory
programming
– Shared memory programming: All PEs have equal access to shared memory space.
Data shared among PEs are via shared variables.
– Distributed memory programming: A PE has access only to its local memory. Data
(message) shared among PEs are communicated via the communication channel, i.e.,
interconnection network.
• In terms of programming, programming shared memory parallel computers is easier than
programming distributed memory systems.
• In terms of scalability, shared memory systems and distributed memory systems display
different performance and cost characteristics. For example,
– Bus (Figure 2.6), scalable in terms of cost, but not performance
– Fully connected crossbar (Figure 3.2 (a)) switch, scalable in terms of performance ,
but not cost
– Hypercube, scalable both in performance and cost
– 2D mesh (Figure 3.2 (b)), scalable both in performance and cost
• In terms of problem size, distributed memory systems are more suitable for problems with
vast amount of data and computation
31 3.1 Shared Memory vs Distributed memory
(a) (b)
Figure 3.2: (a) Fully connected crossbar switch based system; (b) 2D mesh.
...
printf("Thread %d > my_x = %d\n", my_ID, my_x);
...
Example 3.1.2. Suppose we have a shared variable min_x (with initial value ∞) and two threads
in our program. Each thread has a private variable my_x, thread 0 stores a value 5 for my_x
, and thread 1 a value 9 for its my_x. What will happen if both threads update min_x = my_x
simultaneously?
• This is the situation where two threads more or less try to update a shared variable (min_x)
simultaneously.
• Race condition: When threads attempt to access a resource simultaneously, and the access
can result in error, we often say the program has a race condition.
• In such case, we can serialize the contending activities by setting a critical section where
the section can only be accessed by one thread at a time.
Figure 3.3: Steps in parallelization, relationship between PEs, tasks and processors.
• The major goal of decomposition is to expose enough concurrency to keep processes busy
all the time, yet not so much that overhead of managing tasks become substantial.
• Assignment involves assigning fine tasks to available processes, such that each process
has approximately similar workload. Load balancing is an challenging issue in parallel
programming.
• The primary performance goals of assignment are to achieve balanced workload, reduce
the runtime overhead of managing assignment.
• The third step involves necessary orchestration, could involve communication among pro-
cesses and synchronization.
• The major performance goals in orchestration:
– reducing the cost of communication and synchronization
– preserving locality of data reference
– scheduling tasks
– reducing the overhead of parallelism management
• Finally, the mapping is to map processes to processors. The number of processes and the
number of processors are not necessarily need to be matched. That is, for example, you can
have 8 processes on a 4 processor machines. In such a case, a processor may handle more
than one processes by techniques such as space sharing and time sharing.
• The mapping process can be taken care of by OS in order to optimize resource allocation
and utilization. The program may also control the mapping of course.
• Issue: the second phase is in serial where it takes n2 time regardless of multiple processes.
The total time is n2 /p + n2 . Then the speedup, compared to the sequential time 2n2 , is
2n2 2
s= n2
= 1 ,
p + n2 p +1
2n2 2n2
s= 2n2
=p× .
+p 2n2 + p2
p
This speedup is almost linear in p, the number of processors used, when n is large compared
to p. Figure 3.5 shows the impact of concurrency on the performance for Example 3.2.1.
Example 3.2.2. Suppose we have an array with large quantities of floating point data stored in
it. In order to have a good feeling of the distribution of the data, we can find the histogram of the
data. To find the histogram of a set of data, we can simply divide the range of data into equal sized
subintervals, or bins, determine the number of values in each bin, and plot a bar graphs showing
the sizes of the bins.
As a very small example, suppose our data are
A = [1.3, 2.9, 0.4, 0.3, 1.3, 4.4, 1.7, 0.4, 3.2, 0.3,
4.9, 2.4, 3.1, 4.4, 3.9, 0.4, 4.2, 4.5, 4.9, 0.9]
• Let’s set the number of bins to be 5, and the bins are [0, 1.0), [1.0, 2.0), [2.0, 3.0), [3.0, 4.0), [4.0, 5.0),
and binwidth = (5.0 − 0)/5 = 1.0.
• Then bincount = [6, 3, 2, 3, 6] is the histogram — the output is an array the number of
elements of data that lie in each bin.
for (i = 0; i < data_count; i++){
bin = Find_bin(data[i], ...);
bin_count[bin]++;
}
Find_bin function returns the bin that data[i] belongs to.
• Now if we want to parallelize this problem, first we can decompose the dataset into subsets.
Given the small dataset, we divide it into 4 subsets, so that each subset has 5 elements.
– Identify tasks (decompose): i) finding the bin to which an element of data belongs; ii)
increment the corresponding entry in bin_count.
– The second task can only be done once the first task has been completed.
– If two processes or threads are assigned elements from the same bin, then both of them
will try to update the count of the same bin (Figure 3.6 shows an example). Assuming
the bin_count is shared, this will cause race condition.
• A solution to race condition in this example is to create local copies of bin-count, each
process updates their local copies, and at the end add these local copies into the global
bin_count. Figure 3.7 hows an example of this idea.
3.3 Summary
• Aspects of parallel program design
– Decomposition to create concurrent tasks
– Assignment of works to workers
– Orchestration to coordinate processing of tasks by processes/threads
– Mapping to hardware
We will look more into making good decisions in these aspects in the coming lectures.
In this chapter, we gave a general introduction to shared memory programming, distributed
memory programming, an approach in parallel algorithm design, and a few examples to demon-
strate some of the concepts we discussed.
3.4 Exercises
Objectives
Students should be able to
37 3.4 Exercises
• Design parallel algorithms for problems with common computation patterns such as reduc-
tion and broadcasting.
• Identify the concurrency, task dependency, communication, and synchronization in a par-
allel solution.
Problems
Figure 3.8: Each row elements depends on the element to the left; each column depends on the previous
column.
Figure 3.9: Update all the red cells first, then update all the black cells. The process is repeated until
convergence.
Bibliography
Culler, D. E., Singh, J. P., and Gupta, A. (1998). Parallel Computer Architecture: A Hardware/-
Software Approach. Morgan Kaufmann Publishers, Inc.
Grama, A., Gupta, A., Karypis, G., and Kumar, V. (2003). Introduction to Parallel Computing.
Addison Wesley.
Chapter 4
Shared Memory Programming using OpenMP
The contents in this chapter are based on sources [Chapman et al. 2007] and [Trobec et al.
2018].
Example 4.2.2. Create a 4-thread parallel region using runtime library routine (see Listing 4.3).
False Sharing
If independent data elements happen to sit on the same cache line, each update will cause the
cache lines to “slosh back and forth” between threads...this is called false sharing (Figure 4.5
shows an illustration).
Cache
• Processors usually execute operations faster than they access data in memory.
• To address this problem, a block of relatively fast memory is added to a processor — cache.
• The design of cache considers the temporal and spatial locality.
• For example, x is a shared variable and x = 5, my_y, my_z are private variables, what will
be the value of my_z?
if myId==0{
my_y = x;
x++;
}else if myId==1{
my_z = x;
}
Cache Coherence
Example 4.2.4. Suppose x = 2 is a shared variable, y0, y1, z1 are private variables. If the
statements in Table 4.1 are executed at the indicated times, then x =?, y0 =?, y1 =?, z1 =? (See
Figure 4.6.)
Cache coherence problem - When the caches of multiple processors store the same variable,
an update by one processor to the cached variable is ‘seen’ by the other processors. That is, the
cached value stored by the other processors is also updated.
False Sharing
• Suppose two threads with separate caches access different variables that belong to the same
cache line. Further suppose at least one of the threads updates its variable.
4.2 OpenMP Core Features 46
• Even though neither thread has written to a shared variable, the cache controller invalidates
the entire cache line and forces the other threads to get the values of the variables from main
memory.
• The threads aren’t sharing anything (except a cache line), but the behaviour of the threads
with respect to memory access is the same as if they were sharing a variable. Hence the
name false sharing.
}
}
pi += step * sum ;
}
Race Condition
• When multiple threads update a shared variable, the computation exhibits non-deterministic
behaviour — race condition. That is, two or more threads attempt to access the same re-
source.
• If a block of code updates a shared resource, it can only be updated by one thread at a time.
4.2.3 Synchronization
• Synchronization: Bringing one or more threads to a well defined and known point in their
execution. Barrier and mutual exclusion are two most often used forms of synchronization.
– Barrier: Each thread wait at the barrier until all threads arrive.
– Mutual exclusion: Define a block of code that only one thread at a time can execute.
• Synchronization is used to impose order constraints and to protect access to shared data.
• In Method II of computing π,
sum = sum + 4.0/(1.0+x*x);
is called a critical section.
• Critical section - a block of code executed by multiple threads that updates a shared variable,
and the shared variable can only be updated by one thread at a time.
High level synchronizations in OpenMP
• critical construct: Mutual exclusion. Only one thread at a time can enter a critical section,
e.g., Listing 4.7 shows a simple usage of this construct.
#pragma omp critical
float result ;
......
# pragma omp parallel
{
float B ; int i , id , nthrds ;
id = omp_get_thread_num () ;
nthrds = omp_get_num_threads () ;
for ( i = id ; i < nthrds ; i += nthrds ) {
B = big_job ( i ) ;
}
# pragma omp critical
result += calc ( B ) ;
......
}
• atomic construct: Basic form. Provides mutual exclusion but only applies to the update of
a memory location.
• #pragma omp atomic. See an example in Listing 4.8.
The statement inside the atomic must be one of the following forms:
• x op = expr, where op ∈ (+ =, − =, ∗ =, / =, % =)
• x++
• ++x
• x−−
• −−x
• barrier construct: Each thread waits until all threads arrive. See Listing 4.9.
#pragma omp barrier
– sections/section constructs
– task construct
– single construct
• The following construct is also discussed as synchronization construct.
– master construct
Example 4.2.5. Given arrays A[N ] and B[N ]. Find A[i] = A[i] + B[i], 0 ≤ i ≤ N − 1.
Listing 4.10 shows the for loop that does the element-wise vector addition. Listing 4.12 paral-
lelizes the for loop manually, while Listing 4.11 shows how using for construct can conveniently
parallelize the for loop.
• private • schedule
• firstprivate • ordered
• lastprivate • nowait
• reduction • collapse
Schedule Clause
• The schedule clause specifies how the iterations of the loop are assigned to the threads in
a team.
• The syntax is: #pragma omp for schedule(kind [, chunk]).
• Schedule kinds:
4.2 OpenMP Core Features 50
– schedule(static [,chunk]): Deals out blocks of iterations of size chunk to each thread
in a round robin fashion.
∗ The iterations can be assigned to the threads before the loop is executed.
– schedule(dynamic [,chunk]): Each thread grabs chunk size of iterations off a queue
until all iterations have been handled. The default is 1.
∗ The iterations are assigned while the loop is executing
– schedule(guided [,chunk]): Threads dynamically grab blocks of iterations. The size
of the block starts large and shrinks down to size chunk as the calculation proceeds.
The default is 1.
– schedule(runtime): Schedule and chunk size taken from the OMP_SCHEDULE en-
vironment variable.
∗ For example, export OMP_SCHEDULE="static,1"
– schedule(auto): The selection of the schedule is determined by the implementation.
• Most OpenMP implementations use a roughly block partition.
• There is some overhead associated with schedule.
• The overhead for dynamic is greater than static, and the overhead for guided is the
greatest.
• If each iteration of a loop requires roughly the same amount of computation, then it is likely
that the default distribution will give the best performance.
• If the cost of the iterations decreases linearly as the loop executes, then a static schedule
with small chunk size will probably give the best performance.
• If the cost of each iteration can not be determined in advance, then schedule(runtime)
can be used.
Ordered Construct/Clause
Example 4.2.6. Removing a loop carried dependency (e.g., see Listing 4.15).
// Loop dependency
int i , j , A [ MAX ];
j =5;
for ( i =0; i < MAX ; i ++) {
j +=2;
A [ i ]= big ( j ) ;
}
// Removing loop dependency
int i , A [ MAX ];
# pragma omp parallel
# pragma omp for
for ( i =0; i < MAX ; i ++) {
int j =5+2*( i +1) ;
A [ i ]= big ( j ) ;
// Loop dependency
int i , j , A [ MAX ];
j =5;
for ( i =0; i < MAX ; i ++) {
j +=2;
A [ i ]= big ( j ) ;
}
// Removing loop dependency
int i , A [ MAX ];
# pragma omp parallel
# pragma omp for
4.2 OpenMP Core Features 52
It helps to unroll the loop to see the dependencies (see Listing 4.16).
i =0:
B [0]= tmp ;
A [1]= B [1];
tmp = A [0];
i =1:
B [1]= tmp ;
A [2]= B [2];
tmp = A [1];
i =2:
B [2]= tmp ;
A [3]= B [3];
tmp = A [2];
......
Reduction
......
double ave =0.0 , A [ MAX ];
int i ;
for ( i =0; i < MAX ; i ++) {
ave += A [ i ];
}
ave = ave / MAX ;
......
In Listing 4.17, we are aggregating multiple values into a single value—reduction. Reduction
operation is supported in most parallel programming environments.
53 4.2 OpenMP Core Features
......
double ave =0.0 , A [ MAX ];
int i;
# pragma omp parallel for reduction (+: ave )
for ( i =0; i < MAX ; i ++) {
ave += A [i ];
}
ave = ave / MAX ;
......
• Table 4.2 shows the associative operands that can be used with reduction (for C/C++) and
their common initial values.
Table 4.2: Associative operands that can be used with reduction (for C/C++) and their common initial
values
Op Initial value Op Initial value
+ 0 & ∼0
* 1 | 0
- 0 ^ 0
min Large number (+) && 1
max Most neg. number || 0
Collapse Clause
Listing 4.19 gives a simple example of nested loop.
• In Listing 4.20, the iterations of the k and j loops are (manually) collapsed into one loop,
and that loop is then divided among the threads in the current team.
Listing 4.21 uses collapse clause to turn the nested for loop into a single for loop in order
to increase the total number of iteration for efficient parallelism.
nowait Clause
Listing 4.24 shows an example of using nowait clause.
structured block
void work1 () {}
void work2 () {}
void single_example () {
# pragma omp parallel
{
# pragma omp single
printf ( " Beginning work1 .\ n " ) ;
work1 () ;
# pragma omp single
printf ( " Finishing work1 .\ n " ) ;
# pragma omp single nowait
printf ( " Finished work1 and beginning work2 .\ n " ) ;
work2 () ;
}
}
int TID ;
float rate =1.2;
omp_set_num_threads (4) ;
# pragma omp parallel private ( rate , TID )
{
TID = omp_get_thread_num () ;
# pragma omp single copyprivate ( rate )
{
rate = rand () *1.0/ RAND_MAX ;
}
printf ( " Value for variable rate : % f by thread % d \ n " , rate , TID ) ;
}
• The master construct specifies a structured block that is executed by the master thread of
the team.
• There is no implied barrier either on entry to, or exit from, the master construct.
Example 4.2.9. The following code allows the OpenMP implementation to choose any number
of threads between 1 and 8.
omp_set_dynamic(1);
#pragma omp parallel num_threads(8)
Example 4.2.10. The following code only allows the OpenMP implementation to choose 8 threads.
The action in this case is implementation dependent.
omp_set_dynamic(0);
#pragma omp parallel num_threads(8)
• omp_get_dynamic() – You can determine the default setting by calling this function (returns
TRUE if dynamic setting is enabled, and FALSE if disabled).
Examples — firstprivate
See the example in Listing 4.31.
– Some thread in the team executes the task at some time later or immediately.
• Task barrier: The taskwait directive:
......
# pragma omp parallel
{
# pragma omp single private (p )
{
p = list_head ;
while ( p ) {
# pragma omp task
processwork ( p ) ;
p =p - > next ;
}
}
}
In Listing 4.34, for the first task construct, multiple tasks are created, one for each thread. All
foo() tasks are guaranteed to be completed at the omp barrier. For the second task construct, only
one bar() task is created. It is guaranteed to be completed at the implicit barrier, or the end of
structured block.
Example 4.2.14. Understanding Task Construct What will be the output of program in List-
ing 4.35?
4.2 OpenMP Core Features 62
Example 4.2.15. Understanding Task Construct What will be the output of program in List-
ing 4.36?
Example 4.2.16. Understanding Task Construct What will be the output of program in List-
ing 4.37?
Example 4.2.17. Understanding Task Construct What will be the output of program in List-
ing 4.38?
Example 4.2.18. Understanding Task Construct What will be the output of program in List-
ing 4.39?
Example 4.2.21. Write an OpenMP parallel program for computing the nth Fibonacci number.
Compare the performance of the parallel implementation to the sequential one.
Task Switching
4.3 Summary
• OpenMP execution model
• Create a parallel region in OpenMP programs
• Worksharing constructs including loop construct, sections/section construct, and task con-
struct are discussed.
• We have studied the following constructs:
– parallel
– critical
– atomic
– barrier
– master
– for
– sections/section
– task
– single
• Clauses that can be used with some of the constructs
• Usage of OpenMP library functions and environment variables
• False sharing and race condition.
4.4 Exercises
Objectives
• Compile and run an OpenMP program written in C
• Use OpenMP parallel, for, sections/section and task constructs to parallelize se-
quential programs.
• Use synchronization constructs including critical, atomic, and barrier, in OpenMP
programs where applicable.
• Understand the problems of false sharing and race condition in shared memory program-
ming, and apply proper techniques to eliminate such problems in OpenMP programs.
• Understanding various constructs in OpenMP.
Instructions
1. Baseline codes, as well as Makefile (for compilation of the programs) and run script run.sh
(for running of the programs) are provided for some questions. You may use any IDE for
C/C++, such as CodeLite or Code::Blocks, or you may use commands from a terminal to
compile and run OpenMP programs.
2. Note: In the case of submission of codes for any type of assessments, your code must be
able to compile and run from Linux command line (or Linux terminal), as that is the only
way we are going to use to mark your codes.
4.4 Exercises 66
Programming Problems
1. Compile and run parallel_region.c. Change the number of threads in the parallel re-
gion by setting parallel construct clause, library function, or environment variable, respec-
tively.
2. Write a sequential program for computing the number π using the method discussed in the
class (see Section 4.2.2). Parallelize the serial program using OpenMP parallel construct.
Implement the following techniques discussed in the class:
Evaluate the performances of the sequential and parallel codes respectively by measuring
their elapsed time, and show the speedups of your parallel implementations.
Note: A base code is given for this problem in Lec4_codes folder. There is a Makefile
provided in the same folder. Your code must be able to compile using this makefile. For
example, open a Linux terminal with the correct path to the program files. From the ter-
minal, type make clean to remove all existing executables; then type make to compile
all .c files. Upon successful compilation, run the code using its respective executable file
name like ./pi 5 for the number π example, where the argument ‘5’ gives the number of
repetitive runs (to get an average runtime).
3. Implement an OpenMP program that computes the histogram of a set of M 2 integers (range
from 0 to 255) that are stored in an M × M two dimensional array. Answer the following
questions.
4. Given the pseudo code of square matrix multiplication in Listing 4.43, implement matrix
multiplication, C = AB, where C ∈ RN ×N , A ∈ RN ×M , and B ∈ RM ×N , in parallel
using OpenMP for construct. Consider the following questions for your implementation.
(a) How many dot products are performed in your serial matrix multiplication?
(b) Use the OMP_NUM_THREADS environment variable to control the number of threads
and plot the performance with varying number of threads.
(c) Experiment with two cases in which (i) only the outermost loop is parallelized; (ii)
only the second inner loop is parallelized. What are the observed results from these
two cases?
(d) How does collapse clause work with for construct? Does it improve the perfor-
mance of your parallel matrix multiplication?
(e) Experiment with differing matrix sizes, such as N ≫ M or M ≫ N . How does it
impact the performance of your parallel implementation?
67 4.4 Exercises
(f) Experiment with static and dynamic scheduling clauses to see the effects on the
performance of your parallel program. For each schedule clause, determine how many
dot products each thread performs in your program.
......
for ( int i =0; i < N ; i ++)
for ( int j =0; j < N ; j ++) {
C [ i ][ j ] = 0.0;
for ( int k =0; k < M ; k ++)
C [ i ][ j ] = C [ i ][ j ]+ A [ i ][ k ]* B [ k ][ i ];
}
......
5. Count sort is a simple serial algorithm that can be implemented as follows (Listing 4.44).
The basic idea is that for each element a[i] in the list a, we count the number of elements
in the list that are less than a[i]. Then we insert a[i] into a temporary list using the index
determined by the count. The algorithm also deals with the issue where there are elements
with the same value.
• Write a C program that includes both serial and parallel implementations of count-
sort. In your parallel implementation, specify explicitly the data scope of each variable
in the parallel region.
• Compare the performance between the serial and parallel versions.
6. In what circumstances do you use the following environment variables? What are the re-
spective equivalent library functions to these environment variables?
(a) OMP_NUM_THREADS
(b) OMP_DYNAMIC
4.4 Exercises 68
(c) OMP_NESTED
7. What is the difference between the two code snippets shown in Listing 4.45 in terms of
memory access?
// Code snippet 1
for ( int i =0; i < n ; i ++)
for ( int j =0; j < n ; j ++)
sum += a [ i ][ j ];
// Code snippet 2
for ( int j =0; j < n ; j ++)
for ( int i =0; i < n ; i ++)
sum += a [ i ][ j ];
8. Given the code snippet in Listing 4.46, collapse the nested for loops into a single for loop
manually.
int m = 4;
int n = 1000;
......
for ( int i =0; i < m ; i ++)
for ( int j =0; j < n ; j ++)
a [ i ][ j ] = i + j +1;
......
9. Design simple examples to illustrate the functions of various OpenMP synchronization con-
structs including barrier, critical, atomic, and ordered.
10. Suppose OpenMP does not have the reduction clause. Show how to implement an efficient
parallel reduction for finding the minimum or maximum value in an array.
11. Implement the problem of computing the Fibonacci numbers both in serial and parallel.
For your parallel version, use OpenMP task and sections/section constructs respec-
tively. Compare the performances of these 3 implementations, i.e., serial, parallel using
sections/section construct, parallel using task construct, by running the programs for
different problem sizes and thread numbers.
12. Implement quicksort algorithm both in serial and parallel. For your parallel version, use
OpenMP task and sections/section constructs respectively. Compare the performances
of these 3 implementations, i.e., serial, parallel using sections/section construct, par-
allel using task construct, by running the programs for different problem sizes and thread
numbers. A base code (qsort_omp.c) is given in Lec6_codes folder. Put your results in
a table like the one shown in Table 4.4. Answer (or consider) the following question:
(b) Understand the usage of various clauses of task and sections/section constructs
by applying them in your implementations where applicable, and observe the impacts
on the performance of your programs.
1
Put in the value of threshold data size (to exit the parallel sorting) if used. Not valid for sequential sort.
13. Scan operation, or all-prefix-sums operation, is one of the simplest and useful building
blocks for parallel algorithms ( [Blelloch 1990]). Given a set of elements, [a0 , a1 , · · · , an−1 ],
the scan operation associated with addition operator for this input is the output set [a0 , (a0 +
a1 ), · · · , (a0 + a1 + · · · + an−1 )]. For example, the input set is [2, 1, 4, 0, 3, 7, 6, 3], then
the scan with addition operator of this input is [2, 3, 7, 7, 10, 17, 23, 26].
It is simple to compute scan operation in serial, see Listing 4.47.
scan ( out [ N ] , in [ N ]) {
i =0;
out [0]= in [0];
for ( i =1; i < N ; i ++) {
out [ i ]= in [ i ]+ out [i -1];
}
}
Listing 4.47: Sequential algorithm for computing scan operation with ‘+’ operator
Sometimes it is useful for each element of the output vector to contain the sum of all the
previous elements, but not the element itself. Such an operations is called prescan. That is,
given the input [a0 , a1 , · · · , an−1 ], the output of prescan operation with addition operator
is [0, a0 , (a0 + a1 ), · · · , (a0 + a1 + · · · + an−2 )].
4.4 Exercises 70
The algorithm for scan operation in Listing 4.47 is inherently sequential, as there is a loop
carried dependence in the for loop. However, Blelloch [1990] gives an algorithm for calcu-
lating the scan operation in parallel (see [Blelloch 1990, Pg. 42]). Based on this algorithm,
(i) implement the parallel algorithm for prescan using OpenMP; and (ii) implement an
OpenMP parallel program for scan operation based on the prescan parallel algorithm.
Bibliography
Blelloch, G. E. (1990). Prefix sums and their applications. Technical Report CMU-CS-90-190,
School of Computer Science, Carnegie Mellon University.
Chapman, B., Jost, G., and van der Pas, R. (2007). Using OpenMP: Portable Shared Memory
Paralle Programming. The MIT Press, Cambridge.
Trobec, R., Slivnik, B., Bulić, P., and Robič, B. (2018). Introduction to Parallel Computing:
From Algorithms to Programming on State-of-the-Art Platforms. Springer.
Chapter 5
Distributed Memory Programming us-
ing MPI
The contents in this chapter are based on sources [Grama et al. 2003] and [Trobec et al. 2018].
5.1.2 MPI
• MPI (Message Passing Interface) standard is the most popular message passing specifica-
tion that supports parallel programming.
• MPI is a message passing library specification.
• MPI is for communication among processes, which have separate address spaces.
• Inter-process communication consists of
– synchronization
– movement of data from one process’ address space to another’s.
• MPI is not
– a language or compiler specification
– a specific implementation or product
• MPI is for parallel computers, clusters, and heterogeneous networks.
• MPI versions:
– MPI-1 supports the classical message-passing programming model: basic point-to-
point communication, collectives, datatypes, etc. MPI-1 was defined (1994) by a
5.1 Message Passing Model and MPI 72
broadly based group of parallel computer vendors, computer scientists, and applica-
tions developers.
– MPI-2 was released in 1997
– MPI-2.1 (2008) and MPI-2.2 (2009) with some corrections to the standard and small
features
– MPI-3 (2012) added several new features to MPI.
– MPI-4 (2021) added several new features to MPI.
– The Standard itself: at http://www.mpi-forum.org. All MPI official releases, in both
postscript and HTML.
MPI_SUCCESS
MPI_ERROR
Communicators
• MPI_Comm: communicator – communication domain
– Group of processes that can communicate with one another
– Supplied as an argument to all MPI message passing functions
– Process can belong to multiple communication domain
• MPI_COMM_WORLD: root communicator – includes all the processes
MPI_Finalize () ;
return 0;
}
int id, p;
MPI_Comm_rank(MPI_COMM_WORLD, &id);
MPI_Comm_size(MPI_COMM_WORLD, &p);
• After a process has completed all of its MPI library calls, it calls function MPI_Finalize,
allowing the system to free up resources (such as memory) that have been allocated to MPI,
and shuts down MPI environment.
5.2 Data Communication 74
• for OpenMPI,
• Waits until a matching (on source, datatype, tag, comm) message is received from the communica-
tion system, and the buffer can be used.
• Source is the rank of sender in communicator comm, or MPI_ANY_SOURCE.
• status contains further information: The MPI type MPI_Status is a struct with at least three
members MPI_SOURCE, MPI_TAG, and MPI_ERROR .
• MPI_STATUS_IGNORE can be used if we don’t need any additional information
.
MPI_Finalize () ;
return 0;
}
5.3 Examples 76
• The MPI_Get_count function returns the source, tag and number of elements of datatype received
5.3 Examples
Example 5.3.1. Rotating a token around processes connected via ring interconnect network Consider a
set of n processes arranged in a ring. Process 0 sends a token message, say "Hello!", to process 1; process
1 passes it to process 2; process 2 to process 3, and so on. Process n − 1 sends back the message to process
0. Write an MPI program that performs this simple token ring.
if ( myrank == 0)
prev = nproces - 1;
else
prev = myrank - 1;
if ( myrank == ( nproces - 1) )
next = 0;
else
next = myrank + 1;
if ( myrank == 0)
strcpy ( token , " Hello !" ) ;
MPI_Send ( token , MSG_SZ , MPI_CHAR , next , tag , MPI_COMM_WORLD ) ;
MPI_Recv ( token , MSG_SZ , MPI_CHAR , prev , tag , MPI_COMM_WORLD ,
MPI_STATUS_IGNORE );
printf ( " Process % d received token % s from process % d .\ n " , myrank ,
token , prev ) ;
Running the implementation in Listing 5.1 gives results as shown in Listing ??. Why the result is not as
expected?
Listing 5.3 gives an alternative implementation of communication, and Listing ?? shows the result of
a sample run of this implementation.
if ( myrank == 0) {
strcpy ( token , " Hello World ! " ) ;
MPI_Send ( token , MSG_SZ , MPI_CHAR , next , tag , MPI_COMM_WORLD ) ;
MPI_Recv ( token , MSG_SZ , MPI_CHAR , prev , tag , MPI_COMM_WORLD ,
MPI_STATUS_IGNORE ) ;
printf ( " Process % d received token % s from process % d .\ n " , myrank ,
token , prev ) ;
}
else {
MPI_Recv ( token , MSG_SZ , MPI_CHAR , prev , tag , MPI_COMM_WORLD ,
MPI_STATUS_IGNORE ) ;
MPI_Send ( token , MSG_SZ , MPI_CHAR , next , tag , MPI_COMM_WORLD ) ;
printf ( " Process % d received token % s from process % d .\ n " , myrank ,
token , prev ) ;
}
Function Prototypes
The list below shows the prototypes of several MPI functions we discussed so far.
5.5.2 Summary
• Basic set of MPI functions to write simple MPI programs
– Initialize and finalize MPI
– Inquire the basic MPI execution environment
– Basic point to point communication: send and receive
– Basic collective communication: broadcast and reduction
5.6 Point to Point Communication 82
Example 5.6.1. Process 0 sends two messages with different tags to process 1, and process 1 receives them
in reverse order (see Listing 5.6).
Example 5.6.2. Consider the following piece of code, in which process i sends a message to process i + 1
(modulo the number of processes) and receives a message from process i − 1 (modulo the number of
processes) (Listing 5.7). (Note that this example is different from the token ring example in Lec7.)
Example 5.6.3. Example 5.6.2 cont. We can break the circular wait to avoid deadlocks as shown in
Listing 5.8.
Example 5.6.5.
int main ( int argc , char * argv []) {
int myid , numprocs , left , right , flag =0;
int buffer1 [10] , buffer2 [10];
MPI_Request request ; MPI_Status status ;
MPI_Init (& argc ,& argv ) ;
MPI_Comm_size ( MPI_COMM_WORLD , & numprocs ) ;
MPI_Comm_rank ( MPI_COMM_WORLD , & myid ) ;
/* initialize buffer2 */
......
right = ( myid + 1) % numprocs ;
left = myid - 1;
if ( left < 0)
left = numprocs - 1;
MPI_Irecv ( buffer1 , 10 , MPI_INT , left , 123 , MPI_COMM_WORLD , & request )
;
MPI_Send ( buffer2 , 10 , MPI_INT , right , 123 , MPI_COMM_WORLD ) ;
MPI_Test (& request , & flag , & status ) ;
while (! flag ) {
/* Do some work ... */
5.6 Point to Point Communication 86
Figure 5.1: The trapezoidal rule: (a) area to be estimated, (b) estimate area using trapezoids
Example 5.6.6.
int main ( int argc , char * argv []) {
int numtasks , rank , next , prev , buf [2] , tag1 =1 , tag2 =2;
MPI_Request reqs [4]; MPI_Status stats [4];
MPI_Init (& argc ,& argv ) ;
MPI_Comm_size ( MPI_COMM_WORLD , & numtasks ) ;
MPI_Comm_rank ( MPI_COMM_WORLD , & rank ) ;
MPI_Isend (& rank ,1 , MPI_INT , prev , tag2 , MPI_COMM_WORLD , & reqs [2]) ;
MPI_Isend (& rank ,1 , MPI_INT , next , tag1 , MPI_COMM_WORLD , & reqs [3]) ;
MPI_Waitall (4 , reqs , stats ) ;
MPI_Finalize () ;
}
Example 5.6.7. The Trapezoidal Rule We can use the trapezoidal rule to approximate the area between
the graph of a function, y = f (x), two vertical lines, and the x-axis.
• If the endpoints of the subinterval are xi and xi+1 , then the length of the subinterval is h = xi+1 −xi .
Also, if the lengths of the two vertical segments are f (xi ) and f (xi+1 ), then the area of the trapezoid
is
h
Area of one trapezoid = (f (xi ) + f (xi+1 )).
2
• Since we chose the N subintervals, we also know that the bounds of the region are x = a and x = b
then
b−a
h=
N
• The pseudo code for a serial program:
h = (b-a)/N;
approx = (f(a) + f(b))/2.0;
87 5.7 Collective Communications
Get a , b , n ;
h = (b - a)/n;
local_n = n / comm_sz ;
local_a = a + my_rank * local_n * h ;
local_b = local_a + local_n * h ;
local_integral = Trap ( local_a , local_b , local_n , h ) ;
if ( my_rank != 0)
Send local integral to process 0;
else { /* my_rank == 0 */
total_integral = local_integral ;
for ( proc = 1; proc < comm_sz ; proc ++) {
Receive local_integral from proc ;
total_integral += local_integral ;
}
}
if ( my_rank == 0)
print result ;
• The process with rank source sends the contents of the memory referenced by msg to all the pro-
cesses in the communicator MPI_COMM_WORLD.
Example 5.6.7 Continued
1. In our example mpi_trapezoid_1.c, we are using point-to-point communication function as shown
in Listing 5.13.
if ( my_rank == 0) {
for ( dest = 1; dest < comm_sz ; dest ++) {
MPI_Send ( a_p , 1 , MPI_DOUBLE , dest , 0 , MPI_COMM_WORLD ) ;
MPI_Send ( b_p , 1 , MPI_DOUBLE , dest , 0 , MPI_COMM_WORLD ) ;
MPI_Send ( n_p , 1 , MPI_INT , dest , 0 , MPI_COMM_WORLD ) ;
}
} else { /* my rank != 0 */
MPI_Recv ( a_p , 1 , MPI_DOUBLE , 0 , 0 , MPI_COMM_WORLD ,
MPI_STATUS_IGNORE );
MPI_Recv ( b_p , 1 , MPI_DOUBLE , 0 , 0 , MPI_COMM_WORLD ,
MPI_STATUS_IGNORE );
MPI_Recv ( n_p , 1 , MPI_INT , 0 , 0 , MPI_COMM_WORLD ,
MPI_STATUS_IGNORE );
}
2. Instead of using point-to-point communications, you can use collective communications here. Write
another function to implement this part using — MPI_Bcast().
MPI_Reduce
89 5.7 Collective Communications
• MPI_Reduce combines data from all processes in the communicator and returns it to one process
(see an illustration in Figure 5.3).
• In many numerical algorithms, Send/Receive can be replaced by Bcast/Reduce, improving both
simplicity and efficiency.
2. Instead of using point-to-point communications, you can also use collective communications here.
Rewrite this part using appropriate collective communication.
MPI_Reduce
• Suppose that each process calls MPI_Reduce with operator MPI_SUM, and destination process 0.
What happens with the multiple calls of MPI_Reduce in Table 5.10? What are the values for b and
d after executing the second MPI_Reduce?
MPI_Scatter
• The scatter operation is to distribute distinct messages from a single source task (or process) to each
task in the group.
91 5.7 Collective Communications
(a) (b)
Figure 5.5: (a) A global sum followed by a broadcasting; (2) A butterfly structured global sum.
MPI_Allgather
• MPI also provides the MPI_Allgather function in which the data are gathered at all the processes.
• See Figure 5.8 for an illustration.
MPI_Alltoall
• The all-to-all communication operation is performed by:
int MPI_Alltoall(void *sendbuf, int sendcount,
MPI_Datatype senddatatype, void *recvbuf,
int recvcount, MPI_Datatype recvdatatype,
MPI_Comm comm)
• Each task in a group performs a scatter operation, sending a distinct message to all the tasks in the
group in order by index. See the illustration in Figure 5.9.
Example 5.7.1 (Matrix vector multiplication). If A = (aij ) is an m × n matrix and x is a vector with n
components, then y = Ax is a vector with m components. Furthermore,
Process 0 reads in the matrix and distributes row blocks to all the processes in communicator comm.
Listing 5.15 shows the code snippet for this.
if ( my_rank == 0) {
A = malloc ( m * n * sizeof ( double )) ;
if ( A == NULL ) local_ok = 0;
Check_for_error ( local_ok , " Random_matrix " , " Can 't allocate
temporary matrix " , comm ) ;
srand (2021) ;
for ( i = 0; i < m ; i ++)
for (j = 0; j < n ; j ++)
A [ i * n + j ] = ( double ) rand ( ) / RAND_MAX ;
MPI_Scatter (A , local_m *n , MPI_DOUBLE , local_A , local_m *n ,
MPI_DOUBLE , 0 , comm ) ;
free ( A ) ;
} else {
Check_for_error ( local_ok , " Random_matrix " , " Can 't allocate
temporary matrix " , comm ) ;
MPI_Scatter (A , local_m *n , MPI_DOUBLE , local_A , local_m *n ,
MPI_DOUBLE , 0 , comm ) ;
}
Each process gathers the entire vector, then proceeds to compute its share of sub-matrix and vector
multiplication (see Listing 5.16).
95 5.7 Collective Communications
MPI_Scan
• To compute prefix-sums, MPI provides:
int MPI_Scan(void *sendbuf, void *recvbuf, int count,
MPI_Datatype datatype, MPI_Op op, MPI_Comm comm)
Figure 5.10 illustrates a simple scan operation among a group of MPI processes.
• Using this core set of collective operations, MPI communications can be greatly simplified.
MPI_Scatterv Scatters a buffer in parts to all processes in a communicator, which allows different
amounts of data to be sent to different processes.
For Example 5.7.2, assuming the matrix is only 4 × 4, and we are running the MPI code using 4
processes, then some of the arguments of calling MPI_Scatterv:
• sendcounts[4] = {4, 3, 2, 1};
• displs[4] = {0, 4, 8, 12} which is with reference to sendbuf; these values can be expressed as
N * rank, where rank is the rank of a process.
• note also that recvcount in MPI_Scatterv is a scalar; for process 0, recvcount = 4 (=4-0); for pro-
cess 1, recvcount = 3 (=4-1); for process 2, recvcount = 2 (=4-2); and for process 3, recvcount
= 1 (=4-3); so this value can be obtained as N - rank where N is the number of rows in the matrix,
and rank is the rank of a process.
scatterv_1.c gives an example code for Example 5.7.2.
MPI_Alltoall Sends data from all to all processes; each process may send a different amount of data
and provide displacements for the input and output data.
MPI_Alltoallv(void *sendbuf, int *sendcounts, int *sdispls,
MPI_Datatype sendtype, void *recvbuf, int *recvcounts,
int *rdispls, MPI_Datatype recvtype, MPI_Comm comm)
• sendbuf: starting address of send buffer
• sendcounts: integer array equal to the group size specifying the number of elements to send to each
processor
• sdispls: integer array (of length group size). Entry j specifies the displacement (relative to sendbuf
from which to take the outgoing data destined for process j
• sendtype: data type of send buffer elements
• recvcounts: integer array equal to the group size specifying the maximum number of elements that
can be received from each processor
• rdispls: integer array (of length group size). Entry i specifies the displacement (relative to recvbuf
at which to place the incoming data from process i
• recvtype: data type of receive buffer elements
Example 5.7.3. Given the MPI_Alltoallv argument settings shown in Figure 5.12 (the number of pro-
cesses is 3), what is the content of recvbuf for each process (see the results in Figure 5.13)?
97 5.7 Collective Communications
The following function allows a different number of data elements to be sent by each process by re-
placing recvcount in MPI_Gather with an array recvcounts.
• sendbuf: pointer, starting address of send buffer (or the data to be sent)
• sendcount: the number of elements in the send buffer
• sendtype: datatype of send buffer elements
• recvbuf: pointer, starting address of receive buffer (significant only at root)
• recvcounts: integer array (of length group size) containing the number of elements to be received
from each process (significant only at root)
• displs: integer array (of length group size). Entry i specifies the displacement relative to recvbuf
at which to place the incoming data from process i (significant only at root)
• recvtype: the datatype of data to be received (significant only at root)
• target: rank of receiving process (integer)
Gather data from all processes and deliver the combined data to all processes
5.8 Examples 98
• sendbuf: pointer, starting address of send buffer (or the data to be sent)
• sendcount: the number of elements in the send buffer
• sendtype: datatype of send buffer elements
• recvbuf: pointer, starting address of receive buffer (significant only at root)
• recvcounts: integer array (of length group size) containing the number of elements to be received
from each process (significant only at root)
• displs: integer array (of length group size). Entry i specifies the displacement relative to recvbuf
at which to place the incoming data from process i (significant only at root)
• recvtype: the datatype of data to be received (significant only at root)
5.8 Examples
5.8.1 Parallel Quicksort Algorithm
• one process broadcast initial pivot to all processes;
• each process in the upper half swaps with a partner in the lower half
• recurse on each half
• swap among partners in each half
• each process uses quicksort on local elements
Figure 5.14 illustrates the parallel quicksort using four MPI processes.
5.8.2 Hyperquicksort
Limitation of parallel quicksort: poor balancing of list sizes.
Hyperquicksort: sort elements before broadcasting pivot.
• sort elements in each process
• select median as pivot element and broadcast it
• each process in the upper half swaps with a partner in the lower half
• recurse on each half
if ( rank == 0) {
dest = 1;
source = 1;
rc = MPI_Send (& outmsg , 1 , MPI_CHAR , dest , tag , MPI_COMM_WORLD ) ;
rc = MPI_Recv (& inmsg , 1 , MPI_CHAR , source , tag , MPI_COMM_WORLD ,
& Stat ) ;
else if ( rank == 1) {
dest = 0;
source = 0;
rc = MPI_Recv (& inmsg , 1 , MPI_CHAR , source , tag , MPI_COMM_WORLD ,
& Stat ) ;
rc = MPI_Send (& outmsg , 1 , MPI_CHAR , dest , tag , MPI_COMM_WORLD ) ;
}
rc = MPI_Get_count (& Stat , MPI_CHAR , & count ) ;
printf ( " Task % d : Received % d char ( s ) from task % d with tag % d \ n " ,
rank , count , Stat . MPI_SOURCE , Stat . MPI_TAG ) ;
MPI_Finalize () ;
}
5.8 Examples 100
Listing 5.17: MPI code for ping-pong point-to-point communication between two MPI processes
Example 5.8.2. Perform a scatter operation on the rows of an array See Listing 5.18
for i = 0 to n -1 do
if i is even then
for j = 0 to n /2 - 1 do
compare - exchange ( a (2 j ) , a (2 j +1) ) ;
if i is odd then
for j = 0 to n /2 - 1 do
compare - exchange ( a (2 j +1) , a (2 j +2) ) ;
Odd-even transposition sort – The parallel algorithm Listing 5.20 gives the parallel version odd-
even transposition sort.
a0 ≤ a1 ≤ · · · ≤ am ≥ am+1 ≥ . . . an−1 ,
2. or there exists a cyclic shift σ of (0, 1, . . . , n−1) such that the sequence (aσ(0) , aσ(1) , . . . , aσ(n−1) )
satisfies condition 1. A cyclic shift sends each index i to (i + s) mod n, for some integers s.
• A bitonic sequence has two tones – increasing and decreasing, or vice versa. Any cyclic rotation of
such networks is also considered bitonic.
• (1, 2, 4, 7, 6, 0) is a bitonic sequence, because it first increases and then decreases. (8, 9, 2, 1, 0, 4)
is another bitonic sequence, because it is a cyclic shift of (0, 4, 8, 9, 2, 1). Similarly, the sequence
(1, 5, 6, 9, 8, 7, 3, 0) is bitonic, as is the sequence (6, 9, 8, 7, 3, 0, 1, 5), since it can be obtained from
the first by a cyclic shift.
• If sequence A = (a0 , a1 , . . . , an−1 ) is bitonic, then we can form two bitonic sequences from A as
and
Amax = (max(a0 , an/2 ), max(a1 , an/2+1 ), . . . , max(an/2−1 , an−1 )).
5.8 Examples 102
Amin and Amax are bitonic sequences, and each element of Amin is less than every element in
Amax .
• We can apply the procedure recursively on Amin and Amax to get the sorted sequence.
• For example, A = (6, 9, 8, 7, 3, 0, 1, 5) is a bitonic sequence. We can split it into two bitonic
sequence by finding Amin = (min(6, 3), min(9, 0), min(8, 1), min(7, 5)), which is Amin =
(3, 0, 1, 5) (first decrease, then increase), and Amax = (max(6, 3), max(9, 0), max(8, 1), max(7, 5)),
which is Amax = (6, 9, 8, 7) (first increase, then decrease).
• The kernel of the network is the rearrangement of a bitonic sequence into a sorted sequence.
Original
sequence 3 5 8 9 10 12 14 20 95 90 60 40 35 23 18 0
1st Split 3 5 8 9 10 12 14 0 95 90 60 40 35 23 18 20
2nd Split 3 5 8 0 10 12 14 9 35 23 18 20 95 90 60 40
3rd Split 3 0 8 5 10 9 14 12 18 20 35 23 60 40 95 90
4th Split 0 3 5 8 9 10 12 14 18 20 23 35 40 60 90 95
Table 5.11: Merging a 16-element bitonic sequence through a series of log 16 bitonic splits.
• We can easily build a sorting network to implement this bitonic merge algorithm.
• Such a network is called a bitonic merging network. See Table 5.11 for an example.
• The network contains log n columns (see Figure 5.18). Each column contains n/2 comparators and
performs one step of the bitonic merge.
• We denote a bitonic merging network with n inputs by ⊕BM[n].
• Replacing the ⊕ comparators by ⊖ comparators results in a decreasing output sequence; such a
network is denoted by ⊖BM[n]. (Here, a comparator refers to a device with two inputs x and y
and two outputs x′ and y ′ . For an increasing comparator, denoted by ⊕, x′ = min(x, y) and
y ′ = max(x, y), and vice versa for decreasing comparator, denoted by ⊖.)
• The depth of the network is Θ(log2 n). Each stage of the network contains n/2 comparators. A
serial implementation of the network would have complexity Θ(n log2 n). On the other hand, a
parallel bitonic sorting network sorts n elements in Θ(log2 n) time. (The comparators within each
stage are independent of one another, i.e., can be done in parallel.)
• How do we sort an unsorted sequence using a bitonic merge?
– We must first build a single bitonic sequence from the given sequence. See Figure 5.16 for an
illustration of building a bitonic sequence from an input sequence.
∗ A sequence of length 2 is a bitonic sequence.
∗ A bitonic sequence of length 4 can be built by sorting the first two elements using
⊕BM[2] and next two, using ⊖BM[2].
∗ This process can be repeated to generate larger bitonic sequences.
– Once we have turned our input into a bitonic sequence, we can apply a bitonic merge process
to obtain a sorted list. Figure 5.17 shows an example.
To implement bitonic sorting in MPI, the basic idea is that you have much more elements than the
number of PEs. For example sorting one million or even more data elements using a small number of
MPI processes in our case, say 4, 8, or 16 processes. So, you need to fit the bitonic sorting idea into this
kind of framework, not that you are sorting 16 elements only using 16 PEs. In your implementation, the
main element from the perspective of distributed programming model is to figure out how to pair up a PE
103 5.9 MPI Derived Datatypes
Wires
0000
0001 BM[2]
0010
BM[4]
0011
BM[2]
BM[8]
0100
BM[2]
0101
0110
BM[4]
0111
BM[2] BM[16]
1000
1001
BM[2]
1010
BM[4]
1011
BM[2]
1100
BM[8]
1101
BM[2]
1110
BM[4]
1111
BM[2]
Figure 5.16: A schematic representation of a network that converts an input sequence into a bitonic se-
quence. In this example, ⊕BM[k] and ⊖BM[k] denote bitonic merging networks of input size k that use
⊕ and ⊖ comparators, respectively. The last merging network (⊕BM[16]) sorts the input. In this example,
n = 16.
Wires
3 3 3 3 0
0000
5 5 5 0 3
0001
8 8 8 8 5
0010
9 9 0 5 8
0011
10 10 10 10 9
0100
12 12 12 9 10
0101
14 14 14 14 12
0110
20 0 9 12 14
0111
95 95 35 18 18
1000
90 90 23 20 20
1001
60 60 18 35 23
1010
40 40 20 23 35
1011
35 35 95 60 40
1100
23 23 90 40 60
1101
18 18 60 95 90
1110
0 20 40 90 95
1111
Figure 5.17: A bitonic merging network for n = 16. The input wires are numbered 0, 1 . . . , n − 1, and
the binary representation of these numbers is shown. Each column of comparators is drawn separately; the
entire figure represents a ⊕BM[16] bitonic merging network. The network takes a bitonic sequence and
outputs it in sorted order.
with its correct partner at each step. The overall idea is to begin by dividing the data elements among the
PEs evenly first, then each PE sorts its share of elements in increasing or decreasing orders depending on
its MPI process rank. In this way, the goal is to generate a global bitonic sequence owned by the entire
MPI processes. For example, if you have 4 MPI processes, then you can have processes 0 and 1 jointly
own a monotonic (say increasing) sequence; and processes 2 and 3 jointly own another monotonic (say
decreasing) sequence. Then processes 0, 1, 2, and 3 jointly own a bitonic sequence. The remaining work
will be then how to carry out the bitonic split steps. Lastly, note that you need to assume both the
number of elements to be sorted and the number of PEs to be powers of two.
5.9 MPI Derived Datatypes 104
Wires
10 10 5 3
0000
20 20 9 5
0001
5 9 10 8
0010
9 5 20 9
0011
3 3 14 10
0100
8 8 12 12
0101
12 14 8 14
0110
14 12 3 20
0111
90 0 0 95
1000
0 90 40 90
1001
60 60 60 60
1010
40 40 90 40
1011
23 23 95 35
1100
35 35 35 23
1101
95 95 23 18
1110
18 18 18 0
1111
Figure 5.18: The comparator network that transforms an input sequence of 16 unordered numbers into a
bitonic sequence.
double x [1000];
....
for ( i =0; i <1000; i ++) {
if ( my_rank == 0)
MPI_Send (& x [ i ] , 1 , MPI_DOUBLE , 1 , 0 , comm ) ;
else
MPI_Recv (& x [ i ] , 1 , MPI_DOUBLE , 0 , 0 , comm ,& status ) ;
}
/* the following is more efficient than using the for loop */
if ( my_rank == 0)
MPI_Send (& x [0] , 1000 , MPI_DOUBLE , 1 , 0 , comm ) ;
else
MPI_Recv (& x [0] , 1000 , MPI_DOUBLE , 0 , 0 , comm , & status ) ;
In distributed-memory systems, communication can be much more expensive than local computation.
Thus, if we can reduce the number of communications, we are likely to improve the performance of our
programs.
MPI Built-in Datatypes
• The MPI standard defines many built in datatypes, mostly mirroring standard C/C++ or FORTRAN
datatypes
• These are sufficient when sending single instances of each type
• They are also usually sufficient when sending contiguous blocks of a single type
• Sometimes, however, we want to send non-contiguous data or data that is comprised of multiple
105 5.9 MPI Derived Datatypes
types
• MPI provides a mechanism to create derived datatypes that are built from simple datatypes
• In MPI, a derived datatype can be used to represent any collection of data items in memory by
storing both the types of the items and their relative locations in memory.
• Why use derived datatypes?
– Primitive datatypes are contiguous;
– Derived datatypes allow you to specify non-contiguous data in a convenient manner and treat
it as though it is contiguous;
– Useful to
∗ Make code more readable
∗ Reduce number of messages and increase their size (faster since less latency);
∗ Make code more efficient if messages of the same size are repeatedly used.
5.9.1 Typemap
Formally, a derived datatype in MPI is described by a typemap consists of a sequence of basic MPI datatypes
together with a displacement for each of the datatypes. That is,
• a sequence of basic datatypes: {type0 , ..., typen−1 }
• a sequence of integer displacements: {displ0 , ..., displn−1 }.
• Typemap = {(type0 , disp0 ), · · · , (typen−1 , dispn−1 )}
For example, a typemap might consist of (double,0),(char,8) indicating the type has two elements:
• a double precision floating point value starting at displacement 0,
• and a single character starting at displacement 8.
• Types also have extent, which indicates how much space is required for the type
• The extent of a type may be more than the sum of the bytes required for each component
• For example, on a machine that requires double-precision numbers to start on an 8-byte boundary,
the type (double,0),(char,8) will have an extent of 16 even though it only requires 9 bytes
int MPI_Type_contiguous(
int count, //count
5.9 MPI Derived Datatypes 106
To define the new datatype in this example and release it after finished using it:
MPI_Datatype rowtype;
MPI_Type_contiguous(4, MPI_DOUBLE, &rowtype);
MPI_Type_commit(&rowtype);
......
MPI_Type_free(&rowtype);
To define a new datatype:
• Declare the new datatype as MPI_Datatype.
• Construct the new datatype.
• Before we can use a derived datatype in a communication function, we must first commit it with a
call to
int MPI_Type_commit(MPI_Datatype* datatype);
Commits new datatype to the system. Required for all derived datatypes.
• When we finish using the new datatype, we can free any additional storage used with a call to
int MPI_Type_free(MPI_Datatype* datatype)
The new datatype is essentially an array of count elements having type oldtype. For example, the
following two code fragments are equivalent:
107 5.9 MPI Derived Datatypes
MPI_Send (a,n,MPI_DOUBLE,dest,tag,MPI_COMM_WORLD);
and
MPI_Datatype rowtype;
MPI_Type_contiguous(n, MPI_DOUBLE, &rowtype);
MPI_Type_commit(&rowtype);
MPI_Send(a, 1, rowtype, dest, tag, MPI_COMM_WORLD);
Example 5.9.2.
# define SIZE 4
float a [ SIZE ][ SIZE ] =
{1.0 , 2.0 , 3.0 , 4.0 , 5.0 , 6.0 , 7.0 , 8.0 ,
9.0 , 10.0 , 11.0 , 12.0 , 13.0 , 14.0 , 15.0 , 16.0};
float b [ SIZE ];
MPI_Status stat ;
MPI_Datatype rowtype ;
MPI_Init (& argc ,& argv ) ;
MPI_Comm_rank ( MPI_COMM_WORLD , & rank ) ;
MPI_Comm_size ( MPI_COMM_WORLD , & numtasks ) ;
MPI_Type_contiguous ( SIZE , MPI_FLOAT , & rowtype ) ;
MPI_Type_commit (& rowtype ) ;
if ( numtasks == SIZE ) {
if ( rank == 0)
for (i =0; i < numtasks ; i ++)
MPI_Send (& a [ i ][0] , 1 , rowtype , i , tag , MPI_COMM_WORLD ) ;
/* the datatype rowtype can also be used in the following function */
MPI_Recv (b , SIZE , MPI_FLOAT , source , tag , MPI_COMM_WORLD ,& stat ) ;
// MPI_Recv (b ,1 , rowtype , source , tag , MPI_COMM_WORLD ,& stat ) ;
printf ( " rank = % d b = %3.1 f %3.1 f %3.1 f %3.1 f \ n " , rank , b [0] , b [1] , b
[2] , b [3]) ;
}
MPI_Type_free (& rowtype ) ;
MPI_Finalize () ;
int MPI_Type_vector (
int count ,
int blocklength ,
int stride ,
MPI_Datatype oldtype ,
MPI_Datatype * newtype )
• Input parameters
– count: number of blocks (nonnegative integer)
– blocklength: number of elements in each block (integer)
– stride: number of elements between each block (integer)
5.9 MPI Derived Datatypes 108
For example, the following two types can be used to communicate a single row and a single column of
a matrix (ny × nx):
MPI_Datatype rowType, colType;
MPI_Type_vector(nx, 1, 1, MPI_DOUBLE, &rowType);
MPI_Type_vector(ny, 1, nx, MPI_DOUBLE, &colType);
MPI_Type_commit(&rowType);
MPI_Type_commit(&colType);
Example 5.9.3.
# define SIZE 4
float a [ SIZE ][ SIZE ] =
{1.0 , 2.0 , 3.0 , 4.0 , 5.0 , 6.0 , 7.0 , 8.0 ,
9.0 , 10.0 , 11.0 , 12.0 , 13.0 , 14.0 , 15.0 , 16.0};
float b [ SIZE ];
MPI_Status stat ;
109 5.9 MPI Derived Datatypes
MPI_Datatype coltype ;
MPI_Init (& argc ,& argv ) ;
MPI_Comm_rank ( MPI_COMM_WORLD , & rank ) ;
MPI_Comm_size ( MPI_COMM_WORLD , & numtasks ) ;
MPI_Type_vector ( SIZE , 1 , SIZE , MPI_FLOAT , & coltype ) ;
MPI_Type_commit (& coltype ) ;
if ( numtasks == SIZE ) {
if ( rank == 0) {
for ( i =0; i < numtasks ; i ++)
MPI_Send (& a [ i ][0] , 1 , coltype , i , tag , MPI_COMM_WORLD ) ;
}
MPI_Recv (b , SIZE , MPI_FLOAT , source , tag , MPI_COMM_WORLD ,& stat ) ;
}
MPI_Type_free (& coltype ) ;
MPI_Finalize () ;
• Input parameters
– count: number of blocks — also number of entries in indices and blocklens
– blocklens: number of elements in each block (array of nonnegative integers)
– indices: displacement of each block in multiples of oldtype (array of integers)
– oldtype: old datatype
• Output parameters
– newtype: new datatype
• See Figure 5.20 for an example of MPI_Type_indexed.
Indexed type generalizes the vector type; instead of a constant stride, blocks can be of varying length
and displacements.
Example 5.9.4.
int blocklen [] = {4 , 2 , 2 , 6 , 6};
int disp [] = {0 , 8 , 12 , 16 , 23};
MPI_Datatype mytype ;
MPI_Type_indexed (5 , blocklen , disp , MPI_DOUBLE , & mytype ) ;
MPI_Type_commit (& mytype ) ;
......
MPI_Type_free (& mytype ) ;
int MPI_Type_create_struct (
int count , // number of elements in the datatype
int array_of_blocklengths [] , // length of each element
MPI_Aint array_of_displacements [] , // displacements in bytes
MPI_Datatype array_of_types [] ,
MPI_Datatype * new_type_p )
Example 5.9.5 (Moving particles between processes). In N-body problems, the force between particles
become less with growing distance. At great enough distance, the influence of a particle on others is
negligible. A number of algorithms for N-body simulation take advantage of this fact. These algorithms
organize the particles in groups based on their locations using tree structures such as quad-tree. One
important step in the implementation of these algorithms is that of transferring particles from one process
to another as they move. Here, we only discuss a way in which movement of particles can be done in MPI.
typedef struct {
int x,y,z;
double mass;
}Particle;
where x, y, z are the spatial coordinates, and mass is the physical mass of the particle.
• To send a particle from one process to another, or broadcast the particle, it makes sense in MPI to
create a datatype that encapsulate all the components (of different datatypes) in the struct, instead
of sending the elements in the struct individually.
void Build_mpi_type ( int * x_p , int * y_p , int * z_p , double * mass_p ,
MPI_Datatype * particletype_p ) {
int array_of_blocklengths [4] = {1 , 1 , 1 , 1};
MPI_Datatype array_of_types [4] = { MPI_INT , MPI_INT , MPI_INT ,
MPI_DOUBLE };
MPI_Aint array_of_displacements [4] = {0};
MPI_Get_address ( x_p , & array_of_displacements [0]) ;
MPI_Get_address ( y_p , & array_of_displacements [1]) ;
MPI_Get_address ( z_p , & array_of_displacements [2]) ;
MPI_Get_address ( mass_p , & array_of_displacements [3]) ;
for ( int i =3; i <=0; i ++)
array_of_displacements [ i ] -= array_of_displacements [0];
MPI_Type_create_struct (4 , array_of_blocklengths ,
array_of_displacements , \
array_of_types , particletype_p ) ;
MPI_Type_commit ( particletype_p ) ;
} /* Build_mpi_type */
Particle my_particle ;
MPI_Datatype particletype ;
/* call the function to create the new MPI datatype */
Build_mpi_type (& my_particle .x , & my_particle .y , & my_particle .z , &
my_particle . mass , & particletype ) ;
/* process 0 does some computation with my_particle */
......
/* process 0 performs a broadcast */
5.11 Exercises 112
5.10 Summary
• Point-to-point communication
– Blocking vs non-blocking
– Safety in MPI programs
• Collective communication
– Collective communications involve all the processes in a communicator.
– All the processes in the communicator must call the same collective function.
– Collective communications do not use tags, the message is matched on the order in which
they are called within the communicator.
– Some important MPI collective communications we learned: MPI_Reduce, MPI_Allreduce,
MPI_Bcast, MPI_Gather, MPI_Scatter, MPI_Allgather, MPI_Alltoall, MPI_Scan etc.
• Beyond basic MPI datatypes that correspond to integers, characters, and floating point numbers,
MPI provides library functions for creating derived datatypes. The derived datatypes allow packing
messages that include data items that are located at contiguous memory locations, discontiguous
memory locations, of same MPI basic datatypes, or of different MPI basic datatypes. Used effi-
ciently, MPI derived datatypes could improve the performance for certain problems.
5.11 Exercises
Objectives
• Apply the basic MPI functions to write simple MPI programs
• Compile and run MPI programs locally; compile and run MPI programs using MSL cluster
• Apply MPI point-to-point and collective communication functions to write MPI programs.
• Apply MPI derived datatypes in MPI programs.
Problems
1. How do you launch multiple processes to run an MPI program? Compile and run the example codes
in Lecture 7. Are the processes running the same code or the same tasks in these example programs?
2. Suppose the size of communicator comm_sz = 4, and x is a vector with n = 26 elements.
(a) How would the elements of x be distributed among the processes in a program using a block
distribution?
(b) How would the elements of x be distributed among the processes in a program using a block
cyclic distribution with block size b = 4?
3. Using the logical interconnect structure we set in token ring example, and using point-to-point com-
munications only, implement an efficient broadcast operation that has a time complexity O(log p),
instead of O(p), where p is the number of processes in the communicator.
113 5.11 Exercises
and the number of trapezoids. With this, if you run your code with arguments a = 0.0, b = 1.0 and
the number of trapezoids being sufficiently large, say 106 , the estimated area will be an approxima-
tion to number π. For example, with a simple MPI implementation, running the code with above
arguments, we have the output shown below.
mpiexec -n 4 ./mpi_trapezoid_1 0.0 1.0 1000000
With n = 1000000 trapezoids, our estimate of the integral
from 0.000000 to 1.000000 = 3.141592653589597e+00 in 0.009453s
Your task is to complete the code using the code segments given in the example first, and then
rewrite the code by using collective communication functions to replace point-to-point communi-
cations where it is possible. Submit two versions of the code. Name the point-to-point communi-
cation version mpi_trape_p2p_<student number>.c and the collective communication version
mpi_trape_collective_<student number>.c. Submit these two code files on Ulwazi.
6. Implement the matrix-vector multiplication problem given in Example 5.7.1 (or Example 6, Lec8
slides).
7. Write a simple program that performs a scatter operation that distributes the individual elements of
an array of integers among the processes, each process modifies the value it receives, and then a
gather operation is followed to collect the modified data where they are stored in the original array.
8. Write a simple program that performs an all-to-all operation that distributes k(k ≥ 1) individual
elements of an array from each process to all processes (using MPI_Alltoall function).
9. Implement the odd-even transposition sort using MPI according to the parallel algorithm given in
Example ?? (or Example 9, Lec8 slides).
10. How would you implement quicksort using MPI (Lec8 slides)?
11. Suppose comm_sz = 8 and the vector x = (0, 1, 2, . . . , 15) has been distributed among the pro-
cesses using a block distribution. Implement an allgather operation using a butterfly structured
communication (Pg.37 (b), Lec8 slides) and point-to-point communication functions.
12. Write an MPI program that computes a tree-structured global sum without using MPI_Reduce (Pg.
37 (a), Lec8 slides, without the broadcasting step). Write your program for the case in which
comm_sz is a power of 2.
13. Write an MPI program that sends the upper triangular portion of a square matrix stored on process
0 to process 1. Explore using following different methods for this problem: (i) using an appropriate
MPI collective communication; (ii) using MPI derived datatype to define a datatype that can pack
the upper triangle of a matrix, and then use this new datatype for the communication.
14. Write a dense matrix transpose function: Suppose a dense n × n matrix A is stored on process 0.
Create a derived datatype representing a single column of A. Send each column of A to process
1, but have process 1 receive each column into a row. When the function returns, process 1 should
have a copy of AT .
15. Suppose a matrix A ∈ Rn×n is distributed over p number of MPI processes in row-wise manner,
and a vector x ∈ Rn is also distributed over the same number of processes. That is, each process
holds a k × n (k rows) sub-matrix, Asub of A, and k components, xsub , of x, where kp = n. Using
such a distributed setting of matrix A and vector x among p MPI processes, write an MPI program
that computes y = Ax, where each process holds only k components of y that corresponds to the
k rows of A. (Note: For this problem, you should consider randomly generating corresponding
Asub and xsub on each process directly to simulate the given data distribution, instead of generating
5.11 Exercises 114
the entire matrix or vector on a single process and then scatter the respective sub matrices or sub
vectors.)
16. Continued from problem 15, given the same distributed setting of A and x among p number of
processes, how would you compute z = AT x in a communication efficient manner? Implement
your method. (Note: Given the row-wise distribution of A, one can treat each row as a column of
AT on each process. That means we don’t need to do any communication in order to obtain matrix
AT in the first place to compute z.)
17. Write an MPI program that completes Example 6 in Lec9 slides, where each process sends a particle
to all the other processes using the derived datatype given in this example (only the data movement
part).
Bibliography
Chapman, B., Jost, G., and van der Pas, R. (2007). Using OpenMP: Portable Shared Memory Paralle
Programming. The MIT Press, Cambridge.
Grama, A., Gupta, A., Karypis, G., and Kumar, V. (2003). Introduction to Parallel Computing. Addison
Wesley.
Trobec, R., Slivnik, B., Bulić, P., and Robič, B. (2018). Introduction to Parallel Computing: From Algo-
rithms to Programming on State-of-the-Art Platforms. Springer.