multicore architecture
multicore architecture
INTRODUCTION
The processor is the main component of a computer system. It is a logic circuitry that processes instructions. It is also called
CPU (Central Processing Unit). It is the brain of the computer system. Processor is mainly responsible to do all the
computational calculations, logical decision making and to control different activities of the system. Central Processing Unit
is very complicated chip consisting of billions of electronic components. It is fitted on the motherboard with other electronic
parts. The main work of the processor is to execute low level instructions loaded into the memory. The processor can be
manufactured using different technologies - Single core processor and multicore processor.According to [1] processors can
be divided into three types- multiprocessors, multithreaded processors and multicore processors.
There are new trends in the CPU manufacturing industry which are based on the idea that while clock speeds can only be
increased to a limit and there is limit to number of electronic components to be used in a core. Many other technologies are
there to speed things up and open ways for better and more powerful central processing units [3].
When we are unable to increase the performance of CPU furthermore by modifying its running frequency, then new
technology called multicore architecture helps. In multicore architecture we can put more than one core on a single silicon
die. This new approach to enhance the speed came with some additional benefits like better performance, better power
management and better cooling as the multi core processors run at a lower speed to dissipate less heat. It also has some
disadvantages like existing programs need to be rewritten as per new architecture. If we do not write programswith special
focus for running on parallel cores, we will not get advantage of multicores. In this paper section II discusses the single core
processor while in section III, multicore processors have been discussed in detail. The section IV gives a detailed
comparison to two different types of processor and the last section V, concludes this topic.
SINGLE-CORE PROCESSORS
Single core processors have only one processor in die to process instructions. All the processor developed by different
manufacturers till 2005 were single core. In todays’ computers we use multicore processors but single core processor also
perform very well. Single core processors have been discontinued in new computers, so these are available at very cheap
rates.
One architecture uses single core while the other is using two or more cores on the same die for processing
instructions. In todays’ time people use multicore processors but single core processors are also very important as
far as further speed up is required. It the single-core processors which are put together to make a multi-core
processor.
SIMD AND MIMD SYSTEMS
INTRODUCTION
Computer architecture as “the structure of a computer that a machine language programmer must understand to write a
correct program for a machine” [1]. Computer architecture can be classified into four main categories. These categories
are defined under the Flynn’s Taxonomy. Computer architecture is classified by the number of instructions that are
running in parallel and how its data is managed. The four categories that computer architecture can be classified under
are:
1. SISD: Single Instruction, Single Data
2. SIMD: Single Instruction, Multiple Data
3. MISD: Multiple Instruction, Single Data
4. MIMD: Multiple Instruction, Multiple Data
SIMD ARCHITECTURE
Single Instruction stream, Multiple Data stream (SIMD) processors one instruction works on several data items
simultaneously by using several processing elements, all of which carried out same operation as illustrated in Fig 2. [4]
SIMD system comprise one of the three most commercially successful classes of parallel computers (the other being
vector supercomputer and MIMD systems). A number of factors have contributed to this success including :
Simplicity of concept and programming
Regularity of structure
Easy scalability of size and performance
Straightforward applicability in a number of fields which demands parallelism to achieve necessary performance.
A. Basic Principles:
There is a two-dimensional array of processing elements, each connected to its four nearest neighbors.
All processors execute the same instruction simultaneously.
Each processor incorporates local memory.
The processors are programmable, that is, they can perform a variety of functions.
Data can propagate quickly through the array. [1]
True SIMD architecture: True SIMD architectures can be determined by its usage of distributed memory and
shared memory. Both true SIMD architectures possess similar implementation as seen on Fig.4, but differ on
placement of processor and memory modules. [2]
Pipelined SIMD Architecture: This architecture implements the logic behind pipelining an instruction as observe
on Fig.7. Each processing element will receive an instruction from the controlling unit, using a shared memory,
and will perform computation at multiple stages. The controlling unit provides the parallel processing elements
with instructions. The sequential processing element is used to handle other instructions. [2]
TABLE 1
Shared Memory MIMD classes
Size and Performance Scalability in size and performance Complex size and good performance
Conditional Statements Conditional statements depends upon data The multiple instruction stream of MIMD
local to processors, all of instructions of allow for more efficient execution of
then block must broadcast, followed by all conditional statements (e.g.-: if then else)
else block because each processor can independently
follow either decision path
Low synchronization Implicit in program Explicit data structures and operations needed
overheads
Low PE-to-PE Automatic synchronization of all “send” Explicit synchronization and identification
communication overheads and “receive” operations protocols needed.
Efficient execution of Total execution time equals the sum of Total execution time equals the maximum
variable-time instructions maximal execution times execution time
through all processors on a given processor
The purpose is to provide an overview of recent architectural approaches of parallel systems and also comparison
between them. Describes the Flynn’s Taxonomy-: SIMD and MIMD. SIMD allow for more faster and multiple
computation in this field where sacrifice cannot be made on the delay of time. SIMD processing architecture
example-: a graphic processor processing instructions for translation or rotation or other operations are done on multiple
data. MIMD processing architecture example is super computer or distributed computing systems with distributed or
single shared memory.
INTERCONNECTION NETWORKS
INTRODUCTION
Networking strategy was originally employed in the 1950's by the telephone industry as a means of
reducing the time required for a call to go through. Similarly, the computer industry employs networking
strategy to provide fast communication between computer subparts, particularly with regard to parallel
machines.
The performance requirements of many applications, such as weather prediction, signal processing, radar
tracking, and image processing, far exceed the capabilities of single-processor architectures. Parallel
machines break a single problem down into parallel tasks that are performed concurrently, reducing
significantly the application processing time.
Any parallel system that employs more than one processor per application program must be designed to
allow its processors to communicate efficiently; otherwise, the advantages of parallel processing may be
negated by inefficient communication. This fact emphasizes the importance of interconnection networks to
overall parallel system performance. In many proposed or existing parallel processing architectures, an
interconnection network is used to realize transportation of data between processors or between processors
and memory modules.
This chapter deals with several aspects of the networks used in modern (and theoretical) computers. After
classifying various network structures, some of the most well known networks are discussed, along with a
list of advantages and disadvantages associated with their use. Some of the elements of network design are
also explored to give the reader an understanding of the complexity of such designs.
NETWORK TOPOLOGY
Network topology refers to the layouts of links and switch boxes that establish interconnections. The links
are essentially physical wires (or channels); the switch boxes are devices that connect a set of input links to
a set of output links. There are two groups of network topologies: static and dynamic. Static networks
provide fixed connections between nodes. (A node can be a processing unit, a memory module, an I/O
module, or any combination thereof.) With a static network, links between nodes are unchangeable and
cannot be easily reconfigured. Dynamic networks provide reconfigurable connections between nodes. The
switch box is the basic component of the dynamic network. With a dynamic network the connections
between nodes are established by the setting of a set of interconnected switch boxes.
In the following sections, examples of static and dynamic networks are discussed in detail.
Static Networks
There are various types of static networks, all of which are characterized by their node degree; node degree
is the number of links (edges) connected to the node. Some well-known static networks are the following:
In the following sections, the listed static networks are discussed in detail.
Shared bus. The shared bus, also called common bus, is the simplest type of static network. The shared
bus has a degree of 1. In a shared bus architecture, all the nodes share a common communication link, as
shown in Figure 5.1. The shared bus is the least expensive network to implement. Also, nodes (units) can
be easily added or deleted from this network. However, it requires a mechanism for handling conflict when
several nodes request the bus simultaneously. This mechanism can be achieved through a bus controller,
which gives access to the bus either on a first-come, first-served basis or through a priority scheme. (The
structure of a bus controller is explained in the Chapter 6.) The shared bus has a diameter of 1 since each
node can access the other nodes through the shared bus.
Linear array. The linear array (degree of 2) has each node connected with two neighbors (except the far-
ends nodes). The linear quality of this structure comes from the fact that the first and last nodes are not
connected, as illustrated in Figure 5.2. Although the linear array has a simple structure, its design can mean
long communication delays, especially between far-end nodes. This is because any data entering the
network from one end must pass through a number of nodes in order to reach the other end of the network.
A linear array, with N nodes, has a diameter of N-1.
Ring. Another networking configuration with a simple design is the ring structure. A ring network has a
degree of 2. Similar to the linear array, each node is connected to two of its neighbors, but in this case the
first and last nodes are also connected to form a ring. Figure 5.3 shows a ring network. A ring can be
unidirectional or bidirectional. In a unidirectional ring the data can travel in only one direction, clockwise
or counterclockwise. Such a ring has a diameter of N-1, like the linear array. However, a bidirectional ring,
in which data travel in both directions, reduces the diameter by a factor of 2, or less if N is even. A
bidirectional ring with N nodes has a diameter of N / 2 . Although this ring's diameter is much better
than that of the linear array, its configuration can still cause long communication delays between distant
nodes for large N. A bidirectional ring network’s reliability, as compared to the linear array, is also
improved. If a node should fail, effectively cutting off the connection in one direction, the other direction
can be used to complete a message transmission. Once the connection is lost between any two adjacent
nodes, the ring becomes a linear array, however.
Binary tree. Figure 5.4 represents the structure of a binary tree with seven nodes. The top node is called
the root, the four nodes at the bottom are called leaf (or terminal) nodes, and the rest of the nodes are called
intermediate nodes. In such a network, each intermediate node has two children. The root has node address
1. The addresses of the children of a node are obtained by appending 0 and 1 to the node's address that is,
the children of node x are labeled 2x and 2x+1. A binary tree with N nodes has diameter 2(h-1), where
h log2 N is the height of the tree. The binary tree has the advantages of being expandable and having a
simple implementation. Nonetheless, it can still cause long communication delays between faraway leaf
nodes. Leaf nodes farthest away from each other must ultimately pass their message through the root. Since
traffic increases as the root is approached, leaf nodes farthest away from each other will spend the most
amount of time waiting for a message to traverse the tree from source to destination.
One desirable characteristic for an interconnection network is that data can be routed between the nodes in
a simple manner (remember, a node may represent a processor). The binary tree has a simple routing
algorithm. Let a packet denote a unit of information that a node needs to send to another node. Each packet
has a header that contains routing information, such as source address and destination address. A packet is
routed upward toward the root node until it reaches a node that is either the destination or ancestor of the
destination node. If the current node is an ancestor of the destination node, the packet is routed downward
toward the destination.
Fat tree. One problem with the binary tree is that there can be heavy traffic toward the root node.
Consider that the root node acts as the single connection point between the left and right subtrees. As can
be observed in Figure 5.4, all messages from nodes N2, N4, and N5 to nodes N3, N6, and N7 have no choice
but to pass through the root. To reduce the effect of such a problem, the fat tree was proposed by Leiserson
[LEI 85]. Fat trees are more like real trees in which the branches get thicker near the trunk. Proceeding up
from the leaf nodes of a fat tree to the root, the number of communication links increases, and therefore the
communication bandwidth increases. The communication bandwidth of an interconnection network is the
expected number of requests that can be accepted per unit of time.
The structure of the fat tree is based on a binary tree. Each edge of the binary tree corresponds to two
channels of the fat tree. One of the channels is from parent to child, and the other is from child to parent.
The number of communication links in each channel increases as we go up the tree from the leaves and is
determined by the amount of hardware available. For example, Figure 5.5 represents a fat tree in which the
number of communication links in each channel is increased by 1 from one level of the tree to the next.
The fat tree can be used to interconnect the processors of a general-purpose parallel machine. Since its
communication bandwidth can be scaled independently from the number of processors, it provides great
flexibility in design.
For example, using the shuffle function for N=8 (i.e. 23 nodes) the following connections can be
established between the nodes.
The reason that the function is called shuffle is that it reflects the process of shuffling cards. Given that
there are eight cards, the shuffle function performs a perfect playing card shuffle as follows. First, the deck
is cut in half, between cards 3 and 4. Then the two half decks are merged by selecting cards from each half
in an alternative order. Figure 5.6 represents how the cards are shuffled.
Another way to define shuffle connection is through the decimal representation of the addresses of the
nodes. Let N=2n be the number of nodes and i represent the decimal address of a node. For
0 i N / 2 1 , node i is connected to node 2i. For N / 2 i N 1 , node i is connected to node 2i+1-N.
The exchange function is also a simple bijection function. It maps a binary address to another binary
address that differs only in the rightmost bit. It can be described as
exchange(sn-1sn-2 ... s1s0) = sn-1sn-2 ... s1 s 0 .
Figure 5.7 shows the shuffle-exchange connections between nodes when N = 8.
The evaluation of the polynomial is done in two phases. First, each term aixi is computed at node i for i=0
to 7. Then the terms aixi, for i=1 to 7, are added to produce the final result.
Figure 5.9 represents the steps involved in the computation of aixi. Figure 5.9a shows the initial values of
the registers of each node. The coefficient ai, for i=0 to 7, is stored in node i. The value of the variable x is
stored in each node. The mask register of node i, for i=1, 3, 5, and 7, is set to 1; others are set to 0. In each
step of computation, every node checks the content of its mask register. When the content of the mask
register is 1, the content of the coefficient register is multiplied with the content of the variable register, and
the result is stored in the coefficient register. When the content of the mask register is zero, the content of
the coefficient register remains unchanged. The content of the variable register is multiplied with itself.
The contents of the mask registers are shuffled between the nodes using the shuffle network. Figures 5.9b,
c, and d show the values of the registers after the first step, second step, and third step, respectively. At the
end of the third step, each registers contains aixi.
Figure 5.9 Steps for the computation of the aixi. (a) Initial values after step 1. (c) Values after step 2. (d)
Values after step 3.
At this point, the terms aixi for i=0 to 7 are added to produce the final result. To perform such a summation,
exchange connections are used in addition to shuffle connections. Figure 5.10 shows all the connections
and the initial values of the coefficient registers.
Figure 5.10 Required connections for adding the terms aixi.
In each step of computation the contents of the coefficient registers are shuffled between the nodes using
the shuffle connections. Then copies of the contents of the coefficient registers are exchanged between the
nodes using the exchange connections. After the exchange is performed, each node adds the content of its
coefficient register to the value that the copy of the current content is exchanged with. After three shuffle
7
ai x
i
and exchanges, the content of each coefficient register will be the desired . The following shows
i 0
the three steps required to obtain result
.
7
ai x
i
As you can see in the chart, after the third step, the value is stored in each coefficient register.
i 0
From this example, it should be apparent that the shuffle-exchange network provides the desired
connections for manipulating the values of certain problems efficiently.
Two-dimensional mesh. A two-dimensional mesh consists of k1*k0 nodes, where ki 2 denotes the
number of nodes along dimension i. Figure 5.11 represents a two-dimensional mesh for k0=4 and k1=2.
There are four nodes along dimension 0, and two nodes along dimension 1. As shown in Figure 5.11, in a
two-dimensional mesh network each node is connected to its north, south, east, and west neighbors. In
general, a node at row i and column j is connected to the nodes at locations (i-1, j), (i+1, j), (i, j-1), and (i,
j+1). The nodes on the edge of the network have only two or three immediate neighbors.
The diameter of a mesh network is equal to the distance between nodes at opposite corners. Thus, a two-
dimensional mesh with k1*k0 nodes has a diameter (k1 -1) + (k0-1).
In practice, two-dimensional meshes with an equal number of nodes along each dimension are often used
for connecting a set of processing nodes. For this reason in most literature the notion of two-dimensional
mesh is used without indicating the values for k1 and k0; rather, the total number of nodes is defined. A two-
dimensional mesh with k1=k0=n is usually referred to as a mesh with N nodes, where N = n2. For example,
Figure 5.12 shows a mesh with 16 nodes. From this point forward, the term mesh will indicated a two-
dimensional mesh with an equal number of nodes along each dimension.
The routing of data through a mesh can be accomplished in a straightforward manner. The following
simple routing algorithm routes a packet from source S to destination D in a mesh with n2 nodes.
The values R and C determine the number of rows and columns that the packet needs to travel. The
direction the message takes at each node is determined by the sign of the values R and C. When R (C) is
positive, the packet travels downward (right); otherwise, the packet travels upward (left). Each time that the
packet travels from one node to the adjacent node downward, the value R is decremented by 1, and when it
travels upward, R is incremented by 1. Once R becomes 0, the packet starts traveling in the horizontal
direction. Each time that the packet travels from one node to the adjacent node in the right direction, the
value C is decremented by 1, and when it travels in the left direction, C is incremented by 1. When C
becomes 0, the packet has arrived at the destination. For example, to route a packet from node 6 (i.e., S=6)
to node 12 (i.e., D= 12), the packet goes through two paths, as shown in Figure 5.13. In this example,
R 12 / 4 6 / 4 2,
C 0 2 2
It should be noted that in the case just described the nodes on the edge of the mesh network have no
connections to their far neighbors. When there are such connections, the network is called a wraparound
two-dimensional mesh, or an Illiac network. An Illiac network is illustrated in Figure 5.14 for N = 16.
In general, the connections of an Illiac network can be defined by the following four functions:
where N is the number of nodes, 0 j < N, n is the number of nodes along any dimension, and N=n2.
For example, in Figure 5.14, node 4 is connected to nodes 5, 3, 8, and 0, since
The diameter of an Illiac with N=n2 nodes is n-1, which is shorter than a mesh. Although the extra
wraparound connections in Illiac allow the diameter to decrease, they increase the complexity of the design.
Figure 5.15 shows the connectivity of the nodes in a different form. This graph shows that four nodes can
be reached from any node in one step, seven nodes in two steps, and four nodes in three steps. In general,
the number of steps (recirculations) to route data from a node to any other node is upper bounded by the
diameter (i.e., n – 1).
To reduce the diameter of a mesh network, another variation of this network, called torus (or two-
dimensional tours), has also been proposed. As shown in Figure 5.16a, a torus is a combination of ring and
mesh networks. To make the wire length between the adjacent nodes equal, the torus may be folded as
shown in Figure 5.16b. In this way the communication delay between the adjacent nodes becomes equal.
Note that both Figures 5.16a and b provide the same connections between the nodes; in fact, Figure 5.16b is
derived from Figure 5.16a by switching the position of the rightmost two columns and the bottom two rows
of nodes. The diameter of a torus with N=n2 nodes is 2n / 2 , which is the distance between the corner
and the center node. Note that the diameter is further decreased from the mesh network.
Figure 5.16 Different types of torus network. (a) A 4-by-4 torus network. (b) A 4-by-4 torus network with
folded connection.
The mesh network provides suitable interconnection patterns for problems whose solutions require the
computation of a set of values on a grid of points, for which the value at each point is determined based on
the values of the neighboring points. Here we consider one of these class of problems: the problem of
finding a steady-state temperature over the surface of a square slab of material whose four edges are held at
different temperatures. This problem requires the solution of the following partial differential equation,
known as Laplace's equation:
2 U / x2 2 U / y 0 ,
2
where U is the temperature at a given point specified by the coordinates x and y on the slab.
The following describes a method, given by Slotnick [SLO 71], to solve this problem. Even if unfamiliar
with Laplace's equation, the reader should still be able to follow the description. The method is based on
the fact that the temperature at any point on the slab tends to become the average of the temperatures of
neighboring points.
Assume that the slab is covered with a mesh and that each square of the mesh has h units on each side.
Then the temperature of an interior node at coordinates x and y is the average of the temperatures of the
four neighbor nodes. That is, the temperature at node (x, y), denoted as U(x, y), equals the sum of the four
neighboring temperatures divided by 4. For example, as shown in Figure 5.17, assume that the slab can be
covered with a 16-node mesh. Here the value of U(x, y) is expressed as
U(x,y)=[U(x,y+h) + U(x+h,y) + U(x,y-h) + U(x-h,y)]/4.
Figure 5.18 illustrates an alternative representation of Figure 5.17. Here the position of the nodes is more
conveniently indicated by the integers i and j. In this case, the temperature equation can be expressed as
U(i,j)=[U(i,j+1) + U(i+1,j) + U(i,j-1) + U(i-1,j)]/4.
Assume that each node represents a processor having one register to hold the node's temperature. The nodes
on the boundary are arbitrarily held at certain fixed temperatures. Let the nodes on the bottom of the mesh
and on the right edge be held at zero degrees. The nodes along the top and left edges are set according to
their positions. The temperatures of these 12 boundary nodes do not change during the computation. The
temperatures at the 4 interior nodes are the unknowns. Initially, the temperatures at these 4 nodes are set to
zero. In the first iteration of computation, the 4 interior node processors simultaneously calculate the new
temperature values using the values initially given.
Figure 5.18 Initial values of the nodes.
Figure 5.19 represents the new values of the interior nodes after the first iteration. These values are
calculated as follows:
U(1,2)=[U(1,3)+U(2,2)+U(1,1)+U(0,2)]/4 = [8+0+0+8]/4 = 4;
U(2,2)=[U(2,3)+U(3,2)+U(2,1)+U(1,2)]/4 = [4+0+0+0]/4 = 1;
U(1,1)=[U(1,2)+U(2,1)+U(1,0)+U(0,1)]/4 = [0+0+0+4]/4 = 1;
U(2,1)=[U(2,2)+U(3,1)+U(2,0)+U(1,1)]/4 = [0+0+0+0]/4 = 0.
In the second iteration, the values of U(1,2), U(2,2), U(1,1), and U(2,1) are calculated using the new values
just obtained:
U(1,2) = [8+1+1+8]/4 = 4.5;
U(2,2) = [4+0+0+4]/4 = 2;
U(1,1) = [4+0+0+4]/4 = 2;
U(2,1) = [1+0+0+1]/4 = 0.5.
This process continues until a steady-state solution is obtained. As more iterations are performed, the
values of the interior nodes converge to the exact solution. When values for two successive iterations are
close to each other (within a specified error tolerance), the process can be stopped, and it can be said that a
steady-state solution has been reached. Figure 5.20 represents a solution obtained after 11 iterations.
n-cube or hypercube. An n-cube network, also called hypercube, consists of N=2n nodes; n is called the
dimension of the n-cube network. When the node addresses are considered as the corners of an n-
dimensional cube, the network connects each node to its n neighbors. In an n-cube, individual nodes are
uniquely identified by n-bit addresses ranging from 0 to N-1. Given a node with binary address d, this
node is connected to all nodes whose binary addresses differ from d in exactly 1 bit. For example, in a 3-
cube, in which there are eight nodes, node 7 (111) is connected to nodes 6 (110), 5 (101), and 3 (011).
Figure 5.21 demonstrates all the connections between the nodes.
As can be seen in the 3-cube, two nodes are directly connected if their binary addresses differ by 1 bit.
This method of connection is used to control the routing of data through the network in a simple manner.
The following simple routing algorithm routes a packet from its source S = (sn-1 . . . s0) to destination D =
(dn-1 . . . d0).
1. Tag T S D tn-1 . . . t0 is added to the packet header at the source node ( denotes an
XOR gate).
2. If ti 0 for some 0 i n 1 , then use ith-dimension link to send the packet to a new node
with the same address as the current node except the ith bit, and change ti to 0 in the packet
header.
3. Repeat step 2 until ti = 0 for all 0 i n 1 .
For example, as shown in Figure 5.22, to route a packet from node 0 to node 5, the packet could go through
two different paths, P1 and P2. Here T=000 101 = 101. If we first consider the bit t0 and then t2, the
packet goes through the path P1. Since t0 =1, the packet is sent through the 0th-dimension link to node 1.
At node 1, t0 is set to 0; thus T now becomes equal to 100. Now, since t2=1, the packet is sent through the
second-dimension link to node 5. If, instead of t0, bit t2 is considered first, the packet goes through P2.
Figure 5.22 Different paths for routing a packet from node 0 to node 5.
In the network of Figure 5.22, the maximum distance between nodes is 3. This is because the distance
between nodes is equal to the number of bit positions in which their binary addresses differ. Since each
address consists of 3 bits, the difference between two addresses can be at most 3 when every bit at the same
position differs. In general, in an n-cube the maximum distance between nodes is n, making the diameter
equal to n.
The n-cube network has several features that make it very attractive for parallel computation. It appears the
same from every node, and no node needs special treatment. It also provides n disjoint paths between a
source and a destination. Let the source be represented as S = (sn-1sn-2 . . . s0) and the destination by D =
(dn-1dn-2 . . . d0). The shortest paths can be symbolically represented as
For example, consider the 3-cube of Figure 5.21. Since n=3, there are three paths from a source, say 000, to
a destination, say 111. The paths are
This ability to have n alternative paths between any two nodes makes the n-cube network highly reliable if
any one (or more) paths become unusable.
Different networks, such as two-dimensional meshes and trees, can be embedded in an n-cube in such a
way that the connectivity between neighboring nodes remains consistent with their definition. Figure 5.23
shows how a 4-by-4 mesh can be embedded in a 4-cube (four-dimensional hypercube). The 4-cube’s
integrity is not compromised and is well-suited for uses like this, where a great deal of flexibility is
required. All definitional considerations for both the 4-cube and the 4-by-4 mesh, as stated earlier, are
consistent.
Figure 5.23 Embedding a 4-by-4 mesh in a 4-cube.
The interconnection supported by the n-cube provides a natural environment for implementing highly
parallel algorithms, such as sorting, merging, fast Fourier transform (FFT), and matrix operations. For
example, Batcher's bitonic merge algorithm can easily be implemented on an n-cube. This algorithm sorts
a bitonic sequence (a bitonic sequence is a sequence of nondecreasing numbers followed by a sequence of
nonincreasing numbers). Figure 5.24 presents the steps involved in merging a nondecreasing sequence
[0,4,6,9] and a nonincreasing sequence [8,5,3,1]. This algorithm performs a sequence of comparisons on
pairs of data that are successively 22 , 21 , and 20 locations apart.
At each stage of the merge each pair of data elements is compared and switched if they are not in ascending
order. This rearranging continues until the final merge with a distance of 1 puts the data into ascending
order.
In general, the n-cube provides the necessary connections for ascending and descending classes of parallel
algorithms. To define each of these classes, assume that there are 2n input data items stored in 2n locations
(or processors) 0, 1, 2, ..., 2n -1. An algorithm is said to be in the descending class if it performs a sequence
of basic operations on pairs of data that are successively 2n-1 , 2n-2 , ..., and 20 locations apart. (Therefore,
Batcher's algorithm belongs to this class.) In comparison, an ascending algorithm performs successively on
pairs that are 20, 21, ..., and 2n-1 locations apart. When n=3 Figures 5.25 and 5.26 show the required
connections for each stage of operation in this class of algorithms. As shown, the n-cube is able to
efficiently implement algorithms in descending or ascending classes.
Although the n-cube can implement this class of algorithms in n parallel steps, it requires n connections for
each node, which makes the design and expansion difficult. In other words, the n-cube provides poor
scalability and has an inefficient structure for packaging and therefore does not facilitate the increasingly
important property of modular design.
n-Dimensional mesh. An n-dimensional mesh consists of kn-1*kn-2*....*k0 nodes, where ki 2 denotes the
number of nodes along dimension i. Each node X is identified by n coordinates xn-1,xn-2, ... , x0, where
0 xi ki -1 for 0 i n-1. Two nodes X=(xn-1,xn-2,...,x0) and Y=(yn-1,yn-2,...,y0) are said to be neighbors if and
only if yi=xi for all i, 0 i n-1, except one, j, where yj=xj + 1 or yj=xj - 1. That is, a node may have from n
to 2n neighbors, depending on its location in the mesh. The corners of the mesh have n neighbors, and the
internal nodes have 2n neighbors, while other nodes have nb neighbors, where n<nb<2n. The diameter of
n 1
an n-dimensional mesh is (k i 1) . An n-cube is a special case of n-dimensional meshes; it is in fact an n-
i 0
dimensional mesh in which ki=2 for 0 i n-1. Figure 5.27 represents the structure of two three-
dimensional meshes: one for k2 = k1 = k0 = 3 and the other for k2=4, k1=3, and k0=2.
(a) k2 = k1 = k0 = 3.
k-Ary n-cube. A k-ary n-cube consists of kn nodes such that there are k nodes along each dimension.
Each node X is identified by n coordinates, xn-1,xn-2,...,x0, where 0 xi k -1 for 0 i n-1. Two nodes X=(xn-
1,xn-2,...,x0) and Y=(yn-1,yn-2,...,y0) are said to be neighbors if and only if yi=xi for all i, 0 i n-1, except one,
j, where yj=(xj + 1) mod k, or yj=(xj -1) mod k. That is, in contrast to the n-dimensional mesh, a k-ary n-
cube has a symmetrical topology in which each node has an equal number of neighbors. A node has n
neighbors when k=2 and 2n neighbors when k>2. The k-ary n-cube has a diameter of n k / 2 . An n-cube
is a special case of k-ary n-cubes; it is in fact a 2-ary n-cube. Figure 5.28 represents the structure of two k-
ary n-cubes: one for k=4, n=2 and the other for k=n=3. Note that a 4-ary 2-cube is actually a torus network.
Figure 5.28 (a) 4-Ary 2-cube and (b) 3-ary 3-cube networks.
SYMMETRIC AND DISTRIBUTED SHARED MEMORY ARCHITECTURES
Introduction
Parallel and distributed processing did not lose their allure since their inception in 1960’s;
the allure is in terms of their ability to meet a wide range of price and performance.
However, in many cases these advantages were not realized due to longer design times,
limited scalability, lack of OS and programming support, and the ever increasing
performance/cost ratio of uniprocessors [Bell 99]. Historically, parallel processing
systems were classified as SIMD (Single Instruction Multiple Data) or MIMD (Multiple
Instructions Multiple Data). SIMD systems involved the use of a single control processor
and a number of arithmetic processors. The array of arithmetic units executed identical
instruction streams, but on different data items, in a lock-step fashion under the control of
the control unit. Such systems were deemed to be ideal for data-parallel applications.
Their appeal waned quickly since very few applications could garner the performance to
justify their cost. Multicomputers or multiprocessors fall under the MIMD classification,
relying on independent processors. They were able to address a broader range of
applications than SIMD systems. One of the earliest MIMD system was the C.mmp built
at CMU that included 16 modified PDP-11/20 processors connected to 16 memory
modules via a crossbar switch. This can be viewed as a Symmetric Multiprocessor (SMP)
or a shared memory system. The next version of a multiprocessor system at CMU was
known as Cm* and can be deemed as the first hardware implemented distributed shared
memory system. It consisted of an hierarchy of processing nodes; LSI-11 processors
comprising clusters where processors of a single cluster were connected by a bus and
clusters were connected by inter-cluster connections using specialized controllers to
handle accesses to remote memory.
The next wave of multiprocessors relied on distributed memory, where processing nodes
have access only to their local memory, and access to remote data was accomplished by
request and reply messages. Numerous designs on how to interconnect the processing
nodes and memory modules were published in the literature. Examples of such message-
based systems included Intel Paragon, N-Cube, IBM' SP systems. As compared to shared
memory systems, distributed memory (or message passing) systems can accommodate
larger number of computing nodes. This scalability was expected to increase the
utilization of message-passing architectures.
The trend even in commercial parallel processing applications has been leaning towards
the use of small clusters of SMP systems, often interconnected to address the needs of
complex problems requiring the use of large number of processing nodes. Even when
working with a networked resources, programmers are relying on messaging standards
such as MPI (and PVM) or relying on systems software to automatically generate
message passing code from user defined shared memory programs. The reliance on
software support to provide a shared memory programming model (i.e., distributed
shared memory systems, DSMs) can be viewed as a logical evolution in parallel
processing. Distributed Shared Memory (DSM) systems aim to unify parallel processing
systems that rely on message passing with the shared memory systems. The use of
distributed memory systems as (logically) shared memory systems addresses the major
limitation of SMP’s; namely scalability.
Programming Example
In order to appreciate the differences between the shared memory and message passing
paradigms, consider the following code segments to compute inner products. The first
program was written using Pthreads (on shared memory), while the second using MPI
(for message passing systems). Both are assume a master process and multiple worker
processes, and each worker process is allocated equal amounts of work by the master.
There are two major differences in the two implementation, related to how the work is
distributed and how the worker processes access the needed data. The Pthread version
shows that each worker process is given the address of the data they need for their work.
In the MPI version, the actual data is sent to the worker processors. The worker processes
of the Pthread version access the needed data directly as if the data is local. It can also be
seen that the worker processors directly accumulate their partial results in a single global
variable (using mutual exclusion). The worker processes of the MPI program are supplied
the actual data via messages, and they send their partial results back to the master for the
purpose of accumulation.
pthread_mutex_lock(&lock); /* lock */
sum += partial; /* accumulate the partial sums */
pthread_mutex_unlock(&lock); /* unlock */
return NULL;
}
for (i = 0; i < N_PROC; i++) { /* wait for all worker processes to complete */
status = pthread_join(thread[i], NULL);
if (status) exit(1);
}
#defineMASTER 0
#defineFROM_MASTER 1
#define TO_MASTER 2
return partial;
}
mtype = FROM_MASTER;
offset = 0;
for (i = 1; i < N_PROC; i++) { /* send messages to workers */
MPI_Send(&x[offset], SZ_WORK, MPI_DOUBLE, i, mtype,
MPI_COMM_WORLD);
MPI_Send(&y[offset], SZ_WORK, MPI_DOUBLE, i, mtype,
MPI_COMM_WORLD);
offset += SZ_WORK;
}
mtype = TO_MASTER;
for (i = 1; i < N_PROC; i++) { /* receive messages from workers */
MPI_Recv(&partial, 1, MPI_DOUBLE, i, mtype,
MPI_COMM_WORLD, &status);
sum += partial;
}
printf("Inner product is %f\n", sum);
}
else { /* worker process */
mtype = FROM_MASTER; /* receive a message from master */
MPI_Recv(x_partial, SZ_WORK, MPI_DOUBLE, MASTER, mtype,
MPI_COMM_WORLD, &status);
MPI_Recv(y_partial, SZ_WORK, MPI_DOUBLE, MASTER, mtype,
MPI_COMM_WORLD, &status);
partial = worker(SZ_WORK, x_partial, y_partial);
mtype = TO_MASTER;
MPI_Send(&partial, 1, MPI_DOUBLE, MASTER, mtype,
MPI_COMM_WORLD);
} /* send result back to master */
MPI_Finalize();
}
The granularity of the data that is shared and moved across memory hierarchies is another
design consideration. The granularity can be based on objects without semantic meaning,
based purely on a sequence of bytes (e.g., a memory word, a cache block, a page) or it
can be based on objects with semantic basis (e.g., variables, data structures or objects in
the sense of object-oriented programming model). Hardware solutions often use finer
grained objects (often without semantic meaning) while software implementations rely on
coarser grained objects.
In this section we will concentrate on 3 main issues in the design of hardware or software
based distributed shared memory system. They are related to data coherence, memory
consistency and synchronization.
Data Coherency.
Cache Coherency.
The inconsistency that exists between main memory and write-back caches does not
cause any problems in uniprocessor systems. But techniques are needed to ensure that
consistent data is available to all processors in a multiprocessor system. Cache coherency
can be maintained either by hardware techniques or software techniques. We will first
introduce hardware solutions.
Snoopy Protocols .
These protocols are applicable for small scale multiprocessors systems where the
processors are connected to memory via a common bus, making the shared memory
equally accessible to all processors (also known as Symmetric Multiprocessor systems
SMP, Uniform Memory Access systems, UMA). In addition to the shared memory, each
processor contains a local cache memory (or multi-level caches). Since all processors and
their cache memories (or the controller hardware) are connected to a common bus, the
cache memories can snoop on the bus for maintaining coherent data. Each cache line is
associated with a state, and the cache controller will modify the states to track changes to
cache lines made either locally or remotely. A hit on a read implies that the cache data is
consistent with that in main memory and copies that may exist in other processors'
caches. A read miss leads to a request for the data. This request can be satisfied by either
the main memory (if no other cache has a copy of the data), or by another cache which
has a (possibly newer) copy of the data. Initially, when only one cache has a copy, the
cache line is set to Exclusive state. However, when other caches request for a read copy,
the state of the cache line (in all processors) is set to Shared.
Consider what happens when a processor attempts to write to a (local) cache line. On a
hit, if the state of the local cache line is Exclusive (or Modified), the write can proceed
without any delay, and state is changed to Modified. This is because, Exclusive or
Modified state with the data guarantees that no copies of the data exist in other caches. If
the local state is Shared (which implies the existence of copies of the data item in other
processors) then an invalidation signal must be broadcast on the common bus, so that all
other caches will set their cache lines to Invalid state. Following the invalidation, the
write can be completed in local cache, changing the state to Modified.
On a write-miss request is placed on the common bus. If no other cache contains a copy,
the data comes from memory, the write can be completed by the processor and the cache
line is set to Modified. If other caches have the requested data in Shared state, the copies
are invalidated, the write can complete with a single Modified copy. If a different
processor has a Modified copy, the data is written back to main memory and the
processor invalidates its copy. The write can now be completed, leading to a Modified
line at the requesting processor. Such snoopy protocols are sometimes called MESI,
standing for the names of states associated with cache lines: Modified, Exclusive, Shared
or Invalid. Many variations of the MESI protocol have been reported [Baer]. In general
the performance of a cache coherency protocol depends on the amount of sharing (i.e.,
number of shared cache blocks), number of copies, number of writers and granularity of
sharing.
Instead of invalidating shared copies on a write, it may be possible to provide updated
copies. It may be possible with appropriate hardware to detect when a cache line is no
longer shared by other processors, eliminating update messages. The major trade-off
between update and invalidate technique lies in the amount of bus traffic resulting from
the update messages that include data as compared to the cache misses subsequent to
invalidation messages. Update protocols are better suited for applications with a single
writer and multiple readers, while invalidation protocols are favored when multiple
writers exist.
Directory Protocols
Snoopy protocols rely on the ability to listen to and broadcast invalidations on a common
bus. However, the common bus places a limit on the number of processing nodes in a
SMP system. Large scale multiprocessor and distributed systems must use more complex
interconnection mechanisms, such as multiple buses, N-dimensional grids, Crossbar
switches and multistage interconnection networks. New techniques are needed to assure
that invalidation messages are received (and acknowledged) by all caches with copies of
the shared data. This is normally achieved by keeping a directory with main memory
units. There exists one directory entry corresponding to each cache block, and the entry
keeps track of shared copies, or the identification of the processor that contains modified
data. On a read miss, a processor requests the memory unit for data. The request may go
to a remote memory unit depending on the address. If the data is not modified, a copy is
sent to the requesting cache, and the directory entry is modified to reflect the existence of
a shared copy. If a modified copy exists at another cache, the new data is written back to
the memory, a copy of the data is provided to the requesting cache, and the directory is
marked to reflect the existence of two shared copies.
In order to handle writes, it is necessary to maintain state information with each cache
block at local caches, somewhat similar to the Snoopy protocols. On a write hit, the write
can proceed immediately if the state of the cache line is Modified. Otherwise (the state is
Shared), Invalidation message is communicated to the memory unit, which in turn sends
invalidation signals to all caches with shared copies (and receive acknowledgements).
Only after the completion of this process can the processor proceed with a write. The
directory is marked to reflect the existence of a modified copy. A write miss is handled as
a combination of read-miss and write-hit.
Notice that in the approach outlined here, the directory associated with each memory unit
is responsible for tracking the shared copies and for sending invalidation signals. This is
sometimes known as p+1 directory to reflect the fact that each directory entry may need
p+1 bits to track the existence of up to p read copies and one write copy. The memory
requirements imposed by such directory methods can be alleviated by allowing fewer
copies (less than p), and the copies are tracked using "pointers" to the processors
containing copies. Whether using p+1 bits or fixed number of pointers, copies of data
items is maintained by (centralized) directories associated with memory modules. We can
consider distributing each directory entry as follows. On the first request for data, the
memory unit (or the home memory unit) supplies the requested data, and marks the
directory with a "pointer" to the requesting processor. Future read requests will be
forwarded to the processor which has the copy, and the requesting processors are linked
together. In other words, the processors with copies of the data are thus linked, and track
all shared copies. On a write request, an invalidation signal is sent along the linked list to
all shared copies. The home memory unit can wait until invalidations are acknowledged
before permitting the writer to proceed. The home memory unit can also send the
identification of the writer so that acknowledgements to invalidations can be sent directly
to the writer. Scalable Coherence Interface (SCI) standard uses a doubly linked list of
shared copies. This permits a processor to remove itself from the linked list when it no
longer contains a copy of the shared cache line.
Numerous variations have been proposed and implemented to improve the performance
of the directory based protocols. Hybrid techniques that combine Snoopy protocols with
Directory based protocols have also been investigated in Stanford DASH system. Such
systems can be viewed as networks of clusters, where each cluster relies on bus snooping
and use directories across clusters (see Cluster Computing).
The performance of directory based techniques depend on the number of shared blocks,
number of copies of individual shared blocks, if multicasting is available, number of
writers. The amount of memory needed for directories depend on the granularity of
sharing, the number of processors (in p+1 directory), number of shared copies (in pointer
based methods).
Using large cache blocks can reduce certain types of overheads in maintaining coherence
as well as reduce the overall cache miss rates. However, larger cache blocks will increase
the possibility of false-sharing. False sharing refers to the situation when 2 or more
processors which do not really share any specific memory address, however they appear
to share a cache line, since the variables (or addresses) accessed by the different
processors fall to the same cache line. Compile time analysis can detect and eliminate
unnecessary invalidations in some false sharing cases.
Software can also help in improving the performance of hardware based coherency
techniques described above. It is possible to detect when a processor no longer accesses a
cache line (or variable), and “self-invalidation” can be used to eliminate unnecessary
invalidation signals. In the simplest method (known as Indiscriminate Invalidation),
consider an indexed variable X being modified inside a loop. If we do not know how the
loop iterations will be allocated to processors, we may require each processor to read the
variable X a the start of each iteration and flush the variable back to memory at the end of
the loop iteration (that is invalidated). However, this is unnecessary since not all values of
X are accessed in each iteration, and it is also possible that several, contiguous iterations
may be assigned to the same processor.
In Selective invalidation technique, if static analysis reveals that a specific variable may
be modified in a loop iteration, the variable will be marked with a "Change Bit". This
implies that at the end of the loop iteration, the variable may have been modified by some
processor. The processor which actually modifies the variable resets the Change Bit since
the processor already has an updated copy. All other processors will invalidate their
copies of the variable.
A more complex technique involves the use of Version numbers with individual
variables. Each variable is associated with a Current Version Number (CVN). If static
analysis determines that a variable may be modified (by any processor), all processors
will be required to increment the CVN associated with the variable. In addition, when a
processor acquires a new copy of a variable, the variable will be associated with a Birth
Version Number (BVN) which is set to the current CVN. When a processor actually
modifies a variable, the processor will set the BVN to the CVN+1. If the BVN of a
variable in a processor is greater than the CVN of that variable, the processor has the
updated value; otherwise, the processor invalidates its copy of the variable.
Migration of processes or threads from one node to another can lead to poor cache
performances since the migration can cause “false” sharing: the original node where the
thread resided may falsely assume that cache lines are shared with the new node to where
the thread migrated. Some software techniques to selectively invalidate cache lines when
threads migrate have been proposed.
Software aided prefetching of cache lines is often used to reduce cache misses. In shared
memory systems, prefetching may actually increase misses, unless it is possible to predict
if a prefetched cache line will be invalidated before its use.
Memory Consistency
While data coherency (or cache coherency) techniques aim at assuring that copies
individual data items (or cache blocks) will be up to date (or copies will be invalid), a
consistent memory implies that view of the entire shared memory presented to all
processors will be identical. This requirement can also be stated in terms of the order in
which operations performed on shared memory will be made visible to individual
processors in a multiprocessor system. In order to understand the relationship between
the ordering of memory operations and memory consistency, consider the following
example with two processors P1 and P2. P1 performs a write to a shared variable X
(operation-1) followed by a read of variable Y (operation-2); P2 performs a read of
variable X (operation-3) followed by a write to shared variable Y (operation-4). For each
processor, we can potentially consider 4! different orders for the four operations.
However, we expect that the order in which each processor executes the operations (i.e.,
program order) be preserved. This requires that operation-1 always be executed before
opeariont-2; operation-3 before operation-4. Now we have only 6 possible orders in
which the operations can appear.
While it is possible for the two processors to see the operations in different order, intuition tells us
that “correct” program behavior requires that all processors see the memory operations
performed in the same order. These two requirements (Program Order be preserved, and all
processors see the memory operations in the same order) are used to define a correct behavior of
concurrent programs, and is termed as Sequential Consistency of memory.
Sequential Consistency.
Weak Ordering.
3. No data access (read or write) is allowed to be performed until all previous accesses to
synchronization variables have been performed.
The first condition forces a global order on all synchronization variables. Since ordinary
variables are accessed only in critical sections (after accessing synchronization variables),
Sequential consistency is assured even on ordinary variables.
The second condition implies that before a synchronization variable is released (and
subsequently obtained by a different processor), all accesses made to ordinary variables
are made globally visible to all processors. Likewise, the third condition requires that
before a processor can access ordinary variables inside a critical section, accesses to
synchronization variables must be globally visible. This forces mutual exclusion on
synchronization variables and makes changes made in previous critical sections are
globally visible.
Release Consistency.
While using synchronization variables (in Weak ordering), we normally associate locks
with synchronization variables. We use lock and unlock (or acquire and release)
operations on these locks. When you acquire, you have not yet made any changes to
shared variables (other processes may have). When you release a lock, you may have
updated variables and these variables must be made available to other processors. So, we
need to maintain consistent data only on a lock release. How does the performance
improve? You are not guaranteeing consistency memory until a lock is released
2. Before a release on a synchronization variable is allowed, all previous reads and writes
on ordinary variables performed by the processor must have completed.
The performance of Release consistency can be improved using "Lazy" release whereby, the
shared memory is made consistent only on acquire by a different processor.
In both Weak ordering and Release consistency models, the shared memory is made
consistent when any synchronization variable is released (or accessed). However, if we
can associate a set of shared variable with each synchronization variable, then we only
need to maintain consistency on these variables when the associated synchronization
variable is released (or accessed). Entry consistency requires that the program specify the
association between shared memory and synchronization variables. Scope consistency is
similar to Entry consistency, but the associations between shared variables and
synchronization variables is implicitly extracted. Consider the following example where
lock-1 and lock-2 are synchronization variables while A,B,C, and D are ordinary shared
variables.
P1 P2
Lock lock-1
A =1
Lock lock-2
B =1
Unlock lock-2
Unlock l2
Lock lock-2
C = A ----- may not see A=1
D=B
Unlock lock-2
A similar effect will be achieved in Entry consistency by associating lock-1 with variable
A and lock-2 with variables A, B, C and D.
a). Hardware prefetch for write. Here, a request for Exclusive access of a cache block is
issued even before the processor is ready to do a write, overlapping the time for
invalidations and acknowledgements needed to implement sequential consistency
correctly, with other computations. However, improper use of the prefetch may
unnecessarily invalidate cache copies at other processors and this may lead to increased
cache misses.
b). Speculative loads. Here cache blocks for load are prefetched, without changing any
exclusive accesses that are currently held by other processors. If the data is modified
before it is actually used, the prefetched copies are invalidated. Otherwise, one can
realize performance gains from the prefetch.
Summary. It appears that weak ordering is the dominating memory consistency that is
supported by current processors that are using in a multiprocessor systems. Sequential
consistency is assured implicitly and transparently by software layers or by the
programmer. In addition to improved performance, weak ordering may also have benefits
in supporting fault-tolerance to applications using DSM systems, in terms of the amount
state information that must be checkpointed [Hecht 99].
In the previous section we assumed that the programmer relies on synchronization among
the processors (or processes) while accessing shared memory in order to assure program
correctness as defined by the Sequential consistency model. The two fundamental
synchronization constructs used by programmers are mutual exclusion locks and barriers.
Mutual exclusion locks can be acquired by only one processor at a time, forcing a global
sequential order on the locks. When barriers are used a processor is forced to wait for its
partners and proceed beyond the barrier only when all partners reach the barrier. In this
section we describe how mutual exclusion locks and barriers can be supported in
hardware or software, and discuss performance of various implementations.
In this example, the value in R3 will be stored only if SC is successful and the value in
R3 will be non-zero. Otherwise, SC will fail to change the value in memory and R3 will
be set to zero.
Typically, LL stores the memory address in a special Link Register which is compared
with the memory address of a subsequent SC instruction. The Link Register is rest by any
memory accesses to the same address by other processors or on a context switch of the
current process. In a multiprocessor system, LL and SC can be implemented using cache
coherency techniques previously discussed. For example, in snoopy systems, the Link
Register is reset by snooping on the shared bus.
Notice that in this example we are actually using "Spin Locks" where a processor is not
blocked on an unsuccessful attempt to acquire a lock. Instead, the unsuccessful processor
will repeat its attempt to acquire the lock.
Shadow locks.
So far we have been assuming that the coherency of mutual exclusion lock variable is
guaranteed, which can significantly reduce the performance of shared memory systems.
Consider how the repeated attempts to acquire a spin-lock can lead to repeated
invalidations of the lock variable. In order to improve the performance, processors are
required to spin on a "shadow" lock. All spinning processors try to acquire the lock by
accessing the lock variable in common memory. Unsuccessful processor will cache the
"locked-value" in local caches and spin on local copies. The local copies are invalidated
when the lock becomes available.
Consider the following code segment for an implementation of the spin locks
The first branch (BNEZ) does the spin. The second branch is needed for atomicity.
Other Variations.
Spinning on a single variable causes excessive network traffic since all unsuccessful
processors attempt acquire the lock as soon as it becomes available. We can consider
implementation of "exponential-back-off" techniques whereby a processor uses different
amounts of delays while attempting to acquire a lock. Alternatively, we can associate an
"Array" for lock variables so that each processor spins on a different array element. A
similar effect can be achieved using "Ticket" locks. Here, an integer (or ticket number) is
assigned to each unsuccessful processor. The value of the ticket being serviced is
incremented on each release of the lock, and a processor with a matching ticket number
will then be allowed to acquire the lock. This technique is similar to how customers are
assigned a service number and they are serviced when the current number serviced
matches their number.
In Queue locks and Granting Locks, the unsuccessful processors are linked together so
that the current lock holder can release the lock to the next processor in the queue. Notice
that these techniques are similar to the techniques used in older systems where an
unsuccessful process is blocked and queued on the lock by the operating system.
Barrier Synchronization.
How would we implement the join (barrier) using the atomic instructions we have seen so
far. We need two locks: first acquire the join variable -- and increment; and one to make
sure all processes wait until the last process arrives at the barrier. Consider the following
implementation.
M M
BUS
P P P P
1. Software solutions
2. Hardware solutions
Memory
Global cache
BUS
P P P P
Hardware Solution: Snooping Cache
Write Invalidate
When a processor writes into C, all copies of it in
other processors are invalidated. These processors
have to read a valid copy either from M, or from the
processor that modified the variable.
Write Broadcast
Instead of invalidating, why not broadcast the updated
value to the other processors sharing that copy?
This will act as write through for shared data, and
write back for private data.
A. Read Miss
try again
allocate X
Program 1.
process 0 process 1
Properties of SC.
Case 1
Consider a switch-based multiprocessor.
Assume there is no cache.
p0 p1 p2
x y
Case 2
In a multiprocessor where processors have private
cache, all invalidate signals must be acknowledged.
Write-buffers and New problems
P cache
buffer
memory
Let both x:=1 and y:=1 be written into the write buffers,
but before the memory is updated, let the two if
statements be evaluated.
Introduction
Our Pattern Language (OPL), described in the following paper A Design Pattern Language for Engineering (Parallel)
Software, represents the confluence of three different directions of research. The first direction of research was aimed
at analyzing the varieties of computing, with a goal of identifying key influences on computer architecture. This
research led to the identification of “13 dwarfs” of computing, which in turn were became “13 computational patterns”
of OPL (See Figure 1 of the following paper on OPL in this chapter). The second direction of research arose from
generations of research on architecting large pieces of software. This research led to the identification of a series of
“architectural styles” [6] that were incorporated into OPL as “structural patterns.” These computational and structural
patterns sit, side by side, at the top of OPL in Figure 1 of the following paper. The overriding vision of OPL, a unified
pattern language for designing and implementing parallel software, came from Mattson’s book Patterns for Parallel
Programming [3]. A revised of version of Mattson’s pattern languages constitutes the three lower levels of OPL.
For some time OPL was envisioned as a set of computational patterns (the former dwarfs) sitting on top of Mattson’s
pattern language for parallel programming [3]. The result was still less than fully satisfying because few applications
could be naturally decomposed into a single More generally, this generation of OPL didn’t help at all with the important
problem of how to compose computations. It was Koushik Sen who first suggested that a small set of techniques could
be used to compose all the diverse sets of computations in parallel programs.
Approaches to composing software had received considerable attention in software engineering of sequential pro-
grams, and particularly in object-oriented programming. Initially this trend of research seemed unimportant and
perhaps totally irrelevant for parallel software. Examining the software architectures developed in an object-oriented
style gave little insight into how to structure an efficient parallel program. Over time, Kurt Keutzer began to cham-
pion the notion that while software architecture did not offer immediate insights into how to parallelize software, the
need for software modularity was as great or greater in parallel software. Shaw and Garland had already codified a
list of architectural styles [6] for composing software and creating software architectures. This list of architectural
styles seemed immediately useful for parallel software as well; however, four structural patterns particularly useful for
parallel computing were added - Map Reduce, Iterative Refinement, and Puppeteer, and Arbitrary Static Task Graph
replaced the common Call-and-Return architecture of object-oriented programming.
The past few decades have seen large fluctuations in the perceived value of parallel
computing. At times, parallel computation has optimistically been viewed as the solution
to all of our computational limitations. At other times, many have argued that it is a waste
of effort given the rate at which processor speeds and memory prices continue to improve.
Perceptions continue to vacillate between these two extremes due to a number of factors,
among them: the continual changes in the “hot” problems being solved, the programming
environments available to users, the supercomputing market, the vendors involved in build-
ing these supercomputers, and the academic community’s focus at any given point and time.
The result is a somewhat muddied picture from which it is difficult to objectively judge the
value and promise of parallel computing.
In spite of the rapid advances in sequential computing technology, the promise of par-
allel computing is the same now as it was at its inception. Namely, if users can buy fast
sequential computers with gigabytes of memory, imagine how much faster their programs
could run if of these machines were working in cooperation! Or, imagine how much larger
a problem they could solve if the memories of of these machines were used cooperatively!
The challenges to realizing this potential can be grouped into two main problems: the
hardware problem and the software problem. The former asks, “how do I build a parallel
machine that will allow these processors and memories to cooperate efficiently?” The
software problem asks, “given such a platform, how do I express my computation such that
it will utilize these processors and memories effectively?”
In recent years, there has been a growing awareness that while the parallel community
can build machines that are reasonably efficient and/or cheap, most programmers and sci-
entists are incapable of programming them effectively. Moreover, even the best parallel
programmers cannot do so without significant effort. The implication is that the software
problem is currently lacking in satisfactory solutions. This dissertation focuses on one
approach designed to solve that problem.
One of the fundamental concepts that was introduced to Orca C during ZPL’s inception
was the concept of the region. A region is simply a user-specified set of indices, a concept
which may seem trivially uninteresting at first glance. However, the use of regions in ZPL
has had a pervasive effect on the language’s appearance, semantics, compilation, and run-
time management, resulting in much of ZPL’s success. This dissertation defines the region
in greater depth and documents its role in defining and implementing the ZPL language.
This dissertation’s study of regions begins in the next chapter. The rest of this chapter
provides a general overview of parallel programming, summarizing the challenges inherent
in writing parallel programs, the techniques that can be used to create them, and the metrics
used to evaluate these techniques. The next section begins by providing a rough overview
of parallel architectures.
Parallel Architectures
This dissertation categorizes parallel platforms as being one of three rough types: dis-
tributed memory, shared memory, or shared address space. This taxonomy is somewhat
coarse given the wide variety of parallel architectures that have been developed, but it pro-
vides a useful characterization of current architectures for the purposes of this dissertation.
Distributed memory machines are considered to be those in which each processor has
a local memory with its own address space. A processor’s memory cannot be accessed di-
rectly by another processor, requiring both processors to be involved when communicating
values from one memory to another. Examples of distributed memory machines include
commodity Linux clusters.
Shared memory machines are those in which a single address space and global memory
are shared between multiple processors. Each processor owns a local cache, and its values
are kept coherent with the global memory by the operating system. Data can be exchanged
between processors simply by placing the values, or pointers to values, in a predefined
location and synchronizing appropriately. Examples of shared memory machines include
the SGI Origin series and the Sun Enterprise.
Shared address space architectures are those in which each processor has its own local
memory, but a single shared address space is mapped across the distinct memories. Such
architectures allow a processor to access the memories of other processors without their
direct involvement, but they differ from shared memory machines in that there is no implicit
caching of values located on remote machines. The primary example of a shared address
machine is Cray’s T3D/T3E line.
Many modern machines are also built using a combination of these technologies in
a hierarchical fashion, known as a cluster. Most clusters consist of a number of shared
memory machines connected by a network, resulting in a hybrid of shared and distributed
memory characteristics. IBM’s large-scale SP machines are an example of this design.
Controller
Sparse Network
ZPL supports compilation and execution on these diverse architectures by describing them
using a single machine model known as the Candidate Type Architecture (CTA) [Sny86].
The CTA is a reasonably vague model, and deliberately so. It characterizes parallel ma-
chines as a group of von Neumann processors, connected by a sparse network of unspeci-
fied topology. Each processor has a local memory that it can access at unit cost. Processors
can also access other processors’ values at a cost significantly higher than unit cost by
communicating over the network. The CTA also specifies a controller used for global com-
munications and synchronization, though that will not be of concern in this discussion. See
Figure 1.1 for a simple diagram of the CTA.
Why use such an abstract model? The reason is that parallel machines vary so widely in
design that it is difficult to develop a more specific model that describes them all. The CTA
successfully abstracts the vast majority of parallel machines by emphasizing the importance
of locality and the relatively high cost of interprocessor communication. This is in direct
contrast to the overly idealized PRAM [FW78] model or the extremely parameterized LogP
model [CKP 93], neither of which form a useful foundation for a compiler concerned with
portable performance. For more details on the CTA, please refer to the literature [Sny86,
Sny95, Lin92].
Challenges to Parallel Programming
Writing parallel programs is strictly more difficult than writing sequential ones. In se-
quential programming, the programmer must design an algorithm and then express it to
the computer in some manner that is correct, clear, and efficient to execute. Parallel pro-
gramming involves these same issues, but also adds a number of additional challenges that
complicate development and have no counterpart in the sequential realm. These challenges
include: finding and expressing concurrency, managing data distributions, managing inter-
processor communication, balancing the computational load, and simply implementing the
parallel algorithm correctly. This section considers each of these challenges in turn.
Concurrency
where
Matrix addition: Given matrices and
Data Distribution
Communication
Assuming that all the data that a processor needs to access cannot be made exclusively
local to that processor, some form of data transfer must be used to move remote values
to a processor’s local memory or cache. On distributed memory machines, this communi-
cation typically takes the form of explicit calls to a library designed to move values from
one processor’s memory to another. For shared memory machines, communication in-
volves cache coherence protocols to ensure that a processor’s locally cached values are
kept consistent with the main memory. In either case, communication constitutes work that
is time-consuming and which was not present in the sequential implementation. There-
fore, communication overheads must be minimized in order to maximize the benefits of
parallelism.
Over time, a number of algorithms have been developed for parallel matrix multiplica-
tion, each of which has unique concurrency, data distribution, and communication charac-
teristics. A few of these algorithms will be introduced and analyzed during the course of
the next few chapters. For now, we return to our final parallel computing challenges.
Load Balancing
The execution time of a parallel algorithm on a given processor is determined by the time
required to perform its portion of the computation plus the overhead of any time spent per-
forming communication or waiting for remote data values to arrive. The execution time of
the algorithm as a whole is determined by the longest execution time of any of the proces-
sors. For this reason, it is desirable to balance the total computation and communication
between processors in such a way that the maximum per-processor execution time is mini-
mized. This is referred to as load balancing, since the conventional wisdom is that dividing
work between the processors as evenly as possible will minimize idle time on each proces-
sor, thereby reducing the total execution time.
Load balancing a matrix addition algorithm is fairly simple due to the fact that it can
be implemented without communication. The key is simply to give each processor approx-
imately the same number of matrix values. Similarly, matrix multiplication algorithms are
typically load balanced by dividing the elements of among the processors as evenly as
possible and trying to minimize the communication overheads required to bring remote
and values into the processors’ local memories.
Implementation and Debugging
Once all of the parallel design decisions above have been made, the nontrivial matter of
implementing and debugging the parallel program still remains. Programmers often imple-
ment parallel algorithms by creating a single executable that will execute on each processor.
The program is designed to perform different computations and communications based on
the processor’s unique ID to ensure that the work is divided between instances of the exe-
cutable. This is referred to as the Single Program, Multiple Data (SPMD) model, and its
attractiveness stems from the fact that only one program must be written (albeit a nontrivial
one). The alternative is to use the Multiple Program, Multiple Data (MPMD) model, in
which several cooperating programs are created for execution on the processor set. In ei-
ther case, the executables must be written to cooperatively perform the computation while
managing data locality and communication. They must also maintain a reasonably bal-
anced load across the processor set. It should be clear that implementing such a program
will inherently require greater programmer effort than writing the equivalent sequential
program.
As with any program, bugs are likely to creep into the implementation, and the effects
of these bugs can be disastrous. A simple off-by-one error can cause data to be exchanged
with the wrong processor, or for a program to deadlock, waiting for a message that was
never sent. Incorrect synchronization can result in data values being accessed prematurely,
or for race conditions to occur. Bugs related to parallel issues can be nondeterministic and
show up infrequently. Or, they may occur only when using large processor sets, forcing the
programmer to sift through a large number of execution contexts to determine the cause.
In short, parallel debugging involves issues not present in the sequential world, and it can
often be a huge headache.
Summary
Computing effectively with a single processor is a challenging task. The programmer must
be concerned with creating programs that perform correctly and well. Computing with
multiple processors involves the same effort, yet adds a number of new challenges related
to the cooperation of multiple processors. None of these new factors are trivial, giving a
good indication of why programmers and scientists find parallel computing so challenging.
The design of the ZPL language strives to relieve programmers from most of the burdens
of correctly implementing a parallel program. Yet, rather than making them blind to these
details, ZPL’s regions expose the crucial parallel issues of concurrency, data distribution,
communication, and load balancing to programmers, should they care to reason about such
issues. These benefits of regions will be described in subsequent chapters. For now, we
shift our attention to the spectrum of techniques that one might consider when approaching
the task of parallel programming.
Techniques for programming parallel computers can be divided into three rough categories:
parallelizing compilers, parallel programming languages, and parallel libraries. This sec-
tion considers each approach in turn.
Parallelizing Compilers
The concept of a parallelizing compiler is an attractive one. The idea is that program-
mers will write their programs using a traditional language such as C or Fortran, and the
compiler will be responsible for managing the parallel programming challenges described
in the previous section. Such a tool is ideal because it allows programmers to express
code in a familiar, traditional manner, leaving the challenges related to parallelism to the
compiler. Examples of parallelizing compilers include SUIF, KAP, and the Cray MTA
compiler [HAA 96, KLS94, Ter99].
Listing 1.1: Sequential C Matrix Multiplication
for (i=0; i<m; i++) {
for (k=0; k<o; k++) {
C[i][k] = 0;
}
}
for (i=0; i<m; i++) {
for (j=0; j<n; j++) {
for (k=0; k<o; k++) {
C[i][k] += A[i][j] * B[j][k];
}
}
}
The point here is that effective parallel algorithms often differ significantly from their
sequential counterparts. While having an effective parallel compiler would be a godsend,
expecting a compiler to automatically understand an arbitrary sequential algorithm well
enough to create an efficient parallel equivalent seems a bit naive. The continuing lack of
such a compiler serves as evidence to reinforce this claim.
matrix A replicated column replicated row matrix B
multiply
replicate elementwise replicate i
column row
accumulate
matrix C
Global-view Languages
Global-view languages are those in which the programmer specifies the behavior of their
algorithm as a whole, largely ignoring the fact that multiple processors will be used to
implement the program. The compiler is therefore responsible for managing all of the
parallel implementation details, including data distribution and communication.
Many global-view languages are rather unique, providing language-level concepts that
are tailored specifically for parallel computing. The ZPL language and its regions form
one such example. Other global-view languages include the directive-based variations of
traditional programming languages used by parallelizing compilers, since the annotated
sequential programs are global descriptions of the algorithm with no reference to individual
processors. As a simple example of a directive-based global-view language, consider the
pseudocode implementation of the SUMMA algorithm in Listing 1.2. This is essentially
a sequential description of the SUMMA algorithm with some comments (directives) that
indicate how each array should be distributed between processors.
Listing 1.2: Pseudo-Code for SUMMA Using a Global View
double A[m][n];
double B[n][o];
double C[m][o];
double ColA[m];
double RowB[o];
// distribute C [block,block]
// align A[:,:] with C[:,:]
// align B[:,:] with C[:,:]
// align ColA[:] with C[:,*]
// align RowB[:] with C[*,:]
Local-view Languages
Local-view languages are those in which the implementor is responsible for specifying the
program’s behavior on a per-processor basis. Thus, details such as communication, data
distribution, and load balancing must be handled explicitly by the programmer. A local-
view implementation of the SUMMA algorithm might appear as shown in Listing 1.3.
The chief advantage of local-view languages is that users have complete control over
the parallel implementation of their programs, allowing them to implement any parallel
algorithm that they can imagine. The drawback to these approaches is that managing the
details of a parallel program can become a painstaking venture very quickly. This contrast
can be seen even in short programs such as the implementation of SUMMA in Listing 1.3,
especially considering that the implementations of its Broadcast...(), IOwn...(),
and GlobToLoc...() routines have been omitted for brevity. The magnitude of these
details are such that they tend to make programs written in local-view languages much
more difficult to maintain and debug.
Listing 1.3: Pseudo-Code for SUMMA Using a Local View
int m_loc = m/proc_rows;
int o_loc = o/proc_cols;
int n_loc_col = n/proc_cols;
int n_loc_row = n/proc_rows;
double A[m_loc][n_loc_col];
double B[n_loc_row][o_loc];
double C[m_loc][o_loc];
double ColA[m_loc];
double RowB[o_loc];
Parallel libraries are the third approach to parallel computing considered here. These are
simply libraries designed to ease the task of utilizing a parallel computer. Once again, we
categorize these as global-view or local-view approaches.
Global-view Libraries
Global-view libraries, like their language counterparts, are those in which the programmer
is largely kept blissfully unaware of the fact that multiple processors are involved. As a
result, the vast majority of these libraries tend to support high-level numerical operations
such as matrix multiplications or solving linear equations. The number of these libraries is
overwhelming, but a few notable examples include the NAG Parallel Library, ScaLAPACK,
and PLAPACK [NAG00, BCC 97, vdG97].
The advantage to using a global-view library is that the supported routines are typically
well-tuned to take full advantage of a parallel machine’s processing power. To achieve
similar performance using a parallel language tends to require more effort than most pro-
grammers are willing to make.
The disadvantages to global-view libraries are standard ones for any library-based ap-
proach to computation. Libraries support a fixed interface, limiting their generality as com-
pared to programming languages. Libraries can either be small and extremely special-
purpose or they can be wide, either in terms of the number of routines exported or the
number of parameters passed to each routine [GL00]. For these reasons, libraries are a use-
ful tool, but often not as satisfying for expressing general computation as a programming
language.
Local-view Libraries
Like languages, libraries may also be local-view. For our purposes, local-view libraries are
those that aid in the support of processor-level operations such as communication between
processors. Local-view libraries can be evaluated much like local-view languages: they
give the programmer a great deal of explicit low-level control over a parallel machine,
but by nature this requires the explicit management of many painstaking details. Notable
examples include the MPI and SHMEM libraries [Mes94, BK94].
Summary
This section has described a number of different ways of programming parallel computers.
To summarize, general parallelizing compilers seem fairly intractable, leaving languages
and libraries as the most attractive alternatives. In each of these approaches, the tradeoff
between supporting global- and local-view approaches is often one of high-level clarity
versus low-level control. The goal of the ZPL programming language is to take advantage
of the clarity offered by a global-view language without sacrificing the programmer’s abil-
ity to understand the low-level implementation and tune their code accordingly. Further
chapters will develop this point and also provide a more comprehensive survey of parallel
programming languages and libraries.
For any of the parallel programming approaches described in the previous section, there are
a number of metrics that can be used to evaluate its effectiveness. This section describes
five of the most important metrics that will be used to evaluate parallel programming in
this dissertation: performance, clarity, portability, generality, and a programmer’s ability to
reason about the implementation.
Performance
Performance is typically viewed as the bottom line in parallel computing. Since improved
performance is often the primary motivation for using parallel computers, failing to achieve
good performance reflects poorly on a language, library, or compiler.
Sample Speedup Graph
64
linear speedup
48 program A
program B
Speedup
32
16
0
0 4 8 16 32 64
Processors
Figure 1.3: A Sample Speedup Graph. The dotted line indicates linear speedup
( ), which represents ideal parallel performance. The “program A” line
represents an algorithm that scales quite well as the processor set size increases. The “pro-
gram B” line indicates an algorithm that does not scale nearly as well, presumably due to
parallel overheads like communication. Note that these numbers are completely fabricated
for demonstration purposes.
0
If the original motivating goal of running a program times faster using processors
is met, then . This is known as linear speedup. In practice, this is chal-
lenging to achieve since the parallel implementation of most interesting programs requires
work beyond that which was required for the sequential algorithm: in particular, commu-
nication and synchronization between processors. Thus, the amount of work per processor
in a parallel implementation will typically be more than of the work of the sequential
algorithm.
On the other hand, note that the parallelization of many algorithms requires allocating
approximately of the sequential program’s memory on each processor. This causes the
working set of each processor to decrease as increases, allowing it to make better use of
the memory hierarchy. This effect can often offset the overhead of communication, making
linear, or even superlinear speedups possible.
Parallel performance is typically reported using a graph showing speedup versus the
number of processors. Figure 1.3 shows a sample graph that displays fictional results for
a pair of programs. The speedup of program A resembles a parallel algorithm like matrix
addition that requires no communication between processors and therefore achieves nearly
linear speedup. In contrast, program B’s speedup falls away from the ideal as the number
of processors increases, as might occur in a matrix multiplication algorithm that requires
communication.
Clarity
For the purposes of this dissertation, the clarity of a parallel program will refer to how
clearly it represents the overall algorithm being expressed. For example, given that list-
ings 1.2 and 1.3 both implement the SUMMA algorithm for matrix multiplication, how
clear is each representation? Conversely, how much do the details of the parallel imple-
mentation interfere with a reader’s ability to understand an algorithm?
The importance of clarity is often brushed aside in favor of the all-consuming pursuit
of performance. However, this is a mistake that should not be made. Clarity is perhaps the
single most important factor that prevents more scientists and programmers from utilizing
parallel computers today. Local-view libraries continue to be the predominant approach to
parallel programming, yet their syntactic overheads are such that clarity is greatly compro-
mised. This requires programmers to focus most of their attention on making the program
work correctly rather than spending time implementing and improving their original algo-
rithm. Ideally, parallel programming approaches should result in clear programs that can
be readily understood.
Portability
Ideally, portability implies that a given program will behave consistently on all ma-
chines, regardless of their architectural features.
Generality
Generality simply refers to the ability of a parallel programming approach to express algo-
rithms for varying types of problems. For example, a library which only supports matrix
multiplication operations is not very general, and would not be very helpful for writing a
parallel quicksort algorithm. Conversely, a global-view functional language might make it
simple to write a parallel quicksort algorithm, but difficult to express the SUMMA matrix
multiplication algorithm efficiently. Ideally, a parallel programming approach should be as
general as possible.
Listing 1.4: Two matrix additions in C. Which one is better?
double A[m][n];
double B[m][n];
double C[m][n];
Performance Model
This dissertation defines a performance model as the means by which programmers under-
stand the implementations of their programs. In this context, the performance model need
not be a precise tool, but simply a means of weighing different implementation alternatives
against one another.
As an example, C’s performance model indicates that the two loop nests in Listing 1.4
may perform differently in spite of the fact that they are semantically equivalent. C spec-
ifies that two-dimensional arrays are laid out in row-major order, and the memory models
of modern machines indicate that accessing memory sequentially tends to be faster than
accessing it in a strided manner. Using this information, a savvy C programmer will always
choose to implement matrix addition using the first loop nest.
Note that C does not say how much slower the second loop nest will be. In fact, it does
not even guarantee that the second loop nest will be slower. An optimizing compiler may
reorder the loops to make them equivalent to the first loop nest. Or, hardware prefetching
may detect the memory access pattern and successfully hide the memory latency normally
associated with strided array accesses. In the presence of these uncertainties, experienced
C programmers will recognize that the first loop nest should be no worse than the second.
Given the choice between the two approaches, they will choose the first implementation
every time.
C’s performance model gives the programmer some idea of how C code will be com-
piled down to a machine’s hardware, even if the programmer is unfamiliar with specific
details like the machine’s assembly language, its cache size, or its number of registers. In
the same way, a parallel programmer should have some sense of how their code is being
implemented on a parallel machine—for example, how the data and work are distributed
between the processors, when communication takes place, what kind of communication it
is, etc. Note that users of local-view languages and libraries have access to this informa-
tion, because they specify it manually. Ideally, global-view languages and libraries should
also give their users a parallel performance model with which different implementation
alternatives can be compared and evaluated.
This Dissertation
This dissertation was designed to serve many different purposes. Naturally, its most impor-
tant role is to describe the contributions that make up my doctoral research. With this goal
in mind, I have worked to create a document that examines the complete range of effects
that regions have had on the ZPL language, from their syntactic benefits to their imple-
mentation, and from their parallel implications to their ability to support advanced parallel
computations. I also designed this dissertation to serve as documentation for many of my
contributions to the ZPL compiler for use by future collaborators in the project. As such,
some sections contain low-level implementation details that may not be of interest to those
outside the ZPL community. Throughout the process of writing, my unifying concept has
been to tell the story of regions as completely and accurately as I could in the time and
space available.
In telling such a broad story, some of this dissertation’s contributions have been made
as a joint effort between myself and other members of the ZPL project—most notably
Sung-Eun Choi, Steven Deitz, E Christopher Lewis, Calvin Lin, Ton Ngo, and my advisor,
Lawrence Snyder. In describing aspects of the project that were developed as a team, my
intent is not to take credit for work that others have been involved in, but rather to make
this treatment of regions as complete and seamless as possible.
The novel contributions of this dissertation include:
A formal description and analysis of the region concept for expressing array compu-
tation, including support for replicated and privatized dimensions.
The design of the Ironman philosophy for supporting efficient paradigm-neutral com-
munications, and an instantiation of the philosophy in the form of a point-to-point
data transfer library.
A means of parameterizing regions that supports the concise and efficient expression
of hierarchical index sets and algorithms.
Region-based support for sparse computation that permits the expression of sparse
algorithms using dense syntax, and an implementation that supports general array
operations, yet can be optimized to a compact form.
The chapters of this dissertation have a consistent organization. The bulk of each chap-
ter describes its contributions. Most chapters contain an experimental evaluation of their
ideas along with a summary of previous work that is related to their contents. Each chapter
concludes with a discussion section that addresses the strengths and weaknesses of its con-
tributions, mentions side issues not covered in the chapter proper, and outlines possibilities
for future work.
This dissertation is organized as follows. The next three chapters define and analyze
the fundamental region concept. First, Chapter 2 describes the role of the region as a
syntactic mechanism for sequential array-based programming, using ZPL as its context.
Then, Chapter 3 explains the parallel implications of regions, detailing their use in defining
ZPL’s performance model. The implementation of regions and of ZPL’s runtime libraries
is covered in Chapter 4. The two chapters that follow each describe an extension to the
basic region concept designed to support more advanced parallel algorithms. The notion of
a parameterized region is defined in Chapter 5 and its use in implementing multigrid-style
computations is detailed. Chapter 6 extends the region to support sparse sets of indices,
and demonstrates its effectiveness in a number of sparse benchmarks.
An Introduction to Parallel Programming
Peter Pacheco
. / omp_hello 4
compiling
running with 4 threads
# include <omp.h>
#ifdef _OPENMP
# include <omp.h>
#endif
In case the compiler doesn’t
support OpenMP
# ifdef _OPENMP
int my_rank = omp_get_thread_num ( );
int thread_count = omp_get_num_threads ( );
#else
int my_rank = 0;
int thread_count = 1;
# endif
A First OpenMP Version
1) We identified two types of tasks:
a) computation of the areas of individual
trapezoids, and
b) adding the areas of trapezoids.
2) There is no communication among the
tasks in the first collection, but each task
in the first collection communicates with
task 1b.
A First OpenMP Version
3) We assumed that there would be many
more trapezoids than cores.
global_result += my_result ;
Mutual exclusion
+, *, -, &, |, ˆ, &&, ||
THE “PARALLEL FOR”
DIRECTIVE
Parallel for
Forks a team of threads to execute the
following structured block.
However, the structured block following the
parallel for directive must be a for loop.
Furthermore, with the parallel for directive
the system parallelizes the for loop by
dividing the iterations of the loop among
the threads.
Legal forms for parallelizable for
statements
Caveats
The variable index must have integer or
pointer type (e.g., it can’t be a float).
fibo[ 0 ] = fibo[ 1 ] = 1;
# pragma omp parallel for num_threads(2)
for (i = 2; i < n; i++)
fibo[ i ] = fibo[ i – 1 ] + fibo[ i – 2 ];
but sometimes
we get this
1 1 2 3 5 8 13 21 34 55
this is correct 1123580000
Estimating π
OpenMP solution #1
loop dependency
OpenMP solution #2
Assignment of work
using cyclic partitioning.
Our definition of function f.
Results
f(i) calls the sin function i times.
Assume the time to execute f(2i) requires
approximately twice as much time as the
time to execute f(i).
n = 10,000
one thread
run-time = 3.67 seconds.
Results
n = 10,000
two threads
default assignment
run-time = 2.76 seconds
speedup = 1.33
n = 10,000
two threads
cyclic assignment
run-time = 1.84 seconds
speedup = 1.99
The Schedule Clause
Default schedule:
Cyclic schedule:
PRODUCERS AND
CONSUMERS
Queues
Can be viewed as an abstraction of a line of
customers waiting to pay for their groceries in a
supermarket.
A natural data structure to use in many
multithreaded applications.
For example, suppose we have several
―producer‖ threads and several ―consumer‖
threads.
Producer threads might ―produce‖ requests for data.
Consumer threads might ―consume‖ the request by
finding or generating the requested data.
Message-Passing
Each thread could have a shared message
queue, and when one thread wants to
―send a message‖ to another thread, it
could enqueue the message in the
destination thread’s queue.
A thread could receive a message by
dequeuing the message at the head of its
message queue.
Message-Passing
Sending Messages
Receiving Messages
Termination Detection
mpiexec -n 1 ./mpi_hello
mpiexec -n 4 ./mpi_hello
mpiexec -n 4 ./mpi_hello
MPI_Finalize
Tells MPI we’re done, so clean up anything
allocated for this program.
Basic Outline
Communicators
A collection of processes that can send
messages to each other.
MPI_Init defines a communicator that
consists of all the processes created when
the program is started.
Called MPI_COMM_WORLD.
Communicators
my rank
(the process making this call)
SPMD
Single-Program Multiple-Data
We compile one program.
Process 0 does something different.
Receives messages and prints them while the
other processes do the work.
r
MPI_Send
src = q
MPI_Recv
dest = r
q
Receiving messages
A receiver can get a message without
knowing:
the amount of data in the message,
the sender of the message,
or the tag of the message.
status_p argument
MPI_Status*
unpredictable output
Input
Most MPI implementations only allow
process 0 in MPI_COMM_WORLD access
to stdin.
Process 0 must read the data (scanf) and
send to the other processes.
Function for reading user input
COLLECTIVE
COMMUNICATION
Tree-structured communication
1. In the first phase:
(a) Process 1 sends to 0, 3 sends to 2, 5 sends to 4, and
7 sends to 6.
(b) Processes 0, 2, 4, and 6 add in the received values.
(c) Processes 2 and 6 send their new values to
processes 0 and 4, respectively.
(d) Processes 0 and 4 add the received values into their
new values.
i-th component of y
Dot product of the ith
row of A with x.
Matrix-vector multiplication
Multiply a matrix by a vector
Serial pseudo-code
C style arrays
stored as
Serial matrix-vector
multiplication
An MPI matrix-vector
multiplication function (1)
An MPI matrix-vector
multiplication function (2)
MPI DERIVED DATATYPES
Derived datatypes
Used to represent any collection of data items in
memory by storing both the types of the items
and their relative locations in memory.
The idea is that if a function that sends data
knows this information about a collection of data
items, it can collect the items from memory
before they are sent.
Similarly, a function that receives data can
distribute the items into their correct destinations
in memory when they’re received.
Derived datatypes
Formally, consists of a sequence of basic
MPI data types together with a
displacement for each of the data types.
Trapezoidal Rule example:
MPI_Type create_struct
(Seconds)
Speedup
Efficiency
Speedups of Parallel Matrix-
Vector Multiplication
Efficiencies of Parallel Matrix-
Vector Multiplication
Scalability
A program is scalable if the problem size
can be increased at a rate so that the
efficiency doesn’t decrease as the number
of processes increase.
Scalability
Programs that can maintain a constant
efficiency without increasing the problem
size are sometimes said to be strongly
scalable.
Introduction
Many physical phenomena directly or indirectly (when solving a discrete version of a
continuous problem) involve, or can be simulated with particle systems, where each
particle interacts with all other particles according to the laws of physics. Examples
include the gravitational interaction among the stars in a galaxy or the Coulomb forces
exerted by the atoms in a molecule. The challenge of efficiently carrying out the related
calculations is generally known as the N-body problem.
Mathematically, the N-body problem can be formulated as
U (x 0 ) = ∑ F (x 0 ,x i ) (1)
i
where U(x0) is a physical quantity at x0 which can be obtained by summing the pairwise
interactions F(x0,xi) over the particles of the system. For instance, assume a system of
N particles, located at xi and having a mass of mi. The gravitational force exerted on a
particle x having a mass m is then expressed as
N
x − xi
F (x) = ∑ Gmmi 3 (2)
i =1 x − xi
where G is the gravitational constant.
The task of evaluating the function U(x0) for all N particles using (1) requires O(n)
operations for each particle, resulting in a total complexity of O(n2). In this paper we
will see how this complexity can be reduced to O(n log n) or O(n) by using efficient
methods to approximate the sum in the right hand term of (1), while still preserving
such important physical properties as energy and momentum.
Then, for each timestep from time tk to tk+1:=∆t+tk we need to integrate the right hand
term of equation (3) in order to obtain the change in position:
∆x j = ∫∫ F(x j (t ))dtdt (4)
[ t k ,t k + 1 ]
where
N x j − xi
F (x j ) = ∑ Gmi 3
(5)
i =1 x j − xi
(4) is a somewhat difficult integral equation, since xj is present on both sides. Also, xi
is dependent on t, which means we have a system of N coupled integral equations for
each time step.
A discrete version of (4) (which can be obtained by making certain assumptions) has
the general form1
k
∆x j = ∑ ci F(x j (t + hi )), k < ∞ (6)
i =1
and is thus a linear combination of the function F evaluated at different time points;
different discrete integration schemes yield different coefficients ci and hi. A commonly
used integrator is the so-called Leapfrog integration scheme.
We can now formulate an algorithm for the simulation:
Tree codes
Let a be the radius of the smallest disc, call it D, so that the set of particles
P:=(xi1,..,xiN) are inside the disc. Many physical systems have the property that the
field U(x) generated by the the particle set P may be very complex inside D, but
smooth (“low on information content”) at some distance c⋅a from D. The gravitational
force, for instance, has this property. 4
This observation is used in the so-called tree code-approach to the N-body problem:
Clusters of particles at enough distance from the “target particle” x0 of equation 1 are
given a computationally simpler representation in order to speed up summation5. This
approach can be illustrated by the following example: when calculating the
gravitational force exerted by earth on an orbiting satellite, we do not try to sum the
gravitational forces of all atoms that constitute planet earth. Instead we approximate
the force with that of an infinitesimal particle, which have the same mass as earth and is
located at earth’s center of mass.
≥ ca P
x0
≥ ca P
x0
Figure 2 Quadtree
We now construct a quadtree, so that each leaf node contains only 1 particle. This is
done by recursively subdividing the computational box; each node is further subdivided
if it contains more than one particle.
Figure 3 Adaptive quadtree with one particle/leaf. The picture is from [Demmel1]
Assuming the particles are not at arbitrarily small distances from each other (at least
the machine precision sets a limit) a quadtree can be built in O(n min(b,log n)) time,
where b is the machine precision7.
Now assume that the distance between a cluster and a particle must be at least the
length of the side of the cluster box in order to obtain an accurate approximation.
When calculating the force on a particle x0, the tree is recursively traversed from the
root. At each level, there may be no more than 9 boxes (the ones surrounding the box
containing the particle) which need further subdivision, limiting the number of force
calculations on the next level to 27 (=2*3*2*3-9, see Figure 4). Thus as each level a
maximum of 27 O(1) operations are performed. The depth of the tree is min(b,log(n))
yielding a total complexity (for all N particles) of O(n min(b,log n)).
2 2 2 2 2 2
1 3 3 3 3 3 3
2 2 3 3 3 2
3 3 X 3
2 2 3 3 3 2
1 3 3 3 3 3 3
2 2 3 3 3 3 3 3 2
2 2 2 2 2 2
1
2 2 2 2 2 2
1 1 1 1
Tree codes thus reduces the computational complexity from O(n2) to O(n log n) or
O(n) depending on your point of view - certainly a vast improvement! But as the
saying goes, there’s no such thing as a free lunch: tree codes are less accurate than
simple PP, and require more auxiliary storage8.
Step 2
Step 2 calculates the approximations for the long-range force. The approximation is
made by considering several particles as one, with a position equal to the center of
mass of the approximated particles, and a mass the sum of the approximated particles’
masses. More formally, to find the mass and position associated with a node N:
calculate_approxiamtions( N )
if N is a leaf node
return; // Node has a (real) particle => has mass & position
for all children n of N do
calculate_approximations( n )
M := 0
cm := (0,0)
for all children n of N do
M := M + mass of n
cm := cm + mass of n * position of n
endfor
cm := 1/M * cm
mass of N := M
position of N := cm
end
Step 3
Consider the ratio,
D
θ= (7)
r
where D is the size of the current node (call it A) “box” and r is the distance to the
center of mass of another node (called B). If this ratio is sufficiently small, we can use
the center of mass and mass of B to compute the force in A. If this is not the case, we
need to go to the children of B and do the same test. Figure 4 shows this for θ =1.0;
the numbers indicate the relative depth of the nodes. It can clearly be seen that further
away from the particle, x, large nodes are used and closer to it smaller. The method is
accurate to approximately 1% with θ =1.011. Expressed in pseudocode
treeForce(x, N)
if N is a leaf or size(N)/|x-N|<θ
return force(x,N)
else
F := 0
for all children n of N do
F := F + treeForce(x, n)
endfor
return F
endif
end
Mesh point
1 2 Mesh cell
× Particle mi
3 4
Chaining mesh
mesh cell
re
x0
3D
zi
D
D
3D
Figure 7 The expansion of the particles zi converge outside the outer box
B1 B2
outer_shift outer_shift
outer_shift
A outer_shift
B3 B4
A "binary search tree" (BST) or "ordered binary tree" is a type of binary tree where the nodes are arranged in order:
for each node, all elements in its left subtree are less-or-equal to the node (<=), and all the elements in its right
subtree are greater than the node (>). The tree shown above is a binary search tree -- the "root" node is a 5, and its
left subtree nodes (1, 3, 4) are <= 5, and its right subtree nodes (6, 9) are > 5. Recursively, each of the subtrees must
also obey the binary search tree constraint: in the (1, 3, 4) subtree, the 3 is the root, the 1 <= 3 and 4 > 3. Watch out
for the exact wording in the problems -- a "binary search tree" is different from a "binary tree".
The nodes at the bottom edge of the tree have empty subtrees and are called "leaf" nodes (1, 4, 6) while the others
are "internal" nodes (3, 5, 9).
Basically, binary search trees are fast at insert and lookup. The next section presents the code for these two
algorithms. On average, a binary search tree algorithm can locate a node in an N node tree in order lg(N) time (log
base 2). Therefore, binary search trees are good for "dictionary" problems where the code inserts and looks up
information indexed by some key. The lg(N) behavior is the average case -- it's possible for a particular tree to be
much slower depending on its shape.
Strategy
Some of the problems in this article use plain binary trees, and some use binary search trees. In any case, the
problems concentrate on the combination of pointers and recursion. (See the articles linked above for pointer articles
that do not emphasize recursion.)
The node/pointer structure that makes up the tree and the code that manipulates it
The algorithm, typically recursive, that iterates over the tree
When thinking about a binary tree problem, it's often a good idea to draw a few little trees to think about the
various cases.
Typical Binary Tree Code in C/C++
As an introduction, we'll look at the code for the two most basic binary search tree operations -- lookup() and
insert(). The code here works for C or C++.
In C or C++, the binary tree is built with a node type like this...
struct node {
int data;
struct node* left;
struct node* right;
}
Lookup()
Given a binary search tree and a "target" value, search the tree to see if it contains the target. The basic pattern of
the lookup() code occurs in many recursive tree algorithms: deal with the base case where the tree is empty, deal
with the current node, and then use recursion to deal with the subtrees. If the tree is a binary search tree, there is
often some sort of less-than test on the node to decide if the recursion should go left or right.
/*
Given a binary tree, return true if a node
with the target data is found in the tree. Recurs
down the tree, chooses the left or right
branch by comparing the target to each node.
*/
static int lookup(struct node* node, int target) {
// 1. Base case == empty tree
// in that case, the target is not found so return false
if (node == NULL) {
return(false);
}
else {
// 2. see if found here
if (target == node->data) return(true);
else {
// 3. otherwise recur down the correct subtree
if (target < node->data) return(lookup(node->left, target));
else return(lookup(node->right, target));
}
}
}
The lookup() algorithm could be written as a while-loop that iterates down the tree. Our version uses recursion to
help prepare you for the problems below that require recursion.
There is a common problem with pointer intensive code: what if a function needs to change one of the pointer
parameters passed to it? For example, the insert() function below may want to change the root pointer. In C and
C++, one solution uses pointers-to-pointers (aka "reference parameters"). That's a fine technique, but here we will
use the simpler technique that a function that wishes to change a pointer passed to it will return the new value of
the pointer to the caller. The caller is responsible for using the new value. Suppose we have a change() function
that may change the the root, then a call to change() will look like this...
We take the value returned by change(), and use it as the new value for root. This construct is a little awkward, but
it avoids using reference parameters which confuse some C and C++ programmers, and Java does not have reference
parameters at all. This allows us to focus on the recursion instead of the pointer mechanics.
Insert()
Insert() -- given a binary search tree and a number, insert a new node with the given number into the tree in the
correct place. The insert() code is similar to lookup(), but with the complication that it modifies the tree structure.
As described above, insert() returns the new tree pointer to use to its caller. Calling insert() with the number 5 on
this tree...
2
/ \
1 10
2
/ \
1 10
/
5
The solution shown here introduces a newNode() helper function that builds a single node. The base-case/recursion
structure is similar to the structure in lookup() -- each call checks for the NULL case, looks at the node at hand, and
then recurs down the left or right subtree if needed.
/*
Helper function that allocates a new node
with the given data and NULL left and right
pointers.
*/
struct node* NewNode(int data) {
struct node* node = new(struct node); // "new" is like "malloc"
node->data = data;
node->left = NULL;
node->right = NULL;
return(node);
}
/*
Give a binary search tree and a number, inserts a new node
with the given number in the correct place in the tree.
Returns the new root pointer which the caller should
then use (the standard trick to avoid using reference
parameters).
*/
struct node* insert(struct node* node, int data) {
// 1. If the tree is empty, return a new, single node
if (node == NULL) {
return(newNode(data));
}
else {
// 2. Otherwise, recur down the tree
if (data <= node->data) node->left = insert(node->left, data);
else node->right = insert(node->right, data);
The shape of a binary tree depends very much on the order that the nodes are inserted. In particular, if the nodes
are inserted in increasing order (1, 2, 3, 4), the tree nodes just grow to the right leading to a linked list shape where
all the left pointers are NULL. A similar thing happens if the nodes are inserted in decreasing order (4, 3, 2, 1). The
linked list shape defeats the lg(N) performance. We will not address that issue here, instead focusing on pointers
and recursion.
Reading about a data structure is a fine introduction, but at some point the only way to learn is to actually try to
solve some problems starting with a blank sheet of paper. To get the most out of these problems, you should at least
attempt to solve them before looking at the solution. Even if your solution is not quite right, you will be building up
the right skills. With any pointer-based code, it's a good idea to make memory drawings of a a few simple cases to
see how the algorithm should work.
build123()
This is a very basic problem with a little pointer manipulation. (You can skip this problem if you are already
comfortable with pointers.) Write code that builds the following little 1-2-3 binary search tree...
2
/ \
1 3
(In Java, write a build123() method that operates on the receiver to change it to be the 1-2-3 tree with the given
coding constraints.
size()
This problem demonstrates simple binary tree traversal. Given a binary tree, count the number of nodes in the tree.
maxDepth()
Given a binary tree, compute its "maxDepth" -- the number of nodes along the longest path from the root node down
to the farthest leaf node. The maxDepth of the empty tree is 0, the maxDepth of the tree on the first page is 3.
minValue()
Given a non-empty binary search tree (an ordered binary tree), return the minimum data value found in that tree.
Note that it is not necessary to search the entire tree. A maxValue() function is structurally very similar to this
function. This can be solved with recursion or with a simple while loop.
printTree()
Given a binary search tree (aka an "ordered binary tree"), iterate over the nodes to print them out in increasing
order. So the tree...
4
/ \
2 5
/ \
1 3
Produces the output "1 2 3 4 5". This is known as an "inorder" traversal of the tree.
Hint: For each node, the strategy is: recur left, print the node data, recur right.
printPostorder()
Given a binary tree, print out the nodes of the tree according to a bottom-up "postorder" traversal -- both subtrees of
a node are printed out completely before the node itself is printed, and each left subtree is printed before the right
subtree. So the tree...
4
/ \
2 5
/ \
1 3
Produces the output "1 3 2 5 4". The description is complex, but the code is simple. This is the sort of bottom-up
traversal that would be used, for example, to evaluate an expression tree where a node is an operation like '+' and
its subtrees are, recursively, the two subexpressions for the '+'.
void printPostorder(struct node* node) {
hasPathSum()
We'll define a "root-to-leaf path" to be a sequence of nodes in a tree starting with the root node and proceeding
downward to a leaf (a node with no children). We'll say that an empty tree contains no root-to-leaf paths. So for
example, the following tree has exactly four root-to-leaf paths:
5
/ \
4 8
/ / \
11 13 4
/ \ \
7 2 1
Root-to-leaf paths:
path 1: 5 4 11 7
path 2: 5 4 11 2
path 3: 5 8 13
path 4: 5 8 4 1
For this problem, we will be concerned with the sum of the values of such a path -- for example, the sum of the
values on the 5-4-11-7 path is 5 + 4 + 11 + 7 = 27.
Given a binary tree and a sum, return true if the tree has a root-to-leaf path such that adding up all the values
along the path equals the given sum. Return false if no such path can be found. (Thanks to Owen Astrachan for
suggesting this problem.)
printPaths()
Given a binary tree, print out all of its root-to-leaf paths as defined above. This problem is a little harder than it
looks, since the "path so far" needs to be communicated between the recursive calls. Hint: In C, C++, and Java,
probably the best solution is to create a recursive helper function printPathsRecur(node, int path[], int pathLen),
where the path array communicates the sequence of nodes that led up to the current call. Alternately, the problem
may be solved bottom-up, with each node returning its list of paths. This strategy works quite nicely in Lisp, since
it can exploit the built in list and mapping primitives. (Thanks to Matthias Felleisen for suggesting this problem.)
Given a binary tree, print out all of its root-to-leaf paths, one per line.
mirror()
Change a tree so that the roles of the left and right pointers are swapped at every node.
So the tree...
4
/ \
2 5
/ \
1 3
is changed to...
4
/ \
5 2
/ \
3 1
The solution is short, but very recursive. As it happens, this can be accomplished without changing the root node
pointer, so the return-the-new-root construct is not necessary. Alternately, if you do not want to change the tree
nodes, you may construct and return a new mirror tree based on the original tree.
doubleTree()
For each node in a binary search tree, create a new duplicate node, and insert the duplicate as the left child of the
original node. The resulting tree should still be a binary search tree.
So the tree...
2
/ \
1 3
is changed to...
2
/ \
2 3
/ /
1 3
/
1
As with the previous problem, this can be accomplished without changing the root node pointer.
sameTree()
Given two binary trees, return true if they are structurally identical -- they are made of nodes with the same values
arranged in the same way. (Thanks to Julie Zelenski for suggesting this problem.)
countTrees()
This is not a binary tree programming problem in the ordinary sense -- it's more of a math/combinatorics recursion
problem that happens to use binary trees. (Thanks to Jerry Cain for suggesting this problem.)
Suppose you are building an N node binary search tree with the values 1..N. How many structurally different
binary search trees are there that store those values? Write a recursive function that, given the number of distinct
values, computes the number of structurally unique binary search trees that store those values. For example,
countTrees(4) should return 14, since there are 14 structurally unique binary search trees that store 1, 2, 3, and 4. The
base case is easy, and the recursion is short but dense. Your code should not construct any actual trees; it's just a
counting problem.
This background is used by the next two problems: Given a plain binary tree, examine the tree to determine if it
meets the requirement to be a binary search tree. To be a binary search tree, for every node, all of the nodes in its
left tree must be <= the node, and all of the nodes in its right subtree must be > the node. Consider the following four
examples...
a. 5 -> TRUE
/ \
2 7
c. 5 -> TRUE
/ \
2 7
/
1
d. 5 -> FALSE, the 6 is ok with the 2, but the 6 is not ok with the 5
/ \
2 7
/ \
1 6
For the first two cases, the right answer can be seen just by comparing each node to the two nodes immediately
below it. However, the fourth case shows how checking the BST quality may depend on nodes which are several
layers apart -- the 5 and the 6 in that case.
isBST() -- version 1
Suppose you have helper functions minValue() and maxValue() that return the min or max int value from a
non-empty tree (see problem 3 above). Write an isBST() function that returns true if a tree is a binary search tree
and false otherwise. Use the helper functions, and don't forget to check every node in the tree. It's ok if your
solution is not very efficient. (Thanks to Owen Astrachan for the idea of having this problem, and comparing it to
problem 14)
isBST() -- version 2
Version 1 above runs slowly since it traverses over some parts of the tree many times. A better solution looks at each
node only once. The trick is to write a utility helper function isBSTRecur(struct node* node, int min, int max) that
traverses down the tree keeping track of the narrowing min and max allowed values as it goes, looking at each node
only once. The initial values for min and max should be INT_MIN and INT_MAX -- they narrow from there.
/*
Returns true if the given tree is a binary search tree
(efficient version).
*/
int isBST2(struct node* node) {
return(isBSTRecur(node, INT_MIN, INT_MAX));
}
/*
Returns true if the given tree is a BST and its
values are >= min and <= max.
*/
int isBSTRecur(struct node* node, int min, int max) {
Tree-List
The Tree-List problem is one of the greatest recursive pointer problems ever devised, and it happens to use binary
trees as well. CLibarary works through the Tree-List problem in detail and includes solution code in C and Java.
The problem requires an understanding of binary trees, linked lists,recursion, and pointers. It's a great problem,
but it's complex.
C/C++ Solutions
Make an attempt to solve each problem before looking at the solution -- it's the best way to learn.
root->left = lChild;
root->right= rChild;
return(root);
}
// call newNode() three times, and use only one local variable
struct node* build123b() {
struct node* root = newNode(2);
root->left = newNode(1);
root->right = newNode(3);
return(root);
}
/*
Build 123 by calling insert() three times.
Note that the '2' must be inserted first.
*/
struct node* build123c() {
struct node* root = NULL;
root = insert(root, 2);
root = insert(root, 1);
root = insert(root, 3);
return(root);
}
/*
Given a non-empty binary search tree,
return the minimum data value found in that tree.
Note that the entire tree does not need to be searched.
*/
int minValue(struct node* node) {
struct node* current = node;
return(current->data);
}
printTree(node->left);
printf("%d ", node->data);
printTree(node->right);
}
/*
Recursive helper function -- given a node, and an array containing
the path from the root node up to but not including this node,
print out all the root-leaf paths.
*/
void printPathsRecur(struct node* node, int path[], int pathLen) {
if (node==NULL) return;
So the tree...
4
/ \
2 5
/ \
1 3
is changed to...
4
/ \
5 2
/ \
3 1
*/
void mirror(struct node* node) {
if (node==NULL) {
return;
}
else {
struct node* temp;
// do the subtrees
mirror(node->left);
mirror(node->right);
So the tree...
2
/ \
1 3
Is changed to...
2
/ \
2 3
/ /
1 3
/
1
*/
void doubleTree(struct node* node) {
struct node* oldLeft;
if (node==NULL) return;
// do the subtrees
doubleTree(node->left);
doubleTree(node->right);
if (numKeys <=1) {
return(1);
}
else {
// there will be one value at the root, with whatever remains
// on the left and right each forming their own subtrees.
// Iterate through all the values that could be the root...
int sum = 0;
int left, right, root;
return(sum);
}
}
The Message Passing Interface, MPI[12], is a controlled API standard for programming a
wide array of parallel architectures. Though MPI was originally intended for classic
distributed memory architectures, it is used on various architectures from networks of PCs
via large shared memory systems, such as the SGI Origin 2000, to massive parallel
architectures, such as Cray T3D and Intel paragon. The complete MPI API offers 186
operations, which makes this is a rather complex programming API. However, most MPI
applications use only six to ten of the available operations.
MPI is intended for the Single Program Multiple Data (SPMD) programming paradigm
– all nodes run the same application-code. The SPMD paradigm is efficient and easy to use
for a large set of scientific applications with a regular execution pattern. Other, less regular,
applications are far less suited to this paradigm and implementation in MPI is tedious.
MPI's point-to-point communication comes in four shapes: standard, ready, synchronous
and buffered. A standard-send operation does not return until the send buffer has been
copied, either to another buffer below the MPI layer or to the network interface, (NIC). The
ready-send operations are not initiated until the addressed process has initiated a
corresponding receive-operation. The synchronous call sends the message, but does not
return until the receiver has initiated a read of the message. The fourth model, the buffered
send, copies the message to a buffer in the MPI-layer and then allows the application to
continue. Each of the four models also comes in asynchronous (in MPI called non-
blocking) modes. The non-blocking calls return immediately, and it is the programmer's
responsibility to check that the send has completed before overwriting the buffer. Likewise
a non-blocking receive exist, which returns immediately and the programmer needs to
ensure that the receive operation has finished before using the data.
MPI supports both group broadcasting and global reductions. Being SPMD, all nodes
have to meet at a group operation, i.e. a broadcast operation blocks until all the processes in
the context have issued the broadcast operation. This is important because it turns all group-
operations into synchronization points in the application. The MPI API also supports
scatter-gather for easy exchange of large data-structures and virtual architecture topologies,
which allow source-code compatible MPI applications to execute efficiently across
different platforms.
Experiment Environment
Cluster
The cluster comprises 51 Dell Precision Workstation 360s, each with a 3.2GHz Intel
Prescott processor, 2GB RAM and a 120GB Serial ATA hard-disk2. The nodes are
P4
connected using Gigabit Ethernet over two HP Procurve 2848 switches. 32 nodes are
connected to the first switch, and 19 nodes to the second switch. The two switches are
trunked3 with 4 copper cables, providing 4Gbit/s bandwidth between the switches, see
Figure 1. The nodes are running RedHat Linux 9 with a patched Linux 2.4.26 kernel to
support Serial ATA. Hyperthreading is switched on, and Linux is configured for Symmetric
Multiprocessor support.
2
The computers have a motherboard with Intel’s 875P chipset. The chipset supports Gigabit Ethernet over
Intel’s CSA (Communication Streaming Architecture) bus, but Dell’s implementation of the motherboards
use an Intel 82540EM Gigabit Ethernet controller connected to the PCI bus instead.
3
Trunking is a method where traffic between two switches is loadbalanced across a set of links in order to
provide a higher available bandwidth between the switches.
Switches trunked with 4
Copper Cables
MPICH
MPICH is the official reference implementation of MPI and has a high focus on being
portable. MPICH is available for all UNIX flavors and for Windows, a special GRID
enabled version, MPICH-G2, is available for Globus[11]. Many of the MPI
implementations for specialized hardware, i.e. cluster interconnects, are based on MPICH.
MPICH version 1.2.52 is used for the below experiments.
LAM-MPI
MESH-MPI
MESH-MPI is only just released and the presented results are thus brand-new. MESH-MPI
is ‘yet-another-commercial-MPI’, but with a strong focus on performance, rather than
simply improved support over the open-source versions. In addition to improved
performance, MESH-MPI also promotes true non-blocking operations, thread safety, and
scalable collective operations. Future versions have announced support for a special Low
Latency Communication library (LLC) and a Runtime Data Dependency Analysis (RDDA)
functionality to schedule communication. These functions are not available in the current
version which is 1.0a.
Benchmarks
This section describes the benchmark suites we have chosen for examining the performance
of the three MPI implementations. One suite, Pallas, is a micro-benchmark suite, which
gives a lot of information about the performance of the different MPI functions, while the
other, NPB, is an application/kernel suite, which describes the application level
performance. The NPB suite originates from NASA and is used as the basis for deciding on
new systems at NASA. This benchmark tests both the processing power of the system and
the communication performance.
Pallas Benchmark Suite
The Pallas benchmark suite[9] from Pallas GmbH is a suite, which measures the
performance of different MPI functions. The performance is measured for individual
operations rather than on the application level. The results can thus be used in two ways;
either to choose an MPI implementation that performs well for the operations one uses, or
to determine which operations performs poorly on the available MPI implementation so
that one can avoid them when coding applications. The tests/operations that are run in
Pallas are:
• PingPong
The time it takes to pass a message between two processes and back
• PingPing
The time it takes to send a message from one process to another
• SendRecv
The time it takes to send and receive a message in parallel
• Exchange
The time it takes to exchange contents of two buffers
• Allreduce
The time it takes to create a common result, i.e. a global sum
• Reduce
The same as Allreduce but the result is delivered to only one process
• Reduce Scatter
The same as Reduce but the result is distributed amongst the processes
• Allgather
The time it takes to collect partial results from all processes and deliver the data
to all processes
• Allgatherv
Same as Allgather, except that the partial results need not have the same size
• Alltoall
The time it takes for all processes to send data to all other processes and receive
from all other processes – the data that is sent is unique to each reciever
• Bcast - the time it takes to deliver a message to all processes
NPB is available for threaded, OpenMP and MPI systems and we naturally run the MPI
version. NPB is available with five different data-sets, A through D, and W which is for
workstations only. We use dataset C since D won’t fit on the cluster, and also since C is the
most widely reported dataset.
The application kernels in NPB are:
• MG – Multigrid
• CG – Conjugate Gradient
• FT – Fast Fourier Transform
• IS – Integer Sort
• EP – Embarrassingly Parallel
• BT – Block Tridiagonal
• SP – Scalar Pentadiagonal
• LU – Lower Upper Gauss-Seidel
Results
In this section we present and analyze the results of running the benchmarks from section 3
on the systems described in section 2. All the Pallas benchmarks are run on 32 CPUs (they
run on 2x sized systems) as are the NPB benchmarks except BT and SP which are run on 36
CPUs (they run on X2 sized systems).
First in the Pallas benchmark is the point to point experiments, the extreme case is the
concurrent Send and Recv experiments where MPICH uses more than 12 times longer than
MESH-MPI, but otherwise all three are fairly close. MPICH performs worse than the other
two and the commercial MESH-MPI loses only on the ping-ping experiment.
The seemingly large differences on ping-pong and ping-ping are not as significant as
they may seem since they are the result of the interrupt throttling rate on the Intel Ethernet
chipsets which – when set at the recommended 8000, discretises latencies in chunks of
125us, thus the difference between 62.5 us and 125us is not as significant as it may seem
and would probably be much smaller on other Ethernet chipsets.
200
180
160
140
120
time (us)
MESH
100 LAM
MPICH
80
60
40
20
0
PingPong PingPing SendRecv Exchange
200000
746178 us
180000
363107 us
160000
140000
120000
time (us)
MESH
100000 LAM
MPICH
80000
60000
40000
20000
0
PingPong PingPing SendRecv Exchange
In the collective operations, the small data is tested on 8B (eight bytes) rather than 0B
because 0B on group-operations are often not performed at all and resulting times are
reported in the 0.05us range, thus to test the performance on small packages we use the size
of a double precision number. The results are shown in Figure 4.
In the collective operations, the extreme case is Allgatherv using LAM-MPI which
reports a whopping 4747us or 11 times longer than when using MESH-MPI. Except for the
Alltoall benchmark where LAM-MPI is fastest, MESH-MPI is consistently the faster,
and for most experiments, the advantage is significant, measured in multiples rather than
percentages. The Bcast operation, which is a frequently used operation in many
applications, shows MESH-MPI to be 7 times faster than MPICH and 12 times faster than
LAM-MPI.
2000
1800
2800 us 4747 us
1600
1400
1200
time (us)
MESH
1000 LAM
MPICH
800
600
400
200
0
Allreduce Reduce Reduce Allgather Allgatherv Alltoall Bcast
Scatter
1200000
1000000
800000
time (us)
MESH
600000 LAM
MPICH
400000
200000
0
Allreduce Reduce Reduce Scatter Bcast
35000000
30000000
25000000
20000000
time (us)
MESH
LAM
15000000 MPICH
10000000
5000000
0
Allgather Allgatherv Alltoall
While micro-benchmarks are interesting from an MPI perspective, users are primarily
interested in the performance at application level. Here, according to Amdahl’s law,
improvements are limited by the fraction of time spent on MPI operations. Thus the runtime
of the NPB suite is particularly interesting, since it allows us to predict the value of running
a commercial MPI, and it will even allow us to determine if the differences at the operation
level performance can be seen at the application level.
The results are in favour of the commercial MPI; MESH-MPI finished the suite 14.5%
faster than LAM and 37.1% faster than MPICH. Considering that these are real-world
applications doing real work and taking Amdahl’s law into consideration, this is significant.
800
716
700
597
600
522
500
time (s)
400
300
200
100
0
MESH LAM MPICH
If we break down the results in the individual applications the picture is a little less
obvious and LAM-MPI actually outperforms MESH-MPI on two of the experiments; the
FT by 3% and the LU by 6%. Both of these makes extensive use of the Alltoall operation
where MESH-MPI has the biggest problems keeping up with LAM-MPI in the Pallas tests.
25000
20000
15000
MESH
MOPS
LAM
MPICH
10000
5000
0
BT CG EP FT IS LU MG SP
Benchmark