0% found this document useful (0 votes)
50 views

Lecture 5

The document discusses physical organization of parallel computing platforms. It describes the architecture of an ideal parallel computer called a PRAM, which consists of multiple processors that can access a shared memory. PRAMs allow concurrent access to memory locations and can be categorized based on how simultaneous reads and writes are handled. The document also discusses interconnection networks that connect processors and memory in parallel computers. These include static direct networks and dynamic indirect networks that use switches. Common network topologies like buses, crossbars, and multistage networks like the Omega network are described.
Copyright
© © All Rights Reserved
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
50 views

Lecture 5

The document discusses physical organization of parallel computing platforms. It describes the architecture of an ideal parallel computer called a PRAM, which consists of multiple processors that can access a shared memory. PRAMs allow concurrent access to memory locations and can be categorized based on how simultaneous reads and writes are handled. The document also discusses interconnection networks that connect processors and memory in parallel computers. These include static direct networks and dynamic indirect networks that use switches. Common network topologies like buses, crossbars, and multistage networks like the Omega network are described.
Copyright
© © All Rights Reserved
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 72

Parallel Computing Platforms

Physical Organization of Parallel Platforms


Physical Organization of Parallel Platforms

• Architecture of an Ideal Parallel Computer


• Interconnection Networks for Parallel Computers
• Network Topologies
• Evaluating Static Interconnection Networks
• Evaluating Dynamic Interconnection Networks
• Cache Coherence in Multiprocessor Systems
Architecture of an
Ideal Parallel Computer
Architecture of an
Ideal Parallel Computer
• A natural extension of the Random Access Machine (RAM)
serial architecture is the Parallel Random Access Machine, or
PRAM.
• PRAMs consist of p processors and a global memory of
unbounded size that is uniformly accessible to all processors.
• All processors access the same address space.
• Processors share a common clock but may execute different
instructions in each cycle.
Architecture of an Ideal Parallel Computer

• Since PRAMs allow concurrent access to various memory


locations, depending on how simultaneous memory accesses
are handled, PRAMs can be divided into four subclasses.

• Exclusive-read, exclusive-write (EREW) PRAM.


– Access to a memory location is exclusive.
– No concurrent read or write operations are allowed.
– This is the weakest PRAM model, affording minimum
concurrency in memory access.
Architecture of an Ideal Parallel Computer

• Since PRAMs allow concurrent access to various memory


locations, depending on how simultaneous memory accesses
are handled, PRAMs can be divided into four subclasses.

• Concurrent-read, exclusive-write (CREW) PRAM.


– Multiple read accesses to a memory location are allowed.
– However, multiple write accesses to a memory location are
serialized.
Architecture of an Ideal Parallel Computer

• Since PRAMs allow concurrent access to various memory


locations, depending on how simultaneous memory accesses
are handled, PRAMs can be divided into four subclasses.

• Exclusive-read, concurrent-write (ERCW) PRAM.


– Multiple write accesses are allowed to a memory location
– but multiple read accesses are serialized.

• Concurrent-read, concurrent-write (CRCW) PRAM.


– Allows multiple read and write accesses to a common memory
location.
– This is the most powerful PRAM model.
Architecture of an Ideal Parallel Computer

• Several protocols are used to resolve concurrent writes. The


most frequently used protocols are as follows:
• Common: Concurrent write is allowed if all the values that
the processors are attempting to write are identical.
• Arbitrary: An arbitrary processor (randomly selected) is
allowed to proceed with the write operation and the rest fail.
• Priority: All processors are organized into a predefined
prioritized list, and the processor with the highest priority
succeeds and the rest fail.
• Sum: Sum of all the quantities is written (the sum-based write
conflict resolution model can be extended to any associative
operator defined on the quantities being written).
Physical Complexity of an Ideal Parallel
Computer
• Consider the implementation of an EREW PRAM as a shared-
memory computer with p processors and a global memory
of m words.
• Processors and memories are connected via switches.
• Since these switches must operate in O(1) time at the level of
words, for a system of p processors and m words, the switch
complexity is O(mp).
• Clearly, for meaningful values of p and m, a true PRAM is not
realizable.
Interconnection Networks
for Parallel Computers
Interconnection Networks for Parallel
Computers
• Interconnection networks carry data between processors and
to memory.
• Interconnects are made of switches and links (wires, fiber).
• Interconnects are classified as static or dynamic.
• Static networks consist of point-to-point communication links
among processing nodes and are also referred to as direct
networks.
• Dynamic networks are built using switches and
communication links. Dynamic networks are also referred to
as indirect networks.
Static and Dynamic Interconnection Networks

Classification of interconnection networks:


(a) a static network;
(b) a dynamic network.
Interconnection Networks

• Switches map a fixed number of inputs to outputs.


• The total number of ports on a switch is the degree of the
switch.
• Switches may also provide support for internal buffering
(when the requested output port is busy), routing (to alleviate
congestion on the network), and multicast (same output on
multiple ports).
• The cost of a switch grows as the square of the degree of the
switch, the peripheral hardware linearly as the degree, and
the packaging costs linearly as the number of pins.
Interconnection Networks:
Network Interfaces
• Processors talk to the network via a network interface.
• The network interface has input and output ports that pipe data
into and out of the network.
• It typically has the responsibility of packetizing data, computing
routing information, buffering incoming and outgoing data for
matching speeds of network and processing elements, and error
checking.
• The relative speeds of the I/O and memory buses impact the
performance of the network.
• Since I/O buses are typically slower than memory buses, the
latter can support higher bandwidth.
Network Topologies
Network Topologies

• A wide variety of network topologies have been used in


interconnection networks.
• These topologies try to trade off cost and scalability with
performance.
• Commercial machines often implement hybrids of multiple
topologies (combinations or modifications of the pure
topologies ) for reasons of packaging, cost, and available
components.
Network Topologies: Buses

• Some of the simplest and earliest parallel machines used buses.


• A bus has the desirable property that the cost of the network
scales linearly as the number of nodes, p.
• All processors access a common bus for exchanging data.
• The distance between any two nodes is O(1) in a bus. The bus
also provides a convenient broadcast media.
• However, the bandwidth of the shared bus is a major bottleneck.
• Typical bus based machines are limited to dozens of nodes. Sun
Enterprise servers and Intel Pentium based shared-bus
multiprocessors are examples of such architectures.
Network Topologies: Buses

Bus-based interconnects (a) with no local caches; (b) with local


memory/caches.
Since much of the data accessed by processors is local to the
processor, a local memory can improve the performance of bus-
based machines.
Network Topologies: Crossbars
A crossbar network uses an p×m grid of switches to connect p
inputs to m outputs in a non-blocking manner.

A completely non-blocking crossbar network connecting p


processors to b memory banks.
Network Topologies: Crossbars

• The crossbar network is a non-blocking network in the sense


that the connection of a processing node to a memory bank
does not block the connection of any other processing nodes to
other memory banks.
• Assume that the number of memory banks b is at least p;
otherwise, at any given time, there will be some processing
nodes that will be unable to access any memory banks.
• The cost of a crossbar of p processors grows as O(p2).
• This is generally difficult to scale for large values of p.
• Examples of machines that employ crossbars include the Sun
Ultra HPC 10000 and the Fujitsu VPP500.
Network Topologies: Multistage Networks

• Crossbars have excellent performance scalability but poor


cost scalability.
• Buses have excellent cost scalability, but poor performance
scalability.
• An intermediate class of networks called multistage
interconnection networks lies between these two extremes.
• It is more scalable than the bus in terms of performance and
more scalable than the crossbar in terms of cost.
Network Topologies: Multistage Networks

The schematic of a typical multistage interconnection network.


Network Topologies: Multistage Omega Network

• One of the most commonly used multistage


interconnects is the Omega network.
• This network consists of log p stages, where p is the
number of inputs/outputs.
• At each stage, input i is connected to output j if:
Network Topologies: Multistage Omega Network

Each stage of the Omega network implements a perfect


shuffle as follows:

A perfect shuffle interconnection for eight inputs and outputs.


Network Topologies: Multistage Omega
Network
• The perfect shuffle patterns are connected using 2×2
switches.
• The switches operate in two modes – crossover or
passthrough.

Two switching configurations of the 2 × 2 switch:


(a) Pass-through;
(b) Cross-over.
Network Topologies: Multistage Omega Network

A complete Omega network with the perfect shuffle


interconnects and switches can now be illustrated:

A complete omega network connecting eight inputs and eight outputs.

An omega network has p/2 × log p switching nodes, and the


cost of such a network grows as (p log p).
Network Topologies: Multistage Omega
Network – Routing
• Let s be the binary representation of the source and d be that of
the destination processor.
• The data traverses the link to the first switching node.
• If the most significant bits of s and d are the same, then the data
is routed in pass-through mode by the switch else, it switches to
crossover.
• This scheme is repeated at the next switching stage using the
next most significant bit.
• Traversing log p stages uses all log p bits in the binary
representations of s and t.
• Note that this is not a non-blocking switch.
Network Topologies: Multistage Omega
Network – Routing

An example of blocking in omega network: one of the messages


(010 to 111 or 110 to 100) is blocked at link AB.
Network Topologies: Completely Connected
Network
• In a completely-connected network, each node has a direct
communication link to every other node in the network.
• This network is ideal in the sense that a node can send a
message to another node in a single step, since a
communication link exists between them.
• The number of links in the network scales as O(p2).
• While the performance scales very well, the hardware
complexity is not realizable for large values of p.
• In this sense, these networks are static counterparts of
crossbars.
Network Topologies: Completely Connected and Star
Connected Networks

Example of an 8-node completely connected network.

(a) A completely-connected network of eight nodes;


(b) a star connected network of nine nodes.
Network Topologies: Star Connected Network

• In a star-connected network, one processor acts as the


central processor.
• Every other processor has a communication link connecting it
to this processor.
• Every node is connected only to a common node at the
center.
• Distance between any pair of nodes is O(1). However, the
central node becomes a bottleneck.
• In this sense, star connected networks are static counterparts
of buses.
Network Topologies: Linear Arrays, Meshes, and k-d
Meshes

• In a linear array, each node has two neighbors, one to its left
and one to its right.
• If the nodes at either end are connected, we refer to it as a 1-
D torus or a ring.
• A generalization to 2 dimensions has nodes with 4 neighbors,
to the north, south, east, and west.
• A further generalization to d dimensions has nodes with 2d
neighbors.
• A special case of a d-dimensional mesh is a hypercube. Here,
d = log p, where p is the total number of nodes.
Network Topologies: Linear Arrays

Linear arrays: (a) with no wraparound links;


(b) with wraparound link.
Network Topologies: Two- and Three Dimensional
Meshes

Two and three dimensional meshes:


(a) 2-D mesh with no wraparound;
(b) 2-D mesh with wraparound link (2-D torus);
(c) a 3-D mesh with no wraparound.
Network Topologies: Hypercubes and their
Construction

Construction of hypercubes from hypercubes of lower


dimension.
Network Topologies: Properties of Hypercubes

• The distance between any two nodes is at most log p.


• Each node has log p neighbors.
• The distance between two nodes is given by the number of bit
positions at which the two nodes differ.
• For example, nodes labeled 0110 and 0101 are two links
apart, since they differ at two bit positions.
• This property is useful for deriving a number of parallel
algorithms for the hypercube architecture.
Network Topologies: Tree-Based Networks

Complete binary tree networks: (a) a static tree network; and


(b) a dynamic tree network.
Network Topologies: Tree Properties

• The distance between any two nodes is no more than 2logp.


• Links higher up the tree potentially carry more traffic than
those at the lower levels.
• In a dynamic tree network, nodes at intermediate levels are
switching nodes and the leaf nodes are processing elements
• For this reason, a variant called a fat-tree, fattens the links as
we go up the tree.
• Trees can be laid out in 2D with no wire crossings. This is an
attractive property of trees.
Network Topologies: Fat Trees

A fat tree network of 16 processing nodes.


Evaluating
Static Interconnection Networks
Evaluating Static Interconnection Networks

Diameter:
• The distance between the farthest two nodes in the network.
• The diameter of a linear array is p − 1, tree and hypercube is log
p, and completely connected network is O(1).

Connectivity:
• The connectivity of a network is a measure of the multiplicity of
paths between any two processing nodes.
• A network with high connectivity is desirable, because it lowers
contention for communication resources.
• One measure of connectivity is the minimum number of arcs that
must be removed from the network to break it into two
disconnected networks. This is called the arc connectivity of the
network.
Evaluating Static Interconnection Networks

Bisection Width:
• The minimum number of wires you must cut to divide the
network into two equal parts.
• The bisection width of a linear array and tree is 1, that of a
mesh is , that of a hypercube is p/2 and that of a
completely connected network is p2/4.

Cost:
• The number of links or switches (whichever is asymptotically
higher) is a meaningful measure of the cost.
• However, a number of other factors, such as the ability to
layout the network, the length of wires, etc., also factor in to
the cost.
Evaluating Static Interconnection Networks

• The number of bits that can be communicated


simultaneously over a link connecting two nodes is
called the channel width. Channel width is equal to the
number of physical wires in each communication link.
• The peak rate at which a single physical wire can deliver
bits is called the channel rate.
• The peak rate at which data can be communicated
between the ends of a communication link is called
channel bandwidth.
• Channel bandwidth is the product of channel rate and
channel width.
Evaluating Static Interconnection Networks

Bisection Arc Cost


Network Diameter
Width Connectivity (No. of links)

Completely-connected

Star

Complete binary tree

Linear array

2-D mesh, no wraparound

2-D wraparound mesh

Hypercube

Wraparound k-ary d-cube


Evaluating
Dynamic Interconnection Networks
Evaluating Dynamic Interconnection Networks

•A number of evaluation metrics for dynamic networks follow


from the corresponding metrics for static networks.
•Since a message traversing a switch must pay an overhead,
it is logical to think of each switch as a node in the network, in
addition to the processing nodes.
•The diameter of the network would be the maximum distance
between any two processing nodes.
•However, for all networks of interest, this is equivalent to the
maximum distance between any (processing or switching) pair
of nodes.
Evaluating Dynamic Interconnection Networks

• Bisection width of a dynamic


network is computed by
examining various equi-
partitions of the processing
nodes and selecting the
minimum number of edges
crossing the partition. In this
case, each partition yields
an edge cut of four.
Therefore, the bisection
width of this graph is four.
Evaluating Dynamic Interconnection Networks

Bisection Arc Cost


Network Diameter
Width Connectivity (No. of links)

Crossbar

Omega Network

Dynamic Tree
Cache Coherence
in Multiprocessor Systems
Cache Coherence in Multiprocessor Systems

• Interconnects provide basic mechanisms for data transfer.


• In the case of shared address space machines, additional
hardware is required to coordinate access to data that might
have multiple copies in the network.
• The underlying technique must provide some guarantees on
the semantics.
• This guarantee is generally one of serializability, i.e., there
exists some serial order of instruction execution that
corresponds to the parallel schedule.
Cache Coherence in Multiprocessor Systems
When the value of a variable is changes, all its copies must either
be invalidated or updated.

Cache coherence in multiprocessor systems:


(a) Invalidate protocol; (b) Update protocol for shared variables.
Cache Coherence: Update and Invalidate
Protocols
• If a processor just reads a value once and does not need it
again, an update protocol may generate significant overhead.
• If two processors make interleaved test and updates to a
variable, an update protocol is better.
• Both protocols suffer from false sharing overheads (two words
that are not shared, however, they lie on the same cache line).
• In an invalidate protocol, when a processor updates its part of the
cache-line, the other copies of this line are invalidated. When other
processors try to update their parts of the cache-line, the line must
actually be fetched from the remote processor.
• Most current machines use invalidate protocols.
Maintaining Coherence Using Invalidate
Protocols
• Each copy of a data item is associated with a state.
• One example of such a set of states is, shared, invalid, or
dirty.
• In shared state, there are multiple valid copies of the data
item (and therefore, an invalidate would have to be generated
on an update).
• In dirty state, only one copy exists and therefore, no
invalidates need to be generated.
• In invalid state, the data copy is invalid, therefore, a read
generates a data request (and associated state changes).
Maintaining Coherence Using Invalidate
Protocols
• Initially the variable x resides in the global
memory.
• The first step executed by both processors is a
load operation on this variable.
• At this point, the state of the variable is said to be
shared.
• When processor P0 executes a store on this
variable, it marks all other copies of this variable
as invalid.
• It must also mark its own copy as modified or
dirty.
• This is done to ensure that all subsequent
accesses to this variable at other processors will
be serviced by processor P0 and not from the
memory.
Maintaining Coherence Using Invalidate
Protocols
• At this point, say, processor P1 executes
another load operation on x .
• Processor P1 attempts to fetch this
variable and, since the variable was
marked dirty by processor P0, processor
P0 services the request.
• Copies of this variable at processor P1
and the global memory are updated and
the variable re-enters the shared state.
• Thus, in this simple model, there are
three states - shared, invalid, and dirty -
that a cache line goes through.
Maintaining Coherence Using Invalidate Protocols

The solid lines depict


processor actions and the
dashed lines coherence
actions. When a processor
executes a read on an
invalid block, the block is
fetched and a transition is
made from invalid to
shared. Similarly, if a
processor does a write on
a shared block, the
coherence protocol
propagates a C_write (a
coherence write) on the
block. This triggers a
State diagram of a simple three-state transition from shared to
invalid at all the other
coherence protocol.
blocks.
Maintaining Coherence Using Invalidate
Protocols
Example:
•Consider an example of two program segments being executed by
processor P0 and P1.
•The system consists of local memories (or caches) at processors
P0 and P1, and a global memory.
•The three-state protocol assumed in this example corresponds to
the state diagram illustrated in Figure (given in last slide).
•Cache lines in this system can be either shared, invalid, or dirty.
Each data item (variable) is assumed to be on a different cache line.
Initially, the two variables x and y are tagged dirty and the only
copies of these variables exist in the global memory.
Maintaining Coherence Using Invalidate
Protocols

Example of parallel program execution with the simple


three-state coherence protocol.
Snoopy systems & Directory based systems

• The implementation of coherence protocols can be carried out


using a variety of hardware mechanisms – snoopy systems,
directory based systems, or combinations thereof.
Snoopy Cache Systems
Snoopy caches are typically associated with multiprocessor
systems based on broadcast interconnection networks such as a
bus or a ring.
In such systems, all processors snoop on (monitor) the bus for
transactions. This allows the processor to make state transitions
for its cache-blocks.

A simple snoopy bus based cache coherence system.


Snoopy Cache Systems
How are invalidates sent to the right processors?

In snoopy caches, there is a broadcast media that listens to


all invalidates and read requests and performs appropriate
coherence operations locally.

A simple snoopy bus based cache coherence system.


Snoopy Cache Systems

• When the snoop hardware detects that a read has been


issued to a cache block that it has a dirty copy of, it
asserts control of the bus and puts the data out.
• Similarly, when the snoop hardware detects that a write
operation has been issued on a cache block that it has a
copy of, it invalidates the block.
• Other state transitions are made in this fashion locally.
Performance of Snoopy Caches

• Once copies of data are tagged dirty, all subsequent


operations can be performed locally on the cache without
generating external traffic.
• If a data item is read by a number of processors, it transitions
to the shared state in the cache and all subsequent read
operations become local.
• If processors read and update data at the same time, they
generate coherence requests on the bus - which is ultimately
bandwidth limited.
Directory Based Systems

• In snoopy caches, each coherence operation is sent to all


processors. This is an inherent limitation.
• Why not send coherence requests to only those processors
that need to be notified?
• This is done using a directory, which maintains a presence
vector for each data item (cache line) along with its global
state.
Directory Based Systems

Architecture of typical directory based systems:


(a)a centralized directory; and
(b) a distributed directory.
Directory Based Systems

• When P0 and P1 access the variable x , the state of the block is


changed to shared, and the presence bits updated to indicate that
processors P0 and P1 share the block.
• When P0 executes a store on x, the state in the directory is changed
to dirty and the presence bit of P1 is reset.
• All subsequent operations on this variable performed at P0 can
proceed locally.
Directory Based Systems

• If another processor reads the value, the directory notices the dirty
tag and uses the presence bits to direct the request to the
appropriate processor.
• P0 updates the block in the memory, and sends it to the requesting
processor. The presence bits are modified to reflect this and the
state transitions to shared.
Performance of Directory Based Schemes

• The need for a broadcast media is replaced by the directory.


• The additional bits to store the directory may add significant
overhead.
• From the point of view of cost, the amount of memory required to
store the directory may itself become a bottleneck as the number of
processors increases.
• The directory size grows as O(mp), where m is the number of
memory blocks and p the number of processors.
• The underlying network must be able to carry all the coherence
requests.
• The directory is a point of contention, therefore, distributed directory
schemes must be used.
Distributed Directory Schemes

• In scalable architectures, memory is physically distributed across


processors.
• The corresponding presence bits of the blocks are also distributed.
• Each processor is responsible for maintaining the coherence of its own
memory blocks.
• When a processor attempts to read a block for the first time, it requests
the owner for the block. The owner suitably directs this request based
on presence and state information locally available.
• Similarly, when a processor writes into a memory block, it propagates
an invalidate to the owner, which in turn forwards the invalidate to all
processors that have a cached copy of the block.
• Communication overhead associated with state update messages is not
reduced.
Performance of Distributed Directory
Schemes
• Distributed directories permit O(p) simultaneous coherence
operations, provided the underlying network can sustain the
associated state update messages.
• From this point of view, distributed directories are inherently more
scalable than snoopy systems or centralized directory systems.
• The latency and bandwidth of the network become fundamental
performance bottlenecks for such systems.
Exercise

• A cycle in a graph is defined as a path originating and


terminating at the same node. The length of a cycle is
the number of edges in the cycle. Show that there are no
odd-length cycles in a d-dimensional hypercube.
Thank You

You might also like