0% found this document useful (0 votes)

50 views

Lecture 5

The document discusses physical organization of parallel computing platforms. It describes the architecture of an ideal parallel computer called a PRAM, which consists of multiple processors that can access a shared memory. PRAMs allow concurrent access to memory locations and can be categorized based on how simultaneous reads and writes are handled. The document also discusses interconnection networks that connect processors and memory in parallel computers. These include static direct networks and dynamic indirect networks that use switches. Common network topologies like buses, crossbars, and multistage networks like the Omega network are described.

Uploaded by

Lets clear Jee maths

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

50 views

Lecture 5

Uploaded by

Lets clear Jee maths

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

You are on page 1/ 72

Parallel Computing Platforms

Physical Organization of Parallel Platforms

• Architecture of an Ideal Parallel Computer

• Interconnection Networks for Parallel Computers
• Network Topologies
• Evaluating Static Interconnection Networks
• Evaluating Dynamic Interconnection Networks
• Cache Coherence in Multiprocessor Systems
Architecture of an
Ideal Parallel Computer
Architecture of an
Ideal Parallel Computer
• A natural extension of the Random Access Machine (RAM)
serial architecture is the Parallel Random Access Machine, or
PRAM.
• PRAMs consist of p processors and a global memory of
unbounded size that is uniformly accessible to all processors.
• All processors access the same address space.
• Processors share a common clock but may execute different
instructions in each cycle.
Architecture of an Ideal Parallel Computer

• Since PRAMs allow concurrent access to various memory

locations, depending on how simultaneous memory accesses
are handled, PRAMs can be divided into four subclasses.

• Exclusive-read, exclusive-write (EREW) PRAM.

– Access to a memory location is exclusive.
– No concurrent read or write operations are allowed.
– This is the weakest PRAM model, affording minimum
concurrency in memory access.
Architecture of an Ideal Parallel Computer

• Since PRAMs allow concurrent access to various memory

locations, depending on how simultaneous memory accesses
are handled, PRAMs can be divided into four subclasses.

• Concurrent-read, exclusive-write (CREW) PRAM.

– Multiple read accesses to a memory location are allowed.
– However, multiple write accesses to a memory location are
serialized.
Architecture of an Ideal Parallel Computer

• Since PRAMs allow concurrent access to various memory

locations, depending on how simultaneous memory accesses
are handled, PRAMs can be divided into four subclasses.

• Exclusive-read, concurrent-write (ERCW) PRAM.

– Multiple write accesses are allowed to a memory location
– but multiple read accesses are serialized.

• Concurrent-read, concurrent-write (CRCW) PRAM.

– Allows multiple read and write accesses to a common memory
location.
– This is the most powerful PRAM model.
Architecture of an Ideal Parallel Computer

• Several protocols are used to resolve concurrent writes. The

most frequently used protocols are as follows:
• Common: Concurrent write is allowed if all the values that
the processors are attempting to write are identical.
• Arbitrary: An arbitrary processor (randomly selected) is
allowed to proceed with the write operation and the rest fail.
• Priority: All processors are organized into a predefined
prioritized list, and the processor with the highest priority
succeeds and the rest fail.
• Sum: Sum of all the quantities is written (the sum-based write
conflict resolution model can be extended to any associative
operator defined on the quantities being written).
Physical Complexity of an Ideal Parallel
Computer
• Consider the implementation of an EREW PRAM as a shared-
memory computer with p processors and a global memory
of m words.
• Processors and memories are connected via switches.
• Since these switches must operate in O(1) time at the level of
words, for a system of p processors and m words, the switch
complexity is O(mp).
• Clearly, for meaningful values of p and m, a true PRAM is not
realizable.
Interconnection Networks
for Parallel Computers
Interconnection Networks for Parallel
Computers
• Interconnection networks carry data between processors and
to memory.
• Interconnects are made of switches and links (wires, fiber).
• Interconnects are classified as static or dynamic.
• Static networks consist of point-to-point communication links
among processing nodes and are also referred to as direct
networks.
• Dynamic networks are built using switches and
communication links. Dynamic networks are also referred to
as indirect networks.
Static and Dynamic Interconnection Networks

Classification of interconnection networks:

(a) a static network;
(b) a dynamic network.
Interconnection Networks

• Switches map a fixed number of inputs to outputs.

• The total number of ports on a switch is the degree of the
switch.
• Switches may also provide support for internal buffering
(when the requested output port is busy), routing (to alleviate
congestion on the network), and multicast (same output on
multiple ports).
• The cost of a switch grows as the square of the degree of the
switch, the peripheral hardware linearly as the degree, and
the packaging costs linearly as the number of pins.
Interconnection Networks:
Network Interfaces
• Processors talk to the network via a network interface.
• The network interface has input and output ports that pipe data
into and out of the network.
• It typically has the responsibility of packetizing data, computing
routing information, buffering incoming and outgoing data for
matching speeds of network and processing elements, and error
checking.
• The relative speeds of the I/O and memory buses impact the
performance of the network.
• Since I/O buses are typically slower than memory buses, the
latter can support higher bandwidth.
Network Topologies
Network Topologies

• A wide variety of network topologies have been used in

interconnection networks.
• These topologies try to trade off cost and scalability with
performance.
• Commercial machines often implement hybrids of multiple
topologies (combinations or modifications of the pure
topologies ) for reasons of packaging, cost, and available
components.
Network Topologies: Buses

• Some of the simplest and earliest parallel machines used buses.

• A bus has the desirable property that the cost of the network
scales linearly as the number of nodes, p.
• All processors access a common bus for exchanging data.
• The distance between any two nodes is O(1) in a bus. The bus
also provides a convenient broadcast media.
• However, the bandwidth of the shared bus is a major bottleneck.
• Typical bus based machines are limited to dozens of nodes. Sun
Enterprise servers and Intel Pentium based shared-bus
multiprocessors are examples of such architectures.
Network Topologies: Buses

Bus-based interconnects (a) with no local caches; (b) with local

memory/caches.
Since much of the data accessed by processors is local to the
processor, a local memory can improve the performance of bus-
based machines.
Network Topologies: Crossbars
A crossbar network uses an p×m grid of switches to connect p
inputs to m outputs in a non-blocking manner.

A completely non-blocking crossbar network connecting p

processors to b memory banks.
Network Topologies: Crossbars

• The crossbar network is a non-blocking network in the sense

that the connection of a processing node to a memory bank
does not block the connection of any other processing nodes to
other memory banks.
• Assume that the number of memory banks b is at least p;
otherwise, at any given time, there will be some processing
nodes that will be unable to access any memory banks.
• The cost of a crossbar of p processors grows as O(p2).
• This is generally difficult to scale for large values of p.
• Examples of machines that employ crossbars include the Sun
Ultra HPC 10000 and the Fujitsu VPP500.
Network Topologies: Multistage Networks

• Crossbars have excellent performance scalability but poor

cost scalability.
• Buses have excellent cost scalability, but poor performance
scalability.
• An intermediate class of networks called multistage
interconnection networks lies between these two extremes.
• It is more scalable than the bus in terms of performance and
more scalable than the crossbar in terms of cost.
Network Topologies: Multistage Networks

The schematic of a typical multistage interconnection network.

Network Topologies: Multistage Omega Network

• One of the most commonly used multistage

interconnects is the Omega network.
• This network consists of log p stages, where p is the
number of inputs/outputs.
• At each stage, input i is connected to output j if:
Network Topologies: Multistage Omega Network

Each stage of the Omega network implements a perfect

shuffle as follows:

A perfect shuffle interconnection for eight inputs and outputs.

Network Topologies: Multistage Omega
Network
• The perfect shuffle patterns are connected using 2×2
switches.
• The switches operate in two modes – crossover or
passthrough.

Two switching configurations of the 2 × 2 switch:

(a) Pass-through;
(b) Cross-over.
Network Topologies: Multistage Omega Network

A complete Omega network with the perfect shuffle

interconnects and switches can now be illustrated:

A complete omega network connecting eight inputs and eight outputs.

An omega network has p/2 × log p switching nodes, and the

cost of such a network grows as (p log p).
Network Topologies: Multistage Omega
Network – Routing
• Let s be the binary representation of the source and d be that of
the destination processor.
• The data traverses the link to the first switching node.
• If the most significant bits of s and d are the same, then the data
is routed in pass-through mode by the switch else, it switches to
crossover.
• This scheme is repeated at the next switching stage using the
next most significant bit.
• Traversing log p stages uses all log p bits in the binary
representations of s and t.
• Note that this is not a non-blocking switch.
Network Topologies: Multistage Omega
Network – Routing

An example of blocking in omega network: one of the messages

(010 to 111 or 110 to 100) is blocked at link AB.
Network Topologies: Completely Connected
Network
• In a completely-connected network, each node has a direct
communication link to every other node in the network.
• This network is ideal in the sense that a node can send a
message to another node in a single step, since a
communication link exists between them.
• The number of links in the network scales as O(p2).
• While the performance scales very well, the hardware
complexity is not realizable for large values of p.
• In this sense, these networks are static counterparts of
crossbars.
Network Topologies: Completely Connected and Star
Connected Networks

Example of an 8-node completely connected network.

(a) A completely-connected network of eight nodes;

(b) a star connected network of nine nodes.
Network Topologies: Star Connected Network

• In a star-connected network, one processor acts as the

central processor.
• Every other processor has a communication link connecting it
to this processor.
• Every node is connected only to a common node at the
center.
• Distance between any pair of nodes is O(1). However, the
central node becomes a bottleneck.
• In this sense, star connected networks are static counterparts
of buses.
Network Topologies: Linear Arrays, Meshes, and k-d
Meshes

• In a linear array, each node has two neighbors, one to its left
and one to its right.
• If the nodes at either end are connected, we refer to it as a 1-
D torus or a ring.
• A generalization to 2 dimensions has nodes with 4 neighbors,
to the north, south, east, and west.
• A further generalization to d dimensions has nodes with 2d
neighbors.
• A special case of a d-dimensional mesh is a hypercube. Here,
d = log p, where p is the total number of nodes.
Network Topologies: Linear Arrays

Linear arrays: (a) with no wraparound links;

(b) with wraparound link.
Network Topologies: Two- and Three Dimensional
Meshes

Two and three dimensional meshes:

(a) 2-D mesh with no wraparound;
(b) 2-D mesh with wraparound link (2-D torus);
(c) a 3-D mesh with no wraparound.
Network Topologies: Hypercubes and their
Construction

Construction of hypercubes from hypercubes of lower

dimension.
Network Topologies: Properties of Hypercubes

• The distance between any two nodes is at most log p.

• Each node has log p neighbors.
• The distance between two nodes is given by the number of bit
positions at which the two nodes differ.
• For example, nodes labeled 0110 and 0101 are two links
apart, since they differ at two bit positions.
• This property is useful for deriving a number of parallel
algorithms for the hypercube architecture.
Network Topologies: Tree-Based Networks

Complete binary tree networks: (a) a static tree network; and

(b) a dynamic tree network.
Network Topologies: Tree Properties

• The distance between any two nodes is no more than 2logp.

• Links higher up the tree potentially carry more traffic than
those at the lower levels.
• In a dynamic tree network, nodes at intermediate levels are
switching nodes and the leaf nodes are processing elements
• For this reason, a variant called a fat-tree, fattens the links as
we go up the tree.
• Trees can be laid out in 2D with no wire crossings. This is an
attractive property of trees.
Network Topologies: Fat Trees

A fat tree network of 16 processing nodes.

Evaluating
Static Interconnection Networks
Evaluating Static Interconnection Networks

Diameter:
• The distance between the farthest two nodes in the network.
• The diameter of a linear array is p − 1, tree and hypercube is log
p, and completely connected network is O(1).

Connectivity:
• The connectivity of a network is a measure of the multiplicity of
paths between any two processing nodes.
• A network with high connectivity is desirable, because it lowers
contention for communication resources.
• One measure of connectivity is the minimum number of arcs that
must be removed from the network to break it into two
disconnected networks. This is called the arc connectivity of the
network.
Evaluating Static Interconnection Networks

Bisection Width:
• The minimum number of wires you must cut to divide the
network into two equal parts.
• The bisection width of a linear array and tree is 1, that of a
mesh is , that of a hypercube is p/2 and that of a
completely connected network is p2/4.

Cost:
• The number of links or switches (whichever is asymptotically
higher) is a meaningful measure of the cost.
• However, a number of other factors, such as the ability to
layout the network, the length of wires, etc., also factor in to
the cost.
Evaluating Static Interconnection Networks

• The number of bits that can be communicated

simultaneously over a link connecting two nodes is
called the channel width. Channel width is equal to the
number of physical wires in each communication link.
• The peak rate at which a single physical wire can deliver
bits is called the channel rate.
• The peak rate at which data can be communicated
between the ends of a communication link is called
channel bandwidth.
• Channel bandwidth is the product of channel rate and
channel width.
Evaluating Static Interconnection Networks

Bisection Arc Cost

Network Diameter
Width Connectivity (No. of links)

Completely-connected

Star

Complete binary tree

Linear array

2-D mesh, no wraparound

2-D wraparound mesh

Hypercube

Wraparound k-ary d-cube

Evaluating
Dynamic Interconnection Networks
Evaluating Dynamic Interconnection Networks

•A number of evaluation metrics for dynamic networks follow

from the corresponding metrics for static networks.
•Since a message traversing a switch must pay an overhead,
it is logical to think of each switch as a node in the network, in
addition to the processing nodes.
•The diameter of the network would be the maximum distance
between any two processing nodes.
•However, for all networks of interest, this is equivalent to the
maximum distance between any (processing or switching) pair
of nodes.
Evaluating Dynamic Interconnection Networks

• Bisection width of a dynamic

network is computed by
examining various equi-
partitions of the processing
nodes and selecting the
minimum number of edges
crossing the partition. In this
case, each partition yields
an edge cut of four.
Therefore, the bisection
width of this graph is four.
Evaluating Dynamic Interconnection Networks

Bisection Arc Cost

Network Diameter
Width Connectivity (No. of links)

Crossbar

Omega Network

Dynamic Tree
Cache Coherence
in Multiprocessor Systems
Cache Coherence in Multiprocessor Systems

• Interconnects provide basic mechanisms for data transfer.

• In the case of shared address space machines, additional
hardware is required to coordinate access to data that might
have multiple copies in the network.
• The underlying technique must provide some guarantees on
the semantics.
• This guarantee is generally one of serializability, i.e., there
exists some serial order of instruction execution that
corresponds to the parallel schedule.
Cache Coherence in Multiprocessor Systems
When the value of a variable is changes, all its copies must either
be invalidated or updated.

Cache coherence in multiprocessor systems:

(a) Invalidate protocol; (b) Update protocol for shared variables.
Cache Coherence: Update and Invalidate
Protocols
• If a processor just reads a value once and does not need it
again, an update protocol may generate significant overhead.
• If two processors make interleaved test and updates to a
variable, an update protocol is better.
• Both protocols suffer from false sharing overheads (two words
that are not shared, however, they lie on the same cache line).
• In an invalidate protocol, when a processor updates its part of the
cache-line, the other copies of this line are invalidated. When other
processors try to update their parts of the cache-line, the line must
actually be fetched from the remote processor.
• Most current machines use invalidate protocols.
Maintaining Coherence Using Invalidate
Protocols
• Each copy of a data item is associated with a state.
• One example of such a set of states is, shared, invalid, or
dirty.
• In shared state, there are multiple valid copies of the data
item (and therefore, an invalidate would have to be generated
on an update).
• In dirty state, only one copy exists and therefore, no
invalidates need to be generated.
• In invalid state, the data copy is invalid, therefore, a read
generates a data request (and associated state changes).
Maintaining Coherence Using Invalidate
Protocols
• Initially the variable x resides in the global
memory.
• The first step executed by both processors is a
load operation on this variable.
• At this point, the state of the variable is said to be
shared.
• When processor P0 executes a store on this
variable, it marks all other copies of this variable
as invalid.
• It must also mark its own copy as modified or
dirty.
• This is done to ensure that all subsequent
accesses to this variable at other processors will
be serviced by processor P0 and not from the
memory.
Maintaining Coherence Using Invalidate
Protocols
• At this point, say, processor P1 executes
another load operation on x .
• Processor P1 attempts to fetch this
variable and, since the variable was
marked dirty by processor P0, processor
P0 services the request.
• Copies of this variable at processor P1
and the global memory are updated and
the variable re-enters the shared state.
• Thus, in this simple model, there are
three states - shared, invalid, and dirty -
that a cache line goes through.
Maintaining Coherence Using Invalidate Protocols

The solid lines depict

processor actions and the
dashed lines coherence
actions. When a processor
executes a read on an
invalid block, the block is
fetched and a transition is
made from invalid to
shared. Similarly, if a
processor does a write on
a shared block, the
coherence protocol
propagates a C_write (a
coherence write) on the
block. This triggers a
State diagram of a simple three-state transition from shared to
invalid at all the other
coherence protocol.
blocks.
Maintaining Coherence Using Invalidate
Protocols
Example:
•Consider an example of two program segments being executed by
processor P0 and P1.
•The system consists of local memories (or caches) at processors
P0 and P1, and a global memory.
•The three-state protocol assumed in this example corresponds to
the state diagram illustrated in Figure (given in last slide).
•Cache lines in this system can be either shared, invalid, or dirty.
Each data item (variable) is assumed to be on a different cache line.
Initially, the two variables x and y are tagged dirty and the only
copies of these variables exist in the global memory.
Maintaining Coherence Using Invalidate
Protocols

Example of parallel program execution with the simple

three-state coherence protocol.
Snoopy systems & Directory based systems

• The implementation of coherence protocols can be carried out

using a variety of hardware mechanisms – snoopy systems,
directory based systems, or combinations thereof.
Snoopy Cache Systems
Snoopy caches are typically associated with multiprocessor
systems based on broadcast interconnection networks such as a
bus or a ring.
In such systems, all processors snoop on (monitor) the bus for
transactions. This allows the processor to make state transitions
for its cache-blocks.

A simple snoopy bus based cache coherence system.

Snoopy Cache Systems
How are invalidates sent to the right processors?

In snoopy caches, there is a broadcast media that listens to

all invalidates and read requests and performs appropriate
coherence operations locally.

A simple snoopy bus based cache coherence system.

Snoopy Cache Systems

• When the snoop hardware detects that a read has been

issued to a cache block that it has a dirty copy of, it
asserts control of the bus and puts the data out.
• Similarly, when the snoop hardware detects that a write
operation has been issued on a cache block that it has a
copy of, it invalidates the block.
• Other state transitions are made in this fashion locally.
Performance of Snoopy Caches

• Once copies of data are tagged dirty, all subsequent

operations can be performed locally on the cache without
generating external traffic.
• If a data item is read by a number of processors, it transitions
to the shared state in the cache and all subsequent read
operations become local.
• If processors read and update data at the same time, they
generate coherence requests on the bus - which is ultimately
bandwidth limited.
Directory Based Systems

• In snoopy caches, each coherence operation is sent to all

processors. This is an inherent limitation.
• Why not send coherence requests to only those processors
that need to be notified?
• This is done using a directory, which maintains a presence
vector for each data item (cache line) along with its global
state.
Directory Based Systems

Architecture of typical directory based systems:

(a)a centralized directory; and
(b) a distributed directory.
Directory Based Systems

• When P0 and P1 access the variable x , the state of the block is

changed to shared, and the presence bits updated to indicate that
processors P0 and P1 share the block.
• When P0 executes a store on x, the state in the directory is changed
to dirty and the presence bit of P1 is reset.
• All subsequent operations on this variable performed at P0 can
proceed locally.
Directory Based Systems

• If another processor reads the value, the directory notices the dirty
tag and uses the presence bits to direct the request to the
appropriate processor.
• P0 updates the block in the memory, and sends it to the requesting
processor. The presence bits are modified to reflect this and the
state transitions to shared.
Performance of Directory Based Schemes

• The need for a broadcast media is replaced by the directory.

• The additional bits to store the directory may add significant
overhead.
• From the point of view of cost, the amount of memory required to
store the directory may itself become a bottleneck as the number of
processors increases.
• The directory size grows as O(mp), where m is the number of
memory blocks and p the number of processors.
• The underlying network must be able to carry all the coherence
requests.
• The directory is a point of contention, therefore, distributed directory
schemes must be used.
Distributed Directory Schemes

• In scalable architectures, memory is physically distributed across

processors.
• The corresponding presence bits of the blocks are also distributed.
• Each processor is responsible for maintaining the coherence of its own
memory blocks.
• When a processor attempts to read a block for the first time, it requests
the owner for the block. The owner suitably directs this request based
on presence and state information locally available.
• Similarly, when a processor writes into a memory block, it propagates
an invalidate to the owner, which in turn forwards the invalidate to all
processors that have a cached copy of the block.
• Communication overhead associated with state update messages is not
reduced.
Performance of Distributed Directory
Schemes
• Distributed directories permit O(p) simultaneous coherence
operations, provided the underlying network can sustain the
associated state update messages.
• From this point of view, distributed directories are inherently more
scalable than snoopy systems or centralized directory systems.
• The latency and bandwidth of the network become fundamental
performance bottlenecks for such systems.
Exercise

• A cycle in a graph is defined as a path originating and

terminating at the same node. The length of a cycle is
the number of edges in the cycle. Show that there are no
odd-length cycles in a d-dimensional hypercube.
Thank You

Simulation of Digital Communication Systems Using Matlab
From Everand
Simulation of Digital Communication Systems Using Matlab
Mathuranathan Viswanathan
3.5/5 (22)
Parallel Programming Platforms (Part 1) : CSE3057Y Parallel and Distributed Systems
No ratings yet
Parallel Programming Platforms (Part 1) : CSE3057Y Parallel and Distributed Systems
38 pages
Static and Dynamic
No ratings yet
Static and Dynamic
43 pages
Lecture 5 Network Topologies for Parallel Architectures - Updated
No ratings yet
Lecture 5 Network Topologies for Parallel Architectures - Updated
46 pages
4 - Interconnection Networks
No ratings yet
4 - Interconnection Networks
57 pages
Lecture 4 Network Topologies For Parallel Architecture
No ratings yet
Lecture 4 Network Topologies For Parallel Architecture
34 pages
05 - Lecture #5 - 6
No ratings yet
05 - Lecture #5 - 6
42 pages
Module 3
No ratings yet
Module 3
25 pages
1multiprocessors and Multicomputers: A. Multiprocessor System Interconnects
No ratings yet
1multiprocessors and Multicomputers: A. Multiprocessor System Interconnects
16 pages
module-4-chapter-1
No ratings yet
module-4-chapter-1
28 pages
Network 34
No ratings yet
Network 34
76 pages
Unit 1
No ratings yet
Unit 1
25 pages
CS621 Final Term
No ratings yet
CS621 Final Term
111 pages
ACA UNIT-3
No ratings yet
ACA UNIT-3
10 pages
1 Module 1 Introduction To Multiprocessors September 29 2024
No ratings yet
1 Module 1 Introduction To Multiprocessors September 29 2024
29 pages
Lecture 3.2.4 (Various Interconnection Networks)
No ratings yet
Lecture 3.2.4 (Various Interconnection Networks)
5 pages
Parallel Processors: Session 5 Interconnection Networks
No ratings yet
Parallel Processors: Session 5 Interconnection Networks
48 pages
Parallel Processing Lecture3
No ratings yet
Parallel Processing Lecture3
54 pages
Chapter 2 - Parallel Programming Platforms
No ratings yet
Chapter 2 - Parallel Programming Platforms
33 pages
Explicitly Parallel Platforms
No ratings yet
Explicitly Parallel Platforms
90 pages
Parallel Architecture
No ratings yet
Parallel Architecture
33 pages
Interconnection Network
No ratings yet
Interconnection Network
5 pages
Module-4 Notes
No ratings yet
Module-4 Notes
48 pages
What Is An Interconnection Network
No ratings yet
What Is An Interconnection Network
5 pages
Advance Computer Architecture: Unit:Ii System Interconnect Architectures
No ratings yet
Advance Computer Architecture: Unit:Ii System Interconnect Architectures
53 pages
Multiprocessor
No ratings yet
Multiprocessor
22 pages
Interconnection Networks: Crossbar Switch, Which Can Simultaneously Connect Any Set of
No ratings yet
Interconnection Networks: Crossbar Switch, Which Can Simultaneously Connect Any Set of
11 pages
Slides Chapter 2 - Parallel Programming Platforms
No ratings yet
Slides Chapter 2 - Parallel Programming Platforms
33 pages
Lec3 InnerconnectionNetworks
No ratings yet
Lec3 InnerconnectionNetworks
28 pages
Interconnection Networks
No ratings yet
Interconnection Networks
40 pages
Chapter 3
No ratings yet
Chapter 3
21 pages
VII. Cache Coherence. Interconnection Networks (1) : March 16, 2009
No ratings yet
VII. Cache Coherence. Interconnection Networks (1) : March 16, 2009
42 pages
Interconnection Networks
No ratings yet
Interconnection Networks
7 pages
Unit 3 Interconnection Network: Structure Page Nos
No ratings yet
Unit 3 Interconnection Network: Structure Page Nos
18 pages
Chapter 3
No ratings yet
Chapter 3
57 pages
CSCI 8150 Advanced Computer Architecture
100% (2)
CSCI 8150 Advanced Computer Architecture
28 pages
Lecture 8 Miscellaneous Topics
No ratings yet
Lecture 8 Miscellaneous Topics
52 pages
Lecture 6 - Interconnection Networks
No ratings yet
Lecture 6 - Interconnection Networks
50 pages
Chapter 7
No ratings yet
Chapter 7
97 pages
Unit 5
No ratings yet
Unit 5
89 pages
Notes 02
No ratings yet
Notes 02
9 pages
Aca Notes: Scalability
No ratings yet
Aca Notes: Scalability
13 pages
Distributed Memory Machines
No ratings yet
Distributed Memory Machines
10 pages
Chapter 4
No ratings yet
Chapter 4
46 pages
Relation To Computer System Components: M.D.Boomija, Ap/Cse
100% (1)
Relation To Computer System Components: M.D.Boomija, Ap/Cse
39 pages
8.1.1 Multiprocessors Hardware 8.1.2 Multiprocessors Operation System Types 8.1.3 Multiprocessors Synchronization 8.1.4 Multiprocessors Scheduling
No ratings yet
8.1.1 Multiprocessors Hardware 8.1.2 Multiprocessors Operation System Types 8.1.3 Multiprocessors Synchronization 8.1.4 Multiprocessors Scheduling
49 pages
Interconnection of Networks
No ratings yet
Interconnection of Networks
3 pages
Distributed System
100% (1)
Distributed System
26 pages
Multiprocessor Architecture and Programming
No ratings yet
Multiprocessor Architecture and Programming
20 pages
Ch-8 Shared Memory Multiprocessors
No ratings yet
Ch-8 Shared Memory Multiprocessors
45 pages
Unit VI
No ratings yet
Unit VI
12 pages
Lecture 4 Flynn's Classical Taxonomy
No ratings yet
Lecture 4 Flynn's Classical Taxonomy
43 pages
Distributed Systems R19 - Unit-1
No ratings yet
Distributed Systems R19 - Unit-1
35 pages
Multiple Processor Systems: 8.1 Multiprocessors 8.2 Multicomputers 8.3 Distributed Systems
No ratings yet
Multiple Processor Systems: 8.1 Multiprocessors 8.2 Multicomputers 8.3 Distributed Systems
55 pages
Network On Chip
100% (1)
Network On Chip
62 pages
Parallel Architecture: Sathish Vadhiyar
No ratings yet
Parallel Architecture: Sathish Vadhiyar
26 pages
Distributed Operating Syst EM: 15SE327E Unit 1
No ratings yet
Distributed Operating Syst EM: 15SE327E Unit 1
49 pages
Routing in Wireless Mesh Networks
From Everand
Routing in Wireless Mesh Networks
Raghav Kumar
No ratings yet
First Hop Redundancy Protocol: Network Redundancy Protocol
From Everand
First Hop Redundancy Protocol: Network Redundancy Protocol
Mulayam Singh
No ratings yet
Hack into your Friends Computer
From Everand
Hack into your Friends Computer
Magelan Cyber Security
No ratings yet
UCS617
No ratings yet
UCS617
1 page
UMA034
No ratings yet
UMA034
1 page
UCS631
No ratings yet
UCS631
1 page
PCA513
No ratings yet
PCA513
1 page
Lecture 1
No ratings yet
Lecture 1
23 pages
Lecture 2
No ratings yet
Lecture 2
17 pages
PVL333
No ratings yet
PVL333
1 page
MCA101
No ratings yet
MCA101
1 page
Chirag Dobariya
No ratings yet
Chirag Dobariya
7 pages
4G Optimization and KPI Analysis
No ratings yet
4G Optimization and KPI Analysis
11 pages
Itys PRO
No ratings yet
Itys PRO
4 pages
Web Browser Basics: Internet Explorer and Firefox
No ratings yet
Web Browser Basics: Internet Explorer and Firefox
30 pages
Siti Nabila Binti Mohamad Fauzi (51213116029)
No ratings yet
Siti Nabila Binti Mohamad Fauzi (51213116029)
95 pages
Chapter 1 - Introduction To Web Technologies
No ratings yet
Chapter 1 - Introduction To Web Technologies
23 pages
Fortios™ Handbook: SSL VPN For Fortios 5.0
No ratings yet
Fortios™ Handbook: SSL VPN For Fortios 5.0
49 pages
VPN Gate
No ratings yet
VPN Gate
14 pages
Fisheye State Routing
No ratings yet
Fisheye State Routing
5 pages
xACP 2G Automatic Cell Planning
No ratings yet
xACP 2G Automatic Cell Planning
32 pages
Final Spring2014
No ratings yet
Final Spring2014
5 pages
Course outline
No ratings yet
Course outline
3 pages
8 Infoblox Appliances
No ratings yet
8 Infoblox Appliances
5 pages
1.1 History of Wireless Communications
No ratings yet
1.1 History of Wireless Communications
42 pages
SICOM3000TSN: 12G+2 10G Port Layer 2 Managed DIN-Rail TSN Switches
No ratings yet
SICOM3000TSN: 12G+2 10G Port Layer 2 Managed DIN-Rail TSN Switches
7 pages
Protocol Convert PCS-9794
No ratings yet
Protocol Convert PCS-9794
3 pages
EDE Mp
No ratings yet
EDE Mp
13 pages
Mba It
No ratings yet
Mba It
22 pages
Updated CV
No ratings yet
Updated CV
2 pages
Lab 2
100% (1)
Lab 2
168 pages
Brochure Industrial Wireless Communication
No ratings yet
Brochure Industrial Wireless Communication
28 pages
CoolMaster UM v1.4
No ratings yet
CoolMaster UM v1.4
27 pages
Compilation of TELECOM OSP
No ratings yet
Compilation of TELECOM OSP
104 pages
Nms 2nd Unit
No ratings yet
Nms 2nd Unit
30 pages
10982D Supporting and Troubleshooting Windows 10
No ratings yet
10982D Supporting and Troubleshooting Windows 10
7 pages
Get Cloud Computing: Concepts, Technology, Security, and Architecture, Second Edition Thomas Erl PDF ebook with Full Chapters Now
100% (2)
Get Cloud Computing: Concepts, Technology, Security, and Architecture, Second Edition Thomas Erl PDF ebook with Full Chapters Now
41 pages
আমার কথা ও অন্যান্য রচনা ।। বিনোদিনী দাসী
No ratings yet
আমার কথা ও অন্যান্য রচনা ।। বিনোদিনী দাসী
2 pages
Device-to-Device (D2D) Communication in Cellular Network - Performance Analysis of Optimum and Practical Communication Mode Selection
No ratings yet
Device-to-Device (D2D) Communication in Cellular Network - Performance Analysis of Optimum and Practical Communication Mode Selection
6 pages
Networking Interview Questions: Click Here
No ratings yet
Networking Interview Questions: Click Here
34 pages
Windows-Domain-and-Workgroup-Planning-Guide-EPDOC-X250-en-515A
No ratings yet
Windows-Domain-and-Workgroup-Planning-Guide-EPDOC-X250-en-515A
144 pages

Lecture 5

Uploaded by

Lecture 5

Uploaded by

Parallel Computing Platforms

Physical Organization of Parallel Platforms

• Architecture of an Ideal Parallel Computer

• Since PRAMs allow concurrent access to various memory

• Exclusive-read, exclusive-write (EREW) PRAM.

• Since PRAMs allow concurrent access to various memory

• Concurrent-read, exclusive-write (CREW) PRAM.

• Since PRAMs allow concurrent access to various memory

• Exclusive-read, concurrent-write (ERCW) PRAM.

• Concurrent-read, concurrent-write (CRCW) PRAM.

• Several protocols are used to resolve concurrent writes. The

Classification of interconnection networks:

• Switches map a fixed number of inputs to outputs.

• A wide variety of network topologies have been used in

• Some of the simplest and earliest parallel machines used buses.

Bus-based interconnects (a) with no local caches; (b) with local

A completely non-blocking crossbar network connecting p

• The crossbar network is a non-blocking network in the sense

• Crossbars have excellent performance scalability but poor

The schematic of a typical multistage interconnection network.

• One of the most commonly used multistage

Each stage of the Omega network implements a perfect

A perfect shuffle interconnection for eight inputs and outputs.

Two switching configurations of the 2 × 2 switch:

A complete Omega network with the perfect shuffle

A complete omega network connecting eight inputs and eight outputs.

An omega network has p/2 × log p switching nodes, and the

An example of blocking in omega network: one of the messages

Example of an 8-node completely connected network.

(a) A completely-connected network of eight nodes;

• In a star-connected network, one processor acts as the

Linear arrays: (a) with no wraparound links;

Two and three dimensional meshes:

Construction of hypercubes from hypercubes of lower

• The distance between any two nodes is at most log p.

Complete binary tree networks: (a) a static tree network; and

• The distance between any two nodes is no more than 2logp.

A fat tree network of 16 processing nodes.

• The number of bits that can be communicated

Bisection Arc Cost

Complete binary tree

2-D mesh, no wraparound

2-D wraparound mesh

Wraparound k-ary d-cube

•A number of evaluation metrics for dynamic networks follow

• Bisection width of a dynamic

Bisection Arc Cost

• Interconnects provide basic mechanisms for data transfer.

Cache coherence in multiprocessor systems:

The solid lines depict

Example of parallel program execution with the simple

• The implementation of coherence protocols can be carried out

A simple snoopy bus based cache coherence system.

In snoopy caches, there is a broadcast media that listens to

A simple snoopy bus based cache coherence system.

• When the snoop hardware detects that a read has been

• Once copies of data are tagged dirty, all subsequent

• In snoopy caches, each coherence operation is sent to all

Architecture of typical directory based systems:

• When P0 and P1 access the variable x , the state of the block is

• The need for a broadcast media is replaced by the directory.

• In scalable architectures, memory is physically distributed across

• A cycle in a graph is defined as a path originating and

You might also like