0% found this document useful (0 votes)
145 views

PPC Unit 5 Question Bank and Answers

The document discusses the architecture of CC-NUMA and vector parallel machines like Cray Y-MP. It also discusses memory models in MIMD machines and the butterfly network used in the BBN Butterfly parallel computer.
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
145 views

PPC Unit 5 Question Bank and Answers

The document discusses the architecture of CC-NUMA and vector parallel machines like Cray Y-MP. It also discusses memory models in MIMD machines and the butterfly network used in the BBN Butterfly parallel computer.
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 14

1. What is CC-NUMA machine?

Ans. Stanford University’s directory architecture for shared memory (DASH) project of
the early 1990s had the goal of building an experimental cache-coherent
multiprocessor. The 64-processor prototype resulting from this project, along with the
associated theoretical developments and performance evaluations, contributed insight
and specific techniques to the design of scalable distributed shared-memory machines.
DASH has a two-level processor-to-memory interconnection structure and a
corresponding two-level cache coherence scheme (Fig. below ). Within a cluster of 4–16
processors, access to main memory occurs via a shared bus. Each processor in a cluster
has a private instruction cache, a separate data cache, and a Level-2 cache. The
instruction and data caches use the write-through policy, whereas write-back is the
update policy of the Level-2 cache. The clusters are interconnected by a pair of
wormhole-routed 2D mesh networks: a request mesh, which carries remote memory
access requests, and a reply mesh, which routes data and acknowledgments back to
the requesting cluster. Normally, a processor can access its own cache in one clock
cycle, the caches of processors in the same cluster in a few tens of clock cycles, and
remote data in hundreds of clock cycles. Thus, data access locality, which is the norm in
most applications, leads to better performance. Inside a cluster, cache coherence is
enforced by a snoopy protocol, while across clusters, coherence is maintained by a
write-invalidate directory protocol built on the release consistency model for improved
efficiency. The unit of data sharing is a block or cache line. The directory entry for a
block in the home cluster holds its state (uncached, shared, or dirty) and includes a bit-
vector indicating the presence or absence of the cache line in each cache. Remote
memory accesses, as well as exchanges required to maintain data coherence, are
orchestrated via point-to-point wormhole-routed messages that are sent between
cluster directories over 16-bit-wide channels.
When the required data are not found in the local cluster, an access request is
sent to the cluster holding the home directory, which then initiates appropriate actions
based on the type of request and the state of the requested data. In the case of a read
request, the following will happen:

1. For a shared or uncached block, the data are sent to the requester from the home
cluster and the directory entry is updated to include the new (only) sharer.

2. For a dirty block, a message is sent to the cluster holding the single up-to-date copy.
This remote cluster then sends a shared copy of the block to the requesting cluster
and also performs a sharing write-back to the home cluster.

A write (read-exclusive) request will trigger the following actions by the home
directory, with the end result of supplying the requester with an exclusive copy of the
block and invalidating all other copies, if any:
3. For a shared or uncached block, the data are sent to the requester and the directory
entry is updated to indicate that the block is now dirty. Additionally, for a shared
block, invalidation messages are sent to all caches that hold copies of the block, with
the expectation that they will acknowledge the invalidation to the requester (new
owner).

4. For a dirty block, the request is forwarded to the appropriate cluster. The latter then
sends an exclusive copy of the block to the requesting cluster and also performs a
sharing write-back to the home cluster.

2. Describe the architecture of vector parallel Cray Y-MP machine.

Ans: The Cray Y-MP series of vector-parallel computers were introduced in the late
1980s, following several previous Cray vector supercomputers including Cray-1, Cray-2,
and Cray X-MP. Subsequently, the Cray C-90 series of machines were introduced as
enhanced and scaled-up versions of the Y-MP. The Cray Y-MP consisted of a relatively
small number (up to eight) of very powerful vector processors. A vector processor
essentially executes one instruction on a large number of data items with a great deal
of overlap. Such vector processors can thus be viewed as time-multiplexed
implementations of SIMD parallel processing. With this view, the Cray Y-MP, and more
generally vector-parallel machines, should be classified as hybrid SIMD/MIMD
machines.
Figure below shows the Cray Y-MP processor and its links to the central memory
and inter- processor communication network. Each processor has four ports to access
central memory, with each port capable of delivering 128 bits per clock cycle (4 ns).
Thus, a CPU can fetch two operands (a vector element and a scalar), store one value,
and perform I/O simultaneously. The computation section of the CPU is divided into
four subsystems as follows:

1. Vector integer operations are performed by separate function units for add/subtract,
shift, logic, and bit-counting (e.g., determining the weight or parity of a word).

2. Vector floating-point operations are performed by separate function units for


add/subtract, multiply, and reciprocal approximation. The latter function unit is used in
the first step of a division operation x/y. The approximation to 1/y that is provided by
this unit is refined in a few iterations to derive an accurate value for 1/y, which is
multiplied by x in the final step to complete the division operation.

3. Scalar integer operations are performed by separate integer function units for
addition/subtraction, shift, logic, and bit-counting.

4. The add/subtract and multiply operations needed in address computations are


performed by separate function units within an address subsystem that also has two
sets of eight address registers (these 32–bit address registers and their associated
function units are not shown in Fig. above).

The eight vector registers, each holding a vector of length 64 or a segment of a longer
vector, allow computation and memory accesses to be overlapped. As new data are
being loaded into two registers and emptied from a third one, other vector registers can
supply the operands and receive the results of vector instructions. Vector function units
can be chained to allow the next data-dependent vector computation to begin before
the current one has stored all of its results in a vector register. For example, a vector
multiply–add operation can be done by chaining of the floating-point multiply and add
units. This will cause the add unit to begin its vector operation as soon as the multiply
unit has deposited its first result in a vector register.

3. Explain various memory models of MIMD machine.


Ans:
On MIMD machines, a method is required to allow the independent processors to
exchange data when needed. In the early days of parallel computing, each
manufacturer would produce a series of communication routines that could be called
from within a program to carry out message passing tasks. Nowadays, the situation has
dramatically improved with the implementation of standard message passing routines
across a range of platforms. This allows parallel programs to be run without significant
changes on everything from workstation clusters to large scale parallel
supercomputers. The two most common (and useful) interfaces are PVM (parallel
virtual machine) and MPI (message passing interface) both of which work with Fortran
or C.

4. Explain Min-based butterfly network with the help of diagram.

Ans: The Butterfly parallel processor of Bolt, Beranek, and Newman became available
in 1983. It is a general-purpose parallel computer that is particularly suitable for signal
processing applications. The BBN Butterfly was built of 2–256 nodes (boards), each
holding an MC68000 processor with up to 4 MB of memory, interconnected by a 4-ary
wrapped butterfly network. Typical memory referencing instructions took 2 μs to
execute when they accessed local memory, while remote accesses required 6 μs. The
relatively small difference between the latencies of local and remote memory accesses
leads us to classify the BBN Butterfly as a UMA machine. The structure of each node is
shown in Fig. below. A microcoded processor node controller (PNC) is responsible for
initiating all messages sent over the switch and for receiving messages from it. It also
handles all memory access requests, using the memory management unit for
translating virtual addresses to physical addresses. PNC also augments the
functionality of the main processor in performing operations needed for parallel
processing (such as test-and-set, queuing, and scheduling), easily enforcing the
atomicity requirements in view of its sole control of memory.
The wrapped 4-ary butterfly network of the BBN Butterfly required four stages
of 4×4 bit-serial switches, implemented as custom VLSI chips, to accommodate the
largest 256- processor configuration. A small, 16-node version of the network is
depicted in Fig. below. Routing through the network was done by attaching the binary
destination address as a routing tag to the head of a packet, with each switch using and
discarding 2 bits of this tag.
For example, to send a message to Node 9 = (1001)two in Fig. below, the least-
significant 2 bits would be used to select the second output of the switch at the first
level and the most-significant 2 bits would indicate the choice of the third output in the
second-level switch. In typical applications, message collision did not present any
problem and the latency for remote memory accesses was dominated by the bit-serial
transmission time through the network. Because the probability of some switch failing
increases with the network size, BBN Butterfly systems with more than 16 processing
nodes were configured, through the inclusion of extra switches, to have redundant
paths. Besides improving the reliability, these redundant paths also offered
performance benefits by reducing message collisions.
5. Explain with the help of diagram
a) UMA
b)NUMA
c)CC-NUMA
d)COMA

Ans: Shared-memory implementations vary greatly in the hardware architecture that


they use and in the programming model (logical user view) that they support. With
respect to hardware architecture, shared-memory implementations can be classified
according to the placement of the main memory modules within the system (central or
distributed) and whether or not multiple copies of modifiable data are allowed to
coexist (single- or multiple-copy). The resulting four-way classification is depicted in
Fig. below.

With a central main memory, access to all memory addresses takes the same amount
of time, leading to the designation uniform memory access (UMA). In such machines,
data distribution among the main memory modules is important only to the extent that
it leads to more efficient conflict-free parallel access to data items that are likely to be
needed in succession. If multiple copies of modifiable data are to be maintained within
processor caches in a UMA system, then cache coherence becomes an issue and we
have the class of CC-UMA systems. Simple UMA has been used more widely in practice.
An early, and highly influential, system of this type was Carnegie-Mellon University’s
C.mmp system that was built of 16 PDP-11 minicomputers in the mid-1970s . It had
both a crossbar and a bus for inter-processor communication via shared variables or
message passing. When memory is distributed among processing nodes, access to
locations in the global address space will involve different delays depending on the
current location of the data. The access delay may range from tens of nanoseconds for
locally available data, somewhat higher for data in nearby nodes, and perhaps
approaching several microseconds for data located in distant nodes. This variance of
access delay has led to the designation nonuniform memory access (NUMA).

6. Explain Interconnection network for Message Passing Scheme.

Ans: When more than one processor needs to access a memory structure,
interconnection networks are needed to route data:
• from processors to memories (concurrent access to a shared memory structure), or
• from one PE (processor + memory) to another (to provide a message-passing
facility).

Interconnection networks for message passing are of three basic types:

1. Shared-medium networks. Only one of the units linked to a shared-medium network


is allowed to use it at any given time. Nodes connected to the network typically have
request, drive, and receive circuits. Given the single-user requirement, an arbitration
mechanism is needed to decide which one of the requesting nodes can use the shared
network. The two most commonly used shared-medium networks are backplane buses
and local area networks (LANs). The bus arbitration mechanism is different for
synchronous and asynchronous buses. In bus transactions that involve a request and a
response, a split-transaction protocol is often used so that other nodes can use the bus
while the request of one node is being processed at the other end. To ease the
congestion on a shared bus, multiple buses or hierarchical bus networks may be used.
For LANs, the most commonly used arbitration protocol is based in contention: The
nodes can detect the idle/busy state of the shared medium, transmitting when they
observe the idle state, and considering the transmission as having failed when they
detect a “collision.” Token-based protocols, which implement some form of rotating
priority, are also used.

2. Router-based networks. Such networks, also known as direct networks, are based on
each node (with one or more processors) having a dedicated router that is linked
directly to one or more other routers. The local node(s) connected to the router inject
messages into the network through the injection channel and remove incoming
messages through the ejection channel. Locally injected messages compete with
messages that are passing through the router for the use of output channels. The link
controllers handle interfacing considerations of the physical channels. The queues hold
messages that cannot be forwarded because of contention for the output links. Various
switching strategies (e.g., packet or wormhole) and routing algorithms (e.g., tag-based
or use of routing tables) can be implemented in the router.

3. Switch-based networks. Such networks, also known as indirect networks, are based on
crossbars or regularly interconnected (multistage) networks of simpler switches.
Typically, the communication path between any two nodes goes through one or more
switches. The path to be taken by a message is either predetermined at the source
node and included as part of the message header or else it is computed on the fly at
intermediate nodes based on the source and destination addresses. Switch-based
networks can be classified as unidirectional or bidirectional. In unidirectional networks,
each switch port is either input or output, whereas in bidirectional networks, ports can
be used for either input or output. By superimposing two unidirectional networks, one
can build a full-duplex bidirectional network that can route massages in both directions
simultaneously. A bidirectional switch can be used in forward mode, in backward mode,
or in turnaround mode, where in the latter mode, connections are made between
terminals on the same side of the switch.

7. Explain Data Parallel SIMD Machine

Ans: Data-parallel SIMD machines occupy a special place in the history of parallel
processing. The first supercomputer ever built was a SIMD machine. Some of the most
cost-effective parallel computers in existence are of the SIMD variety. You can now buy
a SIMD array processor attachment for your personal computer that gives you
supercomputer-level performance on some problems for a workstation price. However,
because SIMD machines are often built from custom components, they have suffered a
few setbacks in recent years. In this chapter, after reviewing some of the reasons for
these setbacks and evaluating SIMD prospects in the future, we review several example
SIMD machines, from the pioneering ILLIAC IV, through early massively parallel
processors (Goodyear MPP and DAP), to more recent general-purpose machines (TMC
CM-2 and MasPar MP-2).
Figure below depicts the functional view of an associative memory. There are m
memory cells that store data words, each of which has one or more tag bits for use as
markers. The control unit broadcasts data and commands to all cells. A typical search
instruction has a comparand and a mask word as its parameters. The mask specifies
which bits or fields within the cells are to be searched and the comparand provides the
bit values of interest.
Each cell has comparison logic built in and stores the result of its comparison in
the response or tag store. The tag bits can be included in the search criteria, thus
allowing composite searches to be programmed (e.g., searching only among the cells
that responded or failed to respond to a previous search instruction). Such searches,
along with the capability to read, write, multiwrite (write a value into all cells that have
a particular tag bit set), or perform global tag operations (e.g., detecting the presence
or absence of responders or their multiplicity), allow search operations such as the
following to be effectively programmed:

• Exact-match search: locating data based on partial knowledge of contents


• Inexact-match searches: finding numerically or logically proximate values
• Membership searches: identifying all members of a particular set
• Relational searches: determining values that are less than, less or equal, and so forth
• Interval searches: marking items that are between limits or not between limits
• Extrema searches: min- or max-finding, next higher, next lower
• Rank-based selection: selecting kth or k largest/smallest elements
• Ordered retrieval: repeated max- or min-finding with elimination (sorting)

Additionally, arithmetic operations, such as computing the global sum or adding two
fields in a subset of AM cells, can be effectively programmed using bit-serial algorithms.
Associative processors (APs) are AMs that have been augmented with more flexible
processing logic. From an architectural standpoint, APs can be divided into four classes
:

1. Fully parallel (word-parallel, bit-parallel) APs have comparison logic associated with
each bit of stored data. In simple exact-match searches, the logic associated with each
bit generates a local match or mismatch signal. These local signals are then combined
to produce the cell match or mismatch result. In more complicated searches, the bit
logic typically receives partial search results from a neighboring bit position and
generates partial results to be passed on to the next bit position.

2. Bit-serial (word-parallel, bit-serial) systems process an entire bit-slice of data,


containing 1 bit of every word, simultaneously, but go through multiple bits of the
search field sequentially. Bit-serial systems have been dominant in practice because
they allow the most cost-effective implementations using low-cost, high-density, off-
the-shelf RAM chips.

3. Word-serial (word-serial, bit parallel) APs based on electronic circulating memories


represent the hardware counterparts of programmed linear search. Even though
several such systems were built in the 1960s, they do not appear to be cost-effective
with today’s technology.

4. Block-oriented (block-parallel, word-serial, bit/byte-serial) systems represent a


compromise between bit-serial and word-serial systems in an effort to make large
systems practically realizable. Some block-oriented AP systems are based on
augmenting the read/write logic associated with each head of a head-per-track disk so
that it can search the track contents as they pass underneath. Such a mechanism can
act as a filter between the database and a fast sequential computer or as a special-
purpose database search engine.

8. Explain Processor and memory technologies

Ans: Commodity microprocessors are improving in performance at an astonishing rate.


Over the past two decades, microprocessor clock rates have improved by a factor of
100, from a few megahertz to hundreds of megahertz. Gigahertz processors are not far
off. In the same time frame, memory chip capacity has gone up by a factor of 104 , from
16 Kb to 256 Mb. Gigabit memory chips are now beginning to appear. Along with
speed, the functionality of microprocessors has also improved drastically. This is a
direct result of the larger number of transistors that can be accommodated on one chip.
In the past 20 years, the number of transistors on a microprocessor chip has grown by a
factor of 10³; from tens of thousands (Intel 8086) to a few tens of millions (Intel Pentium
Pro). Older microprocessors contained an ALU for integer arithmetic within the basic
CPU chip and a floating-point coprocessor on a separate chip, but increasing VLSI
circuit density has led to the trend of integrating both units on a single microchip, while
still leaving enough room for large on-chip memories (typically used for an instruction
cache, a data cache, and a Level-2 cache). As an example of modern microprocessors,
we briefly describe a member of Intel’s Pentium family of microprocessors: the Intel
Pentium Pro, also known as Intel P6 (Fig. below).

The primary design goal for the Intel P6 was to achieve the highest possible
performance, while keeping the external appearances compatible with the Pentium
and using the same mass production technology. The Intel P6 has a 32-bit architecture,
internally using a 64-bit data bus, 36-bit addresses, and an 86-bit floating-point format.
In the terminology of modern microprocessors, P6 is superscalar and superpipelined:
superscalar because it can execute multiple independent instructions concurrently in its
many functional units; superpipelined because its instruction execution pipeline with
14+ stages is very deep. The Intel P6 is capable of glueless multiprocessing with up to
four processors, operates at 150–200 MHz, and has 21M
transistors, roughly one-fourth of which are for the CPU and the rest for the on-chip
cache memory. Because high performance in the Intel P6 is gained by out-of-order and
speculative instruction execution, a key component in the design is a reservation
station that is essentially a hardware-level scheduler of micro-operations.

Each instruction is converted to one or more micro-operations which are then


executed in arbitrary order whenever their required operands are available. The result
of a micro-operation is sent to both the reservation station and a special unit called the
reorder buffer. The latter unit is responsible for making sure that program execution
remains consistent by committing the results of micro-operations. There is a full
crossbar between all five ports of the reservation station so that any returning result
can be forwarded directly to any other unit for the next clock cycle. Fetching, decoding,
and setting up the components of an instruction in the reservation station takes eight
clock cycles and is performed as an eight-stage pipelined operation. The retirement
process, mentioned above, takes three clock cycles and is also pipelined. Sandwiched
between the above two pipelines is a variable-length pipeline for instruction execution.
For this middle part of instruction execution, the reservation station needs two cycles
to ascertain that the operands are available and to schedule the micro-operation on an
appropriate unit.

The operation itself takes one cycle for register-to-register integer add and
longer for more complex functions. The multiplicity of functional units with different
latencies is why out-of-order and speculative execution (e.g., branch prediction) are
crucial to high performance. With a great deal of functionality plus on-chip memory
already available, a natural question relates to the way in which additional transistors
might be utilized. One alternative is to build multiple processors on the same chip.
Custom microchips housing several simple processors have long been used in the
design of (massively) parallel computers. Commercially available SIMD parallel systems
of the late 1980s already contained tens of bit-serial processors on each chip and more
recent products offer hundreds of such processors per chip (thousands on one PC
board). Microchips containing multiple general-purpose processors and associated
memory constitute a plausible way of utilizing the higher densities that are becoming
available to us. From past experience with parallel computers requiring custom chips, it
appears that custom chip development for one or a few parallel computers will not be
economically viable. Instead, off-the-shelf components will likely become available as
standard building blocks for parallel systems. No matter how many processors we can
put on one chip, the demand for greater performance, created by novel applications or
larger-scale versions of existing ones, will sustain the need for integrating multiple
chips into systems with even higher levels of parallelism.

With tens to tens of thousands of processors afforded by billion-transistor chips,


small-scale parallel systems utilizing powerful general-purpose processors, as well as
multi-million-processor massively parallel systems, will become not only realizable but
also quite cost-effective. Fortunately, the issues involved in the design of single-chip
multiprocessors and massively parallel systems, as well as their use in synthesizing
larger parallel systems, are no different from the current problems facing parallel
computer designers. Given that interconnects have already become the limiting factor.
regardless of the number of processors on a chip, we need to rely on multilevel
hierarchical or recursive architectures.

9. Explain coarse grain and fine grain:

Ans: Depending on the complexity of processing nodes, three categories of message-


passing MIMD computers can be distinguished:
1. Coarse-grain parallelism. Processing nodes are complete (perhaps large, multi-board)
computers that work on sizable sub-problems, such as complete programs or tasks, and
communicate or synchronize with each other rather infrequently.

2. Medium-grain parallelism. Processing nodes might be based on standard micros that


execute smaller chunks of the application program (e.g., subtasks, processes, threads)
and that communicate or synchronize with greater frequency.

3. Fine-grain parallelism. Processing nodes might be standard micros or custom-built


processing elements (perhaps with multiple PEs fitting on one chip) that execute small
pieces of the application and need constant communication or synchronization.

10. What is Shared Medium network?

Ans: A local area network (LAN) that shares its total available bandwidth with all
transmitting stations. Ethernet is the primary example, although Token Ring and FDDI
networks were earlier examples. In the past, when shared media LANs ran out of
capacity to serve their users effectively, they were upgraded by replacing the network
hubs with switches.

You might also like