PPC Unit 5 Question Bank and Answers
PPC Unit 5 Question Bank and Answers
Ans. Stanford University’s directory architecture for shared memory (DASH) project of
the early 1990s had the goal of building an experimental cache-coherent
multiprocessor. The 64-processor prototype resulting from this project, along with the
associated theoretical developments and performance evaluations, contributed insight
and specific techniques to the design of scalable distributed shared-memory machines.
DASH has a two-level processor-to-memory interconnection structure and a
corresponding two-level cache coherence scheme (Fig. below ). Within a cluster of 4–16
processors, access to main memory occurs via a shared bus. Each processor in a cluster
has a private instruction cache, a separate data cache, and a Level-2 cache. The
instruction and data caches use the write-through policy, whereas write-back is the
update policy of the Level-2 cache. The clusters are interconnected by a pair of
wormhole-routed 2D mesh networks: a request mesh, which carries remote memory
access requests, and a reply mesh, which routes data and acknowledgments back to
the requesting cluster. Normally, a processor can access its own cache in one clock
cycle, the caches of processors in the same cluster in a few tens of clock cycles, and
remote data in hundreds of clock cycles. Thus, data access locality, which is the norm in
most applications, leads to better performance. Inside a cluster, cache coherence is
enforced by a snoopy protocol, while across clusters, coherence is maintained by a
write-invalidate directory protocol built on the release consistency model for improved
efficiency. The unit of data sharing is a block or cache line. The directory entry for a
block in the home cluster holds its state (uncached, shared, or dirty) and includes a bit-
vector indicating the presence or absence of the cache line in each cache. Remote
memory accesses, as well as exchanges required to maintain data coherence, are
orchestrated via point-to-point wormhole-routed messages that are sent between
cluster directories over 16-bit-wide channels.
When the required data are not found in the local cluster, an access request is
sent to the cluster holding the home directory, which then initiates appropriate actions
based on the type of request and the state of the requested data. In the case of a read
request, the following will happen:
1. For a shared or uncached block, the data are sent to the requester from the home
cluster and the directory entry is updated to include the new (only) sharer.
2. For a dirty block, a message is sent to the cluster holding the single up-to-date copy.
This remote cluster then sends a shared copy of the block to the requesting cluster
and also performs a sharing write-back to the home cluster.
A write (read-exclusive) request will trigger the following actions by the home
directory, with the end result of supplying the requester with an exclusive copy of the
block and invalidating all other copies, if any:
3. For a shared or uncached block, the data are sent to the requester and the directory
entry is updated to indicate that the block is now dirty. Additionally, for a shared
block, invalidation messages are sent to all caches that hold copies of the block, with
the expectation that they will acknowledge the invalidation to the requester (new
owner).
4. For a dirty block, the request is forwarded to the appropriate cluster. The latter then
sends an exclusive copy of the block to the requesting cluster and also performs a
sharing write-back to the home cluster.
Ans: The Cray Y-MP series of vector-parallel computers were introduced in the late
1980s, following several previous Cray vector supercomputers including Cray-1, Cray-2,
and Cray X-MP. Subsequently, the Cray C-90 series of machines were introduced as
enhanced and scaled-up versions of the Y-MP. The Cray Y-MP consisted of a relatively
small number (up to eight) of very powerful vector processors. A vector processor
essentially executes one instruction on a large number of data items with a great deal
of overlap. Such vector processors can thus be viewed as time-multiplexed
implementations of SIMD parallel processing. With this view, the Cray Y-MP, and more
generally vector-parallel machines, should be classified as hybrid SIMD/MIMD
machines.
Figure below shows the Cray Y-MP processor and its links to the central memory
and inter- processor communication network. Each processor has four ports to access
central memory, with each port capable of delivering 128 bits per clock cycle (4 ns).
Thus, a CPU can fetch two operands (a vector element and a scalar), store one value,
and perform I/O simultaneously. The computation section of the CPU is divided into
four subsystems as follows:
1. Vector integer operations are performed by separate function units for add/subtract,
shift, logic, and bit-counting (e.g., determining the weight or parity of a word).
3. Scalar integer operations are performed by separate integer function units for
addition/subtraction, shift, logic, and bit-counting.
The eight vector registers, each holding a vector of length 64 or a segment of a longer
vector, allow computation and memory accesses to be overlapped. As new data are
being loaded into two registers and emptied from a third one, other vector registers can
supply the operands and receive the results of vector instructions. Vector function units
can be chained to allow the next data-dependent vector computation to begin before
the current one has stored all of its results in a vector register. For example, a vector
multiply–add operation can be done by chaining of the floating-point multiply and add
units. This will cause the add unit to begin its vector operation as soon as the multiply
unit has deposited its first result in a vector register.
Ans: The Butterfly parallel processor of Bolt, Beranek, and Newman became available
in 1983. It is a general-purpose parallel computer that is particularly suitable for signal
processing applications. The BBN Butterfly was built of 2–256 nodes (boards), each
holding an MC68000 processor with up to 4 MB of memory, interconnected by a 4-ary
wrapped butterfly network. Typical memory referencing instructions took 2 μs to
execute when they accessed local memory, while remote accesses required 6 μs. The
relatively small difference between the latencies of local and remote memory accesses
leads us to classify the BBN Butterfly as a UMA machine. The structure of each node is
shown in Fig. below. A microcoded processor node controller (PNC) is responsible for
initiating all messages sent over the switch and for receiving messages from it. It also
handles all memory access requests, using the memory management unit for
translating virtual addresses to physical addresses. PNC also augments the
functionality of the main processor in performing operations needed for parallel
processing (such as test-and-set, queuing, and scheduling), easily enforcing the
atomicity requirements in view of its sole control of memory.
The wrapped 4-ary butterfly network of the BBN Butterfly required four stages
of 4×4 bit-serial switches, implemented as custom VLSI chips, to accommodate the
largest 256- processor configuration. A small, 16-node version of the network is
depicted in Fig. below. Routing through the network was done by attaching the binary
destination address as a routing tag to the head of a packet, with each switch using and
discarding 2 bits of this tag.
For example, to send a message to Node 9 = (1001)two in Fig. below, the least-
significant 2 bits would be used to select the second output of the switch at the first
level and the most-significant 2 bits would indicate the choice of the third output in the
second-level switch. In typical applications, message collision did not present any
problem and the latency for remote memory accesses was dominated by the bit-serial
transmission time through the network. Because the probability of some switch failing
increases with the network size, BBN Butterfly systems with more than 16 processing
nodes were configured, through the inclusion of extra switches, to have redundant
paths. Besides improving the reliability, these redundant paths also offered
performance benefits by reducing message collisions.
5. Explain with the help of diagram
a) UMA
b)NUMA
c)CC-NUMA
d)COMA
With a central main memory, access to all memory addresses takes the same amount
of time, leading to the designation uniform memory access (UMA). In such machines,
data distribution among the main memory modules is important only to the extent that
it leads to more efficient conflict-free parallel access to data items that are likely to be
needed in succession. If multiple copies of modifiable data are to be maintained within
processor caches in a UMA system, then cache coherence becomes an issue and we
have the class of CC-UMA systems. Simple UMA has been used more widely in practice.
An early, and highly influential, system of this type was Carnegie-Mellon University’s
C.mmp system that was built of 16 PDP-11 minicomputers in the mid-1970s . It had
both a crossbar and a bus for inter-processor communication via shared variables or
message passing. When memory is distributed among processing nodes, access to
locations in the global address space will involve different delays depending on the
current location of the data. The access delay may range from tens of nanoseconds for
locally available data, somewhat higher for data in nearby nodes, and perhaps
approaching several microseconds for data located in distant nodes. This variance of
access delay has led to the designation nonuniform memory access (NUMA).
Ans: When more than one processor needs to access a memory structure,
interconnection networks are needed to route data:
• from processors to memories (concurrent access to a shared memory structure), or
• from one PE (processor + memory) to another (to provide a message-passing
facility).
2. Router-based networks. Such networks, also known as direct networks, are based on
each node (with one or more processors) having a dedicated router that is linked
directly to one or more other routers. The local node(s) connected to the router inject
messages into the network through the injection channel and remove incoming
messages through the ejection channel. Locally injected messages compete with
messages that are passing through the router for the use of output channels. The link
controllers handle interfacing considerations of the physical channels. The queues hold
messages that cannot be forwarded because of contention for the output links. Various
switching strategies (e.g., packet or wormhole) and routing algorithms (e.g., tag-based
or use of routing tables) can be implemented in the router.
3. Switch-based networks. Such networks, also known as indirect networks, are based on
crossbars or regularly interconnected (multistage) networks of simpler switches.
Typically, the communication path between any two nodes goes through one or more
switches. The path to be taken by a message is either predetermined at the source
node and included as part of the message header or else it is computed on the fly at
intermediate nodes based on the source and destination addresses. Switch-based
networks can be classified as unidirectional or bidirectional. In unidirectional networks,
each switch port is either input or output, whereas in bidirectional networks, ports can
be used for either input or output. By superimposing two unidirectional networks, one
can build a full-duplex bidirectional network that can route massages in both directions
simultaneously. A bidirectional switch can be used in forward mode, in backward mode,
or in turnaround mode, where in the latter mode, connections are made between
terminals on the same side of the switch.
Ans: Data-parallel SIMD machines occupy a special place in the history of parallel
processing. The first supercomputer ever built was a SIMD machine. Some of the most
cost-effective parallel computers in existence are of the SIMD variety. You can now buy
a SIMD array processor attachment for your personal computer that gives you
supercomputer-level performance on some problems for a workstation price. However,
because SIMD machines are often built from custom components, they have suffered a
few setbacks in recent years. In this chapter, after reviewing some of the reasons for
these setbacks and evaluating SIMD prospects in the future, we review several example
SIMD machines, from the pioneering ILLIAC IV, through early massively parallel
processors (Goodyear MPP and DAP), to more recent general-purpose machines (TMC
CM-2 and MasPar MP-2).
Figure below depicts the functional view of an associative memory. There are m
memory cells that store data words, each of which has one or more tag bits for use as
markers. The control unit broadcasts data and commands to all cells. A typical search
instruction has a comparand and a mask word as its parameters. The mask specifies
which bits or fields within the cells are to be searched and the comparand provides the
bit values of interest.
Each cell has comparison logic built in and stores the result of its comparison in
the response or tag store. The tag bits can be included in the search criteria, thus
allowing composite searches to be programmed (e.g., searching only among the cells
that responded or failed to respond to a previous search instruction). Such searches,
along with the capability to read, write, multiwrite (write a value into all cells that have
a particular tag bit set), or perform global tag operations (e.g., detecting the presence
or absence of responders or their multiplicity), allow search operations such as the
following to be effectively programmed:
Additionally, arithmetic operations, such as computing the global sum or adding two
fields in a subset of AM cells, can be effectively programmed using bit-serial algorithms.
Associative processors (APs) are AMs that have been augmented with more flexible
processing logic. From an architectural standpoint, APs can be divided into four classes
:
1. Fully parallel (word-parallel, bit-parallel) APs have comparison logic associated with
each bit of stored data. In simple exact-match searches, the logic associated with each
bit generates a local match or mismatch signal. These local signals are then combined
to produce the cell match or mismatch result. In more complicated searches, the bit
logic typically receives partial search results from a neighboring bit position and
generates partial results to be passed on to the next bit position.
The primary design goal for the Intel P6 was to achieve the highest possible
performance, while keeping the external appearances compatible with the Pentium
and using the same mass production technology. The Intel P6 has a 32-bit architecture,
internally using a 64-bit data bus, 36-bit addresses, and an 86-bit floating-point format.
In the terminology of modern microprocessors, P6 is superscalar and superpipelined:
superscalar because it can execute multiple independent instructions concurrently in its
many functional units; superpipelined because its instruction execution pipeline with
14+ stages is very deep. The Intel P6 is capable of glueless multiprocessing with up to
four processors, operates at 150–200 MHz, and has 21M
transistors, roughly one-fourth of which are for the CPU and the rest for the on-chip
cache memory. Because high performance in the Intel P6 is gained by out-of-order and
speculative instruction execution, a key component in the design is a reservation
station that is essentially a hardware-level scheduler of micro-operations.
The operation itself takes one cycle for register-to-register integer add and
longer for more complex functions. The multiplicity of functional units with different
latencies is why out-of-order and speculative execution (e.g., branch prediction) are
crucial to high performance. With a great deal of functionality plus on-chip memory
already available, a natural question relates to the way in which additional transistors
might be utilized. One alternative is to build multiple processors on the same chip.
Custom microchips housing several simple processors have long been used in the
design of (massively) parallel computers. Commercially available SIMD parallel systems
of the late 1980s already contained tens of bit-serial processors on each chip and more
recent products offer hundreds of such processors per chip (thousands on one PC
board). Microchips containing multiple general-purpose processors and associated
memory constitute a plausible way of utilizing the higher densities that are becoming
available to us. From past experience with parallel computers requiring custom chips, it
appears that custom chip development for one or a few parallel computers will not be
economically viable. Instead, off-the-shelf components will likely become available as
standard building blocks for parallel systems. No matter how many processors we can
put on one chip, the demand for greater performance, created by novel applications or
larger-scale versions of existing ones, will sustain the need for integrating multiple
chips into systems with even higher levels of parallelism.
Ans: A local area network (LAN) that shares its total available bandwidth with all
transmitting stations. Ethernet is the primary example, although Token Ring and FDDI
networks were earlier examples. In the past, when shared media LANs ran out of
capacity to serve their users effectively, they were upgraded by replacing the network
hubs with switches.