0% found this document useful (0 votes)

52 views189 pages

@vtucode - in 21CS643 Module 4 2021 Scheme

Uploaded by

irannasankannavar0

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

52 views189 pages

@vtucode - in 21CS643 Module 4 2021 Scheme

Uploaded by

irannasankannavar0

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 189

FM Illnfﬁrm-H Hilllmmne-rm

—
MODULE-4 —

Multiprocessors and
Multicomputers
In this chapter. we st1.|dy system ardtitectures of multiproc-moors and multicomputers. ‘various cache
coherence protocols. synchronization methods. crossbar switches. multiport memory. and multistag
networks are described for building multiprocessor systenn. Then we discuss multicomputers with
distrll:iuted memories which are not globally shared.The lntel Paragon is used as a cm-e study. Message-
passing medranisms required with multicomputers are also revievved.Single-add ress-space multicomputers
will be studied in Chapter 9.

M ULTIPRDCESSDR SYSTEM IHTERCCINNECTS

1 Parallel processing demands the use of efficient system interconnects for fast eomrnunication
among multiple processors and shared memory, U0, and peripheral devices. Hierarchical
buses. crossbar switches. and multistage networks are often used for this purpose.
A generalized multiprocessor system is depicted in Fig. 7.1. This architecture combines features from the
UMA, 'N'Ul\-‘LA, and CDMA models introduced in Section 1.4.1. Each processor P; is attached to its own local
memory and private cache. Multiple processors are connected to shared-memory modules through an inter-
processor-memory network (IPMN).
The processors share the access of IIO and peripheral devices through aprocessor U0 network [Pl()N}. Both
IPMN and PIDN are necessary in a shared-resource multiprocessor. Direct interprocessor communications
are supported by an optional interprocessor communication network ELPCN) instead of through the shared
memory.
Network Characteristic: Each of the above types of networks can be designed with many choiecs. The
c-hoices are based on the topology, timing protocol, switclting method, and control strategy. Dynamic networks
are used in multiproccssors in which the interconnections are under program control, Timing, switching,
and control are three major operational characteristics of an interconnection network. Tl1e timing control
can be either arm-hmrions or ¢t.s_\'nc-hr'orrrJ1r.s. Synchronous networks are controlled by a global clock that
synchronizes all network activities. Asynchronous networks use handshaking or interlocking mechanisms to
coordinate fast and slow devices requesting use of the same network.
A network can transfer data using either cor-uir swirr-hirig or packer sn'irc!rr'r1Ig. In circuit switching, once
a device is granted a path in the network, it occupies the path for the entire duration of the data transfer.
PM‘ MIGIELH HI" l'r>rrIq|r_.r.I|n*\ ‘I _

I32 i Advanced Computer Arclritedure

In packet switching, the information is broken into small packets individually competing for a path in the
network.
[Shared Memory]

ICC
%
i [Shared IIO and Peripherals]
Disk Units
Backup storage
B Printer
Torminais
I

E -
I

Network

Legends: IPMN [inter-Processor-iuiemoty Network]

PION [Proeassor- U0 Notvtork]
IPCN [inter-Proeemor Communication Network]
P [Processor]
C [Cache]
SM [Shared Memory]
LM [Local Memory]

Hg. 7.1 Interconnection sr:ru-crures in a generalized rmririprooessor system with iecai memory. private caches.
shared memory. and shared peripherals

Network control strategy is classiﬁed as cenrrrtiized or disnibirreri. With central.izcd control, a global
controller receives requests from all devices attached to the network and grants the network access to one or
more requesters. In a distributed system, requests are handled by local devices independently.

7.1.1 Hierarchical Bus Systems

A bus .s_v.srerrr consists of a hierarchy ofbuscs connecting various system and subsystem eomponents in a
computer. Each bus is formed with a number of signal, control, and power lines. Diife-rent buses are used to
perform different interconnection functions.
In general, the hierarchy of bus systems are packaged at different levels as depicted in Fig. 7.2, including
local buses on boards, baclqn-lane buses, and U0 buses.
Local Bu: Buses implemented Within processor chips or on printer!-cirt'ur'r boards are called focal buses.
On a processor board one may ﬁnd a local bus which provides a common communication path among major
cornponerrts (chips) mounted on thc board. Arncmery board uses a merrrorjr has to connect the memory with
rr-.. - M I.‘ GIBM!‘ Hm ln¢r.q|r_.u||r\
" i5
Mu\ltipruoesso.rs and Muttjcornputers 253

the interface logic. An IIO or network interface chip or board uses a dam bus. Each of these local buses
consists of signal and utility lines.

Locai Peripherals
{SCSI Bus]
CPU Board l'u'len'lory' Board

'F MC
‘1 _ _ . . .. Y
< System B-us [on backplanos] >
it I.

MD Board Common lcatlon Board

@ |F cc

:1 Ell
Disk
Data Bus

Buffer

Units
I
Pr lnter
or Plotter
Buffer
Data Bus

Network
[Ethernet etc.)

Legends: IF [Interface lo-glc], LM [Local Memory)

IDC {HO Controller], MC [Mornory Controller)
IOEP {Lt-O Processor), CC [Corrrnunlcatlon Controller)

Fig.1'.I Bus s}rs1:orns at board ievel. heciqulane ievei.am:l HO ieve!

Bnckplnne Bu: A Jmckplme is a primed circuit on which many connectors are used to plug in fimctional
boards. A.sy.sIen1hrJ.s_ consisting of shared signal paths and utility lines, is built on thc hackplanc. This system
bus provides a eommmi on-znmuliication path alnong all plug-in boards.
Several backplaue bus standards have been developed over time such as the VMI.-Z bus {IEEE Standard
10l4~19S?), Multibus II [THEE Standard 1296~19S'T), and Futurebus+ [[EEE Standard 896.1-I991) as
introduced in Chapter 5. However, point to-point switched interconnects have emerged as more cificicnt
alternatives, as discussed in Chapters 5 and 13.
HO Bu: lnputfoutput devices are connected to a computer system through an M’) bus such as the ‘SCSI
(Small Computer Systems Itltcrfacc) bus. This bus is made of coaxial cables with taps connecting disks,
Fr‘:-r Mtfiruw irrtti-...¢-,.,i,l.¢. '
I34 i Advanced Canpritl:erArclritectuJn

printer, and other devices to a processor through an IEO controller (Fig. 7.2). Special interface logic is used to
connect various board types to the backplanc bus.
Complete specifications for a bus system include logical, electrical, and mechanical properties, various
application profiles, and interface requirements. Our study will be confined to the logical and application
aspects of system buses. Emphasis will be placed on the scalability and bus support for cache coherence and
fast synchronization.
For example, the core of the Encore Multimait multiprocessor was the Nanohus, consisting of 20 slots, a
32-bit address, a 64-bit data path, and a 14-bit vector bus, and operating at a clock rate of l2.5 lvll-la with a
total rncrnory bandwidth of 1l]Cl Mbytes/s. The Sequcnt multiprocessor bus bad a 6'4-bit data path, a lD-MI-Iz
clock rate, and a 32-bit address, for a channel bandwidth of 30 lvlbytesfs. A write-back private cache was used
to reduce the bus traffic by 50%.
Digital bus interconnects can be adopted in commercial systems ranging from workstations to
minicornputcrs, niainfiamcs, and rnultiprocessors. Hierarchical bus systtnns can be used to build rncdiun1-
sized multiproccssors with less than 100 processors. llotvever, the bus approach is limited by bandwidth
scalability and the packaging technology employed
Hierarchical Buses and Cache: Wilson (1987) proposed a hierarchical cachcf bus architecture as shown
in Fig. 7.3. This is a multilevel tree structure in which the leaf nodes are processors and their private caches
{denoted Ff, and CU in Fig. 7.3). These are divided into several clusters, each of which is connected through
a cluster bus.

Inter-cluster Bus
I r I
I
Second
I-ave‘
etidrlr/.<97J;£ Caches

Cl
| I | I | rB,f‘°'

lP*>llF'1llP-EllPa|lP*llP5llPﬁl|P1llP@-I
Pro-comers
Fig. 1.3 A hierarchical cachetbtls archlriecrtrre for designing a scahhle rnulrlproocssor {Courtesy ofwlsonz
reprimand from Pmc. ofknnud l-rrL Syrup. on Compt.rte.rArchlnecu.|re, 198?)

An interclustcr bus is used to provide communications among the clusters. Second lcvcl caches (denoted
as C2,) are used between each cluster bus and the interclustcr bus. Each second-level cache must have a
capacity that is at least an order oi" magnitude larger than the sum of the capacities of all first-level caches
connected beneath it_
Each single cluster operates as a single-bus system. Snoopy bus coherence protocols can be used to
establish comisteney among first-level caches belonging to the same cluster. Second-level caches are used to
extend consistency from each local cluster to the upper level.
Fr‘:-r Mflirpw nrmrme-;|umn '
Multiprocessor: and Multiownputers i :35

The upper-level caches form another level of shared memory between each cluster and the main memory
modules connected to the interclustcr bus. Most memory requests should be satisﬁed at the lower-level
caches. lnterc-luster cache coherence is controlled among the second~level caches and the resulting effects are
passed to the lower level.

Ir)
El Example 1.1 Encore Ultramax multiprocessor architecture
The Ultramax had a two-level hierarchical-bus architecture as depicted in Fig. 'l'.4. The Ultramwt architecture
was very similar to that characterized by Wilson, except that the global Nanobus was used only for intercluster
communications.
< Global Nanobus >

Cluster Nanobu Cluster Hahobus

II II II H
Legends: P = Processor
PC = Private Cache
MM = Main Memory
S-C = Shared Cache
RS = Route Switch

Fig.7.! The Llrramait rnuirlprocessor architecture using hierarchical buses with nurlrlple clusters {Courtesy of
Encore Cornpmser Corpora.rlon.19B7}

The shared memories were distributed to all clusters instead of being connected to the intercluster bus. The
cluster caches formed the second-level caches and performed the same ﬁltering and cache coherence control
for remote accesses as in l|Vilson's scheme. When an aceess request reached the top bus, it would be routed
down to the cluster memory that matched it with the reference address.

The idea of using bridges between multiprocessor clusters is to allow transactions initiated on a local
bus to be completed on a remote bus. As exempliﬁed in Fig. 7.5, multiple buses are used to build a very
large system consisting of three rnultiproeessor clusters. The bus used in this example is Futurebus+, but
the basic idea is more general. Bridges are used to interface the clusters. The main functions of a bridge
include corrmtunieation protocol conversion, interrupt handling in split transactions, and serving as cache
and memory agents.
IE5 C Advanced Comprrrter Architecture

{Processor Processor‘ {Processor Prooessori {Francesco-rl Proeesscrl

I i i I I i
I Cache i Cache I Cache i Cache I Cache | [ Cache

Dual-Fr.lr.|erbus+
i I

Cache Cache Cache Cache

Memory Memory |
Processor Br ldee Br ldge .Prooessor'

Futu re-bus+
Cable
S"‘J"'°‘“ Spec or Special
Processor Processor Bridge Bridge Pljgcmosar PT -

Message Message Message

Cache Cac he Cache Interface I nterfaoe lrtertaoe

+Futurebus+ . + I I .Futurebus++ +
Message Message Message
Cache Cache Interface Interface irtertaoe
Memow
IIO Frame I/O ii‘-CI I10
‘Urocessor Buffer Processor Processor] Processor

SCSI 2! IPI
H-_-,lp| ii

LAN

Connection to is ISDN
Supercornputer Visualization
Manner

Hg. 7.5 A multiprocessor system using multiple Fr.itrrrebus+ segnironrs (Reprinted with permission from IEEE
Smntlartl 396.1-1991. copyright ® W91 by IEEE. Inc}

7.1.1 Crossbar Switch and Mulcipurt Memory

Switched networks provide dynamic interconnections between the inputs and outputs. Major classes of
switched networks are speciﬁed below. based on the number of stages and blocking or nonbiocking. We
describe the crossbar networks and mnltiport memory st1'uctr.u'cs ﬁrst and then the multistage networks.
Crossbar networks are mostly used in small or medium-size systerns. The multistage networks can be
extended to larger systems ifthe increased latency problem can be suitably addressed.
Network Stage: Dcpccnding on t.hc interstage connections used, a sr'rigIr.*-srage nam-'or.i' is also called a
rt-crrr-rrinrirrg nt=Iu'0rJ'r because data items may have to recirculate through the single stage many times before
rt» Mel;I11 iv Hill lnrfqrrgtrlli-\'
' _

rlllurltiprn-cessors and Mrriticomputers B In-y

reaching their destination. A single—stage network is cheaper to build, but multiple passes may be needed to
establish certain connections. The crossbar switch and multiport memory organization are both single-stage
networks.
A multistage network consists of more than one stage of switch boxes. Such a network should be able to
connect from any input to any output. We will study unidirectional multistage networks in Section 7.1.3. The
choice of interstage connection patterns determines the network connectivity. These patterns may be the same
or different at different stages, depending the class of networks to be designed. The Omega network, Flip
network, anrl Baseline networks are all multistage networks.

Blocking versus Nonblockirrg Networks A multistage network is called blocking if the sirnultancous
connections ofsome multiple input-output pairs may result in conﬂicts in the use ofswitches or communication
links.
Examples ofblocking networks include the Omega (Lawrie, I975), Baseline (Wu and Feng, I980], Banyan
(Goke and Lipovski, 1973), and Delta networks (Patel, 1979}. Some blocking networks are equivalent after
graph transformations. In fact, most multistage networks are blocking in nature. In a blocking network,
multiple passes through the network may be needed to achieve certain input-output connections.
A multistage network is called nonhioeking if it can perfonn all possible connections between inputs
and outputs by rearranging its connections. In such a network, a connection path can always be established
between any input-output pair. The Benes networks (Bones, 1965) have such a capability. However, Brmcs
networks require almost twice the number of stages to achieve the nonblocking connections. The Clos
networks (Clos, I953) can also perform all pemtutations in a single pass without blocking. Certain subclasses
of blocking networks can also be made nonblocking if extra stages arc added or connections are restricted.
The blocking problem can be avoided by using combining networks to be described in the next section.
Cmubur Networks In a cmssbnr n-erworI:, every input port is connected to a free output port through a
crtisspoint switch (circles in Fig. 2.26s) without blocking. A crossbar network is a single-stage network built
with unary switches at the erosspoints.
Once the data is read from the memory. its value is retumed to the requesting processor along the same
crossp-oi.nt switch. In general, such a crossbar network requires the use ofrr >< or ccrosspoint switches. A square
crossbar {n = m) can implement any of the nl permutations without blocking.
As introduced earlier, a crossbar switch network is a single-stage, nonblocking, pennutation network.
Each crosspoint in a crossbar network is a unary switch which can be set open or closed, providing a point-
to-point connection path between the source and destination.
All processors can send mcrnory requests indcpemlently and asynchronously. This poses the problem
of multiple requests destined for the same memory module at the same time. In such cases, only one of the
requests is serviced at a time. Let us characterize below the crosspoint switching operations.
Crasspoint Switch Design Out of n crosspoint switches in each eolurrrn of an rt >< or crossbar mesh, only
one can be connected at a time. To resolve the contention for each memory module, each crosspoint switch
must he desigrred with extra hardware.
Furthcrrnorc, each crosspoint switch requires the use ofa largenumber ofconnecting lines accommodating
address, data path, and control signals. This means that each crosspoint has a complexity matching that of a
bus of the same width.
Fr‘:-r Mtﬂr-purl rrrrtr-...s,.aa.¢. '
IBB i Adrorrced Conprnerkdrritedure

For an rr >< rr crossbar network, this implies that nz sets of crosspoint switches and a large ntnnber oflines.
are needed. What this amounts to is a crossbar nctworlr. requiring extensive hardware when rr is very large. So
far only relatively small crossbar networks with n 5 to have been built into commercial machines.
On each row of the crossbar mesh, multiple crosspoint switches can be connected simultaneously.
Simultaneous data transfers can take place in a crossbar between rr pairs of processors and memories.
Figure 16 shows the schematic design of a row of crosspoint switches in a single crossbar network.
Multiplexer modules are used to select one of rr roan’or trrire requests for service. Each processor sends in an
independent request, and the arbitration logic makes the selection based on certain fairness or priority rules.

[rr sets]
Data Data
M|JlllplB)I.‘9t
mad |_Ifl$ fl‘ ptfl-0955012
Addmfi [3 flea] Address

Shared ‘ii
memory Roaeltwrlto
Readflflflte
moduto
{Mil Control

Roe trees
Acitnowtodgo
Art-ltralorr Red‘-‘ea
A K ‘ad nproen-sears
‘E weEﬂﬁb-IE
:
'
° “°*” at
Request

iv Aelrnowtactgo

Fig. 1.6 Schematic design ofa row of orosspolnt switches in a crossbar nenruork

For example. a 4-bit control signal will be generated for rr = I6 processors. Note that rr sets of data,
address, and read-"write lines are connected to the input ofthe multiplexer tree. Based on the control signal
received, only one out of rr sets of information lines is selected as the output of the multiplexer tree.
The memory address is entered for both min’ and nriro access. In the case of marl. the data fetched from
memory are retumed to the selected processor in the reverse direction using the data path established In the
case of write, thc data on the data path are stored in memory.
Acknowledge signals are used to indicate the arbitration rcsult to all requesting processors. These signals
initiate data transfer and are used to avoid conﬂicts. Note that the data path established is bidirectional, in
order to serve both rend and it-rirc requests for diﬂrerent memory cycles.

Cro.r.r.b-or Limitations A single processor can send many requests to multiple memory modules. For an
rr >< rr crossbar network, at most rr memory words can be delivered to at most rr processors in each cycle.
The crossbar network offers the highest bandwidth of rr data transfers per cycle, as compared with only
one data transfer per bus cycle. Since all necessary switching and conflict resolution logic are built into the
crosspoint switch, the processor interface and memory port logic are much simplified and cheaper. A crossbar
network is cost-efiective only for small multiproccssors with a few processors accessing a few memory
modules. A single-stage crossbar network is not expandable once it is built.
J11 Imltqtrarlnrt _

Multiprocessor: and Mrrtticornputers i 2“

Redundancy or parity-check lines can be built into each crosspoint switch to enhance the fault tolerance
and reliability of the crossbar network.

Mulriparr Memory Because building a crossbar network into a large system is cost prohibitive, some
mainframe multiproccssors used a mulnport memory organization. The idea is to move all crosspoint
arbitration and switching fimctions associated with each memory module into the memory eontrol.ler.
Thus the memory module becomes more expensive due to the added access ports and associated logic as
demonstrated in Fig. ?.‘l‘a. The circles in the diagram represent rt switches tied to rr input ports of a memory
module. Only one of rt processor requests can be honored at a time.
The rnultiport memory organization is a compromise solution between a low-cost, low-performance bus
system and a high-cost, high-bandwidth crossbar system. The contention bus is time-shared by all processors
and device modules attached. The multiport memory must resolve conflicts among processors.
This memory stmcture becomes expensive when mand rt become large. Atypical mainframe multiprocessor
configuration may have rt = 4 processors and rn = 16 memory modules. A multiport memory multiprocessor
is not scalable because once the ports are fitted, no more processors can be added without redesigning the
memory controller.
Another drawback is the need for a large number of interconnection cables and connectors when the
configuration becomes large. Thc ports of each memory rnodulc in Fig. ?.7b are prioritized. Some of the
processors are CPUs, some are Lit) processors, and some are connected to dedicated processors.
nPro-netssore
H

12
1
12 ;I=
m Shared memory m §ED
.‘

[st] rt-port memory modules used

1 1 2 2 1 1
lM°2 l lamlr lswfal l 2M"l

M
{bl Memory ports prlorltlzted or prhrlteged ln each module by numbers

Fig. 7.7 Moltlporr: rnemory organtzaclorrs for mr.rlriprocessor syscerns {Cornrorsy off‘. H. Etsiotrrt ACM Compurrlrrg
Surveys. Match 19??)
FM Mtﬁruw Hllltitmpwins

ZN i Mlvonced Cnmptster Architecture

For example, the Univac IIDUI94 multiprocessor consisted of four CPUs, Four HO processors, and two
scientiﬁc vector processors con.11ectcd to four shared-memory modules, each of which was ID-way ported.
The access to these ports was prioritized under operating system control. In other multiproccssors, part of the
memory module can be made private with ports accessible only to the owner processors.

7.1.3 Multistage and Combining Networks

Multistage networks arc used to build larger multiprocessor systems. We describe two multistage networks,
the Omega network and the Butterﬂy network, that have been built into commercial machines. We will study a
special class of multistage networks, called combining networks, for resolving access conﬂicts automatically
through the network. The combining network was built into the NY'Ll’s Ultracomputer.

Routing in Ornego Network We have defined the Omega network in Chapter 2. In what follows, we
describe the message-routing algorithm and broadcast capability of Omega network. This class of network
was built into the Illinois Cedar multiprocessor {Keck ct al., 1987), into the IBM RP3 (Pfistcr ct al., I985),
and into the NYU Ultracomputer (Gottlieb et al., I983). An 8-input Omega network is shown in Fig. "LB.
In general, an n-input Omega network has log; n stages. The stages are labeled fi"om D to log; n l from
the input end to the output end. Data roofing is controlled by inspecting the destination code in binary. When
thc ith high-order bit of the destination code is a 0, a 2 >< 2 switch at stage i connects thc input to thc upper
output. Otherwise, the input is directed to the lower output.
Two switch settings are shown in Figs. 7.8a and b with respect to permutations rt, = [0, 7, 6, -'-‘l, 2} (1, 3)
(5) and 11'; = (0, 6, 4, 7, 3] (1, 5] (2), respectively.
The switch settings in Fig. 7.8a arc for thc implcmcntation of 11'], which maps U -3 'i', 'i' -3» 6, 6 —:~ 4,
4 —> 2, 2 —> 0, 1 —> 3, 3 —> l, 5 —> 5. Consider the touting ofa message from input 001 to output Ul 1. This
involves the use of switches A, B, and C. Since the most significant bit of the destination 0|! is a "':nero“,
switch A mttst he set straight so that the input Dill is connected to the uppcr output (labeled 2}. The middle
bit in 011 is a “one”, thus input 4 to switch B is connected to the lower output with a “crossove-r” connection.
The least significant bit in fill is a "one", implying a flat connection in switch C. Similarly, the switches A,
E, and D are set for routing a message fi'orn input Illll to output IUI. There exists no conflict in all the switch
settings nccdod to implement the permutation H1 in Fig. 133..
Now consider implementing the permutation J1‘; in the 8-input Omega network (Fig. '!.8b]. Conflicts in
switch settings do exist in three switches identified as F, G, and l-I. The conflicts occurring at F are caused
by thc desired routings D00 —> 110 and 100 —> lll. Since both destination addresses have a leading bit 1,
both inputs to switch F must be connected to the lower output. To resolve the conflicts, one request must be
blocked.
Similarly, we sec conflicts at switch G between (lll —> (I00 and lll —> 011, and at switch I-I between
101 -—> UB1 and Ull —> DUO. At switches I and J, broadcast is used from one input to two outputs, which is
allowed if the hardware is built to have four legitimate states as shown in Fig. 2.24s. The above example
indicates the fact that not all permutations can be implemented in one pass through the Omega network.
The Omega network is a blocking network. In case of blocking, one can establish the conflicting
connections in several passes. For the example I2, we can connect DOD —> lll], U0] —> llll, Oil] —> D10,
101 —> 001.110 —> 100 in the first pass and [lll —>llll0, 100 —> lll, 1 ll —:~ llll in the second pass. In general,
if 2 >< 2 switch boxes are used, an n-input Omega network can implement n""2 permutations in a single pass.
There arc n! permutations in total.
. 1 I.‘ IBM‘ ln¢r.q|r_.u|»r\

Mutltipmoesson and Muiticornputers in

Input Stage t] Stage 1 Stage 2 Output

i Ii
nnn non
_.*I'.__ __I"'i_
UB1 " 1 1 --——- 0111

1:
1111 O 0 O ----- o11

1cc ----- “ _e_ _-__. 1oo

:-: o
1o1 ----- 5 5 ' " ----- 1:11

11c --——- 6 B ----- - - - -- 11c

111
{aj Permutation 1:1 = [1], 1,6, 4, 2] {1, 3] {5} implemented on an D-mega network without blocking

Input
nun ,_ F 1 _- one

It1
o1o --——- o1o
o11 0 O " i‘ HO --——- n11

1oo ----- I . ____- 1oo

1o1 ----- - - - -- 1111

111+ " G -1 1‘ 111

{|:1J Permutation rr2=[EI.6, 4,1 3){1.5J [2Jt1toettedatewitet1-as marked F.G. and H

Fig.7.! ‘fiuostrrlechserdngsofanfiXflfinmeganetvmrkbtfltuvifltlxlstwirches

Fern = S,this implies that only 34/B! =4Cl96.~’4032D=0. 10 I 6 = 10. I 6% ofall penntrtations are implementable
in a single pass through an 8—input Omega network. All others will cause blocking and demand up to three
passes to be realized. In general, a maximum of log; n passes are nccdod for an r1-input Omega. Blocking is
not s desired feature in any multistage network, since it lowers the effective bandwidth.
The Omega network can alsoh-eused to broadcast data from one source to many destinations, as exemplified
in Fig. 7.9a, using the upper broadcast or lower broadcast switch settings. In Fig. 7.9a, the message at input
001 is being broadcast to ail eight outputs through a binary tree connection.
The twoway shufiie interstage connections can be replaced by four-way shuffie interstage connections
when 4 >< 4 switch boxes are used as building blocks, as exemplified in Fig. T.9b for n I6-input Omega
network with log4 I6 = 2 stages.
292 i
.
Adi-rnneed Computer Arciritedure

_.1;1_ _.1—\_ _.1j1_.m

—*1_1— -'1_1— “*1
001 ;_--- —-lil— —-Ji_._l—010
‘" —-I |— 1'-—- 011
E ;:__ _;-]i___l_1gg
—-| |— ‘* '='—-- 101

E ;:-- @11ﬂ

_-1 |_ "- -:___

{oi Broadcast connections

0000
0001
0010
-7-
..,..t 0001
0010
DC!-Gt}

GU11 V V D1111
0100 0100
0101 \ 4 4 L4 4 y 0101
0110 70' "‘ " 0110
0111 K 0111
1000 Q ‘ 1000
1001 £1 1001
1010 ll 41-<4 w 4><4 1010
1011 I‘ 1011
1100 1100
1101
1110
1111
xx
(4'*3~ A:-mt 4-X4
1101
1110
1111

[bl Using iourauay shuttle interstage connections

Fig. 1.9 Broadcast capability of an Oreega ne1:wor'k built: widi 4 >< 4 swiedmes

Note that a four-way shuflle corresponds to dividing the I6 inputs into four equal subsets and then
shufliing them evenly among the four subsets. When Ir >< Ir switch boxes are used, one can define a it-way
shuflic function to build an even larger Omega network with logy n stages.
Routing In Butterfly Networks This class of networks is constructed with crossbar switches as building
blocks. Figure 7.10 shows two Butterfly networks of different sizes. Figure ‘Lilla shows a 6-¢1—input Butterfly
network built with two stages [2 = log; 64} of 8 X 3 crossbar switches. The eight-way shufir: ftlnction is used
to establish the interstage connections between stage 0 and stage 1. In l-‘ig. 7.1t‘.ib, a three-stage Butterfly
network is constructed for 512 inputs, again with 8 >< 8 crossbar switches. Each of the 64 >< 64 boxes in
Fig. 'i'.1'l}b is identical to the two-stage Butterfly network in Fig. ?.1Ua.
In total, sixteen ii >< 8 crossbar switches are used in Fig. 1.10s and 16 >< El + 8 >< 8 = I92 are used in
Fig. ?.lIJb. Larger Butterfly networks can be rnodularly constructed using more stages. Note that no broadcast
Jlrlutltiproioessors and Muiticornputers 2‘

connections are allowed in a Butterﬂy network. making these networks a restricted subclass of Omega
networks.

Stage!) St 1 Stage 2
D

. W1
\.t i s
U4
_;@

E-=1-|@. ,.| l-51$

ii /0% 3
Stage 0 Stage 1
$"$
‘i’ I 3'1-(H Iv B:-<31 1
I I c4 gag or
1" 7
- 11
' 11
3 -chi Q 1'2
I 3';-=13 5x5-
I I-
54:-:54
15 15

_ _ / 120
121 ' 1i? , Ma 121

5 /ii Q
2
..
55 2
5? I \- Z 2
I
55
57
' B:-<5 B;-r8 I
I I
B3 53

[a) At\vo~stag1e64:<B4tBu1terﬂyswitchnetwonr
built with 15 8 it 3 crossbar switches and
eight-wayshtifieintastageconnectlors

425
B4:-<64 .

504 '- i 504

511
2 s .s 511

[bi Athres-stage 512 >-c512 Butterfly switch network

builtwlth192B>cBcrossbars\1dtcJ1es

Fig.T.10 Mochrlar ccm1st:ruet:lon oi Butterﬂy switch networks with I X B crobar switches {Courtesy of BBN
Advanced Composers. Inc. 1990}
Ft‘:-r MIG:-|:|'u|' Hl'Ilr'mr:-;|;1mn '
Z74 i Advanced Canpm:erArclritecture

The Hot-Spot Problem When the network traffic is nonuniform, a her spot may appear corresponding tn
a certain memory module being excessively accessed by many processors at the same time. For example, a
semaphore variable being used as a synchronization barrier may become a hot spot since it is shared by many
processors.
Hot spots may degrade the network performance significantly. In the NYU Ultracomputer and the IBM
RP3 multiprocessor, a combining mechanism has been added to the Omega network. The purpose was to
combine multiple requests heading for the same destination at switch points where conflicts are taking place.
An atomic read-modify-write primitive Fetchflr-Add(x, e), has been developed to perform parallel memory
updates using the combining network.
Ferchflldd This atomic memory operation is eifective in implementing an N-way synchronization with a
complexity independent of N. In a Fetch&Add[x, e} operation, .1’ is an integer variable in shared memory and
e is an integer increment. When a single processor executes this operation, the semantics is
Fetch-imdd (x, e)
[temp t— x;
.1" <— Rfmp + e; (7.1)
return rerrzpi
When N processes attempt Fetch&Add(x, e] at the same memory word simultaneously, the memory is
updated only once following a.scrri:1i'i:¢1rion prirmpie. The sum of the N increments, e; + 6.’: + + en,-, is
produced in any arbitrary serialization of the N requests.
This sum is added to the memory word x, resulting in a new valuex + q + e; + . .. + em . The values returned
to the N requests are all unique, depending on the serialization order followed The net result is similar to
a sequential execution of N Fetchdrhdds but is performed in one indivisible operation. Two simultaneous
requests are combined in a switch as illustrated in Fig. "I. 1 l.
One of the following operations will be performed if processor P1 executes Ans, <— l~"etch&AddLr, r-1)
and P; executes Ans; <— Felch&o"\dc1{x_, E2) simultaneously on the shared variable x. If the request from P1 is
executed ahead of that from P2 , the following values are returned:

Arts; (* I

Ans; t— x + c; {T2}

If the execution order is reversed, the following values are returned:

Anal <t— .r+e3
Ans; <t— x [T3]-

Regardless of the executing order, the value x + +3] + £2 is stored in memory. It is the responsibility of the
switch box to form the surn e1 + -E2 , transmit the combined request Fctch&,Add(_r, cl + 22], store the value
e1 (or +33) in a wait butter of the switch, and retum the values x and J.‘ + e, to satisfy the original requests
Fetch&Add(x, 8;) and Fetch&Add(x_, e2), respectively, as illustrated in Fig. 7.11 in four steps.
rr.-.- Mcﬁrrrw "‘“'l_|N'f.l]|r_.\.ll|f\ _
.iiilo\ltipru~oessors and Multicornpoters 2‘;

P Fetch-llhod {x, Ia‘;

1 so-not
P2 Feioueaon pt, e2; L] 1

{ajTv4o requests me-etnta switdt

P1 5"""lm Fe1.cMsAoo[x,e1+e2)

P2 lnl *
{bl The swrtdi forms he sun e1 + B2, ables o1 It bullet, and tsnnnna ‘ho eomtined
re-quetatb memory

P1 Switch
X

(e) The ongnialvalua atoreoli xuretuned bawitch. andhenewualue-x+e, + e,

lsstoredn memory

P1 x Swlldi tst
P2 I'i’E'1 Ll ii
ii
ti
JHB 1

|[d)The vﬂrea xanox + e1sI'e- r9l.|n'|e-ti ti P1a'|d P2 reg:-acutely

Flg.'l'.11 ‘Mo Fe1:d'i&Add op-eratrions are eornbinecl no access a marred variable slmulraneoudy via a combining
network

Jilpplicotions and Drawbacks The Fetch&.Add primitive is very effective in accessing sequentially
allocated queue structures in parallel, or in forking out parallel processes with identical code that operate on
different data sets.
Consider the parallel execution of N independent iterations of the following Do loop by p processors:
Doall N- l to 100
-=ICode using No
Endall
Each processor executes a Fetch&-Add on N before working on a specific iteration of the loop. in this
case, a unique value of N is returned to each processor, which is used in the code segment. The code for each
processor is written as follows, with N being initialized as 1:
n <— Fetcl1&Add (N, I)
While (n 5 100) Dnall
{Code using ni
n <— Fetcl1&Atid(N, 1)
Enilall
The advantage of using a combining network to implement the Fetch&Add operation is achieved at a
significant increase in network cost. According to NYU Ultracomputer experience, message queueing and
combining in each bidirectional 2 >< 2 switch box increased the network cost by a factor of at least 6 or more.
Ff» Mefimw H'["I':.\rl!q|t',|rllI1

ZN "i" Advanced Computer Arclriteeture

Additional switch cycles are also needed to make the entire operation an atomic memory operation.
This may increasc the network latency significantly. Multistage combining networks have the potential of
supporting large-scale multiproccssors with thousands of processors. The problem of increased cost and
latency may be alleviated with the use of faster and cheaper switching technology in the future.
Multistage Networks in Real System: The IBM RP3 was designed to include 512 processors using a high-
speed Omega network for reads or writes and a combining network for synchronization using Fetchtitdtdds.
A l23»port Omega network in the RP3 had a bandwidth oi" 13 Gbytes/s using a SIJ-MI-lz clock.
Multistage Omega networks were also built into the Cedar rnultiprocessor (Kuck ct al., 1986) at thc
University of Illinois and in the Ultracomputer (Gottlieb et al., 1983) at New York University.
The HEN Butterfly processor |{TC20ll{l) used 8 >< S crossbar switch modules to build a two-stage 64 >< 64
Butterfly network for a 64-processor system, and a three-stage 512 >< 512 Butterfly switch {see Fig. 7.10) for
a 512-processor system in the TCZDCIU Series. The switch hardware was clocked at 33 MI-12 with a l-byte
data path. The maximum interprocessor bandwidth for a 64-processor TCZUGO was designed at 2.4 Gbytesfs.
The Cray Y-MP multiprocessor used 64-. I28-, or 256—way interleaved memory banks, each of which
could be accessed via four ports. Crossbar networks were used between the processors and memory banks
in all Cray multiproccssors. Thc Alliant FXF2800 used crossbar interconnects between seven four-processor
{i860} boards plus one U0 board and eight shared. interleaved cache boards which were connected to the
physical memory via a memory bus.

CACHE cot-rsnsncearco SYNCHRDNIZATIDN

_ MECHANISMS
Cache coherence protocols for coping with the multicache inconsistency problem are
considered bclow. Snoopy protocols are designed for bus-connected systems. Directory—hascd protocols
apply to network-connected systems. Finally, we study hardware support for fast synchronization. Software-
implemented synchronization will be discussed in Chapter 11.

7.1.1 The Cache Coherence Problem

In a memory hierarchy for a multiprocessor system, data inconsistency may oecur between adjacent levels
or within the same level. For example, the cache and main memory may contain inconsistent copies of the
same data object. Multiple caches may possess tlitierent copies ofthe same memory block because multiple
processors operate asynchronously and independently.
Caches in a multiprocessing cnvironrnccnt introduce thc ctr:-he t-oherence prohl't=m. When multiple
processors maintain locally cached copies of a unique shared-memory location, any local modification of the
location can result in a globally inconsistent view of memory. Cache coherence schemes prevent this problem
by maintaining a uniform state for each cached block of data. Cache inconsistencies caused by data sharing,
process migration, or IMO are explained below.
Inconsistency in Dem: Sharing The cache inconsistency problem occurs only when multiple private
caches are used. ln general, three sources of thc problem arc idcritifted: sltoring of wrr'ra!:u'e darrt. process
migrririnn, and HO trcrit-'ir_v. Figure 112 illustrates the problems caused by the first two sources. Consider a
multiprocessor with two processors, each using a private cache and both sharing the main memory. Let X be
rm‘ I Ifllli i!|>rrIq|r_.\.I|n*\ _

Mudtiplooenon and Multicomputers i. 2;-I

a shared data element which has been referenced by both processors. Before update, the three copies of X arrc
consistent.
If processor P. writes new data X’ into the cache, the same copy will be written immediately into the
shared memory under a nrire-rhmugia policy. In this case. inconsistency occurs between the two copies {X
and X) in the two caches (Fig. '?.l2a).
On the other hand, inconsistency may also occur when a u'rr'm-brmk policy is used, as shown on the right
in I-"lg. 7.129.. The main memory will be eventually updated when the modiﬁed data in the cache are replaced
or invalidated
Process Migration and HO Figcirc 7. 1211- shows the or:r:|.n'rc:r1r:c ofinconsistency aﬁera process containing
a shared variable X migrates from processor 1 to processor 2 using the write-back cache on the right. In the
middle, a process migrates from processor 2 to processor i when using write-through caches.

Processor P1 P2 P1 P2 P1 P2

we
' Bus
Shared
Merno"!r"
Before upchate Write-through Write-back
{aj inconsistency in sharing of writable data

PFOGBSSMS P1 P2 P1 P2 P1 P2

we :_: -._: : Bus

Memory
Before Migration Write-througii Write-back
[b] incons istency after process migration

Fig. 7.12 C3£‘l'i£ coherence prcblerns in chra sharing and in process rnigrarlon {Adaptedfrun Dubois. 5Cl1rEl.iI"lCl1-.
and Brigs 1993]

In both cases, inconsistency appears between the two cache copies, labeled Xand X’. Special precautions
rnust be exercised to avoid such inconsistencies. A coherence protocol must be established before processes
can safely rnigrate from one processor to another.
Inconsistency problems may occur during U0 operations that bypass the caches.
when the U0 processor imds a new data I into the main memory, bypassing the write through caches
(middle diagram in Fig. 7.1 3a), inconsistency occurs between cache i and the shared memory. When
outputting a data directly from the shared memory [bypassing thc caches], the write-hack caches also create
inconsistency.
Par MIGIITLH Hi" i!'mt'JI||r_.u|r¢\ :

I93 i Advanced ConprsterArciritedm-e

One possible solution to the U0 inconsistency problem is to attach the H0 processors (IE-‘P, and IOP2)
to the private caches [C1 and C1), respectively, as shown in Fig. 'i'.l3h. This way Lr’Ci processors share caches
with the CPU. The lit) consistency can be maintained ifcache-to-cache consistency is maintained via the bus.
An obvious shortcoming of this scheme is the likely increase in cache perturbations and the poor locality of
HG data, which may result in highcr miss ratios.

P1 P2 P1 P2 P1 P2 Pmgesggfg

___.
I ace
-_ - Bus
x x’ x’ it V0
Processor
Memory lit] Memory [input] Memory [Output]
[Write-through) {Write-bacir]
[aj [IO operations bypassing the eache

Legends:
W r.rrw=»r-r
lOPi [|.r‘O Processor i]
ci [Cache 43
Bus

Shared Memory

[bi A possible solution

Fig. ‘L13 Cache inconsistency after an |r'Ci operation and a poslbie soiuti-on (Arhpnecl fmrn Dubois.Scl1-era-ich.
and Brigs. 198'-B)

Two Protocol Approaches Many of the early commercially available multiproccssors used bus-based
memory systems. A bus is a convenient device for ensuring cache coherence because it allows all processors
in the system to observe ongoing memory transactions. If a bus transaction threatens the consistent state of a
locally cached obj cct, the cache controller can take appropriate actions to invalidate thc local copy. Protocols
using this mechanism to -cnsurc cohcrcncc arc called snoopy pmrocois bccausc each cach-c snoops on thc
transactions of other caches.
Dn the other hand, scalable multiprocessor systems interconnect processors using short point-to-point linlrs
in direct or multistage networks. Unlike the situation in buses, the bandwidth of these networks increases
as more processors are added to the system. However, such networks do not have a convenient snooping
mechanism and do not provide an cfﬁcicnt broadcast capability. In such systems, thc cache coherence
problem can be solved using some variant of directory schemes.
in general. a cache coherence protocol consists of the set of possible states in the local caches, the state in
the shared memory, and the state transitions caused by the messages transported through the interconnection
nctworlrt to keep memory coherent. In what follows, we ﬁrst dcscrihc the snoopy protocols and then thc
directory-based protocols. Other approaches to designing a scalable cache coherence interface will be studied
in Chapter 9.
J1? I!mt'JI||r_.u|i¢\

Mutltiprocessorr and ﬂllultiicomputers F 2"

7.1.1 Snoopy Bus Protocols

In using private caches associated with processors tied to a common bus, two approaches have been practiced
for maintaining cache consist-ecncy: n'rr'.te- r'nvm'ir!are and tt'riIr.'-update policies. Essentially, the write-intralidate
policy will invalidate all remote copies when a local cache block is updated. The write~update policy will
broadcast the new data block to all caches containing a copy of the block.
Snoopy protocols achieve data consistency among the caches and shared memory through a bus watching
mechanism. As illustrated in Fig. lid, two snoopy bus protocols create diﬁerent results. Consider three
processors (Pl, P2, and P") maintaining COI15lSlG11l copies of block X in their local caches (Fig. 7. l 4a} and in
the shared-memory module marked X.
Using a write-in validore protocol, the processor Pl modiﬁes (writes) its cache from X to X’, and all other
copies are invalidated via the bus (denoted I in Fig. 7.l4b]. invalidated blocks are sometimes called dirty,
meaning they should not be used. The ‘It-rift’-H]'Jti'tII£’ protocol (Fig. '?.l4c) demands the new block content .15’
be broadcast to all cache copies via the bus. The memory copy is also updated if write-through caches are
used. [n using write-back caches, the memory copy is updated later at block replacement time.

[:1 :1 Shared
Memory [:1 , Sh ed
lilluleiiitory
I I Bus I I Bus

Cache II II same
6 ® 6 P~=-em ﬁt ® o “mm
[a] Consistent ooples of block J( are In shared memory [tr] After a write-I nvatldate operation by P.‘
and three processor caches

,
ii ii Shared
Memory
I I Etus
.__ Caches

® Q a Processors
{c} After a vwlte-update operation by P1

Fig. 'l‘.14 Write-lmallelare and write-up-dare ooh.or\nn-co proroools For write rhrougli caches (1: invalidate]

Write-T‘.I-imugh Codie: The states of a cache block copy change with respect to rt-rid, it--rite, and
repfirconenr operations in thc cache. Figure 7.15 shows the state transitions for two basic write-invalidate
snoopy protocols developed for write-through and write-back caches, respectively. A block copy of a write-
through cache i attached to pmecssnri can assume one of two possible cache states: voﬁd or irrwifid (Fig.
7.15s}.
A remote processor is denoted j, wherej ¢ i. For each ofthe two cache states, sis possible events may take
place. Note that all cache copies of the same block use the same transition graph in making state changes.
In a miid state (Fig. 115a), all processors can rem‘ {R(i‘), HUI) safely. Local processor i can also nrire
(W (F1) safely in a valid state. The in valid state corresponds to the case of thc block either being invalidated
or being replaced (Z(r') or Z(,r')'].
Thur Ml.'I;Iﬂlb' HI" I!'n¢r.q|r_.u|irIi -

IUD i Adrrornccd Crnrnjarirterrﬁrclriterriru-e

Wherever a remote processor wrm-s {H-"'{,r‘)] into its cache copy. all other cache copies become invalidated.
Thc cache block i.n cache r' becomes valid whenever a successful road {R(_r‘)} or write [Iii-"(r'_]_) is carried out by
a local processor i.
The fraction of ii-rim r.'_}-'r:'l'es' on thc bus is highcr than thc fraction of rear! rye-res in a write-through cache,
due to the need for request invalidations. The cache rfirccrmy (registration of cache states} can be made in
dual copies or dua.l—portcd to filter out most invalidations. In case locks arc cached, an atomic Tcstdtflct must
be enforced.
Write-Bock Cache: The valid‘ state of a write-back cache can be further split into two cache states. labeled
RW (rcrmI‘~wrire] and R0 [rtrraf-only-'} as shown in Fig. '7. l5b. Thc INV (invalidated or not-in-cache} cache
state is equivalent to the rm-ora: state mentioned before. This three-state coherence scheme corresponds to
an on-nersh ip ,rJror-or-oi‘.

arr], wrrr
RU]
zur ,_Fr._,‘I[,l
Z“l
win cur
Em
Will. Ztil
[a] Write-through cache

R[|] RIII
wrrr
o o’ RUI

RW:Read-Write
RO:Read Only
wﬁl Isrvzrrwarierea or
notin cache

RU). ZUI. Will. Ztli

W[|] = Write to block by processor r‘. Wfl] = Write to block copy In cache _r'by processor ,r'=e r‘.
R[I] = Read bio-clr by p-ro-cessori. HQ] = Read bio-ck copy In cache jby processor is r'.
Z[|J = Replace block In cache r'. Zfii = Replace btc-clr: copy In ca~che_r'e r'.
rs) Write-back cache
Fig. ‘L1 5 Two smI:e~rrartsi'n|cn graphs icr a cache block using wrine-rlmall-hoe snoopy protocols {Aehpced from
Du-bot!-. Schemrldt. an-cl Brlfls. 19$)

When the memory owns a block, caches can contain only the RD copies of the block. In other words,
multiple copies may exist in the RD state and every processor having a copy (called a Iraq-icr of the copy)
can ram‘ (RE), R(,r'j) thc copy safcly.
rm Mrlirow Hill lmrJI||r_.u|n¢\
' :
Mutltiprocessors and Muttiicomputars i 3m

The Fhl"V state is entered whenever a remote processor it-'ri':es ( H»"'(,r'}) its local copy or the local processor
replaces [_Z[_i)_) its own block copy. The KW state oorrcsponds to only one cache copy existing in the entire
system owned by the local processor i. Reid (R(i)} and write (lt'(i}) can be safely performed in the RW state.
From either the R0 state or the INV state, the cache block becomes uniquely owned when a local it-rim { H-1|[i])
takes place.
Other state transitions in Fig. 7.15b can be similarly figured out- Before a block is modified, ownership
for exclusive access must first be obtained by a roan‘-an!__i= bus transaction which is broadcast to all caches
and memory. If a modified block copy exists in a remote cache, memory must first be updated. the copy
invalidated, and ownership lralisferrcd to thc requesting eache.
Write-once Protocol James Goodman (I 983'] proposed a cache coherence protocol for bus-based
multiprocessors. This scheme combines the advantages of both write-through and write-hack invalidations.
In order to reduce bus ttnffie, the very first n-rite ofa cache block uses a writ.e-through policy.
This will result in a consistent memory copy while all other cache copies are invalidated. After the first
write, shared memory is updated using a write—back policy. This scheme can he described by the four-state
transition graph shown in Fig. 116. The four cache states are defined below:

P~Road
Wﬂto-Invfﬁoad-lnv .

@@ ..-
,1 \\

Road-
17 0Iad-lnv K,‘ J Blk
|:sea,d_|n '11 “ ' P-Write
'\
)-
\-
In
~.
‘K
‘\.1
“".r "\.
s,"' "'it
\
"'rt

,/Read-Elk ii‘ 1*\.1

J - _ _ _,_- _ _ -_"

w P-Write

P-Wt lte
Solid lines: Command issued by local processor
Dashed Ilnos: Commands issued by remote processors
irla tlieaystom bus.

Fig. 7.1-ii Goo-clrrmfs write-once cache coherence protoco! using the wrlne lrwall-chiin policy on wrlne-beck
caelies {Aehpned from james Goo-drmn 1933, reprinted from Stons1:rorn.l‘EEE Comput-nr.j1.me 1990}

I l'Z!.llit?'.' The cache block, which is cottsistcnt with the memory copy, has been read ﬁ'om shared memory
and has not been modiﬁed.
- fnioffd." Thc block is not iiiund in thc cache or is inconsistent with thc memory copy.
I Resen-'er:l'.' Data has been n-rmen exactly om:-e since being rend’ from shared memory. The eache copy
is consistent with the memory copy, which is thc only othcrcopy.
rho Melirpw Hl'llt'mr:-;|;imn '
3112 i Advanced Compattor Arclrlteetme

I Dirn-'.' Thc cachc block has been modiﬁed ( it-rittenj more than oncc, and the cache copy is thc only one
in the system [thus inconsistent with all other copies).

To maintain consistency, the protocol requires two different sets of commands. The solid lines in
Fig. 7.16 correspond to access commands issued by a local processor labeled rend-miss, ii-rite-Fair, and H'ri!E-
miss. Whenever a rend-miss occurs, the vriiin’ state is entered.
The first write-hf! lcads to thc n-:serit’n' store. The second ii'ri!e-hit leads to the dirify state, and all ‘future
it-‘rife-hits stay in the rfirry state. Whenever a n'rr'Ie-rrziiss occurs, the eache block enters the rfirry state.
The dashed lines correspond to invalidation commands issued by remote processors via the snoopy bus.
The rem‘-int-dirirre command reads a block and invalidates all other copies. The v.-rim-in»-rilidrinz command
invalidates all other copies of a block. The bus—mad command corresponds to a normal memory rend by a
remote proces-.sor via the bus
Cache Event: and Action: The memory-access and invalidation commands trigger the following events
and actions:
' Read-rrtr'.ss.' When a processor wants to read a block that is not in the cache, a mod-miss occurs. A bus-
reod operation will be initiated. If no dirty oopy mtists, then main memory has a consistent copy and
supplies a copy to the n:-questing cachc. If a dirt]-' copy docs exist in a rcmotc cache, that cache will
inhibitthc main memory and scnd a copy to the rcqucsti ng cache. In all cases, thc cachc copy will cntcr
thc iolid state after a rcad-miss.
I Write-rift.‘ If the copy is in thc dirri-' or reserved statc, thc write can be carried out locally and thc
ncw state is .d'i'rr_v. If thc ncw statc is solid, a writedrriulidrrre command is broadcast to all caches,
invalidating their copies. The shared memory is n-'rim=n rhrough, and thc resulting statc is rsrsr.-rve.n"
after this first n-'rr'!e.
I Write-miss: When a processor tails to write in a local cache, thc copy must come either from the main
memory or from a remote cachc with a dirty block. This is accomplished by sending a read-invalidate
command which will invalidate all cachc copies. The local copy is thus updated and ends up in a r1"r'rr_i-'
5lfl.l.‘C.
I Read-hit: Read-hits can always be performed in a local cache without causing a statc transition or
using thc snoopy bus for invalidation.
I Block Replacement.‘ If a copy is r;l'r'rr_1-', it has to bc written back to main memory by block rcplaccmcnt.
lfthc copy is dean (i.e., in either the i-'alr'.d', resort-'err‘, or ininiid state], no rcplaccmcnt will Lake place.

Goodman’s write-once protoco] demands special bus lines lo inhibit the main memory when the memory
copy is invalid, and n has -rend operation is needed aflcra rend rm'.s.s. Most standard buses cannot support this
inhibition operation.
The IEEE Futurebus+ proposed to include this special bus provision. Using a write-through policy after
the ﬁrst write and using a write-baclt policy in all additional ii-rims eliminates unnecessary invalidations.
Snoopy cache protocols are popular in bus-based multiproccssors because of their simplicity of
implementation. The write-invalidate policies were implemented on the Sequent Symmetry multiprocessor
and on the Alliant FIX multiprocessor.
Besides the DEC Fireﬁy multiprocessor, the Xerox Pale Alto Research Center implemented another write-
update protocol for its Dragon multiprocessor workstation. The Dragon protocol avoids updating memory
until replacement, in order to improve the efficiency of intercache transfers.
rt» Mel;I11 w Hlll lnrfqttgtrllo-\'
' _

Muiltiprocesson and Mulniccvripoters i 303

Multilevel Cache Coherence To maintain consistency among cache copies at various levels. Wilson
proposed an extension to the write-invalidate protocol used on a single bus. Consistency among cache copies
at the same level is maintained in the same way as described above. Consistency of caches at different levels
is illustrated in Fig. 13.
An invalidation must propagate vertically up and down in order to invalidate all copies in the shared
caches at level 2. Suppose processor P, issues a write request. The write request propagates up to the highest
level and invalidates copies in Cm, C22, Cm, and C13, as shown by the arrows to all the shaded copies.
High-level caches such as Cm keep track of dirty blocks beneath them. A subsequent rend request issued
by P7 will propagate up the hierarchy because no copies exist. Wben it reaches the top level, cache C20 issucs
a flush request down to cache Cu and the dirty copy is supplied to the private cache associated with processor
P-,. Note that higher-level caches act as filters for consistency control. An invalidation command or a read
request will not propagate down to clusters that do not contain a copy of the corresponding bloelt. The cache
C2, acts in this manner.
Protocol Performance lune: The performance of any snoopy protocol depends heavily on the workload
patterns and implcmcntation efficiency. The main motivation for using the snooping mechanism is to reduce
bus traffic, with a secondary goal of reducing the effective memory-access time. The block size is very
sensitive to cache performance in write-invalidate protocols. but not in write-update protocols.
For a uniprocessor system, bus traffic and memory-access time are mainly contributed by cache misses.
The miss ratio decreases when block size increases. llowever, as the block size increases to a darn pollnrimi
point, the miss ratio starts to increase. For larger caches. the data pollution point appears at a larger block sire.
For a system requiring extensive process migration or synchronization. the write-invalidate protocol will
perform better. However, a cache miss can rcsult for an invalidation initiated by another processor prior to the
cache access. Such in wtliatrrion mr'.r.rc.r may increase bus traflic and thus should be reduced.
Extensive simulation results have suggested that bus traffic in a multiprocessor may increase when the block
size increases. Write-invalidate also facilitates the implementation of synchronization primitives. Typically.
the average number of invalidated cache copies is rather small [one or two] in a small multiprocessor.
The write-update protocol requires a bus broadcast capability. This protocol also can avoid the ping-pong
efi'ect on data shared between multiple caches. Reducing the sharing ofdata will lessen bus traffic in a write-
update multiprocessor. However, write update cannot be used with long write bursts. Only through extensive
program traces (trace-driven simulation} can one reveal the eache behavior, hit ratio, bus traflic, and eflbctive
memory-access time.

7.1.3 Directory-Based Protocols

A write-invalidate protocol may lead to heavy bus traffic caused by nztrrl-rrIi.ss'+:s, resulting from the processor
updating a variable and other processors trying to read the same variable. On the other band, the write-update
protocol may update data items in remote caches which will never be used by other processors. In fact, these
problems pose additional limitations in using buses to build large multiprocessors.
When a multistage or packet switched network is used to build a large multiprocessor with hundreds of
processors, the snoopy cache protocols must be rno-tliﬁecl to suit the network capabilities. Since broadcasting
is expensive to perform in such a network, consistency commands will be sent only to those caches that keep
a copy of the block. This leads to n'irccror_t-\- btts'crl;Jrorocof.s for network-connected multiprocessors.
Fr‘:-r Mcﬁruw stilt-...¢-,.,c..¢. '
IHH i Advanced Ca'npm:erArcl'ritccturc

Directory Structure: In a multistage or packet switched network, cache coherence is supported by using
cache direictories to store information on where copies of cache blocks reside. Various directory-based
protocols differ mainly in how the directory maintains information and what information it stores.
Tang (I976) proposed the ﬁrst directory scheme, which used a comm)‘ n'in'cr0rj_t-' containing duplicates of
all cache directories. This central directory, providing all the information needed to enforce consistency, is
usually very large and must be associatively searched. like the individual cache directories. Contention and
long search times are two drawbacks in using :1 central directory for a large multipro-cessoc
A distributed-directory scheme was proposed by Censier and Feaun-ier (l9'i'3). Each memory module
maintains a separate directory which records the state and presence inforlnation for each memory block. The
state infonnation is local, but the presence information indicates which caches have a copy of the block.
In Fig. 7.11‘, a l't"ﬂd-J'7‘li‘i§‘.S‘ (thin lines] in cache 2 results in a request sent to the memory module. The
memory controller reironsmits the request to the dirty copy in cache 1. This cache writes back its copy. The
memory module can supply a copy to the requesting eache. ln the case of a write-hit at cache I [bold lines},
a command is sent to the memory controller, which sends invalidations to all caches (cache 2] marked in the
presence vector residing in the directory D1.

I
I Interconnection Network

EE . a i ' % II ,,,

Hg. 7.11‘ kslc concept ofa cHrec|:ocy-based cache coherence sch-en1c{Co|.u*cesy of Censlcrand F-eaurrtce IEEE
li'ms.Co|nputers, Dec.1‘?7B}

A cache-coherence protocol that does not use broadcasts must store the locations of all cached copies of
each block of shared data. This list of cached locations, whether centralised or distributed, is called a cache
rfirccroufy. A directory entry for each block of data contains a number ofpointers to specify the locations of
copies of the block. Each directory entry also contains a dirty bit to specify whether a particular cache has
permission to Write the associated block of data.
Different types of directory protocols fall under three primary categories: jiifi mop .n‘irccaJrr'cs, finiiren’
rfirccrorics, and chair:-t=d riirsctorits. F1.|.ll-map directories store enough data associated with each block in
global memory so that every cache in the system can simultaneously store a copy of any block of data. That
is. each directory entry contains N pQll1l.EIS_, when: N is the number of processors in the system.
Limited directories differ from full-map directories in that they have a fixed number of pointers per entry,
regardless of thc system size. Chained directories emulate the full-map schemes by distributing the directory
J11 Irufqretrlhw

Multiprocessor: and Muiticompoters i 305

among the caches. The following descriptions of the three classes of cache directories are based on the
original classiﬁcation by Chaiken, Fields, Kwihara, and Agarwal (1990):

Full-Nlop Directories The full-map protocol implements directory entries with one bit per processor and a
dirty bit. Each bit represents the status of the block in the corresponding processor's cache (present or absent).
If the dirty bit is set, then one and only one processor's bit is set and that processor can write into the block.
A cache maintains two bits of state per block. One bit indicates whether a block is valid, and the other
indicates whether a valid block may be written. The cache coherence protocol must keep the state bits in the
Incrnory directory and those in the cache consistent.
Figure 'I'.llia illustrates three d.ifl'erent states of a full-map directory. In the first state, location X is missing
in all of the caches in the system. The second state results from three caches (Cl, C2, and C3) requesting
copies of location X. Three pointers [processor bits} are set in the enn-y to indicate the caches that have copies
ofthe block of data. In the first two states, the dirty hit on the left side ofthe directory entry is set to clcan {C},
indicating that no processor has permission to write to the block of data. The third state results from cache
C3 requesting write pennission for the block. In the final state, the dirty bit is set to dirty {D}, and there is a
single pointer to the block of data in cache C3.
Let us examine the transition from the second state to thc third state in more detail. Once processor P3
issues the write to cache C3, the following events will take place:
('1) Cache C3 dctccts that thc hloclt containing loerti-on X is valid hut that thc processor docs not have
pcrtn is sion to write to thc block, indicated by thc block‘-s writc-permission hit in thc cachc.
(2) Cache C3 issues a writc rcqucst to thc mcmory modulc containing location X and stalls processor P3.
(3) The memory module issues invalidate requests to caches Cl and C2.
(4) Caches Cl and C2 roccivc thc invalidate requests, sct thc appropriate hit to indicatc that thc block
containing location X is invalid and send acltnowlcdgmcnts hack to thc memory module.
(5) Thc memory module rcccivcs thc aclcnowlcdg mcnts, scts thc dirty hit, clcars thc pointers to caches Cl
and C2, and sends wtitc permission to cache C3.
(ti) Cache C3 rcocives thc write pcmiission message, updates the statc in thc cache, and rcactivatcs
processor P3.

The memory module waits to receive the acknowledgments before allowing processor P3 to complete
its write transaction. By waiting for acknowledgments, the protocol guarantees that the memory system
ensures sequential consistency. The ﬁ.lll—ITlap protocol provides a useful upper bound for the perfomiance of
ccrrlralizied directory-based cache coherence. However, it is not scalable due to excessive memory overhead.
Because the sizie ofthe directory entry associated with each block ofmemory is proportional to the number
ofprocessors, the memory consumed by the directory is proportional to the size of memory DU») multiplied
by the size of the directory O(Nj. Thus, the total memory overhead scales as the square of the number of
processors O(N2).

Limited Directories Limited directory protocols are designed to solve the directory size problem.
Restricting the number of simultaneously cached copies of any particular block of data limits the growth of
the directory to a constant factor.
A directory protocol can be classiﬁed as Dir, X using the notation from Ag-arwal et al (1 988'}. The symbol
i stands for the number of pointers, and X is NB for a scheme with no broadcast A full-map scheme without
rr<- Mclinrw Hill I_|Il1‘.l]|r_.I.ll|f\
306 S Admrmed Crmpurter Architeaure

hmadeast is represented as Dir_,._. NB. A limited clircctory protocol that uses r' <1 N pointers is denoted Dir; NB.
The limited directory protocol is similar to thc ﬁ.rll-map directory, except in the case when more than r' eaehes
request read copies of a particular block of data.
Shared memory Shared memory
X: IIII
--- I -
I"
IME!-2|

Cache Cache Cache Cach C he Ca e

I
V "‘ x x: De El‘ III
x:
'( F1) ' ' @j' '(r'='3)‘ '['r'="1j '{'r='2)' '@j
Readit Readx Headx wmex

Shared memory
X1 EIIIII BEBE

Cache’ Cac:-re . he
><-
'("r=1) ' (' i=2‘) (Pa)
{a) Three sures of a iuli-mac directory

Shared memory Shared me mery

*1 IE 1'<=II

Cache Cac = Cache Cache _ Cache Cache

><= @ ><= EEEI ><= @ >==

(P11 (r=2)i (P35 arm ((152) i{Pa}
Head Jr.
{U1 Eviction in a limited directory

Shared memory Shared me rncry

><= IEEE ><= IE

Cach“ Cache cor‘ ¢'.e'e " oer cm

K: " x: rieell "
(H) C2) C2) P1 C2) (2)
Reed X Write X
re) The chained eriecrmy
Hg.T.1l'| Three types of cache ctirocuzry protocols {Courtesy of-Chaihen er: al. IEEE Ccn1pr.rmr'.]r.rne 1990)
"Mr Met?I11 iv Hill lnufqrrgtrlli-\'
' _

Multiprocessor: and Mtrlticwrrputars i In-I

Figure 7.18b shows the situation when three caches request read copies in a memory system with a
Dir; NB protocol. In this ease, we can view l.hc two-pointer directory as a two-way set-associative cache of
pointers to shared copies. When cache C3 requests a copy of location X, the memory module must invalidate
the copy in either cache Cl or cache C2. This process of pointer neplacernent is called in--it-rirm. Since the
directory acts as a set-associative cache, it rnust have a pointer rcplaccmcnt policy.
If the multiprocessor exhibits processor locality in the sens-c that in any given interval of time only a small
subset of all the processors access a given memory word, then a limited directory is sufficient to capture this
small worker set of processors.
Directory pointers in a Dir, NB protocol encode binary processor identifiers. so cach pointer requires log;
N bits of mcrnory, where N is the number of processors in the system. Given thc same assumptions as for thc
fitll-map protocol, the memory overhead of limited directory schemes grows as orrv log1."r'_].
These protocols are considered scalable with respect to memory overhead because the resource required to
implement them grows approximately linearly with the number of processors in the system. Dir, B protocols
allow more than i copies of each block of data to exist, but they resort to a broadcast mechanism when more
than icached copies of a block need to be invalidated. However, point-to-point interconnection networks do
not provide an efficient systemwidc broadcast capability. In such networks, it is diflicult to determine the
completion of a broadcast to ensure sequential consistency.

Chained Directories Chained directories realize the scalability of limited directories without restricting
the number of shared copies of data blocks. This type of cache coherence scheme is called a c-lmim-d scheme
because it keeps track of shared copies of data by maintaining a chain of directory pointers.
The simpler of the two schemes implements a singly linked chain, which is best described by example
(Fig. ll Sc). Suppose there are no shared copies oi" location X. If processor Pl reads location X, the memory
sends a copy to cache Cl, along with a choir: rt-rrrtirrrniorr [CT] pointer. The memory also keeps a pointer to
cache Cl. Subsequently, when processor P2 reads location X, the memory sends a copy to cache C2, along
with the pointer to cache C1. The memory then keeps a pointer to cache C2.
By repeating the above step, all of the caches can cache a copy ot" the location X. If processor P3 writes
to location X, it is necessary to send a data invalidation message down thc chain. To ensu.rc sequential
consistency, the memory module denies processor P3 write permission until the processor with the chain
termination pointer acknowledges the invalidation ofthe chain. Perhaps this scheme should be called a grass in
protocol [as opposed to a snoopy protocol] because information is passed from individual to individual rather
than being spread by covert observation.
The possibility ofcache block replacement complicates chained-directory protocols.
Suppose that caches Cl through (IN all have copies of location X and that location X and location Y map
to the same {direct-mapped) cache line. If processor P, reads location Y, it must ﬁrst evict location X from its
cache with the following possibilities:
(1) Send a message down the chain to eache C, | with a poimerto cache C,,.| and splice C, out ofthe ehain,
or
(2) lnvalitlate location X in eache C,-,| through eaehe C”.

The second scheme can be implemented by a less complex protocol than the ﬁrst. In either case, sequential
consistency is maintained by locking the memory location while invalidations are in progress. Another
solution to the replacement problem is to use a doubly linked chain. This scheme maintains forward and
backward chain pointers for each cached copy so that the protocol does not have to traverse the chain when
FM Mtﬁruw H'IHt'nm;n;u|n1'

NIB i Advanced Canpntter Archite-ctura

there is a cache replacement. The doubly linked directory optimises the replacement condition at the cost of
a larger average message block size (due to the transniissiori of extra directory pointers), twice the pointer
memory in the caches, and a more complex coherence protocol.
Although the chained protocols are more complex than tl1e limited directory protocols, they are still
scalable in terms of the amount of memory used For the directories. The pointer sizes grow as the logarithm
of the number of processors, and the number of pointers per cache or memory block is independent of the
number ofprocessors.
Cache Design Alternative: The relative merits of physical address caches and virtual address caches
have to he judged based on the access time, the aliasing problem, the flushing problem, OS kernel overhead,
special tagging at the process level, and costiperformance considerations. Beyond the use of private caches,
three design altematives are suggested below.
Each of the design alternatives has its own advantages and shorteornings. There exists insufiicient
evidence to determine whether any ofthe alternatives is always better or worse than the use ofprivate caches.
More research and trace data are needed to apply these cache architectures in designing high-performance
multiprocessors.
Shared Cache An alternative approach to maintaining cache coherence is to completely eliminate the
problem by using sfmreri’ c-aches attached to shared-memory modules. No private caches are allowed in this
ease. This approach will reduce the main memory access time but contributes very little to reducing the
overall memory-access time and to resolving access conflicts.
Shared caches can he built as second-level caches. Sometimes. one can make the second-level caches
partially shared by different clusters of processors. Various eache arch.itcetures are possible if private and
shared caches are both used in a memory hierarchy. The use of shared cache alone may be against the
scalability of the entire system. Tradeoffs between using private caches. caches shared by multiprocessor
clusters, and shared main memory are interesting topics for further research
Non-eaeheable Data Another approach is not to cache shared writable data. Shared data are mm-ndienhfe,
and only instructions or private data are eaeheable in local caches. Shared data include locks, process queues.
and any other data structures protected by critical sections.
The compiler must tag data as either criclwnbis or no.m:oc'.Fier'i!J!e. Special hardware tagging must be used
to distinguish them. Cache systems with caeheable and noncacheable blocks demand more support from
hardware and compilers.

Cadre Flushing A mini approach is to use cache flu.-tIn'ng every time a synchronization primitive is
executed. This may work well with transaction processing multiprocessor systems. Cache flushes are slow
unless special hardware is used. This approach does not solve I.-"0 and process migration problems.
Flushing can be made very selective by the compiler in order to increase efficiency. Cache flushing at
synchronization, It'll and process migration may be carried out unconditionally or selectively. Cache flushing
is more ofien used with virtual address caches.

7.1.4 Hardware: Synchronization Mechanisms

Synchronization is a special form of communication in which control infomiation is exchanged. instead
of data. between communicating processm residing in the same or different processors. Synchronization
re»-Mel;J11 to um Em-lilirrsrlitlr
- '
_

Multiprocessor: and Multiccnmptrtars i 3“

enforces correct sequencing of processors and ensures mutually exclusive access to shared writable data
Synchronization can be implemented in software, ﬁrmware, and ha1'rlvva.rc through controlled sharing of data
and control information in memory.
Multiprocessor systems use hardware mechanisms to implement low-level or primitive synchronization
operations. or use software [operating system) level synchronization mechanisms such as Se'mq|'Jh0rr's
or monitors. Only hardware synchronization mechanisms arc studied below. Software approaches to
synchronization will be treated in Chapter 10.
Atomic Operation: Most multiproccssors are equipped with hardware mechanisms for enforcing atomic
operations sueh as memory read, write, or marl-mafiji-'-ii-'rirr> operations which can be used to implement
some synchronization primitives. Besides atomic memory operations, some interprocessor interrupts can be
used for synchronization purposes. For example, the synchronization primitives, Test&S-et [Fm-H and Reset
(fork), are deﬁned below:

Test&Set (for-It)
rump <t— fork; Fork +r— l;
return scrap (T.-4)
Reset (fork)
lock 4- D

Test&Set is implemented with atomic l'E'rI{f-fllfldffit-H-‘rife’ memory operations. To synchronize concurrent

processes, the software may rrrpcat Test&.Set until the returned value (rcrnp) becomes l]. This synchronization
primitive may tie up some bus cycles while a processor enters busy-waiting on the spin lock. To avoid
spinning, interprocessor intermpts can be used.
A lock tied to an interrupt is called a .s'u.sperm' i"or1l'. Using such a lock, a process does not relinquish the
processor while it is waiting. ‘Whenever the process fails to open the lock, it records its status and disables
all interrupts aiming at the lock. When the loclr is open, it signals all waiting processors through an interrupt.
A similar primitive. Compare&Swap, was implemented in IBM 3?!) mainframes.
Concurrent processes residing in different processors can be synchronized rising barriers. A barrier can
be impirrmcnted by a shared-memory word which keeps eouriting the number of processes reaching the
barrier. After all processes have updated the barrier counter, the synchronization point has been reached. No
processor can execute beyond the barrier until the synchronization process is complete.
Wired Barrier Synchronization A wired-NOR logic is shovm i.n Fig. 7.19 for implernenting a barrier
mechanism for fast synchronization. Each processor uses a dedicated control vector X = (X1, .-Y1, ..., X“) and
accesses a common monitor vector Y = (Tl, Y2, .... .__. ltm) in shared memory, where m corresponds to thc
barrier lines used.
The number of barrier lines needed for synchronization depends on the multiprograrnming degree and the
sire of the multiprocessor system. Each control bit X, is connected to the base (input) of :1 probing transistor.
The monitor bit l} checks the collector voltage (output) of the transistor.
Each barrier line is wired-NOR to rr transistors from rr processors. W'he;ru.:ver hit X; is raised to high (1),
the corresponding transistor is closed, pulling down (U) the level of barrier line i. The wired-NOR connection
implies that line i will be high (I) only if control bits X, from all processors are low (0).
Thur Ml.'I;Iﬂlb' HI" i!'n¢r.q|r_.u|»r\ -

3 ID I Advanced Canpmerkdritedrua

This demonstrates the ability to use the control bit .71"; to signal the completion ofa process on preeessor i.
The bit Xi is set to 1 when a process is initiated and reset to D when the process firrishes its execution.
Whrzn all processes ﬁnish their jobs, the .Y,- bits liom the participating processors are all set to ll; and the
barrier line is then raised to high (1), signaling the synchronization barrier has been crossed. This timing is
watched by all processors through snooping on the l’,-bits. Thus only one barrier line is needed to monitor the
initiation and completion ofa single synchronization involving many concurrent processes.
5\i"

I iino1
I I lino
I I I I m

III III III III III III III III

x'l"Jhn Y‘l"Ym "11"xm Y1"“'m x'i"xm Y1"Ym 3'1-1"Xm Y-i"Ym
Pro-clamor 1 Processor 2 Pro-oossorﬁi Processor it

[a] Barrio: lines and interface logic

Stop 1: Foriring [use of one barrier lino]

Process-o|'1 Pro-no-ssor2 Pro-nassor3 Proce-ssor4
Lino 1
X E
YE E E E
Stop 2: Pro-case 1 and Process 3 reach tho synchronization point
K E E E
Y E E E E
Process. 1 Pro-co-as 2 Process 3 Process 4
Stop 3: Ali processes roachtho synchronization poirl
X E E E
Y E
Process 1 Process 2 Process 3 Process 4

[tr] Synchronization stops

F-lg. 1.19 The syn<:l'a'oni:a1:lonoffo|.tr indepen-clenr processes on four processors using one wirod~NOR barrier
iinc {Airlapred from Hwang and Sharrg. Pmc.lnr. Conf.Pnmki Processing, 1991}

Multiple barrier lines can be used simultaneously to monitor several synchnonization points.
Figtue 7.19 shows the synchronization of {bur processes residing on four processors using one barrier line.
Note that other barrier lines can be used to synchronize other processes at the same time in a multiprogrammcd
multiprocessor environment.
rm ' Mrﬁmw rrriir |>rrIq|r_.\.I||¢\ ‘H _

Muiltiplooessors and Multicomputers i

lr
& : Example 7.2 Wired barrier synchronization of five partially
ordered processes (Hwang and Shang 1991)
If the synohmnization pattcm is predicted aﬁcr compile tirnc, then our: can follow the prcccdcncc graph of a
partially ordered set of processes to perform nzultiple synchronization as demonstrated I11 hg 7 20
Pro-eesses
F'1F'2 P3 Pd P5

lg
I IId °* 0
°G
G 0
6
[a] Synchronization pattems [bi Preoedenoe graph

Step Ci: lnitiaiizing the oontioi vectors [tee 5 in-arrierr lines)

Pro-oessor1 Pro-oessor2 Prooessorfi Processors Pro-oessorfi
X Elfil EEE lillil IIEIIEE
Y E5] I5]
Sup-1:S§,rnchro:| at ti-arrier a
>< E E'1 El E E
Y 1 LJ E
FIE,Flllslg
Sup 2a: Synchronization at barrie
>< lsfililll"
" -.-. "lass
J
-L11
Y HF‘ En
EIFJEIH L-JGL-JE HE-1"‘H HIFJITE-I E-Il ?)EEIE[-1
-I EIE[-1E5] HE-IHIE!
H!-IE-IFIEE-IE
E! FIEE-IE
El!-IE-IE! HI?)HG]
HE-IHE]
E-IEC-IE
FIEE-IFJC-IE] E-IE]
Sup 2b: Synizhnonlzation at banter c
x EEEIEIEIE E511‘ E E‘! E‘! E E
Y ilililltlls‘ r=:c EE|:=:|:=: El?! E5] E5] FIE FIE] BF] El?! El?! I -TF1 E-TE‘!
Step 3: Synchro a BEE§m 5|;-gs ti-arr ier mi
X E5135] I
Y
Sup 4: Sy noh barrier e
K EEW‘
Y snags:EL-IH1E HsragurHIa;-Jatsagas
-.-J HatHat
-1 ui:at
-1 quaat
-1 quacan
-1 EC-1HE-1
Elli]EEI
Q5]EEI
Q5]EEI
HIP]IHFJ Eli]Q5]
Ell-IE5]
QE-IE5]
QE-IE5]
Ell-III-IE5]

[oi Synohron ization steps

Fig. 1.10 The syncnrorrlzatlon of ﬁve partially ordered procees using wired-NOR barrier lines{Amp1:ed from
Hwrmg an-cl Sung. Proc.l'ntConfIFhmlIelPiosli1g. 1991}
FM Mtﬁrpw Hliitiimyiwins

III i Advanced Canpoterﬁrchitecturc

Here five processes {P1, P2. ..., P5,) are synchronized by snooping on five barrier lines corresponding
to five synchronization points labeled rt, ii, c, oi, e. At step Cl the control vectors need to be iiiitializcd. All
five processes are synchronized at point ti. The crossing of barrier H is signaled by monitor bit Y1, which is
observable by all processors.
Harriers b and 1-can be monitored simultaneously using two lines as shown in steps 20 and lb. Only four
steps are needed to complete the entire process. Note that only one copy of the monitor vector l’ is maintained
in the shared memory. The bus interface logic of each processor module has a copy of l’ for local monitoring
purposes as shown in Fig. 'i'.20c.

Separate control vectors aroused in local processors. The above dynamic barrier synchronization is possible
only ifthe synchronization pattern is predicted at compile time and process preemption is not allowed. One
can also use the barrier wires along with counting semaphores in memory to support multiprogrammcd
multiproccssors in which preemption is allowed.

THREE GENERATIONS OF MULTICOMPUTERS

— Three early generations ofmulticomputers are reviewed in this section. which have contributed
to the development of modern systems. Experiences from lntel, nCUBE, MIT, and Calteeb
are examined. In particular, we present the lntel Paragon system in some detail. The genetic multicomputer
model shown in Fig. L9 and various network topologies presented in Section 2.3 form the background
needed for reading this section. Further discussion on related topics and current advances can be found in
Chapter I3.

1.3.1 Design Choices in the Fast

Before we examine these developments, let us identify the major design choiem made so Far in building
multicomputers, as compared with the development of other types of parallel computers. As illustrated in
Fig. 7.21, the choices made involve the selection ofprocessors, memory structure, interconnection schemes,
and control strategy.
Design Choice: In selecting a processor technology, a multicomputer designer typically chooses low-cost
so-called commodity processors as building blocks. in fact, the majority of parallel computers have been
built with standard oft‘-the-shelf processors. Even the custom-designed processors used in the AMT D.-‘ll’,
nCUBE, Tlt-'ICr'CM-2, and IBM RP3 computers were low-cost processors.
The nest step was to choose distributed memory for multicomputers rather than using shared memory
which would limit the scalability. Each processor has its own local memory to address. Scalability becomes
morc feasible without shared resources. With distributed memory, a new programming model and tools are
needed for multicomputers.
Multicomputers have message-passing, point-to-point, direct networks as an interconnection scheme
rather than the address-switching networks used in Nl.JMr'i multiproccssors like the IBM RP3 and BBN
Butterﬂy. A message-passing networlt routes messages between nodes. Any node can send a message to
another. Sendireceive semantics must be incorporated to guarantee consistent programming with or without
uniform messaging speeds.
F61‘ Mrlirnw H["l'|>rf'.1] |u.||¢\
_ I i _

Muitiplooenon and Muiticomputers i. 3|!

Muttlcon1puters
n-CUBE MIMD, MPMD,
lntel SPMD

[Control selection]

AMT
ncuss Message _________ __ AMT
Intel Paeelng TMC
TMC

{lntereomectlon selection)

AMT
{$55 oeribmed _________ __ senses sou
"
"rue Memory Swltehln 9 mm RP3
IBM RPS
[Mernory selection]
AMT
nCUBE
Intel
S-equent Low-on-st _________ __ Shared Sequent
Alliam Processors Mernory Alllant
BEN
TMC
IBM RP3
[Processor selection]

Parailel _______ Expensive -Cray

Arehlteemre Prooessors IBM Mainframe

Fig.'I'.I1 Design choices made in the pas: for developing message-passing rnuirleonrpuirers cornprar-ed to those
made for other parallel eomp-uters (Courtesy of lntel Scientific Computers, 1938}

in selecting a control strategy, designers of rnulticornputers choose the asynchronous MIMD, MPMD,
and SPMD operations. rather than the SlM'D lockstep operations as in the CM-2 and DAP. Even though
both support massive parallelism, the SIMD approach ofi'e:rs little or no oppoitmiity to utilize existing
multiprocessor code because radical changes must be made in the programming style.
Cln the other hand, multicomputers allow the use of existing soflware with minor changes from that
developed for multiproeessors or for other types of parallel computers.
First Generation Caltcclrfs Cosmic Cub-e (Seitz, 1933} was the first of the first generation multicomputers.
The lntel iPSCi 1, Ametek SII4, and nCUBl3.~‘l0 were various evolutions of the original Cosmic Cube.
For example, the EPSCJI used iilfllfifi processors with 512 Kbytes of local memory per node. Faeh node
was implemented on s single printed-eireuit hoard with eight U0 ports. Seven L-“CI ports were used to forrn
a seven-dimensional hypercube. The eighth port was used for an Ethernet connection fiom each node to the
host.
HM‘ If J11!!!‘ Em-liqtrsrlrtli _

I I4 i Advanced Computer Architecture

Table i'.l summarizes the important parameters used in designing the early three generations of
multicomputers. The eomrnutiieation latency {for a IUD-byte message) was lather long in the early 19805.
The 3-to-l ratio between remote and local communication latencies was caused by the use of a srrJrc-rmd-
fnrutird routing scheme where the latency is proportional to the number ofbops between two communicating
nodes.

Table 1.1 Three Early Genet-:ule.ns of Mulucanputer Devolopmein

L'it=.'ncrr.ri'io-n First Second Third

l‘i.'r.rr.\" iissisr I 985‘ 9.1’ testis:
Typical no-dc
MIPS 1 ID 100
Mflops scalar 9.1 2 40
Mflops vector ID 40 200
Memory (Mbytes) 0.5 4 32
Typical system
N (nodes) 64 2515- I024
NILPS -E-4 25-ED ICHJK
Mflops scalar 6.4 512 40K
Mflops vector 640 ltlK ZDUK
Memory (Mbytes) 32 1K 3-2K
Coniniunicatjon latency
(l'l]'lJ—by1c message)
Neighbor (microseconds) ZDIDID 5 0.5
Nonlocal {microseconds} 6000 5 0.5

{Modiﬁed front Athas and Scitz, "ll-'Iult-icomputcrs: Message-Passing Concurrent Computers", IEEE Ce-uripiirer: August 1988].

Vector hardware was added on a separate board attached to each processing node board. Or one could use
the second hoard to hold extended local memory. The host used in the iP‘S-Ci l was a.n lntel 310' lrtieropro-cursor.
All I/D must be done tiuough the host.

13.1 Present and Future Development

The second and third generations of multicomputers are introduced below. The lntel Paragon is presented as
a ease study. More recent advances in high-p-erfonnance eomputing are discussed in Chapter l3.

The Second Generation A major improvement of the second generation included the use of better
processors. such as i386 in the iPSG'2 and i360 in the il'SC.-‘S450 and in the Delta. The nCUBE.r‘2 implemented
64 custom-designed VLSI processors on a single PC board. The memory per no-dc was also increased to ll)
times that of the ﬁrst generation.
Most importantly, hardware-supported routing, sueh as it-'orniho)'c retiring, reduced the communication
latency signiﬁcantly from 6000 its to less than 5 _us. In fact, the latency for remote and local commlmieations
became almost the same, independent of the number of hops between any two nodes.
If J11!!!‘ IlN‘Hlll[1|1lf\

Multipruicessors and Muiticwrrputars i 3 |5

The architecture oi" at typical second-generation multicomputer is shown in Fig. 7.22. This corresponds to
a l-6-node mesh-connected architecture. Mesh routing chips {M.RCs) arc usotl to establish the fotir-neighbor
mesh network. All the mesh communication channels and h'[R{l‘-s are built on a backpiane.

F ilo system
| O

Odﬁ

A
Communication

eeealter‘
node node ode

i-be;
Com puter M RC
O O
M RC ?l?ei"|'n?a?a§'e.e
0
M RC
"er
-e*.e.
L Display Generator
Ethernet

Legends: MRG = Mach routing chip

Fig.‘!.I2 The architecture of a second-generation multieornputor~ using a hardware-routed mosh interconnect

{Courtesy of Charles Sela: reprinted with permission from "Concurrent Anrititee-turn". iil_$i‘ ond
Ftzrmiiel Compur.otion. edited by Suzy; and Birtwistie. Morgan Knufmann Publishers. 1990)

Each node is implemented on a PC board plugged into the backplam: at the proper MRC position. All Lil]-
devices, graphics, and the host are connected to the periphery {boundary} of the mesh. The Intel Delta system
had such a mesh architecture.
Another representative system was the nCUBE/‘Z which implemented a hypercube with up to S I 92 nodes
with a total of5 I2 Gbylcs of distributed memory. Note that some parametens in Table 7.1 have been updated
from the conseiwrative estimates made by Atlas and Seitz in 1988. Typical figures representative of current
systems can he found in Chapter 13.
The Sttperhlodie lilflfl was a Transputer-based multicomputer produced by Parsystem Ltd_, England.
Another second-generation system was A1nctck‘s Series 2010. made with 25-Mt-Iz M68020 proccssots using
a mesh-routed architecture with 225-lvibytes.-‘s channels.
The Third Generation These designs laid the foimdation for the current generation of multicomputers.
Caltoch had the Mosaic C project designed to use VLSI-implemented nodes, each containing n 14-MIPS
processor, Iii-hihytesfs routing charuiels, and 16 Kbytes of RAM integrated on a single chip.
FM Mtfiruw Hlllrirmpdrrns

3 I6 i Advanced Cmipoterlirclrite-cturn

The fi.tll sine of the Mosaic was targeted to have a total of 16,334 nodes organized in a three-dimensional
mesh architecture. MIT built thc I-machine which it planned to extend to a 65K-nodc niulticontputcr with
VLSI nodes interconnected by a three-dimensional mesh networir. We will study the J-machine experience
in Section 9.3.2.
The I-machine planned to use message-driven processors to reduce the message handling overhead to less
than 1 _tis. Each processor chip would contain a 512-Kbit DRAM, a 32-bit processor, a floating-point unit,
and a communication controller. The communication latency in systems was later reduced to a few ns using
high-speed links and sophisticated communication protocols.
The significant reduction of overhead in communication and synchronization would permit the execution
of much s11ortt:r tasks with sizes of 5 its pcr processor in thc I-machine, as opposed to executing tasks
of 100 us in the iPSc.~'l.'l'his implies that concurrency may increase from I02" in the iPSc1l to I05 in thc
J-machine.
The first two generations of multicomputers have been called lll-L'£‘ltl'1.l.l?l-gl"|t'll7!l .s_v.s!crri.s, With a significant
reduction in communication latency, the third generation systems may be callcdfine-groin multr'compr.rter's.
Research is also underway to combine thc private virtual address spaces distributed over thc nodes into
a globally shared virtual memory in MPP multicomputers. Instead of page-oriented message passing, the
fine-grain system may require block-level cache communications. This fine-grain and shared virtual memory
approach can in theory combine the relative merits ofmultiproccssors and multicomputers in a lrererogerreorrs
pro:-cssing (HP) environment.

7.3.3 The lntel Paragon System

In the 1930s, hypercube multicomputers were made with homogeneous nodes because all U0 Functions were
given to the host. This limited the U0 bandwidth, and thus thcsc computers could not be used in solving
large-scale problems with efiiciency or high throughput. The lntel Paragon was designed to overcome this
difficulty. The usage model tumed the multicomputer into an applications server with multiuser access in a
network cnvironrnent.
Ever since the introduction ofthe iPSC.I'1 CFS, parallel L"D has been possible with dedicated disk nodes in
addition to the computing nodes. The iPSCl86ll further pushed the idea of using heterogeneous node types.
The Paragon system went fi.I.l1l'lB1' by making it a host-free multicomputer. We explain be-low the various node
types used in the Paragon and prcstmt thc hardware router design.
Thc architccttm: ofthe lntel Paragon system is shown in Fig. 7.23. This system was driven by applications
which require solving general sparse matrix problems, performing parallel data manipulation, or making
scientific predictions through simulation modeling.
These difiicult problems demand heterogeneous node types for numeric, service, I/O, and network
gateways, as dclrtonstrntod in the schcinatic diagram of the Paragon system. The rncsh architocttrrc of the
Paragon was divided into three sections.
The middle section, called the compute partition, is a mesh of numeric nodes implemented with lntel
iiifi-UXP microprocessors. This array bad an aggregate of 3.3 Gbyles oi" distributed memory.
The system had a potential performance of 5 to 300 Gllops collectively. This mesh architecture eliminated
the powcr-of-2 upgrade requirement of a hypercube architecture. All L"O was handled by the two disk L"D
columns at the left and right edges of the mesh. Each column was a 16 >< 1 array of l6 disk nodes. The
aggregate IFO bandwidth reached 48 Mbytes’s with a total of 214 Gbytes per disk U0 column.
J1? lmrJI||r_.u|n¢\

Muitiproceson and Muttioomputers i 3 ni

Cctrnputa F'a|1iIion _|i'_d_i

: Patiticrn Pa-tition Subwﬁteml
1 scs| Co = - --- = SCSI
-Q-1'!‘ Node I. ~15’: - Nudge III Node

EIIIIIIII . . . .
4‘
H‘|F"F'| |'|1p|_| |'|1|;| ""

II
lNeda
ampu

:::si §-tsisi
sil s-»
I I
‘I I
I I

—si§_@_E
' -1-;- Tap-as

5.II............
§!".i'==t?§
I

Node
HIPPI
- _- -___l
.....
Compu
Node
amp "" ompu -= =
5;I:|gi
SCSI
_|
I

IIIII |II
I--_ I I I I I I I ‘___- I I I I I I I I I I I I I I I I I '_____ I I I I I I I______ I I I I I I I

Fig. 7.23 The Inset Paragon system architecture {C-o1.|r|:esy -of lntel 5 Systems Division. 1991)

The prooessors used in the L"O columns were Intel i386’s which supervised the massive data transfers
between the disk arrays and the computational array during U0 operations. The system L"U column was made
up of sir. sen-'ic'e norfas, two tape nodes, two Ethernet nodes, and a l-IIPPI node. The service nodes were used
for system diagnosis and handling of interrupts. The tape nodes were used for backup storage.
The Ethernet and HIPPI nodes were used for fast gateway connections with the outside world. Collectively,
a 17,000-MIPS perfo-rrnanee was claimed possible on the STU numeric and disk UD nodes involved in program
execution. The system was designed to run iPSCf36G-compatible software.
Node and Router Arehirecrure The Paragon was designed as an experimental system. One unit was
built and delivered to Caltoeh in May 1991 for research use by a consortium of 13 national laboratories and
universities. The typical node architecture is shown in Fig. 7.24.

Node Floating Other

Bags I '°°‘B“’°"["ll ipnmiumis; omits; I

His

-- Local Externd
' Hemw we

Routa
(on baekplanai

Communication
cinannds

Fig.7.!-I Node arei1irecn.u'e -efdte Paragon muiricomp-u1:er

F?» Mtﬁruw Hillrlimpwlnw

SIB i Advanced Canpln:erArchite-cture

Each node was on a separate board. For numeric nodes, the processor and ﬂoating-point units were on
the same iE6l] chip. The local memory look up most of the board space. The external Ht] interface was
implemented only on the boundary nodes with a computational array. The message U0 interface was required
for message passing between local nodes and the mesh network. The nwsh-connccrcd mnrcr is shown in Fig.
7.25.

[North]
To or from the
lo-c.al no-do

J
= E Legends:
IC: lrput Controller
C I13
[INQQ1] I!-I 5'?fﬁ,h' |--. FB:Fllt Buffer
ii-*5’ H IE3”

[So-tlh]

Fig. ‘L25 The stru-crure of a i‘I'l'E5l't~COi‘ltl‘l'BC136d rou1:er with four pairs of |l'C.'l channels connected to rlcighb-orlng
routers

Each router had ll) U0 ports, 5 for input and 5 for output. Four pairs of U0 channels were used for mesh
connection to thc four neighbors at thc north, south, cast, and west nodes.
Flow mm:-of digits (flits) buffers were uscd at the end of input channels lo hold thc incoming fills. Thc
concept of flits will be clarified in the next section. Besides four pairs of external channels, a fifth pair was
used for internal connection between the router and the local node. A 5 >< S crossbar switch was used to
establish a connection hctwccn any input channel and any output channel.
The functions of the hardware router included pipclined mcssagc routing at the flit lcvcl and resolving
buffer or channel deadlock situations to achieve deadlock-free routing. In the next section, we will explain
various routing mechanisms and deadlock avoidance schemes.
All the llfl channels shown in Figs. 7.24 and 7.25 are l'Jh__\-'siml' c'hannc'l's which allow only one message
(flit) to pass at a limc. Through limo-sharing, one can also implement 1-irrmti channels lo multiplex the use of
physical channels as described in the next section.

MESSAGE-PASSING MECHANISMS
_ Message piISS].l1g III a mulllcomputcr network demands special hardware and software
support. In this section, we study the store-and-forward and wormhole routing schemes and
analyze their communication latencies. We introduce the concept of virtual channels. Deadlock siulalions in
a message-passing network arc examined. We show how to avoid deadlocks using virtual channels.
J11 Incl'q||;1r|I¢-\

Mmltiprocesson and Multicomputers i 3H

Both deterministic and adaptive routing algorithms are presented for achieving deadlock-free message
routing. We ﬁrst study deterministic dimension-order routing schemes such as E-cube routing for hypeneub-es
and X-Y routing for two-dirnensional meshes. Then we discuss adaptive routing using virtual channels or
virtual subnets. Besides one-to-one unicast routing, we will consider one-to-many multieast and one-to-all
broadcast operations using virtual suhnets and greedy routing algorithms.

7.4.1 Message-Re uting Schemes

Message formats are introduced below. Refined formats led to the improvement Erum store-and-forward to
worrnhole routing in two generations ofmu lticomputers. A handshaking protocol is described for asynchronous
pipelining of successive routers along a communication path. Finally, latency analysis is conducted to show
the time difference between the two roofing schemes presented.
Ma.-sage Format: Information units used in message routing are specified in Fig. 7.26. Amessrige is the
logical unit for internode communication. It is often assembled from an arbitrary number of fixed-length
packets, thus it may have a variable length.

Mm-=@| | l l l
Pmfl// 1 1
l';@.i;@. R: Routlnglnfomadon
In
S: Sequence Number
D: Data only fills
F1g.‘l'.2i Theformatofmessage,pache|l:,a|1dfliu(oont.r\olflcwdifltshnedasirriormafiorluniuofcommunicadon
in a messagepasslng network

A packer is the basic unit containing the destination address for routing purposes. Because difierent
packets may arrive at the destination asynchronously, a sequence number is needed in each packet to allow
reassembly of the message transmitted.
A packet can be Further divided into a number offixed-lengtl1fl'iIs(flow control digits). Routing information
(destination) and sequence number occupy the header flits. The remaining fiits are the data elements of a
packet.
ln multicomputers with store-and-forward routing, packets are the smallest unit of information
transmission. In Wormhole-routed networks, packets are fitrther subdivided into fiits. The flit length is often
atfeetod by tl1e network size.
The packet length is determined by the routing scheme and network implementation. Typical packet
lengths range from 64 to 512 bits. The sequence number may occupy one to two flits depending on the
message length. Other factors afiecting the choice of packet and flit sizes include channel bandwidth, router
design, network traflie intensity, etc.

Stan:-and-Forwnrd Routing Packets are the basic unit of inforroation flow in asrom-mu’-forv.-amtnetwork.
The concept is illustrated in Fig. 127a. Each node is required to use a packet bufi'er. A packet is transmitted
from a source node to a destination node through a sequence ofintenncdiate no-ties.
Fr‘:-r Mtfiruw rrrrrr-...¢-,......¢. '
320 i Advanced Canpin:erArchirectuJ"e

When a packet reaches an intermediate node, it is first stored in the buffer. Then it is fonvarded to the next
node ifthe desired output channel a11d a packet buffer in the receiving node are both available.
The latency in store-and-forward networks is directly proportional to the distance (the number of hops)
between the sotuce and the destination, This routing scheme was implemented in the first generation of
multicomputers.
Wormhole Ruining By subtlividing the packet into smaller flits, latter generations of multicomputers
implemem the n-'orrnhoie muting scheme, as illustrated in Fig. ?.2’!b. Flit buffers are used in the hardware
routers attached to nodes. The transmission from the source node to the destination node is done through a
sequence of routers.
Sou reo No-do Destination Noel-s

G I I I I

lntarmodlato Nodes

[a] Store-and-fonvard routing using packet buﬁors in suooasslvo nod-as

Sou nee Node Destination No-do

Iii I - I I I I

lntomtodlat-a No-dos

[bi Wormhole routing using fllt buffers in suooesslvoroutors

Fig. 7.27 Store-mid-forward routing and worrrrh-oi: routing {Courtesy of |_ionei Ni. 1991'}

All the flits in the same packet are transmitted in order as inseparable companions in a pipelined Fashion.
The packet can be visualized as a railroad train with an engine car (the header flit) towing a long sequence
of box ears {data fljts).
Only the header flit knows where the train {packet} is going. All the data flits [hos ears} must follow the
header flit. Different packets ean be interleaved during transmission. However, the flits fi'om diflhrcnt packets
cannot be mixed up. Otherwise they may be towed to the wrong destinations.
We prove below that wormhole routing has a latency almost independent of the distance between the
source and the destination.
Jlryndrmnous Pipelining The pipelining of successive flits in a packet is done asynchronously using a
handshalring protocol as shown in Fig. 7.23. Along the path, a 1-bit rennf-,-freqiresr {FHA} line is used between
adjacent routers.
When the receiving router {D} is ready (Fig. 123a) to receive a flit (i.e. the flit buffer is available), it pulls
the RJA line low. When the sending router (S) is ready (Fig. 7.2Sb), it raises the line high and transmits flit i
through the channel.
While the flit is being received by D (Fig. 7.281;). the RM line is kept high. After flit r' is removed from
D's buffer (i.e. is transmitted to the next node) {Fig 128-d], the cycle repeats itself for the transmission of the
neat flit i + 1 until the entire pac-1.-tet is transmitted.
Mu\ltipruoesso.r: and Multicwnpmrs 32'

Router S Router D
_ "’_’°*l_‘°“'l .R‘*§_l'"‘:-“"2-..
Channel
[af|Dlsroad'ytoreoalveaﬂlt [tr]Slsreadytose1'|dﬂltr'

R.I'A{hlg1[| PM Howl

{c} Fllt 1' ls received by D {d] Fllt its removed from D's butter and fllt 1' + 1
arrlves at S's buﬁsr
Fig.7.2l I-tan-dd-making protocol bmveen two uornrhole routers {Cour-may of Lionel N1. 1991}

Asynchronous pipelining can be very effieient, and the clock used can be faster than that used in a
synchronous pipeline. However, the pipeline can bc stalled if flit buffers or successive channels along the
path are not available during certain cycles. Should that happen, the packet can be buttered, blocked, dragged,
or detoured. We will discuss these flow control methods in Section 14.3.
Latancylinnlysis A time comparison between store-zmd-forward and wonnhole-routed networks is given
in Fig. "L29. Let L be the packet length (in hits), W the eharmel bandwidth {in bitsfs), D the distance (number
of nodes traversed minus 1). and F the flit length [in bits).

T3; -
Lrw
"1 Data
4---—-*-%
~2
Na header
,|:13Ir:1
Packet |:|j:|:|:| L
D
N4

I‘ TIITIB
{aj Store-and-tonvard ro uting

TWH
LIIN
N1

"2 l:|:|:l:l:| D
"E l:|:l:l:|:]
It Emmi i"lime
{ajr Wormhole routing

Fig.‘l'.29 Tlrne cornparlson between the two revurlng eechnlqoes

Fr‘:-r Mﬂirpw H["l'm'l!I||(1rlnr\ '
322 i Advanced Computter Arch-itecture

The communication latency T3,. for a store-and-forward network is expressed by

To = (D +11 no
The latency Tm, for a wonnhole-routed network is expressed by
L F
Tip” = Hf -l- F X D

Equation 7.5 implies that Iv is directly proportional to D. In Eq. 7.6, T“.-H = Lfli’ ii'L ;>;> F. Thus the
distance D has a negligible effect on the routing latency.
We have ignored the network startup latency and block time due to resource shortage (such as channels
being busy or buficrs being full. etc.) The channel propagation delay has also been ignored because it is much
smaller than thc terms in Tit.‘ or Tm,-.
According to the estimate given in Table 7.1, a typical first generation value of l":,~,t- is between 2000 and
6000 us, while a typical value of Tm, is 5 ,us or less. Current systems employ much faster processors, data
links and routers. Both the latency figures above would therefore be smaller, but worrnhole routing would
still have much lower latency than packet store-and-forward routing.

1.4.1 Deadlock andifirtual Channels

The communication channels between nodes in a wormhole-routed multicomputer network are actually
shared by many possible source and destination pairs. The sharing of a physical channel leads to the concept
of virtual channels.
We introduce below the concept and explain its applications in avoiding deadlocks in this section and in
facilitating network partitioning for multicasting in Section 7.4.4.
Virtual Channel: A virtual channel is a logical link between two nodes. It is formed by a flit buffer in
the source nodc, a physical channcl between thcrn, and a flit buflbr in thc receiver node. Figure 7.30 shows
the concept of four virtual channels sharing a
single physical channel. _? U _,__
Four flit buffers are used at the source node L
and receiver node, respectively. One source
buffer is paired with one receiver bufferto form ___? I] U ___
a virtual channel when the physical channel is L
allocated for the pair. pWsm|
In other words, the physical channel is time- f Chflflflfll U
shared by all the virtual channels. Besides the ’
buffers and channel involved, some channel
states rnust be identified with different virtual T
channels. The source buffers hold fiits awaiting _";_ El _"
use of thc channel. The rccciver hufibrs Fm buffets |,-, Fm bluffgfig in
hold flits just transmitted over the channel. Wt“ "Q49 '-‘*5-‘""31im "$19
The channel (wires or fibers) provides :1 F}g_-mu Fm“, wfiufl mamas Shams 3 Pays,Q1 (hams
communication medium between them. Mm firm mumpyexmg on 3 fl|t_b,y_fm has
' Ifllli lm'rIq|r_.\.I|n*\ _

Multipuuneson and Multicwnputars i. 323

Comparing the setup in Fig. 1.31] with that in Fig. 123, the difference lies in the added bufihrs at both
cnds. Thc sharing of a physical channel by a set of virtual chatmcls is conducted by t'u:nc-multiplexing on a
flit-by-flit basis.

Q) Example 1.3 The deadlock situations caused by circular

waits at buffers or at channels
As illustrated in Fig. 7.31, two types ofdcadluck situations am caused by a circular wait at bttffcra or channels.
A brgfer timdlnck is shown in Fig. 7.3 la for a store-a11d-forward network. A circular wait situation results
from four packets oecupyiiig -Four buffets in four nodes. Unless one packet is discarded or misrouted, the
deadlock caunntbc broken. In Fig. 7.3 lb, admnne] riearilnck results ﬁum four messages being sixnultanouusly
transmitted along four channels in a mesh-connected network using Wormhole routing.

EIEIEIEIIEI
Nocleﬂt Node D

No-do B Node C
Packet Buffer
ll IEIEIIEIEIEI

{aj Buffer clmdiock among four no-clos with store-and-forwarcl routing

Massage 3
Rants; A i o-daA ode D

Message 4 mam Route-r D

I‘ﬁ4|:| m2|:|
‘ odeﬂ odsG
Rotter B MU
Message 2
Flli buffer Massage 1 Router C

[D] Channel dead lock among bur nodes with an-rmhote routing; shaded boxes are fllt buffers

Fig. ‘L31 Deadlock situations caused by a circu'lar wait at buffers or at no|'r|n'i1.|-nicadu-n channels
Fr‘:-r Mcliruw stilt-...¢-...,“. '
324 i Advanced Canpucer Architecture

Four iiits From four messages occupy the four channels simultaneously. if none of the channels in the
cycle is freed, the deadlock situation will continue. Circular waits are finthcr i.llusttatod in Fig. 7.32 using a
chtmnef-tfqriemzlencc grqrih .
The channels involved are represented by nodes, and directed arrows are used to show the dependence
relations among them. A deadlock avoidance scheme is presented using virtual channels.
Deadlock Avoidance By two virtual channels, IQ and V4 in Fig. 7.3I.c, one can break the deadlock
cycle. Amodified channel-dependence graph is obtained by using the virtual channels V3 and P}, after the use
ofchanncl C1, instead ofreusing C3 and C4.
The cycle in Fig. 7.32b is being converted to a spiral, thus avoiding a deadlock. Channel multiplexing can
be done at the flit level or at the packet level if the packet length is sufficiently short. ‘virtual channels can be
implemented with either un idireetioml ehminels or bitfireetiorml c-hnrinels.

o° o Q
C2 @

[3] -Chan nol dgadinink [bi Channel-dependence graph containing a cycle

C
053 e
cw v3 ca Q @

o C2 0 Q1
e
[cl Aeidingtwo virtual channels {V3, V4] [ell A modified chamel-dependence graph using thovirtua channels

Fig. 1.31 Deadlock avoidance using virtttai channels no convert a cycle to a spiral on a charmci-dependence
8*'3P'i"

The use of virtual channels may reduce the effective channel bandwidth available to each request. There
exists a hadeoff between network throughput and oonttnunication latency in determining the degree of using
virtual channels. High-speed multiplexing is required for implementing a large number ofvirtual channels.

7.4.3 Flow Control Strategies

In this section. we examine various strategies developed to control smooth network iraffic flow without
causing congestion or deadlock situations. When two or more packets collide at a node when competing for
buffer or channel resources, policies must be set regarding how to resolve the conflict.
Based on these policies, we describe below deterministic and adaptive routing algorithms developed for
one-to-one i.e. unicast communication.
J11 IlN‘HI|l(1|1lf\ _

Multiprocessor: and llllultirornp-uter‘: F 325

Pocket Collision Resolution ln order to move a flit between adjacent nodes in a pipeline of‘ channels, three
elements must he present: [1] the source buffer holding the flit, (2) the channel being allocated, and (3) the
receiver buffer accepting the flit.
When two packets reach the same node, they may request the same receiver buffer or the same outgoing
channel. Two arbitration decisions must be made: (i) Which packet will be allocated the channel? and (ii)
What will he done with the packet being denied the channel? These decisions lead to the four methods
illustrated in Fig. 7.33 for coping with the packet collision problem.
Figure 7,33 illustrates four methods for resolving the conflict between two packets competing for the use
of The satne outgoing channel at an intermediate node. Packet l is being allocated the channel, and packet 2
heing denied. A hsyjii-ring method has been proposed with the virtual cu!-through nisring scheme devised by
Kermani and Kleinrock (l9?9).
Packet 2 is temporarily stored in a packet buffer. When the channel becomes available later, it will be
ttansrnittcd thccn. This buffering approach has the advantage of not wasting the resources already allocated
However, it requires the use of a large buffer to hold the entire packet.
Furthermore, the packet b~ufi‘ers along the communication path should not form a cycle as shown in
Fig. 7.31 a. The packet buffer however may cause significant storage delay. The virtual cut-through method
-oflicrs a cornprornise by combining the store-and-forward and wonnhole roofing schemes. When collisions
do not occur, the scheme should perform as well as Wormhole routing. in the worst case, it will behave like
a store-and-forward network.
Pure wormhole routing uses a blocking policy in case of packet collision. as illustrated in Fig. ’!.33b. The
second packet is being blocked from advancing; however, it is not being abandoned. Figure 7.331.: shows the
disttirrf policy, which simply drops the packet being blocked from passing through.
The fourth policy is called dfffillr (Fig. 133d). The blocked packet is routed to a detour channel. The
blocking policy is economical to implement but may result in the idling ofresources allocated to the blocked
packet.
Packet 1

Q 'ULlT'J<>i"Q Control Pa-t:ket1

packet 2 buffer channel r 1
Packet huffe-r packet 2 @ I

[aj Buffering In virtual out-through routing [D] Blocking flow oontrol

P3“ ml 1 Detour char: nol Packet 1

Packet 2 E Q I‘ Packet 2
--
a I Once-ins
channel

[cl Discard and retransmission [d] Detour after being mooted

Fig. 7.33 Fiow control rnethocts for resolving a collision between two pac-lneia requesting the same outgoing
channel (pecllet 1 being aiiocancd the dwslei and paellet '1 being denied}
F?» Mtﬁrnw ,dd"I_nlfJ|||;ltlII'\

32¢» i Advnrxed cimpt-to Architecture

The discard policy may result in a severe waste of resotuces. and it demands packet retransmission and
acknowledgment. Otherwise, a packet may be lost afber discarding. This policy is rarely used now because of
its unstable packet delivery rate. The BEN Butterfly network had used this discard policy.
Detour routing offers more flexibility in packet routing. l-lowever, the detour may waste more channel
resources than necessary to reach the destination. Ftutltemtore, a re-routed packet may enter a cycle of
lit-'elot~l:, which wastes network resources. Both the Connection Machine and the Dcnelcor HEP had used this
detour policy.
in practice, some multicomputer networks use hybrid policies which may combine the advantages ofsome
of the above flow control policies.
Dilnensiorr-Order Rn uting Packet routing can be conducted deterrninistically oradaptively. lrtdererrrtin isfic
roaring. the communication path is completely detemtined by the source and destination addresses. ln other
words, the routing path is uniquely predetemnned in advance, independent of network condition.
Adnyiriw muting may depend on network conditions, and alternate paths arc possible. In both types of
routing, deadlock—fi'ee algorithms are desired. Two such deterministic routing algorithms are given below,
based on a concept called dirrionsion orrier rourirlg.
Dimension-order routing requires the selection of successive channels to follow a specific order based on
the dimensions of a multidimensional network. In the case of a two-dimensional mesh network, the scheme
is called X-l’ retiring because a routing path along the X-dimension is decided first before choosing a path
along the Y-dimension, For hypercube [or n-cube) networks, the scheme is calledE-cnixr routing as originally
proposed by Sullivan and Bashltow (1977). These two routing algorithms are described below by presenting
examples.
E-cube Routing on Hyjsortube Consider an n-cube with N = 2” nodes. Each node b is binary-coded as
in = in" 1b,, 2 iil]iJ|:|. Thus the source node is s = s,, 1 s|.sD and the destination node is rl'= if" | ofidu. We
want to determine a route from s to if with a minimum number of steps.
We denote the n dirncnsions as i = 1,2, ..., n, where the ith dimension corresponds to the (i l)st bit in the
node address. Let \-' = t-',, | . . . t-‘[1-'0 be any node along the route. The route is uniquely deterrniricd as follows:

l. Compute the din:-ction bit r,= s,- |$ ti’, | for all rr dimensions (r'= 1, ..., nj. Start the following with
r.limensionr'= l and \-' =s.
2. Route from the current node t- to the nest node \-' EB 2' ' ifr, = l . Skip this step ifr, = D.
3. Move to dimension r'+ 1 {i.e. i<— i+ l). li'i£ rr, go to step 2,clsotlonc.

Iv)
El Example 1.4 E-cube routing on a four-dimensional
hypercube
The above E-cube routing algorithm is illustrated with the example in Fig. 7.34. Now n = 4, s = 0110, and
rf= lllll.Tl1t1sr=r,|r3r2r1= 1l]Il_Route ﬁ‘0m.stos@2n=Ulll since r, =oo | = l.Route fromt-'=l)1ll
to v$ ll = Clllill since r2 =1 &l D= 1. Skip dimension r'= 3 because r3 =1EBl= D. Route from v = D101 to
11$ 13= llill =o'sinc-l: r_|= l.
rm Mrliruw Hill tmt-:-m.u||n
' :
Moitiprucesson and Multicmnputers i 321

ellm 2 mm 3
S-euros: s=G11D
Destination: d=11G1
Route:
elmi G110->Ct111-10101-@1101

elm-t

G110 0111 1110 1111

' 2 Q "
M H
‘F| |

. D101 I 100 1101

I E’
om I
7
1"?‘
mm 1001
GOOD "

Fig. 7.34 Ecube muting on a i1}I'pE1‘C||.dJ|E computer with 16 nodes

The route selected is shown in Fig. 7.34 by arrows. Note that the route is detmnjned from dimension 1
to dimension 4 in order. If the ith hit of s and d agree, no routing is needed along dimension i. Otherwise,
move fnum the current node to the other node along the same dimension. The procedure is repeated until the
destination is reached.

X-Y Routing on n ID Mesh The same idea is applicable to mesh-connected networks. X-Y routing is
illustrated by the example in Fig. 7.35. From any source node s = (x|_}=|] to any destination node n‘ = {I2}-'1},
mute from s along the X-axis ﬁrst until it reaches the column 1'2, when: d is located. Then route to dalnng
the Y-axis.
There are four possible X-Y routing patterns corresponding to the east-north, east-south, west—norti1, and
west-south paths chosen.

I»)
g Example 1.5 X-Y routing on a 2D mesh-connected
multicomputer
l-‘our [so1.n'ce, destination) pairs are shown in Fig. 7.35 to illustrate the four possible routing patterns on a
two-dimensional mesh.
Par MIGIITLH HI" l'mrJI||r_.u|i¢\ :

328 i Advanced CanputterArcl1-itedwe

An east-north route is needed ﬁom node (2,1) to node (7.6). An east-south route is set up from node (11,?)
to node (4,2). A west-south route is neecled from node (5,4) to [2,0]-. The fourth route is west-north bound
from node (6,3) to node (1,5). lf the X—dimension is always routed ﬁrst and then the Y-dimension, a deadlock
or circular wait situation will not exist

ind {yd {ad {ad fled lsfl |sfl |tr

13,51 |4,5| I55‘ ‘B51 [I5

' F

ssssss
3.4 4,4 5.4 ‘G311 T,4
Ll-

1 did ﬂed dad ﬂed #11

'-C

ljili ilﬁilsilél
E_?_iE§i~§i-IE1 1;.~.~2<=.§l'| isn| |4n] |so| |snj [re

Four tsouroantlostlnatloni pairs: [2,1;7,6}%- (D,?;4,2}—|- [5,4;2,0}—z> (6,311 ,5}----

Fig. 7.35 X-Y rourtngonallﬁ mesh corripurervidrltﬂ :-<3 = 154 nodes

It is left as an exercise for the reader to prove that both E-cube and X-Y schemes rcsult in deadlock-fi'oe
routing. Both can be applied in either store-and-forward or worrnhole-routed networks. resulting in a minimal
route with the shortest distance between source and destination.
However, the same dimension order routing scheme cannot produce minimal routes for torus networks.
Nonmininial routing algorithms, producing deadlock-free routes, allow packets to traverse through longer
paths, sometimes to reduce network traflic or for other reasons.
Adoptive Routing T'he main purpose of using adaptive routing is to achieve cfficiency and avoid deadlock.
The concept of virtual channels makes adaptive routing more economical and feasible to implement. We have
shown in l-‘lg. "L32 how to apply virtual channels for this purpose. The idea can be further extended by having
virtual channels in all connections along the same dimension ofa mesh-connected network (Fig. I36}.
Multiprocessor: and Mutticorriputers 3“

HIHIE HIHIE
HIE-H H-H-H
HIE-E EIEIE
[a] Original mosh without virtual channel tn; Tm pairs of vinuai channels In Y-dimension

qlﬁlg %lHlF
HIHIF qlHlF
M E E E E E
(c] For a westbound message [ct] For an eastbound message

Fig.7.!-5 Adaptive K-Y routing using virtual channels co avoid deadloclconly westbound and eastbound tralllc
are deadiociofree {Courtesy of |_icmd Ni. 1991]

Example 1.6 Adaptive X-Y routing using virtual channels

This example uses two pairs of virtual channels in the Y-dimension of a mesh using X-Y routing.
For westbound trafﬁc, thc virtual nenvork in Fig. 7.36c can be used to avoid deadlock because all eastbound
X-channels are not in use. Similarly, the virtual network in 1-'ig. 136d supports only eastbound traffic using
a different set of virtual Y-channels.
The two virtual networks are used at different times; thus deadlock can be adaptively avoided. This concept
will be titrtlicr elaborated for achieving dcadlockfree multicast routing in the next section.

7.4.4 Multicast: Routing Algorithms

Venous communication patterns are specified below. Routing efficienc-y is defined. The concept of virtual
networks and network partitioning are applied to realize the complex communication pattems with efficiency.
Communication Pattern: Four types of communication pattcms may appear in multicomputer networks.
What we have implemented in previous sections is the one-to-one unieast pattern with one source and one
destination.
A mnirimo pattem corresponds to one-to-many communication in which one source sends the same
message to multiple destinations.
A iJJ't'Jfl£iC'flS‘f pattcm corresponds to the case of one-to-all cornrnunication. The most generalized pattern is
the many-to—manv cnnji-rence communication.
rs» Mam-w trrtti-...¢-,.a,i.¢. '
3340 i Advanced Canptnernrdritectutc

tn what fellows, we consider the requirements for iniplementing multicast, broadcast, and conference
communication pattems. Of course, all patterns can be implemented with multiple unicests sequentially, or
even simultaneflusly if resource conflicts can be avoided. Special routing schemes must he used to implement
these muiti-destination patterns.
Routing Eflildency Two eomrnnnly used eflieiency parameters are chartrtei bnmftt-'r'dt:h and ccmnwn icoriort
hi.reric_7t-'. The channel bandwidth at any time instant (or during any time period) indicates the effective data
transmission rate achieved to deliver the messages. The latency is indicated by the packet transmission delay
involved.
An optimally routed network should achieve both rnasimutn bandwidth and minimum latency for the
cornmunication patterns involved. However, these two parameters are not totally independent. Achieving
maximum bandwidth may not necessarily achieve minimum latency at the same time, and vice verse.
Depending on the switching technology used, latency is the more important issue in a store-and-i‘orwarnl
network, while in general the bandwidth affects efficiency more in a worrnhole-routed network.

I»)
8! Example 1.1 Multicast and broadcast on a mesh-connected
computer
Multicast routing is implemented on a 3 >< 3 mesh in Fig. 137. The source nude is identiﬁed as S, which
transmits a packet to ﬁve destinations labeled D, for i = 1, 2, ..., 5.

EIEI EU
DEM |:|a|
titii
maefo
ta} Five unicasts with traffic = 13
ntaam
tin) Amuiticast pattern with traffic = ?
and distance = 4 and distance = 4

DE H
|:|i:i|:i IE
HE: IIEI
(cl Another muiticast pattern with tn} Broadcast to ail nodes via a tree [numbers
trafﬂc = 6 and distance = 5 in nodes correspond to levels of the tree)

Fig. 1'. 31' Multiple unicasts, rnuirimst: patterns. and a broadcast tree on a 3 x 4 mesh cornputer
J11 rum-“mars _

Multiprocessor: and tllluiticorrtputetu i 3:"

This five~destinalion tnulticast can be implemented by five unicasts, as shown in Fig. 7.371 The X-Y
routing trafiic requires the use of 1 + 3 + 4 + 3 + 2 = 13 channels, and the latency is 4 for the longest path
leading to I13.
A multicast can be implemented by replicating the packet at an intermediate node, and multiple copies of
the packet reach their destinations with significantly reduced channel traffie.
Two rnulti-cast routes are given in Figs. 'i'.37b and 'i".3?c. resulting in iraific of 7 and 6. respectively. On a
worniholc-routed network, thc multicast route in Fig. 'i'.3'i'c is better. For a store-and-forward network, thc
route in Fig. ’i'.3Tb is better and has a shorter latency.
A four-level spanning tree is used from node S to broadcast e packet to all the mesh nodes in Fig. 7.37d.
Nodes reached at level 1' of the tree have latency F. This broadcast tree should result in minimum latency as
well as in minimum traffic.

I»)
62] Example 7.8 Multicast and broadcast on a hypercube
computer
To btoadcast on an n~cube, a similar spanning tree is used to reach all nodes within a latency of n. This
is illustrated in Fig. 7.3-Ba for a 4-cube rooted at nude GOOD. Again, 1'nini.t:nu.m trafﬁc should rcsult with a
broadcast tree for a hypercube.
0110 D111 1110 1111
,1 s Ir I.» I
0010 I U-D11 s '\ 101 1 19111

0100 3Q _L 1100 1101

'\ I
\.
mm nam \
tam *'1aa1

[a1 B-roactcasttraa fora 4-cube rooted at nodaﬂﬂtlo

114} 9111 1110 1111

0010
0011 " IIIIIII
“"0 3| 1011
_ _ ;t{_ _
691°“ 1101
"mp1 1100 i __.-"

: Q-i- -
:\
\\

we t—--—t- II
I '|
IIII II II |
éi 15.;“““"@‘°°‘
[bi A rnuttieast tree from no-no G101 to seven destination no-do-s
11aa,o111,1a1a,111a,1a11,1aoa,=.ma D011}
Fig. 7.38 Breadeasttree andmuiricast: tree one 4-cube usinga greedy aigcridim (Lari. E<a‘ai'mia|'t.a|'td I\Ii.‘l9'9D}
Ft‘:-r Meow-as rrrttr-...¢-,.w..¢. '
332 i Advanced Cnmputtar Architecture

A greedy multicast tree is shown in Fig. 7.38]: for sending a packet from node {1101 to seven destination
nodes. The greedy multieast algorithm is based on sending the packet tlrrough thc dimensionlsj which can
reach the most number of remaining destinations.
Starting from the source node S = 0101, there are two destinations via dimension 2 and five destinations
via dimension 4. Therefore, the first-level channels used are 0101 —> 0| ll and 0101 —> I 101.
From node 1101, there are three destinations reachable in dimension 2 and four destinations via dimension
1. Thus the second-level channels used include I101 —> I111, lllll —> HDO, andfllll —> D110.
Similarly, the remaining destinations can be reached with tl1irr.l-level channels ll ll —-3» 1110, llll—-9 lfll 1,
1100 —> 1000, and UIIU —> 0010, and fourth-level channel 1110 —>llIIIO.

Extending the multicast tree, one should compare the reaehabilitv via all dimensions before selecting
ccrtain dimensions to obtain a minimum eover set for the nodes. In case of a tic between two
dimensions, selecting any one of them is suffieient. Therefore, the tree may not be uniquely generated.
It has been proved that this greedy multicast algorithm requires the least number of traffic channels
compared with multiple unicasts or a broadcast tree. To implement multieast operations on wormhole-routed
networks, the router in each node should he able to replicate the data in the flit buffer.
ln order to synchronize the growth ofa rnultieast tree or a broadcast tree, all outgoing channels at the same
level of the tree must be ready before transmission can be pushed one level down. Otherwise, additional
buffering is needed at intennediate nodes.
Virtual Network: Consider a mesh with dual virtual channels along both dimensions as shown in
Fig. 7.39s.
These virtual channels can be used to generate four possible virtual networks. For west-north uatfie, the
virtual network in Fig. '?.39h should he used.

E
!tr -5in=E
H HI |-I-I
Pi
u-I
Pi
21

El!ﬂ""ﬂu-I
Pj
u-I
Pi

-[af|A dual-channd 3 >< 3 mosh

U2 ‘l2 22 U2 12 22 D2 12 22 CI2 12 22

DU 19 BU E0 10 20 CO 10 20 CID 10 20
(tn WB5t—rtorlh suhrtai {cl East-no rlh sulztrtstt tdl West-south we net tell EH51-wvlh wheat
Fig. 7.3! Four vlrrtral nennoﬂts irnplornertrahle from a dtral-chann-at mesh
FM Altliruw Htllrm-ltlgtnrlitt _ '
Multiprocessor: and Multicorrtputers i 333

Similarly. one can consouct three other virtual nets For other traﬂic orientations. ‘Note that no cycle is
possible on any of the virtual networks. Thus deadlock can be completely avoided when X-Y routing is
implemented on these networks.
If both pairs between adjacent nodm are physical channels, then any two of the four virtual networks can
be simultaneously used without conﬂict. If only one pair of physical channels is shared by the dual virtual
channels between adjacent nodes, then only (bj and (c) or [_e) and {d} ean be used simultaneously.
Other combinations, such as (b) and (c), or {bl and (ti), or lc) and le), or (d) and (c), cannot coexist at the
same time due to a shortage of channels.
Obviously, adding channels to the network will increase the adaptivity in making routing decisions.
However, the increased cost can be appreciable and thus prevent the use of redundancy.
Network Partitioning The eoneept ofvirtual networks leads to the partitioning of a given physical network
into logical subnetworks for multicast communications. The idea is illustrated in Fig. 7.40.
West East
f
Jt 1 F
J. N

” 0,5 1,5 2,5 3,5 4,5 5,5 6,5 1,5 3

13,4 1,4 2,4 3,4 4,4 5,4 15,4 1,42

North -< > North
o,3 1,3 2,3 3,3 4,3 5,3 es 1,3

: o,2|-—{1,2|-—{2_,2l-—|3__2l-—|4.2|—>] 5,2|—-| 5;; l—>{r,2_ _,

Sou,h_4 0,1}-l1,1|-l2,1|<-|3,1]<-|4_1|->|5,1|->-|5,1|-{1,1 >- Sam

g -3,0 1,0 2,0 3,5 4,0 5,0 so 7,0 _,

k J‘ k J
T Y
West East

Fig. 7.-I'll Parddonlng ofa 6 x B mesh lnro four subne-rs for a l't1L|tltiC35t frorn source no-do {-L2}. Shade-cl nodes
are along the b-ournhry ofadlaeenr subne1's{Ccurnosy of Lin. Mel(lnly: and NL1991}

Suppose source node (4, 2} wants to transmit to a subset ofnodes in the I5 >< 8 mesh. The mesh is partitioned
into four logical subnets. All trafi‘ic heading for east and north uses the subnet at the upper right corner.
Similarly, one constructs three other subnets at the remaining corners of the mesh.
Nodes in the fifth column and third row are along the boundary between subnets. Essentially, the tralfie
is being directed outward from the center node (4, 2). There is no deadlock if an X-Y multicast is performed
in this partitioned mesh.
Similarly, one can partition a binary mcube into 2" 1 subcubes to provide deadlock—free adaptive routing.
Each subcube has n + I levels with 2" virtual channels per level for the bidirectional network. The number
n-alromw Hllliornoorin-r l
334 ‘=‘i"“ Advanced Computer .-ltrchitscturs

of required virtual channels increases rapidly with rt. It has been shown that for low-diniensional cubes
{n = 2 to 4), this method is best for general-purpose routing.

t.
ti“ Summary
ln a multiprocessor system. interconnects between sub-systems such as processors. memorim and
network controllers play a crucial role in determining system pe|"for'rrnnce.The earliest multiprocessor
systerns were bus-based. with shared main rnemory.The bus is a simple interconnect but it has limitations
in scalability Hierarchical bus systems can address the problem to a limited extent. but as systems grow
larger". more sophisticated and scalable system interconnects are needed.
A network may be of blocking or non-bloc-king qvpe.We studied the crossbar network and the basic
dmign of a row of crosspoint switches, with its arbitration and multiplmter modules.While it has better
agregate bandwidth than the bus. the crossbar network also has limitations of scalability. Multi-port
memory can be used to enhance the aggregate bandwidth ofa memory module.
We studied Omega and Butterfly multistage networks. Larger Omega networks can be built using 2.22
and 4x4 basic switches. while die Butterfly network is built from modules of crossbar swit:ches."Nh-en
network traffic is non-unlfotrrtso-called‘hot-spots’ may develop which may degrade network performance.
The concept of combining networks was developed in an attempt to address this performance limitation.
We studied the related issues of maintaining cache coherence and synchroni1ntion.Write operations
on shared cache data. process migration and HO operations can cause loss of cache coherence. If all the
cadies are on a common bus. then the snoopy bus protocol can be used to maintain cache coherence.
Directory-based cache coherence protocols—using full map. limited or chained directories—can be used
on more general types of system interconnects. Details of the schemes vary between write-back and
write-thnough types of cache.
Hardware synchronization mechanisms between processors make use of atomic operations typified
byTest3tSet. However. at a still lower level of hardwane. in theory wired barrier synchnonization can also
be used. of which we saw examples.
Three early generations of multicomputer systems were studied. providing a pictune of how
multicomputer anchitecture has evolved over time. Broadly. the trend has been from expensive to low
cost processors. from shared to distributed memory. and [with higher‘ speed processors} to higher speed
int:e|'connects.lNe studied the Intel Paragon system as a specific example. laying the basis to review more
recent advancm in Chapter 13.
Message-passirtg communication uses networks of point-to-point links. the basic aim of routing
protocols being to achiew: low network latency and high bandw'idth.W'e studied the typical formats
of messages. packets. and flies (flow control digits); roofing schemes were studied from the points of
view of latency analysis and the avoidance of ddlocks.\Ne examined the important concepts of virtual
channels. worrnhole routing. flow control. collision resolution, dimension order routing. and rnulticast
communication.
TM Hnffirnil-' Hliilfmminnm
Mrrrltiprucessers and Muiti<:orr|puters '3' 335

3 Exercises
Problem 7.1 Consider a multiprocessor with {a} Calculate the memory bandwidth defined
n procssors and m shared-memory modules. all as die average number of memory words
connected to the same baclcplane bus with a central transferred per second over the DTB if n = B.
arbiter as depicted below: m =16.r=1fl ns.andc- t=8r=B0 ns.
(bl Calculate the memory utilization defined as
< Data Transfer Bus > the average number of requests accepted by
all memory modules per memory cycle using
dwe same set of parameters used in part |[a).

Problem 7.2 Use two-inputAND and OR gates

[no wired-OR} to construct an n >< n crossbar switch
network between n processors and n memory
modules. Let the width of each crosspoint be w bits
Selected Request (or a word} in each direction.
{a} Prepare a schematic design of a typical
Add ress Bus
crosspoint switch using Cf]; as the enable signal
for the switch in the ith row and jth column.
Assume m > n and all memory modulesare equally Estimate the total number ofAND and OR
accessible to chprocessor. In other words. each gates needed as a function of n and w.
processor generates a request for any module with
(b) Assume that processor P; has higher priority
probability 1im.The address bus and the DTB can over processor P]. if i < j when they are
be used at the same time to serve different requests.
competing for access to the same memory
Both buses take one cyde to pass the address of a
module. Letk = log n be the address width.
request or to transfer one word of 4 bytes between Design an arbiter whidw generates all the
memory and processor. At each bus cycle {F}. the
crosspoint enable signals cg. again using only
arbiter randomly selects one of the requests from
two-input AND and OR gates and some
die processors. inverters if needed. The memory address
Once a memory module is identiﬁed at the decoder is assumed available from each
end of the address cyde (one bus cycle}. it takes a processor and thus is not included in the
memory cycle {which eq|..rals c bus cycles) to retrieve arbiter design. Indicate the complexity of the
the addressed word from the memory module. arbiter design as a function of n and it.
and another bus cycle to transfer dwe word to the
requesting processor via the data transfer bus. Problem 7.3 Consider a dual-processor [P1
and P2) system using write-back private caches
Until a memory cycle is completed. the arbiter
and a shared memory. all connected to a common
will not issue another requst to the same module.
All rejected requests are ignored and resubmitted in contention bus. Each cadwe has four block frames
subsequent bus cycles until being selected. labeled below as Cl. 1. 2. 3.
FM Illnfﬁrm-H Hiilllimnponm
3“ i
Advanced Computer Architecture

Proc-es_so|'1 E (i) Show how to map the eight cache blocks

to four cadre block frams using a direct-
mapping cadwe organization.
Cache 1 Cache 2
eN {ii} Show how to map the eight cache block
frames using a two-way set-associative
Bus
cache orgnization.
{b} Consider the following two asynchronous
sequences of memory-access events. where
The shared memory is divided into eight cache boldface numbers are for write and the
blocks as 0. 1. ..., 7.To maintain cache coherence. remaining are for read.
the system uses a dwree-state {P-.0. RVV. and invalid) Processor #1 :0.0.D.1.1.4.3.3.5.5.5
snoopy protocol based on the write-invalidate policy Processor #2 1 2.1.0.0.7.5.5.5.7.7.0
described in Fig. 7.12b. (i) Trace the execution of thae two sequences
Assume the same clock drivs the processors and on the two processors by enrecuting the
dwe memory bus. W'ithin each cycle. any processor successive blocks. Both caches are initially
can submit a request to access the bus. In case of flushed {empty}. Assume a direct-mapping
simultaneous bus requests from both processors. organimtion in both caches. Indicate the
the request from P1 is granted and P1 must wait state (RO or RW} of each valid cache block
one or more cycles to access the bus. and mark cache miss and bus utilization
In all cases. the bus allows only one transaction [busy or idle) in dwe block trace for each
per cycle. Once a bus access is granted. the cycle. Assume that the very first memory-
transaction must be completed before the next access events from both processors take
request is granted.When there is no bus contention. place in cycle 1 simultaneously. Calculate the
memory-access events from chprocessor may hit ratio ofcache 1 and cache 2. respectively
require one to two cycles to complete,as specified {ii} Assume a two-way set-associative cache
below separately: organization and a LRU cache block
' Read-hit in cache requires one cycle and no replacement policy.
bus request at all.
Problem 1.4 Consider the execution of 24 code
- Read-miss In cache requires two cycles
segments. S. dirough S14.followinga given precedence
without contention: one for block fetdw and
one for CPU read from cache. graph on a multiprocessor with four processors and
six memory modules as shown below. Assume all
- Write-hit requires one cycle for CPU write
segments have the same gain size and execute with
and bus invalidation simultaneously.
equal time.'When two or more processors try to
- Write-miss requires two cycles: one for access the same memory module at the same time.
block fetch and bus invalidation. and one for
the request of dwe lowest numbered processor is
CPU write.
granted and the rest of the requests are deferred to
- Replacement of a dirty block requires one
later segment time steps.
cycle to update memory via the bus.
A processor waiting from an earlier memory-
fa) In the case of bus contention. one additional
access rejection has seniority priority over new
cycle is needed for bus arbitration in all the
requests to access the same memory module. No
above cases except a read-hit. processor should wait for more than three steps
rm rilcfirm-H Hiii toeqiunnrs
' ._:
Mrrvltijiarecessers -and Muvticwrrputers

to access any given memory module. Each code 5a M2 M3

segment takes a ﬁxed unit time to access a memory 59 M1 M4 M4
and to execute. Assume that the four processors are 5111 M1 M3 l"'|-1 M-1
syndwronized in each seg|'nent execution instruction 511 M2 M4 Ms M1
51: M2 Ms Ms
cycle.
513 M1 Ma
514 M4 Ms M4
515 M: M3 M3
9 9 51¢.
51?
M2
M1
M1 M2 M4

51a M2 Ms
®® $69 ®@®% 519 M2 M2 M2 M1

o e ego 520
521
512
M1
M3
M3

M1
M: M4
M4
Ms

eqeeeooo 511
524
M1 M1 Ms
M1
M3
M4

o es Problem 7.5 This problem is based on Fig, 7.11

which combines multiple Fetch3rAdd requests to the
same shared variable in a common memory.
In some cases. a single segment may require access {a} Show the necessary combining network
to several memory modules simultaneously. Ignore components needed to combine four
d'1e contention problem in d'1e interconnection Fetch&Add {x.e;} for 1' = 1. 2. 3.4.
network. The four processors operate in l"1ll"1D
{la} Show the successive snapshots and variations
mode. and different instructions can be exeuited by
in switch and memory contents.as in Fig. 7.11.
different processors during the same cycle.
for combining the four requests.
‘What is the average memory bandwidth in words
per unit time?Try to achieve the minimum execution Problem 7.6 You have learned about a two-way
time by maximizing the degree of parallelism at all shuffle {perfect shuffle) in Fig. 2.14 and a four-way
steps. shufile in Fig. 7.9. Generalize the mappings to an
Note that at each step some of the memory m-vvay shuffle over r1 objects. where m >< it = r1 for
modules may be idle. The highat possible memory some integer k 2 2. for the construction of the class
bandwidth is six words per step. Some segments may of Delta netvvorks introduced by Patel (1983).
require a wait of no more than three steps before {a} Show how to perform a four-way shuffle over
gr'anting of the memory access requested. But such 11 objects.
a waiting period should be minimized. (b) Use a minimum number of 4 >< 3 switch
Processor modules and a four-way shuffle mapping as
Instr. P1 P; P3 P4 an interstage connection pattern to build a
M. M. M. 64-input. 27-output Delta network in three
M1 M1 M2 M1
stages.
M3 M3
M5 M3 M2 M. [c] In general. an n-stage 0" X b" Delta network
|"’|1 Ma M2 is implemented with 1:1 >< b switch modules as
M2 M1 M1 shown in Fig. 123. Calculate the total number
-5r"'.i="15'."‘rE'"\.‘fr$.-"‘_l." M. M5 of switch modula needed and specify the
TM Illnffirihlr Hiiifiurnpennri .
BF
Advanced Computer Architecture

interstage connection pattern from b" inputs depends on the parallelism profiles in user programs.
to 0" outputs. For fixed values of bq and n. the maximally allowed
(d) Figure out a simple routing scheme to control multiprogramming degree k increases with respect
the switch settings from stage to stage in an
I0 .|lr-rr -
0" >< b" Delta network with n stages.
{e} What is the relationship between Omega Problem 7.9 Wilson (1987) proposed a
networks and Delta networks? hierarchical cachefbus architecture (Fig. 7.3) and
outlined how multilevel cache coherence can be
Problem 7.7 Prove dwe following properties
enforced by extending die write-invalidate protocol.
associated with multistage Omega networks using
Can you figure out a write-broadcast protocol for
different-sized building blocks:
achieving multilevel cache coherence on the same
[aj Prove that the number of legitimate states hardware platform? Comment on the relative
{connections} in a it ><k switch module equals
merits of the two protocols. Feel free to modify
16*. the hardware in Fig, 7.3 if needed to implement the
{bi Determine the percentage of permutations write-broadcast protocol on dwe hierarchical bus!‘
that can be realized in one pass through a cache architecture.
64-input Omega network built with 2 I>< 2
switch modules. Problem 7.10 Answer the following questions on
design choices of multicomputers made in the past:
[c] Repeat part [b] for a 64-inp ut Omega network
built with 8 >< 8 switch modules. {a} Why were low-cost processors chosen over
expensive processors as processing nodes?
(d) Repeat part (b]- for a 512-input Omeg
network built with 8 >< 8 switch moduls. {bj Why was distributed memory chosen over
shared memory?
Problem 7.8 Consider ti'1e interleaved execution [c] Why was message passing chosen over
of k programs in a multiprogrammed multiprocessor address switching?
using m wired-NOR synchronization lines on n
{dj Why was l*'1ll"'lD. l"'1Pl‘-‘ID. or SPHD control
processors as described in Fig. .7.19a.
chosen over SIMD data parallelism!
In general.the number my of barrier lines needed
fora programiisestimated asmy= by[q;.iPy] + 1.where Problem 7.11 Explain the following terms
by = the number of barriers demanded in program i. associated with multicomputer networks and
qy = the number of processes created in program rnessage-passing mechanisms:
i. and Py = dwe number of processors allocated to {a} Message. packets. and fiits.
programi. (b) Store-and-forward routing at packet level.
Thus m = my + my + .. .+ my. For simplicity assume {C} ‘Wormhole routing at flit level.
by = band qy= qfori=1.2.....ic. and Py = min(nu'k.q} {d} Virtual channels versus physical channels.
processors are allocated to each program i. (e) Buffer deadlock versus channel deadlock
Prove that m can be approximated by b - q - (1') Buffering flow control using viruual
cut-through routirg.
kiln + it. or that the degree of multiprogramming is
(gi Blocking flow control in wormhole routing.
it E j—n+.y|In2+4bqmr1jl{2bq]- in such a (hi Discard and retransmission flow control.
multiprocessor system. Note that bq represents the
(ii Detour flow control after being blocked.
number of required synchronization points. which (j) Virtual networks and subnetworks.
TM iirlcfimu-‘ Hillfiornpennri .
Multiprocessor: and Mu'lticorr|p~1rter's

Problem 7.11 determine a suboptimal multicast route.with

(=1 Draw a 16-input Omega network using 2 >< 2 minimum distances from tl'1e source to all
switches as buflding blocks. destinations using as few trafiic channels as
possible. on a 16-node hypercube network.
{bi Show dwe switch settings for routinga message
from node 1011 to node 0101 and from node The source node is (1010). and there are
0111 too node 1001 simultaneously. Does 9 destination nodes (0000). (0001). (0011).
blocking exist in this case-I’ (0100). (0101). (0111). (1111). (1101). and
(1001).
(Ci Determine how many permutations ran be
implemented in one pass through this Omega Problem 7.15 Prove the following statement_s
network.‘W'hat is the percentag of one-pass with reasoning or analysis or counter-examples:
permutations among all permutations? (a) Prove d'1at E-cube routing is ddlock-free
ldi What is t.he maximum number of passes on a wormhole-routed hypercube with a pair
needed to implement any permutation of opposite unidirectional channels between
dwrough the network? adjacent nodes.
Problem 7.13 Explain the following terms as (b) Prove that X-T routing is deadlock-free on a
applied to communication patterns in a message- 2D mesh.
passing network: iii Prove d'1at E-cube routing on the 3D mesh
(k-ary n-cube) used in the j-Machine is
la) Unicast versus multicast
deadlock-free with wormhole routing and
{bi Broadcast versus conference
blocking flow control.
iii Channel bandwidth
id) Communication latency Problem 7.16 Study the Turn model for adaptive
(El Network partitioning for multicasting routing proposed by Glass and Ni (1992) in the
communications 1992 Anrruicrl lntemotionol Symposium on Computer
Architecture. Answer the following questions:
Problem 7.14 Determine the optimal routing
(a) Why is the Turn model deadlock-free from
paths in the following mesh and hypercube
having cycles?
multicompute rs.
(b) How can the Turn model be applied on an
iii Consider a 64-node hypercube network. n-dimensional mesh to prevent deadlock?
Based on the E-cube routing algorithm. show
how to route a message from node (101101) (c) How can the Turn model be applied on a
to node (011010). All intermediate nodes it-ary n-cube to prevent ddlock?
must be identified on the routing path. Problem 7.17 The following assignments are
lb) Determine two optimal routes for multicast related to the greedy algorithm for multicast routing
on an 8 >< B mesh. subject to the following on a wormhole-routed hypercube network.
constraints separately.The source node is (a) Formulate the successive steps of the greedy
(3. 5). and dwere are 10 destination nodes algorithm (Example 7.8) as a minimum cover
(1. 1). (1.2).(1.6).(2.1).(4.1). (5.5). (5.7). (6.1). problem. similar to that practiced in Karnaugh
(7. 1). (7. 5). TI're first multicast route should maps.
be implemented with a minimum number of
(bi Prove that the greedy algorithm always yields
channels. (ii) The second multicast route the minimum network traffic and minimum
should result in minimum distances from the distance from the source to any of the
source to ch of the 10 destinations. destinations.
iii Based on the greedy algorithm (Fig. 7.38).
Ff» Mcfimw H'l'Ilt'4.wrtqn-.r.-11¢-s
m_
Advanced Computer Arcirite-cture

Problem 7.10 Consider the implementation of switches. Design a two-stage Cedar network
Goodman's write-once cache coherence protocol in to provide switdred connections between
a bus-connected multiprocessor system.Specify the 64 processors and 64 memory modules. again
use ofadditional bus lines to inhibitthe main memory in a clustered 1'nanner similar to tl'1e above
when the memory copy is invalid. Also specify all Cedar network design.
other hardware mechanisms and software support (c) Further expand the Cedar network to three
needed for an economical and fast implementation stages using8 X5 crossbar switches as building
of the Goodman protocol. bloclts to connect 512 processors and 512
Explain why this protocol will reduce bus memory modules. Show the schematic
traffic and how unnecessary invalitlations can be interconnections in all three stages from the
eliminated. Consult if necessary the two related input end to the output end.
papers published by Goodman in 1983 and 1990.
Processors Stage 1 Stage 2 Memories
Problem 7.19 Study die paper by Archibald and l I
Baer (1 986) which evaluated various cadwe coherence
protocols using a multiprocessor simulation model.
Ezrqzrlain the Dragon protocol implemented in the
Dragon multiprocessor workstation at the Xerox
Palo Alto Research Center. Compare tl'1e relative
merits of the Goodman protocol. the Firefly
. 1 \9 1»-
protocol. and tl'1e Dragon protocol in the context
of implementation requirements and er-qaected
performance. -- ‘irire
3'‘ .1l*“l'Yr
Problem 7.20 The Cedar multiprocessor at
Illinois was built with a dustered Omega network
as shown below. Four 8 >< 4 crossbar switches
Fl ~l§4' l
were used in the first stage and four 4 >< B crossbar
switches were used in the second stage.There were
-. rrlilr _£ J.-J..- F-‘Ii
32 processors and 32 memory modules. divided into
four clusters with eight of each per cluster. -. .111vi IIIIIIII
(a) Figure out a fixed priority scheme to avoid
conflicts in using the crossbar switches for
nonblodcing connections. For simplicity
:_ IIIIIIII

consider only the forward connections from

the processors to the memory modules.
(b) Suppose both stages use 8 '>< B crossbar
:1 e
PM 1'l|¢G-NH-‘ Hlllﬁivoponm

— —

Multivector and SIMD

Computers
By definition. supercomputers ane the fastest computers available at any specific time. 'l'he value
of superoomputing was originally identified by Buzbee [1983] in three areas: knowledge acquisition,
computational n-actnbiity. and promotion of pmductivity. Computing demand. however. is always ahead of
computer capability.Todav‘s supercomputers are still one generation behind the computing requirements
in most application areas. which have expanded enormously over the last two decades.
In this chapter. we study the architectures of pipelined multivector supercomputers and of SIMD
array processors. Both types of machines perform vector processing over large volumes of data. Besides
discussing basic vector processors. we describe compound vector functions and multipipeline chaining
and networking techniques for developing higher~perlormance vector multiprocessors.
The evolution from SIMD and MIND computers to hybrid SlMDil"llMD computer systems is also
considered. 'l'l1e Connection Machine CH-5 reflected this architectural trend. This hybrid approach to
designing reconfigurable computers opened up new opportunities for exploiting coordinated parallelism
in complex application problems. Recent trends in this direction will be discussed in Chapter 13.

VECTOR PROCESSING PRINCIPLES

1 ‘v'cctor instruction types, memory-access schemes For vector operands, and an overview of
supercomputer families are given in this section.

8.1.1 Vector Instruction Types

Basie concepts behind vector processing are defined below. Then we discuss major types of vector
instnictions encountered in a typical vector processor. The intent is to acquaint the reader with the instruction-
set architectures oftyp ical vector processors.
Vector Processing Definitions A vet-tor is an ordered set ofscalar data items, all ofthe same type, stored
in memory. Usually, the vector elements are ordered to have a fixed addressing increment between successive
elements, called thc .s'n'iri'c.
A vocror processor is an cnscmhlc ofhardware resources, including vector registers, functional pipelines,
proccssing elements, and rcgistcr counters, lhrpcrforming vector operations. l-"Error proc-essirig occurs when
arithmetic or logical opcrationsan: applied to vectors. It isdist inguishcd from scalar processing which operates
on one datum or one pair oi" data. The conversion from scalar oode to vector code is called 1-'ccrori:.nrion_
Thu‘ Ml.'I;Ifllb' H“ l'n¢r.q|r_.u|»r\
34! i Advanced Cornptmerfircbitecture

ln general, vector processing is faster and more efficient than scalar processing. Both pipelined processors
and SIMD computers can perform vector operations. Vector processing reduces software overhead incurred
in the maintenance of looping control, reduces memory-access conﬂicts, and above all matches nicely with
the pipclining and segmentation concepts to generate one rcsult per clock cycle continuously.
Depending on the speed ratio between vector and scalar operations {including startup delays and other
overheads) and on the vcemrimricn ratio in user programs, a vector processor executing a well-vectorized
code can easily achieve a speed|.|p of IO to IO times, as compared with scalar processing on conventional
machines.
Oi‘ course, the enlnmced performance comes with increased hardware and compiler costs, as expected.
A compiler capable of vectorization is ealled a terrorizing eontpiler or simply a wcrorizcr: For successful
vector processing, one 11Beds to make improvements in vector hardware, vectorizing compilers, and
programming skills specially targeted at vector machines.

Vector Instruction Type: We brieﬂy introduced basic vector instructions in Chapter 4. What are
characterized below are vector instructions for register-based, pipelined vector machines. Six types ofvector
instructions are illustrated in Figs. 8.1 and 8.2. We deﬁne these vector instruction types by rnathernatical
mappings between their working registers or memory when: vector operands are stored.
VJ; Ragiier V3 Register V; Ragista Vk Register '|.r‘,- Flogislsel

|s
-.!-'1
E

Fmtctiond unit Fundicnd unit

{a) Vacbr-vector instmction {b)'v'ectcr-smla inshudion

{vactcl Load)
Memory path Vi Regista

till] _

He Mr

till]
Hermcry path
{Vecbr Store)
{ct “ach:-mastery insiuetions

Fig. B-1 Veettor instruction types in Cray-like computers

{ll lirmr-t-'er.'ror rhstrrretrions As shown in Fig. 8.1a, one or two vector operands are fetched ﬁ'om the
respccl:ivc vector registers, enter through a functional pipeline unit, and produce results in another
vector register. These instructions are deﬁned by the following two mappings:
f| :l_’,-—> V, (8.11
jg : ii,-X I-1 —> P] (8.21
,,,,,,,,,,,,,,,,,,,.,,,,,,,,,,,,, _ H,
Examples are V, = sin{ I/E) and V3 = V, + Ir’; for the mappings_f'| and f3, respectively, where F} for
i=1, 2, and 3 are vector registers.
{,3 l li'ctor-Molar r'nsrrur'rr'ons Figure B. lb shows a vector-scalar instruction corresponding to thc
following mapping:
jg :s>< Pk —> I’, (8.31

An example is a scalar product s >1: l"| = F3, in which the elements of Vl arc each multiplied by a
scalar s to produce vector V3 ofcqual length.
('3 l Vecsor-nremo:'_v insrrrrrrfons This corresponds to vector load or vector store (Fig. B. l cl, element by
element, between the vector register ( i-" it and the memory (Ml as deﬁned below:
f4 : M —t- V lireror load‘ (5-4!
f5 : l-’ —-3» M lti.-‘tutor store (B.5j

{.41 lincmr redrrcrion innrrrerions These correspond to the follotwi ng mappings:

,5, : l-']- —> s (8.61

ft : V, >< F, —'> s {B-71

Examples of_f5 include finding the moximrrm, mr'm'mrrm, sum, and mean voiue of all elements in a
vector. A good example off-_. is the dot product, which performs s = 2" tr, >< by from two vectors
_ |
.-r = -|_'_o',-j and s = rs,-1 ’
{'51 Gather and st-otter r'n.srrnv:rr'ons These instructions use two vector registers to gather or to scatter
vector elements randomly throughout the memory, corresponding to the following mappings:
jg, : A-I —> V, >< I-Q, Gather (3.8;
f,,: P’, >< I-’,,—> M Scatter (8.91
Gather is an operation that fetches fi'om memory the nonzero elements of a sparse vector using
indices that themselves are indexed. Scatter docs t.he opposite, storing imo memory a vector in a
sparse vector whose nonzero entries are indexed. The vector register I-"| contains thc data, and the
vector register I-Q, is used as an index to gather or scatter data from or to random memory locations as
illustrated in Figs. 8.2a and E.2b, respectively.
('6 l Mrlsfring in.sn'ur'rr'ons This type of instruction uses a rrtaslr vector to compress or to expand a vector
to a shorter or longer index vector, respectively, corresponding to the following mappings:
_,fi,-,: lQ,>< l"m -3 V, (B. l'I]j
Tbc following example will clarify the meaning of gather, scrrrter, and moshing instructions.

.9?) Example 8.1 Gather, scatter, and masking tnstructtons tn

the CrayY-MP (Cray Researeh,1990)
The gather instruction (Fig. 8.23.) tra.nsfers the contents (600, 400, 250, 200] of nonsequential memory
locations (104, 102., I07, [DO] to fruu elements ofa vector register l"1. The base address (100) of the memory
is indicated by an address register Aft. The number of elements being transferred is indicated by the contents
(4) of a vector iengdt register FL.
rr M G um - '
344"i- in I “M t Admn-cad Cumplmerhxdritedlnm

The offscls {h1dic4:s) from the has: address are rctricvcd fi'om the vector register VG. Thc cffccliv: mcrnnry
addwzaws are nbtained by adding the base address in the indices.
Manny
CnrrbemtsJ'
v1. Regina VD Reghta V1 Ragisiar -P-diiflfifi
|* #1 N:-tn} GIG‘ $8 GD
A0
Q38: fiéé
| 1un|

DOGDUO 4.-1.4-|. a.DOGGIG

M‘--aw
IQ

|[a}| r.-".-mm mmmm

Hem-my
Contents!
VL Fhgastel vn Regista v1 Ragishr P-ddﬂm
I

I
A0
4|

wul
1*,
2
QI1I!l
, $9 ‘
we
mu “>2
O .4O -F

Q -AD ‘Iul

{hp Scaiter inshudion

vo Register vs Ragisiar
(T851191!) {RBBHW
01
- cc:
v4. Register M

-15
“T10
a 11
0 13
24
-r
u1011m111n1.. . 13
maagaaa 9
-1? -
{oi Masiurig imiucim

Fig. 8.2 Gatheawzauer and masking operations on due Cray'Y-HF [Courmesy ufﬁray Research 1990]
,,,,,,,,,,,,,,,,,._.,,,,,,,,,,,, _ H,
The scatter instruction reverses the mapping operations, as illustrated in Fig. 8.2h. Both the I/L and A0
registers an: embedded in the instruction.
The masking instruction is shown in Fig. 8.2c for compressing a long vector into a short index vector. The
contents of vector register Vt} are tested for Zero or nonzero elements. A mrr.s.irr'ng rcgi.srcr { F.-‘vii is used to
store the test results. .-lifter testing and forrrring the nrtrsiring veemr in V M, the corresponding nonzero indices
are stored in the I/l register, The i/"L register indicates the length of the vector being tested.

The grrnlrer, setrrrer, and nrtrsiring instructions are very useful in handling sparse vectors or sparse matrices
often encountered in practical vectorprocessing applications. Sparse matrices are those in which most ofthe
entries are zeros. Advanced vector processors implement these instructiorrs directly in hardware.
The above instruction types cover the most. important ones. A given speciﬁc vector processor may
implement an instruction set containing only a subset or even a superset ofthe above instructions.

8.1.1 Veet:or\-Access Memory Schemes

The ﬂow of vector operands between the main memory and vector registers is usually pipelined with multiple
access paths. In this section, we specify vector operands and describe three vector-access schemes from
interleaved memory modules allowing overlapped memory accesses.

Vector Oponnnd Spaeiflecltiom Vector operands may have arbitrary length. Vcc tor elements are not
necessarily stored in contiguous memory locations. For example, the entries in a matrix may be stored in row
major or in column major order. Each row. column, or diagonal of the matrbc can be used as a vector.
When row elements are stored in contiguous locations with a unit stride, the column elements are stored
with a stride ofn, where n is the matrix order. Similarly, the diagonal elements are also separated by a stride
ofn + l.
To access a vector in memory, one must specify its hose trddress, srriob, and length. Since each vector
register has a fixed number of component registers, only a segment of the vector can be loaded into the vector
register in a fixed number of cycles. Long vectors must be segmented and processed one segment at a time.
Vector operands should be stored in memory to allow pipelined or parallel access. The memory system for
a vector processor must he specifically designed to enable fast vector access. The access rate should match
the pipeline rate. In fact, the access path is often itselfpipclined and is called an fl('Ct’.iS]Ji]J£’. These vector-
aecess memory organizations are desc-ribed below.
C-rflcces: Memory Drghnimtion The in-way low-order interleaved memory structure shown in
Figs. 5.15:: and 5.16 allows m memory words to be accessed concurrently in an overlapped manner. This
eoneurrtrm‘ access has been called C.‘-net-ass as illuslntted in Fig. 5.lt5b.
The access cycles in different memory modules are staggered. The low-order tr bits select the modules,
and thehigh-ordcrb hits select the word within each module, where m = L" and rr+ b = n is the address length.
To access a vector with a stride of I, successive addresses are latched in the address buffer at the rate of
one per cycle. Effectively it takes m minor cycles to fetch m words, which equals one (major) memory cycle
as stated in Eq. 5.4 and Fig. 5.l6b.
If the stride is 1, the successive accesses must be separated by two minor cycles in order to avoid access
conflicts. This reduces the memory throughput by one-half. if the stride is 3, there is no module conflict and
the maximtun throughput (m words) results. In general, C-access will yield the maximum throughput of m
words per memory cycle ifthe stride is relatively prime to m, the number ofinterleaved memory modules.
_ H‘-r Mclinrw Hm I-|Il‘l‘.l]|lj.I.ll|f\
340 1- I Admrl-cad campuurnmmwam

S-Access Nlamory Orgflnization The low-order interleaved memory can be rearranged to allow
sinlrllrflmnris access, or S-rimrss, as illustrated in Fig. 8.3a. In this case, all memory modules are accessed
simultaneously in a sgmchmnimecl manner. Again the high-order (n — a) hits scloct thc same nfikct wnrcl from
each mnclu lc.

Pi Fatd-icycla-i-u-I-i Accasscycla Z-l

HII
Data Latch
Single vmrd

{Hi rj0%
Mdtﬂzlexel W"
high-Order
addess ms

I mnuuia
I
Raadhvlita
a Low-order
adckasa bite:
-[al S-aoc-ass crganiiun for an m-way inleﬂaavad ma may

1. Memory Modliaa

FEE-I1 'l FB‘lI.i'l 2 FEE-11 3 ...

M.-in
I A:;oess1 I Acce552 I Acc|e$3 I

Fai:.h'1 Febc.h2 Fat-h3 ...

H1
I Access‘! I h|:=|:.esa2 I Ac,c|ass3 I

Ffibh ‘l Ffllflh 2 Fflth 3 .. . .

Mu
I Aooa5s'l I 15059552 I Acc|esa3 I

m wnrda m words mwcrds

A A A

Cycla1 l C5rda2 l Dg,rda3 T cyan l11;e

{bl Snacmmhre vecto‘ acicassasuaing mrcliappad iabh and anc-asscydas

Fig. 8.3 The S-access inmrlcavcd memory for vocmr cplnncls acons

At the and of each rnumury cyclc [Fig. 8.31:}, Hi = 2“ ccmsccutivc words an: latched in thc data buﬁcn;
simultaneously. The low-order ri bits are than used to multiplex the m wnrds nut, mic pcr cach minor cycle.
,,.,,,.,.,,,,_,,,,,.5.,,D..,,,,,,,, _ H,
lfthc minorcyclc is chosen to be lfrrr ofthe major memory cycle {Eq. 5.4}, then it takes two memory cycles
to access m consecutive words.
However, if thc access phase of the last access is overlapped with the fetch phase of the eurrem access
(Fig. 8,:-lb), effectively m words take only one memory cycle to access. If the stride is greater than 1 , then the
throughput decreases, roughly proportionally to the stride.

CIS-Access Memory Organization A memory organization in which the C-access and S-access are
combined is called C.r‘lS'-or'ees.s. This scheme is shown in Fig. 8.4, where n access buses are used with m
interleaved memory modules attached to each bus. The m modules on each bus are m-way interleaved to
allow C-access. The rt buses operate in parallel to allow S-access. In each memory cycle, at most m - rt words
are fetched if the n buses are fully used with pipelined memory accesses.

P'°'°9sS°'5 i Memories
@ | Bo
PO M60 .. .
System Moo Mo; Mom
@ I |nter- Bo
1 MC‘ OCll'll'lBC'l

I . l'~"|1,o l~"|1,1 Mir-*1

' {Crossbar J : 1 1
® ' an-1
PM MC 6 6 ..
Mi Mm ,o Mn-,1 Mm;-.-1
Fig. B-4 The US rnerno-ry orpniaation with irr = it {Courtesy of D.K. Panda. 1990]

The CIS-access memory is suitable for use in vector multiprocessor conﬁgurations. it provides parallel
pipelinod access ofa vector data set with high bandwidth. .-‘Especial vector eoeire design is needed within each
processor in order to guarantee smooth data movement between the memory and multiple vector processors.

8.1.3 Early Supertump uters

This section introduces ﬁve early supercomputer families, including the Cray Research series, the CDC!
ETA series. ﬁre Fujitsu VP series, the NBC SX series. and die Hitachi B20 series (Table 3.1). The relative
perfo rmance ofthesc machines for vector processing are compared with scalar processing at the end.

The Cray Research Series Seymour Cray founded Cray Research. Inc. in l9'i'2. Since then, hundreds
units of Cray supercomputers have br:r::r1 produced and installed worldwide. As we shall sec in Chapter 13,
the company has gone through a change ofname and evolution ofpmduct line.
The Cray 1 was introduced in i9'i'5. An enhanced version. the Cray IS, was produced in 19T9. it was the
ﬁrst ECL-based strpercomputcr with a 12.5-as clock cycle. High degrees ofpipelining and vector processing
were the major features of these machines.
rt» Mecmw iirttt-...s-,..i.t.¢. '
Ms — _
Adrovrced Computerhrchitecture

Table 8.1 Summary ofE.r.trly Sttpcrcotrlputcrs

511.:tern Maximum mnﬁgaration. Uniquefeamres

model‘ clock rote, GS/Compiler and remarks
Cray 1S Uniproccssor with 10 pipelines. 12.5 First ECL-based super. introduced in
ns. COSIC-F'l"l' 2.1. l9T6.
Cray 25 4 processors with 256M-word memory, l6l{-word local memory, ported
."-l-256 4.1 ns. COS or UNIX ICFT7 3.0. UNIX V introduced in I985.
Cray X- MP 4 processors with 16M-word memory, Using shared register clusters for
416 and l2BM-word BSD, 8.5 ns, COS IPC, itflroduced in 1933.
CFTT 5.0.
Cray Y-MP 8 processors with lllllvl-word Enhanced fi'om X-ME introduced in
832 memory. 6 as, EFT! 5.0. 1951‘-8.
Cray Y-MP I6 processors with 2 vector pip-espcr Thc Largest Cray machine, introduced
C-9!] processor, 4.2 us, UNIOOSICF Tl’ 5.0. in 15191.
CDC Cyber Uniproeessor with 4 pipelines, 20 ns, Met11ory-to-memory architecture.
205 virtual OSIFTN 200. introduced in I932.
ETA lll E. Uniprooessor tvith 10.5 ns, Successor to Cyber 205, introduced
ETAVIFTN 200 in i985.
NEC 4 processors with 4 sets of pipelines Succeeded by SX-X Series,
SX-X1‘44 per processor, 2.9 ns, I-i'T?SX. introduced in l9'9l.
Fujitsu Unipmeessor with S vector pipes and Used reconfigurable vector registers
Vl"lfi-CH] ilt) dual scalar processors, 3.2 res, and masking, introduced in l99l.
MS?-EX FFT? EX -‘VP.
Hitachi l3 fitltctionnl pipelines in a Introduced in 193? with 64 I.-D
S202’ RD uniprocessor with 512 Mbytes channels providing a maximum
memory 4 ns, FORT 1?!HAP of 288 It-'lhytes."s transfer.
V23-~OC.

Ten functional pipelines could run simultaneously in the Cray IS to achieve a computing power equivalent
to that of ID IBM 3033's or CDC Cyb-er 1600's. Only batch processing with a single user was allowed when
the Cray I was initially introduced using the Cray Operating System [COS] with a Fortran T7 compiler (CF
T? Version 2.1).
The Cray X-MP Series introduced multiprocessor configurations in I 983. Steve Chen led the effort at Cray
Research in developing this series using one to four Cray I-equivalent CPLls with shared meme-1'y.A unique
feature introduced with the X~MP models was shared register clusters for fast interprocessor eommtmications
without going through the shared memory.
Besides 123 Mbytes of shared memory, the X-MP system had 1 Gbyte of.soi'id-stttre sr-sreg’ {SSD) as
extended shared memory. The clock rate was also reduced to 8.5 ns. The peak performance of the X-MP-
4lti was 840 Mflops when eight vector pipelines for add and multiply were used simultaneously across four
PFDEC-SSCIFS.
,,,,,,,,,,,,,,,._,,,,,,,,,,,,, _ ,,,
The successor to the Cray X-MP was thc Cray Y-MP introduced in 1988 with up to eight processors in a
single system using rt 6-us clock rate and 256 Mbytes of shared memory.
The Cray Y-MP C—9(l was introduced in 199!) to ofi'er an integrated system with 16 processors using a
4.2-ns clock. We will stttdy models Y-MP B16 and C-90 in detail in the next section.
Another product line was the Cray 2S introduced in I985. The system allowed up to four processors with
2 Gbytes of shared memory and a 4.1-ns clock. A major contribution of the Cray 2 was to switch from the
hatch processing, COS to multiuser UNIX System V on a supercomputer. This led to the UNICOS operating
system, derived from the UNIXIV and Berkeley 4.3 BSD, variants of which are currently in use in some Cray
DCll"l'l]IlLIlI1'_T S}'SlI'l'_‘l'!TS.

The CyberiETA Series Control Data Corporation (CDC) introduced its ﬁrst supercomputer, the STA R-I00,
in 1973. Cyber 205 was the successor produced in I932. The Cyber 205 ran at a 2D—ns clock rate, using up to
four vector pipelines in a uniprocessor conﬁguration.
Different from thc register-to-register architecture used in Cray and other supercomputers, the Cyber
205 and its successor, the ETA It), had memory-to-memory architecture with longer vector instructions
containing mcmory addrcsscs.
The largest ETA It) consisted of B CPUs sharing memory and 18 HO processors. The peak performance
of thc ETA lD was targeted for IO Gflops. Both tl'|e Cyber and the ETA Series are no longer in production but
wcrt: in use ibr many years at scvcral supcrcomputcrccntcrs.

Japanese Supercomputer: NEC produced the SX-X Series with a claimed peak performance of22 Gflops
in 1991. Fujitsu produced the VP-2000 Series with a 5-Gtlops peak performance at the same time. These two
machines used 2.9- and 3-.2-ns clocks, respectively.
Shared communication registers and reconfigurable vector registers were special features in these
machines. Hitachi offered the 820 Series providing a 3-Gllops peak performance. Japanese supercomputers
were at one time strong in high—specd hardware and interactive vectorizing compilers.
The NEE SK-X 44 NEC claimed that this machine was the fastest vector supercomputer £22 Gflops peak]
ever huiltup to 1992. The architecture is shown in Fig. 8.5. One ofthe major contributions to this performance
was the use of a 2.9-ns clock cycle based on VLSI and high-density packaging.
There were four arithmetic processors commtmicating through either the shared registers or via the shared
memory of 2 Gbytes. There were four sets of vector pipelines per processor, each set consisting of two add!
shift and two mulfiplyflogical pipelines. Therefore, 6-4-way parallelism was obtained wifit four processors,
similar to that in thc C-9t]-.
Besides the vector unit, a high-speed scalar unit employed RISC architecture with 123 scalar registers.
Instruction reordering was supported to exploit higher parallelism. The main memory was l024~way
interleaved. The extended memory of up to I6 Ghytes provided 21 maximum transfer rate. of 2.75 Ghytes.-’s.
Amaatimtun of four l IO processors could be configured to accommodate a l-Gbytefs data transfer rate per
l/U processor. The system could provide a maztimutn of 256 channels for high-speed network, graphics, and
peripheral operations. The support included l00—l'vll:ytests channels.
3511 i
.
Advorrced Computerhrchitecture

law“ o "“°‘°’
Mair i
Mask

—i
TI “’==
|QP - _- Z MMU i Vector ii
“’ 1 Wis "°“‘" —i
Dcp 2* CPM

Mbytes - Y1
Ii. Cache
-‘i Scalar
Hegs_
.
Scalar Prpo

Scalar unit
Captions:
XMU: Extended memory unit
IOP: |.I'O processors [4]
DCP: Data central processors [2]
AP: Arithmetic processors {4}
MMU: Main memory unit
GPM: Data oentrsl processor memory
Each set consists of rt pipeiln-es for adclfshlft
and multlplyfio-gical vector operations

Fig.8.! The NEE. S24-X 44 vector supummnpuuer archirectuns (Cournasy oi NEE, 1991}

Relative 'ﬂ:ctorIScn.lnIr Performance Let r hc the voctorriscalar speed ratio an-t1ftl1e vcctorization ratio.
By Amdal'1l’s law in Section 3.3.], the following reloriveperforrrmnce can be deﬁned:

P= = -%- (3.111
{1—fl+f»'r (1—_f.ir+_f
This relative performance indicates thc speedup performance of vcctnr processing ovcr scalar processing.
The hardware speed ratio r is the designer’s choice. The vectorization ratio f reﬂects the percentage of code
in a user program which is vectcriwed
The relative performance is rather sensitive to the value off This value can be increased by using a
better vectorizzing compiler or through user program transformations. The following ¢}LBIﬂ]J|l: shows the IBM
experience in vector processing with the 3090?»/F computer system.

I»)
lg Example 8.2 The vectorfscalar relative performance of
the IBM 3090!VF
Figure 8.6 plots the relative performance P as a ﬁlncticn ofr with fas a t‘u.tt.t1i.ng parameter. The highcr thc
,,.,,,.,,,,,,,,,,.,5,.,,M,,,,,,,,, _ _ an
value off; the highcr the relative speedup. The IBM 3094] with vector facility (VF) was a high-end mainframe
with add-on voctorhardwarc.

{Pl
B .._._

Veciorlzation Ratio [f]

34190 VF Igileslgn Point 90%
5-— —.i
t

&—

8-0%

i"Cl%

2__

r _ sos
_____ 30%

1 I I I I I I I I I lrl
1 2 3 4 5 6 1' 8- 9 10

Fl} I-G Speedup performance -of vector processing over scalar processing in the IBM JDBDNF (Courtesy
of lBl"'l Corporation, W35}

The designers of the 309tL'VF chose a speed ratio in the range 3 £ r 2 5 because IBM wanoed a balance
between business and scientiﬁc applications. when the program is 70% vcctorizcd, one expects a maxirnurn
speedup cfll However, forfﬁ 311%, the speedup is reduced to less than 1.3.

The IBM designers did not ehoose a high speed ratio because they did not expect user programs to be
highly vcctorimlllle. When fis low, the speedup cannot bc high, even with a very high r. In fact, the limiting
case is-P—> l iff —>t"J.
On thc other hand, P —> r when f -1 I. Scientific supercomputer designers like Cray and Japanese
manufacrtlnrrs often chose a much highcr speed ratio, say, I'll S r S 25, because they expected a, higher
vectorizarion ratioIin user programs, or they used better vectnrizers to increase the ratio to a desired level.
Huge advances have taken place in the underlying technologies and especially in VLSI technology
over the last two decades. We shall see that these advances, summarized in brief in Chapter 13, have dcfincd
the direction of advances in computer architecture over this period. Powerful single-chip processors—as
also multi-core s_t=srerns-on—n-c'!rip—prrwidc High Peaffmnarrce Corrrputing [HPC] today. Such I-TPC systems
typically make use of MIMI) auditor SPMD coniigurations with a large number of processors.
Advent of superscalar processors has resulted in vector processing instructiorts being built into powerful
processors, rather than as specialized processors. Thus the ideas we have studied in this section have made
Ff» Mtfirnii H'l'Iit'mn;|wm-\' _
351 i Advanced Contplrterhrdritecture

their appearance in capabilities such as Streaming SIMD Exrensr'ons (SSE) in processors Chapter l3).
We may say that the concepts of vector processing remain valid today, but their int;-Jfcmerrr¢rrr'mw varies with
advances in technology.

MULTIVECTOR MU LTIPROCESSORS
— The architectural design of supercomputers continues to be upgraded based on advances
in technology and past experience. Design rules are provided for high perfomiance, and
we review these rules in ease studies of well-lmown early supercomputers, high-end mainframes, and
niinisupercomputers. The trends toward scalable architectures in building MPP systems for supcreomputing
are also assessed, while recent developments will he discussed in Chapter I3.

8.1.1 Performance-Directed Design Rules

Supercomputers are targeted toward large-scale scientiﬁc and engineering problerns.They should provide the
highest performance constrained only by current technology. In addition, they must be programmable and
accessible in a multiuserenvironmcnt.
Supercomputer a.r'cl1itectu;re design rules are presented below. These rules are driven by ﬂie desire to offer
the highest available performance in a variety ofrespects, including processor, rnernory, and HO performance,
capacities, and bandwidths in all subsystems.

Architecture Design Goals Smith, Hsu, and Hsi1.u1g(l99D) identified the following four major challenges
in the development of future general-purpose supercomputers:
* Maintaining a good vector!scalar performance balance.
r Supporting scalability with an increasing number of processors.
' Increasing memory system capacity and performance.
* Providing high—performance I10 and an easy-access network.
Balanced H:ct7orl'ScuI-ur Ratio In a supercomputer, separate hardware resotuces with different speeds are
dedicated to concurrent vector and scalar operations. Scalar processing is indispcrisahlc for general-purpose
architectures. Vector processing is needed for regularly structured parallelism in scientific and engineering
computations. These two types of computations must be balanced.
The vector bnlnm-e point is defined as the percentage of vector code in a program required to achieve
equal utilization of vector and scalar hardware. In other words, we expect equal time spent in vector and
scalar hardware so that no resources will be idle.

Ir)
égl Example 8.3 Vectorfscalar balance point in supercomputer
design (Smith,Hsu,and Hsiung,1990)
If a system is capable of 9 Mflops in vector mode and l Mfiops in scalar mode, equal time will be spent in
each mode if the code is 90% vector and 10% scalar, resulting in a vector balance point of 0.9.
.,,,,,,,,,,,,,,,,._,,,,,,,,,,,,, _ ,5,
lt may not he optimal for a system to spend equal time in vector and scalar modes. However, the vector
balance point should be maintained sufficiently high, matching the level of vectorization in user programs.
l'
‘vector peribmiance can be enhanced with replicated firnctional unit pipelines in each processor. Another
approach is to apply dceperpipelining on voctorunits with a double or triple clock rate with respect to scalar
pipeline operations. Longer vectors are required to really achieve the target performance.
Hretoriicelor Performance In Figs. 8.7:: and 8.71:, the single-processor vector performance and scalar
performance are shown, based on running Livermore Fortran loops on Cray Research and Japanese
supercomputers of the 1980-s and early 1990s. The scalar performance of these supercomputers increases
along the dashed lines in the figure.
Cine of thc contributing factors to vector capability is the high clock rate, and other iactors include use of
a betner compiler and the optimization support provided.
Table 8.2 compares the vector and scalar perliorrnartces in seven supercomputers of that period. Note
that these supercomputers have a 90% or higher vector balance point. The higher the vectortscalar ratio, the
heavier the dependence on a high degree of vectoiization in the object code.

Table 8.2 Vector and Senior Performance of itirious Ecrty Superoampurers

.'l'frrc'Irinc' f.'ra_t' Crcrv f_'rr.r_y Crcrv Hitachi NEC Fujitsu

IS 25' X-MP }i.'i1fP S820 S.3't'_7’ l"P4't‘]|t'J
Vector
pcrtorrnanoc 85.0 151.5 143-.3 201.6 137.3 424.2 202.1
tM11ﬂ|1-st
Senior
pcrforrnancc 9.8 11.2 13-.1 12.0 l'7.8 9.5 6.6
{Mﬂops}
Vector
balance 0.90 0.93 0.92 0.92 0.98 0.98 0.9?
point

Source: I Srnith et al., Future General-Purpose Supercomputing Conference, IEEE .§'Irps'rt'amptu1ir|g Carpérence, 1990.

The above approach is quite different ﬁom the design in comparable IBM vector machines which
maintained a low vectorfscalar ratio between 3 and 5. The idea was to make a good compromise between the
demands of scalar and veetorproecssing forgeneral-pt.n'pose application s.

HO and Networking Perfnnnunce Vt-"ith the aggregate speed of supercomputers increasing at least
three to ﬁve times each generation, problem size has been increasing accordingly, as have U0 bandwidth
requirements. Figure 8.7c illustrates the aggregate U0 bandwidths supported by supercomputer systems of
the period up to the early 1990s.
l'h1'Ml.'I;Iﬂ\l|r' HI" l'n¢r.q|r_.u||rr|
354 1 Advanced Compumerhrchitacture

lulﬂoprs = Mfloprs
8-O'D— 5 —
ir H|taehlS-820 _,
100- "'-. ss~—
HE‘.
60G—— __,-‘K 30-— __
: I I!
5m- 5 .-"‘ 25- .-"
.=" ,-*"
40°‘ .»" ,5’ 2°‘ \r"-I-iitachl sazo
.-" ' _,-“ “+1 CrayY-MP
30°‘ .»"'rFuJﬂ>srr .1’ 15- __ __ .4 iCray X-MPI4
.~"" .-"
2nG__ 1". 0
vp40G'l'CI3!|'Y-MP 1n,__‘_.-"""-. : cr3§|'*2

.-"'1/‘ ___:-"'6"BlI‘-9 Cray-1 “EC 5392

1m._.|"‘-- 4' C-rayx-MPI4 5... " Fr1li1$11VP4430
Gray“ clay x-MPr2 Year Year
a l la l I cl I I I i l I cl
19i"ﬁ 1$D 198-4 193-B 1992 1996 2000 19i"6 193-D 1934 193-B 1992 1996 2000
[a] Uniprooessor vector porforrnan-be [bi Scalar performmce

.|'_Ulb]||"l$.iS Gray Y_MP .

2500-

2nm_ Cray-2 '

1500-

nsc sxo 1
1000- Fujitsu VP2ﬂGrD 0
Cray Y-MP '
500“ Cray X-MPI4 4.
NEG 51.2 ' ' Hltaci-|lSFB2rD
Cray-1 Ftiltsu \r'P2UJ 0
7' 1 r r r r r I Year
1976 1978 1980 1962 1964 1966 198-B 1990
[cl U0 perfo1'1'rra.nee

Fig.8.? Some reported supercomputer perforntance data (Source: Smith. Hsu. and Haiung. IEEE
Superuernpu1:ir1gConIermce.1'?9rD}

The HO is defined as the trmrsfer of data between the processorimemory and peripherals or a network. In
the earlier generation ofsupercomputers, IEO bandwidths were not always wcll correlated with computational
performance. U0 processor architectures were implemented by Cray Research with two different approaches.
The first approach is exemplified by the Cray Y-MP l/D subsystem, which used U0 processors that were
flexible and could do complex processing. The second approach was used in the Cray 2. where a simple
Emnt-cod processor controlled high-speed channels with most of tbc 1:13 managerncnt being done by the
mainframe‘s operating system.
,,,,,,,,,,,,,,,,,,,,.,,,,,,,,,,,,, _ 355
Today more than aggregate 100-Gbytesfs [I0 transfer rate are needed in supercomputers eonriected to
high-speed disk arrays and networks. Support for high-speed networking has become a major component of
the HO architectilre in supercomputers.
Memory Dar-ncnd The main memory sizes and extended memory sizes of supercomputers of l9tiUs and
early 19905 are shown in Fig. 8.8. A large—sc-ale memory system must provide a low latency for scalar
processing, a high bandwidth for vector and parallel processing, and a large size for grand challenge problems
and throughput.

Mbytes Mbytes

was 0,3}; _ HltachiS-B20, '

- NEG 3,,_3 1sss4- use Sit-2 - Fujitsu vP2o-no
C‘-raii-2 ' Fujitsu "u"P'2tIiCiO" '
m24_ NECSX-2 ‘ CrayY-MP I 4595- Cray Y_Mp 0
Hitachi 5-B-20'
256 _ Fujitsu
cm,\t'P2DDa
XMPM . 0 1i12* _ Fujitsu \ri=2oo-
craji Y-lii'lP Cray it-MPH '
s=t— zes-
i3j;_-,y)(_|-urjpjrg u Craylt-MPI2 0
16-E I54-
Cray-1
4 i i i i i -i i Year 15- i i i i i -i i Year
197619-1'8 1930 19-B2 1954 1955 19831950 1Q-7619?B19&O 11432 1984198-B193-B1990
[at Main memory capacity {bi Extended memory capacity

Fig-ll-B 5i.ip-ercompi.rter memory capacities (Source: Smith, Hsiu,and Hsiung, IEEE iipflmmpuflfig Coifémflffl.
two;
To achieve the above goals, an effective memory hierarchy is necessary. Atypical hierarchy may consist of
data files or disks, extended memory in dynarnie RAMs, a fast shared memory in static RAMs, and a eache!
local memory using RAM on arrays.
Over the last two decades, with advances in VLSI technology, the processing power available on a chip
has tended to double every two years or so. Memory sizes available on a chip have also grown rapidly;
however, as we shall sec in Chapter 13, the i'nt'i'irior1'i* speeds achicw-'ablc—i.c. read and writc cycle timi:s—
have grown much less rapidly than processor performance. Therefore the rcinrir-c speed mismatch bctwccn
processors and mcmo ry, which has been a feature of oomputcr systems iinm their carlici-it days, has w idcncd
much further over the last two decades. This has necessitated the development ofmore sophisticated memory
latency hiding techniques, such as wider memory access paths and rnulti-level cache memories.
Supporting Scalability Multiprocessor stlpctcomputcrs must be designed to support thc triad of scalar,
vector, and parallel processing. Thc dominant scalability problem involves support of shared incrnory wii:h
an increasing numbcrofptoccssots and memory ports. Increasing memory-access latcncy and iritcrptoccssor
communication ovcrhcad imposc additional constraints on scalability.
Scalable architectures include multistage interconnection networks in fiat systems, hierarchical clustered
systems, and multidimensional spanning buses, ring, mesh, or torus networks with a distrflziut-cd shared
memory. Table 3,3 summarizes the key features of three representative mullivector supercomputers of 19905.
War MIGIIILH H“ r'mr:-;|un|n
355 i ' Adrovrced Computerhrchitecture

8.1.1 Cray‘?-HE C-90, and MPP

We study below the architectures ofthe Cray Research Y-MP, C-90, and MPP. Besides architectural features,
we examine the operating systems, languagesfeompilers, and target performance of these machines.

Table 8.3 A:chltea:w'nlChoiucterls-tics o_|" Three Su WWW

rn tors ofthe 7990:

.1-fac'irine' t’:'ru_r .v.1_-tr." ii:-fiiiili

(!cri.r.ric.r C96.-" J’ 6256 sx-x Sertax PP-2000 sure.1
Nntnher of 16 C-P'l.Fs 4 arithmetic I for \1'P2t'ilIl!1fl, 2
processors processors for \"'P11OO.I'4{]
Machine cycle 4,2 its 2.9 n5 3.2 I15
time
Max. memory 256M words (2 2 Gbytcs, 1624-way ] or 3 Ghytes of
Gbytes). interleaving. SRAM.
Dptional SSD 512M, 1024M, or 16 Gbytes with 2.".-'5 32 Gbytes of
memory 2043M words (16 Gbyt.-‘s transfer extended memory.
Gbytcs). rate.
Processor Two vector pipes and Four sets of vector Two loariistore pipes
architecture: two functional units pipelines and S functional
vector pipelines, per CPU. delivering per processor. each pipes per vector unit,
functional and +54 vector results per set with two I or 2 vector units,
scalar tmits clock period. aeiderfshift and two 2 scalar units could be
multiplyflu/gieal attached to each
pipelines. A separate vector unit.
scalar pipeline.
Operating system UNICOS derived Super-U34! based on UXPIM and
from UNlX.|"\-' and UNIX System V and MSPIEX enhanced
B-SD. 4.3 BSD. for vector processing.
Front-ends IBM. CDC. DEC. Built-in contru] IBM-compatible
Univae, Apollo, processor and 4110 hosts.
Honeywell. processors.
Vechurizing Fortran T1‘, C. CFTT Fortran T?/SK. Fortran T? EXFVP.
languages I 5.0, Cray C release Vectorizerixfl, CNP compiler with
compilers 3.0 Pu:|.a1yzer."SX. interactive
vectnrizer.
Peak performance 16 Gflops, 22 Gfiops, 1 Gbyteis 5 Gflops, 2 Ghytefs
and U0 13.6 Gbytesls. pct HO processor. with 256 channels.
ifril11r1lJ\'it1T]'l
.. ,-.,,.,,,,,,,,,,,,,,_,M,,,,,,,.,
rr-r-M o irm-.-...-.._.-.-..r. _
3,,
‘Hrs Cmy Y-MP 816 A schematic bio-ck diagram of the Y-MP S is shown in Fig. 8.9. The system could
be configured to have one, two, four, or eight processors. The eight CP'Us of the Y-MP shared the central
memory, the IEO section, the ittterp'roces5flT communication section, and the real-time clock.

interprocessor
Communications

Roar-tinio Clock F """"""""""" ' '

[B4 bits] C-PU1
Vector "ii
V Registers F onctional Units
b--Ir'CP'U24-—'l- B registers Aeirii'Suhstrar1
64 B4~hit Shit, Logic
elements Winter Population
per register Mask [64- hit arithmetic]
{B4 hits] >"-.i‘s-ctor

il
P-I-CP'l_.|3"'—I4 Vector
Lsngtii
[B hits]
“
Fioating-point
F uneiionai Units
Add/substract
Multip-ty
S-action

I-iii Reciprocal
appworthnation
¢—~CPU4<—i- -1 [64-bit arithmetic]
at
Registers
{BB4 bits] S
mglsleis Registers Seaiar S-caiar
[B64 hits] Functional Units issoon
registers Aod.i‘S-ohstract
0--CF"L.i5 Shift, Logic,
Contra Memory Population

i
[32-bit arithmetic]
or
J
Reg isters A H
‘*“'Im'=~*=> Registers
-st MH "M" Add-rose
F unctionai Units
AdclISuhtract
M uitip-iy
> Address
Section

i Instruction [2-2-hit arithmetic] J

. euros mm,“ i
_ "_'_"' {s12 16-hit lam
'._,,, cpu 1 .__‘. li‘Bt|1.iCibi'i Rﬂglstms
Pa“-“*5 Performance

iI
1|-I-CP'U-E»-'-—*
QCFZ}
‘ Exchange

Parameter
Raglﬁeis
prqyammame
C-iocir{1!-2 bits] J

I IIO Control I
Monitor

Sums
Register
> ‘5°"“°‘
Sflflfibfl

iii To Eittornai Devices

Fig.8.! CreyT-l"‘|F B16 system crganiuticn [Courtesy cit Cray Research, 1991]
F?» Mtﬁruw HI! r'».-rqiwtnw
353 i Aduwtced Ccmptrterhrchitsctura

The central memory was divided into 256 interleaved banks. Overlapping memory access was made
possible through memory interleaving via four memory-access ports per CPU. A 6-ns clock period was used
in the CPU design.
The central memory offered 16M-, 32M-, 64M-, and 128M-word options with a maximum size of
I Gbyte. The SSD options were from 32M to 512M words or up to 4 Gbytes.
The four memory-access ports allowed each CPU to perfonn two scalar and vectorfetches, one store, and
one independent Ht) simultaneously. These parallel memory accesses were also pipelined to make the w-:~ro.-
rend and vector write possible.
The system had built-in resolution hardware to minimize the delays caused by memory conflicts. To
protect data, single-error correction {double-error detection (S ECDED] logic was used in central memory and
on the data channels to and fipm cemral memory.
The CPU computation section consisted of I4 functional units divided into vector, scalar, address, and
control sections (Fig. 8.9}. Both scalar and vector instructions could be executed in parallel. All arithmetic
was register-to-register. Eight out ofthe 14 functional units could be used by vector instructions.
Large numbers of address, scalar, vector, intermediate, and temporary registers were used. Flexible
chaining of functional pipelines was made possible through the use of registers and multiple memory-access
and arithmeticflogic pipelines. Both 64-bit floating-point and 64-bit integer arithmetic were performed.
Large instruction caches (buffers) were used to hold 512 16-bit instruction parcels at the same time.
The interprocessor communication section of the mainframe contained clusters of shared registers for fast
synchronization purposes. Each cluster consisted of shared address, shared scalar, and semaphore registers.
Note that vector data communication among the CPLls was done through the shared memory.
The real-time clock consisted of a 64-bit counter that advanced one count each clock period. Because the
clock advanced synchronously with program execution, it could he used to time the execution to an exact
clock count.
The HO section supported three channel types with transfer rates of 6 Mbytesfs, 100 Mbytests, and
l Gbyte.-’s. The 10$ and SSD were high—speed data transfer devices designed to support the mainframe
processing by eight cac hes.

ll)
lg Example 8.4 The multistage crossbar network in the Cray
Y-MP B16
The interconnections between the 8 CPUs and 256 memory hanks in the Cray Y-MP it I t5 were implemented
with a multistage crossbar network, logically depicted in Fig. 3.10. The building blocks were 4 >< 4 and 8 >< B
crossbar switches and l >< 3 demultiplexers.
,,,,,,,,,,,_,,,,,,,,,._,,,,,,,,,,,,, _ 35,
Subsections
=.
1
lllil
0, 4, s as
32, 60
64,63, 7'2... 92
Mmill!-. ~1
“D
=s
PR

Proc.
2
i
|t|ttt
BE7 t
1|» 4:
=Q$ll't:l$i=
I
I
it a
file
I1
IE“

tag!
,
s
224, 228-
1,5,8...29

.
252

Pro-c _ so
3
4% 225. ass 25s
Pro-c 2.s.1o...so
4
r l as
Pro-c
5 E§]‘_i"\'*lF -
slag
r \ [EL

P?‘
s3 22s, zso 254
s,r, 11 s1
X '& 8.3 .
» ttt|||
er-15 Ell I ins
il!I=l 221. 231 255
Fig.8.“! S-c|'|omatlt: logic diagram o'l'd1o crossbar network botwocrt B processors and 256 mernory banks in
the CrayY-HF‘ B16

The network was controlled by a fonn of circuit switching where all conﬂicts were worked out early in the
memory-access process and all requests from a given port returned to the port in order.

The use of a multistage network instead of a single-stage crossbar for interprocessor memory connections
was aimed at enhancing scalability in the building of cven larger systems with 64 or 1024 processors.
However, crossbar networks work only for small systems. To entrance scalability, emphasis should be given
to data routing, heavier reliance on processor-based local memory (as in the Cray 2), or the use of clustered
structures (as in the Cedar multiprocessor) to offset any increased latency when system sine increases.
The C-90 ond Cluster: The C-90 was further enhanced in technology and scaled in sine from the Y-MP
Series. The architectural features of C-9th'l625t5 are summarized in Table 8.3. The system was built with
I6 CPUs, each of which was similar to that used in the Y-MP. The system used up to 256 megawords
(2 Gbytcs] of shared main memory among the 16 processors. Up to 16 Gbytes of BSD memory was available
F|'>r'MfGJ'|Ili' H“ I'm-l!I;|(1rHt\
NU i _ Advanced Computcrhrchitecturc

as optional secondary main memory. In each cycle, two vector pipes and two functional units could operate in
parallel, producing four vector results per clock. This implied a four-way parallelism within each processor.
Thus I6 processors could deliver a maximum of 6-4 vector results per clock cycle.
The C-90 used the UNICUS operating system, which was extended from the UNIX system V and Berkeley
BSD 4.3. 'l‘he C-90 could be driven by a number of host machines. Vectorizing compilers were available for
Fortran 77 and C on the system. The 64-way parallelism, coupled with a 4.2-its clock cycle, lead to a peak
perfonnance of l 6 Gllops. The system had a maximum [ID bandwidth of 13.6 Gbytesfs.
Multiple C-5lCI‘s could be used in a clustered configuration in order to solve large-scale problems. As
illustrated in Fig. 8.11, four C-90 clusters were connected to a group of SSDs via 1000 lvlbytesfs channels.
Each C-90 cluster was allowed to access only its own main memory. However, they shared the access of
the SSDs. In other words, large data sets in the SSD could be shared by four clusters ol‘C-90’s. The clusters
could also communicate with each other through a shared semaphore unit. Only syncluonlzation and control
information was passed via the semaphore unit. In this sense, the C-9D clusters were loosely coupled, but
collectively they could provide a ITlfl.Kl1'l1l11'l'l of 256-way parallelism. For computations which were well
partitioned and balanced among the clusters. a maximum peak performance of 64 Gflops was possible for a
four-cluster configuration.

C90 CQO
{1B P's] [16 P's]

i Solldetatoéltorago Dovloo [BSD] 1

C90 CQID
[16 P'sf| [16 P's)

Fig. IL11 Four CrayY-HP C.-90': connected to a common SSD forming a loosely coupled 6-II-way parallel system

The Ci-nryiMPP System Massively parallel processing (MPP) systems have the potential for tackling
highly parallel problems. Standard oft‘-the-shelf microprocessors may have deficiencies when used as
building blocks of an NIP? system. Wliat is needed is a balanced system that matches fast processor speed
with fast L"D, fast incrnory access, and capable software. Cray Research announced its MPP development in
October I992. The development plan sheds some light on the trend towards MPP from the standpoint of a
major supercomputer manufacnirer.
Most of the early RISE microprocessors lacked the connnunication, memory, and synchronization
features needed for efficient MPP systems. Cray Research planned to circumvent these shortcomings by
surrounding the RISC chip with powerful communications hardware, besides exploiting Cray’s expertise
in supercomputer packaging and cooling. in this way, thousands of commodity RISC processors would
be transformed into a supercomputer-class MPP system that could address terabytes of memory, minimize
communication overhead, and provide flexible, lightweight synchronization in a Ul"~lIX environment.
,,,,,,.,,,.m,,,,,,5,._m,,.,,,,,t, _ “I
Gray's first MPP system was eodesnamed T3D because a three-dimensional, dense torus network was
used to interconnect the machine resources. The heart of Cray’s T31) was a scalable macroarehitecttire that
combined die DEC Alpha microprocessors through a low-latency interconnect network that had a bisection
bandwidth an order of magnitude greater than that of existing MPP systems. The T3D system was designed
to work jointly with the Cray Y-MP C-90 or the large-memory M-90 in a closely coupled fashion. Specific
features of the MPP macroarchitectirre are summarized below:

(I) The T3D was an MIMD machine that could be dynamically partitioned to emulate SIMD or
multicomputer IVHMD operations. The 3-D torus operated at a 150-MHZ clo-ck matching that of
the Alpha chips. High-speed bidirectional switching nodes were built into the T3D network so that
interprocessor communications could be handled without interrupting the PEs attached to the nodes.
The TBD network was designed to be scalable from tens to thousands of PEs.
{2} The system used a globally addressable, physically distributed memory. Because the memory was
logically shared, any PE could access the memory of any other processing element without explicit
message passing and without involving the remote PE. As a result, the system could be sealed to
address terabytes of memory. Latency hiding {to be studied in Chapter Qj was supported by data
prefetching, fast syricllronization, and parallel IEO. These were supported by dedicated hardware. For
example, special remote-access hardware was provided to hide the long latency in memory accesses.
Fast synchronization support included special primitives for data-parallel and message-passing
programming paradigms.
(3) The CrayfMPP used a Mach-based rnicrolcemel operating system. Each PE had a microlternel that
managed communications with other PEs and with the closely coupled Y-MP vector processors.
Software portability was a rnajor design goal inthe C-rayfli-[PP Series. Software-oortﬁgurable redundant
hardware was included so that processing could continue in the event of a PE failure.
(4) The Cray CFTTT compiler was modiﬁed with extended directives for MPP applications. Program
debugging and performance tools were developed.

CmyH\i'lFP Deweloplnent Phase: The Phase lll "‘j;::;f;‘$§;p‘;““

original Crayi'MPP program was planned to HQQH Psfiflmaofls
have three phases as illustrated in Fig. 8.12. 5.
The T3Di‘ MPP was attached to the Cray Y-MP A ' fi ii
as a back-end accelerator engine. Besides {1g;;_5fg'§|m with 1Tc§;l,?;Baaknn
hardware development, the biggest challenge F'°'f°"“a"°°
in any MPP development is the software f
environment and availability. The Cray mfipmwsm
T3D programming model was based on an miifigiiiatiim with aoo
Ii-IIMD—oriented concept. Both the Connection Gm”? peak PM I
Machine CM-5 {lobe describedin EB|2.'l'lDI'1 as] {1gg3.";“w5,
and tiie Cray T3D emphasized this model, in cm, Y_HP 1@24_pmm_,_a
order to broaden the application spectrum for {irdudiog C90 =1 Oorlfig-lration with -iso
M90) Gflops peak
their machines. More recent developments in
Cray supercomputer systems are reviewed in Hifil ciay MPP (ram
ChaPl'i-"1'13- Flg.II.12 The detreioprnent phases of the original Cny.il*'lPF'
systnem {Courtesy of Cray Research. 1992)
HM‘ MIGIIIIH Hnlifiu-i!I;|r1riit\
HI i _ iidrorlced Coimpiitcrhrciritccture

8.1.3 Fujit:suVF'1DU'D and VFP5 DU

Multivector multiproccssors from Fujitsu Ltd. arc reviewed in this section as supercomputer design examples.
The VP2iIi0iIl Series offered one- or two-processor configurations. The VPPSOG Series offered from 7 to
222 processing elements (PEs) in a single MPP system. The two systems could be used jointly in solving
largo-scale problems. We describe below the functional specifications and technology bases of thc Fujitsu
supercomputers.
The Fujitsu VP! tlilll Figure 8.13 shows the architecture of the VP-2l500i'i 0 uniprocessor system. The
system could be expanded to have dual processors (thc VP-2400340). The system clock was 3.2 ns. the main
memory unit was of l or 2 Gbytes, and the system storage unit provided up to 32 Gbytes ofextended memory.
Each vector processing unit consisted of two ioadfstore pipelines, three functional pipelines, and two
mask pipelines. Two scalar units could bc attached to each vcctor unit, making a maximum of four scalar
units in the dual-processor configuration. The maximum vector perforrnance ranged fiom 0.5 to 5 Gflops
across iii different models of the i/P2000 Series.
Vector Processing Unit [\i"F*U)

Vector Units Mask "name

1.
Mask
register
M k
Main
.. .. 5*°"aQ° Multiply a
Sygtgm J Lsgdif-l:1~'@ II Add.ti_oigical pipeline
Storage
U H F‘ 9° '3 ‘doctor I Multiply iii
[SSW mare iﬂqlslei _ .iu:|d.tLo<_:|ical pipeline
slimline L Divide pipeline

Charnol I i
Processor Scalar “nus Bufim, Scalar
i [CH Pl | GidflfldunEton

Fig. 5.13 The Fu|irsu VP2000 Series superoompucer architecture {Courtesy of Fulirsu, 1991}

I/)
Cg Example 8.5 Reconﬁgurable vector register file in the
Fujitsu VP2000
Vector registers in Cray and Fujitsu machines are illustrated in Fig. 8.14. Cray machines used 8 vector
registers, and each had a ﬁxed length of 64 component registers. Each cornponeznt register was 64 bits wide
as shown in Fig. i§.14a.
M ,,M,,,,,, Sm [Mm -r M I; H If '||rr.- |r_.I.I||r\
1,, M

VRO
VH1 64 Component mglstets I.
I

0
0 1 I e

‘UR? 11,12’ 0 n u u 53

[a]Elghtve-nztorre-glsters(B:-<54:<54h|ts]onGraymaehlnes

6;-<3 2a» ;-t1 >< 6:-<51 X102 /Mum

DB-1'

$2 component
register

VFGCI
5&1

t;I‘w‘M‘M‘M‘w‘1/ [h] Vector registers configurations in the Fujitsu VPZOIDID

Fig. IL14 veneer register ﬁle in Cray and F|.|ji|.eu sup-ermrnpuum

A component counter was built within each Cray vector register to keep track of the number of vector
elements fetched or procmsod. A segrhent of a 64-element suhvoetor was held as :1 package in each vector
register. Long vectors had to he divided into 64-element segments before they could he processed m a
pipelinod fashion.
In an early model of the Fujitsu VP2000, the vector registers were reconfigurable to have variable lengths
The p|n"p-use was to dynarniealiy match the register length with the vector length being processed.
Thu‘ Ml.'I;Ifllb' HI" l'n¢r.q|r_.u|»r\ -
H4 i Advanced Coinpumerfirdtitecturs

As illustrated in Fig. 3.l4h, a total of 64 Khytes in the register ﬁle could he configured into 8, lo, 32, 154,
128, and 256 vector registers with 1024, 512, 256, 128, 64, and 32 component registers, respectively. All
component registers were 64 hits in lengﬂi.
In the following Fortran Do loop operations, the three-dimertsional vectors are indexed by 1 with oonstant
values ofJ and K in the second and third dimensions.
lll! lll I = 0, 3]
330(1) = Ui1J,K) — Uii,J — 1.10
ZZIU) = V'[l.J,Kl — VUJ — LKII

1'-Z34t1l= WiLJ,Kl — WUJ — l,1'~'-l

ltl Continue
The program can he vectorized to have 1'10 input vectors and 35 output vectors with a vector length of 32
elements [l = D to 3 I }, Therefore, the optimal partition is to conﬁgtlre the register ﬁle as 2.56 vector registers
with 32 components each.

Soﬂware support for parallel and vector processing in such supercomputers will be treated in Part TV.
This includes multitasking, mscrotasking, microtssking, autotasking, and interactive compiler optimization
techniques for vectorization or paralleiization.
The VPP Sill! This was a latter supercomputer seties from Fujitsu, called terror parallel pm:-c'ss0r. The
architecture of the VPPSCID was scalable from ? to 2.22 PEs, offering a highly parallel MIMD rnultivector
system. The peak performance was targeted for 335 Gﬁops. Figure 8.15 shows the architecture ofthe VPP500
used as a back-end machine attached to a \-"P2000 or a VPX 200 host.
VPP 503 Prooeosiig Elunent

I 22¢ >< 224 Crossbar Named; ,1’ I Data TH-Itﬂfsr Urit |

1 2 1 .~' _ ' _
can Data can ' Dam I """"“ 51"“? Um |
Trarotor Tranerler Trmstor ,’ -|-(Mme, ",
|._k'lrl _ Lhil Lktrl " U51 ‘
mam Mam Mam rI s ‘F min _,.- u u
Storage Storage Storage Storage
'4'" '-1"", '-*1" “M ‘Cache Loan Stare
Soda‘ Soda‘ Sada‘ Lil"! Sada‘ Uritl I Pitflllfle Plpflliflfi
um Urit ‘Hoot! um _ Hector um ;
Conrol Plooeosiig ‘K Proosmisg ,3’
Processors Element '-,_%e|‘it,»' Reiiun
\~--

‘ o.-.\ \ Vector Registers VH5?-sun

Mask 1
Sysbm Sbraga Liit ~"~__ 5633
in 54 E"B“".“°“ —l 1
Lhlt

‘-‘ Dm-do fgfé Mul-iﬂll‘ Mast

VPQIXI) Cl‘ VP}-(E SBCCI‘ldHy p|pfl|-flg e pqpiglmg pqflllmg
swag, seem

Fig. 8.15 Tin Fulhzsu 'v'F'P"..iDﬂ ard1lucmro{Cotu-easy olFu[|rsu,1992)

,,,u,M,,,,,,,d5,._,,.,C,,,,,,,,t_., . _ M
Each PE had a peak processing speed of L6 Gllops, implemented with 256K—gate Galas and BiCMOS
LSI circuits. Up to two control processors coordinated the activities of the PEs through a crossbar network.
The data transfer units in each PE handled inter-PE communications. Each PE. had its own memory with up
to 256 Mbytes of static RAM. The system applied the global shared virtual memory concept. [n other words,
the collection oflocal memories physically distributed over the PEs formed a single address space. The entire
system could have up to 55 Gbytes of main memory collectively.
Each PE had a scalar unit and a vector unit operating in parallel. These functional pipelines were very
similar to those built into the VP2000 (Fig. 8.13), but the pipeline fiinctions were modified. We have seen the
224 >< 224 crossbar design in Fig. 2.26b. This was by far t.he largest crossbar built into a commercial MPP
system. The crossbar network is conflict-free, since only one crosspoint switch is on in each row or column
ofthe crossbar switch array.
The VPPSDI] ran jointly with its host the UNIX System V Release 4-based UXPIVPP operating system
with support for closely coupled MIMD operations. The optimization functions of the Fortran T? compiler
worked with the parallel scheduling function of the UNIX-based OS to exploit the maximum capability of
thc vcctor parallel architccturc.
The data transfer unit in each PE provided 400 Mbytesrs unidirectional and SUI) Mbytesrs bidirectional
data exchange among PEs. 'l‘he unit translated logical addresses to physical addresses to facilitate access to
the virtual global memory. The unit was also equipped with special hardware for fast barrier synchronization.
We will further review the software environment for the 1.-"PP5l]{l in Chapter ll.
The system was scalable with an incremental control structure. A single control processor was suflicient to
control up to 9 PEs. Two control processors were used to coordinate a VPP with 30 to 222. PEs. The system
performance was scalable with the number of PEs spanning a peak performance range from l l to 335 Gflops
and a memory capacity of 1.8 to 55 Gbytes.

8.1.4 Mainfrarries and Hinisupemcomputers

In the early l9'9Ds, several high-end mainfrarnes, rninisupercomputers, and supcreomputing workstations
were introduced. Besides summarizing these systems, we examine the architecture designs of the VAX 9000
and Stardent 3000 as case studies. The LINPACK results compiled by Dongarra (1992) are presented to
compare a range of these computers for solving linear systems of equations.
High-End Mainframe Supercomputer: This class of supercomputers have been called ncar-
supcrr-o.-rzpurer.r. In the early 1990s, they offered a peak performance of several hundreds of Mflops to
2.4 Gﬂops as listed in Table 8.4. These machines were not designed entirely for number crunching. Their
main applications were in business and transaction processing. The floating-point capability was only an add-
on optional icaturc ofthesc rnainframc rnachincs.
The number of CPUs ranged from one to six in a single system among the IBM ESEFJDUU, VAX 9000, and
Cybcr 2001] listed in Table 8.4. 'l'he main memory was between 32 Mbytes and 1 Gbyte. Extended memory
could be as large as 8 Gbytes in the ES.-@000.
‘v'cctor hardwarc was an optimal fcaturc which could bc uscd concurrcntly with thc scalar units. Most
vector units consisted of an add pipeline and a multiply pipeline. The clock rates were between 9 and 3!] ns
in these machines. The IEO subsystems were rather sophisticated due to the need to support large database
processing applications in a network environment.

DEC FAX 9000 Even though the VAX 9-00!) did not provide Gflop performance, the design represented a
typical maioﬁame approach to high-perfonnance computing. The architecture is shown in Fig. 8.1-Ga.
F?» Mtﬁruw HI! r'».-rqiwinw
Hi i Adnwrced Computerhrchitactora

Multichip packing technology was used to build the VAX 9000. It offered 40 times the VAJUTBD
perfonnanee per processor. With a four-processor conﬁguration, this implied 15? times the ll/780
performance. When used for transaction processing, T0 TFS was reported on a uniprocessor. The peak vector
processing rate ranged from 125 to SUD Mﬂops.
The system control unit utilized a crossbar switch providing ['otu' simultaneous SUD-Mbytesfs data
transfers. Besides incorporating intcn:orm-cct logic, the crossbar was designed to monitor thc contents of
ca-chc memories, tracking thc most up-to-date cache content to maintain cohcrcncc.
Up to 512 Mbytes of main memory were available using I-Mhit DR.AMs on 64-Mbyte arrays. Up to
2 Ghytcs of extended memory were available using 4-Mhit DRAMs. Various [I0 channels provided an
aggregate data transfer rate of 320 Mbytes/s. The crossbar had eight ports to four processors, two memory
modules, and two UCI controllers. Each port ltad a maximum transfer rate of l Ghytefs, much higher than in
bus-con nccted sy stem s.

Table 8.4 High-end Mulnjharne Supercomputers

Machine " '}alti'E;§i'§oi1ii' DEC FAX raps
(.'hcrrac'!r.':'i.\1‘ icr -9l'lfll"F' Qllllfl.-"4-r‘|'i' VP Cesar zoos V
Number ofprocessors 6 processors each 4 processors with vector 2 central processors
attached to a vector boxes with vector hardware
facility
Machine cycle time Qns 16ns 9ns
Masimum memory | Gbyte 512 lvlb)-tes 512 Mbytes
Extended memory 8 Gbytcs 2 Gbytes ltifi 7
Processor Vector facility (‘I/F] Vector processor l-‘PU for add and
architecture: vector. attached to each (VBOXI connected to a multiply, scalar unit
scalar, and other processor, delivering 4 scalar Two vector with divide and
functional units floating-point results pipelin per VB-OX. multiply, integer unit
per cyclic. Four flltl£Ilt'ltt£\ll.tl1ll_‘i in mat husiims data
scalar CPL’. handler per processor.
lit) subsystem 256 ESCCIN fiber optic 4 XMI I.-"CI buses and I8 HO processors with
channels. 14 VAXBI I."O buses. optional 18 additional
U0 processors.
Operating system MVSEESA, VMFESA, V MS or U LTRIX NDSNE
\-"SE-'ESA
vectoriztng Fcflrart V2 with VAX Fortran compiler Gyher 2000 Fortran V2.
languagesr interactive supporting concurrent
compilers vcctorization. scalar and vcctor
processing.
Peak perfonnanee 2.4 Gflops SOD Mflops peak. 210 lutflops per
and rernurks processor.

Each vector processor UJBOX) was equipped with an add and a multiply pipeline using vector registers
and a rnask!'adcl.ress generator as sltowrt in Fig. E.16b. Vector l.|l.5l]'I.1Cl.'lD‘I!S were fetched through the memory
,_.m,..,,,,,,,5,._..,.,,,,,,,m_., ; _, m
unit (MBDX), decoded in the IBOX, and issued to the VBDX by the EBOX. Scalar operations were directly
executed in the EBOX.
Memory Memory
to 1 Gbyto to 1 Gbyti

Samoa I I
C no
Processor 53‘5""“
Cclrtlrctl Cache your“
Y‘ or “cc
CPU 1
Crossbar Switch
Wflfll El H‘ Gar 2 GB)‘:
Processor 1 5 -HJ
4
Cache .
CPU 2 4 Rteadfwrite Paths
,_-,1“ Gl'lEI|"i* fill] MBI's each
..‘II _. H33“ t
1 GE!s 1 Gflfs

ID l"'D
Control Gonlml

Up ti 12 Up to 12
LPO lntorfaoo IID Interface
per Khlll per XMI

{at The VAX QIHII mulliprocesscir sys/ham

‘Vector

- oontml

‘vector

are
Fltegistor
I Unit

H: 1 {hJTho vector prooossor {VEDXJ

FIg.II.1i The DEC VAX 9090 syscarn architecture and voccor processor design (Courcasy of Digital
Equipment Corporation, 1991)
F?» Mtﬁrpw Hllltltmpwtnw
H3 i Adewrced Comptrterhrchitecture

The vector register file consisted of ts >< 64 >< 64 bits, divided into sixteen 64—element vector registers. No
instruction took more than ﬁve cycles. The vector processor generated two 64-bit results per cycle, and the
vector pip-clincs could be chained for dot-product opctations.
The VAX 9000 could run with either VIVIS or ULTRIX operating system. The sectvice processor in
Fig. 3.I6a used four MieroVAX processors devoted to system, disk/tape, and user interface control and to
monitoring 20,000 scan points throughout thc system for tcliablc operation and fault diagnosis.

Minisupereomp uter: These were a class of low-cost supercomputer systems with a performance of about
5 to 15% and tt cost of 3 to 10% of that of a full-scale supercomputer. Representative systems of the early
l990s include the Convex C series, Alliant FX series, Encore Multirnatt series, and Sequent Symmetry series.
Some of these minisupercomputers have been introduced in Chapters l and T. Most of ﬂtetn had an open
architecture using standard oﬁ’-the-shelf processors and UNIX systems.
Both scalar and vcctorproccssing was supported in thcsc multiptoccsstirsystcnis with shared tncmo ry and
peripherals. Most of these systems were built witl:t a graphics subsystem for visualization and perforntance—
tunirtg purposes.

Supercomp uting Workstation: In the early 1990s, high-perforrnance workstations were being produced
by Sim Microsystents, IBM, DEC, HF, Silicon Graphics, and Stardent using the state-of-tl'|e-art superscalar
RISC processors introduced in Chapters 4 and 6, Most of these workstations had a uniprocessor configuration
with built-in graphics support but no vector hardware.
Silicon Graphics produced the 4-D Series using four R3000 CPLTs in a single workstation without vector
hardware. Stardent Computer Systems produced a departmental supercomputer, called the Stardent 3000,
with custom-designed vcctor hardwatc.

The Stardom 3000 The Stardent 3000 was a multiprocessor workstation that evolved from the TITAN
architecture developed by Ardent Computer Corporation. The architecture and graphics subsystern of the
Stardent 3000 are depicted in Fig. 8.17. Two buses were used for commtutication between the four CPUs,
memory, HO, and graphics subsystems (Fig. E.l'.|'a).
The system featured R3000 /R3010 processorsffloatingpoint units. The vector processors were custom-
designed. A 32-MI-lz clock was used. There were 128 Kbytes of cache; one halfwas used for instructions and
thc other half for data.
The buses carried 32-bit addresses and 64-bit data and operated at 16 MHz. They were rated at
I28 Mbytesfs each. The R-bus was dedicated to data transfers from memory to the vector processor, and the
S-bus handled all other transfers. The system could support a maximum of 5 12 Mbytes ofmemory.
A filll graphics subsystem is shown in Fig. S. 17b. It consisted of two boards that were tightly coupled to
bofll the CPUs and memory. These boards incorporated rasterizers (pixel and polygon processors), frame
bufifcrs, Z-bufiiers, and additional overlay and control planes.
,,,,,,;,,d,,,,,,‘,5,,,,M,,,,,,,,, _ _ W
Scalar mm Main Memory
Pro-oeeoeor Processor [B MB — 512 MBII

- 5-El-us [128-I'u1B.1'Sec]
I R-Bus [12BI'u'|Bl‘S-ac]

Cola Monltor Ham Grapncs ﬁeyboaérﬁa 5C.s| Bug

Boa cl °‘-‘G9 Board
' Other Perlpherals 5CS'B‘“
Graphics Expansion VME Expanslon
NTSC E1I|1BI'11B'T “ME B
video Signm Board [optlonalj Board [optional] "5

[a] The Stardent 3000 system architecture

System Bus

Pixel and Polygon

Pm-ce-ssors Frame Buffer
61-h~It
IIIMA Z-Buffer Me mory
Channel [8 Planes]
System

Interface I3.2-hlt
DMA
I lay Memo
[23 Planes) Dlsplay Interface
Channel
Image

Base Gra phlcs Com-el Pla

Ree Gree Blue
Board
Bus
essor
[DC

Plnel and polygon

prooeesere Expert-slon Frame Buffer
Image M
Image [16 Plan

PolPland
acedygonP
Graphl-as Ex pension
Board
Image Im
[16 gi2%5%
[bl The graphics subsystem arehttectue

Flg.lI.11' The Stardam 3000 vinmlimtiun departmental sup-erccmtputer {Coureesy of Stmzlam Campuuur
1990]
F?» Mtﬁrun-I Hllitlimpwrnw
BTU P Advlelrrced Compirterhrchitactura

The Stardent system was designed for numerically intensive computing with two— and tliree-dimensional
rendering graphics. One to two IEO processors were connected to SCSI or VME buses and other peripherals
or Ethernet connections. 'The peak performance was estimated at 32 to I 23 MIPS, I 6 to 64 scalar Mﬁops, and
32 to I28 vector Mflops. Scoreboard, crossbar switch, and arithmetic pipelines were implemented in each
vector proccssor.
Gordon Bell, chief architect ofthe VAX Series and ofthe TITANr’Stardent architecture, identiﬁed ll rules
of minisupercomputer design in 1939. These rules require performance-directed design, balanced scalar!
vcctor operations, avoiding ho lcs in thc pcrfotrnancc space, achicving peaks in pcrforrnancc cvcn on a singlc
program, providing a decade of addressing space, making a computer easy to use, building on others’ work,
always looking ahead to the next generation, and expecting the unexpected with slack resources.

The LIHPACK Result: This is a general-purpose Fortran library of mathematical software for solving
dense linear systems ofequations of order I00 or higher. LINPACK is very sensitive to vector operations and
the degree of vectorization by the compiler. [t has been used to predict computer performance in scientific
and cnginccring areas.
Many published Mflops and Gflops results are based on running the LINPACK code with prespecified
compilers. LINPACK programs can be characterized as having a high percentage of floating-point arithmetic
opcration s.
ln solving a lincar systcm ofn equations, the total number cl‘ arithmetic operations involved is estimated
35 2n3I'3 + 2,5, whcrc H = tom in the LINPACK. experiments.
Ovcr many ycars, Dongarra comparcd thc pcrformancc ofvatious computcr systems in solving dcnsc
systems of linear equations. His performance experiments involved about IUD computers.
The timing information presented in this report reflects the floating-point, parallel, and vector processing
capabilities of l.l1e machines tested. Since the original reports are quite long, only brief excerpts are quoted
in Table 8.5.
The second column reports LINPACK performance results based on a matrix of order n = I00 in a Fortran
environment, The third column shows the results of solving a system of equations of order n = lt"l'Dl] with no
restriction on the method or its implementation. The last column lists the theoretical peak performance of the
machines.
The LINPACK results reported in the second column of Table 3.5 were for a small problem size of
I04] unknowns. No changes were made in the LINPACK software to exploit vector capabilities on multiple
processors in the machines being evaluated. The compilers of some machines might generate optimized code
that itsclfacccsscd spccial hardware fcaturcs.
The third column corresponds to a much larger problem size of ltltltl unknowns. All possible optimization
means, including user optimimtions of the software, were allowed to achieve as high an execution rate as
possib lc, callcd thc bes'f-cflbrt Mflops.
The theoretical peak can easily be calculated by counting the maximum number of floating-point additions
and multiplications that can be complctcd during a period oftimc, usually thc cyclc timc ofthc machinc.
,,,,,,M_,,,,,¢,,,,5,,,,M£,,,,,,,,, _ _ 3,“
Table 8.5 Perfiarmmce in Solving u Sptun o]"L|nonr Equations

Computer LINPACK Benchmark Hes:-qﬁon Tlioomric

Model n-I00 fMfFfl.v-W) PM
OS/Compiler; Mflnpe '1 =' PW fhfflviw
Cray Y-MP C90 CF7? 5.0 -31:
(16 pron, 4.2 ns) -Wt!-H58 419 V 9115 16000
NEC SK-3H4 ii msxnzn
(1 prom, 2.9 ns} R! .11. —pi“' 3-14 4511 5500
Fujitsu VP2400.I' I0 Fottnu1T7 EX :"VP
(4 nit) ‘Vll L10 179 1688 2009
Convex C38-Ill) ii emu -lm 1:38 -03
(4 pron, l6.’.~' us) -<:p -cls -is 75 425 480
IBM ESI9000-520VF vast-2Y\?§
(1 proc_, 9 ns} Fortran V2124 69 333 444-
FPS 5105 MCPTB? Pg;f7T -D4
('7 |iroc., 25 ns} —Minlim: 33 134 230
Alliant FX/2800-200 **rm* i.i.:i
(14 processors) -O -inline 31 325 565]
DEC VAX9000.-‘4lBVP HPO V 1 .3-— I 63V,
(1 pron, I6 ns) ___
D'X?v'[L 22 89 125
coc Cybcr ans
<4 vim) 11' 195 45!]
seam 2.040 " 3.0 -inlinu
—nmu -= 300 12 '77 123
sun S-Pi\RC5tatiun 2 111 1.4363
-cgB9 -dalign 4 NIA NIA
IBM P'C."AT " Microsofi NM. NM
with 3023'? 0.009] NIA NIA

Source: Jack Dongarra, "1'-‘erfonnance of Various Computers Using Standard Linear Equations S-oﬁware,” Computer
Science Dept, Univ. ni'Ttm.nesscc, Knoxville, TN 3‘I9'945-I301 , March 1992.

39 Example 8.6 Peak performance calculation for the

CrayY-MPIB
The Cray Y-MPIB had a cycle time offi ns. During a cycle, the results of both an addition and a multiplication
could be completed Furthermore, there were eight processors operating sirnultanoo Ll.'i|]p’ without interference
in the best ease. Thus, we calculate the peak pcrfonnancc of the Cray Y-MPIS as follows:
FM Mtfirpw H'lllt'n.-rq|onn1'
IT! E ' .-ldeonced Compurterfirchitectoro

>< g >< B processors = 2667 Mﬂops = 2.6 Gflops (3.12)

1 cvclc 6 >< 10"“ s
The peak performance is often cited by manufacturers. It provides an upper botmd on the real perfonnanee.
Comparing the results in the mend and third columns with the peak values, only 2.9 to 36.3% ofthe peak was
achieved in these runs. This implies that the peak performance cannot represent sustained real performance in
most cases. Often, only about 10% of the peak perfonnance is achievable.

COMPOUNDVECTOR PROCESSING
1 In this section, we study compotmd vectoropcratiotts. Multipipcline chaining and nctworlcing
techniques are described and design examples given. A g1'aph traiisfonnation approach
is presented for setting up pipeline networks to implement compound vector iilnctions, which an: either
speciﬁed by the programmer or detected by an intelligent compiler.

8.3.1 Compound Vector Operations

A compound vecrorjhncfion (Ci.-’F] is defined as a composite function of vector operations convened from a
looping structure oflinlted scalar operations. The following example clarifies tin: concept.

5?) Example 8.7 A compound vector function called the

SAXPY code
Consider the following Fortran type loop of a sequence of five scalar operations to be executed N times:
Do ll] l= 1, N
Load R1, X(1)
Load R2, Y(I)
Multiply R1, S (8.13)
Add R2, R1
Store YQI}, R2
10 Continue
where X(I} and Y{I), I =1, 2, ..., N, arc two source vectors originally residing in the memory. Afier the
computation, the resulting vector is stored hack to the memory. 5 is an immediate constant supplied to the
multiply instruction.
Afier vectoriaation, the above scalar SAXPY code is converted to a sequence of five vector instructions:

M [it : it + N — 1) -3- ‘v'l i-ireror fmtri

M [y : y + N — l] —> V2 iiv.~mr Invnf
S >< 'v'l —> ‘s-'1 lire-For mufrijrJ[}-‘ (8.l-4|}
,,,,,,,,,,,,_,,,,,,,.,,,,,,,,,,,, _ m
V2 + \-'1 —> V2 lircfor unit!
V2 —> M(y : y + N — 1) literor store
The same vector notation used in Eq. 4.l is applied here, where it and y are the stoning I‘l'|GII'|0'l'y addresses
of the X and Y vectors, respectively; V1 and V2 are two l“~l~element vector registers in the vector processor.
The vector code in Eq. 8.14 can be expressed as a CVF as follows, using Fortran 90 notation:
Y(l:N)=S><X(l1N]+Y(l:N_) (3.15)
For simplicity, we write the above expression for a CVF as follows:
Y(l) = S >< X[l) + Y(l_] {H.161}

where the index l implies that all vector operations involve N elements.

Compound Vector Function: "liable 8.6 lists a number of example CW5 involving one-dimensional
vectors indexed by I. The same concept can be generalized to multidimensional vectors with multiple indexes.
For simplicity, we discuss only CVFs defined over one-dimensional vectors. Typical operations appearing
in these C\"Fs include land, stem, nm!n]n!__v, dr'v.ide, fogtrni, and shgritng vector operations. We use "slash" to
represent the .n‘ivr'r.ft= operations. All vector operations are defined on a component-wise basis unless otherwise
specified.
The purpose of studying C"v’Fs is to explore opportunities for concurrent processing of linked vector
operations. The numbers of available vector registers and functional pipelines impose some limitations on
how many C"v'Fs can he executed simultaneously.

Tabla 8.5 Repnsontntlvo Compound ltsctor Ftmcdom

One-dimertriorrai compound vectorfunctions Maximum ehrriningdegree

V1 {ll " VIII) * V311} X V4-[I] 2

V1(I} = any + cm 3
.='t{l} "v"l{|}><S+B(l] 4
AH} — v1tn+ on; + en) 5
sin am-s><C(1) s
sin = any + en) + on) is
not o><v|(nnz><ot1;-cot) 1
nit} n{|}><c{|)+o(1)><v1t't) 7
AH}=Vl{l)+{1 r.s(1)+t iB(I}+Log(V2{l)) ti
not ,iv2n) + sintnn) - can + van) 8
not nt_n><c{|)+n(o>< a(t)><s 9
Ail‘) {Ail} - nn}><c(n + om) ><n_t to
Note: Vifl) are vector registers in the processor. Ml}. Bﬂ]-, Cit], DH]-, and EH) are vectors in tnemory. Scalars
indicated as Q, R, and S an: available From sealer registers in the processor. The choosing degrees include both
memory-access and titnctional pipeline operations.
The Mtﬁrnii H'["I'nrl!q|;lrlII'\' _
314 i .-tdmnced Contplrterhrdritectore

8.3.2 Vector Loops and Cl-raining

Vector pipelining and chaining are an integral part of all vector processors. Concurrent processing of scvcral
vector arithmetic, logic, shift, and memory-access operations require the chaining of multiple pipelines in a
linear cascade. The idea of chaining is an extension of the technique of internal data forwarding practiced in
a scalar processor (Fig. 5.15), and also leads to Stream Processing (sec Chapter 13).
Chaining affects the specd of vector processors. Each of the CVFs listed in Table 8.5 is potentially a
candidate for chaining. However, tl1c implementation may hinge on the particular architecture of a vector
machine. Principal concepts and implementations of vector looping, chaining, and recursion are dcscrib-ed
below.
Hactor Loop: or Strip-mining When a vector has a length greater than that of the vector registers,
segmentation of the long vector into fixed-length segrricnrr is necessary. This technique has been called
strip-raining. One vector segment [one surface of the mine field) is processed at a time. In the case of Cray
computers. the vector segment length is 64 elements.
Until all the vector elements in each segment are processed, the vector register cannot be assigned to
another vector operation. Strip-mining is restricted by the number of available vector registers and so is
vector chaining. In the Fujitsu VP Series. the vector registers can be reconfigured to match the vector length.
This allows SlJ'ip—mining to be done more dynamically with a different “depth” in different applications.
The program construct for processing long vectors is called a vector Imp. ‘vector segmentation is done by
machine hardware under sottware control and should be transparent to the programmer. The loop count is
determined at compile time or at run time, depending on the index value. inside a loop, all vector operands
have equal length, equal to that of the vector register.
Functional Unit: lndepen-deuce [n order lor vector operations to be linked, they must follow a linear data
flow pattern, and all functional pipeline tmits employed must be independent of each other. Thc same unit
cannot he assigned to execute more than one instruction in the same chain.
l-"urthcrmore, vector registers must be lined up as interfaces between functional pipelines. The successive
output results of a pipeline unit are fed into a vector register, one element per cycle. This vector register is
then used as an input register for the next pipeline unit in the chain.
With the requirement of continuous data flow in the successive pipelines, the interface registers must be
able to pass one vector clcrncnt per cycle between adjacent pipelines. There may be transition time delays
between loading the successive vector segrncrtts into the interface registers.
To avoid conflicts among different vector operations, the vector registers and fiinctional pipelines must
be reserved before a vector chain can be established. The vector chaining and the timing relationship are
illustrated in Figs. 3. l S and S19 for executing the vcctorized SAXPY code specified in Eq. 3.14.

P)
[<2] Example 8.8 Pipeline chaining on Cray Supercomputers
and on the Crayx-MP (Courtesy offiray Research,
lnc.,19ll5)
The Cray 1 had one memoryotceess pipe for either load or store but not for both at the same time. The Cray
X-MP had three memory-access pipes, two for t-‘actor food and one tor vector smre. These three access pipes
could be used simultaneously.
,.m,.,,,,,,_,5,._,.,C,,,,,,L,,, m
To implement the SAXPY cod: on the Cray i, thc fivc voclor opcrali-:ms arc dividcd into ihrcc chains: Thc
first chain has only one vector operation, fond 1’. The second chain links the fond X in scalar-vccfinr mu!n‘pi_1-
(S X) -npcnitinns and ihcn to thc vector addopcration. Thc last chain is for smrc 1" as illustrated in Fig. 8.1821.
Thc same set of vector opemlions was implemented on the Cray X-MP in a single chain, as shown in
Fig. 8.181;, bccausc thro: memory-aoccss pipes are used simultaneously. The chain links five vector opcralions
in a singlc connected cascade.

Roadmnte port
Eo
Road port 1 Read port 2

vecto: st
Mad Y3 -LlM 1 Access Anon-as
_ 1 mm W-B
B [Load X)
i
Vecior register
llll Load Y]

K V2
2
ii-0&6 Kl

%§%§
§§ lll’ E
§_§:
<: IIIIIIII I-*1

llllII
-'..~I = E
Scalar register

i Scalarragistm V1 %§IP13! <1 no

is‘; i Multiply 1
_
“<1

IEn &
E E
%
Wﬂddi V4 1

E
i‘~'P~dﬁi

n
Access
H09
{S-{OTB Y] ! [Smm Y]

2%
E:1h-=:
III
4%IIII||:| llll
RB"-HC|J'Wi'iI& B011 wmg pg“

[a] Lhnited chaining using only one rnemory'- [hi Co-rnp-iene chaining u-sing three memory/~
aoceos pipe in the Cray 1 amass pipes in the Cray X-MP

Flg.B.1l Huilipipeiine chaining on Clay 1 and Cmy X-H“ for -mmcutii1g the SAXPY code:Y[1:N) = S X
X,{1.~i\l} +Y[1:N) (Courmsy oi Cray Resnarch.1935}
-
an ‘XI ' Advwi-cod Comptmerhrdtitecture

To compare the time required for chaining these pipelines, Fig, 8.19:1 shows that roughly Sn cycles are
needed to perform the vector operations sequentially without any overlapping or any chaining. The Cray l
requires about 3n cycles no execute, corresponding to about n cycles for each vector chain. The Cray X-MP
requires about n cycles to execute.

|~__ n _b ' Time I-

LD it ‘Z I 3, {Memory ninei
to Y"'ls stil iMem-Jar nine!
PIP-Be 5 * [Mullins Piﬂel
+ rim! @i»°~dderr>1nei
st Y ‘la S it {Merrwrr nine}
[aj Sequential emecution without chaining

to x s'\1\.i'“'°""°‘Y Pl-“Q1
to Y sltnmw Pip“!
Pinfil 5 * }‘3haifl m IMHHIW ninei
* em iA*‘-“W Pipfll
ST Y Sm [Memory pipe)

[bi Chaining of arithmetic pipes with oniy one memory pipe

LD Jt calm“ mp“ ll Cap-tione:

are in
sing: S *
s@*~c-1>
mlmumply pip“)
- =
n = Time to pro-once n elements,
chain + s‘iiAdde=rt>1nei one no woe
ST Y s‘i.lS"°'“ ml
[cl Chaining in two toad pipers, two arithmetic pipes, and one store pipe

Fig-I-19 Timing for cheating the SAXPT co-deY{1 :N] = S X Xi‘! :N]i + 'l'{1:l'~l} under diiienent memory-
acoess capabilities (Cournesy of Cray Research. 1935)

In Fig. 8.19, the pipeline ﬂow-through latencies {stariup delays} are denoted as s, m, anda for thc memory-
acccss pipe, the multiply pipe, and the add pipe, respectively. These latencies equal the lengths of individual
pipelines. The exact cycle counts can be slightly greater than the counts of Sn, 3n, and n due to these extra
delays.

The above example clearly demonstrates the advantages of sector chaining. A meaningful chain must
linb: two or more pipelines. As tar as thc amount oftirnc is conccmed, thc longcr the chaining, thc better the
,,,,,,,.,,,,,,,,,,._.,,,,,,,,,,, _ 3,,
performance. The degree ofcbaa'nr'ng is indicated by thc number ofdistinct pipeline units that can be linked
togcthcr.
‘v'c-ctor chaining eficctivcly increases thc overall pipeline length by adding thc pipeline stages of all
fttnctitrnal units in the chain to form a single long pipeline. The potential speedup of this long pipeline is
certainly greater according to Eq. 5.5.
Chaining Limitation: The number ofvector operations in a CVF must be small enough to make chaining
po ssib lc. Vectorchain irtg is limited by the small ntlrnbcrofiitnctional pipelines available in avector processor.
Furthermore, the limited number of vector registers also imposes an additional limit on chaining.
For example, the Cray Y-MP had only eight vector registers. Suppose all memory pipes are used in a
vector chain. These require that three vector registers (two for vector rand and one for vector srortrj be
reserved at the beginning and end of the chaining operations. The remaining five vector registers are used for
arithmetic, logic, and shift operations.
The number ofinterface registers required between two adjacent pipeline units is at least one and sometimes
two for two source vectors. Thus, the number of nott-rnernory-access vector operations implementable with
the remaining five vector registers cannot be greater than five. in practice, this number is between two and
three.
The actual degree of chaining depends on how many ofthe vector operations involved are binary or unary
and how many use scalar or vector registers. If they are all binary operations, each requiring two source
vector registers, then only two or three vector operations can be sandwiched between the memory-access
operations. Thus a single chain on the Crayi'Y—lv[P could link at most five or six vector operations including
the memory-access operations.

Hector Recurrence These are a special class of vector loops in which the outputs or“ a functional pipeline
may fccd back into one ofits own source vcctorrcgistcrs. in other words, a vector rcgistcr is used for holding
the source operands and thc rcsult clcmcnts simultaneously.
This has been done on Cray machines rising a corrtponertr cormrer associated with each vector regi ster. ln
each pipeline cycle, thc vector reg is icr is used like a shift regi stcr at the component level. When a component
operand is “shifted" out of the vector register and enters the functional pipeline, a result component can enter
the vacated component register during the same cycle. The component counter must keep track ofthe shifting
operations until all 64 components of the result are loaded into the vector register.
Recursive vector summation is often needed in scientiﬁc and statistical computations. For example. the
dot product oftwo vectors, .~i - B = _| tr, >< b,-, can be implernented using recursion. Another example is
polynomial cvaluation over vector operand s.

Summary Ciurdiscussion ofvector and pipeline chaining is based on a load-store architecture using vector
registers in all vector instructions. The number of functional units increases steadily in supercomputers; both
the Dray C-90 and tl'|e NEC SX-X offered I6-way parallelism within each processor.
The degree of chaining can certainly increase if the vector register file becomes larger and scoreboarding
techniques are applied to ensure fitnctional unit independence and to resolve data dependence or resource
dependence problems. The use of rnulliport memory is crucial to enabling large vector chains.
F?» Mtfiruw Hilitlimpwinw
BTU i Adnwrced Compirtcrhrchitccturc

‘Vector looping, chaining, and recursion represent the state of the art in extending pipelining for vector
processing. Furthermore, one can use naisiting, st-otter, and gather instructions to manipulate sparse vectors
or sparse matrices containing a large number ofdummy zero entries. A vector processor cannot be considered
versatile unless it is designed to handle both dense and sparse vectors eﬁeetively.

8.3.3 Multipipeline Networking

The idea of linking vector operations can be generalized to a tnultipipeline networking concept. instead
of linking vector operations into a linear chain, one can build a pipenet by introducing multiple functional
pipelines with inserted delays to achieve st-'src!ie eoniputnrricln OTCVFS.
ln 1978, Kung and Leiserson introduced systolic arrays for special-purpose computing. Their idea was
to map a specific algorithm into a fixed architecture, A s_t‘s'Ioi'i'e arm}-' is formed with a nct"wccrk of functional
units which are locally connected and operates synchronously with multidimensional pipelining. We explain
below how a pipeline net can be extended irom the systolic array concept to build a dynamic vector processor
for efficient execution of various Cl/Fs.
Pipeline Net {Pipenetj Systolic arrays are built with fixed connectivity among the processing cells. This
restriction is removed in a pipeline net. A pipcnet has programmable connectivity as illustrated in Fig. 8.20.
lt is constructed from interconnecting multiple jirmtrionof pipelines ('FPs] through two bujfizred cmssbar
m=rwor.ls (BCNs] which are themselves pipelinod.
A two-level pipeline architecture is seen in a pipeline net. The lower level corresponds to pipelining
within each functional unit. The higher level is the pipelining ofFPs through the BCNs. A genetic model of a
pipeline net is shown in Fig. 3.2l]\d. The register file includes scalar and vector registers, as found in a typical
\-'CCl{lT]I1TDCC55ClT.

The set oi‘ functional pipelines should be able to handle important vector arithmetic, logic, shifting, and
masking operations. Each FP, is pipclined with l:,- stages. The output terminals of each BEN are buffered with
programmable delays. BCNI is used to establish the dynamic connections between the register file and the
FPs. BCN2 sets up the dynamic connections among the FPs.
For simplicity, we call a pipeline network a pr}-Janet. Conventional pipelines or pipeline chains are special
cases of pipcnets. Note that a pipenet is progrannnable with dynamic connectivity. This represents the
fitndamental cl.ifi'erence between a static systolic array and a dynamic pipenet I11 a way, one can visualize
pipenets as programmable systolic arrays. The prograrnrnability sets up the dynamic connections, as well as
the number ofdelays along some connection paths.

Setup ofthe Pip-en-et Figures Blilathmugh S.2t"Jd show how to convert from a program graph to a pipenct.
Whenever a CVF is to be evaluated, the crossbar networks are programmed to set up a connectivity pattern
among the FPs that matches the data ﬂow pattern in the CVF.
The program graph represents the data ﬂow pattern in a given CVF. Nodes on the graph correspond to
vector operators, and edges show the data dependence, with delays properly labeled, among the operators.
The program graph in Fig. S.2Da corresponds to the following CVF:

Ell) = loll) '>< Bill + Bill >< Cilll *' [Bill >< (Jill >< lclll + Dillll (3-17]
for l = 1, 2, n. This CVF has Four input vectors AU), Bil), Ctl), and D(l} and one output vector EH) which
demand ﬁve memory-access operations. In addition. there are seven vector aritlimetic operations involved.
,,._,,,,,,,,,,,,,,,,,,,,,,,,,,,,, _ _, m
Etll
Etll

9
may -
MFY
ADD I
0:: 1} -. 9°“!

Alli Bill
‘+3
Gill Dill
MPY E

Pill Bill
ii -

Bill Dill

tﬂlﬂsmsram crash {ti} The piperneil

aem FF1 BC“?

FF2
in
FF3
_ % ' 2
sass
&,, FF4

II 4
Y to-e=m
11 a.1.¢o|

{olﬂtcrnsﬁaar implementation

MP X

Reg-‘Gr
— Buttered
Crossbar
E'4il-I
E Q Buffered
Crossbar
Nebvoﬂt hletworlt
0 F55 with 'n'iI1 q
I Frog ratrmabla Fmgrannrtatle
I Delays Delays
(BCN1) {BENZ}

El I I El

(djwtgenoralized piponet model

Flg.ll.1ll The concept of a plponet: and in Implunnrieaslon modal (Rnprlrtud from Hwang and Xu, JEEE
Trmsaaluns on Con-iputers, jan. 19%}
FM Mtﬁruw Hll ritmpurtns
BBO i ' .-ltdmncad 'CtIh"l‘lPtl.IIvl!.l'.5t.tCHI|EC£l.rJ'E

ln other words, the above CVF demands a chaining degree of ll if one considers implementing it with a
chain of memory-access and arithmetic pipelines. This high degree of chaining is very difficult to implement
with a limited number of FPs and vector registers. However. the Cl/F can be easily implemented with a
pipenct as shown in Fig. 8.3{lb.
Sis. FPs are employed to implement the seven vector operations because the product vector Bil) >< CH),
once generated. can he used in both the denominator and the numerator. We assume two, four. and six
pipeline stages in the ADD, MPY, and DIV units, respectively. Two noncompute delays are being inserted,
each with two clock delays, along two of the connecting paths. The purpose is to equalize all the path delays
from the input end to the output end.
The cortnections among the FPs and the two inserted delays are shown in Fig. B.2th: for a crossbar-
connected vector processor. The feedback connections are identified by numbers. The delays are set up in the
appropriate bulTe:rs at the output terminals identified as 4 and 5. Usually, these bu ffers allow a range of delays
to be set up at thc time thc resources arc scheduled.
The program graph can be specified either by the programmer or by a compiler. Various connection patterns
in tl'te crossbar networks can he prestored for implementing each CVF type. Once the CVF is decoded, the
connect pattern is enabled for setup dynamically.

Program Graph Tmmformorions The program in Fig. 8.20s is acyclic or loopfree without feedback
connections. An almost trivial mapping is used to establish the pipenct (Fig. E.2(lb). In generaL thc mapping
cannot be obtained directly without some graph transfortrtations. We describe these transibrrnations below
with a concrete example CVF, corresponding to a cyclic graph shown in Fig. Ella.
On a directed program graph, nodal delays correspond to the appropriate FPs. and edge delays are the
signal flow delays along the connecting path between FPs. For simplicity, each delay is counted as one
pipeline clock cycle.
A c'_1-‘dc in a graph is a sequence of nodes and edges whish starts and ends with the same node. We will
consider a It-graph, a .s_1-m."-hmnous pmgrnm grqtuh in which all nodes have a delay ofk cycles. A O-graph is
called a .s_1-'.s'!ol'ie program graph.
The following two lemmas provide basic tools for converting a given program graph into an equivalent
graph. The equivalence is deﬁned up to graph isomorphism and with the same input toutpul behaviors.
Lemma 1: Adding kd-clays to any node in a systolic program graph and then subtracting It delays from all
incoming edges to that node will produce an equivalent program graph.
Lemma 2: An equivalent program graph is generated if all nodal and edge delays are multiplied by the same
positive integer, called the .seni'ing eonsrrirtr.
To implement a CVF by setting up a pipenct in a vector processor, one needs first to represent the CVF
as a systolic graph with zero delays and positive edge delays. Only a systolic graph can be converted to a
pipenet as exernplilied below.

I/l
63 Example 8.9 Program graph transformation to set up a
pipenct (Hwang and )t'.u,1988)
Consider ’rhe systolic program graph in Fig. 5.2 I a. This graph represents the following set of CVFs:
,,.,,,.,,,,,_,,,,,.5,.,,M,,,,,,,,, _ _ W
Em = [Btu >< can + 10(1) >< D011
F(I) = [C(I_) >< D{I)] >< [CU — 2) X DH — 2}] (8.18)
GU] = [F[l_}lfF(l — 1}] >< G(I — 4)
Two multiply operators (MPY1 and MPY2) and one add operator {ADD} are applied to evaluate the vector
EH) from the input end (Vin) to thc output cnd t"tﬁ.,,,,) in Fig. 8.21:1. The same operator MP2 is applied twice,
with different delays [four and six cycles}, before it is multiplied by MPY3 to generate the output vector F(I_‘,l.
Finally, the divide {DIV} and multiply {MPY 4'} operators are applied to generate the output vec-lor Gil).
Applying Lemma l, we add Your-cycle delays to each operator node and subtract four-cycle delays from
all incoming edges. The transforrned graph is obtained in Fig. 8.2lb. This is a 4-graph with all nodal delays
equal to four cycleﬁ. Therefore, one can construct a pipenet with all FF‘s having four pipeline stages as shown
in Fig. 8.2111. The two graphs shown in Figs. 8.2lb and 8.21:: are indeed isomorphic.

3
®. MPY2 li
9D MPY2

W0 0-01 W0 3 o-oQ
E MPY3 4 D 2 MPY3

5 1

* Qt 0 l 9 0
MPY4 MPH
O Q D Q

[a] Asystolle prog ram graph [b]Af\s=r graph iransfotrnaﬁon

Delay

new =
llll
mpvs
"'1 llll
DIV M
llll
Delay
Q.----Illl
MPY2
||t||||| ADD
@
Delay
nu ml llll
ll" MPY1
Delay
{cl Plp-met lrnplemerttatlon with Inserted delays between plpellnee

Fig-3-11 Frvom synchmnuus program graph tan pipenet implementation (Pteprintned from Hwartg and Xu,
IEEE Tmnsecﬂons on Computers, far; 1988)
F?» Mtﬁruw Hlllrllmpwtnw
BIZ O Aduwrced Comptrtcrhrchitecturc

The inserted delays correspond to the edge delays on the transformed graph. These delays can be
implemented with programmable delays in the buffered crossbar networks shown in Fig. 8.20:1. Note that the
only self-reﬂecting cycle at node MFY4 represents the recursion deﬁned in the equation for vector G{]). No
scaling is applicd in this graph transformation.

Tl1e systolic program graph in Fig. Ella can be obtained by intuitive reasoning and delay analysis as
shown above. Systematic procedures needed to convert any set of Cl.-'Fs into systolic program graphs were
reported in the original paper by I-lwang and Xu (1 988].
[fthc systolic graph so obtaincd docs not havc cnough odgc dclays to bc transfcrrcd into thc opcrator
nodcs, wc havc to multiply thc cdgc dclays by a scaling constant s, applying Lemma 2. Then the pipenct
clock rate must bc rcduccd by .s times. This means that successive vector elements entering the pipenct must
bc separated by s‘ cycles to avoid collisions in thc rcspcctivc pipclincs.

Perfbrmance Evaluation The above graph transformation technique has been applied in developing
various pipenets for implementing L‘VFs embedded in L-iverrnore loops. Speedup improvements oi‘ between
2 and 12 wcrc obtained, as compared with implcmcnting thcm on vcctor harrlwarc without chaining or
networking.
[n ortlcr to build into iitturc vcctor processors thc capabilitics of multipipclinc nctworlcing dcscribcd
above, Fortran and other vector languages must be extended to represent CVFs under various conditions.
Automatic compiler techniques need to be developed to convert from vector expressions to systolic
graphs and then to pipeline nets. Therefore, new hardware and software mechanisms are needed to support
compound vector processing. This hardware approach can be one or two orders of magnitude faster than the
softwarc implcmcntation.

SIMD COMPUTER ORGANIZATIONS

— Vector processing can also be carried out by SIMD computers as introduced in Section 1.3.
Implementation models and two example SIMD machines are presented below. We examine
their interconnection networks, processing elements, memo-ry, and IID structures.

Note 8.1 Current stratus of the SIMD system model

Hugc advanccs irt proccssor tcchnology and proccssor irttcrconncct tcchnology havc takcn placc ovcr
the last two decades. These advances have resulted in the dominance of MIMD and SPl\-ID architectures
lor high pcrfortnancc systcms, rathcr than thc SIMD architccturc which was dcvclopcd at an carlicr
stage. As a case study, this shift can be seen in how the erstwhile Thinking Machines Corporation
changed its architectural model as it went from Ch-‘I-2 to CM—5 (see Sub-section 8.4.2 and Section 3.5).
Possibly other than in specialized research platfonns, no computer system of the original SIMD
model is in operation today. Howcvcr, a study ofthis modcl ofcomputcr systcm can still scrvc thc
twin purposc of bringing out (ii thc basic SIMD conccpt and its rclatcd issucs, and (iii an important
historical perspective on the development of computer architecture. Of course, in a speciﬁc course on
the subject of oomputer architecture, the teacher must make the ﬁnal decision on the a.mount of time to
be dcvotcd to this particular modcl of computation.
,,,,,,.,.,,,,_,,,,,.5,,,.,C.,,,,,,,,, _ m
8.4.1 Implementation Models
Two SIMD oornputcr mo-dcls arc clcscribod bclow basocl on thc o1c:n1o1}' distribution and acldrcssing schcmc
used. Moot SIMD computers use a single control unit and distributed memories, except for a few that use
asmciativc rncmoric!-2.
The instruction set of an SIMD computer is decoded by the array control unit. Thc prncmsing clcmcntr
(PEs) in thc SIMD array arc passive A.LUs cxocuting instructions broadcast from thc control unit. All PEs
must operate in loclrstcp, synchronized by thc same array controller.
Distributed-Nlemory Model Spatial parallelism is exploited among thc PEs in an SIMD computer. A
distributed-mcmory SIMD computcr consists of an array of PEs which arc controllcd by thc samc array
control unit. as shown in 1-‘lg. 8.229.. Program and data are loaded imo the control memory through the host
oomputcr.

515$

Smla lnsiuctiorrs
MW
Network
'3°"l“"' Arlay I l Conhnlflemay I Hod no
Cunlmlunit |n51I_ {Progamand Datal Corn-pulel {U591}
Voctcr l,
|"5i“-"5'fi°“5 Broadcaat Biis
{Instructions
H mdwm-»: H Data
PE=P=m*-=
Element
LM: Local
Memory
Data Routingfiatwcrk I

{all Using disiiblilod locd mon1orio5(o.g. the liac IV)

Mass
Stclaga
Coniol Memcly Sula’ Sada
--——- Arlay Control Unit Instr. P|'°¢-B55"
tum:
nI Broadoad Bus
{Vocbr Instructions)
E
~| Al ignmont Network I

Data Bu5

-[bl Uang *:J\a'od-memory modules {o.g. the BSPj|

Fig. 8.22 Two rnodets for constructing SJHD sup-arcomputers

BB4 i Admncod 'l:iIh"l‘lP\i.IIvl!-I’-"l|.rCtl'tiI|Et'1t.rJ'E

An instruction is scnt to the control unit for decoding. lfit is a scalar orprogram control operation, it will
be directly executed by a scalar processor attached to the control unit. If the decoded instruction is a vector
operation, it will be broadcast to all the PEs for parallel execution.
Partitioned data sets are distributed to all the local memories attached to the PEs through a vector data bus.
The PEs are interconnected by a data-routing network which performs inter-PE data communications such
as shifting, permutation, and other routing operations. The data—routing network is under program control
through the control unit. The PEs are synchronized in hardware by t.he control unit.
ln other words, the same instruction is executed by all the PEs in the same cycle. However, masking logic
is providod to enable or disable any PE from participation in a given instruction cycle. The llliac l'v' was such
an early SIMD machine consisting of 64 PEs with local memories interconnected by an B >< B mesh with
wraparound connections (Fig. 2.1 Eb].
Almost all SIMD machines built have been based on thc distributed-memory model. Various SIMD
machines differ mainly in the data—routing network chosen for inter—PE communications. The four-neighbor
mesh architecture has been the most popular choice in the past. Besides llliac IV, the Goodyear MPP and
AMT DAPGID were also implemented with the two-dimensional mesh. Variations from the mesh are the
hypercube embedded in a mesh implemented in the CM-2, and the X-Net plus a multistage crossbar router
implemented in the MasPar MP-1.
Siorcd-Nlemory Model In Fig, Bllb, we show a variation ofthe SIMD computer using shared memory
among the PEs. An alignment network is used as the inter-PE memory communication network. Again this
network is controlled by thc control unit.
The Burroughs Scientiﬁc Processor (ESP) had adopted this architecture, with n = 16 PEs updating
m = 1? shared-memory modules through a 16 >< 17 alignment network. It should be noted that the value m is
often chosen to bc relatively primc with rcspoct to n, so that parallel memory access can bc achieved through
skewing without conflicts.
The alignment network must be properly set to avoid access conﬂicts. Most SIMD computers were built
with distributed memories. Some SIMD computers used bit-slice PEs, such as the DAPGID and CM I200.
Both bit-slicc and tvtird-parallel SI MD computers are studied bclow.

SIMD Instruction: SIMD computers execute vector instructions for arithmetic, logic, data-routing, and
masking operations over vector quantities. ln bit-slice SIMD machines, the vectors are nothing but binary
vectors. In word-parallel SIMD machines, the vector components are 4- or 3-byte numerical values.
All SIMD instructions must use vector operands of equal length n, when: ri is the number of PEs. SIMD
instructions are similar to those used in pipelined vector processors, except that temporal parallelism in
pipelines is replaced by spatial parallelism in multiple PEs.
The data-routing instructions include permutations. broadcasts, multicasts, and various rotate and shift
operations. Masking operations are used to enable or disable a subset of PEs in any instmction cycle.

Hon and I ID All UCI activities are handled by the host computer in the above SIMD organizations. A
special control memory is used between the host and the an'ay control unit. This is a staging memory for
holding programs and data.
Divided data sets arc distributed to thc local memories (Fig. Ella] orto tl1c shared memory modules (Fig.
3.221;} before starting the program execution. The host manages the mass storage and graphics display of
computational results. The scalar processor operates concurrezntly with the PE array under the coordination
ofthe control unit.
,,,,,m,,‘,,,‘,5,,_,M,,,,,,,,, _; M
8.4.1 The CH-Lfirchitectune
The Connection Machine CM-2 produced by Thinking Machines Corporation was a fine-grain MPPeo111pute:r
using thousands of hit-sliee PEs in parallel to achieve a peak processing speed of above ll] Gflops. We
describe the parallel architecture built into the CM-2. Parallel sofiware developed with the CM-2 will be
discussed in Chapter ID.
Program Execution Florodigm All programs started execution on a fmnr—end, which issued
mieroinstruetions to the bacloend processing array when data-parallel operations were desired. The sequencer
broke down those microinstructions and broadcast them to all data pmeessor.s in the array.
Data sets and results could be exchanged between the fironz-end and the processing array in one of three
ways: lJrr1m‘m.s!r'ng, global conrbining, and senior rrwrrsory bus as depleted in Fig, 3,23, Broadcasting was
carried out through thc broadcast bus to all data processors at once.

fromito Front-and Computer

Gbhal R55 "ll 5'15 Scalar Memory Bus

lrietruehon Broadcast Bus

.m.M_I
Processors o I: o II
Ill u u n
I Routoo‘NE'u'HSJ'Seannmg ‘

VD U0
l Controller l Comrollar Framenuﬁar

IID Bus ID Bus Frame huﬁar Out

Flg.lI.23 Tho arelmocrune of the Connection Machine CM-2 (Courtesy of‘l11ink|ng Machines Corporation, 1990}
rm‘ I Iﬂlli t'm'rIq|r_.\.I|n*\ _
355 i Advanced Ctnnpurterﬂnrchitccturc

Global combining allowed the fiont-end to obtain the sum, largest value, logical DR, etc., of values, one
from each processor. The scalar ‘bus allowed the front-end to read or to write one 32—bit value at a time from
or to the memories attached to the data processors. Boflt VAX and Syrnbolics Machines were used as the
fiont-end and as hosts.

The Processing Army The CM-2 was a back-end machine for data—parallel computation. The processing
array contained from 4K to 64K bit-sliee data processors (or PEs), all of which were controlled by a sequencer
as shown in Fig. 8.23.
The sequencer decoded micnoinstructions from the front-end and broadcast nanoinstructions to the
processors in the array. All processors could access their memories simultaneously. All processors executed
the broadcast iristructions in a locitstep manner.
The processors exchanged data among themselves in parallel through the mutter, NE W5‘ grids, ora scanning
mechanism. These network elements were also connected to I/‘O interfaces. A mass storage subsystem, called
the rioro wriiir, was connected through the HO for storing up to 60 Gbytes of data.

Poo:-tossing Node: Figure 3.24 shows the CM-2 processor chips with memory and floating-point chips.
Each data processing node contained 32 bit-slice data processors, an optional ﬂoating~point aeceleravor,
and interfaces for interprocessor eonitrtunicatiuri. Each data processor was implemented with a 3-input and
2-output bit-slice ALU and associated latches and a memory interface. This ALU could perform bit-serial
ﬁlll-adder and Boolean logic operations.

I
gob: bus instruction bus

H h11othorc.hips to'l1othe|ehips

EEEE
iiiiiiiiiiiiiI EEEE
NEWS. EEEE NEWS. EEEE
Router
Hypelcube @EEE Router
Hypaluiba EEEE
Intorlaoa lrrtelhco
EEEE EEEE

22 22 l
adcioss Floating-F'on'|t
Floating-Point 32
Memory and Momcly . Execution
13 ‘med {Sinqo cl Double
3°“ Precision)

Fl:-ll-14 A CH-I processing node co-misting oftwo processor chips and some memory and floating-point
ehips [Couroasy ofThlnlting Machines Corporatlon.199D}
,,,,,,,,,,,,,,,,,._,,,,,,,,,.,,, . _ ,,,
Thc processor chips were paired in each node sharing a group of memory chips. Each processor chip
contained 16 processors. The parallel instruction set, called Paris, included nanoinstnlctions for memory load
and store, arithmetic and logical, and control of the router, NEWS grid, and hypercube interface, floating~
point, HG, and diagnostic operations.
Thc memory data path was 22 bits (I6 data and 6 ECC] per processor chip. The lll-bit memory address
allowed 2'“ = 256K memory words (512 Kbytes of rlataj shared by 32 processors. The floating-point chip
handled 32-bit operations at a time. Intermediate computational results could be stored back into die memory
for subsequent use. Note that integer arithmetic was carried out directly by the processors in a bit-serial
fashion.

Hyper-cube Router: Special hardware was built on each processor chip lor data routing among the
processors. Thc router nodes on all processor chips were wired together to form a Boolean n-cube. A
full conﬁguration of CM-2 had 4096 router nodes on processor chips interconnected as a I2-dimensional
hypcrcuhe.
Each router node was connected to I2 othcr router nodes, including its paired node (Fig. 3.24]. All 16
processors belonging to the same node were equally capable of sending a message from one vertex to any
other processor at another vertex of the 12-cube. The following example clariﬁes this message-passing
ooncept.

bl
[<5 Example 8.10 Message routing on the CM-2 hypercube
(Thinking Machines Corporation,199D)
On each vertex ofthe l2-cube, the processors are numbered 0 through 15. The hypercube routers are numbered
O through 4095 at the 4-D96 vcrriccs. Processor 5 on router node T is thus identified as the l l7th processor in
the entire system because ts>< 7 + 5 = 117.
Suppose processor ll? wants to send a message to processor 361, which is located at processor 9 on router
node 22 {I6 >< 22 + 9 " 3151}. Since router node 7 " [tllllltllltltltltll 1 I); and router node 22 = {'-[ltlt')DOOtll0l lO}2,
they differ at dimension D and dimension 4.
This message must traverse dimensions D and 4 to reach its destination. From router node T, the message
is first directed to router node fi = (UDDOOOUDI lll); through dimension O and then to router node E2 through
dimension 4, if there is no contention for hypercube wires. On the other hand, if router 7 has another
message using the dimension 0 wire, the message can be routed first through dimension 4 to router 23 =
(tltlfitlfltl-[Il[H 1 1 '1; and then to the final destination through dimension 0 to avoid channel conflicts.

The NEWS Grid Within each processor chip, the I15 physical processors could be arranged as an 8 >< 2,
l >< [6, 4 >< 4, 4 >< 2 >< 2, or 2 >< 2 >< 2 >< 2 grid, and so on, Sixty four t-'irm¢n'pmee.s.sors could be assigned to
each physical processor. These 64 virtual processors could be imagined to form a B >< El grid within the chip.
The “NEWS” grid was based on the fact that each processor has a north, east, west, and south neighbor in
the various grid configurations. Furthermore, a subset of the hypercube wires could be chosen to connect the
E13 nodes {chips} as a two-dimensional grid ofany shape, 6-4 >< 64h-eing one of the possible grid configurations.
F?» Mtfiruw HI! r'».-rqiwrnw
BBB i Adnwrced Compirterhrehirceturc

By coupling the internal grid configuration within each node with the global grid configuration, one could
arrange the processors in NEWS grids of any shape involving any number of dimensions. These flexible
interconnections among the processors made it very efiicient to route data on dedicated grid configurations
based on the application requirements.
Scanning and Spread Mechanism: Besides dynamic reconfiguration in NEWS grids through the
hypercube routers, the CM-2 had been built with special hardware support for scanning or spreading across
NEWS grids. These were very powerful parallel operations For Fast data combining or spreading throughout
the entire array.
Scanning on NEWS grids combined communication and computation. The operation could simultaneously
scan in every row ofa grid along a particular dimension tor the partial sum ofthat row, the largest or smallest
value, or bitwise OR, AND, or exclusive OR. Scanning operations could be expanded to cover all elements
of an array.
Spreading could send a value to all other processors across the ehips. A single-bit value could be spread
from one chip to all other chips along the hypercube wires in only 7'5 steps. Variants of scans and spreads
were built into the Paris instructions for ease of access.
HO and Data their The Connection Machine emphasized massive parallelism in computing as well as in
visualization of computational results. High-speed HO channels were available from 2 to lo channels for data
andfor image U0 operations. Peripheral devices attached to HO channels included a data vault, CM-HIPPI
system, CM-IDP system, and Vh-'lEbus interface controller as illustrated in Fig. 3.2.3. The data vault was a
disk-based mass storage system for storing program files and large data bases.
Major Application: Tl1e CM-2 was applied in almost all the MPP and grand challenge applications
introduced in Chapter 3. Specifically, the Connection Machine Series was applied in document retrieval
using relevance feedback, in memory-based reasoning as in the medical diagnostic system called QUACK
tor simulating the diagnosis ofa disease, and in bulk processing of natural languages.
Other applications of the CM-2 included SPICE-like VLSI circuit analysis and layout, computational
fluid dynamics, signal fitnage-Ivision processing and integration. neural network simulation and connectionist
modeling, dynamic programming, contest-free parsing, ray tracing graphics, and computational geometry
problems. As the CM-2 was upgraded to the CM-5, the applications domain was expected to expand
accordingly.

3.4.3 The MasPar MP-1 Architecture

This was a medium-grain SIMD computer, quite different from the Clvl-2. Parallel architecture and
MP-1 hardware design are described below. Special attention is paid to its interprocessor communication
mechanisms.

The .Ma:Par MP-1 The MP-l architecture consisted of four subsystems: the PE rrrrrrv, the arrn__\-' control
rmir (ACUII, a UNIXsubsy.stem with standard IIO, and a highspeed U0 srrbsjysrern as depicted in Fig. 8.25s.
The UNIX subsystem handled traditional serial processing. The highasp-eed HO, working together with the
PE array, handled massively parallel computing.
The MP-1 faintly included configurations with 1024, 4096, and up to 1s,3s4 processors. The peak
performance of the 16K-processor configuration was 26,000 MIPS in 32-bit RISC integer operations. The
,,,,,m,,‘,,,‘,5,,_,M,,,,,,,,, _; M
syslerrl also had a peak floating-poinl capability of 1.5 Gflops i11 singleqircoision and 650 Mflops in double-
prccision operations.

Army Control Unit The ACU was a 1-1-MIPS scalar IUSC processor using :1 demand-paging instruction
memory. The ACU fetched and decoded MP-l instructions, computed addresses and scalar dala values,
issued control signals to the PE array, and monitored the status of the PE array.
Like the sequencer in CM-2, the ACU was microcoded to achieve horizontal control ofthe PE anay. Most
scalar ACU instructions executed in one: 70-ns clock. The whole ACU was implemented on one PC hoard.
An implemented Functional unit, called a ow.-norjs .-ma-hine, was used in parallel wilh the ACU. The
memory machine performed PE array load and store operations, while the ACU broadcast aritlimetic, logic,
and routing instructions to the PEs for parallel execoﬁou.
K 'I'i1l.'ldO'l'
COHSDE

Dish: may
‘I I

Arraj,1Oor|\'o| um:
--‘II. UQEFDEMEG
II... VG
Su ma ‘II. 2 FDDI °p"°"“'
Hgn-Speed
u|-axbeyelmi I““III pIII
I I I I III.-
Ell‘
IgIIII
l _“_
HPPI 1'0 Devices
I I I I I I I I IIIIIII IIIIIIIIIIIIIII U-'“"'*'
HghSpeedO
High-Speed
oramim
Eherrel Q
_

{al|MP-1Sys1m1 Buck ti-agmm

2 5 cm‘; 1
o: ago
0.0 -- - 4r.0}ﬂ- i
o‘ @‘o A
A ,
qniarm of PE cbeleis

Fig. 8.15 The Mas9ar HP-I architecture (Couroesy of H=asPar Computer Cn|1:roraﬁun, 1990}
-
3'10 ‘XI Admn-cad compuwaicmzeom

‘Hie PE Array Each processor board had 1024 PEs and associated memory arranged as 64 PE clusters
[PEC] with 16 PEs per cluster. Figure 8.25b shows the inter-PEJC connections on each processor board.
Each PEC chip was connected to eight neighbors via the X-Net mesh and a global multistage crossbar router
network, labeled S1, S2, and S3 in Fig. E.25b.

Router Rmm,
mar = PE1 PE15 mar
Broadcast _ REDUCTION
Inmlm @

[a] A PE cluster

JLNET IN COMM QQMM JILNET GLIT

ROUTER as “PUT OUTPUT ROUTER s1

16-bit E-it-in-It ¢_b1 1.4;

EJCPONENT MANTISSA ALU FLAGS Lqqm
um own eee e
masts sus l

an aus 1_ .
man ' F‘-Mam s2-on
ADDRESS mmecc REGISTERS
uurr um my
|:_i_
C°"TR°'- aoonocssr
PM EM ,
EKTERNAL MEMO” mam UCTI-UN REDUCTION
[ls] Processor element and memory

F1} 3-I-it Processing element: and memory design in die l"la.sPI.r MP-I (Cottrtnesy of l"'ll.sPIr Computer
Corpor-atton,1990]
,.,,,,,.,m,,,,,.,,,._,,,,..,,.,,,,,., _ 3,,
Each PE cluster (Fig. 8.I'.6a_] was composed of Io PEs and lo processor memories (_PEMs). The PEs were
logically arranged as a 4 >< 4 array for the X-Net two-dimensional mesh interconnections. The 16 PEs in a
clustcr shan:-d an acccss port to thc multistage crossbar routcr. lntcrproccssor communications wcrc carried
out via three mechanisms:
(1') ACU-PE array oomrnunications.
{2} X-Nct ncarcst-ncighbor communications.
('31 Global crossbar routcr communications.

The first mechanism supported ACU instruction/‘data broadcasts to all PEs in the array simultaneously
and perforrned global reductions on parallel data to recover scalar values from the array. The other two IPC
mcchanisms arc dcscribcd separately bclow.

X-Her Nlerh Interconnect The X-Net interconnect directly connected each PE with its eight neighbors
in the two-dimensional mesh. Each PE had four connections at its diagonal corners, forming an X pattem
similar to the BLITZEN X grid network (Davis and Rcif, 1986). A tn‘-state node at each X intersection
pemiitted communication with any of eight neighbors using only four wires per PE.
The connections to the PE array edges were wrapped around to fomr a 2-D toms. The toms smtcture
is symmetric and facilitates several important matrix algorithms and can emulate a one-dimensional ring
with two-X-Net steps. The aggregate X-Net corrununication bandwidth was 18 Gbytesfs in the largest MP-l
configuration.
Multistage Crossbar Interconnect The network provided global communication between all PEs and
formed the basis for the MP-l I10 system. The three router stages implemented the function ofa 1024 >< 1024
crossbar switch. Three router chips were used on each processor board.
Each PE cluster shared an originating port connected to router stage SI and a target port connected to
router stage S3. Connections were established fi'om an originating PE through stages S1, S2. and S3 and then
to the target PE. The full MP-l configtrration had H124 PE clusters, so each stage had 102.4 router ports. The
router supported up to 1024 simultaneous connections with an aggregate bandwidth of 1.3 Gbytests.
Processor Elements and Mu-rrory The PE design had mostly data path logic and no instruction fetch or
decode logic. The design is detailed in Fig. 3.2-fib. Both integer and floating-point computations executed in
each PE with a register-based RJSC architecture. Load and store instructions moved data between the FEM
and thc rcgistcr set.
Each PE had forty 32-bit registers available to the programmer and eight 32-bit registers for system use.
The registers were bit and byte addressable. Each PE had a 4-bit integer AL-U, a 1-bit logic unit, a 64-bit
mantissa unit. a 16-bit exponent unit, and a flag unit. The NIBBLE bus was four bits wide and the BIT bus
was one bit wide. The FEM could be directly or indirectly addressed with a tI'lH.XlITll.ll'l1 aggregated memory
bandwidth of I2 Gbytesfs.
Most data movement with each PE occurred on the NIBBLE bus and the BIT bus. Dit'1‘erent functional
units within the PE could be simultaneously active during each microstep. In other words, integer, Boolean,
and floating-point operations could all perform at the same time. Each PE ran with a slow clock, while the
system speed was obtained through massive parallelism like that implemented in the CM-2.
Ff» Mtfirnii H'l'Ht'mn;|wm-\' _
392 i .-tduonced Cmmplrterhrdritecturc

Plomllel Disk Army: Another feature worthy of mention is the massively parallel IIO architecture
implemented in the MP-l. The PE array (Fig. 8.25s] communicated with a parallel disk array through the
high-speed HO subsystem, which was essentially implemented by the l.3 Gbytesfs global router network.
The disk array provided up to 17.3 Gbytes of formatted capacity with a 9-Mbytesis sustained disk I10
rate. The parallel disk array was a necessity to support dam-parallel computation and provide ﬁle system
trarisparency and multilevel fault tolerance.

i THE CONNECTION MACHINE cm-5

Note 8.2 'l"hinIri.ng Machine: Corpomtion

Thinking Machines Corporation {TMC), of Cambridge, Massachusettes, developed its initial SIMD
systems CM-I and CM-2 on the basis of ideas originally developed at MIT and aimed at artiﬁcial
inreﬁigern-c [Al] applications. The company went out of operation in the mid—199'0s. Two innovative
computer systems developed by this company are reviewed in this chapter". CM-2 (in Sub-section
3.4.2} and CM-5 (in Section 8.5]. From a commercial point of view, none of these systems can be
considered successiill. However, it would be worthwhile studying the architecture from the point oi"
view of learning about -[ii innovative system ideas, {ii} the shiit from SIMD to the i'vIlMD system
architecture of CM-5, and {iii} the use of a standard RES-C processor in an MIMD system with a large
number of processors. Many key designers who worked at TMC later worked for other companies,
including Sim Microsystem s.

The grand challenge applications drive the development of present and future MPP systems to achieve
higher and higher performance goals. The Connection Machine model CM-5 was the most innovative et‘t‘ort
of Thinking Machines Corporation toward this end. We describe below the innovations surrounding the
CM-5 architectural development, its building blocks, and the application paradigms.

8.5.1 A Synchronized MIMD Machine

The CM-2 and its predecessors were criticized for having a rigid SIMD architecture, limiting general-purpose
applications. The CM-5 designers liberated themselves by choosing a universal architecture, which combines
the advantages oi" both SIMD and MIMD machines.
Traditionally, supercomputer programmers were forced to choose between MIMD and SIMD computers.
An MIMI} machine is good at independent branching but bad at synchronization and communication. U11
the other hand, an SIMD machine is good at synchronization and communication but poor at branching. The
CM—5 was designed with a synchronized MIMD structure to support both styles ofparallcl computation.
The Building Block: The CM-5 architectiue is shown in Fig. 8.27. The machine was designed to contain
from 32 lo 16,3-34 processing nodes, each of which could have a 32-MHZ. SPARC processor, 32-Mbytes of
memory, and a 128-Mllops vector processing unit capable of perfonning 64-bit floating-point and integer
operations.
lnstead ofusinga single sequencer (as in the CM-2], the system used a number ofmnrmfpm:-cssrlrs, which
were Sim Microsystems workstation computers. The number of control processors, varying with ditTerent
conﬁgurations, ranged from one to several tens. Each control processor was conﬁgured with memory and
disk based on the ncods.
,,,,,,,,,,,,,,,,,,,,._,,,,,,,,,,,,, .i _ m
Control Network

Datahlotwodr
I I
N N N
DiagnosticNatvuork P P P P cP cP
M M M ru M M
\i\/i/ %,-—" +,—*’
pfajegjng DCrl‘lllOl _ U0
ngdgg processors interfaces

Flg.tI.1‘l' The network arehieecurro of the Connection Machine CH-5 {Courtesy of Leisersoo or al.
Thlnidng Machines Corporation. 1991}

input and output were provided via high-bandwidth I.-‘O .in.tcrfaces to graphics devices, mass secondary
storage such as a data vault, and hignperformance networks. Additional low-speed Ill) was provided by
Ethemet connections to the control processors. The largest configuration was expected to occupy :1 space of
30 111 >< 30 n1, and was designed for a pcak pcrf'orrnancc of over ] Tflops.
The Network Function: The building blocks were interconnected by three networks: a dun: m=m'orlr,
a r:-ommf nc'nvor.i:__ and a dirrgno.s'rfc network. The data network provided high-performance, point-ID-point
data communications between the processing nodes. The control network provided cooperative operations,
including broadcast, synchronization, and scans, as well as system management functions.
The diagnostic network allowed “back-door" access to all system hardware to test system integrity and
to detect and isolate errors. The data and control networks wcrc connected to processing nodes, control
processors, and U0 channels via .t1E£n*0rkinI8!_1'l2t?e.£.
The CM-5 architecture was considered universal because it was optimized for data-parallel processing
of large and complex problems. The data parallelism could be implemented in either SIMD mode, multiple
SIMD mode, or synchronized MIMD mode.
The data and control networks were designed to have good scuruhihrv, making the machine size limited
by thc affordable cost but not by any architectural or engineering constraint. In other words, the networks
depended on no specific types of processors. When new technological advances arrived, they could be easily
incorporated into the architecture. The network interfaces were designed to provide an abstract view of the
networks.

Tlse Syn-ern D1:-er'rrtion: The system operated one or more rrscr pnrtr'rr'c-ns. Each partition consisted of a
control processor, a collection ofproccssing nodes, and dedicated portions ofthe data and comrol networks.
Figure 8.28 illustrates the distributed control on the CM-5 obtained through the dynamic use of the two
interprocessor commtmication networks. Major system management ﬁtnctions. sesrvices, and data distribution
arc surnrnariz.cd in this diagram
l'P.\r' Ml.'I;Iﬂlb' HI" l'n¢r.q|r_.u|»r\
IPH T Advanced 'l:0lTlPl.lDE|'-"l!Ci'hlIrEClU!E

I UNIX GS So-rvlcos
pmmms Partition Management i: ' Cm“ Mmaﬂemenl
I Partition Services
L1sorPro-oesslng { ' lliglph mp“ °l

P3"l|li°" " / Current Data '[ 1 In-use datasots

Partition 2 //“:1/"
Partition r

Data Network
and
Contra Network

File systems,
Ir‘-D Management ¢9"-"B9 UMBE-
Interfaces

Storage and -Conneethrityr { D33 V3 '11- Hippi-

ethomet

“O Available Data { Users‘ data store

Fig.8.!!! Distrihutasd con-tro-I on the CH-5 widt concurr-en: user par1'il:ions and HO 1r::l\ritios{Co-ureesy oi
Thinking Machines Corp-on|tion.1992j

The partitioning of resources was managed by a system executive. The control processor assigned to
ca-ch partition behaved like :1 p.nrr‘irirm nromger. Each user process executed on a single partition but could
exchange data with processes on other partitions. Since all pat-citions utilized UNIX time-sharing and security
features, each allowed multiple users to access the partition, while ensuring no conflicts or interferences.
Access to system functions was classified as either]Jri7viie'ge'olDF no-rprivileged. Privileged system fimctions
included access to data and control networks. These accesses could he executed directly by user code without
system calls. Thus, OS kernel overhead could be eliminated in network communication witlztin a user task.
Access to the diagnostic network, to shared [I0 resources, and to other partitions was also privileged and
could only he accomplished 1.-'ia system calls.
Some control processors in the CM-5 were assigned to manage the U0 devices and interfaces. This
organization allowed a process on any partition to access any LID device, and ensured that access to one
device does not impede access to -other devices. Functionally, the system operations, as depicted in Fig. 8.28,
,.,,,.,,,,,,,,,,,.,,,,,,,,,,,, . _ ,,,
were divided into user-oriented partitions, U0 services based upon system calls, dynamic control of the data
and comrol networks, and system management and diagnostics.
The two networks could download user code from a control processor to the processing nodes, puss
1/0 requests, transfer messages of all sorts between control processors, and transfer data among nodes
and U0 devices, either in a single partition or among differ-ent partitions. The U0 capacity could he scaled
with increasing numbers of processing nodes or of control partitions. The CM-5 embodied the features of
hardware modularity, dis1:ributed control, latency tolerance, and user abstraction; all of these are needed for
scrrirrbic computing.

8.5.1 The CM-5 NetworkArchitect:ure

The date network was based on thcfsr-tree concept introduced by Leiserson {I985}. We explain below how it
is applied in CM-5 construction. Then we describe the major operations on t1'|e control network. Finally, the
structure ofthe diagnostic network is discussed.

For Trees A fat tree is more like a real tree in that it becomes thicker as it acquires more leaves. Processing
nodes, control processors, and HO channels are located at the leaves of a fat tree. A binoryjirr tree was
illustrated in Fig. 2.] Tc. The intemal nodes are switches. Unlike an ordinary binary tree, the channel capacities
of a fat trcc increase as we ascend from icat-es to root.
The hierarchical nature of a fat tree can be exploited to give each user partition a dedicated suhtree, which
cannot he interfered with by any other partition's message trallic. The CM-5 data network was actually
implemented with a 4-ary fat tree as shown in Fig. 8.29. Each of the internal switch nodes was made up of
several router chips. Each router chip was connected to four child chips and either two or four parent chips.

‘III’ ‘I:I’ ‘III’-_ III’ J

Q»I-1-»
r'1"5'i"
“fill ‘W1
tvififi¢""i\l0‘i""‘...-rt’\ :35?
ziiy
Q
1'
1‘; Osail
Fig.8}! CH-5 data network implemented wldt a -l-cry far. tree {Courtesy of Leiserson er. il.Thinltirtg
Machines Corpontlort. 1991}

To implement the partitions, one could allocate different suhtrecs to handle diliercnt partitions. The size
of the subtrecs varied with different partition demands. The IIO channels were assigned to another subtree,
which was not devoted to any user partition. The I/D sulnree was accessed as shared system resource. In
many ways, the data network functioned like a hierarchical system bus, except that there was no interference
among partitioned subtrees. All leaf nodes had unique physical addresses.
FM Mtﬁrpw Hlllrbmyiwtns
395 i ' .-ltdaonced Compurterhrehlteeture

The Dara Network To route a message from one processor node to another, the message was sent up the
tree to the least common ancestor ofthe two processors and then down to the destination.
ln the 4-ary fat-tree implementation (Fig. 8.29) of the data network, each connection provided a link to
another chip with a raw bandwidth of 20 Mhytesfs in each direction. By selecting at each level of the tree
whether two or four parent links are used, the bandwidths between nodes in the fat tree could be adjusted.
Flow control was provided on each link.
Each processor had two connections to the data network, corresponding to a raw bandwidth of 4D Mbytes."s
in and out of each leaf node. in the first two levels, each router chip used only two parent connections to the
next higher level, yielding an aggregate bandwidth of 160 Mbytes.-‘s out of a subtree with 16 leaf nodes. All
router chips hig her than the second level used fourparent connections, which yielded an aggregate bandwidth
of 10 Gbytes/s in each direction, from one half ofa 2K-node system to the other.
The bandwidth continued to scale linearly up to 16,384 nodes, the largest CM-5 configuration planned.
In larger machines, transmission-line techniques were to be used to pipeline bits across long wires, thereby
overeoming the bandwidth limitation that would otherwise be imposed by wire latency.
As a message went up the tree, it would have several choices as to which parent connection to take. The
decision was resolved by pseudo-randomly selecting from among those links that wen: unobstructed by otl1er
messages. After reaching the least common ancestor of the source and destination nodes, the message took a
single available path of links down to the destination. Tl1e pseudo-random choice at each level automatically
balanced the load on the network and avoided undue congestion caused by pathological message sets.
The data network chips were driven by a 40-MI—lz clock. The first two levels were routed through
backplanes. The wires on higher levels were routed through cables, which could be either 9 or 26 ft in length.
Message routing was based on the wonnhole eonoept discussed in S-eetion T4.
Faulty processor nodes or connection links could be mapped out of the system and quarantined. This
allowed the system to remain functional while servicing and testing the mapped-out portion. The data
network was acyclic from input to output, which precluded deadktck from occurring ifthe network promised
to eventually deliver all messages injected into it and the processors promised to eventually remove all
messages from the network after they were successfully delivered.

The Control Network The architecture of the control network was that of a complete binary tree with all
system components at the leaves. Each user partition was assigned to a subtree of the network. Processing
nodes were located at leaves ofthe subtrec, and a control processor was mapped into the partition at an
additional leaf. The control processor executed scalar part of the code, while the processing nodes executed
the data-parallel part.
Unlike the variable-length messages transmitted by the data network, control network packets had a ﬁxed
length of65 him. There were three-major types ofoperations on the control network: broodemring, mt-rahi'm‘ng,
and gloom’ r:pcmrr'ons. These operations provided interprocessor communications. Separate FIFCls in the
network interface were assigned to each type ofeontrol operations.
The control network provided the mechanisms allowing data-parallel code to be executed efficiently
and supported MIMI) execution for general-purpose applications. The binary tree architecture made the
control network simpler to implement t:l'tan the fat tree used in the data network. The control network had the
additional switching capability to map around iaults and to connect any ofthe control processors to any user
partition using an ol"T-line routing strategy.
,,.,,,.,.,,,,,,,,,,5,.,,m,,,,,,,, _ M
The Diagnostic Network This network was needed for upgrading system availability. Built-in testability
was achieved with scan—bascd diagnostics. Again, this network was organized as a (not necessarily complete)
binary tree for its simplicity" in addressing. One or more diagnostic processors were at tl'|e root. The leaves
were pods, and each pod was a physical system, such as a board or a backplane. There was a unique path from
tl1-e root to each pod being tested.
The diagnostic network allowed groups of pods to be addressed according to a “hypercube-address"
scheme. A special diagnostic interface was designed to form an in-system check of the integrity of all CM-5
chips that supported the ITAG (Joint Test Action Group) standard and all networks. It provided scan access
to all chips supporting the ITAG standard and programmable ad hoe access to non-JTAG chips. The network
itsclfwas completely testable and diagnosable. It was able to map out and ignore iisulty or power-down parts
of the machine.

8.5.3 Control Processors and Processing Nodes

The functional architecture of the control processors and of the processing nodes is described in this
subsection.
Control Procmmr As shown in Fig. 3.3 E}, the basic control processor consisted of a RISC microprocessor
(CPU), memory subsystem, HO with local disks and Ethernet connections, and a CM-5 network interface.
This was equivalent to a standard off-the-shell‘ workstation-class computer system. The network interface
connected the control processor to the rest ofthe system through the eontml network and the data network.

Control Network Data Network

W as-:;:

I.-‘O
C PU IIG

Standard Com putor

LAN Connection
Fig. 5.30 The control processor in the CH-5 [Cour-tiny ofTl1inldng Maelt-ines Corporation. 1991}
Par MIGIITLH Hi" l'mt'JI||r_.u|i¢\ :
Advanced Cornpmerﬂnrchitecture

Each control processor ran Cl\='lOST,aU‘l'~l[X-based OS with extensions for managing the parallel processing
resources ofthe CM-5. Some control processors managed computational resources in user partitions. Others
were used to manage l."0 resources. Control processors specialized in mariageiial ﬁmctions rather than
computational functions. Forthis reason, high-performance arithmetic accelerators were not needed. Instead,
additional U0 connections were provided in control processors.
Processing Node: Figure 3.31 shows lite basic stmcture of a processing node. lt was a SPARC-based
processor with a memory subsystem, consisting of a memory controller and 8, I6, or 32 Mbytes of DRAM
memory. The internal bus was 64 hits wide.

Mam Memory Memory Memory

BMW“; Bl-ilhytas Blulhytos Blvlb-ytoo
[optional] [optional] [optlcnall

6-Hilt paths
[plus ECG]
Memory
Controller

64-blt bus

RISC Hotwo-rlr
processor lntorfaoo

Control Network Data Network

Flﬂ-B-31 The processing node in die CM-5 [Courtesy ofThinking Plachirles Corporation, 1992}

The SPARE! processor was chosen for its multiwindow feature to facilitate fast context switching. This
was very crucial to the dynamic use of the processing nodes in different user partitions at different times. The
network interface connected thc node to thc rest ofthe system through the control and data networks. The use
of a hardware arithmetic accelerator to augment the processor was optional.

Hector Unit: its illustrated in Fig, 8.32:1, vector units could be added between the memory bank and the
system bus as an optional feature. The vector units would replace the memory controller in Fig. 8.31. Each
vector unit had a dedicated 72-hit path to its attached memory hank, providing a peak memory bandwidth of
I28 Mhynesfs per vector unit.
The vector unit executed vector instructions issued by the scalar processor and pcrfonnod all functions of
a memory controller, including generation and check of ECC [error correcting code) hits. As detailed in Fig.
S.32.b, each vector unit had a vector instruction decoder. a pipelined ALU, and sixty-four 64-bit registers like
a conventional vector processor.
,,,,,,,,,,_,,,,,,,.,,,,,,,,,,,, _ 3,,
MBue

Memory Memory Memory Memory

B i~.-lnytes B lttlnytes B Mtrytes B Mbytes
MEl|.|e Interface
B4-nit paths
H [plus ECG)

Hector ‘Vector ‘vector Uecnr ‘steam lngtr |_|¢:[|cm

|..l nit Uri it U mt Lin it Decoder

64-bit bus
i Pipetmed Haggai Memory I
ALI.) X 64 Ms Gontrolter
RISC Netimrit
E Ii]:
processor lnte rlaoe

Controt Data Memory

Network Network
[a] Precesemg node with vector uruts [ti] Vector umt functional architecture

Fig.8.}! The processing node with vector units in the CM-5 {Courtesy oiThinking Machiriei Cerpo|atien.1992)

Each vector instruction could be Esued to a specific vector unit or pairs of units or broadcast to all four
units at once. The scalar processor took care of address translation and loop control, overlapping them
with vector unit operations. Together, the vector units provided 512 Mbytes/s rnernory bandwidth and
I28 Milops 64»-bit peak performance per node. tn this sense, each processing node of the CM-5 was
itself a supercomputer. Collectively, 16K processing nodes would yield a peak performance of‘ EM >< 27 =
22' Mfiops = 2 Tflops.
lnitialljy, SPARC processors were being used in implementing the control processors and processing nodes.
As processor technology advanced. other new processors could be also combined in the system. The network
architecture was designed to be independent ofthe processors chosen except for the network interfaces which
would need some minor rrtodifieations when new processors were used.

8.5.4 lnterproeesser Communications

We have described the high-speed scanning and spreading mechanisms built into the CM-2. In the CM-5,
these mechanisms were designed to be Further upgraded into four categories ofinterprocessor communication:
nrpiicrrrion, rcriircrrirrn, rrcrnrrtrarrhn, parallel prefix.
These operations could be applied to regular or irregular data sets including vectors, matrices,
multidimensional arrays, variable-length vectors, linked lists, and completely irregular pattems. In this
section. we describe the key concepts behind these IPC operations. The role of the control network is also
identified in these operations.
Fhrtulffiffllli H["l'|>rrIq|r_r.I|n*\ ‘I _
an i Advanced Computerfltrchitecture

Replication Recall the broadcast operation, where a single value may be replicated to as many copies and
distributed to all processors, as illustrated in Fig. 8.33:1. Other duplication operations include the spreading
of a column vector into all the eoltunns of a matrix (Fig. 8.3-3h], the expansion ofa short vector into a long
vector (Fig. B.33c}, and a completely irregular duplication (Fig. 8.3311).

IIII
2 2 2 2

llllll
IIIIIIIIIIEIIII He~H
,..
WB .

[a] Broadcasting [tr] Spreading

H I II
1 2 c -t |
I

. I
11122aacea4¢-t B
[cjl-tarlabte-tent_:|tl1 vectors [d] Comptetety irregular

Ftg.lt.33 Replication operations for il'I|ZEl‘|Il‘DCH$$CIl' communications on CH-5 {Courtesy of Thinking

Machine: Corporation. 1991}

Replication plays a fundamental role in matrix arithmetic and vector processing, especially on a data-
parallel machine. Replication is carried out through the control network in four kinds ofbroadcasting schemes:
riser broatirasr, sr.1pcrvi.sor broan'ca.st, interrupt broaa'cast, and utii'it_j|-' broadcast. These op-eratiotts can he
used to download code and to distribute dam. to implement fast barrier synchronization, and to conﬁgure
partitions through the CIS.

Reduction Vector reduction was implemented on the CM-2 by ﬁrst scanning, and on the CM-5 the
mectianisrn was funiicr generalized as the opposite of replication. As illustrated in Fig. 8.34, ginbai tt"tl'1iC‘r.’
produces the sum of vector components (Fig. 334a]. Similarly, the rowtcolurnn reductions produce the sums
per each row or column of a matrix (Fig. S.34b}.
,,,,,,,,,,_,,,,,,,,,,_,,,,,,,,,,,,, _, M
"v'ariahle—le:ngth vectors were reduced in chunks ofa long vector (Fig. 8.34-c). The same idea was applied
to a oomplelely irregular set as well (Fig. B.34d_]. In general, reduction functions include the maximum, the
minimum, the average, the dot. product, the sum, logical AND, logical UR, etc. Fast scanning and combining
are necessities in implementing these operation

n nuns
IIIIIII
BEBE
I ‘
2 MM
EEEE
2

[a] Global reduction [bl Ftcwloolumn rodnenon

E m
u m
3615202-476572611 H B

.1‘? T ? 12 i B
[cl Vanablo-ton-gth vectors {dj Corrplototy urogtlar

Fig-I-34 Reduction operations on the CH-5 (Courtesy ofThinking Machines Coqsontion, 1991}

Four types of oombining operations, reduerirm, forward scan {parallel prefix), bnelru-rim‘ scan {parallel
suffix), and router done, were supported by the control network. We will describe parallel prefix shortly.
Homer dam: refers to the detection of completion of a message-routing cycle, based on l<Lirehofi"s current
law, in that the network interfaces keep track of the number of messages entering and leaving the data
network. When a round of message sending and aelmowledging is eomplete, the net "current" (messages) in
and out of a port should be zero.
Flermurotinn Data-parallel computing relies on permutation for fast exchange of data among processing
nodes. Figure 3.35 illustrates four cases of permutations performed on the CM—5. These permutation
operations are often needed in matrix transpose, reversing a vector, shifting s multidimensional grid, and
FFT butterfly operations.
402
_
Advorrcod Cormputerhrchitectura

BEBE
Illllll “
1||||||
rants
E'
E E
{at 1D nearest neighbor (shift) {bl 2D rowteolumn shllt

E4?-h
ntanrala 5 J
vy \;I Qa
t. '
Fig.B.35 Prnuanﬁon operations for lnuerprvoolssor commtailcaﬂorts on the CH-5 [Courtesy ofThlnldng
Machines Corporation. 1991)

Parallel Proflx This is a kind of combining operation supported by the control network. A pamltelprfi
operation delivers to the ith processor the result of applying one of the five reduction operators to the values
in the preceding r' —l processors, in the linear order given by data address.
The idea is illustrated in Fig. 3.36 with four examples. Figure 8.31521 shows the one-dimensional sum-
prcfix, in which for example the fourth output 12 is the sum of the first four input elements (1 + 2+5—4 =
I 2). The two-dimensional rowfcolumn sum-prefix (Fig. 8.36b) can be similarly performed using the forward-
scanning mcchan is m.
Figure 3.364: computes the one-dimensional prefix-sum on sections of a long vector independently.
Figure 8.36d shows the forward scanning along linked lists to produce the prefix-sums as outputs.
Many prefix and suflix scanning operations appear to he inherently sequential processes. But the scanning
and combining mechanisms on the CM-5 could malre the process approximately log; n faster, where n is the
array length involved. For example, on the CM-5 a parallel prefix operation on a vector of I000 entries could
be finished in ll) steps instead of 1000 steps.
,.,,J,.,,,€,,,,,,,,d5,,,,D.:£,,,,,,,,m _ M
IIHBE
B
1 -D 1 } 1 1 1 2
l B 5 2 B 11 20 22

El E B
[a) 1-D sum-prefix [bi 2-D rowfoolunin sum-prefix

B
|ae1|52lu2~¢e5|2e¢|

—-
|s91o|51|o2-24e|2s12|
[c] Variable-length vectors [cl] Linked Ilene

Fig. 8.3-6 Parallel preﬁx operations on the Cl’-1-5 (Cour-eesy ofThinlting Machines Corporation, 1991]

|| '_“‘\
K,
4'} _ Summary

By around 191-'0, computer systems based on the basic single-processor von Neumann architecture had
become well established, with products from several computer companies available in the market In
the search for higher processing power. especially for scientific and engineering appliations. the earliest
supercomputers made heavy use of vector processing concepts, while the concepts of sl'1ared—bus multi-
proccssors and SIMD systems were also beginning to emerge at around that time.
We started this chapter with a study of the basic vector processing concepts, vector instruction types,
and interleaved vector memory access schemes.Vector instruction types include vector-vector. vector-
scalir. vector-memory. vector reduction, gather and scatter. and masking operations. Examples were
studied of the early supercomputers based on vector procmsing concepts, including systems produced
by the tvvo pioneer supercomputer companies Cray and CDC.
Our study of multivector computer-s——i.e. systems based on multiple vector processors——l:egan with
the basic system design rules for achieving the target per-lormance.These design rules can be related
to processing power, IICI and networldrig, memory bandwidth, and scalability. As specific examples,
rnultivector systems and early massively parallel processing (MPP) systems introduced by Cray were
studied, as were Fujitsu multivector systems Also reviewed in brief were mainfiame systems provided
with vector processing capability, and the so-called mini-supercomputers which emerged widi advances
in electronic technology
I'M Hif G-rm-vHIiI' I241-r-womri a
404 i‘ Advanced Compurterfirchitecture

The concept of compound vector pnocessing arises from the search for more efiicient processing of
vector data. Scientific and engineering applications make use of such vector operations, and therefore
system architects have always looked for ways to map them efficiently onto the underlying vector
processing hardware.'l11e concepts of vector loops and chaining, and of multi-pipeline networldng. have
also been developed 'WllIl'I the aim of providing efficlent support for compound vector processing.
SIMD computer systems may be of one of two basic type.s—witl1 distributed memory modules
and with shared memory modules. Specific examples were discussed of two innovative SIMD systems:
Connection Machine 2 [CM-2). with processors based on bit-slice technology. and l‘1asFar MP-1,with
its specially designed processors. Bodi systems used sophisticated system interconnects and had the
capability to connect thousands of processors. However. For good technological reasons. the architectural
trend later turned away from SIMD systems and cowards massively parallel MIMD (or SFMD] systems.
Connection Machine 5 [CM-5} represents the shift towards massively parallel MIMD architecture
which occurred in die mid-199Ds.The ma.in factor behind this shift was the availability of low-cost but
powerful pro-cessors,made possible by rapid advances in the underlyingVL5i technology. CM-5 innovations
included the use of a large number oi RISC processors, a sophisticated data network {using a fat tree].
and special hardware features to support efficient and versatile interprocessor communicafion~wl1lch
included useful operations such as replication. reduction and permutation.

Cg Exercises
Problem B.1 Explain the structural and lb) C-access memory organization.
operational differences between register-to-register (cl C15-access memory organization.
and memory-to-memory ardiitectures in building
multipipelined supercomputers for vector processing.
Problem 8.4 Distinguish among the following
vector processing machinu in terms of architecture,
Comment on the advantages and disadvantages in
performance range. and cost-effectiveness:
using SIMD computers as compared with the use
(a} Full-scale vector supercomputers.
of pipelined supercomputers for vector processing.
[bl High-endmainframesornear-supercomputers.
Problem B.2 Explain the following terms related (c) Minisupercomputers or supercomputing work-
to vector processing; stations.
(a) Vector and scalar balance point
Problem 8.5 Explain the following terms
(bl Vectorization ratio in user code.
associated with compound vector processing:
(cl 'v'-ectorization compiler or vectorizer.
(a} Compound vector functions.
{d} Vector reduction instructions.
[bl Vector loops and pipeline chaining.
(e) Gather and scatter instructions.
(cl Systolic program graphs.
ll} Sparse matrix and masking instruction.
(d) Pipeline network or pipenets.
Problem B.3 Explain the following memory
organizations for vector accesses: Problem 8.6 Answer the following questions
related to the ardiitecture and operalions of the
{3} 5-access memory organltion.
Connection Machine CM-I:
,.,....,...,.,,....,,,...,,,.,,,,.. _ 405
(a) Describe the processing node architecture. All] = Bill] :>< C{|) + D{l] :>< Ell} + F(l) :>< Gil)
including the processor. memory. floating- for I = 1. 2. N. initially. all vector operands are in
point unit. and network interface. memory. and the final vector result must be stored
lb) Describe the hypercube router and the in memory.
NEVVS grid and explain their uses. fa) Show a pipeline-chaining diagram. similar to
(c) Bqalain the scanning and spread mechanisms Fig. 8.1 B. for executing this cvr.
and their applications on the CM-2. (b) Show a space-time diagram. similar to
(d) Explain the concepts of broadcasting. global Fig. 3.1 9, for pipelined execution of the Cl/E
combining, and virtual processors in the use None d1at two vector loads can be carried
of the CH-1 out simultaneously on the two vector-access
Problem 8.7 Answer the following questions pipes.At the end of computation. one of the
about the l"1asPar MP-1: two access pipes is used for storing the A
la) Explain the X-Net mesh interconnect {the PE array.
array] built into the MP-1. Problem 8.11 The following sequence of
lb} Eaqalain how the multistage crossbar router compound vector function is to be executed on a
works for global communication between all Cray X-MP type vector processor:
PEs. All) = Bll) + s >< Cll}
lc) Explain the computing granularity on PEs and Dill) = s >< B{l) >< C(l]
how fast HO is performed on the MP-1.
Em = cm >< (cm - Bllii
Problem B.B Answer d1e following questions where Bll} and Cfl) are each 64-element vectors
about the Connection Machine CH-S: originally stored in memory. The resulting vectors
la) What is a fat tree and its application in Ail). D(l).and Ell} must be stored back into memory
constructing the data network in the CM-5*! after the computation.
lb} What are user partitions and their nesouroes (a) Wfite 11 vector instructions in proper order
requirements? to execute the above C\"Fs on a Cray X-MP
(c) Explain the functions ofthe control processors type vector processor with two vector-load
of the control network and of the diagnostic pipes and one vectoostone pipe which can
network. be used simultaneously with the remaining
(d) Explain how vector processing is supported functional pipelines.
in each processing node. lb) Show a space-time diagram. similar to Fig.
Problem 8.9 Give exampleadiffcrent from those 8.19, for achieving maximally chained vector
in Figs. 8.33 through 8.36, to explain the concepts operations for executing the above CVFs ln
of replication. reduction, permutation. and parallel minimum time.
prefix operations on the CM-5. Check the Technical (c) Show the potential speedup of the above
Summary of CM-5 published by Thinking Machines vecnor chaining operations over the chaining
Corporation if additional reading is needed. operations on the Cray 1. which had only one
memory-access pipe.
Problem 8.10 On a Fuiitsu VFZODO. the vector
processing unit was equipped with two loadfstore Problem B.11 Consider a vector computer
pipelines plus five functional pipelines as shown in which can operate in one of two execution modes
Fig. 8.13. Consider the execution of the following at a time: one is the vector mode with an execution
compound vector function: rate of R, = 2000 Mflops. and the other is the scolor
TM liliffirmil-' Hflllfomponm
40s _
Adi-winced Computernrchitecture

mode with an execution rate of R, = 200 Mllops. Let maximum 64-way parallelism in their vector
rr be the percentage of code that is vectorizable in a operations.
typical program mix for this computer.
Problem 8.15 Devise a minimum-time algorithm
{a} Derive an expression for die overoge execution
to multiply two 64 >< 64 matrices. A = la,-ii and B =
rote R. for this computer. (by). on an SIMD machine consisting of 64 PEs with
(b) Plot Rn as a function of rr in the range [(1.1]. local memory. The 64 PEs are interconnected by a
(c) Determine the vectorization ratio tr needed 2D B >< B torus with bidirectional links.
in order to achieve an average execution rate (a) Show the initial distribudon of the input
of it, = isno Mflops. matrix elements [op and {by} on the PE
(d) Suppose (I = 0.?.What value of R, is needed I‘l"lef‘l"lOl"l'ES.
to achieve R, = 400 Mllops? lb) Specify the SIMD instructions needed to
Problem 8.13 Describe an algorithm using odd, carry out the matrix multiplication. Assume
multiply, and doto-muting operations to compute the that each PE can perform one multiply. one
expressions =A1><B1+A; ><.Bq + +A3;><El;;wid1 odd, or one shifi (shifting data to one of its
minimum time in each of the following two computer four neighbors] operation per cycle.
systems. It is assumed that add and multiply require You should first compute all the multiply and
two and four time units. nespectivelyt The time add operations on local data before starting to
required for instructionfdata fetches from memory route data to neighboring PEs.The SIMD shift
and decoding delays ane ignored.All instructions and operations can be either east. west. south. or
data are assumed already loaded into the relevant north with wraparound connections on the
PEs. Determine the minimum compute time in each lIOI'US.

CHSB. (c) Estimate the total number of SIMD instruction

(a) A serial computer with a processor equipped cycles needed to compute the matrix
with one adder and one multiplier. only one of multiplication.The time includes all arithmetic
which can be used at a time. No data-routing and data-routing operations.The final product
operation is needed in this uniprocessor elements C = A >< B = (c,-I-} end up in various
machine. PE memories without duplication.
(b) An 5ll'lD computer with eight PEs (PEG. (d) Suppose data duplication is allowed initially by
PE1. PE;). which are connected by a loading the same data element into multiple
bidirectional circular ririg. Each PE can directly PE memories. Devise a new algon'thm to
mute its data to its neighbors in one time unit. further reduce the SIMD execution time.The
The operands A. and B. are initially stored in initial data duplimtion time. using either data
PE,-mods for i = 1.1, .... 31. Each PE can odd or broadcast instructions or data muting [shifting]
multiply at different times. instructions. must be counted. Again. each
result element cg ends up in only one PE
Problem 8.14 Calculate the peak performance in
memory.
Gflops with reasoning in each of the following two
vector supercomputers. Problem 8.16 Compare the Connection Machines
(a) The Cray Y-MP C-90 with 16 vector CM-1 and CM-5 in their architectures, operation
processors. modes. functional capabilities. and potential perfor-
(b) The NEC EX-X with 4 vector processors. mance. from the viewpoints of a computer architect
and of a machine programmer.
(cl Explain why both machines offered a
,,.,,.,...,,,,,...._..,..,.,,.,._., _ M
Problem 8.17 Consider the use of a multivector la) Design a minimum-time parallel algorithm to
multiprocessor system for computing the following perform concurrent vector operations on the
linear combination of n vectors: given muitipnocessor. ignoring all memory-
I013
access and I10 operations.
y= Zojxxj (b) Compare the performance of the multipro-
r=<r cessor algorithm with that of a sequential al-
gorithm on a uniprocessor without die pipe-
where r = ira l"'1- Yionlr and 1. = isa. 11,-. lined vector hardware.
.... x,,m,5.)T for D £ j £ 1013 are column vectors:
{oflfl 5 j s 1023} are scalar constants. You are Problem 8.18 The Bumaughs Scientific Processor
asked to implement the above computations on a (H-SP] was built as an SIMD computer consisting of
four-processor system with shared memory. Each 16 PEs accessing 17 shared memory modules. Prove
processor is equipped with a vector-add pipeline that conflict-free memory access can be achieved
and a vector-multiply pipeline.Assume four pipeline on the BSP for vectors of an arbitrary length with a
stages in each functional pipeline. stride which is not a multiple of 17.
rh- l||lcG-mu urn IZ4vr-went-I s

_ _

Scalable, Multithreaded, and

Dataflow Architectures
'l'his chapter discusses innovative computers built with scalable, multlthreadedor dataflow architectures.
These architectures generated and validated many research ideas which led to the latter developrnent of
massively parallel processing {MPP} sy's'tems.Therefore. the material is presented with a strung research
flavor benefiting mostly researchers. designers. and graduate sbudems. More recent developments of
these ideas are presented in Chapter 13.
Major research issues covered include latency-hiding tedsniques. principles of multiehreading.
multidimensional scalability, rnulfithreaded architectures, fine-grain multicomputers, datafievw, and hybrid
architecmra. Example systems studied include the Stanford Da.sh,Wlsconsin Muldcube, USCJOMF, KER»
1,Tera, MIT Alewife and j-Machine, Caltech Mosaic C, ETL EM-4. and MlTl‘l"'lotor'cila *'[

LATENCFHIDINGTECHNIQUES
1 Massively parallel and scalable systems may typically use distributed shared memory. The
access of remote memory significantly increases memory latency. Fr.u1.l1em1ore, the processor
speed has been increasing at a much faster rate than memory speeds. Thus any scalable multiprocessor or
large-scale multicomputer must rely on the use of latency-reducing, -tolerating, or —hiding inechanlsms. Four
latency-hiding mcchanismsarc studied below lbrenhancing scalability and programmability.
Latency hiding can be accomplished through ihur complementary approaches: {ii using prjerr-hr'ng
rec-hnr'qrms' which bring instnictions or data close to the processor before they arc actually needed; {iij using
r-ofrervrrr cor-hes supported by hardware to reduce cache misses; {iiij using refitted rrrerrmrjv r-onsisrerrey
models by allowing bufiiering and pipelining of memory references; and -[ivj using nrrrfriple-er;-nrevrs support
to allow a processor to switch from one contest to another when a long-latency operation is encountered.
The first three mechanisms are described in this section, supported by simulation results obtained by
Stanford researchers. Multiple contexts will be treated with multithreaded proec ssorsand systemarc hitectures
in Sections 9.2 and 9.4. However, the effect of multiple contests is shown here in combination with other
latency-hiding mechanisms.

9.1.1 Shared Virtual Memory

Single-address-space multiproccssors.-‘multicomputers m|.|st use shared virtual memory. We present a model
of such an architectural environment based on the Stanlhrd Dash cspericnoe. Then we esaminc several
shared-virtual-memory systems developed at Stanford, Yale, Camegit:-Mellon, and Princeton universities.
smrate,Muruo|-=md.m ,,.,
The Architecture Environment The Dash architecture was a largeecalc, cache-coherent, NUlvlA
rnuitiprocessor system, as depicted in Fig. 9.1. lt consisted of multiple multiprocessor clusters connected
through a scalable, low-latency interconnection network. Physical memory was distributed among the
processing nodes in various clusters. The distributed memory formed a global address space.

E
5 as
Cache

“ads StorE
1 Wine:
_ Buff-er'

Secondary Cache
IIII I I IIIII I I IIIII
.,-_ _- _
_- _- .-‘T '-.,_
-
J -—-
_|_¢ \-..
l I’
\ I
Cluster 1 't\ ,’ Cluster n
‘ |'

no Directory ‘r‘r I.. I Directory

Cache Cache me moo ‘I Cache “-0-. ___ Cache mommy

Remote 9 |' 5 Remote

rte
Cache We
Cache

I Interconnection Network i

Flg.9.1 A scahbie coherent cache multiprocessor with dlscﬂbuoed shared metnory modeled after the
Sranlord Dash (Courtesy ofhnoup Gupta en al, Prue I991 Ann Int. Symp. Conpumrﬂa-ch}

Cache coherence was maintained using an invalidating, distributed directory-based protocol (Section
7.2.3). For each memory block, the directory kept track of remote nodes cacheing It. when a write occurred,
point-to-point messages were sent to invalidate remote copies of the block. Acknowledgment messages were
used to inform the originating node when an invalidation was completed.
Two levels of local cache were used per processing node. Loads and writes were separated with the Lise
of n-'rir‘e buffers for implementing weaker memory consistency models. The main mommy was shared by all
P rocessl '18 nodes in the same cluster- To facilitate prefetching and the directory-based coherence protocol,
directory memory and remote-access cac hes were used for each cluster. The remote-access cache was shared
by all processors irI the cluster
4| ll i - Adi-wiccd Cmnputerﬂuchitecturc

The SVM Concept Figure 9.2 shows the structure of a distributed shared memory. A global virtual address
space is shared among processors residing at a large number of loosely coupled processing nodes. This
shared virnml nienioiy (SVM) concept was introduced in Section 4.4.]. Implementation and management
issues of SVM are discussed below.

CPU Node U

.\ ___“
emery ‘~__ -._
-s,‘ ‘R
\,_ "Ht
if .
o ‘,--"
it IIL ‘t A1
I
\
f \ 1';
__,.»'
l \. '-. "\-
Shared
\
._
"~ -\.
I I Virtual

‘
I
(1)
I2I2
C
Memory Nodat
I if/I
1.
I " WWW
Node I- -=- 2 "
'
I
.

Ill
ltilamory
I 1' I.

z ,.- 2'
/

Nndg ,1
/If

l PBQG T3lJlE /

_
/

_
Vmual

Address
Space

(ml. naad_miy,wrli.ai:|ie]
,..~
svin
Addm!-.5

BMW

{a) Distributed shared mernory {h)S1'iared virtual memory mapping

Fig.1} The concept of dlsrrlh-uted sltared memory with a global vlrnsal address space shared among all
processors on loosely coupled processing nodes in a massively parallel arch.ltec1:ure {Courtesy of
Kai Li. 1992]

Shared virtual memory was ﬁrst developed in a Ph.D. thesis by Li (1986) at Yale University. The idea is
to implement coherent shared memory on a network of processors without physically shared memory. The
coherent mapping of SVM on a message-passing multicomputer architecture is shown in Fig. 9.lb. The
system uses virtual addresses instead of physical addresses for memory references.
Each virtual address space can be as large as a single node can provide and is shared by all nodes in the
system. Li (1938) implemented the ﬁrst SVM system, IVY, on a network ofApollo workstations. The SVM
address space is organized in pages which can be accused by any node in the systcm. A memory-mapping
manager on each node views its local memory as a large cache of pages lor its associated processor

Page Swapping According to Kai Li (1992). pages that are marked read-only can have copies residing
in thc physical mcmorics of other processors. A page cumcntly bcing written may rcsidc in only one local
memory. When a processor writes a page that is also on other processors, it must update the page and then
invalidate all copies on the other processors. Li described thc page swapping as ihllows:
A memory rcfcrcncc causes a page fault when thc page containing thc mcmory location is not in a
processoﬁs local memory. When a page fault occurs, the memory manager retrieves the missing page from
the memory of another processor. lf there is a page frame available on the receiving node, the page is moved
$stn.,tus....-,ts..o -— .,,,
in. Otherwise, the SVM system uses page replacement policies to ﬁnd an available page frame. swapping its
contentsto the sending node.
A hardware MMU can set the access rights (riff, rerm‘-only, n'rirrihIc'] so that a memory access violating
memory coherence will cause a page fault. The memory coherence problem is solved in IVY through
distributed fault handlers and their servers. To client programs, this mechanism is completely transparent.
The large virtual address space allows programs to be larger in code and data space than the physical
memory on a single node. This SVM approach oifers the case of shared -variable programming in a message-
passing environment. In addition, it improves software portability and enhances system scalability through
modular memory growth.

Example SVM System: Nitzberg and Lo [I99 I ) conducted a survey of SVM research systems. Excerpted
from their stnvey, descriptions of four representative SUM systems are suinrnarized in Table 9.1. Dash
implemented SVNI with a directory-based coherence protocol. Linda ofiercd a shared associative object
rnernory with access Fttnctions. Plus used a write-update coherence protocol and performed replication only
by program request. Shiva extended the IVY system for the lntel iPSU2 hypercube. In using SVM systems,
there exists a tendency to use large block (page) sizes as units of coherence. This tends to increase false»
sharing activity.

Table 9.1 Representative SVM Research Systerns {Excerpts fiwrr Nitzherg and Lo, IEEE Cumptttaduglrst 1991]
.5}'.trem Irnpieme"rr.fuIion (.'0.ir¢'r"e'rrr;'e' t§r?¢'r.'iuf ll-’le‘e'frurrie.r
and and 5<;"munIlies and jor Pt-rfo-rmum~e
El? t'e'i'op¢'r Sim-etrrns Phituc 01$ and .'§_Wrt'Irrrmrhutirm
Stanford Dash Meal:-connected networlt Release memory consistency Relaxed coherence,
{Ler|.oslti, Landon, of Siiieon Graphics 4Di'34t'.l with write-iniralidate prefetehing, and queued
Gharachorloo. Gupta. workstations with added protocol. locks for synchronization.
and Hennessy, 1988-]. hardware. for ooherent
caches and prefetching.
Yale Linda [Carriero 5ofi:weJ‘e-implelnented Coherence varied with Linda could he
and Gclcrnter, 1982-}. system based on the environment; hashing implelnented for many
concepts oftuple space used in mt-smiative search; languages and machines
with access functions no rnutahle data. using C-Linda or Formul-
to achieve coherence Lirtda interfaces.
via virtuai memory
nianageinent.
EMU Plus (Bisiani and Ahardware implementation Used processor consistency, Pages for sharing, words
Ravishankar. l9SE—). using MC 88600. Caltech nondemand write-update for coherence. complex
mesh. and Plus lternet. coherence, delayed operations. synclwonizanon
instructions.
Princeton Shiva {Li and Soflware-based system Sequential consistency, Used data structure
Schaefer, 1988). for Intel iPSC1"2 with B write-invalidate protocol, con1paction.mseges for
Shivafnative operating 4-Khyte page swapping. semaphores and signal-
system. wait, distributed memory
as hacking store.
FM Mtfirnlw Hlilrbmpwtns
4| I i " AdvoncedColnp-uterfirchitectore

Scalability issues of SK-‘M architectures include determining the sizes of data structures for maintaining
memory coherence and how to take advantage of the fast data transmission among distributed memories in
order to implement large SVM address spaces. Data structure compaction and page swapping can simplify
the design of a large SVM address space without using disks as backing stores. A number of alternative
choices are given in Li [1992]-.

9.1.2 Pnefenching Techniques

Prefetching techniques are studied bclow. These involve both hardware and software approaches. Some
benchmark results for prefetching on the Stanford Dash system are presented to illustrate the benefits.
Frefetdring Technique: Prefetching uses knowledge about the expected misses in a program to move the
corresponding data close to the processor before it is actually needed. Prefetching can be classified based on
whcthcr it is binding or rmnbinoiing, and whether it is controlled by nnro'i1'.orc or.srJ_,ii"unre.
With binding prefetching, the value of a later reference {e.g. a register load} is bound at the time when
the prefetch completes. This places restrictions on when a binding prcfctch can be issued, since the value
will become stale if another processor modifies the same location during the interval between prefetch and
reference. Binding prefetching may result in a significant loss in perfomtance due to such limitations.
ln contrast, nonbinding prefetching also brings the data close to the processor, but the data remains visible
to the cache coherence protocol and is thus kept consistent until the processor actually reads the value.
Hardwawcontrolled prcfetching includes schemes such as long cache lines and instruction lookahead.
The effectiveness of long cache lines is limited by the reduced spatial locality in multiprocessor applications,
while instruction looltahead is limited by branches and the finite lotrkaltead bufier size.
With software-controlled prefctching, explicit prefetch instructions are issued. Software control allows the
prefctching to be done selectively (thus reducing bandwidth rcquirerncntsj and extends the possible interval
between prefetch issue and actual reference. which is very important when latencies are large.
The disadvantages of software comrol includc the extra instruction overhead required to generate the
prefetches, as well as the need for sophisticated software intervention. In our study, we concentrate on non-
binding .sr:t,iin-"rim" controi!t'd]Jnf,ri='rt-Iiirig.

Benefits of Pnefetching The benefits of prefetching come from several sources. The most obvious benefit
occurs when a prcfctch is issued early enough in thc code so that thc linc is already in thc cachc by thc time
it is referenced. However, prefetching can improve perforirtance even when this is not possible (e.g. when
the address of a data structure cannot be determined until immediately before it is referenced}. If multiple
prefetches are issued back to back to fetch the data structure, tl'te latency of all but the first prefetched
tcibrcncc can be hidden duc to thc pipelining ofthe mcntory acccsscs.
Prefetching offers another benefit in multiproccssors that use an ownership—based cache coherence
protocol. If a cache block line is to be modified, prefetching it directly with ownership can significantly
reduce the write latencies and the ensuing network traffic for obtaining ownership. Network traflic is reduced
in read-modify-write instructions. since prefctching with ownership avoids first fetching a read-shared copy.
Benchmark Result: Stanford researchers (Gupta, Hennessy, Gharachorloo, Mowry, and Weber, l99l}
reported some benchmark results for evaluating various latency-hiding mechanisms. Benchmark programs
included a particle-based three-dimensional simulator used in aeronautics tMP3D). an LU-decomposition
program {LU}, and a digital logic siniulation program [PTI-[ClR_‘,t. The effect of prefetching is illustrated in
Fig. 9.3 for running the MP3D code on a simulated Dash multiprocessor (Fig. 9.1).
s..t.~.,M.e....t......t. -— ...,
100 — 1001]
_-
_ 14_4 9&7 ore-teteh-es
go __ —- 1 sync ops
F-PPCDO:
18 3 — write buffer
- _ read
BU - _ "1 so busy
Exed
ecutionT'me 7'0 *- ea.-4
_ 6°‘ Q .~°~!“-"l3~'~*LDr_,g satt
Norzma 53.?
50- 58] 54.9 .~'=l\}!\“DFl~‘_*fl
40" 2.3.2 '
30- 27-1 25.9
20~ _ _ _
10- taa =1a.e tee 1a.e tea
O , ,
strategy nopf pf1 pf2 p13 pf4
Coverage 0% 3?% 91% 91% 95%
Source Lines 0 1 2 6 16
F£g.!.3 Efiect of various pm-Imtchlng strategies for running the HPBD bantzhmark cm at sintulatetl Dash
multiprocessor [Courtesy offitneep Gupta at al. 1991}

The simulation runs involved 10,000 particles in a 64 >< 8 >< 3 space anay with five time steps. Five
prefctching strategies were tested -[no]; pfl. p_,i'.?. p_,B. and p_;‘I¢ in Fig. 9.3). These strategies range li‘om no
prefctching {nqrifl to prefetehing of the particle record in the same iteration or pipclined acmss increasing
numbers oi'itcrations{pfi' throug hpf4). The bar diagrarns in Fig. 9.3 show the execution times normalized with
respect to the nnpfstralegy. Each bar shows a breakdown of the times required for prefetches, synchronization
operations, using write buffers, reads, and busy in computing.
The end result wasthatprcfetchcssl.-"ere issued for up to 95"!-itoft11e misses that occurred in the case without
prefetching {referred to as the cot-'ernge_fiteror in Fig. 9.3). Prefetching yielded significant time reduction in
synchronization operations, using write buffets, and performing read operations. The best speedup achieved
in Fig. 9.3 is 1.36, When the pf! prefetehing strategy is compared with the rmlrifstrategy. Still the preietching
benefits would he application-dependent. To introduce the pre-fetches in 1.he MP3D code, only I6 lines of
extra code wen: added to the source code.

9.1.3 Distributed Coherent Caches

While the coherence problem is easily solved for small bus-based multiproccssors through the use of snoop}-'
cache coherence protocols, the problem is much more complicated for large-scale rnultiprocessors that use
general interconnection networks. As a result, some large-scale multiproccssors did not provide caches (eg.
BEN Bunerfly], others provided caches that must be kept coherent by sofiware (eg. IBM RP3}, and still
others provided full hardware support for coherent caches (e.g. Stanford Dash].
4| 4 i - " Advtonced Cmnputerfluchitecture

Dash Experience We evaluate thc beneﬁts when both private and shared read-write data are cachcablc. as
allowed by the Dash hardware coherent caches, versus the case where only private data are cacheable. Figure
9.4 presents a breakdown of the normalized execution times with and without cacheing of shared data for
each of the applications. Private data arc cached in both caches.
_ 100.0 1&0 1&0
S
_ 7.1 so 4.3
go - 13_5 _1o_? Synchronization
_ _- Write Miss
an Read Miss
To _ at .1 Busy
541-_
Nome
T on
mtaizeclEsocut 11.1
5° ' 54-9 rat -15.2
J5: _. _

at -
34] —
_~U'lW

14.3 r~=!~>"’
u'|_L
:1;
20 — 13-1 22.5 3.9

10 — —
T.O i".0 9.5 9.5 6.9 T2
0
No Cache Cache No Cache Cache No Cache Cache
l‘ulP3D i_|_.| PTHOR

Fig.1.-I Efiectof cacltcing shared data in sirnuiamd Dash benchrnerit experimutos (Courtesy of Gupta oral.
Pmc.i‘.rn: 5ymp.Cor11puLArrhh.,Tot*onn:. Caatach. May 1991]

The execution time ofeach application is normalized to the execution time of the ease where shared
data is not cached. The bottom section ofeach bar represents the busy time or useful cycles executed by the
processor. The section above it represents the time that the processor is stalled waiting for reads. The section
above that is the amount oi‘ time thc processor is stalled waiting for writes to be completed. The top section,
labeled “synchronization,” accounts for the time processor is stalled due to locks and barriers.
Benefit: of Cnchcing As expected. the cacheing of shared read-write data provided substantial gains in
performance, with benefits ranging from 2.2- to 2.?-fold improvement for the three Stanford benchmark
programs. The largest benefit came from a reduction in the number of cycles wasted due to read misses. The
cycles wasted due to write misses were also reduced, although the magnitude ofthe benefits varied across the
three programs due to different write-hit ratios.
The cache-hit ratios achieved by MP3D, LU, and PTIIDR were 80, 66, and 77%, respectively, for shared-
read references, and T5, '97, and 47% for shared-write references. It is interesting to note that these hit ratios
are substantially lower than the usual uniprocessor hit ratios.
The low hit ratios arise from several factors: The data set size for engineering applications is large,
parallelism decreases spatial locality in the application, and communication among processors results in
invalidation misses. Still, hardware cache coherence is an efi’ective technique for substantially increasing the
perfo tmancc with no assistance itom the compiler or programmer.
e.nt,Mrm..ta..o_ 4,,
9.1.4 Scalable Coherence Interface
A scalable coherence interconnect. structure with low latency is needed to extend iron: conventional hosed
backpianes to a fully duplex, point-to-point interface specification. The scrrlabfc coherence intcrjlirec (SCI),
which was introduced in Chapter 5, is specified in IEEE Standard 1596-1992. SCI supports unidirectional
point-to-point connections, with two such links between each pair ol" nodes; pac-ket-based cornrntmication is
used, with routing.
Up to 64K processors, memory modules, or L-'0 nodes can effectively interface with a shared SCI
interconnect. The cache coherence protocols used in SCI are directory-be-sod. A sharing list is used to chain
the distributed directories together for reference purposes.

SCI Interconnect Models SCI defines the interface between nodes and the external interconrrect, using
I6-bit links with a bandwidth of up to 1 Gbytefs per link. As a result, backplane buses have been replaced
by unidirectional point-to-poinl1inks.Arypical SCI configuration is shown in Fig. 9.5a. Each SCI node can
be e processor with attached memory and U0 devices. The SCI interconnect can assume a ring structure or a
crossbar switch as depicted in Figs. 9.5b and 9.5c, respectively, among other configuratiorts.

Bﬂds
VME nus

[at Typical SCI eonfrguratron with nudge to other nus

Nodes Nodes

I. - ____ H I
Q
..
in
‘ll-
~ b)!‘
.. .
Ii
‘
-‘-
I'll
IH
>
[t|JArmg for pomt—to-pomt transactions [n]Aerossnar multiprocessor

Fig.!.5 SCI imereomecrion conﬁgurations (Reprinted wirh permission them the IEEE Standard 1595-1992.
copyright © ‘E992 by lEEE.lrrc.}
Ilﬁ i - - AdmrrcedCornprrterArchitecrure

Each node has an input link and an output link which are connected ﬁom or to the SCI ring or crossbar.
The bandwidth of SCI links depends on the physical standard chosen to implement the links and interfaces.
In such an environment, the concept of broadcast bus-based transactions is abandoned. Coherence
protocols are based on poim-to-point transactions initiated by a requester and completed by a responder.
A ring interconnect provides the simplest feedback connections among the nodes.
Tl:|e converter in Fig. 9.5a is used to bridge the SCI ring to the VME bus as shown. A mesh of rings can
also be considered using some bridging modules. The bandwidth, arbitration, and addressing mechanisms of
an SCI ring signiﬁcantly outperform backplane buses. Ely eliminating the snoopy cache controllers, the SCI
is also less expensive per node, but the main advantage lies in its low latency and scalability.
Although SC] is scalable, the amount of memory used in the cache directories also scales up well.
The performance of the SCI protocol does not scale, since when the sharing list is long, invalidatiorrs take
proportionately longer time.
Sharing-List Structures Sharing lists are used in SCI to build chained directories for cache coherence use.
The length of the slrraiing lists is effectively unbounded. Sharing lists are dynamically created, pruned, and
destroyed. Each coherently cached block is entered onto a list of processors sharing the block.
Processors have the option of bypassing the coherence protocols for locally cached data. Cache blocks
of 64 bytes are assumed. By distributing the directories among the sharing processors, SCI avoids sealing
limitations imposed by using a central directory. Communications among sharing processors are supported
by heavily shared memory controllers, as shown in Fig. 9.6.

Pr COBB-'BCl'S

cPu,, caua cPu,; caun E-Urit

I-r C-ache

I Cohnrmtbloclr W |:| hlon-ccheranthlock

Moncry

Fig-9-6 SCI cache coherence pnococol with distributed dineccories (Courtesy of D.\Ejarnes et al. IEEE
Con1pumr.19'9Cl]

Other blocks may be locally cached and are not visible to the coherence protocols. For every block address,
the memory and cache entries have additional tag bits which are used to identify the first processor (head) in
the sharing list and to link the previous and following nodes.
Doubly linked lists are maintained between processors in the sharing list, with forward and backward
pointers as shown by the double arrows in each link. Noncoherecrrt copies may also he made coherent by
page-level control. However, such highcr-level software coherence protocols are beyond the scope of the
SCI standard.
sstn,Mnst...ta..o -—. 4,,
Sharing-Lin Creation The states ofthe sharing list are deﬁned by the state of the memory and the states of
list entries. Nortnally, the shared memory is either in a home [uncaichedi or a cached (sharing-list) state. The
sharing-list entries specify the location ol" the entry in a multiple-entry sharing list, identify the only entry in
the list, or specify the entry-"s cache properties, such as clean. dirty, valid, or stale.
Thc head ptoccssor is always rcsponsiblc for list management. The stable and legal combinations ofthc
memory and entry states can specify uncached data, clean or dirty data at various locations, and cached
writable or stalc data.
The memory is initially in the home state tuncached], and all cache copies are invalid. Sharing-list
creation begins at the cache where an cntry is changed from an invalid to a pending state. When a read-cachc
transaction is directed from a processor to the memory controller, the memory state is changed frorn un-
cachcd to cached and thc rcqttcstc-d data is returned.
The requcstcr‘s cachc entry statc is thcn changcd from a pending state to an only-clean state. Sharing-list
creation is illustrated in Fig. 9.7a. Multiple requests can be simultaneously generated, but they are processed
soqucntially by thc memory controller.

Processors

l'-\'B"d-
naw was old new old
new new [2]

Hm» "‘H —‘"“

read cached “-, l9ad{":i£ch9:1H‘~

...,... m MW
Before After Befae After

[a] Creation of sharing list [bl Addition of new no-dos

Fig.9.? Sharlrtg-list creation and up-than ttxamploa {Courtesy of D.V.jart1as et: al. IEEE Computer. 1990}

Sharing-I.i:t Updater For subsequent memory access, the memory state is cached, and the cache head of
the sharing list has possibly dirty data. As illustrated in Fig. 9.?h, a new requester (cache A) first directs its
read-cache transaction to memory but receives a. pointer to cache B instead of the requested data.
A second cache-to-cache transaction, called prepcnrt'_ is directed from cache A to cache B. Cache B then
sets its backward pointer to point to cache A and returns the requested data. The dashed lines correspond to
transactions between a processor and memory or another processor. The solid lincs are sharing-list pointers.
After the transaction, the inserted cache A becomes the new head, and the old head, cache B, is in the
middle as shown by the new sharing list on the right in Fig. 9.Tb.
Any sharing-list entry may delete itselffrom the list. Demils of entry deletions are left as an exercise for the
reader. Simultaneous deletions never generate deadlocks or starvation. However, the addition ofncw sharing-
list entries must be performed in first-in—first-out order in order to avoid potential deadloclting dependences.
Thc hcad ofthe shati ng list has thc authority to purge othcr cnttics lrom the list to obtain an cxclus ivc entry.
Others may reenter its a new list head. Purges are performed sequentially. The chained-directory coherence
protocols arc fault-tolerant in that dirty data is ncycr lost when transactions arc discarded.
4| B i - - AdmrrcedColnputerArchitec1ure

Implementation Issue: SCI was developed to support multiprocessor systems with thousands ofprocessors
by providing a coherem dist ributed -cache image ofdist ributed shared memory and bridges that interface with
existing or future buses. ll can support various multiprocessor topologies using Omega or crossbar networim.
Differential emitter coupled logic (ECL) signaling works well at SCI clock rates. The original SCI
implementation uses a I6-bit data path at 1 ns per word. The interface is synchronously clocked. Several
models of clock distribution are supported. With distributed shared-memory and distributed cache coherence
protocols, the boundary between multiproccssors and multicomputers has become blurred in MIMD systems
of this class.

9.1.5 Relaxed Memory Consistency

We have studied n'e.r1ir con.sis'Ienrjt' (WC) [Sirldliu et al, 1992) and .seqrn:nrfai' con.sr's1tene__r-' (SC) in Section 5.4.
Two additional memory models are introduced below for building scalable multiproccssors with distributed
shared memory.

Processor Consistency Goodman (I989) introduced the prr;ees'.sor eonsis'rene_v (PC) model in which
writes issued by each individual processor are always in program order. However, the order of writes from
two different processors can be out of program order. ln other words, consistency in writes is observed in
each processor, but the order of reads from each processor is not restricted as long as they do not involve
other processors.
The PC model relaxes from the SC model by removing some restrictions on writes from diﬁcrcnt
processors. This opens up more opportunities for write buffering and pipelining. Two conditions related to
otherproccssors are required for ensuring processor consistency:

{ 1] Before a read is allov.-ed to peribrrn with respect to any other processor, all previous rend accesses
must be performed.
{2} Before a write is allowed to perform with respect to any other processor, all previous rend or write
aoces ses must be performed.

These conditionsallow reriris following a write to bypass the nrire. To avoid deadlock. the implementation
should guarantee that a write that appears previously in program order will eventually be performed.
Rel-ease Consistency One of the most relaxed memory models is the reiease consistent)‘ (RC) model
introduced by Gharochorloo et al (1990). Release consistency requires that synchronization accesses in the
program be identified and classified as either acquires (e.g. locks) or reieases (e.g. unlocks). An acquire is a
read operation (which can he part of a read-modify-write) that gains permission lo access a set of data, while
a release is awrite operation that gives away such permission. This information is used to provide flexibility
in bufiering and pipelining ofaccesses between synchronization points.
The main advantage of the relaxed models is the potential for increased peribrmance by hiding as much
write latency as possible. The main disadvantage is increased hardware complexity and a more complex
programming model. Three conditions ensure release consistency:
~[ ll Before an ordinary read or n-‘rite aocess is allowed to perform with respect to any other processor, all
previous oeqrrire accesses must be performed.
seut,Muo~.¢e...i. 4,,
{2} Before a release acccss is allowed to perform with respect to tun: othcr proocs all previous ordinary
mad and store acccsscs must bc pcrforrncii
{'3} firreciui accesses arc processor-con sistcnt with onc anothcr.The ordering restrictions imposed by weak
consistency arc not present in rcleasc con sistency. lnstcad, rclcasc consistency rcqu ires processor
consistency and not scqucntial consistency.
Release consistency can be satisfied by (ii stalling the processor on an acquire access until it completes,
and [ii] delaying the completion of release aeeess until all previous memory accesses complete. intuitive
definitions ofthe four memory consistency models, the SC, WC, PC, and RC, are sununarized in Fig. 9.5.

Sequential Consistency (56)

The re-suit of any eioecution aprpears as
some interleaving oi the operations of the Sllﬂng
individual processors when executed on a Mam
rruitithreaded sequential machine.
(Lamp-ort, 1919]

Processor Consistency -[PC] “leak cqrislsrgi-my mic]

Writes issued by each Individual The programmer enforoes
prooessor are never seen out of oonslsteney using
order. but ihe order of writes from synchronization operators
two differert prooessors can be guaranteed to be sequen tia liy
obsenred differently. [Go-odman, oonsistertt (D ubois etai.,19BB;
1939] S-indhu et ai.,1992]

Reiaiced
\ / ii’ Models

Release Consistency {RC}

Weak oonsistency with two types of
sgrnchnonization operators: acquire and
release. Each type of operator is
guaranteed to be processor cons istent
[Gharaehorioo et a|.,19'90-J
.-1

Fig.1-3 lncuithre deﬁnitions of four mernery consisuency mor.|e|s.Tl'ie arrow: pointfmm strong so relaxed
oonsistencies (Courtesy oi‘ hﬁlzberg and Lo. IEEE Computer; Au-gust 1991}

The cost of implementing RC over that for SC arises from the extra hardware cost of providing a lockup-
ﬁee cache and keeping track of multiple outstanding requests. Although this cost is not negligible, the same
hardware fcattlrcs are also rcquircd to support prcfctching and multiple contests.

Effect of Release Comiﬂoncy Figure 9.9 prcsmlts the breakdown of execution times under SC and RC
for the three applications. The execution times are nonnaiized to those shown in Fig. 9.3 with shared data
cached. As can be seen from the results, RC removes all idle time due to write-miss latency.
411] i - " Advlorleed Cmnputerﬂuchitecture

— won mm} 1.020 Synchronization

“ 1:14" E 5'9 92-4 3112 Write Miss
— '5-5 Cl‘-B — Read Miss
Busy
3-5.2 :1‘='l'.Q- T22
S
38 64.3 _s.¢
_ _ _ 3.1
malzedExe utlonTme are
Nor SE2_ -51.3
529
*9 ' 43.5 out 49-“
30-

20- _
1°" res 1s.s 2'5-° 25-“ -16.0 14.2.
0
SC RC SC RC SC RC
lvlP3D LU PT HOP.

Fig. 1.9 Eiiiect oinclaxslng the shared-memoqr rnodei from sequential oonsistmcy {SC} to release consistency
(RC) [Courtesy of Gupta at al. Ptcc. int. Syrup. Corn-put. Archie, Toronto. Canarh. May 19'9't}

The gains are large in ivlP3D and PTHOR since the write-miss time constitutes a large portion of the
execution time under SC (35 and 20%, respectively], while the gain is small in LU due to the relatively small
write-miss time under SC (7%).
Effect of Combining Mechanism: The cliect of combining various latency-hiding mechanisms is
illustrated by Fig. 9.10 based on the M'P3D bcnciunark results obtained at Stanford University. The idea of
using mtiirrpic-t-nnttu-t processors will be described in Section 9.2. However, the eitect of integrating MC
with other latency-hiding mechanisms is presented bclow.
The busy parts of the execution times in Fig. 9.10 are equal in all combinations. This is the CPU busy
time for executing the MPED program. The idle part in the bar diagram corresponds to memory latency and
includes all cache-miss penalties. All thc times arc normalized with respect to thc execution time (IUD units}
required in a m-:-he-mhcrsnt system. The leftmost time bar (with 241 units) corresponds to the worst case of
using a private cache ettclusivcly without shared reads or writes. Long overhead is experienced in this case
ciuc to excessive cache misses. The use of a cache-coherent system shows a 2.41-fold improvement over the
private case. All the remaining cases are assumed to use hardware coherent caches.
The use of rt*l'et"ts+.- consrsterrr-_v shows a 35% ﬁrrthcr improvement over the coherent system. The adding
of prefetching reduces the time further to 44 units. The best case is the combination of using coherent caches.
RC, and rriItrl'tr'pi'-t’ c0rttc.1rrs(MC). The rightmost time bar is obtained from applying all four mechanisms. The
combined results show an overall speedup oi'4 to 7 over the case ofusing private caches.
The above and other uncited bencltmark results reported at Stanford suggest that a coherent cache and
relaxed consistency uniformly improve performance. The improvements due to prefctching and multiple
ssot,nstst...ts....t -—. 4,,
contents are sizable but are much more application-dependent. Combinations of the various latency-hiding
mechanisms genenilly attain a better peribrrnatlee than each one on its own.

2¢o- ‘L1
2.20 -
El Idle
zoo - 5“?
180-
- 160 _ RC:ReteaseConsls1ency'
MG: Mutlpte Contexts
ExNorlecadumtioanTime 140 —
120 —

iﬂitrm PWHIB Coherent Coherent Coherent (IQ-ho-rent Cnhgrgnt

Cache Cache +|=te +RG
+Pmfet1‘-h
me
+Mc
+n¢
+Prefetch
+|'-.|1C

Fig. 1.10 Effect: of combining various Ilttenqr-l'tlt:ling mechanisms from the MPJD benehmarkon 1 slrn-tslatned
Dash multipnecessor (Courtesy of Gupta. 1991}

PRINCIPLES OF FIULTITHREADING
1 This section considers multithreaded prooessors and multidimensional system arehiteetures.
Only control-ﬂow approaches are described here. Fine-grain machines are studied in
Section 9.3, von Neurnann mult-ithreacling in Section 9.4, and clataﬂow multzithreacling in Section 9.5. Recent
developments in rnullithreading support by processor hardware are discussed in Chapters 12 and I3.

9.1.1 Hultitl1rBading Issues and Solutions

Multithreading demands that the processor be designed to handle multiple contexts simultaneously on a
context-switching basis. We first specify the typical architecture environment using multiple-context
processors. Next we present a rnullitltreaded computation model. Then we look further into the latency and
synehmnization problems and discuss their solutions in this environment.
FM Mtfiruw H'IHr'n.-rq|i;utn1'
III i " Advanced Corrtputerfluchitecture

Architecture Emrinzlnrnent One possible tnultithreaded MPP system is modeled by a network of processor
(P) and memory (M) nodes as depicted in Fig. 9.] la The distributed memories form a global address space.
Four machine parameters are deﬁned below to analyze the performance of this network:

I-alflflfiy ll-J
...,?"

ll'lHl“OCll'|l"lG‘1'.'.t

@- El— Erin. el- ET

Rate of request [p = HR]
m— El— El.’ sl- El-
[a] Thearchihctore environment. {Gou rteey of Rafael Saavodra, 1992]

R
||-my gc|1Qdu|||-|g Qvgmaad

|_
Z
T hreaet sy nehron ization overhead

_ _ _ _ . __

,_ _____ _ _
1 throacle of parallel computation

,_ _____ __
gomputation lntel‘-comp-uter
communication
[distn outed memories]
[ls] Multithreadecl computation model. [Courtesy of Gordon Bell, C-‘omrnun. ACM, August 1992]

Fig.9.11 Moltltzlsreaded architecture and its oontpumtlon model for a rnaslvely parallel processing system

{'1} T?rc!t:r1ertc_1={'L'j: This is the communication latency on a remote memory access. The value oft. inc ludcts
the nctworlt delays, cache-miss penalty, and delays caused by contentions in split transactions.
{2} The number ofrhrterzds {N}: This is the number of threads that cart be interleaved in each processor.
A thread is represented by a context‘ consisting ofa program counter, a register set, and the required
contest status words.
{'3j The context-.stt-'irt'hingot-writeup" l_'C.'_l: This refers to the cycles lost in performing contest switching in a
processor. This time depends on the switch mechanism and the amount ofprocessor states devoted to
maintaining active thread s.
{'4} Ute ritrcrt-‘of bertveen .r'n-'r'rultc.s {R}: This refers to the cycles between switches triggered by remote
reference. The inverse p = UR is called the inte of reqrtestw tbr remote accesses. This reﬂects a
combination of program behavior and memory system design.

ln order to increase eﬂiciency, one approach is to reduce lite rate of requests by using distributed coherent
caches. Another is to eliminate processor waiting through multithreading. The basic concept ofmultithreading
is described below.

Muftitﬁmaded Computations Bell {I992} has described the stnicture of the rnultithreaded parallel
computations model shown in Fig. 9.1111. The computation starts with a sequential thread (I), followed
Sccrlable,Multlthreuded,aod -—. 4,,
by supervisory scheduling (2) where the processors begin tltreads of computation (3), by intereomputer
messages that update variables among the nodes when the computer has a distributed memory (4), and finally
by synchronization prior to beginning the next unit of parallel work (5).
The eommtmieation overhead period (4) inherent in distributed memory structures is usually distributed
throughout the eontputation and is possibly eontpletely overlapped. Message-passing overhead {send
and reeeive ealls) in multicomputers ean be tedueed by specialized hardware operating in parallel with
eomputation.
Communication bandwidth limits granularity, since at certain amount of data has to be transferred with
other nodes in order to eomplete a e-nmputati-tmal grain. Message-passing ealls -[4] and synchronization (5)
are nonproductive. Fast mechanisms to reduce ort-n hide these delays are therelhre needed. Multithread ing is
not capable of speedup in the execution ofsingle threads, while weak ordering or relaxed consistency models
are capable of doing this.

Problem: nfA.t-ynehmny Massively parallel processors operate asynchronously in a network environment.

The asyne hrony triggers two Fundamental latency problems: remote formic and .s_1-'neitrorii:irtg fonds. as
observed by Nikhil {I 9'92). These two problems are explained by the following example:

l/l
g Example 9.1 Latency problems for remote loads or
synehronizing loads (Rishiyun Nikhil,1992).
The remote load sitttation is il]usn'ated in Fig. 9.12:1. Variables A and B are located on nodes N2 and N3,
respectively. They need to be brought to no-tie N1 to compute the difference A — B in variable C. The basic
t'dcndth
C'DH'l].'1L| ﬂtlﬁn
ma 5 C C1111 t'IDH D f “VD TCl'TlC|tC Iocl
3 Y (F loddheth
3 jiin t H C 51] bt'
ITHCIIUH.

Home N1 Hm H2 Hm N1

mam - GTXT|:|
Roadyt “’°°“*
cI _
W“ I P-MRS mm N3 "3 I
"B I M - NooeN3
PA =| “B - _
PB f - PA Z. mm
PB Z -
Onblotto I"-l1,oon1:u.no; C= A-B oeunmos tooneouto:
~,,,q=ma5|;,_a 0nNodoN1,oon'ptre:C=A-B
B= B “awn mm: A and B oom|:u'ed eoret.n'sn1n,-
vc _ ?:; ti-ma on H1 muslbe nolﬁeo
' ' whonA, B are ready
(a) The remotelcaes prolziem (tn) The synerronziwg loads |:t'oolam

F§g.!.12 ‘Fun common pmblems caused by asymzhmny and corrrmmlcadm larutey in massively parallel
proeusors (Cournuy of ILS. Nvllthﬁ. Digital Equipntent. Corporation, 1991}
414 i - - Adnortced Cornputterarchitecttrre

Let pAand pB be the pointers to A and B, respectively. The two rloads can be issued from the same thread
or from two diﬁ'erent threads. The context of the computation on bll is represented by the variable CTXT. it
can be a stack pointer, a frame pointer, a current-object pointer, a process identiﬁer, etc. In general, variable
names like vA, vB, and C are interpreted relative to C-TXT.
In Fig. 9.l2b, the idling due to synchronizing loads is illustrated. ln this case, A and B are computed by
concurrent processes, and we are not sure exactly when they will be ready for node N1 to read. The ready
signals [Ready] and Ready2) may reach node N1 asyncltronously. This is a typical situation in the producer-
consumer problem. Busy-waiting may result.

The key issue involved in remote loads is how to avoid idling in node N1 during the load operations.
The latency caused by remote loads is an architectural property. The latency caused by synchronizing loads
also depends on scheduling and the time it takes to compute A and B. which may be much longer than thc
transit latency. The synchronization latency is often unpredictable, while the remote-load latencies are oﬂen
predictable.

Multithreading Solution: This solution to asynchrony problems is to multiplex among many threads:
When one thread issues a remote-load request, the processor begins work on another thread, and so on
(Fig. 9.l3a). Clearly, the cost of thread switching should be much smaller than that of the latency of the
remote load, or else the processor might as well wait tbrthe remote load's response.
As the internode latency increases, more threads are needed to hide it effectively. Another concern
is to ma]-te sure that messages carry continuations Suppose, after issuing a remote load from thread T1
(Fig. 9.13:1), we switch to thread E, which also issues a remote load. Thc responses may not return in
the same order. This may be caused by requests traveling different distances, through varying degrees of
congestion, to destination nodes whose loads differ greatly, etc.
One way to cope with the problem is to associate each remote load and response with an identifier for the
appropriate thread, so that it can be reenabled on the arrival ofa response. These thread identifiers are referred
to as emtrintto!r'ons' on messages, A large eorrrirtutrfiort name $‘,l'JfltI‘-t.’ should be provided to name an adequate
number of threads waiting for remote responses.
The size ofthe hardware-supported continuation in a name space varies greatly in diflcrcnt system designs:
from 1 in the Dash, 4 in the Alewife, 64 in the HEP, and I024 in the Tera (Section 51.4} to the local memory
address space in the Monsoon, Hybrid Dataflovvlvon Neumann, MDP (Section 9.3), and ‘T (Section 9.5].
Of course, if the hardware-supported name space is small, one can always virtualize it by multiplexing in
software, but this has an associated overhead.
Distributed Cachcing The concept of distributed cacheing is shown in Fig. 9.13-b. Every memory location
has an owner node. For example, NI owns B and N2 owns A. The directories are used to contain import-
export lists and state whether the data is shmed (for reads, many caches may bold copies] or tu'cl'1t.sit-'e (for
writes, one cache holds the current value].
The directories multiplex among a small number of contexts to cover the cache loading effects. The MIT
Alewife, l(SR—l, and Stanford Dash have implemented directory-based coherence protocols. It should be
noted that distributed cacbeing ofiers a solution for the remote-loads problem, but not for the synchronizing—
Sr:nIabl'e,Mu!dfl':reIded.Ind as
lnads prnblcrn. Multithmading -nfi'-ms a snlutinn fur remmc lnads and possibly for synchmnizing loads.
However, the two approaches can be combined tn solve 1:01:11 typm of remote-access problems.

No-due N1 Mada N2
I C1112 III ‘i

§J
|§L:r1oadA A
I 1
N2
M»-=
-$1
g E
c.tx't1 lA
.
-I

L
GDQ1
V

[a] Mrmihreadlng scrluion

£EE;;;EF%
U ii
P D P D
A: Import; shared B: |mport;sxclush.re
B: export H2; exuusrua A: exmrt N1,N1E-; shared

P = Processor; D = Dlra~c1or3r;C = Cache; M = Memory

{bj DlSU‘|b1.I.Gd camsing

Fig.'!.13 ‘Em solutinns for ovarconirrgtlrc asynchmrry problem: [Courtesy erffi. 5. Niid'll.Digica| Eqrflpmemz
Corp-cratlnlt. 1991)
415 i - IHEEHHHHIIIIIIIIIIIL Adviorrced Cornputerfirchitecture

9.2.2 Multiple-Context: Pr-censors

Multithreadod systems are constructed with rrrrririlrritr-rrorrrttxr {or nwiﬁrhrearkrdj processors. In this section,
we study an abstract model based on the work of Saavedra et at [I990]. We then present an example of this
type ofprocessor. We discuss the processor eﬁieiecney issue as a function ofmcmo ry latency (L), the number
ofcontexts {N}, and context-switching overhead {C}.

The Enhanced Processor Model A conventional single-thread processor will it-‘nit during a remote
reference, so we may say it is idle lor a period of time L. A multithreaded processor, as modeled i.n
Fig. 9.14s, will suspend the current context and switch to another, so after some fixed number of cycles it will
again be busy doing useful work, even though the remote reference is outstanding. Only if all the contexts are
suspended {blocked} will the processor he idle.
Clearly, the objective is to maximize the fraction of time that the processor is busy, so we will use the
efliciency of the processor as our performance index, given by
bust?
"’i~1l""""t"‘~"’= ‘°-"
when: hus_r-: .~rn-irr-hirrg. and idle represent the amount of time, measured over some large interval, that the
processor is in the corresponding state. The basic idea behind a multithreaded machine is to interleave the
execution of several contexts in order to dramatically reduce the value of idle. but without overly increasing
the magnitude of.snirchr'rrg
The state ofa processor is determined by the disposition of the various contexts on the processor. During
its lifetime, a context cycles through the following states: rennft-', rrrnrring, loot-'r'ng, and bloeireri There can
be at most one context running or leaving. A processor is bust-' if there is a context in the running state; it is
Sit-'i'I('§li!Ig while making the transition from one context to another, i.e. when a context is leaving. Otherwise,
all contexts are blocked and we say the processor is r'n'Fe.
A running context keeps the processor busy until it issues an operation that requires a context switch. The
context then spends C cycles in the femring state, then goes into the blocked state for L cycles, and finally
recnters the renrrfy state. Eventually the processor will choose it and the cycle will start again.
The abstzract model shown in l-‘lg. 9.l4a assumes one thread per context, and each context is represented
by its own program counter (PC), register set, and process status word (l’S‘N). An example multithreadecl
processor in which three thread slots (N = 3} are provided is shown in Fig. 9.1-l-b.

3,15 Example 9.2 A multithreacled processor with three thread

slots {Hiroalti Hirata et al.,1992).
As shown in Fig. 9.|4b, the processor is provided with several instruction queue unit and decode unit pairs,
called rhreor! slots. Each thread slot, associated with a program counter, makes up a logic-of ,rJroeessor: while
an instruction fetch unit and all ﬁmctional Ltnits are physically shared among logical processors.
$aa.,~am-ta.“ 4,,
.32..
t PSW
a_.._..._-

Z PC
,i, N oontexte
PSW 1 thread per context
I
I
I

CuntextSelectC %%
-.
ItIDP i
E

ALL! Le-cat Remcte

One Ref Ref
[a] Multtth readed model. {Ceutteey of Rafaet Saavedra, 1992]

lnettucltlen Cache

lnstrumbn Fetch um 1

"-~|]-*

ALU
*-U-—t

Integer
-_ _,- [}-—t—
ea»
:q%@E%%%%%%%¥
7Ba‘re|
Shifter
E

Integer
Multiplier
FP
Adder
KFP FP
MlJ1l|[J-Ber Convene
Loael.I'Sto-
re urit re unit
1 1 Data Cache

MM\HHHHl~ﬂ@@@
'1, 2 K K n Queue Regsters

Req|ete1'S-et Register Set Register Set Large Register Fllee

[attocated for (atbcated for {alto-mated for and Queue Register
executing thread] watﬂng thread] ready thread]

[bi Athree~th1'ead pto-nessor example {Courtesy of H. Htrata et al, Pm: 19" int. Symp. Compu1.A.rc.m't.,
Aumraﬂa, May 1992}

Fig.9." Muldpln-concoct. proeuser medal and an example design

III] i - " Adv\nricedCo4nputerArchitec1u.re

An instruction queue unit has a buffer which saves some instructions succeeding the instruction indicated
by the program counter. The buffer size needs to be at least B = F‘-">< C words, where N is the number ofthrcad
slots and C is the number ofcycles required to access the instruction cache.
An instruction fetch lmil fetches at most B instnictions ibr one thread every C cycles from the instniction
cache and attempts to ﬁll the buffers in the instruction queue unit. This fetching operation is done in an
interleaved fashion for multiple threads. So, on the average, the buffer in one instruction queue unit is ﬁlled
once in B cycles.
When one ofthe threads encounters a branch instruction, however, that thread can procmpt the prefctching
operation. The inst-ruction cache and fetch unit might become a bottleneck for a processor with many thread
slots. In such cases, a bigger and."or faster cache and another fetch unit would be needed.

Context-Switching Felicia: Different multitl'|rcadcd architectures are distinguished by the context-

switching policies adopted. Specified below are four switching policies:
{'1} .'i'u-'r'rc!t an crmire nris'.s'—This policy corresponds to the case where a context is preempted when it
causes a cache miss. In this case, R is taken to be the average interval between misses (in cyclesj, and
L the time required to -satisfy the miss. Here, the processor switches contexts only when it is certain
that the current one will be delayed ibr a significant number ofcycles.
(Zj .'*i'tvr'rr.'h on everfv iond—This policy allows switching on every loat.L independent ofwhether it v.-ill cause
a miss or not. In this case, R represents the average interval between loads. A general multithrcading
model assumes that acontcxt is blocked for L cyc lcs after every switch; but in the case ofa sv.-itch-on-
load processor, this happens only ifthe load causes a cache miss.
The general model can be employed if it is postulated that there are two sources oflatency (L, and
Lg], each having a particular probability {p1 and pg} of occurring on every switch. lfL| represents the
latency on a cache miss, then 1.1, corresponds to what is normally referred to as the miss ratio. L; is a
zero-cycle memory latency with probability fl}
{3} Sn-'r'rr.iir on at-'er'_v in.rrrnr.-rion—This policy allows switching on ct-ery instruction, independent of
whether it is a load or not. ln other words, it interlcavcs the instructions from diifcrem threads on
a cycle-by-cycle basis. Successive iristructioris become independent, which will benefit pipclined
csccution. However, the cache miss may i1'|creasc due to breaking of lomlity. lt has been verified
by some trace-driven experiments at Stanford that cycle-by-cycle interleaving of contexts provides
a performance advantage over switching on a cache miss in that the context interleaving could hide
pipeline dependences and reduce the context switch cost.
{'4} r"i'u-'1'!-olr on block 0finsrrucrr'on—Blocks of instructions from different threads are interleaved. This will
improve the cache-hit ratio duc to locality. It will also benefit single-contest performance.

Processor Eﬁlcfencles A single-tlnread processor executes a context until e remote reference is issued (R
cycles] and then is idle until the reference completes -[L cycles}. There is no context switch and obviously no
switch overhead. We can model this behavior as an alternating renewal process having a cycle of R + L. I11
terms of Eq. 9,1, R and L correspond to the amount oftime during a cycle that the processor is hu.s"_v and irﬂe,
respectively. Thus the efficiency of a single-threaded machine is given by

R 1
£=?=i 9.2
' R+L 1+Lr"R { i
$aan,Masr..¢a.o 4,,
This shows clearly the performance degradation of such a ];:|roecssor in a parallel system with a large
memory latency.
‘With multiple contexts, memory latency can be hidden by switching to a new context, but we assume that
the switch takes C cycles of overhead. Assuming the run length between switches is constant with a sutficient
number ofcontexts, there is always a context ready to execute when a switch occurs, so the processor is never
idle. The processor efficiency is analyzed below under two difi'erent conditions as illustrated in Fig. 9. l 5.
in
Tm IEI
R 1. ,_
R 1. Time
IEI L _E-El
Cflflllfllffl R |_ Context R L
in
F
| R
R
L
1.
1.
|
|
| R
in
R
1_
to
L
j
|

l'R'RRR'RRR IRRRR Ida 1

(ai5n=1I~h=t5 of wntflfl swldiifls In the 5311'-firm rash" no Snapshots of oonteutt switching in the linear region

1- Processor
eificiency

11] ---------------------------------------------- --
satiation

Number of conhxts
ﬂ il-

{cj Efficiency more

Fl3- 9-15 Context switching and processor elﬁciency as I function ofthe number of context: [Courtesy of
Rahal Saawedﬂ-1992}

{I} Snrurnrinn t't.'git'J.H—l11 this saturated region, the processor operates with maximtnn utilization. The
cycle ofthe renewal process in this -case is R + C, and the efﬁciency is simply

R I
9.3)
5”‘ s+c 1+C.-"R l
Observe that the efficiency in saturation is independent of the latency and also does not change with a
fi.|rther increase in thc numb-cr ofcontertts.
Saturation is achieved when the time the processor spends sen-‘icing the other threads cxeoc-cls thc
time required to process a request, i.e., when (N- l){R + C) > L. This gives the saturation point, under
constant mn length, as
-
430 i - ' Advanced Cmnputerfluchitecture

_ L
rl'l',._{ = W + 1

{3} Lim-or n?gion—‘lilfl7tcu the number of contexts is below the saturation point, there may be no ready
contests alter a context switch, so the processor will experience idle cycles. The time required to
switch to a ready context, execute it until a remote reference is issued, and process the reference is
equal to R + C + L. Assuming N is hclow the sattuation point, during this time all the other contexts
have a turn in the processor. Thus, the efficiency is given by
NR
Er‘ 1 ‘°-5’
Dhscrvc that thc efliciency increases linearly with the numher of contexts until the saturation point
is reached and beyond that remains constant. The equation forfla, gives the fimdamcntal limit on the
efiiciency of a multithreaded processor and underlines tl1e importance of the ratio C-‘R. Unless the
context switch is extremely cheap, thc remote rcfcrcncc ratc must he kept low.
Figures 9.15s and 9.151: show snapshots of context switching in thc saturation and linear regions,
respectively. The processor efliciency is plotted as a function of the number of contexts in Fig. 9.150.
In Fig. 9.16, the processor efliciency is plotted as a function of the memory latency L with an average run
length R = I6 cycles. The C = [I curve corresponds to zero switching overhead. With C = 16 cycles, about
50% cfiiciency can be achieved. These results are based on a Markov model of multithrcadod architecture
by Saavcdra 11992]. It should be noted that multitltreading increases both processor efficiency and network
rraflic. Tradeoffs do exist between these two opposing goals, and this has been discussed in a paper by
Agarwal [I992].
1.113 1.0 Number ofConhsts=2
_,.¢=fl Ntmwherofflonhsxta = 2
0.Q] _ c=1 0.9 C=tJ
C=1
0.3] 0.8

ﬂ_‘ G=4
0.?
C=4
0.ﬁ] 0.6 0-
:C=1B
0.5
C=1B

SEF-‘_-i
!'—"F‘P 0.4 u_1
L-:ntm-_.n_-t.¢lT ‘€ﬁ:i&_ﬂ-_ , _, l'

0.3] 0.3 0-

0.20 0.2 0-

0.10 I I i 0.1 9 n -1 1
U 50 100 150 200 0 50 ‘ICU 1$

Memory Lateney{oyctas) Memory Latency {cycles}

{aft Two contexts per processor {hi Six contexts per processor

Fig.9.“ Pnocessor elﬁciency of a multitliraaded anchiteccurie (Courtesy of R. Saaverln. D. E Cutler and

T.Vnn Eicken.199'1]
soot.Most..¢a..o -—. 4,,
9.2.3 Multidimensionallirchiteccures
lnorder to enhance the scalability ofmultiprocessor systems, many research groups have explored economical
and multidimensional architectures that support fast communication, coherence extension, distributed shared
mectnory, and modular packaging.
The architecture of massively parallel processors has evolved from one-dimensional rings to two-
dimensional and three-dimensional meshes or tori as illustrated in i-‘ig. 9.11‘. The Maryland Zrnob
experimented on a sinned token ring for building a multiprocessor. Both the CDC Cybeiphls and KER-1
used hierarchical (two-level] ring architectures. The ring is the simplest architecture to implement from the
viewpoint of backplane packaging.

1-o Ring
“X "\ “X
Maryland CDC Cyherpius I-(SR-1
Zmoo
2-D Mesh

"‘\ “it "\ “\ “X

Stanford MiTAirmrif-e Wisconsin lntel Caitoeh
Dash Muttloobo Paragon Mosaic C
3- D Me-sh.I'Torus
K .
"\ "X W "N
MIT USGIOMP Tera C1'a3rIMPP
J-iuiaehino
Fig. 9.11‘ The evolution from one-dimensions-i ring to two-dimensional mesh an-rl1:i1-on to 1:h=re-e-dimensional
meshftorus arehineierure for building massively parallel processors.

Two-dimeiisional meshes were adopted in the Stanford Dash, the MIT Alewife, the Wisconsin Multicube,
the lntel Paragon, and the Caltceh Mosaic C. A three-dimensional mesh-"tor|.|s was implemented in the MIT
J-Machine, the Tera computer, and in the Cmyfli-'IPP architecture, called T31}. The USC orrimgmini i'fl1i'Hl—
proees'.s'or {OMP) could be extended to higher dimensions. However, it beoomes more difficult to build
higher-dimensional architectures with conventional l‘MI1I-4IllifI'i.Ci!l5-ll.'!Il'Ifll circuit boards.
instead of using hierarchical buses or switched network architectures in one dimension, multiprocessor
architectures can he extendeclto a higher o‘i:riensr'omriir_\-'ormulriiilreirt-'along each dimension. The concepts are
described below for iwo- and three-dimensional meshes proposed for the Multicube and OMF architectures,
respectively.

The Wisconsin Muhieube This arcliitecture was proposed by Goodman and Woest [193 8) at the University
of Wisconsin. It employed a snooping eache system over a grid of buses, as shown in Fig. 9.181 Each
processor was eormeetod to a multilevel eache.
4,,-—. ,,,,,,,,,.,,,,,,,,,,,,,,,,m,,,,,,,,,,
_
‘ Isl ' III
I"
§'s
| t mu ‘ |
J ' @ ’_ not,

""011 ""-1.1 ""u.r~ I

FMM‘Bi| 1 (Cl? v ‘___ FIB‘
I Elie I
1-. F -
II Q 3

e.- . 7rT.-_ D-urnl -n

%
QIHI1‘
II
-Q
III=

@I-II __ 93“
Finw Em D12
:1:
__"iI ti“ I‘ l
iii
ﬁgQQgii
";;
;
i DOD
ii “5.9oneFE “vile!

ta) The tins-oonsm Muiluouol |,t1JTha time-dimensional lJl!P|[2.-oi [P1

Pr-oonmoru Hui. moi-nnry rnnduin, RB_ row
buss. CB1: eohirrvi buses)

/is . . _,. ,

is 1 s s s ;»
" ‘ I

;s»asisa as
‘ .

ii»iss
ar e
ti so as~, s ones»»oi
tosi*s cs? s~®,‘s
“ <.;. i:-
i »s@ ists J’
I

ﬁx
— I 5
5*. .s
~® ..% la. 77% ..t..,
to Q
o» in
I.-Blltli
s s ”
i,¢iTnp IHJCIHP (3.-ti architecture iﬂmoessnrssa Mbi.-ind e. Li. p.
ri~orrr:i|ymoouIasa1e|at:eiodDlIl.t1I...'iiCti

H5118 The Hdticube Incl orlliogoml lriulqinoeulnr ardinecuiru {Courtesy of Goodlnln :nc|"M:|eII'.
19$.mdd'Hwa|1getal,19B9]
$...r.r.r...r..ro......t......r -— 4,,
The ﬁrst-level cache. called the prrx'r's'.sor metre. was a high-perfomtance SRAIM1 eache designed with the
traditional goal of minimizing memory latency. A second-level cache, referred to as the .snoopr'rrg r:'rI£".Fk’, was
a very large cache designed to minimize bus traffic.
Each snooping cache monitored two buses, a row bus and a column bus, in order to maintain data
consistency among the snooping caches. Consistency between the two cache levels was maintatined by using
awrite-through strategy to ensure that the processor cache is always a strict subset ofthe snoop ing eache. The
main memory was divided up among the column buses. All processors tied to the same column shared tl1e
same home memory. The row buses were used for intercolurnn communication and cache coherence control.
The proposed architecture was an example of a new class of interconnection topologies, the rrrulrierrlxr.
eon sistirtg of."-"= rd processors, where each processor was connected to It buses and each bus was connected to
n processors. The hypercube is a special case where n = 2. The Wisconsin Multicube was a two-dimensional
multicube [Ir - 2}, where n scaled to about 32, resulting in a proposed system of over 1000 processors.
The Drthogomll Multipmcessqr ln the proposed OMP architecture (Fig. 9.13b], n processors
simultaneously access rr rows or rr columns of interleat-red memory modules. The n X n memory rrmslr is
interleaved in both dimensions. In other words, each row is rr-way interleaved and so is each column of
memory modules. There are Zn logical buses spanning in two orthogonal directions.
The synchronized row access or column access must be performed exclusively. in fact, the row bus R,-
and the column bus C1‘ can be the same physical bus because only one of tlte two will be used at a time. The
memory controller (MC) in Fig. 9. 1 Sb synchronizes the row access and column access ofthe shared memory.
The DMP architecture supports special-ptirp-ose computations in which data sets can be regularly arranged
as matrices. Simulated performance results obtained at USC veriﬁed the effectiveness of using an OMP in
matrix algebraic computations or in image processing operations.
In Fig. 9.l8b, each of the memory modules M,-, is shared by two processors F} and P,-. In other words. the
physical address spaoe ofprocessor P, covers the ith row or the ith column ofthe memory mesh. The
UMP is well suited for SPMD operations, in which n processors are synchronized at the memory-access level
when data sets are vectorized in matrix lbrrnat.

Muftidirnensional Extensions The above UMP architecture can be generalized to higher dimensions. A
generalized orthogonal multiprocessor is denoted as an Ul\-'lP{n_ Ir), where n is the tlli'?Ir.‘rt'.i‘lfl!'i and Ir is the
rrrui"rr‘p!icr‘{_t-'. There are p = Ir" l processors and rrt = Ir” memory modules inthe system, where p _$= n and p fir» k.
The system uses p memory buses, each spanning into rr dimensions. But only one dimension is used in a
given memory cycle. There are it memory modules attached to each spanning bus.
Each module is connected to n out of p buses through an n-way switch. It should be noted that the
dimension n corresponds to thc number of accessible ports that each memory module has. This implies that
each module is shared by n out ofp = Ir" L processors. For example, the architecture of an OMP(3,-4) is shown
in Fig. 9.I8c, where the circles represent memory modules, the squares processor modules, and the circles
inside squares computer modules.
The 16 processors orthogonally access 64 memory modules via 16 buses. each sprouting into three
directions, called the x-aeee.s's'. _‘l-‘-itIC‘fi.’.5‘S, and 2-or:'ee.s's'. respectively. Various sizes of UMP architecture for
different values of n and Ir are given in Table 9.2. A five-dimensional OMP with multiplicity it = lo has MK
processors.
434 i - Advortcsd Colnputerdichiteeture

Table 9.1 Orthogonal Multipmcessor of Dimension n and Multiplicity it

ri.t-ism. k_J ,-1 -» 14"" m t-"
t'Jl'v'lP(2, 8) 8 64
omrrz, 16) 15 255
OMH3, 8) 64 512
Cll'v'lP[3, 16) 256 4096
0Ml"(4, 8) 512 4096
0l'v'lP(4. 16) 4096 65.536
Cll\-'IP[5,lﬁ) 65,536 l,04El,5']"6

Note: p number ofpro-eessors;m numberol‘menu:n'y modules.

FINE-GRAIN MULTICOHPUTERS
— Traditionally, shared-memory multiproccssors like the Cray Y-MP were used to perform
coarse-grain computations in which each processor executed programs having tasks of a
few seconds or longer. Message-passing multicomputers are used to execute medium-grain programs with
approximately 10-ms task size as in the iPSCfl. In order to build It-'[PP systems, we may have to explore a
higher degree of parallelism by making the task grain size even smaller.
Fine-grain parallelism was utilized in SIMD or data-parallel computers like the CM-2 or on the message-
driven J-Machine and Mosaic C to be described below. We first characterize fine—grain parallelism and discuss
the network architectures proposed for such systems. Special attention is paid to the eflicient hardware or
software mechanisms developed for achieving fine-grain MIMD computation.

9.3.1 Fine-Grain Parallelism

We compare below the grain sizes, communication latencies, and concurrency in four classes of parallel
computers. This comparison leads to the rationales for developing ﬁne-grain multicomputers. In Chapter 13
we shall review recent developments.

Latency flnalysis The computing granularity and communication latency of leading early examples of
multiproccssors, data-parallel computers, and medium-and fine-grain multicomputers are summarized in
Table 9.3. These table entries summarize what we have leamed in Chapters 7 and 8. Four attributes are
identified to characterize these machines. Only typical values for a typical program mix are shown. The
intention is to show tl1e order of magnitude in these entries.
The comnninir-on}-1n l'at‘cm:*__1-' 7;. measures the data or message transfer time on a system interconnect.
This corresponds to the shared-memory access time on the Cray Y-MP, the time required to send a 32-bit
value across the hypercube network in the CM-2. and the network latency on the iPSC.r'l or J-Machine. The
synchronizafion overhead’ T; is the processing time required on a processor, or on a PE, or on a processing
node ofa multicomputer for the purpose of synchton ization.
The sum T, + T. gives the total time required For IPC. The shared-memory Cray Y-MP had a short If.
but a long T, The SIMD machine CM-2 had a short 1'] but a long 1'}. The long latency of the iPSCl'1 made
it unattractive based on fast advancing standards. The MIT J-Machine was designed to make a major
improvement in both oi" these communication delays.
...r.s.,r..o.....i.r...r -— 4,,
Fine-Gr-ntin Parallelism The groin sr':c Tg is measured by the execution time ofa typical program, including
both computing time and cornmimication time involved. Supercomputers handle large grain. Both the CM-2
and the J-Machine were designed as fine-grain machines. The iPSC." I was a relatively medium-grain machine
compared with the rest.
Large grain implies lower concurrency or a lower DCIP [degree ofparallelism). Fine grain leads to a much
higher DOP and also to higher communication overhead. SIMD machines used hardwired synchronization
and massive parallelism to overcome the problems of long network latency and slow processor speed. Finc-
groin rrrulticorrrptrIcr.s. like the J-Machine arm‘ Caltech Mosaic, were designed to lower both the grain sire and
the cornmtmication overhead compared to those oftraditional multicomputers.

Table 9.3 Fine-Gruln,Medlum-Gmln, and Course-Groin Machine Churucterlstlr: of Some Etromple Systerrt:

.'l-ftrclri rrc
(.'}rcrmr.'rcri.sric.s Croft‘ Connécriurr )'rr.r-cl Jl-HT
lift-{P .'lrfr.rc'.hirre CM-2 iPSf.'.-"J J-M'achirre!
Communication 40 ns via shared 600 pr per 32¢hit 5 Ins 1-‘re
latency, TI. memory :*rs€i¢!Pt’#??PF1
Synchronization Zlil ‘Us lZ5 ns per bit- 500 ,t.Lt l].{.r
overhead. 1] sliee operation
in lock step
Grain size. Ti: 205 4 pr per 32-bit lllms 5!“
result per PE
instruction
Concrurency 2-16 ilK— 154K 8 -128 lK—64K i
(DD?)
Remarlt Coarse-grain Fine~grain data i Medium~grain Fine-grain i
supercomputer parallelism lnulticotnputer mutt icornputer

9.3-.1 The HIT]-Machine

The architecture and building block ofthe MIT J-Machine, its instruction set, and system design considerations
are described below based on the paper by Dally et al (1992). The building block was the rrtcssogc-r.lrit'crr
prrnccssor (MDP), a 36-bit microprocessor custom-designed for a line-grain multicomputer.
The 1-Mud-r.irre Architecture The Ir-ary nrube networks were applied in the MIT J-Machine. The initial
prototype J-Machine used a I024-node network (8 >< ti >< 16], which was a reduced lo-ary 3-cube with B
nodes along the .r- and _y-dimensions and 16 nodes along the :-dimension. A 4096-node J-Machine would
use a full l6—ary 3-cube with 16 >< 16 >< 16 nodes. The J-Machine designers called their network a three-
dimcn siortal mesh.
Network addressing limited the size of the J-Machine to a maitimtun conﬁguration of 65,5315 nodes,
corresponding t-o a three-di men sional mesh with 32 X 32 >< 64 nodes. The architecture ofthe three-dimensional
mesh ora general Ir-ary n-cube was shown in Fig. 2.20 for the case ofIr = 4. All hidden parts (nodes and links}
are not shown for purposes of clarity. Clearly, every node has a constant node degree of 6, and there are three
rings crossing each node along the three dimensions. The end-around connections can be folded [Fig 2.2lb}
to balance thc wire length on all channels.
43; I mw I A ME, re
dCom-puterﬁrc '

The MDP Design The MDP chip included a processor, a 4096-word by 36-bit memory, and abu.ilt~ln router
with network ports as shown in Fig. 9.19. An on-chip mernorry controller with error checking and correction
[EEC] capability permitted In-cal memory lo be expanded to I million words by adding external DRAM
chips. The processor was message-driven in the sense that it executed ﬁmctions in response to mess-ages. via
the dispatch mechanism. No receive inslruclion was needed.

5121: 144-but 512: ‘I44-but

SRAM BRAM
(2048 wcrasl {Z011-B worm}

hlerrui rnemoql Iniamd memory

15 ‘I5 hterfaoe Iveriaee
_'q_ _'<+ ’ ’ ”
Y“ ' 5“ iXl‘ﬂl..|1IBl‘l |Yrouterl ylrecnerl
1- Message 2+
Giver:
Processor 14

2 ‘I2 g T9151 Ex

{Bl MDP Pl‘!!! B FXIEIBHS

- 3 1159- -[nanny
met lneriace Ind‘

“'""'.f"“ 1
m
IDBBPBW
_ _ egeters
Amllrierlclogc
mil
lnfliflflflml
FALL!
I R

{b) MDPchq:1 lo-er pun

lbw
36 Exernd C 4 To
Mgr“? A E'Xl.Bl'I\ﬂ
‘|'I

12
O bus
/' Anus
1: H8 36 (1

29 18

Nam“ 5 1: 15 Mung Ne-(wort Nellwork

o of
18

|[c;|Schema'lc block ihgr-In

Flg.9.19 The massage-driven processor [HOP] archimcwre [Cournesy nfW'1 Dally er al; reprinted with
p-errrisslcn from IEEE Micro, Apr! 1992)
ssn,nnss..ta...t. -— 4,,
The MDP created a task to handle each arriving message. Messages carrying these tasks drove each
computation. MDP was a general-purpose multicomputer processing node that provided the communication,
synchronization, and global naming mechanisms required to efliciently support line-grain, concurrent
programming models. The grain size was as small as B-word objects or 2t}-instruction tasks. As we have
seen, fine-grain programs typically execute fi'om ID to IDD instructions between communication and
synchronization actions.
MDP chips provided inexpensive processing nodes with plentiful VLSI commodity parts to construct the
Jellybean Machine (J-Machine) multicomputer. As shown in Fig. 9.19:1, the MDP appeared as a component
with a memory port, six two-way network ports, and a diagnostic port.
Tl'|e memory port provided a direct interface to up to IM words of ECC DRAM, consisting of
I l multiplexed address lines, a I2-bit data bus, and 3 control signals. Prototype J-Machines used three IM ><
4 static-colurrm DRAMs to fonn a four-chip processing node with 262,144 words of memory. The DR.AMs
cycled three times to access a 36-bit data word and a fourth time to check or update 1:he ECC check bits.
The network ports connected MDPs together in a three-dimensional mesh network. Each of the six ports
corresponded to one of the sis cardinal directions (+1, —:t, +y, -y, +-.¢, —z] and consisted of nine data and six
control lines. Each port connected directly to the opposite port on an adjacent MDP.
The diagnostic port could issue supervisory commands and read and write MDP memory from a console
processor (host). Using this port, a host could read or write at any location in the MDP‘s address space, as
well as reset, interrupt, halt, or single-step the processor. The MDP chip floor plan is shown Fig. 9.1%.
Figure 9.19c shows the components built inside the MDP chip. The chip included a conventional
microprocessor with prefetch, control, register file and ALU LTRALLI], and memory blocks. The network
communication subsystem comprised the routers and network input and output interfaces. The m'n'rt’.s.s
nrirhnieric unit (AAU) provided addressing functions. The MDP also included a DRAM interface, control
clock, and diagnostic interface.
Instruction-Set Architecnrre The MDP‘ extended a conventional microprocessor instruction-set
architecture with instructions to support parallel processing. The instruction set contained fined-fonnat, three-
address instructions. Two 17-bit instructions fit into each 36-bit word with 2 bits reserved for type checking.
Separate register sets were provided to support rapid switching among three execution levels: background,
priority U (FD), and priority I {Pl}. The MDP executed at the background level while no message created a
task, and initiated execution upon message arrival at P0 or Pl level depending on the message priority.
P1 level had higher priority than Pl) level. The register set at each priority level included four GPRs. four
address registers, four [D registers, and one instmclion pointer {IF}. The [D registers were not used in the
background register set.
Communication Support The MDP provided hardware support for end-to-end message delivery including
formatting, injection, delivery, buffer allocation, buffering, and task scheduling. All MDP nansmitted a
message using a series of SEND instructions, each of which injected one or two words into the network at
either priority 0 or l.
Consider the following MDP assembly code for sending a four-word message using three variants of the
SEND instruction.
SEND Rtl,tl ; send net address (priority 0)
SEND2 R1,R2,U ; header and receiver [priority 0)
SENDZE R3-,[3 ,A3],0 ; selector and continuation end message [priority 0)
435 i - - AdmrIcedCelnputerA:chitec1ure

The first SEND instruction reads the absolute address of the destination node in < X. 1'1 Z 1* format irorn
RD and forwards it to the network hardware. The SEND2 instruction reads the first two words of the message
out of registers RI and R2 and enqueues them for transmission. The final instruction enqueues two additional
words of data, one from R3 and one from memory. The use of the SENDZE instruction marks the end of the
message and causes it to be transmitted into the network.
Tl1e J-Machine was a three-dimensional mesh with two-way channels, dimension-order routing. and
blocking flow control (Fig. 9.20). The Faces of the network cube were open for use as l/D ports to the
machine. Each channel could sustain a data rate of 233 Mbps {million bits per second}. All three dimensions
could operate simultaneously for an aggregate data rate of 36-4 Mbps per node.

| —--t

' I I I I *3
»— — -2

— — -1

4
“ _ 3
" 2
I I I I 1 /
o 1 2 3 4 Z

Fig.'I.2tl E~cuhe routing from node (1. 5. 2} no node (5.13) on a 6-any 3-cube

.Me:.Iag-e Format and Routing The J-Machine used deterrninistic dirnension—o1-der E-cube routing. As
shown in Fig. 9.20, all messages routed ﬁrst in the x-dimension, then in the y—dirnension, and then in the
z-dimension. Since messages routed in dimension order and messages nlnning in opposite directions along
the same dimension cannot block, resource cycles were thus avoided, making the network provably deadlock-
free.

53?) Example 9.3 A typical message in the MIT J-Machine

(VV. Dally et al,199I)
The following message consists of nine flits. The first three flits of the message contain the I-, y-. and
:2-addresses. Each node along the path compares the address in the head flit of tlte message. If the two indices
match, the node routes the rest to the next dimension. The final flit in the message is marked as the tail.
scetatttqmutuett-=ad¢d,md -—. 4,,
Fm Contents R9m3"k5
5:+ i-:-address
11- y-address
4:+ z-address
Meg: D0 Method to catl
00440
||\.|T: no Argument to method
0023
|NT:CiCi Ftatrlreddtsss
I0 3‘-l7'l‘.I #-I.fil\J_a <1:5:2> T l

The MDP supported a broad range of parallel programming models. including shared memory. data-
parallcl, datafiow, actor, and explicit message passing, by providing a low-overhead primitive mechanism for
communication, synchronization, and naming.
lts communication mechanisms permitted a user-level task on one node to send a message to any other
node in a4(}96-node machine in less than 2 its. This process did not consume any processing resources on
intermediate nodes, and it automatically allocated buffer memory on the receiving node. On message arrival,
the receiving node created and dispatched a task in less than l us.
Presence tags provided synchronization on all storage locations. Three separate register sets allowed fast
context switching. A translation mechaltism maintained bindings betweetrt arbitrary names and values and
supported a global virtual address space. These mechanisms were selected to be general and amenable to
efficient hardware implementation. The J-Machine used wormhole routing and blocking flow control. A
combining-tree approach was used for synchronization.

The Router Design The routers fonned the switches in a J-Machine network and delivered messages
to their destinations. As shown in Fig. 9.2la, the MDP contained three independent routers, one for each
bidirectional dimension of the network.
Each router contained two separate virtual networks with different priorities that shared the same physical
channels. The priority-l network could preempt the wires even if the priority-0 network was congested or
jammed. The priority levels supported multi-threaded operations.
Each of the lll Toubcr paths Contained buffers. comparators, and output arbitration {Fig. 9.21). On each
data path, a comparator compared thc lead flit, which contained the destination address in that dimension, to
the node coordinate. Ifthe head flit did not match, the message continued in the current direction. Otherwise
t:he message was routed to the next dimension.
A message entering the dimension competed with messages continuing in the dimension at a two-to-
onc switch. Once a message was granted this switch, all other input was locked out for the duration of the
message. Once the head flit of the message had set up the route, subsequent flits followed directly behind it
440 i - - Admrrced Cornprrtenlrrchitscture

Netout

iii Forward Forward

_t(_
Priority 0 fl M
IP"°w
1 Address chock:

Priority O
Y" Y‘ dpliineaiiigilgn {I l:l F iiqlzitdnsbn
Priority 1

L Sign cheek
Prlo'"W Cl
2- -—- 2+
pl-|o|-Hy 1 Backward I Backward

Neil" Address chock

[a] Duahprlority lovers per dimension [a] Each priority with forward, reverse, and previous
in the router data paths to the next dimension.

F§g.!.21 Priority contzrol and cltlrntenslon-order routnar design bi the HDP chip (Courtenay ofWl Daily oi: al;
reprinted with permission from IEEE Micro, April 199'!)

Two priorities of messages shared the physical wires hut used completely separate l:ru.i'i'ers and routing
logic. This allowed priority-1 messages to proceed through blockages at priority ll. Without this ability, the
system would not be able to redistribute data that caused hot spots in the network.
Synchronimrion The MDP synchronized using message dispatch and presence tags on all states. Because
each message arrival dispatched a process, messages could signal events on remote nodes. For example, in
the following combining-tree example, each COMBINE message signals its own arrival and initiates the
COMBINE routine.
In response to an arriving message, the processor may set presence tags for task synchronization. For
example, access to the value produced by the combining tree may be syneiu-onized by initially tagging as
empty the location that will hold this value. An attempt to read this location before the combining tree has
written it will raise an exception and suspend the reading task until the root of the tree writes the value.

I»)
lg Example 9.4 Using a combining tree for synchronization
of events (VV. Dally et a|,1992)
A combining tree is shown in Fig. 9.22. This tree sums results produced by a distributed computation. Each
node sums tl1e input values as they arrive and then passes a result message to its parent.
snt,vas..-.ta..o -—. ...,,

Vatue=1" V3lLte=12
C-ount=0 C-=>u~t=1

Ftg.tI.22 A eontbinirqg rzrsa for internode communication or syndrronizatlon (Ccureesy otvtt Dally ct. al. 1991}

A pair of SEND instructions was used to send the COMBINE message to a node. Upon message arrival,
the MDP buffered the message and created a task to execute the following CDMBl'NE routine written in
MDP assembly code:
COIVTBTNE: MOVE [1, as], COMB get node pointer from message
MOVE [2, as], at get value from message
ADD Rl . COMBNALUE. RI
MOVE RI , COMB.‘W\LI.lE store result
MOVE COM'B.COl..l'NT, R2 get Count
ADD R2, -l, R2
MOVE R2, COMBEOUNT store decrernented Count
BN2 R2, DONE
MOVE HEADER, RD get message header
SEND2 COMB . Pl\RENT_l'~iDDE, R0 send message to parent
SEND2E COMB.Pr\REl\'T, RI with value
DONE: SUSPEND
If the node was idle. execution of this routine began three cycles after message arrival. The routine loaded
the combining-node pointer and value from the message, performed the required add and decrement, and, if
Count reached zero, sent a message to its parent.

Research Issue: The J-Machine was an exploratory research project. Rather than being specialized for
a single model of computation, the MDP incorporated primitive mechanisms for efficient communication,
syncltronizatlon. and namirrg. The machine was used as a platform for software experiments in ﬁne-grain
parallel programming.
Reducing the grain size of a program increases both the potential speedup due to parallel execution and
the potential overhead associated with parallelism. Special hardware mechanisms for reducing the overhead
441 i - " Admnced Cmnputerﬂuchitecture

due to eomrnunicrttion, process switching, synchronization, and multi-threading were therefore central to
the design of the MDP. Software issues sueh as load balancing, scheduling, and locality also remained open
questions.
The MIT research group led by Dally implemented two languages on the J-Machine: the actor language
Concurrent Smalltalk and the dataﬂow language Id. The machine's mechanism also supported dataflow
and object-oriented programming models using a global name space. The use ofa few simple mechanisms
provided orders of magnitude lower communication and synchronization overhead than was possihle with
multicomputers built from then available oﬁ'-the-shelfmicroproccssors.

9.3.3 The Caltech Mosaic C

The Caltech Mosaic C was an experimental fine-grain multicomputer that employed single-chip nodes and
advanced packaging technology to demonstrate the perfonnancefcost advantages ol‘Iine-grain multicomputer
architecture. We describe below the architectrtre of the Mosaic C and review its application potentials. based
on a report by Seitz (I992), the project leader at Caltech.

From Cosmic Cube to Mosaic C The evolution from the Cosmic Cube to the Mosaic is an example of
one type ofsrwling rrrtelr ir| which advances in technology are employed to reimplement nodes ofa similar
logical complexity but which are faster and smaller, have lower power, and are less expensive. The progress
i11 microelectronics over the preceding decade was such that Mosaic nodes were = GU times faster, used
= 20 times less power, were = 100 times smaller. and were (in constant dollars) = 25 times less ctrpens ive to
rnanufacture than Cosmic Cube nodes.

. . . Memory bus
\

“°°“"“"""““ “~ \
\
E E
Fig. 1.23 The Caltech Mosaic ar1:hl'neen.n1t (Courtesy of C.Seitz. 't9'9'1j

Each Mosaic node included 6-4 Mbytes of memory and an I1-MIPS processor, a packet interface, and a
router. The nodes were tied together with rt 6-0-lvlhytesfs, two-dimensional routing-mesh network (Fig. 9.23}.
r..rrt,Mmr..ru....r. -— _,.,,
The compilation-based progrrtrnming system allowed ﬁne-grain reactive-process message-passing programs
to be expressed in C——, an extension of C++, and the r|.rn-time system performed automatic distributed
l'Il'lB.l\ligt'_‘l'Tl;CfIl Of 5y$l't'_‘l']'l ITHJLIFCCS.

Mosaic C Node The Mosaic C multicomputer node was a single 9.25 mm >< 10.00 mm chip fabricated in
a l.2-,ttrrt-feature-size, two-level-metal CMDS process. At 5-V operation, the synchronous parts of the chip
operated with large margins at a 30-MI-L-: clock rate, and the chip dissipated = 0.5 W.
The processor also included two program cou ntcrs and two sets of general-purpose registers to allow
zero-time context switching between user programs and message handling. Thus, when the packet interface
received s complete packet, received the header of a packet, completed the sending of a packet, exhausted the
allocated space for receiving packets. or any of several other events that could be selected, it could interrupt
the processor by switching it instantly to thc message-handling context.
Instead of several hundred instructions for handling a packet, the Mosaic typically required only about I0
instructions. The number ofclock cycles for the message-handling routines could he reduced to insignificance
by placing them in hardware, but the Caltcch group chose the more flexible software mechanism so that they
could experiment with dificrent message-handling strategies.

Mosaic C I x 8 Mesh Boon-ls The choice ofa two-dimensional mesh for the Mosaic was based on a 1989
engineering analysis; originally, a three-dimensional mesh network was planned. But the mutual fit of the
two-dimensional mesh network and the circuit board medium provided high packaging density and allowed
the high-speed signals between the routers to be conveyed on shorter wires.
Sixty-four Mosaic chips were packaged by tape»automated bonding (TAB) in an 8 >< 8 array on a circuit
board. These boards allowed the construction of arbitrarily large, two-dimensional arrays of nodes using
stacking connectors. This style of packaging was meant to demonstrate some of the density, scaling, and
testing adva.ntagcs of mesh-connected systems. Host-interihce boards were also used to connect the Mosaic
arrays and workstations.
Application: and Future Trend: Charles Seitz determined that the most profitable niche and scaling
track for the multicomputer, a highly scalable and economical MIMD architecture. was the fine-grain
multicomputer. The Mosaic C demonstrated many of the advantages of this architecture, but the major part
ofthe Mosaic experiment was to explore tl'|e programmability and application span ofthis class of machirlc.
The Mosaic may he taken as the origin of two scaling tracks: (1) Single-chip nodes are a technologically
attractive point in the design space of multicomputers. Constant-node-size scaling results in single-chip
no-des of increasing memory size, processing capability, and communication bandwidth in larger systems
than centralized shared-memory multiprocessors. (2) lt was also forecasts that constant-node-complexity
scaling would allow a Mosaic 8 >< 8 board to be implemented as a single chip, with about 20 times the
performance per node, within It] years. In this contest, see also the discussion in Chapter 13.
A 16K-node machine was constructed at Caltech to explore the progranunability and application span
of the Mosaic C architecture for large-scale computing problems. For the loosely coupled computations in
which it excels, a multicomputer can be more economically implemented as a network of high-jterforrltaoce
workstations connected by a high-bandwidth local-area network. in fact, the Mosaic components and
programming tools were used by a USC Information Science Institute project (led by Danny Cohen, 1992) to
implement a 400-Mbitsls ATOMIC local-area network for this purpose.
rm-Mrfirnw Hill ' t
444 i “W mm Advanced Cornputenfirchitecture

SCALABLE AND MULTITHREADED ARCHITECTURES

-
Tlu'ee pioneering and landmark scalable multiprocessor systems are discussed in this
section. The Stanford Dash combined several latency-hiding mechanisms. The Kendall
Square Research KSR-1 offered the ﬁrst commercial attempt to produce a multiprocessor with cache-only
memory. The Tera computer evolved from the llEPrTIorizon series developed by Burton Smith. Only the
main architectural features are described below. All three systems were extensions of the traditional von
Neumann model. By far, t11c Tera system n.-presented the most aggressive attempt to build a multi-threaded
multiprocessor.

9.4.1 The Stanford Dash Multiprocessor

This was an experimental multiprocessor system developed by John Hennessy and coworkers at Stanford
University beginning in 1988. The name Dash is an abbreviation for Dire:-ror_1-' Arc-hirer-rnrejor Shared
Merrmrji-1 The fundamental premise behind Dash was that it is possﬂsle to build a scalable high-performance
machine with a single address space. coherent caches, and distributed memories. The directory-based
coherence gave Dash the case of use of shared-memory architectures, while maintaining thc scalability of
message-passing machines.

The Prototype Architecture A high-level organization of lite Dash architecture was illustrated in Fig.
9.] when we studied the various latency-hiding techniques. The Dash prototype is illustrated in Fig. 9.24.
It incorporated up to 64 MIPS R3i)00fR3l)l(l microprocessors with I6 clusters of 4 PEs each. The cluster
hardware was modified from Silicon Graphics 4Dt‘3-40 nodes with new directory and reply controller hoards
as depicted in Fig. 9.24:1.
The interconnection network among the 16 multiprocessor clusters was a pair of wormhole-routed mesh
nehvorks. The channel width was I6 bits with a 50-ns fall-through time and a 35-ns cycle time. One mesh
network was used to reqrrerr remote memory, and the other was a rt=p{_r mesh as depicted in Fig. 9.2-lb, whcre
the small squares at mesh intersections are thc 5 >< 5 mesh routers.
The Dash designers claimed scalability for the Dash approach. Although the prototype was limited to
at most 16 clusters {a 4 >< 4 mesh), due to the limined physical memory addressahility (255 Mbytes) of the
4Dt'34U system, the system was scalable to support htmdrcds to thousands of processors.
To use the 4Df34t] in the Dash, the Stanford team made minor modifications to the existing system boards
and designed a pair ofncw boards to support the directory memory and intercluster interiace. The main
modification to the existing boards was to add a bus retry signal, to be used when a request required service
from a remote cluster.
The central bus arbiter was modified to accept a mask from the directory. The mask held off a proccssor’s
retry until the remote request was serviced. This efiectively created a split-transaction bus protocol lor
requests requiring remote service.
The new directory controller hoards contained the directory memory, the intercluster coherence state
machines and buffers, and a local section of the global interconnection network. The directory logic was
split between the two logic boards along the lines ofthe logic used for outbound and inbound portions of
intcrcluster transactions.
s=.w.,~Mmi...i -ig, ...,
Wmflhdfl "-‘ll-Iiflfl Two 2 Dmashas
5flfl8"i'IflP
120 MB|'slir|k
.J L 1. ii l 1

I I ‘ Raquaa Mash I

~~~»-»» m __- ’“--F’

__ x
\‘\
- Raplyﬁlash -"' I 1»

_'__» 1.
.- \
.- .-"i "5 Nude Nada
J.-" 1 Clusm CIBS/1.8|'
_,a

5E
§
L
4 x MIPS R.3mCl
(33 MHZ]

Snoupybus L '
# I-,1! /-;,r1' -
Mamnqr {ijcitial adclram-ad)
Hod NC!
Modiﬁed Sliwn Graphics Puwer Shim 4DJ34CI ' ' '7

{an-ha pmmwe mm mmmmmah-m.| -[a) Black diagram nf2 x 2 mash inhrcnnnaci

Load Chm: Lewl

Other processor caches
within loud dusm

Dimcbqr Home Laud

Diractmrarid main mamciry
associalad with given address

Remote Cluslar Lmrel

Procassnr mes in
ramut dushts

{ch Logic mamcuy hiaramrq

Fig.9.2-I Tlu Sanford Dash prouotypn system [Couru-sy of D. Lnnosid at al. Pmc. ‘Fifth Int. Symp. Compm.
mam; .lu.|straIh. May 1991}
448 i - Admriccd Cemputerdrchitecture

The mesh networks supported a scalable local and global memory bandwidth. The single-address space
with coherent caches permitted incremental porting or tuning of applications, and exploited temporal and
spatial locality. Other factors contributing to improved performance included mechanisms for reducing and
tolerating latency, and well-designed U0 capabilities.
Dash Mernory Hierardry Dash implemented an invalidation-based cache coherence protocol. A memory
location could be i11 one ofthnrc states:

s l'_-"ifr.-rml-rcnl—not cached by any cluster;

r .‘i'hnn:nl'—in an umnodiﬁed state in the caches of one or more clusters; or
' Dirt]-'—modiﬁed in a single cache -of some cluster.

The directory kept the summary information for each memory block, specifying its state and the clusters
cacheing it. The Dash memory system could be logically broken into four levels of hierarchy, as illustrated
in Fig. 9.25c.
The ﬁrst level was the processor cache which was designed to match the processor speed and support
snooping from the bus. It took only one clock to access the processor cache. A request that could not be
serviced by the processor cache was scnt to the local chrsttcr: The prototype allowed 30 processor clocks to
aoccss thc local cluster. This lcvcl included the othcr processors‘ caches within thc requesting processor's
cluster.
Otherwise, the request was sent to the borne cluster level. The home level consisted of the cluster that
contained the directory and physical memory for a given memory address. It took 100 processor clocks to
access t.he directory at the home level. For many accesses (for instance, most private data references), the
local and home cluster were the same, and the hierarchy collapsed to three levels. In general, however, a
request would travel through the interconnection network to the home cluster.
The home cluster could usually satisfy the request immediately, but if the directory entry was in a dirty
state, or in a shared state when the requesting processor requested exclusive access, the fot.n‘lJ'| level had to
be accessed. The remote cluster level for a memory block consisted of the clusters marked by the directory
as holding a copy of the block. It took I35 processor clocks to access processor caches in remote clusters in
the prototype design.

The Directory Protocol Thc directory memory relieved the processor caches of snooping on memory
requests by keeping track of which caches held each memory block. In the home node. there was a directory
entry per block frame. Each entry contained one ;Jrcsertcc bi! per processor eache. In addition, a store bi!
indicated whether the block was uncached, shared in multiple caches, or held esclmiyely by one cache (i.e.
whether the block was dirty).
Using the state and presence bits, the memory could tell which caches needed to be invalidated when a
location was written. Likewise, the directory indicated whether the memory copy of the block was up-to-date
or which cache held the most recent copy.
By using the directory memory, a node writing a location could send point-to-point invalidation or update
messages to the processors actually cacheing that block. This is in contrast to the invalidating broadcast
required by the snoopy protocol. The scalability ofthe Dash depended on this ability to avoid broadcasts.
Another important attribute of a directory-based protocol is that it does not depend on any speciﬁc
interconnection network topology. As a result, the designer can readily use any of the low—lateney scalable
networks, such as meshes or hypercubes, that were originally developed for message-passing machines.
sMr,M..m..r,r...¢ -— ...,,

Ir)
El Example 9.5 Cache coherence protocol using distributed
directories in the Dash multiprocessor (Daniel
Lenoski and john Hennessy et al, 1992.}
Figure 9.25:: illustrates the ﬂow of s read request to remote memory with the directory in rt dirty remote
state. The read request is forwarded to the owning dirty cluster. The owning cluster sends out two messages
in response to the read. Amessage containing the data is scnt directly to the requesting cluster, and a sharing
writebaclr request is sent to the home cluster. The sharing writeback request writes the cache block back to
memory and also updates the directory.

Local
Local
lag‘)® 1. Read Request Q»)
m 2 to Home -
3% 1. Refs: ﬁnest

Home 23- llfbi RTBIIY Home

a
191)) @111
3a. Read Reply
go III
‘*1-'=*
.

'
Um [Shared ' Shared ' shared
ﬁﬂll
---er sreg, srrgw stem,
‘E’ or or or
Ia] Read of dirty remote cache block [a] Write to shared remote eache block
Fig.'J.‘15 Two examples of a directory-based cache ooherence protocol in the Dash (Gouroesy of Lenoskl
and I-lers1-easy. 1992]

This protocol reduocs latency by permitting the dirty cluster to respond directly to the requesting cluster.
In addition, this forwarding strategy allows the directory controller to simultaneously process many requests
(i.e. to be multithreaded) without the added complexity of maintaining the state ofourstanding requests.
Serialization is reduced to the time ofa single interclustcr bus transaction. The only resource held while
intercluster messages are being scnt is a single emry in the originating cluster's remote-access cache.
445 i - - AdmricedCmnpirterArchitec1urc

Figure 9.25b shows the corresponding sequence for a write operation that requires remote service. The
invalidatio n-based protocol requires the processor (actually the write buffer] to acquire exclusive ownership
of the cache block before completing the store. Thus, if a write is made to a block that the processor does not
have cached, or only has cached in a shared state, the processor issues a read-exclusive request on the local
bus.
in this case, no other cache holds the block entry dirty in the local cluster, so a RdEx Request (message
I} is sent to the home cluster. As before, a rernotesaccess cache entry is allocated in the local cluster. At the
home cluster, the pseudo-CPU issues the read-exclusive request to the bus. The directory indicates that the
line is in the shared state. This results in the directory controller sending a RdE:t Reply (message la] to the
local cluster and invalidation requests {inv-Req, message Eb] to the sharing cluster.
The home cluster owns the block, so it can immediately update the directory to the dirty state, indicating
that the local cluster now holds an exclusive copy of the memory line. The RdEx Reply message is received
in the local cluster by the reply controller, which can then satisfy the read-exclusive request.
To ensure consistency at release points, however the remote-access cache entry is dealloeated only when
it receives the number of invalidate acknowledgments (Inv-ﬂick, message 3) equal to an invalidation count
sent in the original reply message.

The Dash prototype with 64 nodes was rather small in size. [F each processor had a live-issue superscalar
operation with a I00-MHZ clock, an extended machine with 2K nodes would have the potential to become a
system with 1 tera operations per second, with higher performance at higher clock rate-s.
This demands an integrated implementation with lower overhead in the scalable directory structure. A
three-dimensional toms network was considered with I6-bit data paths, a El]-ns fall-through delay, and a
=1-tn; cycle time. The access time ratio among the four levels of memory hierarchy was to be approximately
1:5: 16:80: 120, where 1 correspondsto one processor clock. The larger version ofDASH was not implemented;
however, the concept of distributed directory-based cache coherenee was validated.

7.4.2 The Kendall Square Research KER-1

This was the first commercial attempt to build a scalable multiprocessor with or-rein--only nu-mm§_v nrt-hr'Iec-
flirt’ [C-OMA). The Kendall Square Research l<LSR—l was a size- and generatiomscalable shared-memory
multiprocessor computer. lt was fDl'I'l'1Bl2l as a hierarchy of “ring multis“ as depicted in Fig. 9.26.
The KSR-l Architecture Scalability in the KSR.-l was achieved by connecting 32 processors to form a ring
rnulti [search engine [I in Fig. 9.26) operating at I Gbytefs (123 million accesses persecond}. Interconnection
bandwidth within a ring scales linearly, since every ring slot has roughly thc capacity of a typical crosspoint
switch found in a supercomputer that interconnects eight to sixteen 100-Mbytesfs HIPPI channels.
The KSR-l used a two—level hierarchy to interconnect 34 Ring:(ls by a top—lcv-el Ring:1 (1088 processors]
and was therefore massive. The ring design supported an arbitrary number of levels, pcflttitting ultras to be
built (Fig. 92?].
s=.w.,~Mm¢..., -lg, ...,
II Q

I Unidirectional
' Sﬂﬂfm slotted rlrq
Engine 1

[ju-

'.’_. .__
\
/ *~.
/1 \
/ ARD :
.1 ALLCA-C HE
/ Router and
x
1 Dlrectmy
/
/ \

Search Unldlrecﬂmmal
' Engi no D sioited ring
[8-32 nodes]
I

, 1' 3,
I
1 E
1’ 1.

l.
,r

E \

for this node only

32 ma, 12aa lino {“s|.H:rpage~"}

20 mn-11, 20 MIPS, an custom sqaerscalar

Fig.9.26 The KER-1 ardiheuune with a slotted ring for communication (Courtesy of Kern-ﬁll Sqnne
Rasaard1 Corpondon. 1991]
450 i - Adi-tenced Cerrtputerﬂirchitectore

4-——-___,----' ----.__,___
___--' Rtng:1 ‘~=__

i I I1 i '|

U 5*-gs
Ringzﬂ
I.‘ Q Ringrﬂ

|—
'I -—'
‘
I=sRm
=-= an ..-
'--I i

UI . sors i
Directory
I
i 1I "‘F____-— -“___.‘ I
|
" —
+
Responding R1991 9 }
Pmmﬁa - Local Local
Cache "cache Cache
Directory Directory Directory :
‘;
3
request 1 1 -

Local Local Local

'”i‘°““° Cache C3.Gl'l& ''" Cache Requesting
Processor
\_ _ _ ___-

la»-em] lpimal '7

Fig. 9.21 Remote cache [memory] access through two levels of emflmmkafion rings in the KSR.-1
(Courtesy of Kcnfiil Sqtnre ileseareh Corponcion. 1991)

Each nodc comprised a primary cache, acting as a 32-Mbytc primary memory, and a 64-bit supcmscalar
processor with roughly the same performance as an IBM RSIGUOU operating at the same clock rate. The
superscalar processors containing 64 ﬂoating-point and 32 ﬁxed-point registers of 6-4 bits were designed for
both scalar and vector operations.
For example, ts elements could be prcfclchcd at one time. A processor also had a 0.5-Mbyte subcache
supplying 20 million accesses per second to the processor (a computational efficiency of 0.5}. A processor
Operated at 20 MT-Iz and was fabricated in l.2-_i.t trt CMCIS.
The processor, without caches, contained 3.9 million transistors on I5 types of I2 custom chips. Three-
quarters of each processor consisted of the search engine responsible lor migrating data to and from other
nodes, for maintaining memory coherence throughout the system using distributed directories, and for ring
contml.

The ALLCACHE M-elnory The KSR-1 eliminated thc memory hierarchy found in conventional computers
and the corresponding physical memory addressing overhead. Instead, it offered a single—leve1 memory,
ealled ALLCACHE by KSR designers. This ALLCACHE design represented the conﬂuence of cache and
shared virtual memory concepts that exploit locality required by scalable distributed computing. Each local
cache had a capacity of32 Mbytes -[225 bytes]. The global virtual address space had Z“ bytes.
o.ot,msa.o,.u -— .,,,
Bell {I992} considered the KSR machine the most likely blueprirtt for future scalable MPP systems. This
was a revolutionary architecture and thus was more controversial when it was first introdu-cod in I99 l . The
architecture provided size (including LID) and generation scalability in that every node was identical, and it
offered an efﬁcient environment for both arbitrary workloads and sequential to parallel processing through a
large hardware-supported address space with an unlimited number ofprocessors.

Programming Model The KER machine provided a strict sequentially consistent programming model
and dynamic management oi" memory through hardware migration and replication ofdata throughout the
distributed processor memory nodes using its ALLCACHE mechanism.
with sequential consistency, every processor returns the latest value of a written value, and results ofan
execution on multiple processors appear as some interleaving ofopcrations of individual nodes when executed
on a multithrcaded machine. With ALLCACHE, an address became a name, and this name automatically
migrated throughout the system and was associated with a processor in a cache-like fashion as nccdod.
Copies oi" a given cell were made by the hardware and scnt to other nodes to reduce access time. A
processor could prefetch data into a local cache and post-store data for other cells. The hardware was designed
to exploit spatial and temporal locality.
For example, in the SPMD programming mode], copies of the program moved dynamically and were
cached in each ofthe operating nodes‘ primary and processor caches. Data such as elements ofa matrix
moved to the nodes as required simply by accessing the data. and the processor had instructions to prefetch
data to the processor's registers. When a processor wrote to an address, all cells were updated and thus
memory coherence was maintained. Data movement occurred in subpages of 128 bytes ofthe 16K pages.

Ela Example 9.6 Multi-ring searching with requesting and

responding processors on different Ring:
O5 (Courtesy of Kendall Square Research
Corporation, 1991).
internode communication for remote memory access was achieved through a searching process. When the
requester and responder were in the same Ringzfl, the searching was restricted to a single connected R_ing:l].
Local cache directories showed what addresses could be found in the connected local cache. Each Ringtli was
a unidirectional slotted ring for pipelined searching until the destination was reached.
Figure 9.2? illustrates the situation when the requester and responder resided in ditierent Ringztis. The top
level, Ring: l , consisted entirely of ring mtififlg ceH.s' (Ii_RCs}, each containing a directory for the Ring:i) to
which it was connected. Each RRC directory on Ringzl was essentially a duplicate of the RRC directory on
the corresponding Ring:D.
When a packet reached an RRC on Ringzl, it was moved to the next RRC on the ring if the RRC directory
indicated that the requested data was not on the corresponding ring. Otherwise, the packet was routed down
to the RRC on Ring:U. The packet-passing speed of a Ring:D was S million packets per second. Ringil could
bc configured to handle 3, ts, 32, or 6-4 million packets per second.
FM Mtfiruw Hlllriimpwtni
451 i " Advanced Colnputerdrchitectorc

Environma-it and P-etrfnrmance Every known form of parallelism was supported via the KSR’s Mach-
based operating system. Multiple users could run multiple sessions comprising multiple applications or
multiple processes (each with independent address space), each ot‘ which might consist of multiple threads
ofcontrol rtmning and simultaneously sharing a common address space. Message passing was supported by
pointer passing in the sharod memory to avoid data copying and enhance performance.
The l-(SR also provided a commercial programming environment For transaction processing that accessed
relational databases in parallel with ttnlimited scalability as an altemative to multicomputers formed from
multiprocessor mainframes. A 1K-no-de system provided almost two orders of magnitude more processing
power, primary memory, U0 bandwidth, and mass storage capacity than a multiprocessor mainframe available
at that time.
For example. unlike other contemporary candidates, a 1088-node system could be conﬁgured with
I53 terabytes of disk memory, providing 500 times the capacity of its main memory. The 32- and 320-node
systems were designed to deliver over 1000 and 10,000 transactions per second, respectively, giving them
over 100 times the throughput of a multiprocessor mainframe available at the time.
‘With rapid advances in VLSI and interconnect technologies, the mid-19905 saw a major shakeout in
the supercomputer business. Kendall Square Research, tl'te developers of KSR-I and its sequel KSR-2
systems, were forced to exit from hardware business during that period. As in the case of other innovative
and pioneering attempts at the development of parallel computer architectures, knowledge gained from the
KSR development was also useful in the design and development of MPP computer systems of subsequent
generations. Our next case study on MPP system will also bring out clearly this important point.

‘1.4.3 The Tera Multiprocessor System

Multithreaded von Neumann architecture can be traced back to the CDC 6600 manufactured in the mid-
l96Us. Multiple functional units in tl'|e 6600 CPU could execute different operations simultaneously using
a score-boarding control. The very first tnultithreaded multiprocessor was the Denelcor HEP designed by
Burton Smith in l9?8. The HEP was built with 16 processors driven by a 10-Ml-lz clock, and each processor
could execute I18 rhre.nr.ls' {called processes in HEP terminology] simultaneously.
Tl1e HEP failed to survive due to inadequate software and compiler supporL The Tera was very much a
HEP descendant but was implemented with \-"LSl circuits and packaging technology. A400-M]-tz clock was
proposed for use in the Tera system, again with a maximum of 128 threads {F-srrcrurrs in Tera terrninologyj
per processor
in this section, we describe the Tera architecture, its processors and thread state, and the tagged memory.-'
registers. The unique features ofthe Tera included not only the high degree ofmultithrcading hilt also the
explicit-dependence lookahead and the high degree of pipelining in its processornehvork-memory operations.
These advanced features were mutually supportive. The first Tera Multithreaded Architecture (MTA) system
was delivered in 1998.
The T-em Design Goal: The Tera architecture was designed with several major goals in mind. First, it
needed to be suitable for very high-speed implementations, i.e. have a short clock period and be scalable to
many processors. A maximum configuration of the first implementation of the architecture {_Fig. 9.28s) was
256 processors, SIB memory units, 256 lfl) cache units, 256 U0 processors, 4096 interconnection network
nodes, and a clock period of less than 3 ns.
Scnlafle, Murtlth Wm.
, 4;;
Prweflwiimax 256) ma Pmcs551|:||5|[max 255;

mm ii am
3 DTc|midal Mesh {'16 >< 16 >< '16)

Mamcrias (max 512)

{aj| The Tara oompubr system

Z—linlE
,¢ ,-4
U /5 V
V £7 3’
U Di Cl ca

.0 U .Cl

‘cl 4/ca UP/ﬁn 0,4?

Y-links
y /
$3

D U Cl

X—IinlG

{MA spasa 4 >< 4 >< 4 torus with X-linka and '1"-links miasing an dbmate
Z-layers, respadnrely

Fig.9.2B The Tera n1u1dprooessor and fl: t hree--dh1-ensionai sparse torus archJ1:ec|:1.n'e shown with a
4 X 4 X 4 configuamfion {Courtesy ofTer1 Computer Company, 1991]
454 i - Adnorroed Cornputerfirchitectore

Second, it was important that the architecture be applicable to a wide spectrum of problems. Programs that
do not vecto-rize well, perhaps because of a preponderance of scalar operations or too 'l'i'equent conditional
branches, will execute eflicicntly as long as there is suflicient parallelism to keep the processors busy.
Virtually any parallelism applicable in the total comptrtational workload can be turned into speed. from
operation-level parallelism within program basic blocks to multiuset lime and space sharing.
A third goal was ease of compiler implementation. Although the instruction set did have a few turusual
features, they did not pose unduly dillicult problems for the code generator. There were no register or
memory addressing constraints and only three addressing modes. Condition code setting was consistent and
orthogonal.
Because the archiler:tr.1re permitted fi'ee exchange of spatial and temporal locality for parallelism, a highly
optimizing compiler could improve locality and trade the parallelism thereby saved for more spocd. On the
other hand, ifthere was sufiicicnt parallelism, the compiler could exploit it efficiently.
The Sparse T'lrree-Dimrmsionnl Torus The interconnection network was a three-dimensional sparsely
populated torus (Fig. 9.23b) oi pipelined packet-switching nodes, each of which was linked to some of its
neighbors. Each link could transport a packet containing source and destination addresses. an operation. and
64 data bits in both directions simultaneously on every clock tick. Some of the nodes were also linked to
rtssorrreos. i.e. processors, data memory units, I.-‘O processors, and It'll) cache units.
instead of locating the processors on one side ofthe network and the memories on the other la “dance hall"
configuration), the resources were distributed more-or-less trniformly throughout the network. This permitted
data to be placed in memory units near the appropriate processor when possible, and otherwise generally
rnaxiniizcd the distance between possibly intcriirrirrg resources.
The interconnection network of one 256-processor Tera system contained 4096 nodes arranged in a 16 ><
l6 >< 16 toroidal mesh; i.e. the mesh “wrapped around“ in all three dimensions. Dfthe 4096 nodes, 1230 were
attached to the resources comprising 256 eache units and 256 lit) processors. The 2816 remaining nodes did
not have resources attached but still provided message bandwidth.
To increase node performance, some of the links were omitted. lf the three directions are named rt, y, and
1', then X-links and y-links were omitted on alternate z-layers (Fig. 9.23b). This reduces the node degree fi'orn
6 to 4, or from 7 to 5, counting the resource link. ln spite ofits missing links, the bandwidth of the network
was very large.
Any plane bisecting the network crossed at lcast 256 links. giving the network a data bisection bandwidth
of one 64-bit data word per processor per tick in each direction. This bandwidth was needed to support
shared-memory addressing in the event that all 256 processors addressed memory on the other side of some
bisecting plane simtrltancottsly.
As the Tera architecture scaled to larger numbers of processors p. the number ofnetwork nodes grew as
yurm rather than as theplog p associated with the more commonly used multistage networks. To see this, we
first assume that memory latency is fully masked by parallelism only when the number of messages being
routed by the network is at least p>< I. where! is the [round-trip) latency. Since messages occupy volume,
the network must have a volume proportional to p >< I: since the speed of light is finite, the volume is also
proportional to 13 and therefore I is proportional to ,rJ|"3 rathcr than log p.

Pipelined Support Each processor in a Tera cornputcr could execute multiple instruction streams (threads)
simultaneously. in the initial implementation, as few as I or as many as 123 program counters could be active
e.ut,».un...ta..n. 4,,
at once. On every tick of the clock, the processor logic selected a ready-to-execute thread and allowed it to
issue its ncxt instruction. Sincc instruction intcrprctation was contplctcly pipclined by thc process-or and
by the network and memories as well (Fig. 9.29}, a new instruction from a different thread could he issued
during each tick without interfering with its predecessors.
When an instruction finished, the thread to which it belonged became ready to execute the next instruction.
As long as there were enough tlneads in the processor so that the average instruction latency was filled with
instructions fi'on1 other threads, the processor was fitlly utilized. Thus, it was only necessary to have enough
threads to hide the expected latency (perhaps TO ticks on average); once latency was hidden, the processor
would n.|n at peak performance and additional threads would not speed the result.
If a thread were not allowed to issue its next instruction until the previous instruction completed, then
approximately ill) different threads would he required on each processor to hide the expected latency. The
lookahead described later allowed threads to issue multiple instructions in parallel, thereby reducing the
nutnh-er oi‘ threads needed to achieve peak performance.
As seen in Fig. 9.29, three operations could be executed simultaneously per instruction per processor. The
M-pipeline was for memory-access operations, the /i-pipeline for arithmetic operations, and the C7-pipeline
for control or arithmetic operations. The instructions were 64 bits wide. If more than one operation in an
instruction specified the same register or setting oi" condition codes, the priority was M> A ':= C.

iﬁliﬁ Instruction
K PM fetch x

E; ZP O

write egister

/""‘\ ‘ii
_,i

ii
write poo
memory
poo i_ write aglst-at
\-../ qi "
MU '4
|£r00l
write egistor
\-

[ interconnection network )

I I t t
'\ memory intemal pipeline /

Fig.i.H Pipeiined procassort-networilt-memerry structure (Courtesy ofTcra Corrtptrtter Cerrnpany. 1991)

455 i Adrotrtoed Compoterﬁrchiztecttrre

it was estimated that a peak speed of 1G operations per second could be achieved per processor if driven
by a 333-MI-[2 clock. However, a particular thread would not exceed about IDOM operations per second
because of interleaved execution. The processor pipeline was rather deep, about 1'0 ticks, as compared with
8 ticks in die earlier HEP pipeline.
Thread State and Nhmogement Figure 9.30 shows that each thread had the following state associated
with it:

' One 64-bit stream status word (SSW);

r Thirty—two 64-bit general-purpose registers (R0-R31);
' Eight 64-bit target registers [TD—TT).

‘O..iS-SW pg

TD
I
I
I
TT

es
I
U

R31
128-Copies

Stream Status Word [SSW]

1 32 bit PC [Program Counter)
I Mod-as {o.g. rounding, iookahmd disable)
1 Trap disable mask [o.g. data atignmont, overﬂow]
1 Condition codes [last four emitted)
No synchrmization bits on RD-R31
Target Registers [Ti]-17) took like SSWs

Fig.'J.30 The thread rnanagernenr scheme used in the Tera eornpoeer {Cour-cosy of Tera Computer
Cornpany. 1997.}

Context switching was so rapid that the processor had no time to swap the processor-resident thread state.
Instead, it had I28 of everytlting. i.e. l23 SSWs, 4096 general purpose registers, and I024 target registers. lt
is appropriate to compare these registers in both quantity and function to vector registers or words ofcaehes
in other architectures. In all three cases, the objective is to improve locality and avoid reloading data.
Program addresses were 32 bits in length. Each thread’s current program counter [PC] was located in
the lower half of its SSW. The upper half described various modes {e.g. floating-point rounding, lookahead
disable), the trap disable mask (e.g. data alignment, floating overflow}, and the four most recently generated
condition codes.
rsta,tnss..,ta..a -— ,5,
Most operations had a TEST variant which emitted a condition code; and branch operations could
examine any subset of the last four condition codes emitted and branch appropriately. Also associated with
each thread were thirty-two 64-bit general-purpose registers. Register R0 was special in that it read as 0 and
output to it was discarded. Otherwise, all general-purpose registers were identical.
The Larger registers were used as branch targets. The fomtat of the target registers was identical to that of
the SSW, though most control transfer operations used only the low 32. bits to determine a new PC. Separating
the determination ofthe branch target address from the decision to branch allowed the hardware to prclbtch
irtstructions at the branch targets, thus avoiding delay when the branch decision was made. Using target
registers also made branch operations smaller, resulting in tighter loops. There were also slcip operations
which obviated the need to set targets for short forward branches.
One target register (TD) pointed to the trap handler which was nominally an unprivileged program. ‘When
a trap occurred. the effect was as if a coroutine call to a T0 had been executed. This made trap handling
extremely lightweight and independent ofthe operating system. Trap handlers could be changed by the user
to achieve specific trap capabilities and priorities without loss of efiic-iency.
Explicit-Dependence Loulmhead If there were enough threads executing on each processor to hide the
pipeline latency {about TD ticks}, then the machine would run at peak performance. However, if each thread
could execute some of its instructions in parallel leg. two successive loads}, then fewer threads and parallel
activities would be required to achieve peak performance.
Tl're obvious solution was to introduce instruction loolrahead; the difliculty was that the traditional
register reservation approach requires far too much scoreboard bandwidth in this kind oi‘ architecture. Either
multithreading or horizontal instruction alone would procludc scorchoarding.
The Tera architccttrre used a new technique called etyJlieir-riependt'rtr'e Iookrtherrrt Each instruction
contained a 3-bit loolrahead field that explicitly specified how many instructions fi'o|n this thread would be
issued before encountering an instniction that depended on the current one. Since seven was the maximum
possible lookabead value, at most 8 instructions and 24 operations could be concurrently executing from each
thread.
A thread was ready to issue a new instruction when all instructions with loolcabead values referring to the
new instruction had completed Thus, it" each thread maintained a lookahead of seven, then nine threads were
needed to hide ‘F2 ticks of latency.
Loolrahead across one or more branch operations was handled by specifying the minimum of all distances
involved. The variant branch operations JUMP_CIF'I‘EN and JL1l\1P_SELDGI-1, for high-and low-probability
branches, respectively, facilitated optimization by providing a barrier to lookahead along the less likely path.
There were also SI-<. I P_tIIF'I‘El'»l and I P_SELE-DI-I operation s. The overall approach was conceptually sim-
ilar to exposed-pipeline lookahead except that the quanta were insbuctions instead of licks.

Advantages and Drawbacks The Tera used multiple contexts to hide latency. The machine performed a
context switch every clock cycle. Both pipeline latency and memory latency were hidden in the I-[EP1'I‘era
approach. The major focus was on latency tolerance rather than latency reduction.
With 128 contexts per processor, a large number (2K) oi" registers must be shared finely between threads.
The thread creation must be very cheap [a few clock cycles]. Tagged memory and registers with fulltempty
bits were used for synchronization. As long as there was plenty of parallelism in user programs to hide
latency and plenty of compiler support, the perl‘ormance was potentially very high.
FM Mtfiruw Hfllritmpwins
455 i " Advanced Colnputerfirehitactorc

However, these Tera advantages were embedded in a number of potential drawbacks. The performance
must be bad for limited parallelism, such as guaranteed low single-contest performance. A large number of
contexts (threads) demanded lots ofregisters and other hardware resources which in tum implied higher cost
and complexity. Finally, the limited focus on latency reduction and cacheing entailed lots of slack parallelism
to hide latency as well as lots of memory bandwidth; both required a higher cost ihr building tl'|e machine.
ln the year 1996, the independent company Cray Research, Inc. founded by Seymour Cray merged with the
high-performance graphics workstation producer Silicon Graphics, Inc. (SGI); Cray Research then became a
business division of SGI. ln the year 2000, Tera Computer Company, originators and developers of the Tera
MTA massively parallel system which we have studied in this section, took over Cray Research. The merged
company was named Cray, lnc., and it is in active operation today (see www.cray.eoml. Cray has continued
with the development of the MTA architecture, as we shall review in Chapter 13.

DATAFLOW AND HYBRID ARCHITECTURES

1 Multithreaded architectures can in theory be designed with a pure datafiow approach or with a
hybrid approach combining von Neurnann and data—driven mechanisms. ln this final section,
we briefly review the historical development of dataflow computers. Then we consider the design ofthe ETL!
EM-4 in Japan and the prototype design of the MIT-"Motorola ‘T project.

9.5.1 The Evolution of Dataﬂow Computers

As introduced in Section 2.3, dataflow computers have the potential for exploiting all the parallelism available
i11 a program. Since execution is driven only by the availability of operands at the inputs to tl'|c functional
units, there is no need for a program counter in this architecture, and its parallelism is limited only by the
actual data dependences in the application program. While the dataflow concept offers the potential of high
performance, the performance of an actual dataflow implementation can be restricted by a limited number
of functional units, limited memory bandwidth, and the need to assu-ciatively match pending operations with
available functional units.
Arvind and Iannueci [l 9'87) identified merrmry Irrrcnct-' and sirnr-Irronirrrrrhrr ovcrhcrrrt’ as two iirndamcntal
issues in multiprocessing. Scalable multiproccssors must address the loss in processor efficiency in these cases.
Using various latency-hiding mechanisms and multiple contexts per processor can make the conventional
von Neurnann architecture relatively expensive to implement, and only certain types of parallelism can be
exploited efliciently.
HEPr'Tera computers offered an evolutionary step beyond the von Neumalrn architectures. Datafiow
architectures represent a radical alternative to von Neumann architectures because they use datafiow graphs
as their machine languages. Datallow graphs, as opposed to conventional machine languages, specify only
a partial order for the execution of instructions and thus provide opporttmities for parallel and pipclined
execution at tl1e level of individual instructiorts.

Dutnflow Graphs We have seen a datafiow graph in Fig. 2.13. Datallow graphs can be used as a machine
language in dataﬂow computers. Another example of a dataftow graph (Fig. 9.3 la) is given below.
Sca|abie,Mu!Hﬂ|reuded,and 4,,
X

I 2 I
Dahiow gmphs as
a madwiina language

'|
I 24
‘~
M|TTaggadToluan
Dahiowﬂwc-hrhctim
\
‘~
Mancheshr
Dalaﬂow
;

I I 720
T
Explicit Token
ETLSigm.a-1

Bbre Machines
\_
7: B: 3| 1\
l-l|T.ll-lobrda El’L EH-4
Monsoon
'1'

9 - P-RISE: "RISE-iﬁad" dahlow

coax i
l-l|TI‘M;ol;o|'da ‘T

{a) Dataﬂcnv gaph lor oomputing (1J)Evolutior| tree of dynamicdataﬂmv mac-i1inas{Ccuiasy

coa x{Cc|u1ssy cifArvind}| cal R. l'~ild1I}|

FIQJLI1 An mple
dauﬂow graph and clauﬂow machine praiects

I»)
£3 Example 9.7 The dataflow graph for the calculation of
cosx (Ar\rlnd,1991).
This dataflow graph shows how to obtain an approximatioll of coax by thc following powcr
computation:
2 _=l _£i ,2 -1 ji
M121-‘I +" -J‘ =1-J‘ +1 -“ (9.6)
2! 4! 6! 2 24 7'20
The conesponding datafiow graph consists of nine operators {actors or nodes). The edges in the graph
intcrconncct thc opcrator nodcs. Thc succcssivc powers ofx arc obtained by rcpcatcd multiplications. Thc
constants (divisors) arc foil into thc no-[lcs directly. All intcmlodiatc results arc forwarded among thc nodes.
460 i - Aduorrccd Computerdrchitecture

Start: versus Dynamic Dataflow Static damjiow computers simply disallow more than one token to
reside on any one arc, which is enforced by the firing mle: A node is enabled as soon as tokens are present
on all input arcs and there is no token on any of its output arcs. Jack Dennis proposed the very first static
dataflow computer in 1974.
The static firing rule is difficult to implement in hardware. Special feedback rrekmn-‘lodge .sigrm!.s are
needed to secure the correct token passing between producing nodes and consuming nodes. Also, the static
rule makes it very inefficient to process arrays of data. The number of acknowledge signals can grow too fast
to be supported by hardware.
However, static dataflow inspired the development ofdjmrrmic datqflow eomprrrrers, which were researched
vigorously at MIT and in Japan. In a dynamic architecture, each data token is tagged with a contest descriptor,
called a ragged rotten. The firing rule oftagged-token dataflow is changed to: A node is enabled as soon as
tokens with identical tags are present at each of its input arcs.
with tagged tokens, tag matching becomes necessary. Special hardware mechanisms are needed to achieve
this. In the rest of this section, we discuss only dynamic darallow computers. Arvind of MIT pioneered the
development of tagged—token architecture for dynamic datafiow computers.
Although data dependence does exist in datatlow graphs, it does not force unnecessary sequentlalization,
and dataflow computers schedule instructions according to tl'|e availability of the operands. Conceptually,
"tolren”-carry-ing values flow along the edges ofthe graph. Values or tokens may he memory locations.
Each instruction waits for tokens on all inputs, consumes input tokens, computes output values based on
input values, and produces tokens on outputs. No further restriction on instruction ordering is imposed. No
side effects are produced with the execution of instructions in a datafiow computer. El-nth dataflow graphs and
machines implement only functional languages.

Pure Dataflow Machines Figure 9.311: shows the evolution of dataflow computers. The MIT toggle!-
token drzrtajlow architecture (TTDA) {Arvind et al, i933), the Manchester Dataflow Computer (Gurd and
Watson, 1982), and the ETL Sigma-I {Hiralri and Shimada, I98?) were all pure dataflow computers. The
TTDA was simulated but never built. The Mancltester machine was actually built and became operational in
mid-I 982. lt operated asynchronously using a separate clock for each processing element with apes-fomtanee
comparable to that of the VAJCHSO.
The ETL Sigma-1 was developed at the Electrotechnical Laboratory, Tsulruba, Japan. It consisted of 128
PEs frilly synchronous with a lll-Ml-lz clock. lt implemented the I-structure memory proposed by Arvind.
The fi.|ll configuration became operational in I987 and achieved a I70-Mllops p-erforrnance. The major
problem in using the Sigma-l was lack of high—level language for users.
Explicit Token Store Machine: These were successors to the pu.re dataflow machines. The basic idea is to
eliminate associative token matching. The waiting token memory is directly addressed, with the use of Full!
empty bits. This idea was used in the l'dl'l'r’l'\-iotorola Monsoon {Papadopoulos and Cullcr, H88) and in the
ETL EM-4 system (Sakai et al, W89).
Multithreading was supported in Monsoon using multiple register sets. Thread-based programming was
conceptually introduced in Monsoon. The maximum configuration built consisted of eight processors and
eight I-stnrclure memory modules using an B >< 3 crossbar network. lt became operational in I991.
scrnatemurumt-msd,md. -— 4,,
EM-4 was an extension of the Sigma-1. It was designed for [U24 nodes, but only an EU-node prototype
became operational in 1990. The prototype achieved 815 MIPS in an 80 '>< 80 matrhr multiplication benchmark.
We will study the details of EM-4 in Section 9.5.2.

Hybrid and Unified iirchitectures These are architecttires combining positive features from the \-ion
Neumarm and daraflow areliitectures. 'l‘he best research examples include the MIT P-RISC |[Nikhil and
Arvind, 1988], the IBM Empire [Iannueei et al., 1991), and the l'v1lTfh-Iotorola "'T (Nikhil, Papadopoulos,
Arvirld, and Greiner. 1991}.
P-RISC was a “RISC-ified“ datafiow architecture. It allowed tighter encodings of the dataflow graphs
and produced longer threads for better performance. This was achieved by splitting “complex“ datafiow
instructions into separate "simple" component instructions that could be composed by the compiler. lt
used traditional instruction sequencing. It performed all intraprocessor eommunication via memory and
implemented “joins” explicitly using memory locations.
P-RISC replaced some of the dataflow synchronization with conventional program counter-based
synchronization. IBM Empire was a von Neumannfdataflow hybrid architecture under development at IBM
based on the thesis of lanrtueci {I983}. The *T was a latter effort at MIT joining both the dataflow and von
Neumann ideas, to be discussed in Section 9.5.3.

9.5.2 ETUEH-4 in japan

EM-4 had the overall system organization as shown in Fig. 9.32s. Each EMC-R node was a single-chip
processor without ﬂoating-point hardware but including a switch of the network. Each node played the
role of I-structure memory and had L31 Mbytes of static RAM. An Omega network was used to provide
interconnections among the nodes.

The Node Architreemm The internal design ofthe processor chip and ofthe node memory are shown
in Fig. 9.321». The processor chip communicated with the network through a 3 >< 3 crossbar swirch unit
The processor and its memory were interfaced with rr rriemory control‘ rmir. The memory was used to hold
programs {template segments} as well as tokens {operand segments, heaps, or frarnes) waiting to he fetched.
The processor co l'I5l5lCtI| of six component units. The r'npur brgffer was used as a token store with a capacity
nt‘32 words. TlflC_,i'-i‘fC.FI-Irlrlfth rmir fetched tokens from the memory and performed tag-matching operations
among the tokens fetched in. Instructions were directly fetched from the memory through the memory
controller.
The heart of the processor was the exeeriririri rmir. which ietchod instnrctions until the end of a thrcarl.
lnstructions with matching tokens were executed Instructions could emit tokens or write to registers.
Instructions were fetched continually using traditional sequelleing [PC — 1 or branch) until a “stop” ﬂag was
raised to indicate the end of a thread. Then another pair of tokens was accepted. Each instruction in a thread
speciﬁed the two sources for the next instructiolt in the thread.
“S M ,n...[,n,F .. ,. hm,
I’ TC E

N-ode Node

EMC-R El|IC—R
Processor Processor
I K

Omega Nelworlt

{at Globd organization

Memory

Fet:e.h-Math
Unit

Ovellow

lma-it Program
our-er Execution {Ternplate
Unit Unit segntentsi

{siren Memory
1l—--i-or tﬁhitirg
queue) Insiuction Control
Tokens
{operand
Fetch Unit segments,
{til end i.e.
Dfil'IB3d:| flames)

Heap

Hagista Execute and ‘

Fla EmitTotoans

'- ing
i Unit Present hits
. {3 X3 crossbar)

Netvrcllt

(b [I The EHC-R pmoe mordesign

Fig. 9.31 The ETL EM-4 daraﬂow arehltactatre (Cmtrmsy of S.‘I.it2i,YI.l'lII:Ll\']1i or at, Elect:-orreeiinlcal
Labor1.tiory.T.ﬂ.tiotbI.]lpln. 1991]
a.a.r,Maa....ar...r. -— ...,,
The same idea was used as in Monsoon for token matching, but with different encoding. All data tokens
were 32 bits, and instruction words were 33 bits. EM-4 supported rcmotc loads and synchronizing loads. Thc
_,i5rHr'errrpr__v bits present in memory words were used to synchronize remote loads associated with ditfercnt
threads.

9.5.3 The HlTi'Hotoro|a *T Prototype

The *T project was a direct descendant of a series of MIT dynamic dataflow architecttrrcs unifying with the
von Neumann architectures. in this final section, we describe ‘T, a prototype multirhreaded MPP system
based on the work ofblikhil, Papadopoulos, and Arvind of MIT in collaboration with Greiner and Traub of
Motorola. Finally, we compare the datafiow and von Ncurnann perspectives in building fine-grain, massively
parallel systems.
‘The Prototype Architecture The *T prototype was a single-address-space system. A “l:rrick" of l 6 nodes
was packaged in a 9-in cube (Fig. 9.33a). The local network was built with S >< 8 crossbar switching ehips. A
brick had the potential to achieve 3200 MIPS or 3.2 Gflops. The memory was distributed to the nodes. Doe
gigabyte ofRAM was used perbrick. With 200-lvlbytesis links, the U0 bandwidth was 6.4 Gbytesfs per brick.
A 256-node machine could bc built with l5 bricks as illustrated in Fig. 9.33b. Thc 16 bricks were
interconnected by four switching boards. Each board implemented a 16 >< 16 crossbar switch. The entire
system could be packaged into a 1.5-m cube. ‘No cables were used between the boards. The package was
limited by connector-pin density. Thc 256-nodc machine had thc potential to acliicvc 5l),O[ll] MIPS or 50
Gfiops. The bisection bandwidth was 5|) Gbytesfs.
The *T Node Design Each node was designed to be implemented with four component units. A Motorola
superscalar RISC microprocessor (MC B81 10) was modified as a rinrn processor {dP). This dP was optimized
for long threads. Concurrent integer and floating-point operations were performed within each dP.
A.s_ji-riehmn r'rnriorr eoproee.s.mr {sP} was implemented as an 88000 special-function unit (SFU), which was
optimized for simple, short threads. Both t.hc dP and the sP could handle fast loads. The dP handled incoming
continuation, while the sP handled incoming messages, rloadfrstore responses, and joins for messaging or
synchronization purposes. in other words, the sP ofi‘-loaded simple message-handling tasks fiom the main
processor (thc dP}. Thus the dP would not be disrupted by short messages.
The merrrorji-' r-onroller handled requests icr remote memory load or store, as well as the management
of node memory (64 Mbytes). The nernorlr irrrrrfirr-r' rmir received or transmitted messages from or to the
network, respectively, as illustrated in Fig. 9.33c. It should be noted that the sP was built as an on-chip SFU
of thc dP.
The MC SS] ll) family allowed additional on-chip SFUs, with reserved opcode space, common instruction-
issue logic and cac-hes, etc., and direct access to processor registers. Example S1-‘Us included the floating-
point unit, gra.pl‘tlCS unit, coprocessor, etc. The MC B3110 was itself a two-way superscalar processor driven
by a 50-lvll-Iz clock.
New SFUs wcrc added into thc MC B8110 to provide 115 buffers for incoming messages and 4 buffers
for outgoing messages. Other S1-"Us included a conrinrirrrion srar-.1: with 64 entries and a microrhrerrrir-d
s‘('f?-t'r'.lrrl't’!'. which supplied continuations from messages and the continuation stack, etc. Special instructions
were available for packing or unpacking continuations.
rm tilcfi-rm-H um 7
464 i mmun“ Advanced Cmnputerfirchétecztrre

16 out 16m

{ajﬂt tmck ef1B nodes wm 32- {|Jj|A256-node machme eenststtng of

Gflops and 3260-MIPS peak per- 16 bucks mtereenneetecl by 4 beards ef
iermanee, packaged tn e 9-tn cube 16 .w 16 swttcnes and packaged tn a 1.5-rn cube

Network{BOD MH.fs tn)

I Netwerklnterfaee Llnn ‘

requests
responses

H de MC 88110
Mflfinofl Memory mfih
{64_MB) I Controller essage
: Cepreeesser

t‘
J
I
f; ~[RMem] {dP + SP]
BOD MEIs
{cj Intern: node aremtecture wttn data pmeesser
{MC B3110) and synchrentzatton eepreceseer {sP)

Fig.!.33 The MFUMut;omla "‘T pmeotype multithreaded althiuecture {Counaesy of Nikhil, Papadupoul-us,
andﬂtmmd. Pmc. 19th Int‘. 5ymp.Cotr|puterAa-ch, Aus:r:tNa.. May 1992}

Research Experiment: The ‘T prototype was used to test the effectiveness of the unified architecture
in supporting multithreading operations. The development of *T was influenced by other multifltreflded
architectures, including Tera, Alcwifc, and J-Machine.
Scnfattie,Mutrir.iireaded,and -— 4,,
The l-structure semantics was also implemented in ‘T. Fuilfe-mptv bits were used on producer-
constuner variables. *T treated messages as virtual continuations. Thus busy-waiting was eliminated. Dther
optimizations in *T included speculative avoidance of the extra loads and stores through tnultithreading and
coherent caeheing.
The *T designers wanted to provide a superset ofthe capabilities ofTera, J-Machine, and EM-4. Compiler
techniques developed for these machines were expected to be applicable to ‘T. To achieve these goals, a
promising approach was to start with declarative languages while the compiler could aim to extract a large
amount of fine—grain parallelism.
Muftithreodingrtll Perspective The Dash, l-(SR-I, and Alewife leveraged existing processor technology.
The advantages of these directory-based caeheing systems include compatibility with existing hardware and
software. But they offer a less aggressive pursuit of parallelism and depend heavily on compilers to obtain
locality. The synclironizing loads are still problematic in these Cll5lFll‘JLllIt'.‘Cl caeheing solutions.
ln von Neumann multithreading approaches, the HEPfTera replicated the conventional instruction stream.
Syncltronizing-loads problems were solved by a hardware trap and software. Hybrid architectures, such as
Empire, replicated conventional instruction streams, but they did not preserve registers across threads. The
synchronizing loads were entirely supported in hardware. J-Machine supported three instruction streams
(priorities). It grew out of message-passing machines but added support for global addressing. Remote
synchronizing loads were supported by soihvare convention.
In the dataflow approaches, the system-level view has stayed constant from the Tagged-Token Dataflow
Architecture to the ‘T. The various designs differ in internal node architecture, with trends toward the
removal of intra-node synchronization, using longer threads, high-speed registers, and compatibility with
existing machine codes. The ‘T designers claimed that the unification of datafiow and von Neumann ideas
would support a scalable shared-memory programming model using existing SIMD/SPMD codes.

E.'.|"}—i
-, i,» I S ummary
5.4/'

Computer systems love always operated with processors having much faster cycle times than main
memories.With steady advances inVLSi technology over the years. both processors and main memories
have become faster. but the relative speed mismatch between them has in tact widened over the years.
Latency hiding techniques are therefore devised to allow processors to operatic at high efficiency in spite of
having to access slower memories from time to time; use of cache memories is a common latency hiding
technique. in the context of Massively Parallel Processing i',l"1PP} systems, other technical challenges also
confront system digrters in minimizing the impact of memory access latencies.
in this chapter. we studied some basic latency hiding techniques applicable to such systems, narnely:
shared virtual memory with some specific examples; preietching techniques and their effectiveness; and
the use of distributed coherent caches. Scalable Coherent interface (SCI) provides cache coherence with
distributed directories and sharing iist.s.‘iN'o studied several relaxed memory consiatiency models which
can permit greater exploitation of paralleﬁsm in applications; the impact of relaxed consistency models
while running three speciﬁc applications was presented.
rh- i'hlcG-rm-P HiiI" r
456 i Hm .-lidvionced Computernrcfnitecvsre

Principl of multi-threading were introduced. with specific attention paid to the technical factors
relevant to system design. namely: communication latency on remote access. number of threads. context-
switching overhead. and the interval between context switches. Multiple context processors have been
designed to provide hardware supp-ort for single cycle context svvit-thing. Possible context-switching
policies were studied. along with their impact on system efficiency. Mulddimensional architectures were
reviewed as a possible platform for multi-threaded systems.
Fine-grain multicomputers are specially designed to provide efficient support for fine-grain
parallelism in applications. The MIT j-machine was studied from d'|e points of view of its overall
system design, its Message-Drhien Processor ii‘-1DP) and instruction set architecture. and the message
format and routing employed in its 3-dimensional mesh. The design goal of Caltech Mosaic C system
was to exploit the advances which had taken place in VLSI and packaging technologies; we studied
the basic node design with its two contexts {for user program and message handler). and basic
B >< 8 mesh design employed in the system.
in the category of scalable multithreaded architectures. tl'ie Stanford Dash multiprocessor system
utilized directory-based cache coherence in a single address-space distributed memory system. Kenthll
Square Research KSR-1 system employed a cache-only memory design with a ring-based Interconnect.
The Tera multiprocessor system relied for its performance on a large degree of multi-threading and
agressive use of pipelining throughout the system. wid'i a sparse 3-dimensional torus interconnect.
We also studied i:l1e basic concepts and evolution of dataflow and hybrid architectures, from the first
introduction of the concept in 1974 byjack Dennis at HIT. Specific datafiow and hybrid systems studied in
this context were the ETUEM-4 system developed in japan. and the MiTi'Motoro|a ‘T prototype system.

Problem 9.1 Consider a scalable multiprocessor be satisﬁed by a local cache. Express E as a

widn p processing nodes and distributed shared function of R, L and fi.
memory. Let R be the rate of each processing node ls} Now assume each processor is multithrded
generating a request to access remote memory to handle N contexts simultane-ousIy.Assume
through the interconnection network. Let L be the a context-switching overhead of C. Express E
average latency for remote memory access. Derive as a function ofN, R, Lh,and C.
expressions for the processor efficiency E under (dl Now consider the use of a 2-D r >< r torus
each of the following conditions: with R = p and bidirectional links. Let t, be
{3} The processor is single-threaded. uss only a the time delay between adjacent nodes and
private cache.and has no other latency-hiding tm be the local memory-access lime.Assume
mechanisms. Express E as a function of R and that the network is fast enough to respond to
L. each request without buffering. Express the
(b) Suppose a coherent cache is supported by latency i‘. as a function of pi. td. and tm. Then
l"iardware with proper data sharing and fr is express the efficiency E as a function of N, R,
the probability that a remote request can h, C, pi, rd. and tr,
Sctifoble,Muldtfireoded,and -— 4..
Problem 9.2 The following two questions are {c} if r is kept constant as tl'ie number of
related to the effect of prefetching on latency PFOCESSOFS ll'iCl"S1ES, how HTZHY FEQUESIS CHI"!

tolerance: be sent to the system without exceeding the

{a} Perform an analytical study of the effects limit!
of data prefetching on the performance (d) Each request goes through a maximum of two
(efficiency) of processors in a scalable buses in the multicube.What bus bandwidth
multiprocessor system without multithread- will be needed to satisfy all the requests?
ing. (e]- In parts (b) and (dj-. does the multicube
(bi Repeat part {a} for a multithreaded provide enough bus bandwidti"i? justify the
multiprocessor system under reasonable answer with reasoning.
assumptions.
Problem ?.5 Consider dwe use of an ordwogonal
Problem 9.3 The following questions are related multiprocessor consisting of 4 processors and
to the effects of memory consistency models." 16 orthogonally shared memory modules (Fig. 9'.18b)
{a} Perform an analytical study of the effects to perform an unfolded multiplication of two 8 '>< B
of using a relaxed consistency memory matrices in a partitioned SPHD mode.
model in a scalable multiprocessor without
fa) Show how to distribute the 1 X 2 submiatrices
multithreading.
of the input matrix A = {op} and B = {bi} to the
(b) Repeat part (a) for a multidireaded 16 orthogonally shared memory modules.
multiprocessor system under reasonable
(bi Specify the SPMD algorithm by involving all
assumptions.
four procasors in a synchronized manner
ft] C-an you derive an efficiency expression for to access eidwer the row memories or the
a multiple-context processor supported
column memories. Synchronization is handled
by both prefetching and release memory
at the loop level.
consistency?
You can assume the use of a pipeline-read
Problem 9.4 Consider a two-dimensional to fetch either one column or one row vector
multicube arcl"iitect1.|re with m row buses and of the input matrix A or B at a time. and a
m column buses (Fig: 9.1Ba’). Each bus has a pipeline-write to store the product matrix
bandwidth of B bitsfs.The bus is considered active C = A >< B = {cg} ele|'nents in a similar fashion.
when it is actually in progress. The bus utilizadon Assume that sufficient large register windows
rate ti (0 < o 5 1] is defined as the number of active are available within each processor to hold all
bus cycles over the total cycles elapsed. The per- 2 >< 2 submatrix elements. Each processor can
processor request rate r is defined as the number perform inner product operations.
of requests that a processor sends on either of the (c) Let N >< N be the matrix size and it = Nfn
two buses [for the purpose of memoryaccess. cache the partitioned block size in mapping a large
coherence. synchronization. etc.) per second. matrix in the orthogonal memory. Estimate
(a]- Consider a single-column bus with associated the number oforthogonal memory accesses
processors and memory module and express and the number of synchronizations needed
the bus bandwidth as a function ofm.o,and r. in an SPHD algorithm for multiplying two N ><
(b) W'hat is the tonal bus bandwidth available in N matrices on an n-processor DMR
the entire system?
rm lvltfirovv Hill 7
468 i mmlmm“ Advanced Cornpirterflirchitecttrre

(d) Repeat the above for a two-dimensional fast Problem 9.!‘ Why are hypercube networks
Fourier transform over N >< N sample points [binary n-cube networks), which were very popular
on-an n-processor Cll"'lP, where N = n-k for in ﬁrst-generation multicomputers. being replaced
some integr k 2 2. The idea of performing by 2D or 3D meshes or tori in the second and third
a two-dimensional FFT on an DMP is to generations of multicomp uters?
perform a one-dimensional FFT along one
Problem $.10 Answer the following questions on
dimension in a row-access mode.
the SCI standard:
All n processors then synchronize, switch
(a) Explain the sharing-list creation and update
to a column-access mode. and perform
medaods used in die IEEE Scalable Coherence
another one-dimensional FFT along the
Interface (SCI) standard.
second dimension. First try the case where N
(b} Com menton the advantages and disadvantages
= B.n = 4,and k= land then work outthe
of chained directories for cache coherence
general case for large N ;:?> n.
control in large-scale multiprocessor systems.
Problem 9.5 The following questions are related
to shared virtual memory:
Problem 9.11 Compare the four context-
svvitching policies: switch on cache miss. switch on
(a) Why has shared virtual memory (SVM)
every load, switch on every instruction (cycle by
become a necessity in building a scalable
cycle]. and switch on block of instructions.
system with memories physically distributed
(a) What are the advantages and shortcomings of
over a large number of processing nodes?
each policy?
(b) What are d1e major differences in
(bl What additional research would be needed to
implementing SVH at the cache block level
make an optimal choice among these policies?
and the page level?

Problem 9.7 The release consistency {RC} model

Problem 9.12 After studying the Dash memory
hierardwy and directory protocol. answer the
has combined the advantages of both the processor
following questions with an analysis of potential
consistency (PC) and the weak consistency (WC)
performance:
moclels.Answer the following questions related to
flwese consistency models:
(a) Define the cache states used in Dash.
(b} How were the cad1e directories implemented
{a} Compare the implementation requirements
in the memory hierardayl
in the three consistency models.
{b} Comment on the advantages and (c) Explain the Dash directory-based coherence
protocol when reading a remote cache block
shortcomings of each consistency model.
that is dirty in a remote cluster.
Problem 9.8 Answer the following questions (dl Repeat part (cl for the case of writing to a
involving the l"'llT]-Machirve: shared remote cache block.
(a1 Whatvvere the unique features of the message-
Problem 9.13 Answer the following questions
driven processors (MD?) making it suitable
on multiproccssors:
for building flne-grain multicomputers?
(a) Describe the ALLCACHE architecture
(bl Explain the E-cube routing mechanism built
into the l"'lDP. implemented in the Kendall Square Research
KSR-1.
(c_‘,l Explain the concept of using a combining
tree for synchronization of events on various (b) Explain how cache coherence can be main-
nodes in the j-Machine. talned in the KSR-1.
eta ammo.“ -— 4,,
Study the papers on COMA architectures memory modules which are 32-way interleaved
by Stenstrom et al (1992) and Hagersten for pipelined access of vector data_ Assume no
et al (1990). Compare the differences contention between processors and memories in
between KER-1 and the Data Diffusion the interconnection network. The one-dimensional
Machine {DDM] architecture. convolution is defined over a 1 >< n image and a
1 >< m kernel as follows:
Problem 9.14 Answer the following questions m—|
on the development of the Tera computer. ‘f{i)- = EWfl)- X({i-I) mod n) forfl £iEn—1
What were the design goals of the Tera ,-=0
computer?
(a) How many multiplications and additions are
Explain the sparse 3D torus used in Tera.
involved in the above computations? Map the
What are the advantages of the sparse image pixels X(i) to memory module My ifj =
structure? i (mod 32) and assume n = 156. The output
Explain how pipelining is applied in supporting image l"(i) is also stored in module My ifj =
the multithreaded operations in each Tera i[mod 32) for O £1‘ 5 255.The kernel is also
processor. stored in a similar manner.Assume m = 4 and
Explain the thread state and management each processor handles the computation of
scheme used in Tera. one output image.
Explain the idea of explicit-dependence lb) Show how to partition the computations
looltahead and its effects on multiihreading in among the four processors such that minim um
Tera. time is spent in both memory-access and CPU
What are the contributions of the Tera executions.Assume no memory conflicts and
architecture and software development? up to four fetch or store operations {but
Compare the advantages and potential not mixed) performed at the same time.The
drawbacks of the Tera computer. interleaved memory can be accessed by one
Problem 9.15 Answer the following questions or more processors at the same time.
related to dataflow computers: (c) What is the minimum execution time
(including b-oth memory and CPU opera-
Distinguish between static dataflow
tions) if each multiply and add and ch
computers and dynamic dataflow computers.
interleaved memory access is considered one
Draw a dataflow graph showing the
time unit. Assume enough working registers
computations of the roots of a sequence of
quadratic equations Apr? + Egg + C; = Ci for i =
are available in each CPU.
1.2. N. (d) What is the speedup factor of the above
multiprocessor solution over a uniprocessor
Consider i'J'iE parallel execution of the
solution!‘ You can make similar assumptions
successive root computations with a four-PE
about the use of the 32-way interleaved
tagged-token dataflow computer (Fig. 1.12).
memory for bod1 uniprocessor and multi-
Show a minimum-time schedule for using the
processor configurations.
four PEs to compute the N pairs of roots.
Problem 9.17 Answer the following questions on
Problem 116 Consider the mapping of a one-
fine-grain multicomputers and massive parallelism:
dimensional circular convolution computation
on a multiprocessor with 4 processors and 32 (ai Why are fine-grain processors chosen for

17cs72 Notes Module 4
No ratings yet
17cs72 Notes Module 4
54 pages
ToS-II Manual PDF
0% (1)
ToS-II Manual PDF
175 pages
Floor Function - Titu Andreescu, Dorin Andrica - MR 2006 PDF
0% (1)
Floor Function - Titu Andreescu, Dorin Andrica - MR 2006 PDF
5 pages
Cremophor RH 40 Brochure
No ratings yet
Cremophor RH 40 Brochure
8 pages
Module-4 Notes
No ratings yet
Module-4 Notes
48 pages
Advanced Computer Architectures: 17CS72 (As Per CBCS Scheme)
No ratings yet
Advanced Computer Architectures: 17CS72 (As Per CBCS Scheme)
54 pages
Buses: Characteristics, Types & Uses
No ratings yet
Buses: Characteristics, Types & Uses
15 pages
15CS72ACA Module 4 ACA Converted
No ratings yet
15CS72ACA Module 4 ACA Converted
95 pages
Chapter-7 Multiprocessors and Multicomputers: Module-Iv
No ratings yet
Chapter-7 Multiprocessors and Multicomputers: Module-Iv
53 pages
VII. Cache Coherence. Interconnection Networks (1) : March 16, 2009
No ratings yet
VII. Cache Coherence. Interconnection Networks (1) : March 16, 2009
42 pages
Chapter Thirteen: Multiprocessors
No ratings yet
Chapter Thirteen: Multiprocessors
55 pages
Coa
No ratings yet
Coa
19 pages
L22-PC-IO
No ratings yet
L22-PC-IO
19 pages
Bus Operations
No ratings yet
Bus Operations
37 pages
Lectures On Multiprocessors: Unit 10
No ratings yet
Lectures On Multiprocessors: Unit 10
26 pages
Lectures On Lectures On Multiprocessors: Unit 10
No ratings yet
Lectures On Lectures On Multiprocessors: Unit 10
26 pages
Module-4 ACA
No ratings yet
Module-4 ACA
53 pages
03 - Top Level View of Computer Function and Interconnection
No ratings yet
03 - Top Level View of Computer Function and Interconnection
64 pages
Unit 15 Bus Structure
No ratings yet
Unit 15 Bus Structure
32 pages
Lecture 6 System Bus COA
No ratings yet
Lecture 6 System Bus COA
38 pages
Lecture5 (Share Memory" According To Connection)
No ratings yet
Lecture5 (Share Memory" According To Connection)
9 pages
Interfacing Processors and Peripherals: CS151B/EE M116C Computer Systems Architecture
No ratings yet
Interfacing Processors and Peripherals: CS151B/EE M116C Computer Systems Architecture
31 pages
ch.4 and 5
No ratings yet
ch.4 and 5
41 pages
Module 3
No ratings yet
Module 3
25 pages
Lesson 3 - Top Level View of Computer Function and Interconnection
No ratings yet
Lesson 3 - Top Level View of Computer Function and Interconnection
74 pages
Bus
No ratings yet
Bus
45 pages
William Stallings Computer Organization and Architecture 10 Edition
No ratings yet
William Stallings Computer Organization and Architecture 10 Edition
16 pages
03 - Top Level View of Computer Function and Interconnection
No ratings yet
03 - Top Level View of Computer Function and Interconnection
74 pages
Buses and Bus Architecture-2
No ratings yet
Buses and Bus Architecture-2
19 pages
computer architecture[1]
No ratings yet
computer architecture[1]
11 pages
unit-3.3 dynamic interconnection network
No ratings yet
unit-3.3 dynamic interconnection network
15 pages
Lecture Notes of Week 4-5
No ratings yet
Lecture Notes of Week 4-5
28 pages
Computer IO Buses and Interfaces
No ratings yet
Computer IO Buses and Interfaces
28 pages
Microcontrollers: Lecture Notes No. 8
No ratings yet
Microcontrollers: Lecture Notes No. 8
66 pages
Ch3 PDF
No ratings yet
Ch3 PDF
62 pages
Multiprocessor Architecture and Programming
No ratings yet
Multiprocessor Architecture and Programming
20 pages
COmputer Booklet
No ratings yet
COmputer Booklet
7 pages
Unit1 1.6 Inter Structure (1)
No ratings yet
Unit1 1.6 Inter Structure (1)
45 pages
Unit1 1.6 Inter Structure
No ratings yet
Unit1 1.6 Inter Structure
45 pages
@vtucode - in 21CS643 Module 3 2021 Scheme
No ratings yet
@vtucode - in 21CS643 Module 3 2021 Scheme
97 pages
Swarnadeep Shit 11000223050
No ratings yet
Swarnadeep Shit 11000223050
11 pages
William Stallings Computer Organization and Architecture 9 Edition
No ratings yet
William Stallings Computer Organization and Architecture 9 Edition
52 pages
EMB Ok
No ratings yet
EMB Ok
5 pages
IT--105 Final Exam Study Guide
No ratings yet
IT--105 Final Exam Study Guide
303 pages
CH03 COA9e
No ratings yet
CH03 COA9e
52 pages
1 Module 1 Introduction To Multiprocessors September 29 2024
No ratings yet
1 Module 1 Introduction To Multiprocessors September 29 2024
29 pages
Lecture 3 On Chapter 3 A Top-Level View of Computer Function and Interconnection by Sameer Akram
No ratings yet
Lecture 3 On Chapter 3 A Top-Level View of Computer Function and Interconnection by Sameer Akram
37 pages
Ch-8 Shared Memory Multiprocessors
No ratings yet
Ch-8 Shared Memory Multiprocessors
45 pages
03_Buses
No ratings yet
03_Buses
55 pages
CS621 Final Term
No ratings yet
CS621 Final Term
111 pages
What Is Bus: Data Lines
No ratings yet
What Is Bus: Data Lines
3 pages
Chapter7_2
No ratings yet
Chapter7_2
23 pages
William Stallings Computer Organization and Architecture 9 Edition
No ratings yet
William Stallings Computer Organization and Architecture 9 Edition
60 pages
Chapter Ten Architeture
No ratings yet
Chapter Ten Architeture
14 pages
Multi Processors
No ratings yet
Multi Processors
15 pages
Pipeline
No ratings yet
Pipeline
43 pages
William Stallings Computer Organization and Architecture 7 Edition System Buses
No ratings yet
William Stallings Computer Organization and Architecture 7 Edition System Buses
50 pages
Bus Structure
No ratings yet
Bus Structure
34 pages
Sytem Bus
No ratings yet
Sytem Bus
24 pages
03 - Top Level View of Computer Function and Interconnection
No ratings yet
03 - Top Level View of Computer Function and Interconnection
32 pages
1 CS Bus Hub Arch Overview
No ratings yet
1 CS Bus Hub Arch Overview
24 pages
PLC: Programmable Logic Controller – Arktika.: EXPERIMENTAL PRODUCT BASED ON CPLD.
From Everand
PLC: Programmable Logic Controller – Arktika.: EXPERIMENTAL PRODUCT BASED ON CPLD.
MARIO FRANCO
No ratings yet
Preliminary Specifications: Programmed Data Processor Model Three (PDP-3) October, 1960
From Everand
Preliminary Specifications: Programmed Data Processor Model Three (PDP-3) October, 1960
Digital Equipment Corporation
No ratings yet
Lecture 2
100% (1)
Lecture 2
24 pages
Pestel Analysis On FMCG Industry:: Political Factors
No ratings yet
Pestel Analysis On FMCG Industry:: Political Factors
6 pages
Tutorial 5.2
No ratings yet
Tutorial 5.2
6 pages
Demystifying LLMs
No ratings yet
Demystifying LLMs
53 pages
High Efficiency Filter Elements For Hydraulic and Lubrication Oils
No ratings yet
High Efficiency Filter Elements For Hydraulic and Lubrication Oils
8 pages
Sag DB en PDF
No ratings yet
Sag DB en PDF
2 pages
Te803 Pirochta Unit
No ratings yet
Te803 Pirochta Unit
19 pages
Botha Nandipha Loveness 2019 Thesis
No ratings yet
Botha Nandipha Loveness 2019 Thesis
263 pages
PLTL Ch. 16 Assignment
No ratings yet
PLTL Ch. 16 Assignment
6 pages
Victor Perfumes
No ratings yet
Victor Perfumes
5 pages
101 U2
No ratings yet
101 U2
24 pages
02 Big Data Analytics MDEC PDF
No ratings yet
02 Big Data Analytics MDEC PDF
34 pages
Dissertation Topics
No ratings yet
Dissertation Topics
35 pages
Project Proposal: (Digital Horn Counter) : Background
No ratings yet
Project Proposal: (Digital Horn Counter) : Background
1 page
Podcast listening test-1
No ratings yet
Podcast listening test-1
9 pages
Adobe Scan Dec 16, 2023
No ratings yet
Adobe Scan Dec 16, 2023
10 pages
Acivity Sheet Human Person and Society Week 6
No ratings yet
Acivity Sheet Human Person and Society Week 6
5 pages
Inspection & Testing of Electrical Systems 76927 - 16
0% (1)
Inspection & Testing of Electrical Systems 76927 - 16
17 pages
SafetyDataSheet (SDS) BONDERITEL-MRB-415 EnglishUnitedStates 08152022
No ratings yet
SafetyDataSheet (SDS) BONDERITEL-MRB-415 EnglishUnitedStates 08152022
5 pages
Tranexamic Acid Dosing For Cardiac Surgical.11
No ratings yet
Tranexamic Acid Dosing For Cardiac Surgical.11
10 pages
Differential Diagnostic Analysis System: Clinicians' Corner
No ratings yet
Differential Diagnostic Analysis System: Clinicians' Corner
8 pages
Flowchart
No ratings yet
Flowchart
24 pages
Contents and Layout of Research Report
No ratings yet
Contents and Layout of Research Report
4 pages
Argument 4 范文精析
No ratings yet
Argument 4 范文精析
2 pages
Bai Word
No ratings yet
Bai Word
4 pages
Software Testing and Analysis - Process, Principles and Techniques by Mauro Pezze ..
No ratings yet
Software Testing and Analysis - Process, Principles and Techniques by Mauro Pezze ..
564 pages
Carrageenan Book
No ratings yet
Carrageenan Book
30 pages

@vtucode - in 21CS643 Module 4 2021 Scheme

Uploaded by

@vtucode - in 21CS643 Module 4 2021 Scheme

Uploaded by

FM Illnfﬁrm-H Hilllmmne-rm

M ULTIPRDCESSDR SYSTEM IHTERCCINNECTS

I32 i Advanced Computer Arclritedure

Legends: IPMN [inter-Processor-iuiemoty Network]

7.1.1 Hierarchical Bus Systems

MD Board Common lcatlon Board

Legends: IF [Interface lo-glc], LM [Local Memory)

Fig.1'.I Bus s}rs1:orns at board ievel. heciqulane ievei.am:l HO ieve!

Cluster Nanobu Cluster Hahobus

{Processor Processor‘ {Processor Prooessori {Francesco-rl Proeesscrl

Cache Cache Cache Cache

Message Message Message

7.1.1 Crossbar Switch and Mulcipurt Memory

rlllurltiprn-cessors and Mrriticomputers B In-y

Multiprocessor: and Mrrtticornputers i 2“

[st] rt-port memory modules used

ZN i Mlvonced Cnmptster Architecture

7.1.3 Multistage and Combining Networks

Mutltipmoesson and Muiticornputers in

Input Stage t] Stage 1 Stage 2 Output

1cc ----- “ _e_ _-__. 1oo

11c --——- 6 B ----- - - - -- 11c

1oo ----- I . ____- 1oo

111+ " G -1 1‘ 111

_.1;1_ _.1—\_ _.1j1_.m

_-1 |_ "- -:___

{oi Broadcast connections

[bl Using iourauay shuttle interstage connections

E-=1-|@. ,.| l-51$

504 '- i 504

[bi Athres-stage 512 >-c512 Butterfly switch network

If the execution order is reversed, the following values are returned:

P Fetch-llhod {x, Ia‘;

{ajTv4o requests me-etnta switdt

(e) The ongnialvalua atoreoli xuretuned bawitch. andhenewualue-x+e, + e,

|[d)The vﬂrea xanox + e1sI'e- r9l.|n'|e-ti ti P1a'|d P2 reg:-acutely

ZN "i" Advanced Computer Arclriteeture

CACHE cot-rsnsncearco SYNCHRDNIZATIDN

7.1.1 The Cache Coherence Problem

Mudtiplooenon and Multicomputers i. 2;-I

we :_: -._: : Bus

I93 i Advanced ConprsterArciritedm-e

[bi A possible solution

Mutltiprocessorr and ﬂllultiicomputers F 2"

7.1.1 Snoopy Bus Protocols

IUD i Adrrornccd Crnrnjarirterrﬁrclriterriru-e

RU). ZUI. Will. Ztli

,/Read-Elk ii‘ 1*\.1

Muiltiprocesson and Mulniccvripoters i 303

7.1.3 Directory-Based Protocols

Multiprocessor: and Muiticompoters i 305

Cache Cache Cache Cach C he Ca e

Shared memory Shared me mery

Cache Cac = Cache Cache _ Cache Cache

><= @ ><= EEEI ><= @ >==

Shared memory Shared me rncry

Cach“ Cache cor‘ ¢'.e'e " oer cm

Multiprocessor: and Mtrlticwrrputars i In-I

NIB i Advanced Canpntter Archite-ctura

7.1.4 Hardware: Synchronization Mechanisms

Multiprocessor: and Multiccnmptrtars i 3“

Test&Set is implemented with atomic l'E'rI{f-fllfldffit-H-‘rife’ memory operations. To synchronize concurrent

III III III III III III III III

[a] Barrio: lines and interface logic

Stop 1: Foriring [use of one barrier lino]

[tr] Synchronization stops

Muiltiplooessors and Multicomputers i

Step Ci: lnitiaiizing the oontioi vectors [tee 5 in-arrierr lines)

[oi Synohron ization steps

III i Advanced Canpoterﬁrchitecturc

THREE GENERATIONS OF MULTICOMPUTERS

1.3.1 Design Choices in the Fast

Muitiplooenon and Muiticomputers i. 3|!

Parailel _________ __ Expensive -Cray

I I4 i Advanced Computer Architecture

Table 1.1 Three Early Genet-:ule.ns of Mulucanputer Devolopmein

Parailel _______ Expensive -Cray