@vtucode - in 21CS643 Module 4 2021 Scheme
@vtucode - in 21CS643 Module 4 2021 Scheme
—
MODULE-4 —
Multiprocessors and
Multicomputers
In this chapter. we st1.|dy system ardtitectures of multiproc-moors and multicomputers. ‘various cache
coherence protocols. synchronization methods. crossbar switches. multiport memory. and multistag
networks are described for building multiprocessor systenn. Then we discuss multicomputers with
distrll:iuted memories which are not globally shared.The lntel Paragon is used as a cm-e study. Message-
passing medranisms required with multicomputers are also revievved.Single-add ress-space multicomputers
will be studied in Chapter 9.
In packet switching, the information is broken into small packets individually competing for a path in the
network.
[Shared Memory]
ICC
%
i [Shared IIO and Peripherals]
Disk Units
Backup storage
B Printer
Torminais
I
E -
I
Network
Hg. 7.1 Interconnection sr:ru-crures in a generalized rmririprooessor system with iecai memory. private caches.
shared memory. and shared peripherals
Network control strategy is classified as cenrrrtiized or disnibirreri. With central.izcd control, a global
controller receives requests from all devices attached to the network and grants the network access to one or
more requesters. In a distributed system, requests are handled by local devices independently.
the interface logic. An IIO or network interface chip or board uses a dam bus. Each of these local buses
consists of signal and utility lines.
Locai Peripherals
{SCSI Bus]
CPU Board l'u'len'lory' Board
'F MC
‘1 _ _ . . .. Y
< System B-us [on backplanos] >
it I.
@ |F cc
:1 Ell
Disk
Data Bus
Buffer
Units
I
Pr lnter
or Plotter
Buffer
Data Bus
Network
[Ethernet etc.)
Bnckplnne Bu: A Jmckplme is a primed circuit on which many connectors are used to plug in fimctional
boards. A.sy.sIen1hrJ.s_ consisting of shared signal paths and utility lines, is built on thc hackplanc. This system
bus provides a eommmi on-znmuliication path alnong all plug-in boards.
Several backplaue bus standards have been developed over time such as the VMI.-Z bus {IEEE Standard
10l4~19S?), Multibus II [THEE Standard 1296~19S'T), and Futurebus+ [[EEE Standard 896.1-I991) as
introduced in Chapter 5. However, point to-point switched interconnects have emerged as more cificicnt
alternatives, as discussed in Chapters 5 and 13.
HO Bu: lnputfoutput devices are connected to a computer system through an M’) bus such as the ‘SCSI
(Small Computer Systems Itltcrfacc) bus. This bus is made of coaxial cables with taps connecting disks,
Fr‘:-r Mtfiruw irrtti-...¢-,.,i,l.¢. '
I34 i Advanced Canpritl:erArclritectuJn
printer, and other devices to a processor through an IEO controller (Fig. 7.2). Special interface logic is used to
connect various board types to the backplanc bus.
Complete specifications for a bus system include logical, electrical, and mechanical properties, various
application profiles, and interface requirements. Our study will be confined to the logical and application
aspects of system buses. Emphasis will be placed on the scalability and bus support for cache coherence and
fast synchronization.
For example, the core of the Encore Multimait multiprocessor was the Nanohus, consisting of 20 slots, a
32-bit address, a 64-bit data path, and a 14-bit vector bus, and operating at a clock rate of l2.5 lvll-la with a
total rncrnory bandwidth of 1l]Cl Mbytes/s. The Sequcnt multiprocessor bus bad a 6'4-bit data path, a lD-MI-Iz
clock rate, and a 32-bit address, for a channel bandwidth of 30 lvlbytesfs. A write-back private cache was used
to reduce the bus traffic by 50%.
Digital bus interconnects can be adopted in commercial systems ranging from workstations to
minicornputcrs, niainfiamcs, and rnultiprocessors. Hierarchical bus systtnns can be used to build rncdiun1-
sized multiproccssors with less than 100 processors. llotvever, the bus approach is limited by bandwidth
scalability and the packaging technology employed
Hierarchical Buses and Cache: Wilson (1987) proposed a hierarchical cachcf bus architecture as shown
in Fig. 7.3. This is a multilevel tree structure in which the leaf nodes are processors and their private caches
{denoted Ff, and CU in Fig. 7.3). These are divided into several clusters, each of which is connected through
a cluster bus.
Inter-cluster Bus
I r I
I
Second
I-ave‘
etidrlr/.<97J;£ Caches
Cl
| I | I | rB,f‘°'
lP*>llF'1llP-EllPa|lP*llP5llPfil|P1llP@-I
Pro-comers
Fig. 1.3 A hierarchical cachetbtls archlriecrtrre for designing a scahhle rnulrlproocssor {Courtesy ofwlsonz
reprimand from Pmc. ofknnud l-rrL Syrup. on Compt.rte.rArchlnecu.|re, 198?)
An interclustcr bus is used to provide communications among the clusters. Second lcvcl caches (denoted
as C2,) are used between each cluster bus and the interclustcr bus. Each second-level cache must have a
capacity that is at least an order oi" magnitude larger than the sum of the capacities of all first-level caches
connected beneath it_
Each single cluster operates as a single-bus system. Snoopy bus coherence protocols can be used to
establish comisteney among first-level caches belonging to the same cluster. Second-level caches are used to
extend consistency from each local cluster to the upper level.
Fr‘:-r Mflirpw nrmrme-;|umn '
Multiprocessor: and Multiownputers i :35
The upper-level caches form another level of shared memory between each cluster and the main memory
modules connected to the interclustcr bus. Most memory requests should be satisfied at the lower-level
caches. lnterc-luster cache coherence is controlled among the second~level caches and the resulting effects are
passed to the lower level.
Ir)
El Example 1.1 Encore Ultramax multiprocessor architecture
The Ultramax had a two-level hierarchical-bus architecture as depicted in Fig. 'l'.4. The Ultramwt architecture
was very similar to that characterized by Wilson, except that the global Nanobus was used only for intercluster
communications.
< Global Nanobus >
II II II H
Legends: P = Processor
PC = Private Cache
MM = Main Memory
S-C = Shared Cache
RS = Route Switch
Fig.7.! The Llrramait rnuirlprocessor architecture using hierarchical buses with nurlrlple clusters {Courtesy of
Encore Cornpmser Corpora.rlon.19B7}
The shared memories were distributed to all clusters instead of being connected to the intercluster bus. The
cluster caches formed the second-level caches and performed the same filtering and cache coherence control
for remote accesses as in l|Vilson's scheme. When an aceess request reached the top bus, it would be routed
down to the cluster memory that matched it with the reference address.
The idea of using bridges between multiprocessor clusters is to allow transactions initiated on a local
bus to be completed on a remote bus. As exemplified in Fig. 7.5, multiple buses are used to build a very
large system consisting of three rnultiproeessor clusters. The bus used in this example is Futurebus+, but
the basic idea is more general. Bridges are used to interface the clusters. The main functions of a bridge
include corrmtunieation protocol conversion, interrupt handling in split transactions, and serving as cache
and memory agents.
IE5 C Advanced Comprrrter Architecture
I i i I I i
I Cache i Cache I Cache i Cache I Cache | [ Cache
Dual-Fr.lr.|erbus+
i I
Futu re-bus+
Cable
S"‘J"'°‘“ Spec or Special
Processor Processor Bridge Bridge Pljgcmosar PT -
+Futurebus+ . + I I .Futurebus++ +
Message Message Message
Cache Cache Interface Interface irtertaoe
Memow
IIO Frame I/O ii‘-CI I10
‘Urocessor Buffer Processor Processor] Processor
SCSI 2! IPI
H-_-,lp| ii
LAN
Connection to is ISDN
Supercornputer Visualization
Manner
Hg. 7.5 A multiprocessor system using multiple Fr.itrrrebus+ segnironrs (Reprinted with permission from IEEE
Smntlartl 396.1-1991. copyright ® W91 by IEEE. Inc}
reaching their destination. A single—stage network is cheaper to build, but multiple passes may be needed to
establish certain connections. The crossbar switch and multiport memory organization are both single-stage
networks.
A multistage network consists of more than one stage of switch boxes. Such a network should be able to
connect from any input to any output. We will study unidirectional multistage networks in Section 7.1.3. The
choice of interstage connection patterns determines the network connectivity. These patterns may be the same
or different at different stages, depending the class of networks to be designed. The Omega network, Flip
network, anrl Baseline networks are all multistage networks.
Blocking versus Nonblockirrg Networks A multistage network is called blocking if the sirnultancous
connections ofsome multiple input-output pairs may result in conflicts in the use ofswitches or communication
links.
Examples ofblocking networks include the Omega (Lawrie, I975), Baseline (Wu and Feng, I980], Banyan
(Goke and Lipovski, 1973), and Delta networks (Patel, 1979}. Some blocking networks are equivalent after
graph transformations. In fact, most multistage networks are blocking in nature. In a blocking network,
multiple passes through the network may be needed to achieve certain input-output connections.
A multistage network is called nonhioeking if it can perfonn all possible connections between inputs
and outputs by rearranging its connections. In such a network, a connection path can always be established
between any input-output pair. The Benes networks (Bones, 1965) have such a capability. However, Brmcs
networks require almost twice the number of stages to achieve the nonblocking connections. The Clos
networks (Clos, I953) can also perform all pemtutations in a single pass without blocking. Certain subclasses
of blocking networks can also be made nonblocking if extra stages arc added or connections are restricted.
The blocking problem can be avoided by using combining networks to be described in the next section.
Cmubur Networks In a cmssbnr n-erworI:, every input port is connected to a free output port through a
crtisspoint switch (circles in Fig. 2.26s) without blocking. A crossbar network is a single-stage network built
with unary switches at the erosspoints.
Once the data is read from the memory. its value is retumed to the requesting processor along the same
crossp-oi.nt switch. In general, such a crossbar network requires the use ofrr >< or ccrosspoint switches. A square
crossbar {n = m) can implement any of the nl permutations without blocking.
As introduced earlier, a crossbar switch network is a single-stage, nonblocking, pennutation network.
Each crosspoint in a crossbar network is a unary switch which can be set open or closed, providing a point-
to-point connection path between the source and destination.
All processors can send mcrnory requests indcpemlently and asynchronously. This poses the problem
of multiple requests destined for the same memory module at the same time. In such cases, only one of the
requests is serviced at a time. Let us characterize below the crosspoint switching operations.
Crasspoint Switch Design Out of n crosspoint switches in each eolurrrn of an rt >< or crossbar mesh, only
one can be connected at a time. To resolve the contention for each memory module, each crosspoint switch
must he desigrred with extra hardware.
Furthcrrnorc, each crosspoint switch requires the use ofa largenumber ofconnecting lines accommodating
address, data path, and control signals. This means that each crosspoint has a complexity matching that of a
bus of the same width.
Fr‘:-r Mtflr-purl rrrrtr-...s,.aa.¢. '
IBB i Adrorrced Conprnerkdrritedure
For an rr >< rr crossbar network, this implies that nz sets of crosspoint switches and a large ntnnber oflines.
are needed. What this amounts to is a crossbar nctworlr. requiring extensive hardware when rr is very large. So
far only relatively small crossbar networks with n 5 to have been built into commercial machines.
On each row of the crossbar mesh, multiple crosspoint switches can be connected simultaneously.
Simultaneous data transfers can take place in a crossbar between rr pairs of processors and memories.
Figure 16 shows the schematic design of a row of crosspoint switches in a single crossbar network.
Multiplexer modules are used to select one of rr roan’or trrire requests for service. Each processor sends in an
independent request, and the arbitration logic makes the selection based on certain fairness or priority rules.
[rr sets]
Data Data
M|JlllplB)I.‘9t
mad |_Ifl$ fl‘ ptfl-0955012
Addmfi [3 flea] Address
Shared ‘ii
memory Roaeltwrlto
Readflflflte
moduto
{Mil Control
Roe trees
Acitnowtodgo
Art-ltralorr Red‘-‘ea
A K ‘ad nproen-sears
‘E weEflfib-IE
:
'
° “°*” at
Request
iv Aelrnowtactgo
Fig. 1.6 Schematic design ofa row of orosspolnt switches in a crossbar nenruork
For example. a 4-bit control signal will be generated for rr = I6 processors. Note that rr sets of data,
address, and read-"write lines are connected to the input ofthe multiplexer tree. Based on the control signal
received, only one out of rr sets of information lines is selected as the output of the multiplexer tree.
The memory address is entered for both min’ and nriro access. In the case of marl. the data fetched from
memory are retumed to the selected processor in the reverse direction using the data path established In the
case of write, thc data on the data path are stored in memory.
Acknowledge signals are used to indicate the arbitration rcsult to all requesting processors. These signals
initiate data transfer and are used to avoid conflicts. Note that the data path established is bidirectional, in
order to serve both rend and it-rirc requests for diflrerent memory cycles.
Cro.r.r.b-or Limitations A single processor can send many requests to multiple memory modules. For an
rr >< rr crossbar network, at most rr memory words can be delivered to at most rr processors in each cycle.
The crossbar network offers the highest bandwidth of rr data transfers per cycle, as compared with only
one data transfer per bus cycle. Since all necessary switching and conflict resolution logic are built into the
crosspoint switch, the processor interface and memory port logic are much simplified and cheaper. A crossbar
network is cost-efiective only for small multiproccssors with a few processors accessing a few memory
modules. A single-stage crossbar network is not expandable once it is built.
J11 Imltqtrarlnrt _
Redundancy or parity-check lines can be built into each crosspoint switch to enhance the fault tolerance
and reliability of the crossbar network.
Mulriparr Memory Because building a crossbar network into a large system is cost prohibitive, some
mainframe multiproccssors used a mulnport memory organization. The idea is to move all crosspoint
arbitration and switching fimctions associated with each memory module into the memory eontrol.ler.
Thus the memory module becomes more expensive due to the added access ports and associated logic as
demonstrated in Fig. ?.‘l‘a. The circles in the diagram represent rt switches tied to rr input ports of a memory
module. Only one of rt processor requests can be honored at a time.
The rnultiport memory organization is a compromise solution between a low-cost, low-performance bus
system and a high-cost, high-bandwidth crossbar system. The contention bus is time-shared by all processors
and device modules attached. The multiport memory must resolve conflicts among processors.
This memory stmcture becomes expensive when mand rt become large. Atypical mainframe multiprocessor
configuration may have rt = 4 processors and rn = 16 memory modules. A multiport memory multiprocessor
is not scalable because once the ports are fitted, no more processors can be added without redesigning the
memory controller.
Another drawback is the need for a large number of interconnection cables and connectors when the
configuration becomes large. Thc ports of each memory rnodulc in Fig. ?.7b are prioritized. Some of the
processors are CPUs, some are Lit) processors, and some are connected to dedicated processors.
nPro-netssore
H
12
1
12 ;I=
m Shared memory m §ED
.‘
1 1 2 2 1 1
lM°2 l lamlr lswfal l 2M"l
M
{bl Memory ports prlorltlzted or prhrlteged ln each module by numbers
Fig. 7.7 Moltlporr: rnemory organtzaclorrs for mr.rlriprocessor syscerns {Cornrorsy off‘. H. Etsiotrrt ACM Compurrlrrg
Surveys. Match 19??)
FM Mtfiruw Hllltitmpwins
For example, the Univac IIDUI94 multiprocessor consisted of four CPUs, Four HO processors, and two
scientific vector processors con.11ectcd to four shared-memory modules, each of which was ID-way ported.
The access to these ports was prioritized under operating system control. In other multiproccssors, part of the
memory module can be made private with ports accessible only to the owner processors.
Routing in Ornego Network We have defined the Omega network in Chapter 2. In what follows, we
describe the message-routing algorithm and broadcast capability of Omega network. This class of network
was built into the Illinois Cedar multiprocessor {Keck ct al., 1987), into the IBM RP3 (Pfistcr ct al., I985),
and into the NYU Ultracomputer (Gottlieb et al., I983). An 8-input Omega network is shown in Fig. "LB.
In general, an n-input Omega network has log; n stages. The stages are labeled fi"om D to log; n l from
the input end to the output end. Data roofing is controlled by inspecting the destination code in binary. When
thc ith high-order bit of the destination code is a 0, a 2 >< 2 switch at stage i connects thc input to thc upper
output. Otherwise, the input is directed to the lower output.
Two switch settings are shown in Figs. 7.8a and b with respect to permutations rt, = [0, 7, 6, -'-‘l, 2} (1, 3)
(5) and 11'; = (0, 6, 4, 7, 3] (1, 5] (2), respectively.
The switch settings in Fig. 7.8a arc for thc implcmcntation of 11'], which maps U -3 'i', 'i' -3» 6, 6 —:~ 4,
4 —> 2, 2 —> 0, 1 —> 3, 3 —> l, 5 —> 5. Consider the touting ofa message from input 001 to output Ul 1. This
involves the use of switches A, B, and C. Since the most significant bit of the destination 0|! is a "':nero“,
switch A mttst he set straight so that the input Dill is connected to the uppcr output (labeled 2}. The middle
bit in 011 is a “one”, thus input 4 to switch B is connected to the lower output with a “crossove-r” connection.
The least significant bit in fill is a "one", implying a flat connection in switch C. Similarly, the switches A,
E, and D are set for routing a message fi'orn input Illll to output IUI. There exists no conflict in all the switch
settings nccdod to implement the permutation H1 in Fig. 133..
Now consider implementing the permutation J1‘; in the 8-input Omega network (Fig. '!.8b]. Conflicts in
switch settings do exist in three switches identified as F, G, and l-I. The conflicts occurring at F are caused
by thc desired routings D00 —> 110 and 100 —> lll. Since both destination addresses have a leading bit 1,
both inputs to switch F must be connected to the lower output. To resolve the conflicts, one request must be
blocked.
Similarly, we sec conflicts at switch G between (lll —> (I00 and lll —> 011, and at switch I-I between
101 -—> UB1 and Ull —> DUO. At switches I and J, broadcast is used from one input to two outputs, which is
allowed if the hardware is built to have four legitimate states as shown in Fig. 2.24s. The above example
indicates the fact that not all permutations can be implemented in one pass through the Omega network.
The Omega network is a blocking network. In case of blocking, one can establish the conflicting
connections in several passes. For the example I2, we can connect DOD —> lll], U0] —> llll, Oil] —> D10,
101 —> 001.110 —> 100 in the first pass and [lll —>llll0, 100 —> lll, 1 ll —:~ llll in the second pass. In general,
if 2 >< 2 switch boxes are used, an n-input Omega network can implement n""2 permutations in a single pass.
There arc n! permutations in total.
. 1 I.‘ IBM‘ ln¢r.q|r_.u|»r\
i Ii
nnn non
_.*I'.__ __I"'i_
UB1 " 1 1 --——- 0111
1:
1111 O 0 O ----- o11
Input
nun ,_ F 1 _- one
It1
o1o --——- o1o
o11 0 O " i‘ HO --——- n11
Fig.7.! ‘fiuostrrlechserdngsofanfiXflfinmeganetvmrkbtfltuvifltlxlstwirches
Fern = S,this implies that only 34/B! =4Cl96.~’4032D=0. 10 I 6 = 10. I 6% ofall penntrtations are implementable
in a single pass through an 8—input Omega network. All others will cause blocking and demand up to three
passes to be realized. In general, a maximum of log; n passes are nccdod for an r1-input Omega. Blocking is
not s desired feature in any multistage network, since it lowers the effective bandwidth.
The Omega network can alsoh-eused to broadcast data from one source to many destinations, as exemplified
in Fig. 7.9a, using the upper broadcast or lower broadcast switch settings. In Fig. 7.9a, the message at input
001 is being broadcast to ail eight outputs through a binary tree connection.
The twoway shufiie interstage connections can be replaced by four-way shuffie interstage connections
when 4 >< 4 switch boxes are used as building blocks, as exemplified in Fig. T.9b for n I6-input Omega
network with log4 I6 = 2 stages.
292 i
.
Adi-rnneed Computer Arciritedure
E ;:-- @11fl
0000
0001
0010
-7-
..,..t 0001
0010
DC!-Gt}
GU11 V V D1111
0100 0100
0101 \ 4 4 L4 4 y 0101
0110 70' "‘ " 0110
0111 K 0111
1000 Q ‘ 1000
1001 £1 1001
1010 ll 41-<4 w 4><4 1010
1011 I‘ 1011
1100 1100
1101
1110
1111
xx
(4'*3~ A:-mt 4-X4
1101
1110
1111
Fig. 1.9 Broadcast capability of an Oreega ne1:wor'k built: widi 4 >< 4 swiedmes
Note that a four-way shuflle corresponds to dividing the I6 inputs into four equal subsets and then
shufliing them evenly among the four subsets. When Ir >< Ir switch boxes are used, one can define a it-way
shuflic function to build an even larger Omega network with logy n stages.
Routing In Butterfly Networks This class of networks is constructed with crossbar switches as building
blocks. Figure 7.10 shows two Butterfly networks of different sizes. Figure ‘Lilla shows a 6-¢1—input Butterfly
network built with two stages [2 = log; 64} of 8 X 3 crossbar switches. The eight-way shufir: ftlnction is used
to establish the interstage connections between stage 0 and stage 1. In l-‘ig. 7.1t‘.ib, a three-stage Butterfly
network is constructed for 512 inputs, again with 8 >< 8 crossbar switches. Each of the 64 >< 64 boxes in
Fig. 'i'.1'l}b is identical to the two-stage Butterfly network in Fig. ?.1Ua.
In total, sixteen ii >< 8 crossbar switches are used in Fig. 1.10s and 16 >< El + 8 >< 8 = I92 are used in
Fig. ?.lIJb. Larger Butterfly networks can be rnodularly constructed using more stages. Note that no broadcast
Jlrlutltiproioessors and Muiticornputers 2‘
connections are allowed in a Butterfly network. making these networks a restricted subclass of Omega
networks.
Stage!) St 1 Stage 2
D
. W1
\.t i s
U4
_;@
ii /0% 3
Stage 0 Stage 1
$"$
‘i’ I 3'1-(H Iv B:-<31 1
I I c4 gag or
1" 7
- 11
' 11
3 -chi Q 1'2
I 3';-=13 5x5-
I I-
54:-:54
15 15
_ _ / 120
121 ' 1i? , Ma 121
5 /ii Q
2
..
55 2
5? I \- Z 2
I
55
57
' B:-<5 B;-r8 I
I I
B3 53
[a) At\vo~stag1e64:<B4tBu1terflyswitchnetwonr
built with 15 8 it 3 crossbar switches and
eight-wayshtifieintastageconnectlors
425
B4:-<64 .
Fig.T.10 Mochrlar ccm1st:ruet:lon oi Butterfly switch networks with I X B crobar switches {Courtesy of BBN
Advanced Composers. Inc. 1990}
Ft‘:-r MIG:-|:|'u|' Hl'Ilr'mr:-;|;1mn '
Z74 i Advanced Canpm:erArclritecture
The Hot-Spot Problem When the network traffic is nonuniform, a her spot may appear corresponding tn
a certain memory module being excessively accessed by many processors at the same time. For example, a
semaphore variable being used as a synchronization barrier may become a hot spot since it is shared by many
processors.
Hot spots may degrade the network performance significantly. In the NYU Ultracomputer and the IBM
RP3 multiprocessor, a combining mechanism has been added to the Omega network. The purpose was to
combine multiple requests heading for the same destination at switch points where conflicts are taking place.
An atomic read-modify-write primitive Fetchflr-Add(x, e), has been developed to perform parallel memory
updates using the combining network.
Ferchflldd This atomic memory operation is eifective in implementing an N-way synchronization with a
complexity independent of N. In a Fetch&Add[x, e} operation, .1’ is an integer variable in shared memory and
e is an integer increment. When a single processor executes this operation, the semantics is
Fetch-imdd (x, e)
[temp t— x;
.1" <— Rfmp + e; (7.1)
return rerrzpi
When N processes attempt Fetch&Add(x, e] at the same memory word simultaneously, the memory is
updated only once following a.scrri:1i'i:¢1rion prirmpie. The sum of the N increments, e; + 6.’: + + en,-, is
produced in any arbitrary serialization of the N requests.
This sum is added to the memory word x, resulting in a new valuex + q + e; + . .. + em . The values returned
to the N requests are all unique, depending on the serialization order followed The net result is similar to
a sequential execution of N Fetchdrhdds but is performed in one indivisible operation. Two simultaneous
requests are combined in a switch as illustrated in Fig. "I. 1 l.
One of the following operations will be performed if processor P1 executes Ans, <— l~"etch&AddLr, r-1)
and P; executes Ans; <— Felch&o"\dc1{x_, E2) simultaneously on the shared variable x. If the request from P1 is
executed ahead of that from P2 , the following values are returned:
Arts; (* I
Ans; t— x + c; {T2}
Regardless of the executing order, the value x + +3] + £2 is stored in memory. It is the responsibility of the
switch box to form the surn e1 + -E2 , transmit the combined request Fctch&,Add(_r, cl + 22], store the value
e1 (or +33) in a wait butter of the switch, and retum the values x and J.‘ + e, to satisfy the original requests
Fetch&Add(x, 8;) and Fetch&Add(x_, e2), respectively, as illustrated in Fig. 7.11 in four steps.
rr.-.- Mcfirrrw "‘“'l_|N'f.l]|r_.\.ll|f\ _
.iiilo\ltipru~oessors and Multicornpoters 2‘;
P1 5"""lm Fe1.cMsAoo[x,e1+e2)
P2 lnl *
{bl The swrtdi forms he sun e1 + B2, ables o1 It bullet, and tsnnnna ‘ho eomtined
re-quetatb memory
P1 Switch
X
P1 x Swlldi tst
P2 I'i’E'1 Ll ii
ii
ti
JHB 1
Flg.'l'.11 ‘Mo Fe1:d'i&Add op-eratrions are eornbinecl no access a marred variable slmulraneoudy via a combining
network
Jilpplicotions and Drawbacks The Fetch&.Add primitive is very effective in accessing sequentially
allocated queue structures in parallel, or in forking out parallel processes with identical code that operate on
different data sets.
Consider the parallel execution of N independent iterations of the following Do loop by p processors:
Doall N- l to 100
-=ICode using No
Endall
Each processor executes a Fetch&-Add on N before working on a specific iteration of the loop. in this
case, a unique value of N is returned to each processor, which is used in the code segment. The code for each
processor is written as follows, with N being initialized as 1:
n <— Fetcl1&Add (N, I)
While (n 5 100) Dnall
{Code using ni
n <— Fetcl1&Atid(N, 1)
Enilall
The advantage of using a combining network to implement the Fetch&Add operation is achieved at a
significant increase in network cost. According to NYU Ultracomputer experience, message queueing and
combining in each bidirectional 2 >< 2 switch box increased the network cost by a factor of at least 6 or more.
Ff» Mefimw H'["I':.\rl!q|t',|rllI1
Additional switch cycles are also needed to make the entire operation an atomic memory operation.
This may increasc the network latency significantly. Multistage combining networks have the potential of
supporting large-scale multiproccssors with thousands of processors. The problem of increased cost and
latency may be alleviated with the use of faster and cheaper switching technology in the future.
Multistage Networks in Real System: The IBM RP3 was designed to include 512 processors using a high-
speed Omega network for reads or writes and a combining network for synchronization using Fetchtitdtdds.
A l23»port Omega network in the RP3 had a bandwidth oi" 13 Gbytes/s using a SIJ-MI-lz clock.
Multistage Omega networks were also built into the Cedar rnultiprocessor (Kuck ct al., 1986) at thc
University of Illinois and in the Ultracomputer (Gottlieb et al., 1983) at New York University.
The HEN Butterfly processor |{TC20ll{l) used 8 >< S crossbar switch modules to build a two-stage 64 >< 64
Butterfly network for a 64-processor system, and a three-stage 512 >< 512 Butterfly switch {see Fig. 7.10) for
a 512-processor system in the TCZDCIU Series. The switch hardware was clocked at 33 MI-12 with a l-byte
data path. The maximum interprocessor bandwidth for a 64-processor TCZUGO was designed at 2.4 Gbytesfs.
The Cray Y-MP multiprocessor used 64-. I28-, or 256—way interleaved memory banks, each of which
could be accessed via four ports. Crossbar networks were used between the processors and memory banks
in all Cray multiproccssors. Thc Alliant FXF2800 used crossbar interconnects between seven four-processor
{i860} boards plus one U0 board and eight shared. interleaved cache boards which were connected to the
physical memory via a memory bus.
a shared data element which has been referenced by both processors. Before update, the three copies of X arrc
consistent.
If processor P. writes new data X’ into the cache, the same copy will be written immediately into the
shared memory under a nrire-rhmugia policy. In this case. inconsistency occurs between the two copies {X
and X) in the two caches (Fig. '?.l2a).
On the other hand, inconsistency may also occur when a u'rr'm-brmk policy is used, as shown on the right
in I-"lg. 7.129.. The main memory will be eventually updated when the modified data in the cache are replaced
or invalidated
Process Migration and HO Figcirc 7. 1211- shows the or:r:|.n'rc:r1r:c ofinconsistency afiera process containing
a shared variable X migrates from processor 1 to processor 2 using the write-back cache on the right. In the
middle, a process migrates from processor 2 to processor i when using write-through caches.
Processor P1 P2 P1 P2 P1 P2
we
' Bus
Shared
Merno"!r"
Before upchate Write-through Write-back
{aj inconsistency in sharing of writable data
PFOGBSSMS P1 P2 P1 P2 P1 P2
Memory
Before Migration Write-througii Write-back
[b] incons istency after process migration
Fig. 7.12 C3£‘l'i£ coherence prcblerns in chra sharing and in process rnigrarlon {Adaptedfrun Dubois. 5Cl1rEl.iI"lCl1-.
and Brigs 1993]
In both cases, inconsistency appears between the two cache copies, labeled Xand X’. Special precautions
rnust be exercised to avoid such inconsistencies. A coherence protocol must be established before processes
can safely rnigrate from one processor to another.
Inconsistency problems may occur during U0 operations that bypass the caches.
when the U0 processor imds a new data I into the main memory, bypassing the write through caches
(middle diagram in Fig. 7.1 3a), inconsistency occurs between cache i and the shared memory. When
outputting a data directly from the shared memory [bypassing thc caches], the write-hack caches also create
inconsistency.
Par MIGIITLH Hi" i!'mt'JI||r_.u|r¢\ :
One possible solution to the U0 inconsistency problem is to attach the H0 processors (IE-‘P, and IOP2)
to the private caches [C1 and C1), respectively, as shown in Fig. 'i'.l3h. This way Lr’Ci processors share caches
with the CPU. The lit) consistency can be maintained ifcache-to-cache consistency is maintained via the bus.
An obvious shortcoming of this scheme is the likely increase in cache perturbations and the poor locality of
HG data, which may result in highcr miss ratios.
P1 P2 P1 P2 P1 P2 Pmgesggfg
___.
I ace
-_ - Bus
x x’ x’ it V0
Processor
Memory lit] Memory [input] Memory [Output]
[Write-through) {Write-bacir]
[aj [IO operations bypassing the eache
Legends:
W r.rrw=»r-r
lOPi [|.r‘O Processor i]
ci [Cache 43
Bus
Shared Memory
Fig. ‘L13 Cache inconsistency after an |r'Ci operation and a poslbie soiuti-on (Arhpnecl fmrn Dubois.Scl1-era-ich.
and Brigs. 198'-B)
Two Protocol Approaches Many of the early commercially available multiproccssors used bus-based
memory systems. A bus is a convenient device for ensuring cache coherence because it allows all processors
in the system to observe ongoing memory transactions. If a bus transaction threatens the consistent state of a
locally cached obj cct, the cache controller can take appropriate actions to invalidate thc local copy. Protocols
using this mechanism to -cnsurc cohcrcncc arc called snoopy pmrocois bccausc each cach-c snoops on thc
transactions of other caches.
Dn the other hand, scalable multiprocessor systems interconnect processors using short point-to-point linlrs
in direct or multistage networks. Unlike the situation in buses, the bandwidth of these networks increases
as more processors are added to the system. However, such networks do not have a convenient snooping
mechanism and do not provide an cfficicnt broadcast capability. In such systems, thc cache coherence
problem can be solved using some variant of directory schemes.
in general. a cache coherence protocol consists of the set of possible states in the local caches, the state in
the shared memory, and the state transitions caused by the messages transported through the interconnection
nctworlrt to keep memory coherent. In what follows, we first dcscrihc the snoopy protocols and then thc
directory-based protocols. Other approaches to designing a scalable cache coherence interface will be studied
in Chapter 9.
J1? I!mt'JI||r_.u|i¢\
[:1 :1 Shared
Memory [:1 , Sh ed
lilluleiiitory
I I Bus I I Bus
Cache II II same
6 ® 6 P~=-em fit ® o “mm
[a] Consistent ooples of block J( are In shared memory [tr] After a write-I nvatldate operation by P.‘
and three processor caches
,
ii ii Shared
Memory
I I Etus
.__ Caches
® Q a Processors
{c} After a vwlte-update operation by P1
Fig. 'l‘.14 Write-lmallelare and write-up-dare ooh.or\nn-co proroools For write rhrougli caches (1: invalidate]
Write-T‘.I-imugh Codie: The states of a cache block copy change with respect to rt-rid, it--rite, and
repfirconenr operations in thc cache. Figure 7.15 shows the state transitions for two basic write-invalidate
snoopy protocols developed for write-through and write-back caches, respectively. A block copy of a write-
through cache i attached to pmecssnri can assume one of two possible cache states: vofid or irrwifid (Fig.
7.15s}.
A remote processor is denoted j, wherej ¢ i. For each ofthe two cache states, sis possible events may take
place. Note that all cache copies of the same block use the same transition graph in making state changes.
In a miid state (Fig. 115a), all processors can rem‘ {R(i‘), HUI) safely. Local processor i can also nrire
(W (F1) safely in a valid state. The in valid state corresponds to the case of thc block either being invalidated
or being replaced (Z(r') or Z(,r')'].
Thur Ml.'I;Ifllb' HI" I!'n¢r.q|r_.u|irIi -
Wherever a remote processor wrm-s {H-"'{,r‘)] into its cache copy. all other cache copies become invalidated.
Thc cache block i.n cache r' becomes valid whenever a successful road {R(_r‘)} or write [Iii-"(r'_]_) is carried out by
a local processor i.
The fraction of ii-rim r.'_}-'r:'l'es' on thc bus is highcr than thc fraction of rear! rye-res in a write-through cache,
due to the need for request invalidations. The cache rfirccrmy (registration of cache states} can be made in
dual copies or dua.l—portcd to filter out most invalidations. In case locks arc cached, an atomic Tcstdtflct must
be enforced.
Write-Bock Cache: The valid‘ state of a write-back cache can be further split into two cache states. labeled
RW (rcrmI‘~wrire] and R0 [rtrraf-only-'} as shown in Fig. '7. l5b. Thc INV (invalidated or not-in-cache} cache
state is equivalent to the rm-ora: state mentioned before. This three-state coherence scheme corresponds to
an on-nersh ip ,rJror-or-oi‘.
arr], wrrr
RU]
zur ,_Fr._,‘I[,l
Z“l
win cur
Em
Will. Ztil
[a] Write-through cache
R[|] RIII
wrrr
o o’ RUI
RW:Read-Write
RO:Read Only
wfil Isrvzrrwarierea or
notin cache
When the memory owns a block, caches can contain only the RD copies of the block. In other words,
multiple copies may exist in the RD state and every processor having a copy (called a Iraq-icr of the copy)
can ram‘ (RE), R(,r'j) thc copy safcly.
rm Mrlirow Hill lmrJI||r_.u|n¢\
' :
Mutltiprocessors and Muttiicomputars i 3m
The Fhl"V state is entered whenever a remote processor it-'ri':es ( H»"'(,r'}) its local copy or the local processor
replaces [_Z[_i)_) its own block copy. The KW state oorrcsponds to only one cache copy existing in the entire
system owned by the local processor i. Reid (R(i)} and write (lt'(i}) can be safely performed in the RW state.
From either the R0 state or the INV state, the cache block becomes uniquely owned when a local it-rim { H-1|[i])
takes place.
Other state transitions in Fig. 7.15b can be similarly figured out- Before a block is modified, ownership
for exclusive access must first be obtained by a roan‘-an!__i= bus transaction which is broadcast to all caches
and memory. If a modified block copy exists in a remote cache, memory must first be updated. the copy
invalidated, and ownership lralisferrcd to thc requesting eache.
Write-once Protocol James Goodman (I 983'] proposed a cache coherence protocol for bus-based
multiprocessors. This scheme combines the advantages of both write-through and write-hack invalidations.
In order to reduce bus ttnffie, the very first n-rite ofa cache block uses a writ.e-through policy.
This will result in a consistent memory copy while all other cache copies are invalidated. After the first
write, shared memory is updated using a write—back policy. This scheme can he described by the four-state
transition graph shown in Fig. 116. The four cache states are defined below:
P~Road
Wflto-Invffioad-lnv .
@@ ..-
,1 \\
Road-
17 0Iad-lnv K,‘ J Blk
|:sea,d_|n '11 “ ' P-Write
'\
)-
\-
In
~.
‘K
‘\.1
“".r "\.
s,"' "'it
\
"'rt
w P-Write
P-Wt lte
Solid lines: Command issued by local processor
Dashed Ilnos: Commands issued by remote processors
irla tlieaystom bus.
Fig. 7.1-ii Goo-clrrmfs write-once cache coherence protoco! using the wrlne lrwall-chiin policy on wrlne-beck
caelies {Aehpned from james Goo-drmn 1933, reprinted from Stons1:rorn.l‘EEE Comput-nr.j1.me 1990}
I l'Z!.llit?'.' The cache block, which is cottsistcnt with the memory copy, has been read fi'om shared memory
and has not been modified.
- fnioffd." Thc block is not iiiund in thc cache or is inconsistent with thc memory copy.
I Resen-'er:l'.' Data has been n-rmen exactly om:-e since being rend’ from shared memory. The eache copy
is consistent with the memory copy, which is thc only othcrcopy.
rho Melirpw Hl'llt'mr:-;|;imn '
3112 i Advanced Compattor Arclrlteetme
I Dirn-'.' Thc cachc block has been modified ( it-rittenj more than oncc, and the cache copy is thc only one
in the system [thus inconsistent with all other copies).
To maintain consistency, the protocol requires two different sets of commands. The solid lines in
Fig. 7.16 correspond to access commands issued by a local processor labeled rend-miss, ii-rite-Fair, and H'ri!E-
miss. Whenever a rend-miss occurs, the vriiin’ state is entered.
The first write-hf! lcads to thc n-:serit’n' store. The second ii'ri!e-hit leads to the dirify state, and all ‘future
it-‘rife-hits stay in the rfirry state. Whenever a n'rr'Ie-rrziiss occurs, the eache block enters the rfirry state.
The dashed lines correspond to invalidation commands issued by remote processors via the snoopy bus.
The rem‘-int-dirirre command reads a block and invalidates all other copies. The v.-rim-in»-rilidrinz command
invalidates all other copies of a block. The bus—mad command corresponds to a normal memory rend by a
remote proces-.sor via the bus
Cache Event: and Action: The memory-access and invalidation commands trigger the following events
and actions:
' Read-rrtr'.ss.' When a processor wants to read a block that is not in the cache, a mod-miss occurs. A bus-
reod operation will be initiated. If no dirty oopy mtists, then main memory has a consistent copy and
supplies a copy to the n:-questing cachc. If a dirt]-' copy docs exist in a rcmotc cache, that cache will
inhibitthc main memory and scnd a copy to the rcqucsti ng cache. In all cases, thc cachc copy will cntcr
thc iolid state after a rcad-miss.
I Write-rift.‘ If the copy is in thc dirri-' or reserved statc, thc write can be carried out locally and thc
ncw state is .d'i'rr_v. If thc ncw statc is solid, a writedrriulidrrre command is broadcast to all caches,
invalidating their copies. The shared memory is n-'rim=n rhrough, and thc resulting statc is rsrsr.-rve.n"
after this first n-'rr'!e.
I Write-miss: When a processor tails to write in a local cache, thc copy must come either from the main
memory or from a remote cachc with a dirty block. This is accomplished by sending a read-invalidate
command which will invalidate all cachc copies. The local copy is thus updated and ends up in a r1"r'rr_i-'
5lfl.l.‘C.
I Read-hit: Read-hits can always be performed in a local cache without causing a statc transition or
using thc snoopy bus for invalidation.
I Block Replacement.‘ If a copy is r;l'r'rr_1-', it has to bc written back to main memory by block rcplaccmcnt.
lfthc copy is dean (i.e., in either the i-'alr'.d', resort-'err‘, or ininiid state], no rcplaccmcnt will Lake place.
Goodman’s write-once protoco] demands special bus lines lo inhibit the main memory when the memory
copy is invalid, and n has -rend operation is needed aflcra rend rm'.s.s. Most standard buses cannot support this
inhibition operation.
The IEEE Futurebus+ proposed to include this special bus provision. Using a write-through policy after
the first write and using a write-baclt policy in all additional ii-rims eliminates unnecessary invalidations.
Snoopy cache protocols are popular in bus-based multiproccssors because of their simplicity of
implementation. The write-invalidate policies were implemented on the Sequent Symmetry multiprocessor
and on the Alliant FIX multiprocessor.
Besides the DEC Firefiy multiprocessor, the Xerox Pale Alto Research Center implemented another write-
update protocol for its Dragon multiprocessor workstation. The Dragon protocol avoids updating memory
until replacement, in order to improve the efficiency of intercache transfers.
rt» Mel;I11 w Hlll lnrfqttgtrllo-\'
' _
Multilevel Cache Coherence To maintain consistency among cache copies at various levels. Wilson
proposed an extension to the write-invalidate protocol used on a single bus. Consistency among cache copies
at the same level is maintained in the same way as described above. Consistency of caches at different levels
is illustrated in Fig. 13.
An invalidation must propagate vertically up and down in order to invalidate all copies in the shared
caches at level 2. Suppose processor P, issues a write request. The write request propagates up to the highest
level and invalidates copies in Cm, C22, Cm, and C13, as shown by the arrows to all the shaded copies.
High-level caches such as Cm keep track of dirty blocks beneath them. A subsequent rend request issued
by P7 will propagate up the hierarchy because no copies exist. Wben it reaches the top level, cache C20 issucs
a flush request down to cache Cu and the dirty copy is supplied to the private cache associated with processor
P-,. Note that higher-level caches act as filters for consistency control. An invalidation command or a read
request will not propagate down to clusters that do not contain a copy of the corresponding bloelt. The cache
C2, acts in this manner.
Protocol Performance lune: The performance of any snoopy protocol depends heavily on the workload
patterns and implcmcntation efficiency. The main motivation for using the snooping mechanism is to reduce
bus traffic, with a secondary goal of reducing the effective memory-access time. The block size is very
sensitive to cache performance in write-invalidate protocols. but not in write-update protocols.
For a uniprocessor system, bus traffic and memory-access time are mainly contributed by cache misses.
The miss ratio decreases when block size increases. llowever, as the block size increases to a darn pollnrimi
point, the miss ratio starts to increase. For larger caches. the data pollution point appears at a larger block sire.
For a system requiring extensive process migration or synchronization. the write-invalidate protocol will
perform better. However, a cache miss can rcsult for an invalidation initiated by another processor prior to the
cache access. Such in wtliatrrion mr'.r.rc.r may increase bus traflic and thus should be reduced.
Extensive simulation results have suggested that bus traffic in a multiprocessor may increase when the block
size increases. Write-invalidate also facilitates the implementation of synchronization primitives. Typically.
the average number of invalidated cache copies is rather small [one or two] in a small multiprocessor.
The write-update protocol requires a bus broadcast capability. This protocol also can avoid the ping-pong
efi'ect on data shared between multiple caches. Reducing the sharing ofdata will lessen bus traffic in a write-
update multiprocessor. However, write update cannot be used with long write bursts. Only through extensive
program traces (trace-driven simulation} can one reveal the eache behavior, hit ratio, bus traflic, and eflbctive
memory-access time.
Directory Structure: In a multistage or packet switched network, cache coherence is supported by using
cache direictories to store information on where copies of cache blocks reside. Various directory-based
protocols differ mainly in how the directory maintains information and what information it stores.
Tang (I976) proposed the first directory scheme, which used a comm)‘ n'in'cr0rj_t-' containing duplicates of
all cache directories. This central directory, providing all the information needed to enforce consistency, is
usually very large and must be associatively searched. like the individual cache directories. Contention and
long search times are two drawbacks in using :1 central directory for a large multipro-cessoc
A distributed-directory scheme was proposed by Censier and Feaun-ier (l9'i'3). Each memory module
maintains a separate directory which records the state and presence inforlnation for each memory block. The
state infonnation is local, but the presence information indicates which caches have a copy of the block.
In Fig. 7.11‘, a l't"fld-J'7‘li‘i§‘.S‘ (thin lines] in cache 2 results in a request sent to the memory module. The
memory controller reironsmits the request to the dirty copy in cache 1. This cache writes back its copy. The
memory module can supply a copy to the requesting eache. ln the case of a write-hit at cache I [bold lines},
a command is sent to the memory controller, which sends invalidations to all caches (cache 2] marked in the
presence vector residing in the directory D1.
I
I Interconnection Network
EE . a i ' % II ,,,
Hg. 7.11‘ kslc concept ofa cHrec|:ocy-based cache coherence sch-en1c{Co|.u*cesy of Censlcrand F-eaurrtce IEEE
li'ms.Co|nputers, Dec.1‘?7B}
A cache-coherence protocol that does not use broadcasts must store the locations of all cached copies of
each block of shared data. This list of cached locations, whether centralised or distributed, is called a cache
rfirccroufy. A directory entry for each block of data contains a number ofpointers to specify the locations of
copies of the block. Each directory entry also contains a dirty bit to specify whether a particular cache has
permission to Write the associated block of data.
Different types of directory protocols fall under three primary categories: jiifi mop .n‘irccaJrr'cs, finiiren’
rfirccrorics, and chair:-t=d riirsctorits. F1.|.ll-map directories store enough data associated with each block in
global memory so that every cache in the system can simultaneously store a copy of any block of data. That
is. each directory entry contains N pQll1l.EIS_, when: N is the number of processors in the system.
Limited directories differ from full-map directories in that they have a fixed number of pointers per entry,
regardless of thc system size. Chained directories emulate the full-map schemes by distributing the directory
J11 Irufqretrlhw
among the caches. The following descriptions of the three classes of cache directories are based on the
original classification by Chaiken, Fields, Kwihara, and Agarwal (1990):
Full-Nlop Directories The full-map protocol implements directory entries with one bit per processor and a
dirty bit. Each bit represents the status of the block in the corresponding processor's cache (present or absent).
If the dirty bit is set, then one and only one processor's bit is set and that processor can write into the block.
A cache maintains two bits of state per block. One bit indicates whether a block is valid, and the other
indicates whether a valid block may be written. The cache coherence protocol must keep the state bits in the
Incrnory directory and those in the cache consistent.
Figure 'I'.llia illustrates three d.ifl'erent states of a full-map directory. In the first state, location X is missing
in all of the caches in the system. The second state results from three caches (Cl, C2, and C3) requesting
copies of location X. Three pointers [processor bits} are set in the enn-y to indicate the caches that have copies
ofthe block of data. In the first two states, the dirty hit on the left side ofthe directory entry is set to clcan {C},
indicating that no processor has permission to write to the block of data. The third state results from cache
C3 requesting write pennission for the block. In the final state, the dirty bit is set to dirty {D}, and there is a
single pointer to the block of data in cache C3.
Let us examine the transition from the second state to thc third state in more detail. Once processor P3
issues the write to cache C3, the following events will take place:
('1) Cache C3 dctccts that thc hloclt containing loerti-on X is valid hut that thc processor docs not have
pcrtn is sion to write to thc block, indicated by thc block‘-s writc-permission hit in thc cachc.
(2) Cache C3 issues a writc rcqucst to thc mcmory modulc containing location X and stalls processor P3.
(3) The memory module issues invalidate requests to caches Cl and C2.
(4) Caches Cl and C2 roccivc thc invalidate requests, sct thc appropriate hit to indicatc that thc block
containing location X is invalid and send acltnowlcdgmcnts hack to thc memory module.
(5) Thc memory module rcccivcs thc aclcnowlcdg mcnts, scts thc dirty hit, clcars thc pointers to caches Cl
and C2, and sends wtitc permission to cache C3.
(ti) Cache C3 rcocives thc write pcmiission message, updates the statc in thc cache, and rcactivatcs
processor P3.
The memory module waits to receive the acknowledgments before allowing processor P3 to complete
its write transaction. By waiting for acknowledgments, the protocol guarantees that the memory system
ensures sequential consistency. The fi.lll—ITlap protocol provides a useful upper bound for the perfomiance of
ccrrlralizied directory-based cache coherence. However, it is not scalable due to excessive memory overhead.
Because the sizie ofthe directory entry associated with each block ofmemory is proportional to the number
ofprocessors, the memory consumed by the directory is proportional to the size of memory DU») multiplied
by the size of the directory O(Nj. Thus, the total memory overhead scales as the square of the number of
processors O(N2).
Limited Directories Limited directory protocols are designed to solve the directory size problem.
Restricting the number of simultaneously cached copies of any particular block of data limits the growth of
the directory to a constant factor.
A directory protocol can be classified as Dir, X using the notation from Ag-arwal et al (1 988'}. The symbol
i stands for the number of pointers, and X is NB for a scheme with no broadcast A full-map scheme without
rr<- Mclinrw Hill I_|Il1‘.l]|r_.I.ll|f\
306 S Admrmed Crmpurter Architeaure
hmadeast is represented as Dir_,._. NB. A limited clircctory protocol that uses r' <1 N pointers is denoted Dir; NB.
The limited directory protocol is similar to thc fi.rll-map directory, except in the case when more than r' eaehes
request read copies of a particular block of data.
Shared memory Shared memory
X: IIII
--- I -
I"
IME!-2|
Shared memory
X1 EIIIII BEBE
Cache’ Cac:-re . he
><-
'("r=1) ' (' i=2‘) (Pa)
{a) Three sures of a iuli-mac directory
Figure 7.18b shows the situation when three caches request read copies in a memory system with a
Dir; NB protocol. In this ease, we can view l.hc two-pointer directory as a two-way set-associative cache of
pointers to shared copies. When cache C3 requests a copy of location X, the memory module must invalidate
the copy in either cache Cl or cache C2. This process of pointer neplacernent is called in--it-rirm. Since the
directory acts as a set-associative cache, it rnust have a pointer rcplaccmcnt policy.
If the multiprocessor exhibits processor locality in the sens-c that in any given interval of time only a small
subset of all the processors access a given memory word, then a limited directory is sufficient to capture this
small worker set of processors.
Directory pointers in a Dir, NB protocol encode binary processor identifiers. so cach pointer requires log;
N bits of mcrnory, where N is the number of processors in the system. Given thc same assumptions as for thc
fitll-map protocol, the memory overhead of limited directory schemes grows as orrv log1."r'_].
These protocols are considered scalable with respect to memory overhead because the resource required to
implement them grows approximately linearly with the number of processors in the system. Dir, B protocols
allow more than i copies of each block of data to exist, but they resort to a broadcast mechanism when more
than icached copies of a block need to be invalidated. However, point-to-point interconnection networks do
not provide an efficient systemwidc broadcast capability. In such networks, it is diflicult to determine the
completion of a broadcast to ensure sequential consistency.
Chained Directories Chained directories realize the scalability of limited directories without restricting
the number of shared copies of data blocks. This type of cache coherence scheme is called a c-lmim-d scheme
because it keeps track of shared copies of data by maintaining a chain of directory pointers.
The simpler of the two schemes implements a singly linked chain, which is best described by example
(Fig. ll Sc). Suppose there are no shared copies oi" location X. If processor Pl reads location X, the memory
sends a copy to cache Cl, along with a choir: rt-rrrtirrrniorr [CT] pointer. The memory also keeps a pointer to
cache Cl. Subsequently, when processor P2 reads location X, the memory sends a copy to cache C2, along
with the pointer to cache C1. The memory then keeps a pointer to cache C2.
By repeating the above step, all of the caches can cache a copy ot" the location X. If processor P3 writes
to location X, it is necessary to send a data invalidation message down thc chain. To ensu.rc sequential
consistency, the memory module denies processor P3 write permission until the processor with the chain
termination pointer acknowledges the invalidation ofthe chain. Perhaps this scheme should be called a grass in
protocol [as opposed to a snoopy protocol] because information is passed from individual to individual rather
than being spread by covert observation.
The possibility ofcache block replacement complicates chained-directory protocols.
Suppose that caches Cl through (IN all have copies of location X and that location X and location Y map
to the same {direct-mapped) cache line. If processor P, reads location Y, it must first evict location X from its
cache with the following possibilities:
(1) Send a message down the chain to eache C, | with a poimerto cache C,,.| and splice C, out ofthe ehain,
or
(2) lnvalitlate location X in eache C,-,| through eaehe C”.
The second scheme can be implemented by a less complex protocol than the first. In either case, sequential
consistency is maintained by locking the memory location while invalidations are in progress. Another
solution to the replacement problem is to use a doubly linked chain. This scheme maintains forward and
backward chain pointers for each cached copy so that the protocol does not have to traverse the chain when
FM Mtfiruw H'IHt'nm;n;u|n1'
there is a cache replacement. The doubly linked directory optimises the replacement condition at the cost of
a larger average message block size (due to the transniissiori of extra directory pointers), twice the pointer
memory in the caches, and a more complex coherence protocol.
Although the chained protocols are more complex than tl1e limited directory protocols, they are still
scalable in terms of the amount of memory used For the directories. The pointer sizes grow as the logarithm
of the number of processors, and the number of pointers per cache or memory block is independent of the
number ofprocessors.
Cache Design Alternative: The relative merits of physical address caches and virtual address caches
have to he judged based on the access time, the aliasing problem, the flushing problem, OS kernel overhead,
special tagging at the process level, and costiperformance considerations. Beyond the use of private caches,
three design altematives are suggested below.
Each of the design alternatives has its own advantages and shorteornings. There exists insufiicient
evidence to determine whether any ofthe alternatives is always better or worse than the use ofprivate caches.
More research and trace data are needed to apply these cache architectures in designing high-performance
multiprocessors.
Shared Cache An alternative approach to maintaining cache coherence is to completely eliminate the
problem by using sfmreri’ c-aches attached to shared-memory modules. No private caches are allowed in this
ease. This approach will reduce the main memory access time but contributes very little to reducing the
overall memory-access time and to resolving access conflicts.
Shared caches can he built as second-level caches. Sometimes. one can make the second-level caches
partially shared by different clusters of processors. Various eache arch.itcetures are possible if private and
shared caches are both used in a memory hierarchy. The use of shared cache alone may be against the
scalability of the entire system. Tradeoffs between using private caches. caches shared by multiprocessor
clusters, and shared main memory are interesting topics for further research
Non-eaeheable Data Another approach is not to cache shared writable data. Shared data are mm-ndienhfe,
and only instructions or private data are eaeheable in local caches. Shared data include locks, process queues.
and any other data structures protected by critical sections.
The compiler must tag data as either criclwnbis or no.m:oc'.Fier'i!J!e. Special hardware tagging must be used
to distinguish them. Cache systems with caeheable and noncacheable blocks demand more support from
hardware and compilers.
Cadre Flushing A mini approach is to use cache flu.-tIn'ng every time a synchronization primitive is
executed. This may work well with transaction processing multiprocessor systems. Cache flushes are slow
unless special hardware is used. This approach does not solve I.-"0 and process migration problems.
Flushing can be made very selective by the compiler in order to increase efficiency. Cache flushing at
synchronization, It'll and process migration may be carried out unconditionally or selectively. Cache flushing
is more ofien used with virtual address caches.
enforces correct sequencing of processors and ensures mutually exclusive access to shared writable data
Synchronization can be implemented in software, firmware, and ha1'rlvva.rc through controlled sharing of data
and control information in memory.
Multiprocessor systems use hardware mechanisms to implement low-level or primitive synchronization
operations. or use software [operating system) level synchronization mechanisms such as Se'mq|'Jh0rr's
or monitors. Only hardware synchronization mechanisms arc studied below. Software approaches to
synchronization will be treated in Chapter 10.
Atomic Operation: Most multiproccssors are equipped with hardware mechanisms for enforcing atomic
operations sueh as memory read, write, or marl-mafiji-'-ii-'rirr> operations which can be used to implement
some synchronization primitives. Besides atomic memory operations, some interprocessor interrupts can be
used for synchronization purposes. For example, the synchronization primitives, Test&S-et [Fm-H and Reset
(fork), are defined below:
Test&Set (for-It)
rump <t— fork; Fork +r— l;
return scrap (T.-4)
Reset (fork)
lock 4- D
3 ID I Advanced Canpmerkdritedrua
This demonstrates the ability to use the control bit .71"; to signal the completion ofa process on preeessor i.
The bit Xi is set to 1 when a process is initiated and reset to D when the process firrishes its execution.
Whrzn all processes finish their jobs, the .Y,- bits liom the participating processors are all set to ll; and the
barrier line is then raised to high (1), signaling the synchronization barrier has been crossed. This timing is
watched by all processors through snooping on the l’,-bits. Thus only one barrier line is needed to monitor the
initiation and completion ofa single synchronization involving many concurrent processes.
5\i"
I iino1
I I lino
I I I I m
F-lg. 1.19 The syn<:l'a'oni:a1:lonoffo|.tr indepen-clenr processes on four processors using one wirod~NOR barrier
iinc {Airlapred from Hwang and Sharrg. Pmc.lnr. Conf.Pnmki Processing, 1991}
Multiple barrier lines can be used simultaneously to monitor several synchnonization points.
Figtue 7.19 shows the synchronization of {bur processes residing on four processors using one barrier line.
Note that other barrier lines can be used to synchronize other processes at the same time in a multiprogrammcd
multiprocessor environment.
rm ' Mrfimw rrriir |>rrIq|r_.\.I||¢\ ‘H _
lr
& : Example 7.2 Wired barrier synchronization of five partially
ordered processes (Hwang and Shang 1991)
If the synohmnization pattcm is predicted aficr compile tirnc, then our: can follow the prcccdcncc graph of a
partially ordered set of processes to perform nzultiple synchronization as demonstrated I11 hg 7 20
Pro-eesses
F'1F'2 P3 Pd P5
lg
I IId °* 0
°G
G 0
6
[a] Synchronization pattems [bi Preoedenoe graph
Fig. 1.10 The syncnrorrlzatlon of five partially ordered procees using wired-NOR barrier lines{Amp1:ed from
Hwrmg an-cl Sung. Proc.l'ntConfIFhmlIelPiosli1g. 1991}
FM Mtfirpw Hliitiimyiwins
Here five processes {P1, P2. ..., P5,) are synchronized by snooping on five barrier lines corresponding
to five synchronization points labeled rt, ii, c, oi, e. At step Cl the control vectors need to be iiiitializcd. All
five processes are synchronized at point ti. The crossing of barrier H is signaled by monitor bit Y1, which is
observable by all processors.
Harriers b and 1-can be monitored simultaneously using two lines as shown in steps 20 and lb. Only four
steps are needed to complete the entire process. Note that only one copy of the monitor vector l’ is maintained
in the shared memory. The bus interface logic of each processor module has a copy of l’ for local monitoring
purposes as shown in Fig. 'i'.20c.
Separate control vectors aroused in local processors. The above dynamic barrier synchronization is possible
only ifthe synchronization pattern is predicted at compile time and process preemption is not allowed. One
can also use the barrier wires along with counting semaphores in memory to support multiprogrammcd
multiproccssors in which preemption is allowed.
Muttlcon1puters
n-CUBE MIMD, MPMD,
lntel SPMD
[Control selection]
AMT
ncuss Message _________ __ AMT
Intel Paeelng TMC
TMC
{lntereomectlon selection)
AMT
{$55 oeribmed _________ __ senses sou
"
"rue Memory Swltehln 9 mm RP3
IBM RPS
[Mernory selection]
AMT
nCUBE
Intel
S-equent Low-on-st _________ __ Shared Sequent
Alliam Processors Mernory Alllant
BEN
TMC
IBM RP3
[Processor selection]
Fig.'I'.I1 Design choices made in the pas: for developing message-passing rnuirleonrpuirers cornprar-ed to those
made for other parallel eomp-uters (Courtesy of lntel Scientific Computers, 1938}
in selecting a control strategy, designers of rnulticornputers choose the asynchronous MIMD, MPMD,
and SPMD operations. rather than the SlM'D lockstep operations as in the CM-2 and DAP. Even though
both support massive parallelism, the SIMD approach ofi'e:rs little or no oppoitmiity to utilize existing
multiprocessor code because radical changes must be made in the programming style.
Cln the other hand, multicomputers allow the use of existing soflware with minor changes from that
developed for multiproeessors or for other types of parallel computers.
First Generation Caltcclrfs Cosmic Cub-e (Seitz, 1933} was the first of the first generation multicomputers.
The lntel iPSCi 1, Ametek SII4, and nCUBl3.~‘l0 were various evolutions of the original Cosmic Cube.
For example, the EPSCJI used iilfllfifi processors with 512 Kbytes of local memory per node. Faeh node
was implemented on s single printed-eireuit hoard with eight U0 ports. Seven L-“CI ports were used to forrn
a seven-dimensional hypercube. The eighth port was used for an Ethernet connection fiom each node to the
host.
HM‘ If J11!!!‘ Em-liqtrsrlrtli _
Table i'.l summarizes the important parameters used in designing the early three generations of
multicomputers. The eomrnutiieation latency {for a IUD-byte message) was lather long in the early 19805.
The 3-to-l ratio between remote and local communication latencies was caused by the use of a srrJrc-rmd-
fnrutird routing scheme where the latency is proportional to the number ofbops between two communicating
nodes.
{Modified front Athas and Scitz, "ll-'Iult-icomputcrs: Message-Passing Concurrent Computers", IEEE Ce-uripiirer: August 1988].
Vector hardware was added on a separate board attached to each processing node board. Or one could use
the second hoard to hold extended local memory. The host used in the iP‘S-Ci l was a.n lntel 310' lrtieropro-cursor.
All I/D must be done tiuough the host.
The Second Generation A major improvement of the second generation included the use of better
processors. such as i386 in the iPSG'2 and i360 in the il'SC.-‘S450 and in the Delta. The nCUBE.r‘2 implemented
64 custom-designed VLSI processors on a single PC board. The memory per no-dc was also increased to ll)
times that of the first generation.
Most importantly, hardware-supported routing, sueh as it-'orniho)'c retiring, reduced the communication
latency significantly from 6000 its to less than 5 _us. In fact, the latency for remote and local commlmieations
became almost the same, independent of the number of hops between any two nodes.
If J11!!!‘ IlN‘Hlll[1|1lf\
The architecture oi" at typical second-generation multicomputer is shown in Fig. 7.22. This corresponds to
a l-6-node mesh-connected architecture. Mesh routing chips {M.RCs) arc usotl to establish the fotir-neighbor
mesh network. All the mesh communication channels and h'[R{l‘-s are built on a backpiane.
F ilo system
| O
Odfi
A
Communication
eeealter‘
node node ode
i-be;
Com puter M RC
O O
M RC ?l?ei"|'n?a?a§'e.e
0
M RC
"er
-e*.e.
L Display Generator
Ethernet
Each node is implemented on a PC board plugged into the backplam: at the proper MRC position. All Lil]-
devices, graphics, and the host are connected to the periphery {boundary} of the mesh. The Intel Delta system
had such a mesh architecture.
Another representative system was the nCUBE/‘Z which implemented a hypercube with up to S I 92 nodes
with a total of5 I2 Gbylcs of distributed memory. Note that some parametens in Table 7.1 have been updated
from the conseiwrative estimates made by Atlas and Seitz in 1988. Typical figures representative of current
systems can he found in Chapter 13.
The Sttperhlodie lilflfl was a Transputer-based multicomputer produced by Parsystem Ltd_, England.
Another second-generation system was A1nctck‘s Series 2010. made with 25-Mt-Iz M68020 proccssots using
a mesh-routed architecture with 225-lvibytes.-‘s channels.
The Third Generation These designs laid the foimdation for the current generation of multicomputers.
Caltoch had the Mosaic C project designed to use VLSI-implemented nodes, each containing n 14-MIPS
processor, Iii-hihytesfs routing charuiels, and 16 Kbytes of RAM integrated on a single chip.
FM Mtfiruw Hlllrirmpdrrns
3 I6 i Advanced Cmipoterlirclrite-cturn
The fi.tll sine of the Mosaic was targeted to have a total of 16,334 nodes organized in a three-dimensional
mesh architecture. MIT built thc I-machine which it planned to extend to a 65K-nodc niulticontputcr with
VLSI nodes interconnected by a three-dimensional mesh networir. We will study the J-machine experience
in Section 9.3.2.
The I-machine planned to use message-driven processors to reduce the message handling overhead to less
than 1 _tis. Each processor chip would contain a 512-Kbit DRAM, a 32-bit processor, a floating-point unit,
and a communication controller. The communication latency in systems was later reduced to a few ns using
high-speed links and sophisticated communication protocols.
The significant reduction of overhead in communication and synchronization would permit the execution
of much s11ortt:r tasks with sizes of 5 its pcr processor in thc I-machine, as opposed to executing tasks
of 100 us in the iPSc.~'l.'l'his implies that concurrency may increase from I02" in the iPSc1l to I05 in thc
J-machine.
The first two generations of multicomputers have been called lll-L'£‘ltl'1.l.l?l-gl"|t'll7!l .s_v.s!crri.s, With a significant
reduction in communication latency, the third generation systems may be callcdfine-groin multr'compr.rter's.
Research is also underway to combine thc private virtual address spaces distributed over thc nodes into
a globally shared virtual memory in MPP multicomputers. Instead of page-oriented message passing, the
fine-grain system may require block-level cache communications. This fine-grain and shared virtual memory
approach can in theory combine the relative merits ofmultiproccssors and multicomputers in a lrererogerreorrs
pro:-cssing (HP) environment.
EIIIIIIII . . . .
4‘
H‘|F"F'| |'|1p|_| |'|1|;| ""
II
lNeda
ampu
:::si §-tsisi
sil s-»
I I
‘I I
I I
—si§_@_E
' -1-;- Tap-as
5.II............
§!".i'==t?§
I
Node
HIPPI
- _- -___l
.....
Compu
Node
amp "" ompu -= =
5;I:|gi
SCSI
_|
I
IIIII |II
I--_ I I I I I I I ‘___- I I I I I I I I I I I I I I I I I '_____ I I I I I I I______ I I I I I I I
Fig. 7.23 The Inset Paragon system architecture {C-o1.|r|:esy -of lntel 5 Systems Division. 1991)
The prooessors used in the L"O columns were Intel i386’s which supervised the massive data transfers
between the disk arrays and the computational array during U0 operations. The system L"U column was made
up of sir. sen-'ic'e norfas, two tape nodes, two Ethernet nodes, and a l-IIPPI node. The service nodes were used
for system diagnosis and handling of interrupts. The tape nodes were used for backup storage.
The Ethernet and HIPPI nodes were used for fast gateway connections with the outside world. Collectively,
a 17,000-MIPS perfo-rrnanee was claimed possible on the STU numeric and disk UD nodes involved in program
execution. The system was designed to run iPSCf36G-compatible software.
Node and Router Arehirecrure The Paragon was designed as an experimental system. One unit was
built and delivered to Caltoeh in May 1991 for research use by a consortium of 13 national laboratories and
universities. The typical node architecture is shown in Fig. 7.24.
His
-- Local Externd
' Hemw we
Routa
(on baekplanai
Communication
cinannds
Each node was on a separate board. For numeric nodes, the processor and floating-point units were on
the same iE6l] chip. The local memory look up most of the board space. The external Ht] interface was
implemented only on the boundary nodes with a computational array. The message U0 interface was required
for message passing between local nodes and the mesh network. The nwsh-connccrcd mnrcr is shown in Fig.
7.25.
[North]
To or from the
lo-c.al no-do
J
= E Legends:
IC: lrput Controller
C I13
[INQQ1] I!-I 5'?ffi,h' |--. FB:Fllt Buffer
ii-*5’ H IE3”
[So-tlh]
Fig. ‘L25 The stru-crure of a i‘I'l'E5l't~COi‘ltl‘l'BC136d rou1:er with four pairs of |l'C.'l channels connected to rlcighb-orlng
routers
Each router had ll) U0 ports, 5 for input and 5 for output. Four pairs of U0 channels were used for mesh
connection to thc four neighbors at thc north, south, cast, and west nodes.
Flow mm:-of digits (flits) buffers were uscd at the end of input channels lo hold thc incoming fills. Thc
concept of flits will be clarified in the next section. Besides four pairs of external channels, a fifth pair was
used for internal connection between the router and the local node. A 5 >< S crossbar switch was used to
establish a connection hctwccn any input channel and any output channel.
The functions of the hardware router included pipclined mcssagc routing at the flit lcvcl and resolving
buffer or channel deadlock situations to achieve deadlock-free routing. In the next section, we will explain
various routing mechanisms and deadlock avoidance schemes.
All the llfl channels shown in Figs. 7.24 and 7.25 are l'Jh__\-'siml' c'hannc'l's which allow only one message
(flit) to pass at a limc. Through limo-sharing, one can also implement 1-irrmti channels lo multiplex the use of
physical channels as described in the next section.
MESSAGE-PASSING MECHANISMS
_ Message piISS].l1g III a mulllcomputcr network demands special hardware and software
support. In this section, we study the store-and-forward and wormhole routing schemes and
analyze their communication latencies. We introduce the concept of virtual channels. Deadlock siulalions in
a message-passing network arc examined. We show how to avoid deadlocks using virtual channels.
J11 Incl'q||;1r|I¢-\
Both deterministic and adaptive routing algorithms are presented for achieving deadlock-free message
routing. We first study deterministic dimension-order routing schemes such as E-cube routing for hypeneub-es
and X-Y routing for two-dirnensional meshes. Then we discuss adaptive routing using virtual channels or
virtual subnets. Besides one-to-one unicast routing, we will consider one-to-many multieast and one-to-all
broadcast operations using virtual suhnets and greedy routing algorithms.
Mm-=@| | l l l
Pmfl// 1 1
l';@.i;@. R: Routlnglnfomadon
In
S: Sequence Number
D: Data only fills
F1g.‘l'.2i Theformatofmessage,pache|l:,a|1dfliu(oont.r\olflcwdifltshnedasirriormafiorluniuofcommunicadon
in a messagepasslng network
A packer is the basic unit containing the destination address for routing purposes. Because difierent
packets may arrive at the destination asynchronously, a sequence number is needed in each packet to allow
reassembly of the message transmitted.
A packet can be Further divided into a number offixed-lengtl1fl'iIs(flow control digits). Routing information
(destination) and sequence number occupy the header flits. The remaining fiits are the data elements of a
packet.
ln multicomputers with store-and-forward routing, packets are the smallest unit of information
transmission. In Wormhole-routed networks, packets are fitrther subdivided into fiits. The flit length is often
atfeetod by tl1e network size.
The packet length is determined by the routing scheme and network implementation. Typical packet
lengths range from 64 to 512 bits. The sequence number may occupy one to two flits depending on the
message length. Other factors afiecting the choice of packet and flit sizes include channel bandwidth, router
design, network traflie intensity, etc.
Stan:-and-Forwnrd Routing Packets are the basic unit of inforroation flow in asrom-mu’-forv.-amtnetwork.
The concept is illustrated in Fig. 127a. Each node is required to use a packet bufi'er. A packet is transmitted
from a source node to a destination node through a sequence ofintenncdiate no-ties.
Fr‘:-r Mtfiruw rrrrrr-...¢-,......¢. '
320 i Advanced Canpin:erArchirectuJ"e
When a packet reaches an intermediate node, it is first stored in the buffer. Then it is fonvarded to the next
node ifthe desired output channel a11d a packet buffer in the receiving node are both available.
The latency in store-and-forward networks is directly proportional to the distance (the number of hops)
between the sotuce and the destination, This routing scheme was implemented in the first generation of
multicomputers.
Wormhole Ruining By subtlividing the packet into smaller flits, latter generations of multicomputers
implemem the n-'orrnhoie muting scheme, as illustrated in Fig. ?.2’!b. Flit buffers are used in the hardware
routers attached to nodes. The transmission from the source node to the destination node is done through a
sequence of routers.
Sou reo No-do Destination Noel-s
G I I I I
lntarmodlato Nodes
Iii I - I I I I
lntomtodlat-a No-dos
Fig. 7.27 Store-mid-forward routing and worrrrh-oi: routing {Courtesy of |_ionei Ni. 1991'}
All the flits in the same packet are transmitted in order as inseparable companions in a pipelined Fashion.
The packet can be visualized as a railroad train with an engine car (the header flit) towing a long sequence
of box ears {data fljts).
Only the header flit knows where the train {packet} is going. All the data flits [hos ears} must follow the
header flit. Different packets ean be interleaved during transmission. However, the flits fi'om diflhrcnt packets
cannot be mixed up. Otherwise they may be towed to the wrong destinations.
We prove below that wormhole routing has a latency almost independent of the distance between the
source and the destination.
Jlryndrmnous Pipelining The pipelining of successive flits in a packet is done asynchronously using a
handshalring protocol as shown in Fig. 7.23. Along the path, a 1-bit rennf-,-freqiresr {FHA} line is used between
adjacent routers.
When the receiving router {D} is ready (Fig. 123a) to receive a flit (i.e. the flit buffer is available), it pulls
the RJA line low. When the sending router (S) is ready (Fig. 7.2Sb), it raises the line high and transmits flit i
through the channel.
While the flit is being received by D (Fig. 7.281;). the RM line is kept high. After flit r' is removed from
D's buffer (i.e. is transmitted to the next node) {Fig 128-d], the cycle repeats itself for the transmission of the
neat flit i + 1 until the entire pac-1.-tet is transmitted.
Mu\ltipruoesso.r: and Multicwnpmrs 32'
Router S Router D
_ "’_’°*l_‘°“'l .R‘*§_l'"‘:-“"2-..
Channel
[af|Dlsroad'ytoreoalveafllt [tr]Slsreadytose1'|dflltr'
R.I'A{hlg1[| PM Howl
{c} Fllt 1' ls received by D {d] Fllt its removed from D's butter and fllt 1' + 1
arrlves at S's bufisr
Fig.7.2l I-tan-dd-making protocol bmveen two uornrhole routers {Cour-may of Lionel N1. 1991}
Asynchronous pipelining can be very effieient, and the clock used can be faster than that used in a
synchronous pipeline. However, the pipeline can bc stalled if flit buffers or successive channels along the
path are not available during certain cycles. Should that happen, the packet can be buttered, blocked, dragged,
or detoured. We will discuss these flow control methods in Section 14.3.
Latancylinnlysis A time comparison between store-zmd-forward and wonnhole-routed networks is given
in Fig. "L29. Let L be the packet length (in hits), W the eharmel bandwidth {in bitsfs), D the distance (number
of nodes traversed minus 1). and F the flit length [in bits).
T3; -
Lrw
"1 Data
4---—-*-%
~2
Na header
,|:13Ir:1
Packet |:|j:|:|:| L
D
N4
I‘ TIITIB
{aj Store-and-tonvard ro uting
TWH
LIIN
N1
"2 l:|:|:l:l:| D
"E l:|:l:l:|:]
It Emmi i"lime
{ajr Wormhole routing
To = (D +11 no
The latency Tm, for a wonnhole-routed network is expressed by
L F
Tip” = Hf -l- F X D
Equation 7.5 implies that Iv is directly proportional to D. In Eq. 7.6, T“.-H = Lfli’ ii'L ;>;> F. Thus the
distance D has a negligible effect on the routing latency.
We have ignored the network startup latency and block time due to resource shortage (such as channels
being busy or buficrs being full. etc.) The channel propagation delay has also been ignored because it is much
smaller than thc terms in Tit.‘ or Tm,-.
According to the estimate given in Table 7.1, a typical first generation value of l":,~,t- is between 2000 and
6000 us, while a typical value of Tm, is 5 ,us or less. Current systems employ much faster processors, data
links and routers. Both the latency figures above would therefore be smaller, but worrnhole routing would
still have much lower latency than packet store-and-forward routing.
Comparing the setup in Fig. 1.31] with that in Fig. 123, the difference lies in the added bufihrs at both
cnds. Thc sharing of a physical channel by a set of virtual chatmcls is conducted by t'u:nc-multiplexing on a
flit-by-flit basis.
EIEIEIEIIEI
Nocleflt Node D
No-do B Node C
Packet Buffer
ll IEIEIIEIEIEI
I‘fi4|:| m2|:|
‘ odefl odsG
Rotter B MU
Message 2
Flli buffer Massage 1 Router C
[D] Channel dead lock among bur nodes with an-rmhote routing; shaded boxes are fllt buffers
Fig. ‘L31 Deadlock situations caused by a circu'lar wait at buffers or at no|'r|n'i1.|-nicadu-n channels
Fr‘:-r Mcliruw stilt-...¢-...,“. '
324 i Advanced Canpucer Architecture
Four iiits From four messages occupy the four channels simultaneously. if none of the channels in the
cycle is freed, the deadlock situation will continue. Circular waits are finthcr i.llusttatod in Fig. 7.32 using a
chtmnef-tfqriemzlencc grqrih .
The channels involved are represented by nodes, and directed arrows are used to show the dependence
relations among them. A deadlock avoidance scheme is presented using virtual channels.
Deadlock Avoidance By two virtual channels, IQ and V4 in Fig. 7.3I.c, one can break the deadlock
cycle. Amodified channel-dependence graph is obtained by using the virtual channels V3 and P}, after the use
ofchanncl C1, instead ofreusing C3 and C4.
The cycle in Fig. 7.32b is being converted to a spiral, thus avoiding a deadlock. Channel multiplexing can
be done at the flit level or at the packet level if the packet length is sufficiently short. ‘virtual channels can be
implemented with either un idireetioml ehminels or bitfireetiorml c-hnrinels.
o° o Q
C2 @
C
053 e
cw v3 ca Q @
o C2 0 Q1
e
[cl Aeidingtwo virtual channels {V3, V4] [ell A modified chamel-dependence graph using thovirtua channels
Fig. 1.31 Deadlock avoidance using virtttai channels no convert a cycle to a spiral on a charmci-dependence
8*'3P'i"
The use of virtual channels may reduce the effective channel bandwidth available to each request. There
exists a hadeoff between network throughput and oonttnunication latency in determining the degree of using
virtual channels. High-speed multiplexing is required for implementing a large number ofvirtual channels.
Pocket Collision Resolution ln order to move a flit between adjacent nodes in a pipeline of‘ channels, three
elements must he present: [1] the source buffer holding the flit, (2) the channel being allocated, and (3) the
receiver buffer accepting the flit.
When two packets reach the same node, they may request the same receiver buffer or the same outgoing
channel. Two arbitration decisions must be made: (i) Which packet will be allocated the channel? and (ii)
What will he done with the packet being denied the channel? These decisions lead to the four methods
illustrated in Fig. 7.33 for coping with the packet collision problem.
Figure 7,33 illustrates four methods for resolving the conflict between two packets competing for the use
of The satne outgoing channel at an intermediate node. Packet l is being allocated the channel, and packet 2
heing denied. A hsyjii-ring method has been proposed with the virtual cu!-through nisring scheme devised by
Kermani and Kleinrock (l9?9).
Packet 2 is temporarily stored in a packet buffer. When the channel becomes available later, it will be
ttansrnittcd thccn. This buffering approach has the advantage of not wasting the resources already allocated
However, it requires the use of a large buffer to hold the entire packet.
Furthermore, the packet b~ufi‘ers along the communication path should not form a cycle as shown in
Fig. 7.31 a. The packet buffer however may cause significant storage delay. The virtual cut-through method
-oflicrs a cornprornise by combining the store-and-forward and wonnhole roofing schemes. When collisions
do not occur, the scheme should perform as well as Wormhole routing. in the worst case, it will behave like
a store-and-forward network.
Pure wormhole routing uses a blocking policy in case of packet collision. as illustrated in Fig. ’!.33b. The
second packet is being blocked from advancing; however, it is not being abandoned. Figure 7.331.: shows the
disttirrf policy, which simply drops the packet being blocked from passing through.
The fourth policy is called dfffillr (Fig. 133d). The blocked packet is routed to a detour channel. The
blocking policy is economical to implement but may result in the idling ofresources allocated to the blocked
packet.
Packet 1
Packet 2 E Q I‘ Packet 2
--
a I Once-ins
channel
Fig. 7.33 Fiow control rnethocts for resolving a collision between two pac-lneia requesting the same outgoing
channel (pecllet 1 being aiiocancd the dwslei and paellet '1 being denied}
F?» Mtfirnw ,dd"I_nlfJ|||;ltlII'\
The discard policy may result in a severe waste of resotuces. and it demands packet retransmission and
acknowledgment. Otherwise, a packet may be lost afber discarding. This policy is rarely used now because of
its unstable packet delivery rate. The BEN Butterfly network had used this discard policy.
Detour routing offers more flexibility in packet routing. l-lowever, the detour may waste more channel
resources than necessary to reach the destination. Ftutltemtore, a re-routed packet may enter a cycle of
lit-'elot~l:, which wastes network resources. Both the Connection Machine and the Dcnelcor HEP had used this
detour policy.
in practice, some multicomputer networks use hybrid policies which may combine the advantages ofsome
of the above flow control policies.
Dilnensiorr-Order Rn uting Packet routing can be conducted deterrninistically oradaptively. lrtdererrrtin isfic
roaring. the communication path is completely detemtined by the source and destination addresses. ln other
words, the routing path is uniquely predetemnned in advance, independent of network condition.
Adnyiriw muting may depend on network conditions, and alternate paths arc possible. In both types of
routing, deadlock—fi'ee algorithms are desired. Two such deterministic routing algorithms are given below,
based on a concept called dirrionsion orrier rourirlg.
Dimension-order routing requires the selection of successive channels to follow a specific order based on
the dimensions of a multidimensional network. In the case of a two-dimensional mesh network, the scheme
is called X-l’ retiring because a routing path along the X-dimension is decided first before choosing a path
along the Y-dimension, For hypercube [or n-cube) networks, the scheme is calledE-cnixr routing as originally
proposed by Sullivan and Bashltow (1977). These two routing algorithms are described below by presenting
examples.
E-cube Routing on Hyjsortube Consider an n-cube with N = 2” nodes. Each node b is binary-coded as
in = in" 1b,, 2 iil]iJ|:|. Thus the source node is s = s,, 1 s|.sD and the destination node is rl'= if" | ofidu. We
want to determine a route from s to if with a minimum number of steps.
We denote the n dirncnsions as i = 1,2, ..., n, where the ith dimension corresponds to the (i l)st bit in the
node address. Let \-' = t-',, | . . . t-‘[1-'0 be any node along the route. The route is uniquely deterrniricd as follows:
l. Compute the din:-ction bit r,= s,- |$ ti’, | for all rr dimensions (r'= 1, ..., nj. Start the following with
r.limensionr'= l and \-' =s.
2. Route from the current node t- to the nest node \-' EB 2' ' ifr, = l . Skip this step ifr, = D.
3. Move to dimension r'+ 1 {i.e. i<— i+ l). li'i£ rr, go to step 2,clsotlonc.
Iv)
El Example 1.4 E-cube routing on a four-dimensional
hypercube
The above E-cube routing algorithm is illustrated with the example in Fig. 7.34. Now n = 4, s = 0110, and
rf= lllll.Tl1t1sr=r,|r3r2r1= 1l]Il_Route fi‘0m.stos@2n=Ulll since r, =oo | = l.Route fromt-'=l)1ll
to v$ ll = Clllill since r2 =1 &l D= 1. Skip dimension r'= 3 because r3 =1EBl= D. Route from v = D101 to
11$ 13= llill =o'sinc-l: r_|= l.
rm Mrliruw Hill tmt-:-m.u||n
' :
Moitiprucesson and Multicmnputers i 321
ellm 2 mm 3
S-euros: s=G11D
Destination: d=11G1
Route:
elmi G110->Ct111-10101-@1101
elm-t
I E’
om I
7
1"?‘
mm 1001
GOOD "
The route selected is shown in Fig. 7.34 by arrows. Note that the route is detmnjned from dimension 1
to dimension 4 in order. If the ith hit of s and d agree, no routing is needed along dimension i. Otherwise,
move fnum the current node to the other node along the same dimension. The procedure is repeated until the
destination is reached.
X-Y Routing on n ID Mesh The same idea is applicable to mesh-connected networks. X-Y routing is
illustrated by the example in Fig. 7.35. From any source node s = (x|_}=|] to any destination node n‘ = {I2}-'1},
mute from s along the X-axis first until it reaches the column 1'2, when: d is located. Then route to dalnng
the Y-axis.
There are four possible X-Y routing patterns corresponding to the east-north, east-south, west—norti1, and
west-south paths chosen.
I»)
g Example 1.5 X-Y routing on a 2D mesh-connected
multicomputer
l-‘our [so1.n'ce, destination) pairs are shown in Fig. 7.35 to illustrate the four possible routing patterns on a
two-dimensional mesh.
Par MIGIITLH HI" l'mrJI||r_.u|i¢\ :
An east-north route is needed fiom node (2,1) to node (7.6). An east-south route is set up from node (11,?)
to node (4,2). A west-south route is neecled from node (5,4) to [2,0]-. The fourth route is west-north bound
from node (6,3) to node (1,5). lf the X—dimension is always routed first and then the Y-dimension, a deadlock
or circular wait situation will not exist
ssssss
3.4 4,4 5.4 ‘G311 T,4
Ll-
ljili ilfiilsilél
E_?_iE§i~§i-IE1 1;.~.~2<=.§l'| isn| |4n] |so| |snj [re
It is left as an exercise for the reader to prove that both E-cube and X-Y schemes rcsult in deadlock-fi'oe
routing. Both can be applied in either store-and-forward or worrnhole-routed networks. resulting in a minimal
route with the shortest distance between source and destination.
However, the same dimension order routing scheme cannot produce minimal routes for torus networks.
Nonmininial routing algorithms, producing deadlock-free routes, allow packets to traverse through longer
paths, sometimes to reduce network traflic or for other reasons.
Adoptive Routing T'he main purpose of using adaptive routing is to achieve cfficiency and avoid deadlock.
The concept of virtual channels makes adaptive routing more economical and feasible to implement. We have
shown in l-‘lg. "L32 how to apply virtual channels for this purpose. The idea can be further extended by having
virtual channels in all connections along the same dimension ofa mesh-connected network (Fig. I36}.
Multiprocessor: and Mutticorriputers 3“
HIHIE HIHIE
HIE-H H-H-H
HIE-E EIEIE
[a] Original mosh without virtual channel tn; Tm pairs of vinuai channels In Y-dimension
qlfilg %lHlF
HIHIF qlHlF
M E E E E E
(c] For a westbound message [ct] For an eastbound message
Fig.7.!-5 Adaptive K-Y routing using virtual channels co avoid deadloclconly westbound and eastbound tralllc
are deadiociofree {Courtesy of |_icmd Ni. 1991]
tn what fellows, we consider the requirements for iniplementing multicast, broadcast, and conference
communication pattems. Of course, all patterns can be implemented with multiple unicests sequentially, or
even simultaneflusly if resource conflicts can be avoided. Special routing schemes must he used to implement
these muiti-destination patterns.
Routing Eflildency Two eomrnnnly used eflieiency parameters are chartrtei bnmftt-'r'dt:h and ccmnwn icoriort
hi.reric_7t-'. The channel bandwidth at any time instant (or during any time period) indicates the effective data
transmission rate achieved to deliver the messages. The latency is indicated by the packet transmission delay
involved.
An optimally routed network should achieve both rnasimutn bandwidth and minimum latency for the
cornmunication patterns involved. However, these two parameters are not totally independent. Achieving
maximum bandwidth may not necessarily achieve minimum latency at the same time, and vice verse.
Depending on the switching technology used, latency is the more important issue in a store-and-i‘orwarnl
network, while in general the bandwidth affects efficiency more in a worrnhole-routed network.
I»)
8! Example 1.1 Multicast and broadcast on a mesh-connected
computer
Multicast routing is implemented on a 3 >< 3 mesh in Fig. 137. The source nude is identified as S, which
transmits a packet to five destinations labeled D, for i = 1, 2, ..., 5.
EIEI EU
DEM |:|a|
titii
maefo
ta} Five unicasts with traffic = 13
ntaam
tin) Amuiticast pattern with traffic = ?
and distance = 4 and distance = 4
DE H
|:|i:i|:i IE
HE: IIEI
(cl Another muiticast pattern with tn} Broadcast to ail nodes via a tree [numbers
trafflc = 6 and distance = 5 in nodes correspond to levels of the tree)
Fig. 1'. 31' Multiple unicasts, rnuirimst: patterns. and a broadcast tree on a 3 x 4 mesh cornputer
J11 rum-“mars _
This five~destinalion tnulticast can be implemented by five unicasts, as shown in Fig. 7.371 The X-Y
routing trafiic requires the use of 1 + 3 + 4 + 3 + 2 = 13 channels, and the latency is 4 for the longest path
leading to I13.
A multicast can be implemented by replicating the packet at an intermediate node, and multiple copies of
the packet reach their destinations with significantly reduced channel traffie.
Two rnulti-cast routes are given in Figs. 'i'.37b and 'i".3?c. resulting in iraific of 7 and 6. respectively. On a
worniholc-routed network, thc multicast route in Fig. 'i'.3'i'c is better. For a store-and-forward network, thc
route in Fig. ’i'.3Tb is better and has a shorter latency.
A four-level spanning tree is used from node S to broadcast e packet to all the mesh nodes in Fig. 7.37d.
Nodes reached at level 1' of the tree have latency F. This broadcast tree should result in minimum latency as
well as in minimum traffic.
I»)
62] Example 7.8 Multicast and broadcast on a hypercube
computer
To btoadcast on an n~cube, a similar spanning tree is used to reach all nodes within a latency of n. This
is illustrated in Fig. 7.3-Ba for a 4-cube rooted at nude GOOD. Again, 1'nini.t:nu.m traffic should rcsult with a
broadcast tree for a hypercube.
0110 D111 1110 1111
,1 s Ir I.» I
0010 I U-D11 s '\ 101 1 19111
0010
0011 " IIIIIII
“"0 3| 1011
_ _ ;t{_ _
691°“ 1101
"mp1 1100 i __.-"
: Q-i- -
:\
\\
we t—--—t- II
I '|
IIII II II |
éi 15.;“““"@‘°°‘
[bi A rnuttieast tree from no-no G101 to seven destination no-do-s
11aa,o111,1a1a,111a,1a11,1aoa,=.ma D011}
Fig. 7.38 Breadeasttree andmuiricast: tree one 4-cube usinga greedy aigcridim (Lari. E<a‘ai'mia|'t.a|'td I\Ii.‘l9'9D}
Ft‘:-r Meow-as rrrttr-...¢-,.w..¢. '
332 i Advanced Cnmputtar Architecture
A greedy multicast tree is shown in Fig. 7.38]: for sending a packet from node {1101 to seven destination
nodes. The greedy multieast algorithm is based on sending the packet tlrrough thc dimensionlsj which can
reach the most number of remaining destinations.
Starting from the source node S = 0101, there are two destinations via dimension 2 and five destinations
via dimension 4. Therefore, the first-level channels used are 0101 —> 0| ll and 0101 —> I 101.
From node 1101, there are three destinations reachable in dimension 2 and four destinations via dimension
1. Thus the second-level channels used include I101 —> I111, lllll —> HDO, andfllll —> D110.
Similarly, the remaining destinations can be reached with tl1irr.l-level channels ll ll —-3» 1110, llll—-9 lfll 1,
1100 —> 1000, and UIIU —> 0010, and fourth-level channel 1110 —>llIIIO.
Extending the multicast tree, one should compare the reaehabilitv via all dimensions before selecting
ccrtain dimensions to obtain a minimum eover set for the nodes. In case of a tic between two
dimensions, selecting any one of them is suffieient. Therefore, the tree may not be uniquely generated.
It has been proved that this greedy multicast algorithm requires the least number of traffic channels
compared with multiple unicasts or a broadcast tree. To implement multieast operations on wormhole-routed
networks, the router in each node should he able to replicate the data in the flit buffer.
ln order to synchronize the growth ofa rnultieast tree or a broadcast tree, all outgoing channels at the same
level of the tree must be ready before transmission can be pushed one level down. Otherwise, additional
buffering is needed at intennediate nodes.
Virtual Network: Consider a mesh with dual virtual channels along both dimensions as shown in
Fig. 7.39s.
These virtual channels can be used to generate four possible virtual networks. For west-north uatfie, the
virtual network in Fig. '?.39h should he used.
E
!tr -5in=E
H HI |-I-I
Pi
u-I
Pi
21
El!fl""flu-I
Pj
u-I
Pi
U2 ‘l2 22 U2 12 22 D2 12 22 CI2 12 22
DU 19 BU E0 10 20 CO 10 20 CID 10 20
(tn WB5t—rtorlh suhrtai {cl East-no rlh sulztrtstt tdl West-south we net tell EH51-wvlh wheat
Fig. 7.3! Four vlrrtral nennoflts irnplornertrahle from a dtral-chann-at mesh
FM Altliruw Htllrm-ltlgtnrlitt _ '
Multiprocessor: and Multicorrtputers i 333
Similarly. one can consouct three other virtual nets For other traflic orientations. ‘Note that no cycle is
possible on any of the virtual networks. Thus deadlock can be completely avoided when X-Y routing is
implemented on these networks.
If both pairs between adjacent nodm are physical channels, then any two of the four virtual networks can
be simultaneously used without conflict. If only one pair of physical channels is shared by the dual virtual
channels between adjacent nodes, then only (bj and (c) or [_e) and {d} ean be used simultaneously.
Other combinations, such as (b) and (c), or {bl and (ti), or lc) and le), or (d) and (c), cannot coexist at the
same time due to a shortage of channels.
Obviously, adding channels to the network will increase the adaptivity in making routing decisions.
However, the increased cost can be appreciable and thus prevent the use of redundancy.
Network Partitioning The eoneept ofvirtual networks leads to the partitioning of a given physical network
into logical subnetworks for multicast communications. The idea is illustrated in Fig. 7.40.
West East
f
Jt 1 F
J. N
Fig. 7.-I'll Parddonlng ofa 6 x B mesh lnro four subne-rs for a l't1L|tltiC35t frorn source no-do {-L2}. Shade-cl nodes
are along the b-ournhry ofadlaeenr subne1's{Ccurnosy of Lin. Mel(lnly: and NL1991}
Suppose source node (4, 2} wants to transmit to a subset ofnodes in the I5 >< 8 mesh. The mesh is partitioned
into four logical subnets. All trafi‘ic heading for east and north uses the subnet at the upper right corner.
Similarly, one constructs three other subnets at the remaining corners of the mesh.
Nodes in the fifth column and third row are along the boundary between subnets. Essentially, the tralfie
is being directed outward from the center node (4, 2). There is no deadlock if an X-Y multicast is performed
in this partitioned mesh.
Similarly, one can partition a binary mcube into 2" 1 subcubes to provide deadlock—free adaptive routing.
Each subcube has n + I levels with 2" virtual channels per level for the bidirectional network. The number
n-alromw Hllliornoorin-r l
334 ‘=‘i"“ Advanced Computer .-ltrchitscturs
of required virtual channels increases rapidly with rt. It has been shown that for low-diniensional cubes
{n = 2 to 4), this method is best for general-purpose routing.
t.
ti“ Summary
ln a multiprocessor system. interconnects between sub-systems such as processors. memorim and
network controllers play a crucial role in determining system pe|"for'rrnnce.The earliest multiprocessor
systerns were bus-based. with shared main rnemory.The bus is a simple interconnect but it has limitations
in scalability Hierarchical bus systems can address the problem to a limited extent. but as systems grow
larger". more sophisticated and scalable system interconnects are needed.
A network may be of blocking or non-bloc-king qvpe.We studied the crossbar network and the basic
dmign of a row of crosspoint switches, with its arbitration and multiplmter modules.While it has better
agregate bandwidth than the bus. the crossbar network also has limitations of scalability. Multi-port
memory can be used to enhance the aggregate bandwidth ofa memory module.
We studied Omega and Butterfly multistage networks. Larger Omega networks can be built using 2.22
and 4x4 basic switches. while die Butterfly network is built from modules of crossbar swit:ches."Nh-en
network traffic is non-unlfotrrtso-called‘hot-spots’ may develop which may degrade network performance.
The concept of combining networks was developed in an attempt to address this performance limitation.
We studied the related issues of maintaining cache coherence and synchroni1ntion.Write operations
on shared cache data. process migration and HO operations can cause loss of cache coherence. If all the
cadies are on a common bus. then the snoopy bus protocol can be used to maintain cache coherence.
Directory-based cache coherence protocols—using full map. limited or chained directories—can be used
on more general types of system interconnects. Details of the schemes vary between write-back and
write-thnough types of cache.
Hardware synchronization mechanisms between processors make use of atomic operations typified
byTest3tSet. However. at a still lower level of hardwane. in theory wired barrier synchnonization can also
be used. of which we saw examples.
Three early generations of multicomputer systems were studied. providing a pictune of how
multicomputer anchitecture has evolved over time. Broadly. the trend has been from expensive to low
cost processors. from shared to distributed memory. and [with higher‘ speed processors} to higher speed
int:e|'connects.lNe studied the Intel Paragon system as a specific example. laying the basis to review more
recent advancm in Chapter 13.
Message-passirtg communication uses networks of point-to-point links. the basic aim of routing
protocols being to achiew: low network latency and high bandw'idth.W'e studied the typical formats
of messages. packets. and flies (flow control digits); roofing schemes were studied from the points of
view of latency analysis and the avoidance of ddlocks.\Ne examined the important concepts of virtual
channels. worrnhole routing. flow control. collision resolution, dimension order routing. and rnulticast
communication.
TM Hnffirnil-' Hliilfmminnm
Mrrrltiprucessers and Muiti<:orr|puters '3' 335
3 Exercises
Problem 7.1 Consider a multiprocessor with {a} Calculate the memory bandwidth defined
n procssors and m shared-memory modules. all as die average number of memory words
connected to the same baclcplane bus with a central transferred per second over the DTB if n = B.
arbiter as depicted below: m =16.r=1fl ns.andc- t=8r=B0 ns.
(bl Calculate the memory utilization defined as
< Data Transfer Bus > the average number of requests accepted by
all memory modules per memory cycle using
dwe same set of parameters used in part |[a).
51a M2 Ms
®® $69 ®@®% 519 M2 M2 M2 M1
o e ego 520
521
512
M1
M3
M3
M1
M: M4
M4
Ms
eqeeeooo 511
524
M1 M1 Ms
M1
M3
M4
interstage connection pattern from b" inputs depends on the parallelism profiles in user programs.
to 0" outputs. For fixed values of bq and n. the maximally allowed
(d) Figure out a simple routing scheme to control multiprogramming degree k increases with respect
the switch settings from stage to stage in an
I0 .|lr-rr -
0" >< b" Delta network with n stages.
{e} What is the relationship between Omega Problem 7.9 Wilson (1987) proposed a
networks and Delta networks? hierarchical cachefbus architecture (Fig. 7.3) and
outlined how multilevel cache coherence can be
Problem 7.7 Prove dwe following properties
enforced by extending die write-invalidate protocol.
associated with multistage Omega networks using
Can you figure out a write-broadcast protocol for
different-sized building blocks:
achieving multilevel cache coherence on the same
[aj Prove that the number of legitimate states hardware platform? Comment on the relative
{connections} in a it ><k switch module equals
merits of the two protocols. Feel free to modify
16*. the hardware in Fig, 7.3 if needed to implement the
{bi Determine the percentage of permutations write-broadcast protocol on dwe hierarchical bus!‘
that can be realized in one pass through a cache architecture.
64-input Omega network built with 2 I>< 2
switch modules. Problem 7.10 Answer the following questions on
design choices of multicomputers made in the past:
[c] Repeat part [b] for a 64-inp ut Omega network
built with 8 >< 8 switch modules. {a} Why were low-cost processors chosen over
expensive processors as processing nodes?
(d) Repeat part (b]- for a 512-input Omeg
network built with 8 >< 8 switch moduls. {bj Why was distributed memory chosen over
shared memory?
Problem 7.8 Consider ti'1e interleaved execution [c] Why was message passing chosen over
of k programs in a multiprogrammed multiprocessor address switching?
using m wired-NOR synchronization lines on n
{dj Why was l*'1ll"'lD. l"'1Pl‘-‘ID. or SPHD control
processors as described in Fig. .7.19a.
chosen over SIMD data parallelism!
In general.the number my of barrier lines needed
fora programiisestimated asmy= by[q;.iPy] + 1.where Problem 7.11 Explain the following terms
by = the number of barriers demanded in program i. associated with multicomputer networks and
qy = the number of processes created in program rnessage-passing mechanisms:
i. and Py = dwe number of processors allocated to {a} Message. packets. and fiits.
programi. (b) Store-and-forward routing at packet level.
Thus m = my + my + .. .+ my. For simplicity assume {C} ‘Wormhole routing at flit level.
by = band qy= qfori=1.2.....ic. and Py = min(nu'k.q} {d} Virtual channels versus physical channels.
processors are allocated to each program i. (e) Buffer deadlock versus channel deadlock
Prove that m can be approximated by b - q - (1') Buffering flow control using viruual
cut-through routirg.
kiln + it. or that the degree of multiprogramming is
(gi Blocking flow control in wormhole routing.
it E j—n+.y|In2+4bqmr1jl{2bq]- in such a (hi Discard and retransmission flow control.
multiprocessor system. Note that bq represents the
(ii Detour flow control after being blocked.
number of required synchronization points. which (j) Virtual networks and subnetworks.
TM iirlcfimu-‘ Hillfiornpennri .
Multiprocessor: and Mu'lticorr|p~1rter's
Problem 7.10 Consider the implementation of switches. Design a two-stage Cedar network
Goodman's write-once cache coherence protocol in to provide switdred connections between
a bus-connected multiprocessor system.Specify the 64 processors and 64 memory modules. again
use ofadditional bus lines to inhibitthe main memory in a clustered 1'nanner similar to tl'1e above
when the memory copy is invalid. Also specify all Cedar network design.
other hardware mechanisms and software support (c) Further expand the Cedar network to three
needed for an economical and fast implementation stages using8 X5 crossbar switches as building
of the Goodman protocol. bloclts to connect 512 processors and 512
Explain why this protocol will reduce bus memory modules. Show the schematic
traffic and how unnecessary invalitlations can be interconnections in all three stages from the
eliminated. Consult if necessary the two related input end to the output end.
papers published by Goodman in 1983 and 1990.
Processors Stage 1 Stage 2 Memories
Problem 7.19 Study die paper by Archibald and l I
Baer (1 986) which evaluated various cadwe coherence
protocols using a multiprocessor simulation model.
Ezrqzrlain the Dragon protocol implemented in the
Dragon multiprocessor workstation at the Xerox
Palo Alto Research Center. Compare tl'1e relative
merits of the Goodman protocol. the Firefly
. 1 \9 1»-
protocol. and tl'1e Dragon protocol in the context
of implementation requirements and er-qaected
performance. -- ‘irire
3'‘ .1l*“l'Yr
Problem 7.20 The Cedar multiprocessor at
Illinois was built with a dustered Omega network
as shown below. Four 8 >< 4 crossbar switches
Fl ~l§4' l
were used in the first stage and four 4 >< B crossbar
switches were used in the second stage.There were
-. rrlilr _£ J.-J..- F-‘Ii
32 processors and 32 memory modules. divided into
four clusters with eight of each per cluster. -. .111vi IIIIIIII
(a) Figure out a fixed priority scheme to avoid
conflicts in using the crossbar switches for
nonblodcing connections. For simplicity
:_ IIIIIIII
— —
ln general, vector processing is faster and more efficient than scalar processing. Both pipelined processors
and SIMD computers can perform vector operations. Vector processing reduces software overhead incurred
in the maintenance of looping control, reduces memory-access conflicts, and above all matches nicely with
the pipclining and segmentation concepts to generate one rcsult per clock cycle continuously.
Depending on the speed ratio between vector and scalar operations {including startup delays and other
overheads) and on the vcemrimricn ratio in user programs, a vector processor executing a well-vectorized
code can easily achieve a speed|.|p of IO to IO times, as compared with scalar processing on conventional
machines.
Oi‘ course, the enlnmced performance comes with increased hardware and compiler costs, as expected.
A compiler capable of vectorization is ealled a terrorizing eontpiler or simply a wcrorizcr: For successful
vector processing, one 11Beds to make improvements in vector hardware, vectorizing compilers, and
programming skills specially targeted at vector machines.
Vector Instruction Type: We briefly introduced basic vector instructions in Chapter 4. What are
characterized below are vector instructions for register-based, pipelined vector machines. Six types ofvector
instructions are illustrated in Figs. 8.1 and 8.2. We define these vector instruction types by rnathernatical
mappings between their working registers or memory when: vector operands are stored.
VJ; Ragiier V3 Register V; Ragista Vk Register '|.r‘,- Flogislsel
|s
-.!-'1
E
{vactcl Load)
Memory path Vi Regista
till] _
He Mr
till]
Hermcry path
{Vecbr Store)
{ct “ach:-mastery insiuetions
{ll lirmr-t-'er.'ror rhstrrretrions As shown in Fig. 8.1a, one or two vector operands are fetched fi'om the
respccl:ivc vector registers, enter through a functional pipeline unit, and produce results in another
vector register. These instructions are defined by the following two mappings:
f| :l_’,-—> V, (8.11
jg : ii,-X I-1 —> P] (8.21
,,,,,,,,,,,,,,,,,,,.,,,,,,,,,,,,, _ H,
Examples are V, = sin{ I/E) and V3 = V, + Ir’; for the mappings_f'| and f3, respectively, where F} for
i=1, 2, and 3 are vector registers.
{,3 l li'ctor-Molar r'nsrrur'rr'ons Figure B. lb shows a vector-scalar instruction corresponding to thc
following mapping:
jg :s>< Pk —> I’, (8.31
An example is a scalar product s >1: l"| = F3, in which the elements of Vl arc each multiplied by a
scalar s to produce vector V3 ofcqual length.
('3 l Vecsor-nremo:'_v insrrrrrrfons This corresponds to vector load or vector store (Fig. B. l cl, element by
element, between the vector register ( i-" it and the memory (Ml as defined below:
f4 : M —t- V lireror load‘ (5-4!
f5 : l-’ —-3» M lti.-‘tutor store (B.5j
The offscls {h1dic4:s) from the has: address are rctricvcd fi'om the vector register VG. Thc cffccliv: mcrnnry
addwzaws are nbtained by adding the base address in the indices.
Manny
CnrrbemtsJ'
v1. Regina VD Reghta V1 Ragisiar -P-diiflfifi
|* #1 N:-tn} GIG‘ $8 GD
A0
Q38: fiéé
| 1un|
I
A0
4|
wul
1*,
2
QI1I!l
, $9 ‘
we
mu “>2
O .4O -F
Q -AD ‘Iul
-15
“T10
a 11
0 13
24
-r
u1011m111n1.. . 13
maagaaa 9
-1? -
{oi Masiurig imiucim
Fig. 8.2 Gatheawzauer and masking operations on due Cray'Y-HF [Courmesy uffiray Research 1990]
,,,,,,,,,,,,,,,,,._.,,,,,,,,,,,, _ H,
The scatter instruction reverses the mapping operations, as illustrated in Fig. 8.2h. Both the I/L and A0
registers an: embedded in the instruction.
The masking instruction is shown in Fig. 8.2c for compressing a long vector into a short index vector. The
contents of vector register Vt} are tested for Zero or nonzero elements. A mrr.s.irr'ng rcgi.srcr { F.-‘vii is used to
store the test results. .-lifter testing and forrrring the nrtrsiring veemr in V M, the corresponding nonzero indices
are stored in the I/l register, The i/"L register indicates the length of the vector being tested.
The grrnlrer, setrrrer, and nrtrsiring instructions are very useful in handling sparse vectors or sparse matrices
often encountered in practical vectorprocessing applications. Sparse matrices are those in which most ofthe
entries are zeros. Advanced vector processors implement these instructiorrs directly in hardware.
The above instruction types cover the most. important ones. A given specific vector processor may
implement an instruction set containing only a subset or even a superset ofthe above instructions.
Vector Oponnnd Spaeiflecltiom Vector operands may have arbitrary length. Vcc tor elements are not
necessarily stored in contiguous memory locations. For example, the entries in a matrix may be stored in row
major or in column major order. Each row. column, or diagonal of the matrbc can be used as a vector.
When row elements are stored in contiguous locations with a unit stride, the column elements are stored
with a stride ofn, where n is the matrix order. Similarly, the diagonal elements are also separated by a stride
ofn + l.
To access a vector in memory, one must specify its hose trddress, srriob, and length. Since each vector
register has a fixed number of component registers, only a segment of the vector can be loaded into the vector
register in a fixed number of cycles. Long vectors must be segmented and processed one segment at a time.
Vector operands should be stored in memory to allow pipelined or parallel access. The memory system for
a vector processor must he specifically designed to enable fast vector access. The access rate should match
the pipeline rate. In fact, the access path is often itselfpipclined and is called an fl('Ct’.iS]Ji]J£’. These vector-
aecess memory organizations are desc-ribed below.
C-rflcces: Memory Drghnimtion The in-way low-order interleaved memory structure shown in
Figs. 5.15:: and 5.16 allows m memory words to be accessed concurrently in an overlapped manner. This
eoneurrtrm‘ access has been called C.‘-net-ass as illuslntted in Fig. 5.lt5b.
The access cycles in different memory modules are staggered. The low-order tr bits select the modules,
and thehigh-ordcrb hits select the word within each module, where m = L" and rr+ b = n is the address length.
To access a vector with a stride of I, successive addresses are latched in the address buffer at the rate of
one per cycle. Effectively it takes m minor cycles to fetch m words, which equals one (major) memory cycle
as stated in Eq. 5.4 and Fig. 5.l6b.
If the stride is 1, the successive accesses must be separated by two minor cycles in order to avoid access
conflicts. This reduces the memory throughput by one-half. if the stride is 3, there is no module conflict and
the maximtun throughput (m words) results. In general, C-access will yield the maximum throughput of m
words per memory cycle ifthe stride is relatively prime to m, the number ofinterleaved memory modules.
_ H‘-r Mclinrw Hm I-|Il‘l‘.l]|lj.I.ll|f\
340 1- I Admrl-cad campuurnmmwam
S-Access Nlamory Orgflnization The low-order interleaved memory can be rearranged to allow
sinlrllrflmnris access, or S-rimrss, as illustrated in Fig. 8.3a. In this case, all memory modules are accessed
simultaneously in a sgmchmnimecl manner. Again the high-order (n — a) hits scloct thc same nfikct wnrcl from
each mnclu lc.
HII
Data Latch
Single vmrd
{Hi rj0%
Mdtflzlexel W"
high-Order
addess ms
I mnuuia
I
Raadhvlita
a Low-order
adckasa bite:
-[al S-aoc-ass crganiiun for an m-way inleflaavad ma may
1. Memory Modliaa
Fig. 8.3 The S-access inmrlcavcd memory for vocmr cplnncls acons
At the and of each rnumury cyclc [Fig. 8.31:}, Hi = 2“ ccmsccutivc words an: latched in thc data buficn;
simultaneously. The low-order ri bits are than used to multiplex the m wnrds nut, mic pcr cach minor cycle.
,,.,,,.,.,,,,_,,,,,.5.,,D..,,,,,,,, _ H,
lfthc minorcyclc is chosen to be lfrrr ofthe major memory cycle {Eq. 5.4}, then it takes two memory cycles
to access m consecutive words.
However, if thc access phase of the last access is overlapped with the fetch phase of the eurrem access
(Fig. 8,:-lb), effectively m words take only one memory cycle to access. If the stride is greater than 1 , then the
throughput decreases, roughly proportionally to the stride.
CIS-Access Memory Organization A memory organization in which the C-access and S-access are
combined is called C.r‘lS'-or'ees.s. This scheme is shown in Fig. 8.4, where n access buses are used with m
interleaved memory modules attached to each bus. The m modules on each bus are m-way interleaved to
allow C-access. The rt buses operate in parallel to allow S-access. In each memory cycle, at most m - rt words
are fetched if the n buses are fully used with pipelined memory accesses.
P'°'°9sS°'5 i Memories
@ | Bo
PO M60 .. .
System Moo Mo; Mom
@ I |nter- Bo
1 MC‘ OCll'll'lBC'l
The CIS-access memory is suitable for use in vector multiprocessor configurations. it provides parallel
pipelinod access ofa vector data set with high bandwidth. .-‘Especial vector eoeire design is needed within each
processor in order to guarantee smooth data movement between the memory and multiple vector processors.
The Cray Research Series Seymour Cray founded Cray Research. Inc. in l9'i'2. Since then, hundreds
units of Cray supercomputers have br:r::r1 produced and installed worldwide. As we shall sec in Chapter 13,
the company has gone through a change ofname and evolution ofpmduct line.
The Cray 1 was introduced in i9'i'5. An enhanced version. the Cray IS, was produced in 19T9. it was the
first ECL-based strpercomputcr with a 12.5-as clock cycle. High degrees ofpipelining and vector processing
were the major features of these machines.
rt» Mecmw iirttt-...s-,..i.t.¢. '
Ms — _
Adrovrced Computerhrchitecture
Ten functional pipelines could run simultaneously in the Cray IS to achieve a computing power equivalent
to that of ID IBM 3033's or CDC Cyb-er 1600's. Only batch processing with a single user was allowed when
the Cray I was initially introduced using the Cray Operating System [COS] with a Fortran T7 compiler (CF
T? Version 2.1).
The Cray X-MP Series introduced multiprocessor configurations in I 983. Steve Chen led the effort at Cray
Research in developing this series using one to four Cray I-equivalent CPLls with shared meme-1'y.A unique
feature introduced with the X~MP models was shared register clusters for fast interprocessor eommtmications
without going through the shared memory.
Besides 123 Mbytes of shared memory, the X-MP system had 1 Gbyte of.soi'id-stttre sr-sreg’ {SSD) as
extended shared memory. The clock rate was also reduced to 8.5 ns. The peak performance of the X-MP-
4lti was 840 Mflops when eight vector pipelines for add and multiply were used simultaneously across four
PFDEC-SSCIFS.
,,,,,,,,,,,,,,,._,,,,,,,,,,,,, _ ,,,
The successor to the Cray X-MP was thc Cray Y-MP introduced in 1988 with up to eight processors in a
single system using rt 6-us clock rate and 256 Mbytes of shared memory.
The Cray Y-MP C—9(l was introduced in 199!) to ofi'er an integrated system with 16 processors using a
4.2-ns clock. We will stttdy models Y-MP B16 and C-90 in detail in the next section.
Another product line was the Cray 2S introduced in I985. The system allowed up to four processors with
2 Gbytes of shared memory and a 4.1-ns clock. A major contribution of the Cray 2 was to switch from the
hatch processing, COS to multiuser UNIX System V on a supercomputer. This led to the UNICOS operating
system, derived from the UNIXIV and Berkeley 4.3 BSD, variants of which are currently in use in some Cray
DCll"l'l]IlLIlI1'_T S}'SlI'l'_‘l'!TS.
The CyberiETA Series Control Data Corporation (CDC) introduced its first supercomputer, the STA R-I00,
in 1973. Cyber 205 was the successor produced in I932. The Cyber 205 ran at a 2D—ns clock rate, using up to
four vector pipelines in a uniprocessor configuration.
Different from thc register-to-register architecture used in Cray and other supercomputers, the Cyber
205 and its successor, the ETA It), had memory-to-memory architecture with longer vector instructions
containing mcmory addrcsscs.
The largest ETA It) consisted of B CPUs sharing memory and 18 HO processors. The peak performance
of thc ETA lD was targeted for IO Gflops. Both tl'|e Cyber and the ETA Series are no longer in production but
wcrt: in use ibr many years at scvcral supcrcomputcrccntcrs.
Japanese Supercomputer: NEC produced the SX-X Series with a claimed peak performance of22 Gflops
in 1991. Fujitsu produced the VP-2000 Series with a 5-Gtlops peak performance at the same time. These two
machines used 2.9- and 3-.2-ns clocks, respectively.
Shared communication registers and reconfigurable vector registers were special features in these
machines. Hitachi offered the 820 Series providing a 3-Gllops peak performance. Japanese supercomputers
were at one time strong in high—specd hardware and interactive vectorizing compilers.
The NEE SK-X 44 NEC claimed that this machine was the fastest vector supercomputer £22 Gflops peak]
ever huiltup to 1992. The architecture is shown in Fig. 8.5. One ofthe major contributions to this performance
was the use of a 2.9-ns clock cycle based on VLSI and high-density packaging.
There were four arithmetic processors commtmicating through either the shared registers or via the shared
memory of 2 Gbytes. There were four sets of vector pipelines per processor, each set consisting of two add!
shift and two mulfiplyflogical pipelines. Therefore, 6-4-way parallelism was obtained wifit four processors,
similar to that in thc C-9t]-.
Besides the vector unit, a high-speed scalar unit employed RISC architecture with 123 scalar registers.
Instruction reordering was supported to exploit higher parallelism. The main memory was l024~way
interleaved. The extended memory of up to I6 Ghytes provided 21 maximum transfer rate. of 2.75 Ghytes.-’s.
Amaatimtun of four l IO processors could be configured to accommodate a l-Gbytefs data transfer rate per
l/U processor. The system could provide a maztimutn of 256 channels for high-speed network, graphics, and
peripheral operations. The support included l00—l'vll:ytests channels.
3511 i
.
Advorrced Computerhrchitecture
law“ o "“°‘°’
Mair i
Mask
—i
TI “’==
|QP - _- Z MMU i Vector ii
“’ 1 Wis "°“‘" —i
Dcp 2* CPM
Mbytes - Y1
Ii. Cache
-‘i Scalar
Hegs_
.
Scalar Prpo
Scalar unit
Captions:
XMU: Extended memory unit
IOP: |.I'O processors [4]
DCP: Data central processors [2]
AP: Arithmetic processors {4}
MMU: Main memory unit
GPM: Data oentrsl processor memory
Each set consists of rt pipeiln-es for adclfshlft
and multlplyfio-gical vector operations
Fig.8.! The NEE. S24-X 44 vector supummnpuuer archirectuns (Cournasy oi NEE, 1991}
Relative 'fl:ctorIScn.lnIr Performance Let r hc the voctorriscalar speed ratio an-t1ftl1e vcctorization ratio.
By Amdal'1l’s law in Section 3.3.], the following reloriveperforrrmnce can be defined:
P= = -%- (3.111
{1—fl+f»'r (1—_f.ir+_f
This relative performance indicates thc speedup performance of vcctnr processing ovcr scalar processing.
The hardware speed ratio r is the designer’s choice. The vectorization ratio f reflects the percentage of code
in a user program which is vectcriwed
The relative performance is rather sensitive to the value off This value can be increased by using a
better vectorizzing compiler or through user program transformations. The following ¢}LBIfl]J|l: shows the IBM
experience in vector processing with the 3090?»/F computer system.
I»)
lg Example 8.2 The vectorfscalar relative performance of
the IBM 3090!VF
Figure 8.6 plots the relative performance P as a filncticn ofr with fas a t‘u.tt.t1i.ng parameter. The highcr thc
,,.,,,.,,,,,,,,,,.,5,.,,M,,,,,,,,, _ _ an
value off; the highcr the relative speedup. The IBM 3094] with vector facility (VF) was a high-end mainframe
with add-on voctorhardwarc.
{Pl
B .._._
&—
8-0%
3_
i"Cl%
2__
r _ sos
_____ 30%
1 I I I I I I I I I lrl
1 2 3 4 5 6 1' 8- 9 10
Fl} I-G Speedup performance -of vector processing over scalar processing in the IBM JDBDNF (Courtesy
of lBl"'l Corporation, W35}
The designers of the 309tL'VF chose a speed ratio in the range 3 £ r 2 5 because IBM wanoed a balance
between business and scientific applications. when the program is 70% vcctorizcd, one expects a maxirnurn
speedup cfll However, forffi 311%, the speedup is reduced to less than 1.3.
The IBM designers did not ehoose a high speed ratio because they did not expect user programs to be
highly vcctorimlllle. When fis low, the speedup cannot bc high, even with a very high r. In fact, the limiting
case is-P—> l iff —>t"J.
On thc other hand, P —> r when f -1 I. Scientific supercomputer designers like Cray and Japanese
manufacrtlnrrs often chose a much highcr speed ratio, say, I'll S r S 25, because they expected a, higher
vectorizarion ratioIin user programs, or they used better vectnrizers to increase the ratio to a desired level.
Huge advances have taken place in the underlying technologies and especially in VLSI technology
over the last two decades. We shall see that these advances, summarized in brief in Chapter 13, have dcfincd
the direction of advances in computer architecture over this period. Powerful single-chip processors—as
also multi-core s_t=srerns-on—n-c'!rip—prrwidc High Peaffmnarrce Corrrputing [HPC] today. Such I-TPC systems
typically make use of MIMI) auditor SPMD coniigurations with a large number of processors.
Advent of superscalar processors has resulted in vector processing instructiorts being built into powerful
processors, rather than as specialized processors. Thus the ideas we have studied in this section have made
Ff» Mtfirnii H'l'Iit'mn;|wm-\' _
351 i Advanced Contplrterhrdritecture
their appearance in capabilities such as Streaming SIMD Exrensr'ons (SSE) in processors Chapter l3).
We may say that the concepts of vector processing remain valid today, but their int;-Jfcmerrr¢rrr'mw varies with
advances in technology.
MULTIVECTOR MU LTIPROCESSORS
— The architectural design of supercomputers continues to be upgraded based on advances
in technology and past experience. Design rules are provided for high perfomiance, and
we review these rules in ease studies of well-lmown early supercomputers, high-end mainframes, and
niinisupercomputers. The trends toward scalable architectures in building MPP systems for supcreomputing
are also assessed, while recent developments will he discussed in Chapter I3.
Architecture Design Goals Smith, Hsu, and Hsi1.u1g(l99D) identified the following four major challenges
in the development of future general-purpose supercomputers:
* Maintaining a good vector!scalar performance balance.
r Supporting scalability with an increasing number of processors.
' Increasing memory system capacity and performance.
* Providing high—performance I10 and an easy-access network.
Balanced H:ct7orl'ScuI-ur Ratio In a supercomputer, separate hardware resotuces with different speeds are
dedicated to concurrent vector and scalar operations. Scalar processing is indispcrisahlc for general-purpose
architectures. Vector processing is needed for regularly structured parallelism in scientific and engineering
computations. These two types of computations must be balanced.
The vector bnlnm-e point is defined as the percentage of vector code in a program required to achieve
equal utilization of vector and scalar hardware. In other words, we expect equal time spent in vector and
scalar hardware so that no resources will be idle.
Ir)
égl Example 8.3 Vectorfscalar balance point in supercomputer
design (Smith,Hsu,and Hsiung,1990)
If a system is capable of 9 Mflops in vector mode and l Mfiops in scalar mode, equal time will be spent in
each mode if the code is 90% vector and 10% scalar, resulting in a vector balance point of 0.9.
.,,,,,,,,,,,,,,,,._,,,,,,,,,,,,, _ ,5,
lt may not he optimal for a system to spend equal time in vector and scalar modes. However, the vector
balance point should be maintained sufficiently high, matching the level of vectorization in user programs.
l'
‘vector peribmiance can be enhanced with replicated firnctional unit pipelines in each processor. Another
approach is to apply dceperpipelining on voctorunits with a double or triple clock rate with respect to scalar
pipeline operations. Longer vectors are required to really achieve the target performance.
Hretoriicelor Performance In Figs. 8.7:: and 8.71:, the single-processor vector performance and scalar
performance are shown, based on running Livermore Fortran loops on Cray Research and Japanese
supercomputers of the 1980-s and early 1990s. The scalar performance of these supercomputers increases
along the dashed lines in the figure.
Cine of thc contributing factors to vector capability is the high clock rate, and other iactors include use of
a betner compiler and the optimization support provided.
Table 8.2 compares the vector and scalar perliorrnartces in seven supercomputers of that period. Note
that these supercomputers have a 90% or higher vector balance point. The higher the vectortscalar ratio, the
heavier the dependence on a high degree of vectoiization in the object code.
Source: I Srnith et al., Future General-Purpose Supercomputing Conference, IEEE .§'Irps'rt'amptu1ir|g Carpérence, 1990.
The above approach is quite different fiom the design in comparable IBM vector machines which
maintained a low vectorfscalar ratio between 3 and 5. The idea was to make a good compromise between the
demands of scalar and veetorproecssing forgeneral-pt.n'pose application s.
HO and Networking Perfnnnunce Vt-"ith the aggregate speed of supercomputers increasing at least
three to five times each generation, problem size has been increasing accordingly, as have U0 bandwidth
requirements. Figure 8.7c illustrates the aggregate U0 bandwidths supported by supercomputer systems of
the period up to the early 1990s.
l'h1'Ml.'I;Ifl\l|r' HI" l'n¢r.q|r_.u||rr|
354 1 Advanced Compumerhrchitacture
lulfloprs = Mfloprs
8-O'D— 5 —
ir H|taehlS-820 _,
100- "'-. ss~—
HE‘.
60G—— __,-‘K 30-— __
: I I!
5m- 5 .-"‘ 25- .-"
.=" ,-*"
40°‘ .»" ,5’ 2°‘ \r"-I-iitachl sazo
.-" ' _,-“ “+1 CrayY-MP
30°‘ .»"'rFuJfl>srr .1’ 15- __ __ .4 iCray X-MPI4
.~"" .-"
2nG__ 1". 0
vp40G'l'CI3!|'Y-MP 1n,__‘_.-"""-. : cr3§|'*2
2500-
1500-
nsc sxo 1
1000- Fujitsu VP2flGrD 0
Cray Y-MP '
500“ Cray X-MPI4 4.
NEG 51.2 ' ' Hltaci-|lSFB2rD
Cray-1 Ftiltsu \r'P2UJ 0
7' 1 r r r r r I Year
1976 1978 1980 1962 1964 1966 198-B 1990
[cl U0 perfo1'1'rra.nee
Fig.8.? Some reported supercomputer perforntance data (Source: Smith. Hsu. and Haiung. IEEE
Superuernpu1:ir1gConIermce.1'?9rD}
The HO is defined as the trmrsfer of data between the processorimemory and peripherals or a network. In
the earlier generation ofsupercomputers, IEO bandwidths were not always wcll correlated with computational
performance. U0 processor architectures were implemented by Cray Research with two different approaches.
The first approach is exemplified by the Cray Y-MP l/D subsystem, which used U0 processors that were
flexible and could do complex processing. The second approach was used in the Cray 2. where a simple
Emnt-cod processor controlled high-speed channels with most of tbc 1:13 managerncnt being done by the
mainframe‘s operating system.
,,,,,,,,,,,,,,,,,,,,.,,,,,,,,,,,,, _ 355
Today more than aggregate 100-Gbytesfs [I0 transfer rate are needed in supercomputers eonriected to
high-speed disk arrays and networks. Support for high-speed networking has become a major component of
the HO architectilre in supercomputers.
Memory Dar-ncnd The main memory sizes and extended memory sizes of supercomputers of l9tiUs and
early 19905 are shown in Fig. 8.8. A large—sc-ale memory system must provide a low latency for scalar
processing, a high bandwidth for vector and parallel processing, and a large size for grand challenge problems
and throughput.
Mbytes Mbytes
Fig-ll-B 5i.ip-ercompi.rter memory capacities (Source: Smith, Hsiu,and Hsiung, IEEE iipflmmpuflfig Coifémflffl.
two;
To achieve the above goals, an effective memory hierarchy is necessary. Atypical hierarchy may consist of
data files or disks, extended memory in dynarnie RAMs, a fast shared memory in static RAMs, and a eache!
local memory using RAM on arrays.
Over the last two decades, with advances in VLSI technology, the processing power available on a chip
has tended to double every two years or so. Memory sizes available on a chip have also grown rapidly;
however, as we shall sec in Chapter 13, the i'nt'i'irior1'i* speeds achicw-'ablc—i.c. read and writc cycle timi:s—
have grown much less rapidly than processor performance. Therefore the rcinrir-c speed mismatch bctwccn
processors and mcmo ry, which has been a feature of oomputcr systems iinm their carlici-it days, has w idcncd
much further over the last two decades. This has necessitated the development ofmore sophisticated memory
latency hiding techniques, such as wider memory access paths and rnulti-level cache memories.
Supporting Scalability Multiprocessor stlpctcomputcrs must be designed to support thc triad of scalar,
vector, and parallel processing. Thc dominant scalability problem involves support of shared incrnory wii:h
an increasing numbcrofptoccssots and memory ports. Increasing memory-access latcncy and iritcrptoccssor
communication ovcrhcad imposc additional constraints on scalability.
Scalable architectures include multistage interconnection networks in fiat systems, hierarchical clustered
systems, and multidimensional spanning buses, ring, mesh, or torus networks with a distrflziut-cd shared
memory. Table 3,3 summarizes the key features of three representative mullivector supercomputers of 19905.
War MIGIIILH H“ r'mr:-;|un|n
355 i ' Adrovrced Computerhrchitecture
interprocessor
Communications
il
P-I-CP'l_.|3"'—I4 Vector
Lsngtii
[B hits]
“
Fioating-point
F uneiionai Units
Add/substract
Multip-ty
S-action
I-iii Reciprocal
appworthnation
¢—~CPU4<—i- -1 [64-bit arithmetic]
at
Registers
{BB4 bits] S
mglsleis Registers Seaiar S-caiar
[B64 hits] Functional Units issoon
registers Aod.i‘S-ohstract
0--CF"L.i5 Shift, Logic,
Contra Memory Population
i
[32-bit arithmetic]
or
J
Reg isters A H
‘*“'Im'=~*=> Registers
-st MH "M" Add-rose
F unctionai Units
AdclISuhtract
M uitip-iy
> Address
Section
iI
1|-I-CP'U-E»-'-—*
QCFZ}
‘ Exchange
Parameter
Raglfieis
prqyammame
C-iocir{1!-2 bits] J
I IIO Control I
Monitor
Sums
Register
> ‘5°"“°‘
Sflflfibfl
Fig.8.! CreyT-l"‘|F B16 system crganiuticn [Courtesy cit Cray Research, 1991]
F?» Mtfiruw HI! r'».-rqiwtnw
353 i Aduwtced Ccmptrterhrchitsctura
The central memory was divided into 256 interleaved banks. Overlapping memory access was made
possible through memory interleaving via four memory-access ports per CPU. A 6-ns clock period was used
in the CPU design.
The central memory offered 16M-, 32M-, 64M-, and 128M-word options with a maximum size of
I Gbyte. The SSD options were from 32M to 512M words or up to 4 Gbytes.
The four memory-access ports allowed each CPU to perfonn two scalar and vectorfetches, one store, and
one independent Ht) simultaneously. These parallel memory accesses were also pipelined to make the w-:~ro.-
rend and vector write possible.
The system had built-in resolution hardware to minimize the delays caused by memory conflicts. To
protect data, single-error correction {double-error detection (S ECDED] logic was used in central memory and
on the data channels to and fipm cemral memory.
The CPU computation section consisted of I4 functional units divided into vector, scalar, address, and
control sections (Fig. 8.9}. Both scalar and vector instructions could be executed in parallel. All arithmetic
was register-to-register. Eight out ofthe 14 functional units could be used by vector instructions.
Large numbers of address, scalar, vector, intermediate, and temporary registers were used. Flexible
chaining of functional pipelines was made possible through the use of registers and multiple memory-access
and arithmeticflogic pipelines. Both 64-bit floating-point and 64-bit integer arithmetic were performed.
Large instruction caches (buffers) were used to hold 512 16-bit instruction parcels at the same time.
The interprocessor communication section of the mainframe contained clusters of shared registers for fast
synchronization purposes. Each cluster consisted of shared address, shared scalar, and semaphore registers.
Note that vector data communication among the CPLls was done through the shared memory.
The real-time clock consisted of a 64-bit counter that advanced one count each clock period. Because the
clock advanced synchronously with program execution, it could he used to time the execution to an exact
clock count.
The HO section supported three channel types with transfer rates of 6 Mbytesfs, 100 Mbytests, and
l Gbyte.-’s. The 10$ and SSD were high—speed data transfer devices designed to support the mainframe
processing by eight cac hes.
ll)
lg Example 8.4 The multistage crossbar network in the Cray
Y-MP B16
The interconnections between the 8 CPUs and 256 memory hanks in the Cray Y-MP it I t5 were implemented
with a multistage crossbar network, logically depicted in Fig. 3.10. The building blocks were 4 >< 4 and 8 >< B
crossbar switches and l >< 3 demultiplexers.
,,,,,,,,,,,_,,,,,,,,,._,,,,,,,,,,,,, _ 35,
Subsections
=.
1
lllil
0, 4, s as
32, 60
64,63, 7'2... 92
Mmill!-. ~1
“D
=s
PR
Proc.
2
i
|t|ttt
BE7 t
1|» 4:
=Q$ll't:l$i=
I
I
it a
file
I1
IE“
tag!
,
s
224, 228-
1,5,8...29
.
252
Pro-c _ so
3
4% 225. ass 25s
Pro-c 2.s.1o...so
4
r l as
Pro-c
5 E§]‘_i"\'*lF -
slag
r \ [EL
P?‘
s3 22s, zso 254
s,r, 11 s1
X '& 8.3 .
» ttt|||
er-15 Ell I ins
il!I=l 221. 231 255
Fig.8.“! S-c|'|omatlt: logic diagram o'l'd1o crossbar network botwocrt B processors and 256 mernory banks in
the CrayY-HF‘ B16
The network was controlled by a fonn of circuit switching where all conflicts were worked out early in the
memory-access process and all requests from a given port returned to the port in order.
The use of a multistage network instead of a single-stage crossbar for interprocessor memory connections
was aimed at enhancing scalability in the building of cven larger systems with 64 or 1024 processors.
However, crossbar networks work only for small systems. To entrance scalability, emphasis should be given
to data routing, heavier reliance on processor-based local memory (as in the Cray 2), or the use of clustered
structures (as in the Cedar multiprocessor) to offset any increased latency when system sine increases.
The C-90 ond Cluster: The C-90 was further enhanced in technology and scaled in sine from the Y-MP
Series. The architectural features of C-9th'l625t5 are summarized in Table 8.3. The system was built with
I6 CPUs, each of which was similar to that used in the Y-MP. The system used up to 256 megawords
(2 Gbytcs] of shared main memory among the 16 processors. Up to 16 Gbytes of BSD memory was available
F|'>r'MfGJ'|Ili' H“ I'm-l!I;|(1rHt\
NU i _ Advanced Computcrhrchitecturc
as optional secondary main memory. In each cycle, two vector pipes and two functional units could operate in
parallel, producing four vector results per clock. This implied a four-way parallelism within each processor.
Thus I6 processors could deliver a maximum of 6-4 vector results per clock cycle.
The C-90 used the UNICUS operating system, which was extended from the UNIX system V and Berkeley
BSD 4.3. 'l‘he C-90 could be driven by a number of host machines. Vectorizing compilers were available for
Fortran 77 and C on the system. The 64-way parallelism, coupled with a 4.2-its clock cycle, lead to a peak
perfonnance of l 6 Gllops. The system had a maximum [ID bandwidth of 13.6 Gbytesfs.
Multiple C-5lCI‘s could be used in a clustered configuration in order to solve large-scale problems. As
illustrated in Fig. 8.11, four C-90 clusters were connected to a group of SSDs via 1000 lvlbytesfs channels.
Each C-90 cluster was allowed to access only its own main memory. However, they shared the access of
the SSDs. In other words, large data sets in the SSD could be shared by four clusters ol‘C-90’s. The clusters
could also communicate with each other through a shared semaphore unit. Only syncluonlzation and control
information was passed via the semaphore unit. In this sense, the C-9D clusters were loosely coupled, but
collectively they could provide a ITlfl.Kl1'l1l11'l'l of 256-way parallelism. For computations which were well
partitioned and balanced among the clusters. a maximum peak performance of 64 Gflops was possible for a
four-cluster configuration.
C90 CQO
{1B P's] [16 P's]
C90 CQID
[16 P'sf| [16 P's)
Fig. IL11 Four CrayY-HP C.-90': connected to a common SSD forming a loosely coupled 6-II-way parallel system
The Ci-nryiMPP System Massively parallel processing (MPP) systems have the potential for tackling
highly parallel problems. Standard oft‘-the-shelf microprocessors may have deficiencies when used as
building blocks of an NIP? system. Wliat is needed is a balanced system that matches fast processor speed
with fast L"D, fast incrnory access, and capable software. Cray Research announced its MPP development in
October I992. The development plan sheds some light on the trend towards MPP from the standpoint of a
major supercomputer manufacnirer.
Most of the early RISE microprocessors lacked the connnunication, memory, and synchronization
features needed for efficient MPP systems. Cray Research planned to circumvent these shortcomings by
surrounding the RISC chip with powerful communications hardware, besides exploiting Cray’s expertise
in supercomputer packaging and cooling. in this way, thousands of commodity RISC processors would
be transformed into a supercomputer-class MPP system that could address terabytes of memory, minimize
communication overhead, and provide flexible, lightweight synchronization in a Ul"~lIX environment.
,,,,,,.,,,.m,,,,,,5,._m,,.,,,,,t, _ “I
Gray's first MPP system was eodesnamed T3D because a three-dimensional, dense torus network was
used to interconnect the machine resources. The heart of Cray’s T31) was a scalable macroarehitecttire that
combined die DEC Alpha microprocessors through a low-latency interconnect network that had a bisection
bandwidth an order of magnitude greater than that of existing MPP systems. The T3D system was designed
to work jointly with the Cray Y-MP C-90 or the large-memory M-90 in a closely coupled fashion. Specific
features of the MPP macroarchitectirre are summarized below:
(I) The T3D was an MIMD machine that could be dynamically partitioned to emulate SIMD or
multicomputer IVHMD operations. The 3-D torus operated at a 150-MHZ clo-ck matching that of
the Alpha chips. High-speed bidirectional switching nodes were built into the T3D network so that
interprocessor communications could be handled without interrupting the PEs attached to the nodes.
The TBD network was designed to be scalable from tens to thousands of PEs.
{2} The system used a globally addressable, physically distributed memory. Because the memory was
logically shared, any PE could access the memory of any other processing element without explicit
message passing and without involving the remote PE. As a result, the system could be sealed to
address terabytes of memory. Latency hiding {to be studied in Chapter Qj was supported by data
prefetching, fast syricllronization, and parallel IEO. These were supported by dedicated hardware. For
example, special remote-access hardware was provided to hide the long latency in memory accesses.
Fast synchronization support included special primitives for data-parallel and message-passing
programming paradigms.
(3) The CrayfMPP used a Mach-based rnicrolcemel operating system. Each PE had a microlternel that
managed communications with other PEs and with the closely coupled Y-MP vector processors.
Software portability was a rnajor design goal inthe C-rayfli-[PP Series. Software-oortfigurable redundant
hardware was included so that processing could continue in the event of a PE failure.
(4) The Cray CFTTT compiler was modified with extended directives for MPP applications. Program
debugging and performance tools were developed.
Charnol I i
Processor Scalar “nus Bufim, Scalar
i [CH Pl | GidflfldunEton
Fig. 5.13 The Fu|irsu VP2000 Series superoompucer architecture {Courtesy of Fulirsu, 1991}
I/)
Cg Example 8.5 Reconfigurable vector register file in the
Fujitsu VP2000
Vector registers in Cray and Fujitsu machines are illustrated in Fig. 8.14. Cray machines used 8 vector
registers, and each had a fixed length of 64 component registers. Each cornponeznt register was 64 bits wide
as shown in Fig. i§.14a.
M ,,M,,,,,, Sm [Mm -r M I; H If '||rr.- |r_.I.I||r\
1,, M
VRO
VH1 64 Component mglstets I.
I
0
0 1 I e
‘UR? 11,12’ 0 n u u 53
[a]Elghtve-nztorre-glsters(B:-<54:<54h|ts]onGraymaehlnes
$2 component
register
VFGCI
5&1
A component counter was built within each Cray vector register to keep track of the number of vector
elements fetched or procmsod. A segrhent of a 64-element suhvoetor was held as :1 package in each vector
register. Long vectors had to he divided into 64-element segments before they could he processed m a
pipelinod fashion.
In an early model of the Fujitsu VP2000, the vector registers were reconfigurable to have variable lengths
The p|n"p-use was to dynarniealiy match the register length with the vector length being processed.
Thu‘ Ml.'I;Ifllb' HI" l'n¢r.q|r_.u|»r\ -
H4 i Advanced Coinpumerfirdtitecturs
As illustrated in Fig. 3.l4h, a total of 64 Khytes in the register file could he configured into 8, lo, 32, 154,
128, and 256 vector registers with 1024, 512, 256, 128, 64, and 32 component registers, respectively. All
component registers were 64 hits in lengfli.
In the following Fortran Do loop operations, the three-dimertsional vectors are indexed by 1 with oonstant
values ofJ and K in the second and third dimensions.
lll! lll I = 0, 3]
330(1) = Ui1J,K) — Uii,J — 1.10
ZZIU) = V'[l.J,Kl — VUJ — LKII
Soflware support for parallel and vector processing in such supercomputers will be treated in Part TV.
This includes multitasking, mscrotasking, microtssking, autotasking, and interactive compiler optimization
techniques for vectorization or paralleiization.
The VPP Sill! This was a latter supercomputer seties from Fujitsu, called terror parallel pm:-c'ss0r. The
architecture of the VPPSCID was scalable from ? to 2.22 PEs, offering a highly parallel MIMD rnultivector
system. The peak performance was targeted for 335 Gfiops. Figure 8.15 shows the architecture ofthe VPP500
used as a back-end machine attached to a \-"P2000 or a VPX 200 host.
VPP 503 Prooeosiig Elunent
1 2 1 .~' _ ' _
can Data can ' Dam I """"“ 51"“? Um |
Trarotor Tranerler Trmstor ,’ -|-(Mme, ",
|._k'lrl _ Lhil Lktrl " U51 ‘
mam Mam Mam rI s ‘F min _,.- u u
Storage Storage Storage Storage
'4'" '-1"", '-*1" “M ‘Cache Loan Stare
Soda‘ Soda‘ Sada‘ Lil"! Sada‘ Uritl I Pitflllfle Plpflliflfi
um Urit ‘Hoot! um _ Hector um ;
Conrol Plooeosiig ‘K Proosmisg ,3’
Processors Element '-,_%e|‘it,»' Reiiun
\~--
DEC FAX 9000 Even though the VAX 9-00!) did not provide Gflop performance, the design represented a
typical maiofiame approach to high-perfonnance computing. The architecture is shown in Fig. 8.1-Ga.
F?» Mtfiruw HI! r'».-rqiwinw
Hi i Adnwrced Computerhrchitactora
Multichip packing technology was used to build the VAX 9000. It offered 40 times the VAJUTBD
perfonnanee per processor. With a four-processor configuration, this implied 15? times the ll/780
performance. When used for transaction processing, T0 TFS was reported on a uniprocessor. The peak vector
processing rate ranged from 125 to SUD Mflops.
The system control unit utilized a crossbar switch providing ['otu' simultaneous SUD-Mbytesfs data
transfers. Besides incorporating intcn:orm-cct logic, the crossbar was designed to monitor thc contents of
ca-chc memories, tracking thc most up-to-date cache content to maintain cohcrcncc.
Up to 512 Mbytes of main memory were available using I-Mhit DR.AMs on 64-Mbyte arrays. Up to
2 Ghytcs of extended memory were available using 4-Mhit DRAMs. Various [I0 channels provided an
aggregate data transfer rate of 320 Mbytes/s. The crossbar had eight ports to four processors, two memory
modules, and two UCI controllers. Each port ltad a maximum transfer rate of l Ghytefs, much higher than in
bus-con nccted sy stem s.
Each vector processor UJBOX) was equipped with an add and a multiply pipeline using vector registers
and a rnask!'adcl.ress generator as sltowrt in Fig. E.16b. Vector l.|l.5l]'I.1Cl.'lD‘I!S were fetched through the memory
,_.m,..,,,,,,,5,._..,.,,,,,,,m_., ; _, m
unit (MBDX), decoded in the IBOX, and issued to the VBDX by the EBOX. Scalar operations were directly
executed in the EBOX.
Memory Memory
to 1 Gbyto to 1 Gbyti
Samoa I I
C no
Processor 53‘5""“
Cclrtlrctl Cache your“
Y‘ or “cc
CPU 1
Crossbar Switch
Wflfll El H‘ Gar 2 GB)‘:
Processor 1 5 -HJ
4
Cache .
CPU 2 4 Rteadfwrite Paths
,_-,1“ Gl'lEI|"i* fill] MBI's each
..‘II _. H33“ t
1 GE!s 1 Gflfs
ID l"'D
Control Gonlml
Up ti 12 Up to 12
LPO lntorfaoo IID Interface
per Khlll per XMI
‘Vector
- oontml
‘vector
are
Fltegistor
I Unit
The vector register file consisted of ts >< 64 >< 64 bits, divided into sixteen 64—element vector registers. No
instruction took more than five cycles. The vector processor generated two 64-bit results per cycle, and the
vector pip-clincs could be chained for dot-product opctations.
The VAX 9000 could run with either VIVIS or ULTRIX operating system. The sectvice processor in
Fig. 3.I6a used four MieroVAX processors devoted to system, disk/tape, and user interface control and to
monitoring 20,000 scan points throughout thc system for tcliablc operation and fault diagnosis.
Minisupereomp uter: These were a class of low-cost supercomputer systems with a performance of about
5 to 15% and tt cost of 3 to 10% of that of a full-scale supercomputer. Representative systems of the early
l990s include the Convex C series, Alliant FX series, Encore Multirnatt series, and Sequent Symmetry series.
Some of these minisupercomputers have been introduced in Chapters l and T. Most of fltetn had an open
architecture using standard ofi’-the-shelf processors and UNIX systems.
Both scalar and vcctorproccssing was supported in thcsc multiptoccsstirsystcnis with shared tncmo ry and
peripherals. Most of these systems were built witl:t a graphics subsystem for visualization and perforntance—
tunirtg purposes.
Supercomp uting Workstation: In the early 1990s, high-perforrnance workstations were being produced
by Sim Microsystents, IBM, DEC, HF, Silicon Graphics, and Stardent using the state-of-tl'|e-art superscalar
RISC processors introduced in Chapters 4 and 6, Most of these workstations had a uniprocessor configuration
with built-in graphics support but no vector hardware.
Silicon Graphics produced the 4-D Series using four R3000 CPLTs in a single workstation without vector
hardware. Stardent Computer Systems produced a departmental supercomputer, called the Stardent 3000,
with custom-designed vcctor hardwatc.
The Stardom 3000 The Stardent 3000 was a multiprocessor workstation that evolved from the TITAN
architecture developed by Ardent Computer Corporation. The architecture and graphics subsystern of the
Stardent 3000 are depicted in Fig. 8.17. Two buses were used for commtutication between the four CPUs,
memory, HO, and graphics subsystems (Fig. E.l'.|'a).
The system featured R3000 /R3010 processorsffloatingpoint units. The vector processors were custom-
designed. A 32-MI-lz clock was used. There were 128 Kbytes of cache; one halfwas used for instructions and
thc other half for data.
The buses carried 32-bit addresses and 64-bit data and operated at 16 MHz. They were rated at
I28 Mbytesfs each. The R-bus was dedicated to data transfers from memory to the vector processor, and the
S-bus handled all other transfers. The system could support a maximum of 5 12 Mbytes ofmemory.
A filll graphics subsystem is shown in Fig. S. 17b. It consisted of two boards that were tightly coupled to
bofll the CPUs and memory. These boards incorporated rasterizers (pixel and polygon processors), frame
bufifcrs, Z-bufiiers, and additional overlay and control planes.
,,,,,,;,,d,,,,,,‘,5,,,,M,,,,,,,,, _ _ W
Scalar mm Main Memory
Pro-oeeoeor Processor [B MB — 512 MBII
- 5-El-us [128-I'u1B.1'Sec]
I R-Bus [12BI'u'|Bl‘S-ac]
System Bus
Interface I3.2-hlt
DMA
I lay Memo
[23 Planes) Dlsplay Interface
Channel
Image
PolPland
acedygonP
Graphl-as Ex pension
Board
Image Im
[16 gi2%5%
[bl The graphics subsystem arehttectue
Flg.lI.11' The Stardam 3000 vinmlimtiun departmental sup-erccmtputer {Coureesy of Stmzlam Campuuur
1990]
F?» Mtfirun-I Hllitlimpwrnw
BTU P Advlelrrced Compirterhrchitactura
The Stardent system was designed for numerically intensive computing with two— and tliree-dimensional
rendering graphics. One to two IEO processors were connected to SCSI or VME buses and other peripherals
or Ethernet connections. 'The peak performance was estimated at 32 to I 23 MIPS, I 6 to 64 scalar Mfiops, and
32 to I28 vector Mflops. Scoreboard, crossbar switch, and arithmetic pipelines were implemented in each
vector proccssor.
Gordon Bell, chief architect ofthe VAX Series and ofthe TITANr’Stardent architecture, identified ll rules
of minisupercomputer design in 1939. These rules require performance-directed design, balanced scalar!
vcctor operations, avoiding ho lcs in thc pcrfotrnancc space, achicving peaks in pcrforrnancc cvcn on a singlc
program, providing a decade of addressing space, making a computer easy to use, building on others’ work,
always looking ahead to the next generation, and expecting the unexpected with slack resources.
The LIHPACK Result: This is a general-purpose Fortran library of mathematical software for solving
dense linear systems ofequations of order I00 or higher. LINPACK is very sensitive to vector operations and
the degree of vectorization by the compiler. [t has been used to predict computer performance in scientific
and cnginccring areas.
Many published Mflops and Gflops results are based on running the LINPACK code with prespecified
compilers. LINPACK programs can be characterized as having a high percentage of floating-point arithmetic
opcration s.
ln solving a lincar systcm ofn equations, the total number cl‘ arithmetic operations involved is estimated
35 2n3I'3 + 2,5, whcrc H = tom in the LINPACK. experiments.
Ovcr many ycars, Dongarra comparcd thc pcrformancc ofvatious computcr systems in solving dcnsc
systems of linear equations. His performance experiments involved about IUD computers.
The timing information presented in this report reflects the floating-point, parallel, and vector processing
capabilities of l.l1e machines tested. Since the original reports are quite long, only brief excerpts are quoted
in Table 8.5.
The second column reports LINPACK performance results based on a matrix of order n = I00 in a Fortran
environment, The third column shows the results of solving a system of equations of order n = lt"l'Dl] with no
restriction on the method or its implementation. The last column lists the theoretical peak performance of the
machines.
The LINPACK results reported in the second column of Table 3.5 were for a small problem size of
I04] unknowns. No changes were made in the LINPACK software to exploit vector capabilities on multiple
processors in the machines being evaluated. The compilers of some machines might generate optimized code
that itsclfacccsscd spccial hardware fcaturcs.
The third column corresponds to a much larger problem size of ltltltl unknowns. All possible optimization
means, including user optimimtions of the software, were allowed to achieve as high an execution rate as
possib lc, callcd thc bes'f-cflbrt Mflops.
The theoretical peak can easily be calculated by counting the maximum number of floating-point additions
and multiplications that can be complctcd during a period oftimc, usually thc cyclc timc ofthc machinc.
,,,,,,M_,,,,,¢,,,,5,,,,M£,,,,,,,,, _ _ 3,“
Table 8.5 Perfiarmmce in Solving u Sptun o]"L|nonr Equations
Source: Jack Dongarra, "1'-‘erfonnance of Various Computers Using Standard Linear Equations S-ofiware,” Computer
Science Dept, Univ. ni'Ttm.nesscc, Knoxville, TN 3‘I9'945-I301 , March 1992.
COMPOUNDVECTOR PROCESSING
1 In this section, we study compotmd vectoropcratiotts. Multipipcline chaining and nctworlcing
techniques are described and design examples given. A g1'aph traiisfonnation approach
is presented for setting up pipeline networks to implement compound vector iilnctions, which an: either
specified by the programmer or detected by an intelligent compiler.
where the index l implies that all vector operations involve N elements.
Compound Vector Function: "liable 8.6 lists a number of example CW5 involving one-dimensional
vectors indexed by I. The same concept can be generalized to multidimensional vectors with multiple indexes.
For simplicity, we discuss only CVFs defined over one-dimensional vectors. Typical operations appearing
in these C\"Fs include land, stem, nm!n]n!__v, dr'v.ide, fogtrni, and shgritng vector operations. We use "slash" to
represent the .n‘ivr'r.ft= operations. All vector operations are defined on a component-wise basis unless otherwise
specified.
The purpose of studying C"v’Fs is to explore opportunities for concurrent processing of linked vector
operations. The numbers of available vector registers and functional pipelines impose some limitations on
how many C"v'Fs can he executed simultaneously.
P)
[<2] Example 8.8 Pipeline chaining on Cray Supercomputers
and on the Crayx-MP (Courtesy offiray Research,
lnc.,19ll5)
The Cray 1 had one memoryotceess pipe for either load or store but not for both at the same time. The Cray
X-MP had three memory-access pipes, two for t-‘actor food and one tor vector smre. These three access pipes
could be used simultaneously.
,.m,.,,,,,,_,5,._,.,C,,,,,,L,,, m
To implement the SAXPY cod: on the Cray i, thc fivc voclor opcrali-:ms arc dividcd into ihrcc chains: Thc
first chain has only one vector operation, fond 1’. The second chain links the fond X in scalar-vccfinr mu!n‘pi_1-
(S X) -npcnitinns and ihcn to thc vector addopcration. Thc last chain is for smrc 1" as illustrated in Fig. 8.1821.
Thc same set of vector opemlions was implemented on the Cray X-MP in a single chain, as shown in
Fig. 8.181;, bccausc thro: memory-aoccss pipes are used simultaneously. The chain links five vector opcralions
in a singlc connected cascade.
Roadmnte port
Eo
Road port 1 Read port 2
vecto: st
Mad Y3 -LlM 1 Access Anon-as
_ 1 mm W-B
B [Load X)
i
Vecior register
llll Load Y]
K V2
2
ii-0&6 Kl
%§%§
§§ lll’ E
§_§:
<: IIIIIIII I-*1
llllII
-'..~I = E
Scalar register
IEn &
E E
%
Wflddi V4 1
E
i‘~'P~dfii
n
Access
H09
{S-{OTB Y] ! [Smm Y]
2%
E:1h-=:
III
4%IIII||:| llll
RB"-HC|J'Wi'iI& B011 wmg pg“
[a] Lhnited chaining using only one rnemory'- [hi Co-rnp-iene chaining u-sing three memory/~
aoceos pipe in the Cray 1 amass pipes in the Cray X-MP
Flg.B.1l Huilipipeiine chaining on Clay 1 and Cmy X-H“ for -mmcutii1g the SAXPY code:Y[1:N) = S X
X,{1.~i\l} +Y[1:N) (Courmsy oi Cray Resnarch.1935}
-
an ‘XI ' Advwi-cod Comptmerhrdtitecture
To compare the time required for chaining these pipelines, Fig, 8.19:1 shows that roughly Sn cycles are
needed to perform the vector operations sequentially without any overlapping or any chaining. The Cray l
requires about 3n cycles no execute, corresponding to about n cycles for each vector chain. The Cray X-MP
requires about n cycles to execute.
LD it ‘Z I 3, {Memory ninei
to Y"'ls stil iMem-Jar nine!
PIP-Be 5 * [Mullins Piflel
+ rim! @i»°~dderr>1nei
st Y ‘la S it {Merrwrr nine}
[aj Sequential emecution without chaining
to x s'\1\.i'“'°""°‘Y Pl-“Q1
to Y sltnmw Pip“!
Pinfil 5 * }‘3haifl m IMHHIW ninei
* em iA*‘-“W Pipfll
ST Y Sm [Memory pipe)
are in
sing: S *
s@*~c-1>
mlmumply pip“)
- =
n = Time to pro-once n elements,
chain + s‘iiAdde=rt>1nei one no woe
ST Y s‘i.lS"°'“ ml
[cl Chaining in two toad pipers, two arithmetic pipes, and one store pipe
Fig-I-19 Timing for cheating the SAXPT co-deY{1 :N] = S X Xi‘! :N]i + 'l'{1:l'~l} under diiienent memory-
acoess capabilities (Cournesy of Cray Research. 1935)
In Fig. 8.19, the pipeline flow-through latencies {stariup delays} are denoted as s, m, anda for thc memory-
acccss pipe, the multiply pipe, and the add pipe, respectively. These latencies equal the lengths of individual
pipelines. The exact cycle counts can be slightly greater than the counts of Sn, 3n, and n due to these extra
delays.
The above example clearly demonstrates the advantages of sector chaining. A meaningful chain must
linb: two or more pipelines. As tar as thc amount oftirnc is conccmed, thc longcr the chaining, thc better the
,,,,,,,.,,,,,,,,,,._.,,,,,,,,,,, _ 3,,
performance. The degree ofcbaa'nr'ng is indicated by thc number ofdistinct pipeline units that can be linked
togcthcr.
‘v'c-ctor chaining eficctivcly increases thc overall pipeline length by adding thc pipeline stages of all
fttnctitrnal units in the chain to form a single long pipeline. The potential speedup of this long pipeline is
certainly greater according to Eq. 5.5.
Chaining Limitation: The number ofvector operations in a CVF must be small enough to make chaining
po ssib lc. Vectorchain irtg is limited by the small ntlrnbcrofiitnctional pipelines available in avector processor.
Furthermore, the limited number of vector registers also imposes an additional limit on chaining.
For example, the Cray Y-MP had only eight vector registers. Suppose all memory pipes are used in a
vector chain. These require that three vector registers (two for vector rand and one for vector srortrj be
reserved at the beginning and end of the chaining operations. The remaining five vector registers are used for
arithmetic, logic, and shift operations.
The number ofinterface registers required between two adjacent pipeline units is at least one and sometimes
two for two source vectors. Thus, the number of nott-rnernory-access vector operations implementable with
the remaining five vector registers cannot be greater than five. in practice, this number is between two and
three.
The actual degree of chaining depends on how many ofthe vector operations involved are binary or unary
and how many use scalar or vector registers. If they are all binary operations, each requiring two source
vector registers, then only two or three vector operations can be sandwiched between the memory-access
operations. Thus a single chain on the Crayi'Y—lv[P could link at most five or six vector operations including
the memory-access operations.
Hector Recurrence These are a special class of vector loops in which the outputs or“ a functional pipeline
may fccd back into one ofits own source vcctorrcgistcrs. in other words, a vector rcgistcr is used for holding
the source operands and thc rcsult clcmcnts simultaneously.
This has been done on Cray machines rising a corrtponertr cormrer associated with each vector regi ster. ln
each pipeline cycle, thc vector reg is icr is used like a shift regi stcr at the component level. When a component
operand is “shifted" out of the vector register and enters the functional pipeline, a result component can enter
the vacated component register during the same cycle. The component counter must keep track ofthe shifting
operations until all 64 components of the result are loaded into the vector register.
Recursive vector summation is often needed in scientific and statistical computations. For example. the
dot product oftwo vectors, .~i - B = _| tr, >< b,-, can be implernented using recursion. Another example is
polynomial cvaluation over vector operand s.
Summary Ciurdiscussion ofvector and pipeline chaining is based on a load-store architecture using vector
registers in all vector instructions. The number of functional units increases steadily in supercomputers; both
the Dray C-90 and tl'|e NEC SX-X offered I6-way parallelism within each processor.
The degree of chaining can certainly increase if the vector register file becomes larger and scoreboarding
techniques are applied to ensure fitnctional unit independence and to resolve data dependence or resource
dependence problems. The use of rnulliport memory is crucial to enabling large vector chains.
F?» Mtfiruw Hilitlimpwinw
BTU i Adnwrced Compirtcrhrchitccturc
‘Vector looping, chaining, and recursion represent the state of the art in extending pipelining for vector
processing. Furthermore, one can use naisiting, st-otter, and gather instructions to manipulate sparse vectors
or sparse matrices containing a large number ofdummy zero entries. A vector processor cannot be considered
versatile unless it is designed to handle both dense and sparse vectors efieetively.
The set oi‘ functional pipelines should be able to handle important vector arithmetic, logic, shifting, and
masking operations. Each FP, is pipclined with l:,- stages. The output terminals of each BEN are buffered with
programmable delays. BCNI is used to establish the dynamic connections between the register file and the
FPs. BCN2 sets up the dynamic connections among the FPs.
For simplicity, we call a pipeline network a pr}-Janet. Conventional pipelines or pipeline chains are special
cases of pipcnets. Note that a pipenet is progrannnable with dynamic connectivity. This represents the
fitndamental cl.ifi'erence between a static systolic array and a dynamic pipenet I11 a way, one can visualize
pipenets as programmable systolic arrays. The prograrnrnability sets up the dynamic connections, as well as
the number ofdelays along some connection paths.
Setup ofthe Pip-en-et Figures Blilathmugh S.2t"Jd show how to convert from a program graph to a pipenct.
Whenever a CVF is to be evaluated, the crossbar networks are programmed to set up a connectivity pattern
among the FPs that matches the data flow pattern in the CVF.
The program graph represents the data flow pattern in a given CVF. Nodes on the graph correspond to
vector operators, and edges show the data dependence, with delays properly labeled, among the operators.
The program graph in Fig. S.2Da corresponds to the following CVF:
Ell) = loll) '>< Bill + Bill >< Cilll *' [Bill >< (Jill >< lclll + Dillll (3-17]
for l = 1, 2, n. This CVF has Four input vectors AU), Bil), Ctl), and D(l} and one output vector EH) which
demand five memory-access operations. In addition. there are seven vector aritlimetic operations involved.
,,._,,,,,,,,,,,,,,,,,,,,,,,,,,,,, _ _, m
Etll
Etll
9
may -
MFY
ADD I
0:: 1} -. 9°“!
Alli Bill
‘+3
Gill Dill
MPY E
Pill Bill
ii -
Bill Dill
FF2
in
FF3
_ % ' 2
sass
&,, FF4
II 4
Y to-e=m
11 a.1.¢o|
{olfltcrnsfiaar implementation
MP X
Reg-‘Gr
— Buttered
Crossbar
E'4il-I
E Q Buffered
Crossbar
Nebvoflt hletworlt
0 F55 with 'n'iI1 q
I Frog ratrmabla Fmgrannrtatle
I Delays Delays
(BCN1) {BENZ}
El I I El
Flg.ll.1ll The concept of a plponet: and in Implunnrieaslon modal (Rnprlrtud from Hwang and Xu, JEEE
Trmsaaluns on Con-iputers, jan. 19%}
FM Mtfiruw Hll ritmpurtns
BBO i ' .-ltdmncad 'CtIh"l‘lPtl.IIvl!.l'.5t.tCHI|EC£l.rJ'E
ln other words, the above CVF demands a chaining degree of ll if one considers implementing it with a
chain of memory-access and arithmetic pipelines. This high degree of chaining is very difficult to implement
with a limited number of FPs and vector registers. However. the Cl/F can be easily implemented with a
pipenct as shown in Fig. 8.3{lb.
Sis. FPs are employed to implement the seven vector operations because the product vector Bil) >< CH),
once generated. can he used in both the denominator and the numerator. We assume two, four. and six
pipeline stages in the ADD, MPY, and DIV units, respectively. Two noncompute delays are being inserted,
each with two clock delays, along two of the connecting paths. The purpose is to equalize all the path delays
from the input end to the output end.
The cortnections among the FPs and the two inserted delays are shown in Fig. B.2th: for a crossbar-
connected vector processor. The feedback connections are identified by numbers. The delays are set up in the
appropriate bulTe:rs at the output terminals identified as 4 and 5. Usually, these bu ffers allow a range of delays
to be set up at thc time thc resources arc scheduled.
The program graph can be specified either by the programmer or by a compiler. Various connection patterns
in tl'te crossbar networks can he prestored for implementing each CVF type. Once the CVF is decoded, the
connect pattern is enabled for setup dynamically.
Program Graph Tmmformorions The program in Fig. 8.20s is acyclic or loopfree without feedback
connections. An almost trivial mapping is used to establish the pipenct (Fig. E.2(lb). In generaL thc mapping
cannot be obtained directly without some graph transfortrtations. We describe these transibrrnations below
with a concrete example CVF, corresponding to a cyclic graph shown in Fig. Ella.
On a directed program graph, nodal delays correspond to the appropriate FPs. and edge delays are the
signal flow delays along the connecting path between FPs. For simplicity, each delay is counted as one
pipeline clock cycle.
A c'_1-‘dc in a graph is a sequence of nodes and edges whish starts and ends with the same node. We will
consider a It-graph, a .s_1-m."-hmnous pmgrnm grqtuh in which all nodes have a delay ofk cycles. A O-graph is
called a .s_1-'.s'!ol'ie program graph.
The following two lemmas provide basic tools for converting a given program graph into an equivalent
graph. The equivalence is defined up to graph isomorphism and with the same input toutpul behaviors.
Lemma 1: Adding kd-clays to any node in a systolic program graph and then subtracting It delays from all
incoming edges to that node will produce an equivalent program graph.
Lemma 2: An equivalent program graph is generated if all nodal and edge delays are multiplied by the same
positive integer, called the .seni'ing eonsrrirtr.
To implement a CVF by setting up a pipenct in a vector processor, one needs first to represent the CVF
as a systolic graph with zero delays and positive edge delays. Only a systolic graph can be converted to a
pipenet as exernplilied below.
I/l
63 Example 8.9 Program graph transformation to set up a
pipenct (Hwang and )t'.u,1988)
Consider ’rhe systolic program graph in Fig. 5.2 I a. This graph represents the following set of CVFs:
,,.,,,.,,,,,_,,,,,.5,.,,M,,,,,,,,, _ _ W
Em = [Btu >< can + 10(1) >< D011
F(I) = [C(I_) >< D{I)] >< [CU — 2) X DH — 2}] (8.18)
GU] = [F[l_}lfF(l — 1}] >< G(I — 4)
Two multiply operators (MPY1 and MPY2) and one add operator {ADD} are applied to evaluate the vector
EH) from the input end (Vin) to thc output cnd t"tfi.,,,,) in Fig. 8.21:1. The same operator MP2 is applied twice,
with different delays [four and six cycles}, before it is multiplied by MPY3 to generate the output vector F(I_‘,l.
Finally, the divide {DIV} and multiply {MPY 4'} operators are applied to generate the output vec-lor Gil).
Applying Lemma l, we add Your-cycle delays to each operator node and subtract four-cycle delays from
all incoming edges. The transforrned graph is obtained in Fig. 8.2lb. This is a 4-graph with all nodal delays
equal to four cyclefi. Therefore, one can construct a pipenet with all FF‘s having four pipeline stages as shown
in Fig. 8.2111. The two graphs shown in Figs. 8.2lb and 8.21:: are indeed isomorphic.
3
®. MPY2 li
9D MPY2
W0 0-01 W0 3 o-oQ
E MPY3 4 D 2 MPY3
5 1
* Qt 0 l 9 0
MPY4 MPH
O Q D Q
Delay
new =
llll
mpvs
"'1 llll
DIV M
llll
Delay
Q.----Illl
MPY2
||t||||| ADD
@
Delay
nu ml llll
ll" MPY1
Delay
{cl Plp-met lrnplemerttatlon with Inserted delays between plpellnee
Fig-3-11 Frvom synchmnuus program graph tan pipenet implementation (Pteprintned from Hwartg and Xu,
IEEE Tmnsecflons on Computers, far; 1988)
F?» Mtfiruw Hlllrllmpwtnw
BIZ O Aduwrced Comptrtcrhrchitecturc
The inserted delays correspond to the edge delays on the transformed graph. These delays can be
implemented with programmable delays in the buffered crossbar networks shown in Fig. 8.20:1. Note that the
only self-reflecting cycle at node MFY4 represents the recursion defined in the equation for vector G{]). No
scaling is applicd in this graph transformation.
Tl1e systolic program graph in Fig. Ella can be obtained by intuitive reasoning and delay analysis as
shown above. Systematic procedures needed to convert any set of Cl.-'Fs into systolic program graphs were
reported in the original paper by I-lwang and Xu (1 988].
[fthc systolic graph so obtaincd docs not havc cnough odgc dclays to bc transfcrrcd into thc opcrator
nodcs, wc havc to multiply thc cdgc dclays by a scaling constant s, applying Lemma 2. Then the pipenct
clock rate must bc rcduccd by .s times. This means that successive vector elements entering the pipenct must
bc separated by s‘ cycles to avoid collisions in thc rcspcctivc pipclincs.
Perfbrmance Evaluation The above graph transformation technique has been applied in developing
various pipenets for implementing L‘VFs embedded in L-iverrnore loops. Speedup improvements oi‘ between
2 and 12 wcrc obtained, as compared with implcmcnting thcm on vcctor harrlwarc without chaining or
networking.
[n ortlcr to build into iitturc vcctor processors thc capabilitics of multipipclinc nctworlcing dcscribcd
above, Fortran and other vector languages must be extended to represent CVFs under various conditions.
Automatic compiler techniques need to be developed to convert from vector expressions to systolic
graphs and then to pipeline nets. Therefore, new hardware and software mechanisms are needed to support
compound vector processing. This hardware approach can be one or two orders of magnitude faster than the
softwarc implcmcntation.
515$
Smla lnsiuctiorrs
MW
Network
'3°"l“"' Arlay I l Conhnlflemay I Hod no
Cunlmlunit |n51I_ {Progamand Datal Corn-pulel {U591}
Voctcr l,
|"5i“-"5'fi°“5 Broadcaat Biis
{Instructions
H mdwm-»: H Data
PE=P=m*-=
Element
LM: Local
Memory
Data Routingfiatwcrk I
Mass
Stclaga
Coniol Memcly Sula’ Sada
--——- Arlay Control Unit Instr. P|'°¢-B55"
tum:
nI Broadoad Bus
{Vocbr Instructions)
E
~| Al ignmont Network I
Data Bu5
An instruction is scnt to the control unit for decoding. lfit is a scalar orprogram control operation, it will
be directly executed by a scalar processor attached to the control unit. If the decoded instruction is a vector
operation, it will be broadcast to all the PEs for parallel execution.
Partitioned data sets are distributed to all the local memories attached to the PEs through a vector data bus.
The PEs are interconnected by a data-routing network which performs inter-PE data communications such
as shifting, permutation, and other routing operations. The data—routing network is under program control
through the control unit. The PEs are synchronized in hardware by t.he control unit.
ln other words, the same instruction is executed by all the PEs in the same cycle. However, masking logic
is providod to enable or disable any PE from participation in a given instruction cycle. The llliac l'v' was such
an early SIMD machine consisting of 64 PEs with local memories interconnected by an B >< B mesh with
wraparound connections (Fig. 2.1 Eb].
Almost all SIMD machines built have been based on thc distributed-memory model. Various SIMD
machines differ mainly in the data—routing network chosen for inter—PE communications. The four-neighbor
mesh architecture has been the most popular choice in the past. Besides llliac IV, the Goodyear MPP and
AMT DAPGID were also implemented with the two-dimensional mesh. Variations from the mesh are the
hypercube embedded in a mesh implemented in the CM-2, and the X-Net plus a multistage crossbar router
implemented in the MasPar MP-1.
Siorcd-Nlemory Model In Fig, Bllb, we show a variation ofthe SIMD computer using shared memory
among the PEs. An alignment network is used as the inter-PE memory communication network. Again this
network is controlled by thc control unit.
The Burroughs Scientific Processor (ESP) had adopted this architecture, with n = 16 PEs updating
m = 1? shared-memory modules through a 16 >< 17 alignment network. It should be noted that the value m is
often chosen to bc relatively primc with rcspoct to n, so that parallel memory access can bc achieved through
skewing without conflicts.
The alignment network must be properly set to avoid access conflicts. Most SIMD computers were built
with distributed memories. Some SIMD computers used bit-slice PEs, such as the DAPGID and CM I200.
Both bit-slicc and tvtird-parallel SI MD computers are studied bclow.
SIMD Instruction: SIMD computers execute vector instructions for arithmetic, logic, data-routing, and
masking operations over vector quantities. ln bit-slice SIMD machines, the vectors are nothing but binary
vectors. In word-parallel SIMD machines, the vector components are 4- or 3-byte numerical values.
All SIMD instructions must use vector operands of equal length n, when: ri is the number of PEs. SIMD
instructions are similar to those used in pipelined vector processors, except that temporal parallelism in
pipelines is replaced by spatial parallelism in multiple PEs.
The data-routing instructions include permutations. broadcasts, multicasts, and various rotate and shift
operations. Masking operations are used to enable or disable a subset of PEs in any instmction cycle.
Hon and I ID All UCI activities are handled by the host computer in the above SIMD organizations. A
special control memory is used between the host and the an'ay control unit. This is a staging memory for
holding programs and data.
Divided data sets arc distributed to thc local memories (Fig. Ella] orto tl1c shared memory modules (Fig.
3.221;} before starting the program execution. The host manages the mass storage and graphics display of
computational results. The scalar processor operates concurrezntly with the PE array under the coordination
ofthe control unit.
,,,,,m,,‘,,,‘,5,,_,M,,,,,,,,, _; M
8.4.1 The CH-Lfirchitectune
The Connection Machine CM-2 produced by Thinking Machines Corporation was a fine-grain MPPeo111pute:r
using thousands of hit-sliee PEs in parallel to achieve a peak processing speed of above ll] Gflops. We
describe the parallel architecture built into the CM-2. Parallel sofiware developed with the CM-2 will be
discussed in Chapter ID.
Program Execution Florodigm All programs started execution on a fmnr—end, which issued
mieroinstruetions to the bacloend processing array when data-parallel operations were desired. The sequencer
broke down those microinstructions and broadcast them to all data pmeessor.s in the array.
Data sets and results could be exchanged between the fironz-end and the processing array in one of three
ways: lJrr1m‘m.s!r'ng, global conrbining, and senior rrwrrsory bus as depleted in Fig, 3,23, Broadcasting was
carried out through thc broadcast bus to all data processors at once.
.m.M_I
Processors o I: o II
Ill u u n
I Routoo‘NE'u'HSJ'Seannmg ‘
VD U0
l Controller l Comrollar Framenufiar
Global combining allowed the fiont-end to obtain the sum, largest value, logical DR, etc., of values, one
from each processor. The scalar ‘bus allowed the front-end to read or to write one 32—bit value at a time from
or to the memories attached to the data processors. Boflt VAX and Syrnbolics Machines were used as the
fiont-end and as hosts.
The Processing Army The CM-2 was a back-end machine for data—parallel computation. The processing
array contained from 4K to 64K bit-sliee data processors (or PEs), all of which were controlled by a sequencer
as shown in Fig. 8.23.
The sequencer decoded micnoinstructions from the front-end and broadcast nanoinstructions to the
processors in the array. All processors could access their memories simultaneously. All processors executed
the broadcast iristructions in a locitstep manner.
The processors exchanged data among themselves in parallel through the mutter, NE W5‘ grids, ora scanning
mechanism. These network elements were also connected to I/‘O interfaces. A mass storage subsystem, called
the rioro wriiir, was connected through the HO for storing up to 60 Gbytes of data.
Poo:-tossing Node: Figure 3.24 shows the CM-2 processor chips with memory and floating-point chips.
Each data processing node contained 32 bit-slice data processors, an optional floating~point aeceleravor,
and interfaces for interprocessor eonitrtunicatiuri. Each data processor was implemented with a 3-input and
2-output bit-slice ALU and associated latches and a memory interface. This ALU could perform bit-serial
filll-adder and Boolean logic operations.
I
gob: bus instruction bus
H h11othorc.hips to'l1othe|ehips
EEEE
iiiiiiiiiiiiiI EEEE
NEWS. EEEE NEWS. EEEE
Router
Hypelcube @EEE Router
Hypaluiba EEEE
Intorlaoa lrrtelhco
EEEE EEEE
22 22 l
adcioss Floating-F'on'|t
Floating-Point 32
Memory and Momcly . Execution
13 ‘med {Sinqo cl Double
3°“ Precision)
Fl:-ll-14 A CH-I processing node co-misting oftwo processor chips and some memory and floating-point
ehips [Couroasy ofThlnlting Machines Corporatlon.199D}
,,,,,,,,,,,,,,,,,._,,,,,,,,,.,,, . _ ,,,
Thc processor chips were paired in each node sharing a group of memory chips. Each processor chip
contained 16 processors. The parallel instruction set, called Paris, included nanoinstnlctions for memory load
and store, arithmetic and logical, and control of the router, NEWS grid, and hypercube interface, floating~
point, HG, and diagnostic operations.
Thc memory data path was 22 bits (I6 data and 6 ECC] per processor chip. The lll-bit memory address
allowed 2'“ = 256K memory words (512 Kbytes of rlataj shared by 32 processors. The floating-point chip
handled 32-bit operations at a time. Intermediate computational results could be stored back into die memory
for subsequent use. Note that integer arithmetic was carried out directly by the processors in a bit-serial
fashion.
Hyper-cube Router: Special hardware was built on each processor chip lor data routing among the
processors. Thc router nodes on all processor chips were wired together to form a Boolean n-cube. A
full configuration of CM-2 had 4096 router nodes on processor chips interconnected as a I2-dimensional
hypcrcuhe.
Each router node was connected to I2 othcr router nodes, including its paired node (Fig. 3.24]. All 16
processors belonging to the same node were equally capable of sending a message from one vertex to any
other processor at another vertex of the 12-cube. The following example clarifies this message-passing
ooncept.
bl
[<5 Example 8.10 Message routing on the CM-2 hypercube
(Thinking Machines Corporation,199D)
On each vertex ofthe l2-cube, the processors are numbered 0 through 15. The hypercube routers are numbered
O through 4095 at the 4-D96 vcrriccs. Processor 5 on router node T is thus identified as the l l7th processor in
the entire system because ts>< 7 + 5 = 117.
Suppose processor ll? wants to send a message to processor 361, which is located at processor 9 on router
node 22 {I6 >< 22 + 9 " 3151}. Since router node 7 " [tllllltllltltltltll 1 I); and router node 22 = {'-[ltlt')DOOtll0l lO}2,
they differ at dimension D and dimension 4.
This message must traverse dimensions D and 4 to reach its destination. From router node T, the message
is first directed to router node fi = (UDDOOOUDI lll); through dimension O and then to router node E2 through
dimension 4, if there is no contention for hypercube wires. On the other hand, if router 7 has another
message using the dimension 0 wire, the message can be routed first through dimension 4 to router 23 =
(tltlfitlfltl-[Il[H 1 1 '1; and then to the final destination through dimension 0 to avoid channel conflicts.
The NEWS Grid Within each processor chip, the I15 physical processors could be arranged as an 8 >< 2,
l >< [6, 4 >< 4, 4 >< 2 >< 2, or 2 >< 2 >< 2 >< 2 grid, and so on, Sixty four t-'irm¢n'pmee.s.sors could be assigned to
each physical processor. These 64 virtual processors could be imagined to form a B >< El grid within the chip.
The “NEWS” grid was based on the fact that each processor has a north, east, west, and south neighbor in
the various grid configurations. Furthermore, a subset of the hypercube wires could be chosen to connect the
E13 nodes {chips} as a two-dimensional grid ofany shape, 6-4 >< 64h-eing one of the possible grid configurations.
F?» Mtfiruw HI! r'».-rqiwrnw
BBB i Adnwrced Compirterhrehirceturc
By coupling the internal grid configuration within each node with the global grid configuration, one could
arrange the processors in NEWS grids of any shape involving any number of dimensions. These flexible
interconnections among the processors made it very efiicient to route data on dedicated grid configurations
based on the application requirements.
Scanning and Spread Mechanism: Besides dynamic reconfiguration in NEWS grids through the
hypercube routers, the CM-2 had been built with special hardware support for scanning or spreading across
NEWS grids. These were very powerful parallel operations For Fast data combining or spreading throughout
the entire array.
Scanning on NEWS grids combined communication and computation. The operation could simultaneously
scan in every row ofa grid along a particular dimension tor the partial sum ofthat row, the largest or smallest
value, or bitwise OR, AND, or exclusive OR. Scanning operations could be expanded to cover all elements
of an array.
Spreading could send a value to all other processors across the ehips. A single-bit value could be spread
from one chip to all other chips along the hypercube wires in only 7'5 steps. Variants of scans and spreads
were built into the Paris instructions for ease of access.
HO and Data their The Connection Machine emphasized massive parallelism in computing as well as in
visualization of computational results. High-speed HO channels were available from 2 to lo channels for data
andfor image U0 operations. Peripheral devices attached to HO channels included a data vault, CM-HIPPI
system, CM-IDP system, and Vh-'lEbus interface controller as illustrated in Fig. 3.2.3. The data vault was a
disk-based mass storage system for storing program files and large data bases.
Major Application: Tl1e CM-2 was applied in almost all the MPP and grand challenge applications
introduced in Chapter 3. Specifically, the Connection Machine Series was applied in document retrieval
using relevance feedback, in memory-based reasoning as in the medical diagnostic system called QUACK
tor simulating the diagnosis ofa disease, and in bulk processing of natural languages.
Other applications of the CM-2 included SPICE-like VLSI circuit analysis and layout, computational
fluid dynamics, signal fitnage-Ivision processing and integration. neural network simulation and connectionist
modeling, dynamic programming, contest-free parsing, ray tracing graphics, and computational geometry
problems. As the CM-2 was upgraded to the CM-5, the applications domain was expected to expand
accordingly.
The .Ma:Par MP-1 The MP-l architecture consisted of four subsystems: the PE rrrrrrv, the arrn__\-' control
rmir (ACUII, a UNIXsubsy.stem with standard IIO, and a highspeed U0 srrbsjysrern as depicted in Fig. 8.25s.
The UNIX subsystem handled traditional serial processing. The highasp-eed HO, working together with the
PE array, handled massively parallel computing.
The MP-1 faintly included configurations with 1024, 4096, and up to 1s,3s4 processors. The peak
performance of the 16K-processor configuration was 26,000 MIPS in 32-bit RISC integer operations. The
,,,,,m,,‘,,,‘,5,,_,M,,,,,,,,, _; M
syslerrl also had a peak floating-poinl capability of 1.5 Gflops i11 singleqircoision and 650 Mflops in double-
prccision operations.
Army Control Unit The ACU was a 1-1-MIPS scalar IUSC processor using :1 demand-paging instruction
memory. The ACU fetched and decoded MP-l instructions, computed addresses and scalar dala values,
issued control signals to the PE array, and monitored the status of the PE array.
Like the sequencer in CM-2, the ACU was microcoded to achieve horizontal control ofthe PE anay. Most
scalar ACU instructions executed in one: 70-ns clock. The whole ACU was implemented on one PC hoard.
An implemented Functional unit, called a ow.-norjs .-ma-hine, was used in parallel wilh the ACU. The
memory machine performed PE array load and store operations, while the ACU broadcast aritlimetic, logic,
and routing instructions to the PEs for parallel execofiou.
K 'I'i1l.'ldO'l'
COHSDE
Dish: may
‘I I
Arraj,1Oor|\'o| um:
--‘II. UQEFDEMEG
II... VG
Su ma ‘II. 2 FDDI °p"°"“'
Hgn-Speed
u|-axbeyelmi I““III pIII
I I I I III.-
Ell‘
IgIIII
l _“_
HPPI 1'0 Devices
I I I I I I I I IIIIIII IIIIIIIIIIIIIII U-'“"'*'
HghSpeedO
High-Speed
oramim
Eherrel Q
_
2 5 cm‘; 1
o: ago
0.0 -- - 4r.0}fl- i
o‘ @‘o A
A ,
qniarm of PE cbeleis
Fig. 8.15 The Mas9ar HP-I architecture (Couroesy of H=asPar Computer Cn|1:rorafiun, 1990}
-
3'10 ‘XI Admn-cad compuwaicmzeom
‘Hie PE Array Each processor board had 1024 PEs and associated memory arranged as 64 PE clusters
[PEC] with 16 PEs per cluster. Figure 8.25b shows the inter-PEJC connections on each processor board.
Each PEC chip was connected to eight neighbors via the X-Net mesh and a global multistage crossbar router
network, labeled S1, S2, and S3 in Fig. E.25b.
Router Rmm,
mar = PE1 PE15 mar
Broadcast _ REDUCTION
Inmlm @
[a] A PE cluster
an aus 1_ .
man ' F‘-Mam s2-on
ADDRESS mmecc REGISTERS
uurr um my
|:_i_
C°"TR°'- aoonocssr
PM EM ,
EKTERNAL MEMO” mam UCTI-UN REDUCTION
[ls] Processor element and memory
F1} 3-I-it Processing element: and memory design in die l"la.sPI.r MP-I (Cottrtnesy of l"'ll.sPIr Computer
Corpor-atton,1990]
,.,,,,,.,m,,,,,.,,,._,,,,..,,.,,,,,., _ 3,,
Each PE cluster (Fig. 8.I'.6a_] was composed of Io PEs and lo processor memories (_PEMs). The PEs were
logically arranged as a 4 >< 4 array for the X-Net two-dimensional mesh interconnections. The 16 PEs in a
clustcr shan:-d an acccss port to thc multistage crossbar routcr. lntcrproccssor communications wcrc carried
out via three mechanisms:
(1') ACU-PE array oomrnunications.
{2} X-Nct ncarcst-ncighbor communications.
('31 Global crossbar routcr communications.
The first mechanism supported ACU instruction/‘data broadcasts to all PEs in the array simultaneously
and perforrned global reductions on parallel data to recover scalar values from the array. The other two IPC
mcchanisms arc dcscribcd separately bclow.
X-Her Nlerh Interconnect The X-Net interconnect directly connected each PE with its eight neighbors
in the two-dimensional mesh. Each PE had four connections at its diagonal corners, forming an X pattem
similar to the BLITZEN X grid network (Davis and Rcif, 1986). A tn‘-state node at each X intersection
pemiitted communication with any of eight neighbors using only four wires per PE.
The connections to the PE array edges were wrapped around to fomr a 2-D toms. The toms smtcture
is symmetric and facilitates several important matrix algorithms and can emulate a one-dimensional ring
with two-X-Net steps. The aggregate X-Net corrununication bandwidth was 18 Gbytesfs in the largest MP-l
configuration.
Multistage Crossbar Interconnect The network provided global communication between all PEs and
formed the basis for the MP-l I10 system. The three router stages implemented the function ofa 1024 >< 1024
crossbar switch. Three router chips were used on each processor board.
Each PE cluster shared an originating port connected to router stage SI and a target port connected to
router stage S3. Connections were established fi'om an originating PE through stages S1, S2. and S3 and then
to the target PE. The full MP-l configtrration had H124 PE clusters, so each stage had 102.4 router ports. The
router supported up to 1024 simultaneous connections with an aggregate bandwidth of 1.3 Gbytests.
Processor Elements and Mu-rrory The PE design had mostly data path logic and no instruction fetch or
decode logic. The design is detailed in Fig. 3.2-fib. Both integer and floating-point computations executed in
each PE with a register-based RJSC architecture. Load and store instructions moved data between the FEM
and thc rcgistcr set.
Each PE had forty 32-bit registers available to the programmer and eight 32-bit registers for system use.
The registers were bit and byte addressable. Each PE had a 4-bit integer AL-U, a 1-bit logic unit, a 64-bit
mantissa unit. a 16-bit exponent unit, and a flag unit. The NIBBLE bus was four bits wide and the BIT bus
was one bit wide. The FEM could be directly or indirectly addressed with a tI'lH.XlITll.ll'l1 aggregated memory
bandwidth of I2 Gbytesfs.
Most data movement with each PE occurred on the NIBBLE bus and the BIT bus. Dit'1‘erent functional
units within the PE could be simultaneously active during each microstep. In other words, integer, Boolean,
and floating-point operations could all perform at the same time. Each PE ran with a slow clock, while the
system speed was obtained through massive parallelism like that implemented in the CM-2.
Ff» Mtfirnii H'l'Ht'mn;|wm-\' _
392 i .-tduonced Cmmplrterhrdritecturc
Plomllel Disk Army: Another feature worthy of mention is the massively parallel IIO architecture
implemented in the MP-l. The PE array (Fig. 8.25s] communicated with a parallel disk array through the
high-speed HO subsystem, which was essentially implemented by the l.3 Gbytesfs global router network.
The disk array provided up to 17.3 Gbytes of formatted capacity with a 9-Mbytesis sustained disk I10
rate. The parallel disk array was a necessity to support dam-parallel computation and provide file system
trarisparency and multilevel fault tolerance.
The grand challenge applications drive the development of present and future MPP systems to achieve
higher and higher performance goals. The Connection Machine model CM-5 was the most innovative et‘t‘ort
of Thinking Machines Corporation toward this end. We describe below the innovations surrounding the
CM-5 architectural development, its building blocks, and the application paradigms.
Datahlotwodr
I I
N N N
DiagnosticNatvuork P P P P cP cP
M M M ru M M
\i\/i/ %,-—" +,—*’
pfajegjng DCrl‘lllOl _ U0
ngdgg processors interfaces
Flg.tI.1‘l' The network arehieecurro of the Connection Machine CH-5 {Courtesy of Leisersoo or al.
Thlnidng Machines Corporation. 1991}
input and output were provided via high-bandwidth I.-‘O .in.tcrfaces to graphics devices, mass secondary
storage such as a data vault, and hignperformance networks. Additional low-speed Ill) was provided by
Ethemet connections to the control processors. The largest configuration was expected to occupy :1 space of
30 111 >< 30 n1, and was designed for a pcak pcrf'orrnancc of over ] Tflops.
The Network Function: The building blocks were interconnected by three networks: a dun: m=m'orlr,
a r:-ommf nc'nvor.i:__ and a dirrgno.s'rfc network. The data network provided high-performance, point-ID-point
data communications between the processing nodes. The control network provided cooperative operations,
including broadcast, synchronization, and scans, as well as system management functions.
The diagnostic network allowed “back-door" access to all system hardware to test system integrity and
to detect and isolate errors. The data and control networks wcrc connected to processing nodes, control
processors, and U0 channels via .t1E£n*0rkinI8!_1'l2t?e.£.
The CM-5 architecture was considered universal because it was optimized for data-parallel processing
of large and complex problems. The data parallelism could be implemented in either SIMD mode, multiple
SIMD mode, or synchronized MIMD mode.
The data and control networks were designed to have good scuruhihrv, making the machine size limited
by thc affordable cost but not by any architectural or engineering constraint. In other words, the networks
depended on no specific types of processors. When new technological advances arrived, they could be easily
incorporated into the architecture. The network interfaces were designed to provide an abstract view of the
networks.
Tlse Syn-ern D1:-er'rrtion: The system operated one or more rrscr pnrtr'rr'c-ns. Each partition consisted of a
control processor, a collection ofproccssing nodes, and dedicated portions ofthe data and comrol networks.
Figure 8.28 illustrates the distributed control on the CM-5 obtained through the dynamic use of the two
interprocessor commtmication networks. Major system management fitnctions. sesrvices, and data distribution
arc surnrnariz.cd in this diagram
l'P.\r' Ml.'I;Ifllb' HI" l'n¢r.q|r_.u|»r\
IPH T Advanced 'l:0lTlPl.lDE|'-"l!Ci'hlIrEClU!E
I UNIX GS So-rvlcos
pmmms Partition Management i: ' Cm“ Mmaflemenl
I Partition Services
L1sorPro-oesslng { ' lliglph mp“ °l
Data Network
and
Contra Network
File systems,
Ir‘-D Management ¢9"-"B9 UMBE-
Interfaces
Fig.8.!!! Distrihutasd con-tro-I on the CH-5 widt concurr-en: user par1'il:ions and HO 1r::l\ritios{Co-ureesy oi
Thinking Machines Corp-on|tion.1992j
The partitioning of resources was managed by a system executive. The control processor assigned to
ca-ch partition behaved like :1 p.nrr‘irirm nromger. Each user process executed on a single partition but could
exchange data with processes on other partitions. Since all pat-citions utilized UNIX time-sharing and security
features, each allowed multiple users to access the partition, while ensuring no conflicts or interferences.
Access to system functions was classified as either]Jri7viie'ge'olDF no-rprivileged. Privileged system fimctions
included access to data and control networks. These accesses could he executed directly by user code without
system calls. Thus, OS kernel overhead could be eliminated in network communication witlztin a user task.
Access to the diagnostic network, to shared [I0 resources, and to other partitions was also privileged and
could only he accomplished 1.-'ia system calls.
Some control processors in the CM-5 were assigned to manage the U0 devices and interfaces. This
organization allowed a process on any partition to access any LID device, and ensured that access to one
device does not impede access to -other devices. Functionally, the system operations, as depicted in Fig. 8.28,
,.,,,.,,,,,,,,,,,.,,,,,,,,,,,, . _ ,,,
were divided into user-oriented partitions, U0 services based upon system calls, dynamic control of the data
and comrol networks, and system management and diagnostics.
The two networks could download user code from a control processor to the processing nodes, puss
1/0 requests, transfer messages of all sorts between control processors, and transfer data among nodes
and U0 devices, either in a single partition or among differ-ent partitions. The U0 capacity could he scaled
with increasing numbers of processing nodes or of control partitions. The CM-5 embodied the features of
hardware modularity, dis1:ributed control, latency tolerance, and user abstraction; all of these are needed for
scrrirrbic computing.
For Trees A fat tree is more like a real tree in that it becomes thicker as it acquires more leaves. Processing
nodes, control processors, and HO channels are located at the leaves of a fat tree. A binoryjirr tree was
illustrated in Fig. 2.] Tc. The intemal nodes are switches. Unlike an ordinary binary tree, the channel capacities
of a fat trcc increase as we ascend from icat-es to root.
The hierarchical nature of a fat tree can be exploited to give each user partition a dedicated suhtree, which
cannot he interfered with by any other partition's message trallic. The CM-5 data network was actually
implemented with a 4-ary fat tree as shown in Fig. 8.29. Each of the internal switch nodes was made up of
several router chips. Each router chip was connected to four child chips and either two or four parent chips.
Q»I-1-»
r'1"5'i"
“fill ‘W1
tvififi¢""i\l0‘i""‘...-rt’\ :35?
ziiy
Q
1'
1‘; Osail
Fig.8}! CH-5 data network implemented wldt a -l-cry far. tree {Courtesy of Leiserson er. il.Thinltirtg
Machines Corpontlort. 1991}
To implement the partitions, one could allocate different suhtrecs to handle diliercnt partitions. The size
of the subtrecs varied with different partition demands. The IIO channels were assigned to another subtree,
which was not devoted to any user partition. The I/D sulnree was accessed as shared system resource. In
many ways, the data network functioned like a hierarchical system bus, except that there was no interference
among partitioned subtrees. All leaf nodes had unique physical addresses.
FM Mtfirpw Hlllrbmyiwtns
395 i ' .-ltdaonced Compurterhrehlteeture
The Dara Network To route a message from one processor node to another, the message was sent up the
tree to the least common ancestor ofthe two processors and then down to the destination.
ln the 4-ary fat-tree implementation (Fig. 8.29) of the data network, each connection provided a link to
another chip with a raw bandwidth of 20 Mhytesfs in each direction. By selecting at each level of the tree
whether two or four parent links are used, the bandwidths between nodes in the fat tree could be adjusted.
Flow control was provided on each link.
Each processor had two connections to the data network, corresponding to a raw bandwidth of 4D Mbytes."s
in and out of each leaf node. in the first two levels, each router chip used only two parent connections to the
next higher level, yielding an aggregate bandwidth of 160 Mbytes.-‘s out of a subtree with 16 leaf nodes. All
router chips hig her than the second level used fourparent connections, which yielded an aggregate bandwidth
of 10 Gbytes/s in each direction, from one half ofa 2K-node system to the other.
The bandwidth continued to scale linearly up to 16,384 nodes, the largest CM-5 configuration planned.
In larger machines, transmission-line techniques were to be used to pipeline bits across long wires, thereby
overeoming the bandwidth limitation that would otherwise be imposed by wire latency.
As a message went up the tree, it would have several choices as to which parent connection to take. The
decision was resolved by pseudo-randomly selecting from among those links that wen: unobstructed by otl1er
messages. After reaching the least common ancestor of the source and destination nodes, the message took a
single available path of links down to the destination. Tl1e pseudo-random choice at each level automatically
balanced the load on the network and avoided undue congestion caused by pathological message sets.
The data network chips were driven by a 40-MI—lz clock. The first two levels were routed through
backplanes. The wires on higher levels were routed through cables, which could be either 9 or 26 ft in length.
Message routing was based on the wonnhole eonoept discussed in S-eetion T4.
Faulty processor nodes or connection links could be mapped out of the system and quarantined. This
allowed the system to remain functional while servicing and testing the mapped-out portion. The data
network was acyclic from input to output, which precluded deadktck from occurring ifthe network promised
to eventually deliver all messages injected into it and the processors promised to eventually remove all
messages from the network after they were successfully delivered.
The Control Network The architecture of the control network was that of a complete binary tree with all
system components at the leaves. Each user partition was assigned to a subtree of the network. Processing
nodes were located at leaves ofthe subtrec, and a control processor was mapped into the partition at an
additional leaf. The control processor executed scalar part of the code, while the processing nodes executed
the data-parallel part.
Unlike the variable-length messages transmitted by the data network, control network packets had a fixed
length of65 him. There were three-major types ofoperations on the control network: broodemring, mt-rahi'm‘ng,
and gloom’ r:pcmrr'ons. These operations provided interprocessor communications. Separate FIFCls in the
network interface were assigned to each type ofeontrol operations.
The control network provided the mechanisms allowing data-parallel code to be executed efficiently
and supported MIMI) execution for general-purpose applications. The binary tree architecture made the
control network simpler to implement t:l'tan the fat tree used in the data network. The control network had the
additional switching capability to map around iaults and to connect any ofthe control processors to any user
partition using an ol"T-line routing strategy.
,,.,,,.,.,,,,,,,,,,5,.,,m,,,,,,,, _ M
The Diagnostic Network This network was needed for upgrading system availability. Built-in testability
was achieved with scan—bascd diagnostics. Again, this network was organized as a (not necessarily complete)
binary tree for its simplicity" in addressing. One or more diagnostic processors were at tl'|e root. The leaves
were pods, and each pod was a physical system, such as a board or a backplane. There was a unique path from
tl1-e root to each pod being tested.
The diagnostic network allowed groups of pods to be addressed according to a “hypercube-address"
scheme. A special diagnostic interface was designed to form an in-system check of the integrity of all CM-5
chips that supported the ITAG (Joint Test Action Group) standard and all networks. It provided scan access
to all chips supporting the ITAG standard and programmable ad hoe access to non-JTAG chips. The network
itsclfwas completely testable and diagnosable. It was able to map out and ignore iisulty or power-down parts
of the machine.
W as-:;:
I.-‘O
C PU IIG
LAN Connection
Fig. 5.30 The control processor in the CH-5 [Cour-tiny ofTl1inldng Maelt-ines Corporation. 1991}
Par MIGIITLH Hi" l'mt'JI||r_.u|i¢\ :
Advanced Cornpmerflnrchitecture
Each control processor ran Cl\='lOST,aU‘l'~l[X-based OS with extensions for managing the parallel processing
resources ofthe CM-5. Some control processors managed computational resources in user partitions. Others
were used to manage l."0 resources. Control processors specialized in mariageiial fimctions rather than
computational functions. Forthis reason, high-performance arithmetic accelerators were not needed. Instead,
additional U0 connections were provided in control processors.
Processing Node: Figure 3.31 shows lite basic stmcture of a processing node. lt was a SPARC-based
processor with a memory subsystem, consisting of a memory controller and 8, I6, or 32 Mbytes of DRAM
memory. The internal bus was 64 hits wide.
6-Hilt paths
[plus ECG]
Memory
Controller
64-blt bus
RISC Hotwo-rlr
processor lntorfaoo
Flfl-B-31 The processing node in die CM-5 [Courtesy ofThinking Plachirles Corporation, 1992}
The SPARE! processor was chosen for its multiwindow feature to facilitate fast context switching. This
was very crucial to the dynamic use of the processing nodes in different user partitions at different times. The
network interface connected thc node to thc rest ofthe system through the control and data networks. The use
of a hardware arithmetic accelerator to augment the processor was optional.
Hector Unit: its illustrated in Fig, 8.32:1, vector units could be added between the memory bank and the
system bus as an optional feature. The vector units would replace the memory controller in Fig. 8.31. Each
vector unit had a dedicated 72-hit path to its attached memory hank, providing a peak memory bandwidth of
I28 Mhynesfs per vector unit.
The vector unit executed vector instructions issued by the scalar processor and pcrfonnod all functions of
a memory controller, including generation and check of ECC [error correcting code) hits. As detailed in Fig.
S.32.b, each vector unit had a vector instruction decoder. a pipelined ALU, and sixty-four 64-bit registers like
a conventional vector processor.
,,,,,,,,,,_,,,,,,,.,,,,,,,,,,,, _ 3,,
MBue
64-bit bus
i Pipetmed Haggai Memory I
ALI.) X 64 Ms Gontrolter
RISC Netimrit
E Ii]:
processor lnte rlaoe
Fig.8.}! The processing node with vector units in the CM-5 {Courtesy oiThinking Machiriei Cerpo|atien.1992)
Each vector instruction could be Esued to a specific vector unit or pairs of units or broadcast to all four
units at once. The scalar processor took care of address translation and loop control, overlapping them
with vector unit operations. Together, the vector units provided 512 Mbytes/s rnernory bandwidth and
I28 Milops 64»-bit peak performance per node. tn this sense, each processing node of the CM-5 was
itself a supercomputer. Collectively, 16K processing nodes would yield a peak performance of‘ EM >< 27 =
22' Mfiops = 2 Tflops.
lnitialljy, SPARC processors were being used in implementing the control processors and processing nodes.
As processor technology advanced. other new processors could be also combined in the system. The network
architecture was designed to be independent ofthe processors chosen except for the network interfaces which
would need some minor rrtodifieations when new processors were used.
Replication Recall the broadcast operation, where a single value may be replicated to as many copies and
distributed to all processors, as illustrated in Fig. 8.33:1. Other duplication operations include the spreading
of a column vector into all the eoltunns of a matrix (Fig. 8.3-3h], the expansion ofa short vector into a long
vector (Fig. B.33c}, and a completely irregular duplication (Fig. 8.3311).
IIII
2 2 2 2
llllll
IIIIIIIIIIEIIII He~H
,..
WB .
H I II
1 2 c -t |
I
. I
11122aacea4¢-t B
[cjl-tarlabte-tent_:|tl1 vectors [d] Comptetety irregular
Replication plays a fundamental role in matrix arithmetic and vector processing, especially on a data-
parallel machine. Replication is carried out through the control network in four kinds ofbroadcasting schemes:
riser broatirasr, sr.1pcrvi.sor broan'ca.st, interrupt broaa'cast, and utii'it_j|-' broadcast. These op-eratiotts can he
used to download code and to distribute dam. to implement fast barrier synchronization, and to configure
partitions through the CIS.
Reduction Vector reduction was implemented on the CM-2 by first scanning, and on the CM-5 the
mectianisrn was funiicr generalized as the opposite of replication. As illustrated in Fig. 8.34, ginbai tt"tl'1iC‘r.’
produces the sum of vector components (Fig. 334a]. Similarly, the rowtcolurnn reductions produce the sums
per each row or column of a matrix (Fig. S.34b}.
,,,,,,,,,,_,,,,,,,,,,_,,,,,,,,,,,,, _, M
"v'ariahle—le:ngth vectors were reduced in chunks ofa long vector (Fig. 8.34-c). The same idea was applied
to a oomplelely irregular set as well (Fig. B.34d_]. In general, reduction functions include the maximum, the
minimum, the average, the dot. product, the sum, logical AND, logical UR, etc. Fast scanning and combining
are necessities in implementing these operation
n nuns
IIIIIII
BEBE
I ‘
2 MM
EEEE
2
E m
u m
3615202-476572611 H B
.1‘? T ? 12 i B
[cl Vanablo-ton-gth vectors {dj Corrplototy urogtlar
Fig-I-34 Reduction operations on the CH-5 (Courtesy ofThinking Machines Coqsontion, 1991}
Four types of oombining operations, reduerirm, forward scan {parallel prefix), bnelru-rim‘ scan {parallel
suffix), and router done, were supported by the control network. We will describe parallel prefix shortly.
Homer dam: refers to the detection of completion of a message-routing cycle, based on l<Lirehofi"s current
law, in that the network interfaces keep track of the number of messages entering and leaving the data
network. When a round of message sending and aelmowledging is eomplete, the net "current" (messages) in
and out of a port should be zero.
Flermurotinn Data-parallel computing relies on permutation for fast exchange of data among processing
nodes. Figure 3.35 illustrates four cases of permutations performed on the CM—5. These permutation
operations are often needed in matrix transpose, reversing a vector, shifting s multidimensional grid, and
FFT butterfly operations.
402
_
Advorrcod Cormputerhrchitectura
BEBE
Illllll “
1||||||
rants
E'
E E
{at 1D nearest neighbor (shift) {bl 2D rowteolumn shllt
E4?-h
ntanrala 5 J
vy \;I Qa
t. '
Fig.B.35 Prnuanfion operations for lnuerprvoolssor commtailcaflorts on the CH-5 [Courtesy ofThlnldng
Machines Corporation. 1991)
Parallel Proflx This is a kind of combining operation supported by the control network. A pamltelprfi
operation delivers to the ith processor the result of applying one of the five reduction operators to the values
in the preceding r' —l processors, in the linear order given by data address.
The idea is illustrated in Fig. 3.36 with four examples. Figure 8.31521 shows the one-dimensional sum-
prcfix, in which for example the fourth output 12 is the sum of the first four input elements (1 + 2+5—4 =
I 2). The two-dimensional rowfcolumn sum-prefix (Fig. 8.36b) can be similarly performed using the forward-
scanning mcchan is m.
Figure 3.364: computes the one-dimensional prefix-sum on sections of a long vector independently.
Figure 8.36d shows the forward scanning along linked lists to produce the prefix-sums as outputs.
Many prefix and suflix scanning operations appear to he inherently sequential processes. But the scanning
and combining mechanisms on the CM-5 could malre the process approximately log; n faster, where n is the
array length involved. For example, on the CM-5 a parallel prefix operation on a vector of I000 entries could
be finished in ll) steps instead of 1000 steps.
,.,,J,.,,,€,,,,,,,,d5,,,,D.:£,,,,,,,,m _ M
IIHBE
B
1 -D 1 } 1 1 1 2
l B 5 2 B 11 20 22
El E B
[a) 1-D sum-prefix [bi 2-D rowfoolunin sum-prefix
B
|ae1|52lu2~¢e5|2e¢|
—-
|s91o|51|o2-24e|2s12|
[c] Variable-length vectors [cl] Linked Ilene
Fig. 8.3-6 Parallel prefix operations on the Cl’-1-5 (Cour-eesy ofThinlting Machines Corporation, 1991]
|| '_“‘\
K,
4'} _ Summary
By around 191-'0, computer systems based on the basic single-processor von Neumann architecture had
become well established, with products from several computer companies available in the market In
the search for higher processing power. especially for scientific and engineering appliations. the earliest
supercomputers made heavy use of vector processing concepts, while the concepts of sl'1ared—bus multi-
proccssors and SIMD systems were also beginning to emerge at around that time.
We started this chapter with a study of the basic vector processing concepts, vector instruction types,
and interleaved vector memory access schemes.Vector instruction types include vector-vector. vector-
scalir. vector-memory. vector reduction, gather and scatter. and masking operations. Examples were
studied of the early supercomputers based on vector procmsing concepts, including systems produced
by the tvvo pioneer supercomputer companies Cray and CDC.
Our study of multivector computer-s——i.e. systems based on multiple vector processors——l:egan with
the basic system design rules for achieving the target per-lormance.These design rules can be related
to processing power, IICI and networldrig, memory bandwidth, and scalability. As specific examples,
rnultivector systems and early massively parallel processing (MPP) systems introduced by Cray were
studied, as were Fujitsu multivector systems Also reviewed in brief were mainfiame systems provided
with vector processing capability, and the so-called mini-supercomputers which emerged widi advances
in electronic technology
I'M Hif G-rm-vHIiI' I241-r-womri a
404 i‘ Advanced Compurterfirchitecture
The concept of compound vector pnocessing arises from the search for more efiicient processing of
vector data. Scientific and engineering applications make use of such vector operations, and therefore
system architects have always looked for ways to map them efficiently onto the underlying vector
processing hardware.'l11e concepts of vector loops and chaining, and of multi-pipeline networldng. have
also been developed 'WllIl'I the aim of providing efficlent support for compound vector processing.
SIMD computer systems may be of one of two basic type.s—witl1 distributed memory modules
and with shared memory modules. Specific examples were discussed of two innovative SIMD systems:
Connection Machine 2 [CM-2). with processors based on bit-slice technology. and l‘1asFar MP-1,with
its specially designed processors. Bodi systems used sophisticated system interconnects and had the
capability to connect thousands of processors. However. For good technological reasons. the architectural
trend later turned away from SIMD systems and cowards massively parallel MIMD (or SFMD] systems.
Connection Machine 5 [CM-5} represents the shift towards massively parallel MIMD architecture
which occurred in die mid-199Ds.The ma.in factor behind this shift was the availability of low-cost but
powerful pro-cessors,made possible by rapid advances in the underlyingVL5i technology. CM-5 innovations
included the use of a large number oi RISC processors, a sophisticated data network {using a fat tree].
and special hardware features to support efficient and versatile interprocessor communicafion~wl1lch
included useful operations such as replication. reduction and permutation.
Cg Exercises
Problem B.1 Explain the structural and lb) C-access memory organization.
operational differences between register-to-register (cl C15-access memory organization.
and memory-to-memory ardiitectures in building
multipipelined supercomputers for vector processing.
Problem 8.4 Distinguish among the following
vector processing machinu in terms of architecture,
Comment on the advantages and disadvantages in
performance range. and cost-effectiveness:
using SIMD computers as compared with the use
(a} Full-scale vector supercomputers.
of pipelined supercomputers for vector processing.
[bl High-endmainframesornear-supercomputers.
Problem B.2 Explain the following terms related (c) Minisupercomputers or supercomputing work-
to vector processing; stations.
(a) Vector and scalar balance point
Problem 8.5 Explain the following terms
(bl Vectorization ratio in user code.
associated with compound vector processing:
(cl 'v'-ectorization compiler or vectorizer.
(a} Compound vector functions.
{d} Vector reduction instructions.
[bl Vector loops and pipeline chaining.
(e) Gather and scatter instructions.
(cl Systolic program graphs.
ll} Sparse matrix and masking instruction.
(d) Pipeline network or pipenets.
Problem B.3 Explain the following memory
organizations for vector accesses: Problem 8.6 Answer the following questions
related to the ardiitecture and operalions of the
{3} 5-access memory organltion.
Connection Machine CM-I:
,.,....,...,.,,....,,,...,,,.,,,,.. _ 405
(a) Describe the processing node architecture. All] = Bill] :>< C{|) + D{l] :>< Ell} + F(l) :>< Gil)
including the processor. memory. floating- for I = 1. 2. N. initially. all vector operands are in
point unit. and network interface. memory. and the final vector result must be stored
lb) Describe the hypercube router and the in memory.
NEVVS grid and explain their uses. fa) Show a pipeline-chaining diagram. similar to
(c) Bqalain the scanning and spread mechanisms Fig. 8.1 B. for executing this cvr.
and their applications on the CM-2. (b) Show a space-time diagram. similar to
(d) Explain the concepts of broadcasting. global Fig. 3.1 9, for pipelined execution of the Cl/E
combining, and virtual processors in the use None d1at two vector loads can be carried
of the CH-1 out simultaneously on the two vector-access
Problem 8.7 Answer the following questions pipes.At the end of computation. one of the
about the l"1asPar MP-1: two access pipes is used for storing the A
la) Explain the X-Net mesh interconnect {the PE array.
array] built into the MP-1. Problem 8.11 The following sequence of
lb} Eaqalain how the multistage crossbar router compound vector function is to be executed on a
works for global communication between all Cray X-MP type vector processor:
PEs. All) = Bll) + s >< Cll}
lc) Explain the computing granularity on PEs and Dill) = s >< B{l) >< C(l]
how fast HO is performed on the MP-1.
Em = cm >< (cm - Bllii
Problem B.B Answer d1e following questions where Bll} and Cfl) are each 64-element vectors
about the Connection Machine CH-S: originally stored in memory. The resulting vectors
la) What is a fat tree and its application in Ail). D(l).and Ell} must be stored back into memory
constructing the data network in the CM-5*! after the computation.
lb} What are user partitions and their nesouroes (a) Wfite 11 vector instructions in proper order
requirements? to execute the above C\"Fs on a Cray X-MP
(c) Explain the functions ofthe control processors type vector processor with two vector-load
of the control network and of the diagnostic pipes and one vectoostone pipe which can
network. be used simultaneously with the remaining
(d) Explain how vector processing is supported functional pipelines.
in each processing node. lb) Show a space-time diagram. similar to Fig.
Problem 8.9 Give exampleadiffcrent from those 8.19, for achieving maximally chained vector
in Figs. 8.33 through 8.36, to explain the concepts operations for executing the above CVFs ln
of replication. reduction, permutation. and parallel minimum time.
prefix operations on the CM-5. Check the Technical (c) Show the potential speedup of the above
Summary of CM-5 published by Thinking Machines vecnor chaining operations over the chaining
Corporation if additional reading is needed. operations on the Cray 1. which had only one
memory-access pipe.
Problem 8.10 On a Fuiitsu VFZODO. the vector
processing unit was equipped with two loadfstore Problem B.11 Consider a vector computer
pipelines plus five functional pipelines as shown in which can operate in one of two execution modes
Fig. 8.13. Consider the execution of the following at a time: one is the vector mode with an execution
compound vector function: rate of R, = 2000 Mflops. and the other is the scolor
TM liliffirmil-' Hflllfomponm
40s _
Adi-winced Computernrchitecture
mode with an execution rate of R, = 200 Mllops. Let maximum 64-way parallelism in their vector
rr be the percentage of code that is vectorizable in a operations.
typical program mix for this computer.
Problem 8.15 Devise a minimum-time algorithm
{a} Derive an expression for die overoge execution
to multiply two 64 >< 64 matrices. A = la,-ii and B =
rote R. for this computer. (by). on an SIMD machine consisting of 64 PEs with
(b) Plot Rn as a function of rr in the range [(1.1]. local memory. The 64 PEs are interconnected by a
(c) Determine the vectorization ratio tr needed 2D B >< B torus with bidirectional links.
in order to achieve an average execution rate (a) Show the initial distribudon of the input
of it, = isno Mflops. matrix elements [op and {by} on the PE
(d) Suppose (I = 0.?.What value of R, is needed I‘l"lef‘l"lOl"l'ES.
to achieve R, = 400 Mllops? lb) Specify the SIMD instructions needed to
Problem 8.13 Describe an algorithm using odd, carry out the matrix multiplication. Assume
multiply, and doto-muting operations to compute the that each PE can perform one multiply. one
expressions =A1><B1+A; ><.Bq + +A3;><El;;wid1 odd, or one shifi (shifting data to one of its
minimum time in each of the following two computer four neighbors] operation per cycle.
systems. It is assumed that add and multiply require You should first compute all the multiply and
two and four time units. nespectivelyt The time add operations on local data before starting to
required for instructionfdata fetches from memory route data to neighboring PEs.The SIMD shift
and decoding delays ane ignored.All instructions and operations can be either east. west. south. or
data are assumed already loaded into the relevant north with wraparound connections on the
PEs. Determine the minimum compute time in each lIOI'US.
_ _
LATENCFHIDINGTECHNIQUES
1 Massively parallel and scalable systems may typically use distributed shared memory. The
access of remote memory significantly increases memory latency. Fr.u1.l1em1ore, the processor
speed has been increasing at a much faster rate than memory speeds. Thus any scalable multiprocessor or
large-scale multicomputer must rely on the use of latency-reducing, -tolerating, or —hiding inechanlsms. Four
latency-hiding mcchanismsarc studied below lbrenhancing scalability and programmability.
Latency hiding can be accomplished through ihur complementary approaches: {ii using prjerr-hr'ng
rec-hnr'qrms' which bring instnictions or data close to the processor before they arc actually needed; {iij using
r-ofrervrrr cor-hes supported by hardware to reduce cache misses; {iiij using refitted rrrerrmrjv r-onsisrerrey
models by allowing bufiiering and pipelining of memory references; and -[ivj using nrrrfriple-er;-nrevrs support
to allow a processor to switch from one contest to another when a long-latency operation is encountered.
The first three mechanisms are described in this section, supported by simulation results obtained by
Stanford researchers. Multiple contexts will be treated with multithreaded proec ssorsand systemarc hitectures
in Sections 9.2 and 9.4. However, the effect of multiple contests is shown here in combination with other
latency-hiding mechanisms.
E
5 as
Cache
“ads StorE
1 Wine:
_ Buff-er'
Secondary Cache
IIII I I IIIII I I IIIII
.,-_ _- _
_- _- .-‘T '-.,_
-
J -—-
_|_¢ \-..
l I’
\ I
Cluster 1 't\ ,’ Cluster n
‘ |'
I Interconnection Network i
Flg.9.1 A scahbie coherent cache multiprocessor with dlscflbuoed shared metnory modeled after the
Sranlord Dash (Courtesy ofhnoup Gupta en al, Prue I991 Ann Int. Symp. Conpumrfla-ch}
Cache coherence was maintained using an invalidating, distributed directory-based protocol (Section
7.2.3). For each memory block, the directory kept track of remote nodes cacheing It. when a write occurred,
point-to-point messages were sent to invalidate remote copies of the block. Acknowledgment messages were
used to inform the originating node when an invalidation was completed.
Two levels of local cache were used per processing node. Loads and writes were separated with the Lise
of n-'rir‘e buffers for implementing weaker memory consistency models. The main mommy was shared by all
P rocessl '18 nodes in the same cluster- To facilitate prefetching and the directory-based coherence protocol,
directory memory and remote-access cac hes were used for each cluster. The remote-access cache was shared
by all processors irI the cluster
4| ll i - Adi-wiccd Cmnputerfluchitecturc
The SVM Concept Figure 9.2 shows the structure of a distributed shared memory. A global virtual address
space is shared among processors residing at a large number of loosely coupled processing nodes. This
shared virnml nienioiy (SVM) concept was introduced in Section 4.4.]. Implementation and management
issues of SVM are discussed below.
CPU Node U
.\ ___“
emery ‘~__ -._
-s,‘ ‘R
\,_ "Ht
if .
o ‘,--"
it IIL ‘t A1
I
\
f \ 1';
__,.»'
l \. '-. "\-
Shared
\
._
"~ -\.
I I Virtual
‘
I
(1)
I2I2
C
Memory Nodat
I if/I
1.
I " WWW
Node I- -=- 2 "
'
I
.
Ill
ltilamory
I 1' I.
z ,.- 2'
/
Nndg ,1
/If
l PBQG T3lJlE /
_
/
_
Vmual
Address
Space
(ml. naad_miy,wrli.ai:|ie]
,..~
svin
Addm!-.5
BMW
Fig.1} The concept of dlsrrlh-uted sltared memory with a global vlrnsal address space shared among all
processors on loosely coupled processing nodes in a massively parallel arch.ltec1:ure {Courtesy of
Kai Li. 1992]
Shared virtual memory was first developed in a Ph.D. thesis by Li (1986) at Yale University. The idea is
to implement coherent shared memory on a network of processors without physically shared memory. The
coherent mapping of SVM on a message-passing multicomputer architecture is shown in Fig. 9.lb. The
system uses virtual addresses instead of physical addresses for memory references.
Each virtual address space can be as large as a single node can provide and is shared by all nodes in the
system. Li (1938) implemented the first SVM system, IVY, on a network ofApollo workstations. The SVM
address space is organized in pages which can be accused by any node in the systcm. A memory-mapping
manager on each node views its local memory as a large cache of pages lor its associated processor
Page Swapping According to Kai Li (1992). pages that are marked read-only can have copies residing
in thc physical mcmorics of other processors. A page cumcntly bcing written may rcsidc in only one local
memory. When a processor writes a page that is also on other processors, it must update the page and then
invalidate all copies on the other processors. Li described thc page swapping as ihllows:
A memory rcfcrcncc causes a page fault when thc page containing thc mcmory location is not in a
processofis local memory. When a page fault occurs, the memory manager retrieves the missing page from
the memory of another processor. lf there is a page frame available on the receiving node, the page is moved
$stn.,tus....-,ts..o -— .,,,
in. Otherwise, the SVM system uses page replacement policies to find an available page frame. swapping its
contentsto the sending node.
A hardware MMU can set the access rights (riff, rerm‘-only, n'rirrihIc'] so that a memory access violating
memory coherence will cause a page fault. The memory coherence problem is solved in IVY through
distributed fault handlers and their servers. To client programs, this mechanism is completely transparent.
The large virtual address space allows programs to be larger in code and data space than the physical
memory on a single node. This SVM approach oifers the case of shared -variable programming in a message-
passing environment. In addition, it improves software portability and enhances system scalability through
modular memory growth.
Example SVM System: Nitzberg and Lo [I99 I ) conducted a survey of SVM research systems. Excerpted
from their stnvey, descriptions of four representative SUM systems are suinrnarized in Table 9.1. Dash
implemented SVNI with a directory-based coherence protocol. Linda ofiercd a shared associative object
rnernory with access Fttnctions. Plus used a write-update coherence protocol and performed replication only
by program request. Shiva extended the IVY system for the lntel iPSU2 hypercube. In using SVM systems,
there exists a tendency to use large block (page) sizes as units of coherence. This tends to increase false»
sharing activity.
Table 9.1 Representative SVM Research Systerns {Excerpts fiwrr Nitzherg and Lo, IEEE Cumptttaduglrst 1991]
.5}'.trem Irnpieme"rr.fuIion (.'0.ir¢'r"e'rrr;'e' t§r?¢'r.'iuf ll-’le‘e'frurrie.r
and and 5<;"munIlies and jor Pt-rfo-rmum~e
El? t'e'i'op¢'r Sim-etrrns Phituc 01$ and .'§_Wrt'Irrrmrhutirm
Stanford Dash Meal:-connected networlt Release memory consistency Relaxed coherence,
{Ler|.oslti, Landon, of Siiieon Graphics 4Di'34t'.l with write-iniralidate prefetehing, and queued
Gharachorloo. Gupta. workstations with added protocol. locks for synchronization.
and Hennessy, 1988-]. hardware. for ooherent
caches and prefetching.
Yale Linda [Carriero 5ofi:weJ‘e-implelnented Coherence varied with Linda could he
and Gclcrnter, 1982-}. system based on the environment; hashing implelnented for many
concepts oftuple space used in mt-smiative search; languages and machines
with access functions no rnutahle data. using C-Linda or Formul-
to achieve coherence Lirtda interfaces.
via virtuai memory
nianageinent.
EMU Plus (Bisiani and Ahardware implementation Used processor consistency, Pages for sharing, words
Ravishankar. l9SE—). using MC 88600. Caltech nondemand write-update for coherence. complex
mesh. and Plus lternet. coherence, delayed operations. synclwonizanon
instructions.
Princeton Shiva {Li and Soflware-based system Sequential consistency, Used data structure
Schaefer, 1988). for Intel iPSC1"2 with B write-invalidate protocol, con1paction.mseges for
Shivafnative operating 4-Khyte page swapping. semaphores and signal-
system. wait, distributed memory
as hacking store.
FM Mtfirnlw Hlilrbmpwtns
4| I i " AdvoncedColnp-uterfirchitectore
Scalability issues of SK-‘M architectures include determining the sizes of data structures for maintaining
memory coherence and how to take advantage of the fast data transmission among distributed memories in
order to implement large SVM address spaces. Data structure compaction and page swapping can simplify
the design of a large SVM address space without using disks as backing stores. A number of alternative
choices are given in Li [1992]-.
Benefits of Pnefetching The benefits of prefetching come from several sources. The most obvious benefit
occurs when a prcfctch is issued early enough in thc code so that thc linc is already in thc cachc by thc time
it is referenced. However, prefetching can improve perforirtance even when this is not possible (e.g. when
the address of a data structure cannot be determined until immediately before it is referenced}. If multiple
prefetches are issued back to back to fetch the data structure, tl'te latency of all but the first prefetched
tcibrcncc can be hidden duc to thc pipelining ofthe mcntory acccsscs.
Prefetching offers another benefit in multiproccssors that use an ownership—based cache coherence
protocol. If a cache block line is to be modified, prefetching it directly with ownership can significantly
reduce the write latencies and the ensuing network traffic for obtaining ownership. Network traflic is reduced
in read-modify-write instructions. since prefctching with ownership avoids first fetching a read-shared copy.
Benchmark Result: Stanford researchers (Gupta, Hennessy, Gharachorloo, Mowry, and Weber, l99l}
reported some benchmark results for evaluating various latency-hiding mechanisms. Benchmark programs
included a particle-based three-dimensional simulator used in aeronautics tMP3D). an LU-decomposition
program {LU}, and a digital logic siniulation program [PTI-[ClR_‘,t. The effect of prefetching is illustrated in
Fig. 9.3 for running the MP3D code on a simulated Dash multiprocessor (Fig. 9.1).
s..t.~.,M.e....t......t. -— ...,
100 — 1001]
_-
_ 14_4 9&7 ore-teteh-es
go __ —- 1 sync ops
F-PPCDO:
18 3 — write buffer
- _ read
BU - _ "1 so busy
Exed
ecutionT'me 7'0 *- ea.-4
_ 6°‘ Q .~°~!“-"l3~'~*LDr_,g satt
Norzma 53.?
50- 58] 54.9 .~'=l\}!\“DFl~‘_*fl
40" 2.3.2 '
30- 27-1 25.9
20~ _ _ _
10- taa =1a.e tee 1a.e tea
O , ,
strategy nopf pf1 pf2 p13 pf4
Coverage 0% 3?% 91% 91% 95%
Source Lines 0 1 2 6 16
F£g.!.3 Efiect of various pm-Imtchlng strategies for running the HPBD bantzhmark cm at sintulatetl Dash
multiprocessor [Courtesy offitneep Gupta at al. 1991}
The simulation runs involved 10,000 particles in a 64 >< 8 >< 3 space anay with five time steps. Five
prefctching strategies were tested -[no]; pfl. p_,i'.?. p_,B. and p_;‘I¢ in Fig. 9.3). These strategies range li‘om no
prefctching {nqrifl to prefetehing of the particle record in the same iteration or pipclined acmss increasing
numbers oi'itcrations{pfi' throug hpf4). The bar diagrarns in Fig. 9.3 show the execution times normalized with
respect to the nnpfstralegy. Each bar shows a breakdown of the times required for prefetches, synchronization
operations, using write buffers, reads, and busy in computing.
The end result wasthatprcfetchcssl.-"ere issued for up to 95"!-itoft11e misses that occurred in the case without
prefetching {referred to as the cot-'ernge_fiteror in Fig. 9.3). Prefetching yielded significant time reduction in
synchronization operations, using write buffets, and performing read operations. The best speedup achieved
in Fig. 9.3 is 1.36, When the pf! prefetehing strategy is compared with the rmlrifstrategy. Still the preietching
benefits would he application-dependent. To introduce the pre-fetches in 1.he MP3D code, only I6 lines of
extra code wen: added to the source code.
Dash Experience We evaluate thc benefits when both private and shared read-write data are cachcablc. as
allowed by the Dash hardware coherent caches, versus the case where only private data are cacheable. Figure
9.4 presents a breakdown of the normalized execution times with and without cacheing of shared data for
each of the applications. Private data arc cached in both caches.
_ 100.0 1&0 1&0
S
_ 7.1 so 4.3
go - 13_5 _1o_? Synchronization
_ _- Write Miss
an Read Miss
To _ at .1 Busy
541-_
Nome
T on
mtaizeclEsocut 11.1
5° ' 54-9 rat -15.2
J5: _. _
at -
34] —
_~U'lW
14.3 r~=!~>"’
u'|_L
:1;
20 — 13-1 22.5 3.9
10 — —
T.O i".0 9.5 9.5 6.9 T2
0
No Cache Cache No Cache Cache No Cache Cache
l‘ulP3D i_|_.| PTHOR
Fig.1.-I Efiectof cacltcing shared data in sirnuiamd Dash benchrnerit experimutos (Courtesy of Gupta oral.
Pmc.i‘.rn: 5ymp.Cor11puLArrhh.,Tot*onn:. Caatach. May 1991]
The execution time ofeach application is normalized to the execution time of the ease where shared
data is not cached. The bottom section ofeach bar represents the busy time or useful cycles executed by the
processor. The section above it represents the time that the processor is stalled waiting for reads. The section
above that is the amount oi‘ time thc processor is stalled waiting for writes to be completed. The top section,
labeled “synchronization,” accounts for the time processor is stalled due to locks and barriers.
Benefit: of Cnchcing As expected. the cacheing of shared read-write data provided substantial gains in
performance, with benefits ranging from 2.2- to 2.?-fold improvement for the three Stanford benchmark
programs. The largest benefit came from a reduction in the number of cycles wasted due to read misses. The
cycles wasted due to write misses were also reduced, although the magnitude ofthe benefits varied across the
three programs due to different write-hit ratios.
The cache-hit ratios achieved by MP3D, LU, and PTIIDR were 80, 66, and 77%, respectively, for shared-
read references, and T5, '97, and 47% for shared-write references. It is interesting to note that these hit ratios
are substantially lower than the usual uniprocessor hit ratios.
The low hit ratios arise from several factors: The data set size for engineering applications is large,
parallelism decreases spatial locality in the application, and communication among processors results in
invalidation misses. Still, hardware cache coherence is an efi’ective technique for substantially increasing the
perfo tmancc with no assistance itom the compiler or programmer.
e.nt,Mrm..ta..o_ 4,,
9.1.4 Scalable Coherence Interface
A scalable coherence interconnect. structure with low latency is needed to extend iron: conventional hosed
backpianes to a fully duplex, point-to-point interface specification. The scrrlabfc coherence intcrjlirec (SCI),
which was introduced in Chapter 5, is specified in IEEE Standard 1596-1992. SCI supports unidirectional
point-to-point connections, with two such links between each pair ol" nodes; pac-ket-based cornrntmication is
used, with routing.
Up to 64K processors, memory modules, or L-'0 nodes can effectively interface with a shared SCI
interconnect. The cache coherence protocols used in SCI are directory-be-sod. A sharing list is used to chain
the distributed directories together for reference purposes.
SCI Interconnect Models SCI defines the interface between nodes and the external interconrrect, using
I6-bit links with a bandwidth of up to 1 Gbytefs per link. As a result, backplane buses have been replaced
by unidirectional point-to-poinl1inks.Arypical SCI configuration is shown in Fig. 9.5a. Each SCI node can
be e processor with attached memory and U0 devices. The SCI interconnect can assume a ring structure or a
crossbar switch as depicted in Figs. 9.5b and 9.5c, respectively, among other configuratiorts.
Bflds
VME nus
Nodes Nodes
I. - ____ H I
Q
..
in
‘ll-
~ b)!‘
.. .
Ii
‘
-‘-
I'll
IH
>
[t|JArmg for pomt—to-pomt transactions [n]Aerossnar multiprocessor
Fig.!.5 SCI imereomecrion configurations (Reprinted wirh permission them the IEEE Standard 1595-1992.
copyright © ‘E992 by lEEE.lrrc.}
Ilfi i - - AdmrrcedCornprrterArchitecrure
Each node has an input link and an output link which are connected fiom or to the SCI ring or crossbar.
The bandwidth of SCI links depends on the physical standard chosen to implement the links and interfaces.
In such an environment, the concept of broadcast bus-based transactions is abandoned. Coherence
protocols are based on poim-to-point transactions initiated by a requester and completed by a responder.
A ring interconnect provides the simplest feedback connections among the nodes.
Tl:|e converter in Fig. 9.5a is used to bridge the SCI ring to the VME bus as shown. A mesh of rings can
also be considered using some bridging modules. The bandwidth, arbitration, and addressing mechanisms of
an SCI ring significantly outperform backplane buses. Ely eliminating the snoopy cache controllers, the SCI
is also less expensive per node, but the main advantage lies in its low latency and scalability.
Although SC] is scalable, the amount of memory used in the cache directories also scales up well.
The performance of the SCI protocol does not scale, since when the sharing list is long, invalidatiorrs take
proportionately longer time.
Sharing-List Structures Sharing lists are used in SCI to build chained directories for cache coherence use.
The length of the slrraiing lists is effectively unbounded. Sharing lists are dynamically created, pruned, and
destroyed. Each coherently cached block is entered onto a list of processors sharing the block.
Processors have the option of bypassing the coherence protocols for locally cached data. Cache blocks
of 64 bytes are assumed. By distributing the directories among the sharing processors, SCI avoids sealing
limitations imposed by using a central directory. Communications among sharing processors are supported
by heavily shared memory controllers, as shown in Fig. 9.6.
Pr COBB-'BCl'S
Moncry
Fig-9-6 SCI cache coherence pnococol with distributed dineccories (Courtesy of D.\Ejarnes et al. IEEE
Con1pumr.19'9Cl]
Other blocks may be locally cached and are not visible to the coherence protocols. For every block address,
the memory and cache entries have additional tag bits which are used to identify the first processor (head) in
the sharing list and to link the previous and following nodes.
Doubly linked lists are maintained between processors in the sharing list, with forward and backward
pointers as shown by the double arrows in each link. Noncoherecrrt copies may also he made coherent by
page-level control. However, such highcr-level software coherence protocols are beyond the scope of the
SCI standard.
sstn,Mnst...ta..o -—. 4,,
Sharing-Lin Creation The states ofthe sharing list are defined by the state of the memory and the states of
list entries. Nortnally, the shared memory is either in a home [uncaichedi or a cached (sharing-list) state. The
sharing-list entries specify the location ol" the entry in a multiple-entry sharing list, identify the only entry in
the list, or specify the entry-"s cache properties, such as clean. dirty, valid, or stale.
Thc head ptoccssor is always rcsponsiblc for list management. The stable and legal combinations ofthc
memory and entry states can specify uncached data, clean or dirty data at various locations, and cached
writable or stalc data.
The memory is initially in the home state tuncached], and all cache copies are invalid. Sharing-list
creation begins at the cache where an cntry is changed from an invalid to a pending state. When a read-cachc
transaction is directed from a processor to the memory controller, the memory state is changed frorn un-
cachcd to cached and thc rcqttcstc-d data is returned.
The requcstcr‘s cachc entry statc is thcn changcd from a pending state to an only-clean state. Sharing-list
creation is illustrated in Fig. 9.7a. Multiple requests can be simultaneously generated, but they are processed
soqucntially by thc memory controller.
Processors
l'-\'B"d-
naw was old new old
new new [2]
...,... m MW
Before After Befae After
Fig.9.? Sharlrtg-list creation and up-than ttxamploa {Courtesy of D.V.jart1as et: al. IEEE Computer. 1990}
Sharing-I.i:t Updater For subsequent memory access, the memory state is cached, and the cache head of
the sharing list has possibly dirty data. As illustrated in Fig. 9.?h, a new requester (cache A) first directs its
read-cache transaction to memory but receives a. pointer to cache B instead of the requested data.
A second cache-to-cache transaction, called prepcnrt'_ is directed from cache A to cache B. Cache B then
sets its backward pointer to point to cache A and returns the requested data. The dashed lines correspond to
transactions between a processor and memory or another processor. The solid lincs are sharing-list pointers.
After the transaction, the inserted cache A becomes the new head, and the old head, cache B, is in the
middle as shown by the new sharing list on the right in Fig. 9.Tb.
Any sharing-list entry may delete itselffrom the list. Demils of entry deletions are left as an exercise for the
reader. Simultaneous deletions never generate deadlocks or starvation. However, the addition ofncw sharing-
list entries must be performed in first-in—first-out order in order to avoid potential deadloclting dependences.
Thc hcad ofthe shati ng list has thc authority to purge othcr cnttics lrom the list to obtain an cxclus ivc entry.
Others may reenter its a new list head. Purges are performed sequentially. The chained-directory coherence
protocols arc fault-tolerant in that dirty data is ncycr lost when transactions arc discarded.
4| B i - - AdmrrcedColnputerArchitec1ure
Implementation Issue: SCI was developed to support multiprocessor systems with thousands ofprocessors
by providing a coherem dist ributed -cache image ofdist ributed shared memory and bridges that interface with
existing or future buses. ll can support various multiprocessor topologies using Omega or crossbar networim.
Differential emitter coupled logic (ECL) signaling works well at SCI clock rates. The original SCI
implementation uses a I6-bit data path at 1 ns per word. The interface is synchronously clocked. Several
models of clock distribution are supported. With distributed shared-memory and distributed cache coherence
protocols, the boundary between multiproccssors and multicomputers has become blurred in MIMD systems
of this class.
Processor Consistency Goodman (I989) introduced the prr;ees'.sor eonsis'rene_v (PC) model in which
writes issued by each individual processor are always in program order. However, the order of writes from
two different processors can be out of program order. ln other words, consistency in writes is observed in
each processor, but the order of reads from each processor is not restricted as long as they do not involve
other processors.
The PC model relaxes from the SC model by removing some restrictions on writes from dificrcnt
processors. This opens up more opportunities for write buffering and pipelining. Two conditions related to
otherproccssors are required for ensuring processor consistency:
{ 1] Before a read is allov.-ed to peribrrn with respect to any other processor, all previous rend accesses
must be performed.
{2} Before a write is allowed to perform with respect to any other processor, all previous rend or write
aoces ses must be performed.
These conditionsallow reriris following a write to bypass the nrire. To avoid deadlock. the implementation
should guarantee that a write that appears previously in program order will eventually be performed.
Rel-ease Consistency One of the most relaxed memory models is the reiease consistent)‘ (RC) model
introduced by Gharochorloo et al (1990). Release consistency requires that synchronization accesses in the
program be identified and classified as either acquires (e.g. locks) or reieases (e.g. unlocks). An acquire is a
read operation (which can he part of a read-modify-write) that gains permission lo access a set of data, while
a release is awrite operation that gives away such permission. This information is used to provide flexibility
in bufiering and pipelining ofaccesses between synchronization points.
The main advantage of the relaxed models is the potential for increased peribrmance by hiding as much
write latency as possible. The main disadvantage is increased hardware complexity and a more complex
programming model. Three conditions ensure release consistency:
~[ ll Before an ordinary read or n-‘rite aocess is allowed to perform with respect to any other processor, all
previous oeqrrire accesses must be performed.
seut,Muo~.¢e...i. 4,,
{2} Before a release acccss is allowed to perform with respect to tun: othcr proocs all previous ordinary
mad and store acccsscs must bc pcrforrncii
{'3} firreciui accesses arc processor-con sistcnt with onc anothcr.The ordering restrictions imposed by weak
consistency arc not present in rcleasc con sistency. lnstcad, rclcasc consistency rcqu ires processor
consistency and not scqucntial consistency.
Release consistency can be satisfied by (ii stalling the processor on an acquire access until it completes,
and [ii] delaying the completion of release aeeess until all previous memory accesses complete. intuitive
definitions ofthe four memory consistency models, the SC, WC, PC, and RC, are sununarized in Fig. 9.5.
Reiaiced
\ / ii’ Models
Fig.1-3 lncuithre definitions of four mernery consisuency mor.|e|s.Tl'ie arrow: pointfmm strong so relaxed
oonsistencies (Courtesy oi‘ hfilzberg and Lo. IEEE Computer; Au-gust 1991}
The cost of implementing RC over that for SC arises from the extra hardware cost of providing a lockup-
fiee cache and keeping track of multiple outstanding requests. Although this cost is not negligible, the same
hardware fcattlrcs are also rcquircd to support prcfctching and multiple contests.
Effect of Release Comifloncy Figure 9.9 prcsmlts the breakdown of execution times under SC and RC
for the three applications. The execution times are nonnaiized to those shown in Fig. 9.3 with shared data
cached. As can be seen from the results, RC removes all idle time due to write-miss latency.
411] i - " Advlorleed Cmnputerfluchitecture
20- _
1°" res 1s.s 2'5-° 25-“ -16.0 14.2.
0
SC RC SC RC SC RC
lvlP3D LU PT HOP.
Fig. 1.9 Eiiiect oinclaxslng the shared-memoqr rnodei from sequential oonsistmcy {SC} to release consistency
(RC) [Courtesy of Gupta at al. Ptcc. int. Syrup. Corn-put. Archie, Toronto. Canarh. May 19'9't}
The gains are large in ivlP3D and PTHOR since the write-miss time constitutes a large portion of the
execution time under SC (35 and 20%, respectively], while the gain is small in LU due to the relatively small
write-miss time under SC (7%).
Effect of Combining Mechanism: The cliect of combining various latency-hiding mechanisms is
illustrated by Fig. 9.10 based on the M'P3D bcnciunark results obtained at Stanford University. The idea of
using mtiirrpic-t-nnttu-t processors will be described in Section 9.2. However, the eitect of integrating MC
with other latency-hiding mechanisms is presented bclow.
The busy parts of the execution times in Fig. 9.10 are equal in all combinations. This is the CPU busy
time for executing the MPED program. The idle part in the bar diagram corresponds to memory latency and
includes all cache-miss penalties. All thc times arc normalized with respect to thc execution time (IUD units}
required in a m-:-he-mhcrsnt system. The leftmost time bar (with 241 units) corresponds to the worst case of
using a private cache ettclusivcly without shared reads or writes. Long overhead is experienced in this case
ciuc to excessive cache misses. The use of a cache-coherent system shows a 2.41-fold improvement over the
private case. All the remaining cases are assumed to use hardware coherent caches.
The use of rt*l'et"ts+.- consrsterrr-_v shows a 35% firrthcr improvement over the coherent system. The adding
of prefetching reduces the time further to 44 units. The best case is the combination of using coherent caches.
RC, and rriItrl'tr'pi'-t’ c0rttc.1rrs(MC). The rightmost time bar is obtained from applying all four mechanisms. The
combined results show an overall speedup oi'4 to 7 over the case ofusing private caches.
The above and other uncited bencltmark results reported at Stanford suggest that a coherent cache and
relaxed consistency uniformly improve performance. The improvements due to prefctching and multiple
ssot,nstst...ts....t -—. 4,,
contents are sizable but are much more application-dependent. Combinations of the various latency-hiding
mechanisms genenilly attain a better peribrrnatlee than each one on its own.
2¢o- ‘L1
2.20 -
El Idle
zoo - 5“?
180-
- 160 _ RC:ReteaseConsls1ency'
MG: Mutlpte Contexts
ExNorlecadumtioanTime 140 —
120 —
Fig. 1.10 Effect: of combining various Ilttenqr-l'tlt:ling mechanisms from the MPJD benehmarkon 1 slrn-tslatned
Dash multipnecessor (Courtesy of Gupta. 1991}
PRINCIPLES OF FIULTITHREADING
1 This section considers multithreaded prooessors and multidimensional system arehiteetures.
Only control-flow approaches are described here. Fine-grain machines are studied in
Section 9.3, von Neurnann mult-ithreacling in Section 9.4, and clataflow multzithreacling in Section 9.5. Recent
developments in rnullithreading support by processor hardware are discussed in Chapters 12 and I3.
Architecture Emrinzlnrnent One possible tnultithreaded MPP system is modeled by a network of processor
(P) and memory (M) nodes as depicted in Fig. 9.] la The distributed memories form a global address space.
Four machine parameters are defined below to analyze the performance of this network:
I-alflflfiy ll-J
...,?"
ll'lHl“OCll'|l"lG‘1'.'.t
R
||-my gc|1Qdu|||-|g Qvgmaad
|_
Z
T hreaet sy nehron ization overhead
_ _ _ _ . __
,_ _____ _ _
1 throacle of parallel computation
,_ _____ __
gomputation lntel‘-comp-uter
communication
[distn outed memories]
[ls] Multithreadecl computation model. [Courtesy of Gordon Bell, C-‘omrnun. ACM, August 1992]
Fig.9.11 Moltltzlsreaded architecture and its oontpumtlon model for a rnaslvely parallel processing system
{'1} T?rc!t:r1ertc_1={'L'j: This is the communication latency on a remote memory access. The value oft. inc ludcts
the nctworlt delays, cache-miss penalty, and delays caused by contentions in split transactions.
{2} The number ofrhrterzds {N}: This is the number of threads that cart be interleaved in each processor.
A thread is represented by a context‘ consisting ofa program counter, a register set, and the required
contest status words.
{'3j The context-.stt-'irt'hingot-writeup" l_'C.'_l: This refers to the cycles lost in performing contest switching in a
processor. This time depends on the switch mechanism and the amount ofprocessor states devoted to
maintaining active thread s.
{'4} Ute ritrcrt-‘of bertveen .r'n-'r'rultc.s {R}: This refers to the cycles between switches triggered by remote
reference. The inverse p = UR is called the inte of reqrtestw tbr remote accesses. This reflects a
combination of program behavior and memory system design.
ln order to increase efliciency, one approach is to reduce lite rate of requests by using distributed coherent
caches. Another is to eliminate processor waiting through multithreading. The basic concept ofmultithreading
is described below.
Muftitfimaded Computations Bell {I992} has described the stnicture of the rnultithreaded parallel
computations model shown in Fig. 9.1111. The computation starts with a sequential thread (I), followed
Sccrlable,Multlthreuded,aod -—. 4,,
by supervisory scheduling (2) where the processors begin tltreads of computation (3), by intereomputer
messages that update variables among the nodes when the computer has a distributed memory (4), and finally
by synchronization prior to beginning the next unit of parallel work (5).
The eommtmieation overhead period (4) inherent in distributed memory structures is usually distributed
throughout the eontputation and is possibly eontpletely overlapped. Message-passing overhead {send
and reeeive ealls) in multicomputers ean be tedueed by specialized hardware operating in parallel with
eomputation.
Communication bandwidth limits granularity, since at certain amount of data has to be transferred with
other nodes in order to eomplete a e-nmputati-tmal grain. Message-passing ealls -[4] and synchronization (5)
are nonproductive. Fast mechanisms to reduce ort-n hide these delays are therelhre needed. Multithread ing is
not capable of speedup in the execution ofsingle threads, while weak ordering or relaxed consistency models
are capable of doing this.
l/l
g Example 9.1 Latency problems for remote loads or
synehronizing loads (Rishiyun Nikhil,1992).
The remote load sitttation is il]usn'ated in Fig. 9.12:1. Variables A and B are located on nodes N2 and N3,
respectively. They need to be brought to no-tie N1 to compute the difference A — B in variable C. The basic
t'dcndth
C'DH'l].'1L| fltlfin
ma 5 C C1111 t'IDH D f “VD TCl'TlC|tC Iocl
3 Y (F loddheth
3 jiin t H C 51] bt'
ITHCIIUH.
Home N1 Hm H2 Hm N1
mam - GTXT|:|
Roadyt “’°°“*
cI _
W“ I P-MRS mm N3 "3 I
"B I M - NooeN3
PA =| “B - _
PB f - PA Z. mm
PB Z -
Onblotto I"-l1,oon1:u.no; C= A-B oeunmos tooneouto:
~,,,q=ma5|;,_a 0nNodoN1,oon'ptre:C=A-B
B= B “awn mm: A and B oom|:u'ed eoret.n'sn1n,-
vc _ ?:; ti-ma on H1 muslbe nolfieo
' ' whonA, B are ready
(a) The remotelcaes prolziem (tn) The synerronziwg loads |:t'oolam
F§g.!.12 ‘Fun common pmblems caused by asymzhmny and corrrmmlcadm larutey in massively parallel
proeusors (Cournuy of ILS. Nvllthfi. Digital Equipntent. Corporation, 1991}
414 i - - Adnortced Cornputterarchitecttrre
Let pAand pB be the pointers to A and B, respectively. The two rloads can be issued from the same thread
or from two difi'erent threads. The context of the computation on bll is represented by the variable CTXT. it
can be a stack pointer, a frame pointer, a current-object pointer, a process identifier, etc. In general, variable
names like vA, vB, and C are interpreted relative to C-TXT.
In Fig. 9.l2b, the idling due to synchronizing loads is illustrated. ln this case, A and B are computed by
concurrent processes, and we are not sure exactly when they will be ready for node N1 to read. The ready
signals [Ready] and Ready2) may reach node N1 asyncltronously. This is a typical situation in the producer-
consumer problem. Busy-waiting may result.
The key issue involved in remote loads is how to avoid idling in node N1 during the load operations.
The latency caused by remote loads is an architectural property. The latency caused by synchronizing loads
also depends on scheduling and the time it takes to compute A and B. which may be much longer than thc
transit latency. The synchronization latency is often unpredictable, while the remote-load latencies are oflen
predictable.
Multithreading Solution: This solution to asynchrony problems is to multiplex among many threads:
When one thread issues a remote-load request, the processor begins work on another thread, and so on
(Fig. 9.l3a). Clearly, the cost of thread switching should be much smaller than that of the latency of the
remote load, or else the processor might as well wait tbrthe remote load's response.
As the internode latency increases, more threads are needed to hide it effectively. Another concern
is to ma]-te sure that messages carry continuations Suppose, after issuing a remote load from thread T1
(Fig. 9.13:1), we switch to thread E, which also issues a remote load. Thc responses may not return in
the same order. This may be caused by requests traveling different distances, through varying degrees of
congestion, to destination nodes whose loads differ greatly, etc.
One way to cope with the problem is to associate each remote load and response with an identifier for the
appropriate thread, so that it can be reenabled on the arrival ofa response. These thread identifiers are referred
to as emtrintto!r'ons' on messages, A large eorrrirtutrfiort name $‘,l'JfltI‘-t.’ should be provided to name an adequate
number of threads waiting for remote responses.
The size ofthe hardware-supported continuation in a name space varies greatly in diflcrcnt system designs:
from 1 in the Dash, 4 in the Alewife, 64 in the HEP, and I024 in the Tera (Section 51.4} to the local memory
address space in the Monsoon, Hybrid Dataflovvlvon Neumann, MDP (Section 9.3), and ‘T (Section 9.5].
Of course, if the hardware-supported name space is small, one can always virtualize it by multiplexing in
software, but this has an associated overhead.
Distributed Cachcing The concept of distributed cacheing is shown in Fig. 9.13-b. Every memory location
has an owner node. For example, NI owns B and N2 owns A. The directories are used to contain import-
export lists and state whether the data is shmed (for reads, many caches may bold copies] or tu'cl'1t.sit-'e (for
writes, one cache holds the current value].
The directories multiplex among a small number of contexts to cover the cache loading effects. The MIT
Alewife, l(SR—l, and Stanford Dash have implemented directory-based coherence protocols. It should be
noted that distributed cacbeing ofiers a solution for the remote-loads problem, but not for the synchronizing—
Sr:nIabl'e,Mu!dfl':reIded.Ind as
lnads prnblcrn. Multithmading -nfi'-ms a snlutinn fur remmc lnads and possibly for synchmnizing loads.
However, the two approaches can be combined tn solve 1:01:11 typm of remote-access problems.
No-due N1 Mada N2
I C1112 III ‘i
§J
|§L:r1oadA A
I 1
N2
M»-=
-$1
g E
c.tx't1 lA
.
-I
L
GDQ1
V
£EE;;;EF%
U ii
P D P D
A: Import; shared B: |mport;sxclush.re
B: export H2; exuusrua A: exmrt N1,N1E-; shared
Fig.'!.13 ‘Em solutinns for ovarconirrgtlrc asynchmrry problem: [Courtesy erffi. 5. Niid'll.Digica| Eqrflpmemz
Corp-cratlnlt. 1991)
415 i - IHEEHHHHIIIIIIIIIIIL Adviorrced Cornputerfirchitecture
The Enhanced Processor Model A conventional single-thread processor will it-‘nit during a remote
reference, so we may say it is idle lor a period of time L. A multithreaded processor, as modeled i.n
Fig. 9.14s, will suspend the current context and switch to another, so after some fixed number of cycles it will
again be busy doing useful work, even though the remote reference is outstanding. Only if all the contexts are
suspended {blocked} will the processor he idle.
Clearly, the objective is to maximize the fraction of time that the processor is busy, so we will use the
efliciency of the processor as our performance index, given by
bust?
"’i~1l""""t"‘~"’= ‘°-"
when: hus_r-: .~rn-irr-hirrg. and idle represent the amount of time, measured over some large interval, that the
processor is in the corresponding state. The basic idea behind a multithreaded machine is to interleave the
execution of several contexts in order to dramatically reduce the value of idle. but without overly increasing
the magnitude of.snirchr'rrg
The state ofa processor is determined by the disposition of the various contexts on the processor. During
its lifetime, a context cycles through the following states: rennft-', rrrnrring, loot-'r'ng, and bloeireri There can
be at most one context running or leaving. A processor is bust-' if there is a context in the running state; it is
Sit-'i'I('§li!Ig while making the transition from one context to another, i.e. when a context is leaving. Otherwise,
all contexts are blocked and we say the processor is r'n'Fe.
A running context keeps the processor busy until it issues an operation that requires a context switch. The
context then spends C cycles in the femring state, then goes into the blocked state for L cycles, and finally
recnters the renrrfy state. Eventually the processor will choose it and the cycle will start again.
The abstzract model shown in l-‘lg. 9.l4a assumes one thread per context, and each context is represented
by its own program counter (PC), register set, and process status word (l’S‘N). An example multithreadecl
processor in which three thread slots (N = 3} are provided is shown in Fig. 9.1-l-b.
Z PC
,i, N oontexte
PSW 1 thread per context
I
I
I
CuntextSelectC %%
-.
ItIDP i
E
lnettucltlen Cache
lnstrumbn Fetch um 1
"-~|]-*
ALU
*-U-—t
Integer
-_ _,- [}-—t—
ea»
:q%@E%%%%%%%¥
7Ba‘re|
Shifter
E
Integer
Multiplier
FP
Adder
KFP FP
MlJ1l|[J-Ber Convene
Loael.I'Sto-
re urit re unit
1 1 Data Cache
MM\HHHHl~fl@@@
'1, 2 K K n Queue Regsters
[bi Athree~th1'ead pto-nessor example {Courtesy of H. Htrata et al, Pm: 19" int. Symp. Compu1.A.rc.m't.,
Aumrafla, May 1992}
An instruction queue unit has a buffer which saves some instructions succeeding the instruction indicated
by the program counter. The buffer size needs to be at least B = F‘-">< C words, where N is the number ofthrcad
slots and C is the number ofcycles required to access the instruction cache.
An instruction fetch lmil fetches at most B instnictions ibr one thread every C cycles from the instniction
cache and attempts to fill the buffers in the instruction queue unit. This fetching operation is done in an
interleaved fashion for multiple threads. So, on the average, the buffer in one instruction queue unit is filled
once in B cycles.
When one ofthe threads encounters a branch instruction, however, that thread can procmpt the prefctching
operation. The inst-ruction cache and fetch unit might become a bottleneck for a processor with many thread
slots. In such cases, a bigger and."or faster cache and another fetch unit would be needed.
Processor Efilcfencles A single-tlnread processor executes a context until e remote reference is issued (R
cycles] and then is idle until the reference completes -[L cycles}. There is no context switch and obviously no
switch overhead. We can model this behavior as an alternating renewal process having a cycle of R + L. I11
terms of Eq. 9,1, R and L correspond to the amount oftime during a cycle that the processor is hu.s"_v and irfle,
respectively. Thus the efficiency of a single-threaded machine is given by
R 1
£=?=i 9.2
' R+L 1+Lr"R { i
$aan,Masr..¢a.o 4,,
This shows clearly the performance degradation of such a ];:|roecssor in a parallel system with a large
memory latency.
‘With multiple contexts, memory latency can be hidden by switching to a new context, but we assume that
the switch takes C cycles of overhead. Assuming the run length between switches is constant with a sutficient
number ofcontexts, there is always a context ready to execute when a switch occurs, so the processor is never
idle. The processor efficiency is analyzed below under two difi'erent conditions as illustrated in Fig. 9. l 5.
in
Tm IEI
R 1. ,_
R 1. Time
IEI L _E-El
Cflflllfllffl R |_ Context R L
in
F
| R
R
L
1.
1.
|
|
| R
in
R
1_
to
L
j
|
1- Processor
eificiency
11] ---------------------------------------------- --
satiation
Number of conhxts
fl il-
Fl3- 9-15 Context switching and processor elficiency as I function ofthe number of context: [Courtesy of
Rahal Saawedfl-1992}
{I} Snrurnrinn t't.'git'J.H—l11 this saturated region, the processor operates with maximtnn utilization. The
cycle ofthe renewal process in this -case is R + C, and the efficiency is simply
R I
9.3)
5”‘ s+c 1+C.-"R l
Observe that the efficiency in saturation is independent of the latency and also does not change with a
fi.|rther increase in thc numb-cr ofcontertts.
Saturation is achieved when the time the processor spends sen-‘icing the other threads cxeoc-cls thc
time required to process a request, i.e., when (N- l){R + C) > L. This gives the saturation point, under
constant mn length, as
-
430 i - ' Advanced Cmnputerfluchitecture
_ L
rl'l',._{ = W + 1
{3} Lim-or n?gion—‘lilfl7tcu the number of contexts is below the saturation point, there may be no ready
contests alter a context switch, so the processor will experience idle cycles. The time required to
switch to a ready context, execute it until a remote reference is issued, and process the reference is
equal to R + C + L. Assuming N is hclow the sattuation point, during this time all the other contexts
have a turn in the processor. Thus, the efficiency is given by
NR
Er‘ 1 ‘°-5’
Dhscrvc that thc efliciency increases linearly with the numher of contexts until the saturation point
is reached and beyond that remains constant. The equation forfla, gives the fimdamcntal limit on the
efiiciency of a multithreaded processor and underlines tl1e importance of the ratio C-‘R. Unless the
context switch is extremely cheap, thc remote rcfcrcncc ratc must he kept low.
Figures 9.15s and 9.151: show snapshots of context switching in thc saturation and linear regions,
respectively. The processor efliciency is plotted as a function of the number of contexts in Fig. 9.150.
In Fig. 9.16, the processor efliciency is plotted as a function of the memory latency L with an average run
length R = I6 cycles. The C = [I curve corresponds to zero switching overhead. With C = 16 cycles, about
50% cfiiciency can be achieved. These results are based on a Markov model of multithrcadod architecture
by Saavcdra 11992]. It should be noted that multitltreading increases both processor efficiency and network
rraflic. Tradeoffs do exist between these two opposing goals, and this has been discussed in a paper by
Agarwal [I992].
1.113 1.0 Number ofConhsts=2
_,.¢=fl Ntmwherofflonhsxta = 2
0.Q] _ c=1 0.9 C=tJ
C=1
0.3] 0.8
fl_‘ G=4
0.?
C=4
0.fi] 0.6 0-
:C=1B
0.5
C=1B
SEF-‘_-i
!'—"F‘P 0.4 u_1
L-:ntm-_.n_-t.¢lT ‘€fi:i&_fl-_ , _, l'
0.3] 0.3 0-
0.20 0.2 0-
0.10 I I i 0.1 9 n -1 1
U 50 100 150 200 0 50 ‘ICU 1$
{aft Two contexts per processor {hi Six contexts per processor
1-o Ring
“X "\ “X
Maryland CDC Cyherpius I-(SR-1
Zmoo
2-D Mesh
Two-dimeiisional meshes were adopted in the Stanford Dash, the MIT Alewife, the Wisconsin Multicube,
the lntel Paragon, and the Caltceh Mosaic C. A three-dimensional mesh-"tor|.|s was implemented in the MIT
J-Machine, the Tera computer, and in the Cmyfli-'IPP architecture, called T31}. The USC orrimgmini i'fl1i'Hl—
proees'.s'or {OMP) could be extended to higher dimensions. However, it beoomes more difficult to build
higher-dimensional architectures with conventional l‘MI1I-4IllifI'i.Ci!l5-ll.'!Il'Ifll circuit boards.
instead of using hierarchical buses or switched network architectures in one dimension, multiprocessor
architectures can he extendeclto a higher o‘i:riensr'omriir_\-'ormulriiilreirt-'along each dimension. The concepts are
described below for iwo- and three-dimensional meshes proposed for the Multicube and OMF architectures,
respectively.
The Wisconsin Muhieube This arcliitecture was proposed by Goodman and Woest [193 8) at the University
of Wisconsin. It employed a snooping eache system over a grid of buses, as shown in Fig. 9.181 Each
processor was eormeetod to a multilevel eache.
4,,-—. ,,,,,,,,,.,,,,,,,,,,,,,,,,m,,,,,,,,,,
_
‘ Isl ' III
I"
§'s
| t mu ‘ |
J ' @ ’_ not,
%
QIHI1‘
II
-Q
III=
@I-II __ 93“
Finw Em D12
:1:
__"iI ti“ I‘ l
iii
figQQgii
";;
;
i DOD
ii “5.9oneFE “vile!
/is . . _,. ,
is 1 s s s ;»
" ‘ I
;s»asisa as
‘ .
ii»iss
ar e
ti so as~, s ones»»oi
tosi*s cs? s~®,‘s
“ <.;. i:-
i »s@ ists J’
I
fix
— I 5
5*. .s
~® ..% la. 77% ..t..,
to Q
o» in
I.-Blltli
s s ”
i,¢iTnp IHJCIHP (3.-ti architecture iflmoessnrssa Mbi.-ind e. Li. p.
ri~orrr:i|ymoouIasa1e|at:eiodDlIl.t1I...'iiCti
H5118 The Hdticube Incl orlliogoml lriulqinoeulnr ardinecuiru {Courtesy of Goodlnln :nc|"M:|eII'.
19$.mdd'Hwa|1getal,19B9]
$...r.r.r...r..ro......t......r -— 4,,
The first-level cache. called the prrx'r's'.sor metre. was a high-perfomtance SRAIM1 eache designed with the
traditional goal of minimizing memory latency. A second-level cache, referred to as the .snoopr'rrg r:'rI£".Fk’, was
a very large cache designed to minimize bus traffic.
Each snooping cache monitored two buses, a row bus and a column bus, in order to maintain data
consistency among the snooping caches. Consistency between the two cache levels was maintatined by using
awrite-through strategy to ensure that the processor cache is always a strict subset ofthe snoop ing eache. The
main memory was divided up among the column buses. All processors tied to the same column shared tl1e
same home memory. The row buses were used for intercolurnn communication and cache coherence control.
The proposed architecture was an example of a new class of interconnection topologies, the rrrulrierrlxr.
eon sistirtg of."-"= rd processors, where each processor was connected to It buses and each bus was connected to
n processors. The hypercube is a special case where n = 2. The Wisconsin Multicube was a two-dimensional
multicube [Ir - 2}, where n scaled to about 32, resulting in a proposed system of over 1000 processors.
The Drthogomll Multipmcessqr ln the proposed OMP architecture (Fig. 9.13b], n processors
simultaneously access rr rows or rr columns of interleat-red memory modules. The n X n memory rrmslr is
interleaved in both dimensions. In other words, each row is rr-way interleaved and so is each column of
memory modules. There are Zn logical buses spanning in two orthogonal directions.
The synchronized row access or column access must be performed exclusively. in fact, the row bus R,-
and the column bus C1‘ can be the same physical bus because only one of tlte two will be used at a time. The
memory controller (MC) in Fig. 9. 1 Sb synchronizes the row access and column access ofthe shared memory.
The DMP architecture supports special-ptirp-ose computations in which data sets can be regularly arranged
as matrices. Simulated performance results obtained at USC verified the effectiveness of using an OMP in
matrix algebraic computations or in image processing operations.
In Fig. 9.l8b, each of the memory modules M,-, is shared by two processors F} and P,-. In other words. the
physical address spaoe ofprocessor P, covers the ith row or the ith column ofthe memory mesh. The
UMP is well suited for SPMD operations, in which n processors are synchronized at the memory-access level
when data sets are vectorized in matrix lbrrnat.
Muftidirnensional Extensions The above UMP architecture can be generalized to higher dimensions. A
generalized orthogonal multiprocessor is denoted as an Ul\-'lP{n_ Ir), where n is the tlli'?Ir.‘rt'.i‘lfl!'i and Ir is the
rrrui"rr‘p!icr‘{_t-'. There are p = Ir" l processors and rrt = Ir” memory modules inthe system, where p _$= n and p fir» k.
The system uses p memory buses, each spanning into rr dimensions. But only one dimension is used in a
given memory cycle. There are it memory modules attached to each spanning bus.
Each module is connected to n out of p buses through an n-way switch. It should be noted that the
dimension n corresponds to thc number of accessible ports that each memory module has. This implies that
each module is shared by n out ofp = Ir" L processors. For example, the architecture of an OMP(3,-4) is shown
in Fig. 9.I8c, where the circles represent memory modules, the squares processor modules, and the circles
inside squares computer modules.
The 16 processors orthogonally access 64 memory modules via 16 buses. each sprouting into three
directions, called the x-aeee.s's'. _‘l-‘-itIC‘fi.’.5‘S, and 2-or:'ee.s's'. respectively. Various sizes of UMP architecture for
different values of n and Ir are given in Table 9.2. A five-dimensional OMP with multiplicity it = lo has MK
processors.
434 i - Advortcsd Colnputerdichiteeture
FINE-GRAIN MULTICOHPUTERS
— Traditionally, shared-memory multiproccssors like the Cray Y-MP were used to perform
coarse-grain computations in which each processor executed programs having tasks of a
few seconds or longer. Message-passing multicomputers are used to execute medium-grain programs with
approximately 10-ms task size as in the iPSCfl. In order to build It-'[PP systems, we may have to explore a
higher degree of parallelism by making the task grain size even smaller.
Fine-grain parallelism was utilized in SIMD or data-parallel computers like the CM-2 or on the message-
driven J-Machine and Mosaic C to be described below. We first characterize fine—grain parallelism and discuss
the network architectures proposed for such systems. Special attention is paid to the eflicient hardware or
software mechanisms developed for achieving fine-grain MIMD computation.
Latency flnalysis The computing granularity and communication latency of leading early examples of
multiproccssors, data-parallel computers, and medium-and fine-grain multicomputers are summarized in
Table 9.3. These table entries summarize what we have leamed in Chapters 7 and 8. Four attributes are
identified to characterize these machines. Only typical values for a typical program mix are shown. The
intention is to show tl1e order of magnitude in these entries.
The comnninir-on}-1n l'at‘cm:*__1-' 7;. measures the data or message transfer time on a system interconnect.
This corresponds to the shared-memory access time on the Cray Y-MP, the time required to send a 32-bit
value across the hypercube network in the CM-2. and the network latency on the iPSC.r'l or J-Machine. The
synchronizafion overhead’ T; is the processing time required on a processor, or on a PE, or on a processing
node ofa multicomputer for the purpose of synchton ization.
The sum T, + T. gives the total time required For IPC. The shared-memory Cray Y-MP had a short If.
but a long T, The SIMD machine CM-2 had a short 1'] but a long 1'}. The long latency of the iPSCl'1 made
it unattractive based on fast advancing standards. The MIT J-Machine was designed to make a major
improvement in both oi" these communication delays.
...r.s.,r..o.....i.r...r -— 4,,
Fine-Gr-ntin Parallelism The groin sr':c Tg is measured by the execution time ofa typical program, including
both computing time and cornmimication time involved. Supercomputers handle large grain. Both the CM-2
and the J-Machine were designed as fine-grain machines. The iPSC." I was a relatively medium-grain machine
compared with the rest.
Large grain implies lower concurrency or a lower DCIP [degree ofparallelism). Fine grain leads to a much
higher DOP and also to higher communication overhead. SIMD machines used hardwired synchronization
and massive parallelism to overcome the problems of long network latency and slow processor speed. Finc-
groin rrrulticorrrptrIcr.s. like the J-Machine arm‘ Caltech Mosaic, were designed to lower both the grain sire and
the cornmtmication overhead compared to those oftraditional multicomputers.
Table 9.3 Fine-Gruln,Medlum-Gmln, and Course-Groin Machine Churucterlstlr: of Some Etromple Systerrt:
.'l-ftrclri rrc
(.'}rcrmr.'rcri.sric.s Croft‘ Connécriurr )'rr.r-cl Jl-HT
lift-{P .'lrfr.rc'.hirre CM-2 iPSf.'.-"J J-M'achirre!
Communication 40 ns via shared 600 pr per 32¢hit 5 Ins 1-‘re
latency, TI. memory :*rs€i¢!Pt’#??PF1
Synchronization Zlil ‘Us lZ5 ns per bit- 500 ,t.Lt l].{.r
overhead. 1] sliee operation
in lock step
Grain size. Ti: 205 4 pr per 32-bit lllms 5!“
result per PE
instruction
Concrurency 2-16 ilK— 154K 8 -128 lK—64K i
(DD?)
Remarlt Coarse-grain Fine~grain data i Medium~grain Fine-grain i
supercomputer parallelism lnulticotnputer mutt icornputer
The MDP Design The MDP chip included a processor, a 4096-word by 36-bit memory, and abu.ilt~ln router
with network ports as shown in Fig. 9.19. An on-chip mernorry controller with error checking and correction
[EEC] capability permitted In-cal memory lo be expanded to I million words by adding external DRAM
chips. The processor was message-driven in the sense that it executed fimctions in response to mess-ages. via
the dispatch mechanism. No receive inslruclion was needed.
2 ‘I2 g T9151 Ex
“'""'.f"“ 1
m
IDBBPBW
_ _ egeters
Amllrierlclogc
mil
lnfliflflflml
FALL!
I R
lbw
36 Exernd C 4 To
Mgr“? A E'Xl.Bl'I\fl
‘|'I
12
O bus
/' Anus
1: H8 36 (1
29 18
o of
18
Flg.9.19 The massage-driven processor [HOP] archimcwre [Cournesy nfW'1 Dally er al; reprinted with
p-errrisslcn from IEEE Micro, Apr! 1992)
ssn,nnss..ta...t. -— 4,,
The MDP created a task to handle each arriving message. Messages carrying these tasks drove each
computation. MDP was a general-purpose multicomputer processing node that provided the communication,
synchronization, and global naming mechanisms required to efliciently support line-grain, concurrent
programming models. The grain size was as small as B-word objects or 2t}-instruction tasks. As we have
seen, fine-grain programs typically execute fi'om ID to IDD instructions between communication and
synchronization actions.
MDP chips provided inexpensive processing nodes with plentiful VLSI commodity parts to construct the
Jellybean Machine (J-Machine) multicomputer. As shown in Fig. 9.19:1, the MDP appeared as a component
with a memory port, six two-way network ports, and a diagnostic port.
Tl'|e memory port provided a direct interface to up to IM words of ECC DRAM, consisting of
I l multiplexed address lines, a I2-bit data bus, and 3 control signals. Prototype J-Machines used three IM ><
4 static-colurrm DRAMs to fonn a four-chip processing node with 262,144 words of memory. The DR.AMs
cycled three times to access a 36-bit data word and a fourth time to check or update 1:he ECC check bits.
The network ports connected MDPs together in a three-dimensional mesh network. Each of the six ports
corresponded to one of the sis cardinal directions (+1, —:t, +y, -y, +-.¢, —z] and consisted of nine data and six
control lines. Each port connected directly to the opposite port on an adjacent MDP.
The diagnostic port could issue supervisory commands and read and write MDP memory from a console
processor (host). Using this port, a host could read or write at any location in the MDP‘s address space, as
well as reset, interrupt, halt, or single-step the processor. The MDP chip floor plan is shown Fig. 9.1%.
Figure 9.19c shows the components built inside the MDP chip. The chip included a conventional
microprocessor with prefetch, control, register file and ALU LTRALLI], and memory blocks. The network
communication subsystem comprised the routers and network input and output interfaces. The m'n'rt’.s.s
nrirhnieric unit (AAU) provided addressing functions. The MDP also included a DRAM interface, control
clock, and diagnostic interface.
Instruction-Set Architecnrre The MDP‘ extended a conventional microprocessor instruction-set
architecture with instructions to support parallel processing. The instruction set contained fined-fonnat, three-
address instructions. Two 17-bit instructions fit into each 36-bit word with 2 bits reserved for type checking.
Separate register sets were provided to support rapid switching among three execution levels: background,
priority U (FD), and priority I {Pl}. The MDP executed at the background level while no message created a
task, and initiated execution upon message arrival at P0 or Pl level depending on the message priority.
P1 level had higher priority than Pl) level. The register set at each priority level included four GPRs. four
address registers, four [D registers, and one instmclion pointer {IF}. The [D registers were not used in the
background register set.
Communication Support The MDP provided hardware support for end-to-end message delivery including
formatting, injection, delivery, buffer allocation, buffering, and task scheduling. All MDP nansmitted a
message using a series of SEND instructions, each of which injected one or two words into the network at
either priority 0 or l.
Consider the following MDP assembly code for sending a four-word message using three variants of the
SEND instruction.
SEND Rtl,tl ; send net address (priority 0)
SEND2 R1,R2,U ; header and receiver [priority 0)
SENDZE R3-,[3 ,A3],0 ; selector and continuation end message [priority 0)
435 i - - AdmrIcedCelnputerA:chitec1ure
The first SEND instruction reads the absolute address of the destination node in < X. 1'1 Z 1* format irorn
RD and forwards it to the network hardware. The SEND2 instruction reads the first two words of the message
out of registers RI and R2 and enqueues them for transmission. The final instruction enqueues two additional
words of data, one from R3 and one from memory. The use of the SENDZE instruction marks the end of the
message and causes it to be transmitted into the network.
Tl1e J-Machine was a three-dimensional mesh with two-way channels, dimension-order routing. and
blocking flow control (Fig. 9.20). The Faces of the network cube were open for use as l/D ports to the
machine. Each channel could sustain a data rate of 233 Mbps {million bits per second}. All three dimensions
could operate simultaneously for an aggregate data rate of 36-4 Mbps per node.
| —--t
' I I I I *3
»— — -2
— — -1
4
“ _ 3
" 2
I I I I 1 /
o 1 2 3 4 Z
Fig.'I.2tl E~cuhe routing from node (1. 5. 2} no node (5.13) on a 6-any 3-cube
.Me:.Iag-e Format and Routing The J-Machine used deterrninistic dirnension—o1-der E-cube routing. As
shown in Fig. 9.20, all messages routed first in the x-dimension, then in the y—dirnension, and then in the
z-dimension. Since messages routed in dimension order and messages nlnning in opposite directions along
the same dimension cannot block, resource cycles were thus avoided, making the network provably deadlock-
free.
The MDP supported a broad range of parallel programming models. including shared memory. data-
parallcl, datafiow, actor, and explicit message passing, by providing a low-overhead primitive mechanism for
communication, synchronization, and naming.
lts communication mechanisms permitted a user-level task on one node to send a message to any other
node in a4(}96-node machine in less than 2 its. This process did not consume any processing resources on
intermediate nodes, and it automatically allocated buffer memory on the receiving node. On message arrival,
the receiving node created and dispatched a task in less than l us.
Presence tags provided synchronization on all storage locations. Three separate register sets allowed fast
context switching. A translation mechaltism maintained bindings betweetrt arbitrary names and values and
supported a global virtual address space. These mechanisms were selected to be general and amenable to
efficient hardware implementation. The J-Machine used wormhole routing and blocking flow control. A
combining-tree approach was used for synchronization.
The Router Design The routers fonned the switches in a J-Machine network and delivered messages
to their destinations. As shown in Fig. 9.2la, the MDP contained three independent routers, one for each
bidirectional dimension of the network.
Each router contained two separate virtual networks with different priorities that shared the same physical
channels. The priority-l network could preempt the wires even if the priority-0 network was congested or
jammed. The priority levels supported multi-threaded operations.
Each of the lll Toubcr paths Contained buffers. comparators, and output arbitration {Fig. 9.21). On each
data path, a comparator compared thc lead flit, which contained the destination address in that dimension, to
the node coordinate. Ifthe head flit did not match, the message continued in the current direction. Otherwise
t:he message was routed to the next dimension.
A message entering the dimension competed with messages continuing in the dimension at a two-to-
onc switch. Once a message was granted this switch, all other input was locked out for the duration of the
message. Once the head flit of the message had set up the route, subsequent flits followed directly behind it
440 i - - Admrrced Cornprrtenlrrchitscture
Netout
Priority O
Y" Y‘ dpliineaiiigilgn {I l:l F iiqlzitdnsbn
Priority 1
L Sign cheek
Prlo'"W Cl
2- -—- 2+
pl-|o|-Hy 1 Backward I Backward
F§g.!.21 Priority contzrol and cltlrntenslon-order routnar design bi the HDP chip (Courtenay ofWl Daily oi: al;
reprinted with permission from IEEE Micro, April 199'!)
Two priorities of messages shared the physical wires hut used completely separate l:ru.i'i'ers and routing
logic. This allowed priority-1 messages to proceed through blockages at priority ll. Without this ability, the
system would not be able to redistribute data that caused hot spots in the network.
Synchronimrion The MDP synchronized using message dispatch and presence tags on all states. Because
each message arrival dispatched a process, messages could signal events on remote nodes. For example, in
the following combining-tree example, each COMBINE message signals its own arrival and initiates the
COMBINE routine.
In response to an arriving message, the processor may set presence tags for task synchronization. For
example, access to the value produced by the combining tree may be syneiu-onized by initially tagging as
empty the location that will hold this value. An attempt to read this location before the combining tree has
written it will raise an exception and suspend the reading task until the root of the tree writes the value.
I»)
lg Example 9.4 Using a combining tree for synchronization
of events (VV. Dally et a|,1992)
A combining tree is shown in Fig. 9.22. This tree sums results produced by a distributed computation. Each
node sums tl1e input values as they arrive and then passes a result message to its parent.
snt,vas..-.ta..o -—. ...,,
Vatue=1" V3lLte=12
C-ount=0 C-=>u~t=1
Ftg.tI.22 A eontbinirqg rzrsa for internode communication or syndrronizatlon (Ccureesy otvtt Dally ct. al. 1991}
A pair of SEND instructions was used to send the COMBINE message to a node. Upon message arrival,
the MDP buffered the message and created a task to execute the following CDMBl'NE routine written in
MDP assembly code:
COIVTBTNE: MOVE [1, as], COMB get node pointer from message
MOVE [2, as], at get value from message
ADD Rl . COMBNALUE. RI
MOVE RI , COMB.‘W\LI.lE store result
MOVE COM'B.COl..l'NT, R2 get Count
ADD R2, -l, R2
MOVE R2, COMBEOUNT store decrernented Count
BN2 R2, DONE
MOVE HEADER, RD get message header
SEND2 COMB . Pl\RENT_l'~iDDE, R0 send message to parent
SEND2E COMB.Pr\REl\'T, RI with value
DONE: SUSPEND
If the node was idle. execution of this routine began three cycles after message arrival. The routine loaded
the combining-node pointer and value from the message, performed the required add and decrement, and, if
Count reached zero, sent a message to its parent.
Research Issue: The J-Machine was an exploratory research project. Rather than being specialized for
a single model of computation, the MDP incorporated primitive mechanisms for efficient communication,
syncltronizatlon. and namirrg. The machine was used as a platform for software experiments in fine-grain
parallel programming.
Reducing the grain size of a program increases both the potential speedup due to parallel execution and
the potential overhead associated with parallelism. Special hardware mechanisms for reducing the overhead
441 i - " Admnced Cmnputerfluchitecture
due to eomrnunicrttion, process switching, synchronization, and multi-threading were therefore central to
the design of the MDP. Software issues sueh as load balancing, scheduling, and locality also remained open
questions.
The MIT research group led by Dally implemented two languages on the J-Machine: the actor language
Concurrent Smalltalk and the dataflow language Id. The machine's mechanism also supported dataflow
and object-oriented programming models using a global name space. The use ofa few simple mechanisms
provided orders of magnitude lower communication and synchronization overhead than was possihle with
multicomputers built from then available ofi'-the-shelfmicroproccssors.
From Cosmic Cube to Mosaic C The evolution from the Cosmic Cube to the Mosaic is an example of
one type ofsrwling rrrtelr ir| which advances in technology are employed to reimplement nodes ofa similar
logical complexity but which are faster and smaller, have lower power, and are less expensive. The progress
i11 microelectronics over the preceding decade was such that Mosaic nodes were = GU times faster, used
= 20 times less power, were = 100 times smaller. and were (in constant dollars) = 25 times less ctrpens ive to
rnanufacture than Cosmic Cube nodes.
(
e Message-Passing Netvmrk )
I
.-P’
_, / I
:+:+:<-:
I
/'
. . . Memory bus
\
“°°“"“"""““ “~ \
\
E E
Fig. 1.23 The Caltech Mosaic ar1:hl'neen.n1t (Courtesy of C.Seitz. 't9'9'1j
Each Mosaic node included 6-4 Mbytes of memory and an I1-MIPS processor, a packet interface, and a
router. The nodes were tied together with rt 6-0-lvlhytesfs, two-dimensional routing-mesh network (Fig. 9.23}.
r..rrt,Mmr..ru....r. -— _,.,,
The compilation-based progrrtrnming system allowed fine-grain reactive-process message-passing programs
to be expressed in C——, an extension of C++, and the r|.rn-time system performed automatic distributed
l'Il'lB.l\ligt'_‘l'Tl;CfIl Of 5y$l't'_‘l']'l ITHJLIFCCS.
Mosaic C Node The Mosaic C multicomputer node was a single 9.25 mm >< 10.00 mm chip fabricated in
a l.2-,ttrrt-feature-size, two-level-metal CMDS process. At 5-V operation, the synchronous parts of the chip
operated with large margins at a 30-MI-L-: clock rate, and the chip dissipated = 0.5 W.
The processor also included two program cou ntcrs and two sets of general-purpose registers to allow
zero-time context switching between user programs and message handling. Thus, when the packet interface
received s complete packet, received the header of a packet, completed the sending of a packet, exhausted the
allocated space for receiving packets. or any of several other events that could be selected, it could interrupt
the processor by switching it instantly to thc message-handling context.
Instead of several hundred instructions for handling a packet, the Mosaic typically required only about I0
instructions. The number ofclock cycles for the message-handling routines could he reduced to insignificance
by placing them in hardware, but the Caltcch group chose the more flexible software mechanism so that they
could experiment with dificrent message-handling strategies.
Mosaic C I x 8 Mesh Boon-ls The choice ofa two-dimensional mesh for the Mosaic was based on a 1989
engineering analysis; originally, a three-dimensional mesh network was planned. But the mutual fit of the
two-dimensional mesh network and the circuit board medium provided high packaging density and allowed
the high-speed signals between the routers to be conveyed on shorter wires.
Sixty-four Mosaic chips were packaged by tape»automated bonding (TAB) in an 8 >< 8 array on a circuit
board. These boards allowed the construction of arbitrarily large, two-dimensional arrays of nodes using
stacking connectors. This style of packaging was meant to demonstrate some of the density, scaling, and
testing adva.ntagcs of mesh-connected systems. Host-interihce boards were also used to connect the Mosaic
arrays and workstations.
Application: and Future Trend: Charles Seitz determined that the most profitable niche and scaling
track for the multicomputer, a highly scalable and economical MIMD architecture. was the fine-grain
multicomputer. The Mosaic C demonstrated many of the advantages of this architecture, but the major part
ofthe Mosaic experiment was to explore tl'|e programmability and application span ofthis class of machirlc.
The Mosaic may he taken as the origin of two scaling tracks: (1) Single-chip nodes are a technologically
attractive point in the design space of multicomputers. Constant-node-size scaling results in single-chip
no-des of increasing memory size, processing capability, and communication bandwidth in larger systems
than centralized shared-memory multiprocessors. (2) lt was also forecasts that constant-node-complexity
scaling would allow a Mosaic 8 >< 8 board to be implemented as a single chip, with about 20 times the
performance per node, within It] years. In this contest, see also the discussion in Chapter 13.
A 16K-node machine was constructed at Caltech to explore the progranunability and application span
of the Mosaic C architecture for large-scale computing problems. For the loosely coupled computations in
which it excels, a multicomputer can be more economically implemented as a network of high-jterforrltaoce
workstations connected by a high-bandwidth local-area network. in fact, the Mosaic components and
programming tools were used by a USC Information Science Institute project (led by Danny Cohen, 1992) to
implement a 400-Mbitsls ATOMIC local-area network for this purpose.
rm-Mrfirnw Hill ' t
444 i “W mm Advanced Cornputenfirchitecture
The Prototype Architecture A high-level organization of lite Dash architecture was illustrated in Fig.
9.] when we studied the various latency-hiding techniques. The Dash prototype is illustrated in Fig. 9.24.
It incorporated up to 64 MIPS R3i)00fR3l)l(l microprocessors with I6 clusters of 4 PEs each. The cluster
hardware was modified from Silicon Graphics 4Dt‘3-40 nodes with new directory and reply controller hoards
as depicted in Fig. 9.24:1.
The interconnection network among the 16 multiprocessor clusters was a pair of wormhole-routed mesh
nehvorks. The channel width was I6 bits with a 50-ns fall-through time and a 35-ns cycle time. One mesh
network was used to reqrrerr remote memory, and the other was a rt=p{_r mesh as depicted in Fig. 9.2-lb, whcre
the small squares at mesh intersections are thc 5 >< 5 mesh routers.
The Dash designers claimed scalability for the Dash approach. Although the prototype was limited to
at most 16 clusters {a 4 >< 4 mesh), due to the limined physical memory addressahility (255 Mbytes) of the
4Dt'34U system, the system was scalable to support htmdrcds to thousands of processors.
To use the 4Df34t] in the Dash, the Stanford team made minor modifications to the existing system boards
and designed a pair ofncw boards to support the directory memory and intercluster interiace. The main
modification to the existing boards was to add a bus retry signal, to be used when a request required service
from a remote cluster.
The central bus arbiter was modified to accept a mask from the directory. The mask held off a proccssor’s
retry until the remote request was serviced. This efiectively created a split-transaction bus protocol lor
requests requiring remote service.
The new directory controller hoards contained the directory memory, the intercluster coherence state
machines and buffers, and a local section of the global interconnection network. The directory logic was
split between the two logic boards along the lines ofthe logic used for outbound and inbound portions of
intcrcluster transactions.
s=.w.,~Mmi...i -ig, ...,
Wmflhdfl "-‘ll-Iiflfl Two 2 Dmashas
5flfl8"i'IflP
120 MB|'slir|k
.J L 1. ii l 1
I I ‘ Raquaa Mash I
_'__» 1.
.- \
.- .-"i "5 Nude Nada
J.-" 1 Clusm CIBS/1.8|'
_,a
5E
§
L
4 x MIPS R.3mCl
(33 MHZ]
Snoupybus L '
# I-,1! /-;,r1' -
Mamnqr {ijcitial adclram-ad)
Hod NC!
Modified Sliwn Graphics Puwer Shim 4DJ34CI ' ' '7
Fig.9.2-I Tlu Sanford Dash prouotypn system [Couru-sy of D. Lnnosid at al. Pmc. ‘Fifth Int. Symp. Compm.
mam; .lu.|straIh. May 1991}
448 i - Admriccd Cemputerdrchitecture
The mesh networks supported a scalable local and global memory bandwidth. The single-address space
with coherent caches permitted incremental porting or tuning of applications, and exploited temporal and
spatial locality. Other factors contributing to improved performance included mechanisms for reducing and
tolerating latency, and well-designed U0 capabilities.
Dash Mernory Hierardry Dash implemented an invalidation-based cache coherence protocol. A memory
location could be i11 one ofthnrc states:
The directory kept the summary information for each memory block, specifying its state and the clusters
cacheing it. The Dash memory system could be logically broken into four levels of hierarchy, as illustrated
in Fig. 9.25c.
The first level was the processor cache which was designed to match the processor speed and support
snooping from the bus. It took only one clock to access the processor cache. A request that could not be
serviced by the processor cache was scnt to the local chrsttcr: The prototype allowed 30 processor clocks to
aoccss thc local cluster. This lcvcl included the othcr processors‘ caches within thc requesting processor's
cluster.
Otherwise, the request was sent to the borne cluster level. The home level consisted of the cluster that
contained the directory and physical memory for a given memory address. It took 100 processor clocks to
access t.he directory at the home level. For many accesses (for instance, most private data references), the
local and home cluster were the same, and the hierarchy collapsed to three levels. In general, however, a
request would travel through the interconnection network to the home cluster.
The home cluster could usually satisfy the request immediately, but if the directory entry was in a dirty
state, or in a shared state when the requesting processor requested exclusive access, the fot.n‘lJ'| level had to
be accessed. The remote cluster level for a memory block consisted of the clusters marked by the directory
as holding a copy of the block. It took I35 processor clocks to access processor caches in remote clusters in
the prototype design.
The Directory Protocol Thc directory memory relieved the processor caches of snooping on memory
requests by keeping track of which caches held each memory block. In the home node. there was a directory
entry per block frame. Each entry contained one ;Jrcsertcc bi! per processor eache. In addition, a store bi!
indicated whether the block was uncached, shared in multiple caches, or held esclmiyely by one cache (i.e.
whether the block was dirty).
Using the state and presence bits, the memory could tell which caches needed to be invalidated when a
location was written. Likewise, the directory indicated whether the memory copy of the block was up-to-date
or which cache held the most recent copy.
By using the directory memory, a node writing a location could send point-to-point invalidation or update
messages to the processors actually cacheing that block. This is in contrast to the invalidating broadcast
required by the snoopy protocol. The scalability ofthe Dash depended on this ability to avoid broadcasts.
Another important attribute of a directory-based protocol is that it does not depend on any specific
interconnection network topology. As a result, the designer can readily use any of the low—lateney scalable
networks, such as meshes or hypercubes, that were originally developed for message-passing machines.
sMr,M..m..r,r...¢ -— ...,,
Ir)
El Example 9.5 Cache coherence protocol using distributed
directories in the Dash multiprocessor (Daniel
Lenoski and john Hennessy et al, 1992.}
Figure 9.25:: illustrates the flow of s read request to remote memory with the directory in rt dirty remote
state. The read request is forwarded to the owning dirty cluster. The owning cluster sends out two messages
in response to the read. Amessage containing the data is scnt directly to the requesting cluster, and a sharing
writebaclr request is sent to the home cluster. The sharing writeback request writes the cache block back to
memory and also updates the directory.
Local
Local
lag‘)® 1. Read Request Q»)
m 2 to Home -
3% 1. Refs: finest
'
Um [Shared ' Shared ' shared
fiflll
---er sreg, srrgw stem,
‘E’ or or or
Ia] Read of dirty remote cache block [a] Write to shared remote eache block
Fig.'J.‘15 Two examples of a directory-based cache ooherence protocol in the Dash (Gouroesy of Lenoskl
and I-lers1-easy. 1992]
This protocol reduocs latency by permitting the dirty cluster to respond directly to the requesting cluster.
In addition, this forwarding strategy allows the directory controller to simultaneously process many requests
(i.e. to be multithreaded) without the added complexity of maintaining the state ofourstanding requests.
Serialization is reduced to the time ofa single interclustcr bus transaction. The only resource held while
intercluster messages are being scnt is a single emry in the originating cluster's remote-access cache.
445 i - - AdmricedCmnpirterArchitec1urc
Figure 9.25b shows the corresponding sequence for a write operation that requires remote service. The
invalidatio n-based protocol requires the processor (actually the write buffer] to acquire exclusive ownership
of the cache block before completing the store. Thus, if a write is made to a block that the processor does not
have cached, or only has cached in a shared state, the processor issues a read-exclusive request on the local
bus.
in this case, no other cache holds the block entry dirty in the local cluster, so a RdEx Request (message
I} is sent to the home cluster. As before, a rernotesaccess cache entry is allocated in the local cluster. At the
home cluster, the pseudo-CPU issues the read-exclusive request to the bus. The directory indicates that the
line is in the shared state. This results in the directory controller sending a RdE:t Reply (message la] to the
local cluster and invalidation requests {inv-Req, message Eb] to the sharing cluster.
The home cluster owns the block, so it can immediately update the directory to the dirty state, indicating
that the local cluster now holds an exclusive copy of the memory line. The RdEx Reply message is received
in the local cluster by the reply controller, which can then satisfy the read-exclusive request.
To ensure consistency at release points, however the remote-access cache entry is dealloeated only when
it receives the number of invalidate acknowledgments (Inv-flick, message 3) equal to an invalidation count
sent in the original reply message.
The Dash prototype with 64 nodes was rather small in size. [F each processor had a live-issue superscalar
operation with a I00-MHZ clock, an extended machine with 2K nodes would have the potential to become a
system with 1 tera operations per second, with higher performance at higher clock rate-s.
This demands an integrated implementation with lower overhead in the scalable directory structure. A
three-dimensional toms network was considered with I6-bit data paths, a El]-ns fall-through delay, and a
=1-tn; cycle time. The access time ratio among the four levels of memory hierarchy was to be approximately
1:5: 16:80: 120, where 1 correspondsto one processor clock. The larger version ofDASH was not implemented;
however, the concept of distributed directory-based cache coherenee was validated.
I Unidirectional
' Sflflfm slotted rlrq
Engine 1
[ju-
'.’_. .__
\
/ *~.
/1 \
/ ARD :
.1 ALLCA-C HE
/ Router and
x
1 Dlrectmy
/
/ \
Search Unldlrecflmmal
' Engi no D sioited ring
[8-32 nodes]
I
, 1' 3,
I
1 E
1’ 1.
l.
,r
E \
Fig.9.26 The KER-1 ardiheuune with a slotted ring for communication (Courtesy of Kern-fill Sqnne
Rasaard1 Corpondon. 1991]
450 i - Adi-tenced Cerrtputerflirchitectore
4-——-___,----' ----.__,___
___--' Rtng:1 ‘~=__
i I I1 i '|
U 5*-gs
Ringzfl
I.‘ Q Ringrfl
|—
'I -—'
‘
I=sRm
=-= an ..-
'--I i
UI . sors i
Directory
I
i 1I "‘F____-— -“___.‘ I
|
" —
+
Responding R1991 9 }
Pmmfia - Local Local
Cache "cache Cache
Directory Directory Directory :
‘;
3
request 1 1 -
Each nodc comprised a primary cache, acting as a 32-Mbytc primary memory, and a 64-bit supcmscalar
processor with roughly the same performance as an IBM RSIGUOU operating at the same clock rate. The
superscalar processors containing 64 floating-point and 32 fixed-point registers of 6-4 bits were designed for
both scalar and vector operations.
For example, ts elements could be prcfclchcd at one time. A processor also had a 0.5-Mbyte subcache
supplying 20 million accesses per second to the processor (a computational efficiency of 0.5}. A processor
Operated at 20 MT-Iz and was fabricated in l.2-_i.t trt CMCIS.
The processor, without caches, contained 3.9 million transistors on I5 types of I2 custom chips. Three-
quarters of each processor consisted of the search engine responsible lor migrating data to and from other
nodes, for maintaining memory coherence throughout the system using distributed directories, and for ring
contml.
The ALLCACHE M-elnory The KSR-1 eliminated thc memory hierarchy found in conventional computers
and the corresponding physical memory addressing overhead. Instead, it offered a single—leve1 memory,
ealled ALLCACHE by KSR designers. This ALLCACHE design represented the confluence of cache and
shared virtual memory concepts that exploit locality required by scalable distributed computing. Each local
cache had a capacity of32 Mbytes -[225 bytes]. The global virtual address space had Z“ bytes.
o.ot,msa.o,.u -— .,,,
Bell {I992} considered the KSR machine the most likely blueprirtt for future scalable MPP systems. This
was a revolutionary architecture and thus was more controversial when it was first introdu-cod in I99 l . The
architecture provided size (including LID) and generation scalability in that every node was identical, and it
offered an efficient environment for both arbitrary workloads and sequential to parallel processing through a
large hardware-supported address space with an unlimited number ofprocessors.
Programming Model The KER machine provided a strict sequentially consistent programming model
and dynamic management oi" memory through hardware migration and replication ofdata throughout the
distributed processor memory nodes using its ALLCACHE mechanism.
with sequential consistency, every processor returns the latest value of a written value, and results ofan
execution on multiple processors appear as some interleaving ofopcrations of individual nodes when executed
on a multithrcaded machine. With ALLCACHE, an address became a name, and this name automatically
migrated throughout the system and was associated with a processor in a cache-like fashion as nccdod.
Copies oi" a given cell were made by the hardware and scnt to other nodes to reduce access time. A
processor could prefetch data into a local cache and post-store data for other cells. The hardware was designed
to exploit spatial and temporal locality.
For example, in the SPMD programming mode], copies of the program moved dynamically and were
cached in each ofthe operating nodes‘ primary and processor caches. Data such as elements ofa matrix
moved to the nodes as required simply by accessing the data. and the processor had instructions to prefetch
data to the processor's registers. When a processor wrote to an address, all cells were updated and thus
memory coherence was maintained. Data movement occurred in subpages of 128 bytes ofthe 16K pages.
Environma-it and P-etrfnrmance Every known form of parallelism was supported via the KSR’s Mach-
based operating system. Multiple users could run multiple sessions comprising multiple applications or
multiple processes (each with independent address space), each ot‘ which might consist of multiple threads
ofcontrol rtmning and simultaneously sharing a common address space. Message passing was supported by
pointer passing in the sharod memory to avoid data copying and enhance performance.
The l-(SR also provided a commercial programming environment For transaction processing that accessed
relational databases in parallel with ttnlimited scalability as an altemative to multicomputers formed from
multiprocessor mainframes. A 1K-no-de system provided almost two orders of magnitude more processing
power, primary memory, U0 bandwidth, and mass storage capacity than a multiprocessor mainframe available
at that time.
For example. unlike other contemporary candidates, a 1088-node system could be configured with
I53 terabytes of disk memory, providing 500 times the capacity of its main memory. The 32- and 320-node
systems were designed to deliver over 1000 and 10,000 transactions per second, respectively, giving them
over 100 times the throughput of a multiprocessor mainframe available at the time.
‘With rapid advances in VLSI and interconnect technologies, the mid-19905 saw a major shakeout in
the supercomputer business. Kendall Square Research, tl'te developers of KSR-I and its sequel KSR-2
systems, were forced to exit from hardware business during that period. As in the case of other innovative
and pioneering attempts at the development of parallel computer architectures, knowledge gained from the
KSR development was also useful in the design and development of MPP computer systems of subsequent
generations. Our next case study on MPP system will also bring out clearly this important point.
mm ii am
3 DTc|midal Mesh {'16 >< 16 >< '16)
Z—linlE
,¢ ,-4
U /5 V
V £7 3’
U Di Cl ca
.0 U .Cl
D U Cl
X—IinlG
{MA spasa 4 >< 4 >< 4 torus with X-linka and '1"-links miasing an dbmate
Z-layers, respadnrely
Fig.9.2B The Tera n1u1dprooessor and fl: t hree--dh1-ensionai sparse torus archJ1:ec|:1.n'e shown with a
4 X 4 X 4 configuamfion {Courtesy ofTer1 Computer Company, 1991]
454 i - Adnorroed Cornputerfirchitectore
Second, it was important that the architecture be applicable to a wide spectrum of problems. Programs that
do not vecto-rize well, perhaps because of a preponderance of scalar operations or too 'l'i'equent conditional
branches, will execute eflicicntly as long as there is suflicient parallelism to keep the processors busy.
Virtually any parallelism applicable in the total comptrtational workload can be turned into speed. from
operation-level parallelism within program basic blocks to multiuset lime and space sharing.
A third goal was ease of compiler implementation. Although the instruction set did have a few turusual
features, they did not pose unduly dillicult problems for the code generator. There were no register or
memory addressing constraints and only three addressing modes. Condition code setting was consistent and
orthogonal.
Because the archiler:tr.1re permitted fi'ee exchange of spatial and temporal locality for parallelism, a highly
optimizing compiler could improve locality and trade the parallelism thereby saved for more spocd. On the
other hand, ifthere was sufiicicnt parallelism, the compiler could exploit it efficiently.
The Sparse T'lrree-Dimrmsionnl Torus The interconnection network was a three-dimensional sparsely
populated torus (Fig. 9.23b) oi pipelined packet-switching nodes, each of which was linked to some of its
neighbors. Each link could transport a packet containing source and destination addresses. an operation. and
64 data bits in both directions simultaneously on every clock tick. Some of the nodes were also linked to
rtssorrreos. i.e. processors, data memory units, I.-‘O processors, and It'll) cache units.
instead of locating the processors on one side ofthe network and the memories on the other la “dance hall"
configuration), the resources were distributed more-or-less trniformly throughout the network. This permitted
data to be placed in memory units near the appropriate processor when possible, and otherwise generally
rnaxiniizcd the distance between possibly intcriirrirrg resources.
The interconnection network of one 256-processor Tera system contained 4096 nodes arranged in a 16 ><
l6 >< 16 toroidal mesh; i.e. the mesh “wrapped around“ in all three dimensions. Dfthe 4096 nodes, 1230 were
attached to the resources comprising 256 eache units and 256 lit) processors. The 2816 remaining nodes did
not have resources attached but still provided message bandwidth.
To increase node performance, some of the links were omitted. lf the three directions are named rt, y, and
1', then X-links and y-links were omitted on alternate z-layers (Fig. 9.23b). This reduces the node degree fi'orn
6 to 4, or from 7 to 5, counting the resource link. ln spite ofits missing links, the bandwidth of the network
was very large.
Any plane bisecting the network crossed at lcast 256 links. giving the network a data bisection bandwidth
of one 64-bit data word per processor per tick in each direction. This bandwidth was needed to support
shared-memory addressing in the event that all 256 processors addressed memory on the other side of some
bisecting plane simtrltancottsly.
As the Tera architecture scaled to larger numbers of processors p. the number ofnetwork nodes grew as
yurm rather than as theplog p associated with the more commonly used multistage networks. To see this, we
first assume that memory latency is fully masked by parallelism only when the number of messages being
routed by the network is at least p>< I. where! is the [round-trip) latency. Since messages occupy volume,
the network must have a volume proportional to p >< I: since the speed of light is finite, the volume is also
proportional to 13 and therefore I is proportional to ,rJ|"3 rathcr than log p.
Pipelined Support Each processor in a Tera cornputcr could execute multiple instruction streams (threads)
simultaneously. in the initial implementation, as few as I or as many as 123 program counters could be active
e.ut,».un...ta..n. 4,,
at once. On every tick of the clock, the processor logic selected a ready-to-execute thread and allowed it to
issue its ncxt instruction. Sincc instruction intcrprctation was contplctcly pipclined by thc process-or and
by the network and memories as well (Fig. 9.29}, a new instruction from a different thread could he issued
during each tick without interfering with its predecessors.
When an instruction finished, the thread to which it belonged became ready to execute the next instruction.
As long as there were enough tlneads in the processor so that the average instruction latency was filled with
instructions fi'on1 other threads, the processor was fitlly utilized. Thus, it was only necessary to have enough
threads to hide the expected latency (perhaps TO ticks on average); once latency was hidden, the processor
would n.|n at peak performance and additional threads would not speed the result.
If a thread were not allowed to issue its next instruction until the previous instruction completed, then
approximately ill) different threads would he required on each processor to hide the expected latency. The
lookahead described later allowed threads to issue multiple instructions in parallel, thereby reducing the
nutnh-er oi‘ threads needed to achieve peak performance.
As seen in Fig. 9.29, three operations could be executed simultaneously per instruction per processor. The
M-pipeline was for memory-access operations, the /i-pipeline for arithmetic operations, and the C7-pipeline
for control or arithmetic operations. The instructions were 64 bits wide. If more than one operation in an
instruction specified the same register or setting oi" condition codes, the priority was M> A ':= C.
ifilifi Instruction
K PM fetch x
E; ZP O
write egister
/""‘\ ‘ii
_,i
ii
write poo
memory
poo i_ write aglst-at
\-../ qi "
MU '4
|£r00l
write egistor
\-
[ interconnection network )
I I t t
'\ memory intemal pipeline /
it was estimated that a peak speed of 1G operations per second could be achieved per processor if driven
by a 333-MI-[2 clock. However, a particular thread would not exceed about IDOM operations per second
because of interleaved execution. The processor pipeline was rather deep, about 1'0 ticks, as compared with
8 ticks in die earlier HEP pipeline.
Thread State and Nhmogement Figure 9.30 shows that each thread had the following state associated
with it:
‘O..iS-SW pg
TD
I
I
I
TT
es
I
U
R31
128-Copies
Fig.'J.30 The thread rnanagernenr scheme used in the Tera eornpoeer {Cour-cosy of Tera Computer
Cornpany. 1997.}
Context switching was so rapid that the processor had no time to swap the processor-resident thread state.
Instead, it had I28 of everytlting. i.e. l23 SSWs, 4096 general purpose registers, and I024 target registers. lt
is appropriate to compare these registers in both quantity and function to vector registers or words ofcaehes
in other architectures. In all three cases, the objective is to improve locality and avoid reloading data.
Program addresses were 32 bits in length. Each thread’s current program counter [PC] was located in
the lower half of its SSW. The upper half described various modes {e.g. floating-point rounding, lookahead
disable), the trap disable mask (e.g. data alignment, floating overflow}, and the four most recently generated
condition codes.
rsta,tnss..,ta..a -— ,5,
Most operations had a TEST variant which emitted a condition code; and branch operations could
examine any subset of the last four condition codes emitted and branch appropriately. Also associated with
each thread were thirty-two 64-bit general-purpose registers. Register R0 was special in that it read as 0 and
output to it was discarded. Otherwise, all general-purpose registers were identical.
The Larger registers were used as branch targets. The fomtat of the target registers was identical to that of
the SSW, though most control transfer operations used only the low 32. bits to determine a new PC. Separating
the determination ofthe branch target address from the decision to branch allowed the hardware to prclbtch
irtstructions at the branch targets, thus avoiding delay when the branch decision was made. Using target
registers also made branch operations smaller, resulting in tighter loops. There were also slcip operations
which obviated the need to set targets for short forward branches.
One target register (TD) pointed to the trap handler which was nominally an unprivileged program. ‘When
a trap occurred. the effect was as if a coroutine call to a T0 had been executed. This made trap handling
extremely lightweight and independent ofthe operating system. Trap handlers could be changed by the user
to achieve specific trap capabilities and priorities without loss of efiic-iency.
Explicit-Dependence Loulmhead If there were enough threads executing on each processor to hide the
pipeline latency {about TD ticks}, then the machine would run at peak performance. However, if each thread
could execute some of its instructions in parallel leg. two successive loads}, then fewer threads and parallel
activities would be required to achieve peak performance.
Tl're obvious solution was to introduce instruction loolrahead; the difliculty was that the traditional
register reservation approach requires far too much scoreboard bandwidth in this kind oi‘ architecture. Either
multithreading or horizontal instruction alone would procludc scorchoarding.
The Tera architccttrre used a new technique called etyJlieir-riependt'rtr'e Iookrtherrrt Each instruction
contained a 3-bit loolrahead field that explicitly specified how many instructions fi'o|n this thread would be
issued before encountering an instniction that depended on the current one. Since seven was the maximum
possible lookabead value, at most 8 instructions and 24 operations could be concurrently executing from each
thread.
A thread was ready to issue a new instruction when all instructions with loolcabead values referring to the
new instruction had completed Thus, it" each thread maintained a lookahead of seven, then nine threads were
needed to hide ‘F2 ticks of latency.
Loolrahead across one or more branch operations was handled by specifying the minimum of all distances
involved. The variant branch operations JUMP_CIF'I‘EN and JL1l\1P_SELDGI-1, for high-and low-probability
branches, respectively, facilitated optimization by providing a barrier to lookahead along the less likely path.
There were also SI-<. I P_tIIF'I‘El'»l and I P_SELE-DI-I operation s. The overall approach was conceptually sim-
ilar to exposed-pipeline lookahead except that the quanta were insbuctions instead of licks.
Advantages and Drawbacks The Tera used multiple contexts to hide latency. The machine performed a
context switch every clock cycle. Both pipeline latency and memory latency were hidden in the I-[EP1'I‘era
approach. The major focus was on latency tolerance rather than latency reduction.
With 128 contexts per processor, a large number (2K) oi" registers must be shared finely between threads.
The thread creation must be very cheap [a few clock cycles]. Tagged memory and registers with fulltempty
bits were used for synchronization. As long as there was plenty of parallelism in user programs to hide
latency and plenty of compiler support, the perl‘ormance was potentially very high.
FM Mtfiruw Hfllritmpwins
455 i " Advanced Colnputerfirehitactorc
However, these Tera advantages were embedded in a number of potential drawbacks. The performance
must be bad for limited parallelism, such as guaranteed low single-contest performance. A large number of
contexts (threads) demanded lots ofregisters and other hardware resources which in tum implied higher cost
and complexity. Finally, the limited focus on latency reduction and cacheing entailed lots of slack parallelism
to hide latency as well as lots of memory bandwidth; both required a higher cost ihr building tl'|e machine.
ln the year 1996, the independent company Cray Research, Inc. founded by Seymour Cray merged with the
high-performance graphics workstation producer Silicon Graphics, Inc. (SGI); Cray Research then became a
business division of SGI. ln the year 2000, Tera Computer Company, originators and developers of the Tera
MTA massively parallel system which we have studied in this section, took over Cray Research. The merged
company was named Cray, lnc., and it is in active operation today (see www.cray.eoml. Cray has continued
with the development of the MTA architecture, as we shall review in Chapter 13.
Dutnflow Graphs We have seen a datafiow graph in Fig. 2.13. Datallow graphs can be used as a machine
language in dataflow computers. Another example of a dataftow graph (Fig. 9.3 la) is given below.
Sca|abie,Mu!Hfl|reuded,and 4,,
X
I 2 I
Dahiow gmphs as
a madwiina language
'|
I 24
‘~
M|TTaggadToluan
Dahiowflwc-hrhctim
\
‘~
Mancheshr
Dalaflow
;
I I 720
T
Explicit Token
ETLSigm.a-1
Bbre Machines
\_
7: B: 3| 1\
l-l|T.ll-lobrda El’L EH-4
Monsoon
'1'
coax i
l-l|TI‘M;ol;o|'da ‘T
FIQJLI1 An mple
dauflow graph and clauflow machine praiects
I»)
£3 Example 9.7 The dataflow graph for the calculation of
cosx (Ar\rlnd,1991).
This dataflow graph shows how to obtain an approximatioll of coax by thc following powcr
computation:
2 _=l _£i ,2 -1 ji
M121-‘I +" -J‘ =1-J‘ +1 -“ (9.6)
2! 4! 6! 2 24 7'20
The conesponding datafiow graph consists of nine operators {actors or nodes). The edges in the graph
intcrconncct thc opcrator nodcs. Thc succcssivc powers ofx arc obtained by rcpcatcd multiplications. Thc
constants (divisors) arc foil into thc no-[lcs directly. All intcmlodiatc results arc forwarded among thc nodes.
460 i - Aduorrccd Computerdrchitecture
Start: versus Dynamic Dataflow Static damjiow computers simply disallow more than one token to
reside on any one arc, which is enforced by the firing mle: A node is enabled as soon as tokens are present
on all input arcs and there is no token on any of its output arcs. Jack Dennis proposed the very first static
dataflow computer in 1974.
The static firing rule is difficult to implement in hardware. Special feedback rrekmn-‘lodge .sigrm!.s are
needed to secure the correct token passing between producing nodes and consuming nodes. Also, the static
rule makes it very inefficient to process arrays of data. The number of acknowledge signals can grow too fast
to be supported by hardware.
However, static dataflow inspired the development ofdjmrrmic datqflow eomprrrrers, which were researched
vigorously at MIT and in Japan. In a dynamic architecture, each data token is tagged with a contest descriptor,
called a ragged rotten. The firing rule oftagged-token dataflow is changed to: A node is enabled as soon as
tokens with identical tags are present at each of its input arcs.
with tagged tokens, tag matching becomes necessary. Special hardware mechanisms are needed to achieve
this. In the rest of this section, we discuss only dynamic darallow computers. Arvind of MIT pioneered the
development of tagged—token architecture for dynamic datafiow computers.
Although data dependence does exist in datatlow graphs, it does not force unnecessary sequentlalization,
and dataflow computers schedule instructions according to tl'|e availability of the operands. Conceptually,
"tolren”-carry-ing values flow along the edges ofthe graph. Values or tokens may he memory locations.
Each instruction waits for tokens on all inputs, consumes input tokens, computes output values based on
input values, and produces tokens on outputs. No further restriction on instruction ordering is imposed. No
side effects are produced with the execution of instructions in a datafiow computer. El-nth dataflow graphs and
machines implement only functional languages.
Pure Dataflow Machines Figure 9.311: shows the evolution of dataflow computers. The MIT toggle!-
token drzrtajlow architecture (TTDA) {Arvind et al, i933), the Manchester Dataflow Computer (Gurd and
Watson, 1982), and the ETL Sigma-I {Hiralri and Shimada, I98?) were all pure dataflow computers. The
TTDA was simulated but never built. The Mancltester machine was actually built and became operational in
mid-I 982. lt operated asynchronously using a separate clock for each processing element with apes-fomtanee
comparable to that of the VAJCHSO.
The ETL Sigma-1 was developed at the Electrotechnical Laboratory, Tsulruba, Japan. It consisted of 128
PEs frilly synchronous with a lll-Ml-lz clock. lt implemented the I-structure memory proposed by Arvind.
The fi.|ll configuration became operational in I987 and achieved a I70-Mllops p-erforrnance. The major
problem in using the Sigma-l was lack of high—level language for users.
Explicit Token Store Machine: These were successors to the pu.re dataflow machines. The basic idea is to
eliminate associative token matching. The waiting token memory is directly addressed, with the use of Full!
empty bits. This idea was used in the l'dl'l'r’l'\-iotorola Monsoon {Papadopoulos and Cullcr, H88) and in the
ETL EM-4 system (Sakai et al, W89).
Multithreading was supported in Monsoon using multiple register sets. Thread-based programming was
conceptually introduced in Monsoon. The maximum configuration built consisted of eight processors and
eight I-stnrclure memory modules using an B >< 3 crossbar network. lt became operational in I991.
scrnatemurumt-msd,md. -— 4,,
EM-4 was an extension of the Sigma-1. It was designed for [U24 nodes, but only an EU-node prototype
became operational in 1990. The prototype achieved 815 MIPS in an 80 '>< 80 matrhr multiplication benchmark.
We will study the details of EM-4 in Section 9.5.2.
Hybrid and Unified iirchitectures These are architecttires combining positive features from the \-ion
Neumarm and daraflow areliitectures. 'l‘he best research examples include the MIT P-RISC |[Nikhil and
Arvind, 1988], the IBM Empire [Iannueei et al., 1991), and the l'v1lTfh-Iotorola "'T (Nikhil, Papadopoulos,
Arvirld, and Greiner. 1991}.
P-RISC was a “RISC-ified“ datafiow architecture. It allowed tighter encodings of the dataflow graphs
and produced longer threads for better performance. This was achieved by splitting “complex“ datafiow
instructions into separate "simple" component instructions that could be composed by the compiler. lt
used traditional instruction sequencing. It performed all intraprocessor eommunication via memory and
implemented “joins” explicitly using memory locations.
P-RISC replaced some of the dataflow synchronization with conventional program counter-based
synchronization. IBM Empire was a von Neumannfdataflow hybrid architecture under development at IBM
based on the thesis of lanrtueci {I983}. The *T was a latter effort at MIT joining both the dataflow and von
Neumann ideas, to be discussed in Section 9.5.3.
The Node Architreemm The internal design ofthe processor chip and ofthe node memory are shown
in Fig. 9.321». The processor chip communicated with the network through a 3 >< 3 crossbar swirch unit
The processor and its memory were interfaced with rr rriemory control‘ rmir. The memory was used to hold
programs {template segments} as well as tokens {operand segments, heaps, or frarnes) waiting to he fetched.
The processor co l'I5l5lCtI| of six component units. The r'npur brgffer was used as a token store with a capacity
nt‘32 words. TlflC_,i'-i‘fC.FI-Irlrlfth rmir fetched tokens from the memory and performed tag-matching operations
among the tokens fetched in. Instructions were directly fetched from the memory through the memory
controller.
The heart of the processor was the exeeriririri rmir. which ietchod instnrctions until the end of a thrcarl.
lnstructions with matching tokens were executed Instructions could emit tokens or write to registers.
Instructions were fetched continually using traditional sequelleing [PC — 1 or branch) until a “stop” flag was
raised to indicate the end of a thread. Then another pair of tokens was accepted. Each instruction in a thread
specified the two sources for the next instructiolt in the thread.
“S M ,n...[,n,F .. ,. hm,
I’ TC E
N-ode Node
EMC-R El|IC—R
Processor Processor
I K
Omega Nelworlt
Memory
Fet:e.h-Math
Unit
Ovellow
lma-it Program
our-er Execution {Ternplate
Unit Unit segntentsi
{siren Memory
1l—--i-or tfihitirg
queue) Insiuction Control
Tokens
{operand
Fetch Unit segments,
{til end i.e.
Dfil'IB3d:| flames)
Heap
'- ing
i Unit Present hits
. {3 X3 crossbar)
Netvrcllt
Fig. 9.31 The ETL EM-4 daraflow arehltactatre (Cmtrmsy of S.‘I.it2i,YI.l'lII:Ll\']1i or at, Elect:-orreeiinlcal
Labor1.tiory.T.fl.tiotbI.]lpln. 1991]
a.a.r,Maa....ar...r. -— ...,,
The same idea was used as in Monsoon for token matching, but with different encoding. All data tokens
were 32 bits, and instruction words were 33 bits. EM-4 supported rcmotc loads and synchronizing loads. Thc
_,i5rHr'errrpr__v bits present in memory words were used to synchronize remote loads associated with ditfercnt
threads.
16 out 16m
I Netwerklnterfaee Llnn ‘
requests
responses
H de MC 88110
Mflfinofl Memory mfih
{64_MB) I Controller essage
: Cepreeesser
t‘
J
I
f; ~[RMem] {dP + SP]
BOD MEIs
{cj Intern: node aremtecture wttn data pmeesser
{MC B3110) and synchrentzatton eepreceseer {sP)
Fig.!.33 The MFUMut;omla "‘T pmeotype multithreaded althiuecture {Counaesy of Nikhil, Papadupoul-us,
andfltmmd. Pmc. 19th Int‘. 5ymp.Cotr|puterAa-ch, Aus:r:tNa.. May 1992}
Research Experiment: The ‘T prototype was used to test the effectiveness of the unified architecture
in supporting multithreading operations. The development of *T was influenced by other multifltreflded
architectures, including Tera, Alcwifc, and J-Machine.
Scnfattie,Mutrir.iireaded,and -— 4,,
The l-structure semantics was also implemented in ‘T. Fuilfe-mptv bits were used on producer-
constuner variables. *T treated messages as virtual continuations. Thus busy-waiting was eliminated. Dther
optimizations in *T included speculative avoidance of the extra loads and stores through tnultithreading and
coherent caeheing.
The *T designers wanted to provide a superset ofthe capabilities ofTera, J-Machine, and EM-4. Compiler
techniques developed for these machines were expected to be applicable to ‘T. To achieve these goals, a
promising approach was to start with declarative languages while the compiler could aim to extract a large
amount of fine—grain parallelism.
Muftithreodingrtll Perspective The Dash, l-(SR-I, and Alewife leveraged existing processor technology.
The advantages of these directory-based caeheing systems include compatibility with existing hardware and
software. But they offer a less aggressive pursuit of parallelism and depend heavily on compilers to obtain
locality. The synclironizing loads are still problematic in these Cll5lFll‘JLllIt'.‘Cl caeheing solutions.
ln von Neumann multithreading approaches, the HEPfTera replicated the conventional instruction stream.
Syncltronizing-loads problems were solved by a hardware trap and software. Hybrid architectures, such as
Empire, replicated conventional instruction streams, but they did not preserve registers across threads. The
synchronizing loads were entirely supported in hardware. J-Machine supported three instruction streams
(priorities). It grew out of message-passing machines but added support for global addressing. Remote
synchronizing loads were supported by soihvare convention.
In the dataflow approaches, the system-level view has stayed constant from the Tagged-Token Dataflow
Architecture to the ‘T. The various designs differ in internal node architecture, with trends toward the
removal of intra-node synchronization, using longer threads, high-speed registers, and compatibility with
existing machine codes. The ‘T designers claimed that the unification of datafiow and von Neumann ideas
would support a scalable shared-memory programming model using existing SIMD/SPMD codes.
E.'.|"}—i
-, i,» I S ummary
5.4/'
Computer systems love always operated with processors having much faster cycle times than main
memories.With steady advances inVLSi technology over the years. both processors and main memories
have become faster. but the relative speed mismatch between them has in tact widened over the years.
Latency hiding techniques are therefore devised to allow processors to operatic at high efficiency in spite of
having to access slower memories from time to time; use of cache memories is a common latency hiding
technique. in the context of Massively Parallel Processing i',l"1PP} systems, other technical challenges also
confront system digrters in minimizing the impact of memory access latencies.
in this chapter. we studied some basic latency hiding techniques applicable to such systems, narnely:
shared virtual memory with some specific examples; preietching techniques and their effectiveness; and
the use of distributed coherent caches. Scalable Coherent interface (SCI) provides cache coherence with
distributed directories and sharing iist.s.‘iN'o studied several relaxed memory consiatiency models which
can permit greater exploitation of parallefism in applications; the impact of relaxed consistency models
while running three specific applications was presented.
rh- i'hlcG-rm-P HiiI" r
456 i Hm .-lidvionced Computernrcfnitecvsre
Principl of multi-threading were introduced. with specific attention paid to the technical factors
relevant to system design. namely: communication latency on remote access. number of threads. context-
switching overhead. and the interval between context switches. Multiple context processors have been
designed to provide hardware supp-ort for single cycle context svvit-thing. Possible context-switching
policies were studied. along with their impact on system efficiency. Mulddimensional architectures were
reviewed as a possible platform for multi-threaded systems.
Fine-grain multicomputers are specially designed to provide efficient support for fine-grain
parallelism in applications. The MIT j-machine was studied from d'|e points of view of its overall
system design, its Message-Drhien Processor ii‘-1DP) and instruction set architecture. and the message
format and routing employed in its 3-dimensional mesh. The design goal of Caltech Mosaic C system
was to exploit the advances which had taken place in VLSI and packaging technologies; we studied
the basic node design with its two contexts {for user program and message handler). and basic
B >< 8 mesh design employed in the system.
in the category of scalable multithreaded architectures. tl'ie Stanford Dash multiprocessor system
utilized directory-based cache coherence in a single address-space distributed memory system. Kenthll
Square Research KSR-1 system employed a cache-only memory design with a ring-based Interconnect.
The Tera multiprocessor system relied for its performance on a large degree of multi-threading and
agressive use of pipelining throughout the system. wid'i a sparse 3-dimensional torus interconnect.
We also studied i:l1e basic concepts and evolution of dataflow and hybrid architectures, from the first
introduction of the concept in 1974 byjack Dennis at HIT. Specific datafiow and hybrid systems studied in
this context were the ETUEM-4 system developed in japan. and the MiTi'Motoro|a ‘T prototype system.
(d) Repeat the above for a two-dimensional fast Problem 9.!‘ Why are hypercube networks
Fourier transform over N >< N sample points [binary n-cube networks), which were very popular
on-an n-processor Cll"'lP, where N = n-k for in first-generation multicomputers. being replaced
some integr k 2 2. The idea of performing by 2D or 3D meshes or tori in the second and third
a two-dimensional FFT on an DMP is to generations of multicomp uters?
perform a one-dimensional FFT along one
Problem $.10 Answer the following questions on
dimension in a row-access mode.
the SCI standard:
All n processors then synchronize, switch
(a) Explain the sharing-list creation and update
to a column-access mode. and perform
medaods used in die IEEE Scalable Coherence
another one-dimensional FFT along the
Interface (SCI) standard.
second dimension. First try the case where N
(b} Com menton the advantages and disadvantages
= B.n = 4,and k= land then work outthe
of chained directories for cache coherence
general case for large N ;:?> n.
control in large-scale multiprocessor systems.
Problem 9.5 The following questions are related
to shared virtual memory:
Problem 9.11 Compare the four context-
svvitching policies: switch on cache miss. switch on
(a) Why has shared virtual memory (SVM)
every load, switch on every instruction (cycle by
become a necessity in building a scalable
cycle]. and switch on block of instructions.
system with memories physically distributed
(a) What are the advantages and shortcomings of
over a large number of processing nodes?
each policy?
(b) What are d1e major differences in
(bl What additional research would be needed to
implementing SVH at the cache block level
make an optimal choice among these policies?
and the page level?