An Efficient Algorithm For Exploiting Multiple Arithmetic Units
An Efficient Algorithm For Exploiting Multiple Arithmetic Units
Tomasulo
Introduction
After storage access time has been satisfactorily reduced execution of floating-point instructions in the IBMSys-
through the use of buffering and overlap techniques, even tem/360 Model 91. Obviously, one begins with multiple
after the instruction unit hasbeenpipelined to operate executionunits,inthiscase an adder and a multi-
at a rate approaching oneinstructionper cycle,’ there plier/divider.’
remains the need to optimize the actual performance of It might appear that achieving the concurrent operation
arithmetic operations, especially
floating-point.
Two of these two units does not differ substantially from the
familar problems confront the designer in his attempt to attainment of fixed-floating overlap. However, in the latter
balance execution with issuing. First, individual operations case the architecture limits each of the instruction classes
are not fastenough* to allowsimple serial execution. to its own set of accumulators and this guarantees inde-
Second, it is difficult to achieve the fastestexecution pendence.* In the former case there is only one set of
times in a universal execution unit. In other words, cir- accumulators, which implies program-specified sequences
cuitrydesigned to do both multiply and add will do of dependent operations. Now it is no longersimply a
neither as fast as two units each limited to one kind of matter of classifyingeach instruction as fixed-point or
instruction. floating-point, a classificationwhichisindependent of
The first step toward surmounting these obstacles has previousinstructions. Rather, it is a question of deter-
been presented,’ i.e., the division of the execution func- mining each instruction’s relationship with all previous,
tion into twoindependent parts, a fixed-pointexecu- incompletedinstructions.Simply stated, the objective
tion area and a floating-point execution area. While this must be to preserve essential precedences while allowing
relieves the physical constraint and makes concurrent the greatest possible overlap of independent operations.
execution possible,there is another consideration. In order This objective is achieved in the Model 91 through a
to secure a performance increase the program must con- schemecalled the common data bus(CDB). It makes
tain an intimate mixture of fixed-point and floating-point possible
maximum concurrency
with
minimal
effort
instructions. Obviously, it is not always feasible for the (usually none) by the programmer or, more importantly,
programmer to arrange this and, indeed,many of the by the compiler. At the same time, the hardware required
programs of greatest interest to the user consist almost is small and logically simple. The CDB can function with
wholly of floating-point instructions. The subject of this any numberof accumulators and any numberof execution
paper, then, is the method used to achieve concurrent units. In short, it provides a hardware algorithm for the
” .” automatic, efficient exploitation of multipleexecution
During the planning phase, floating-point multiply was taken to be units.
six cycles, divide as eighteen cycles andaddas two cycles. A subse-
quent papers explains how times of 3, 12, and 2 were actually achieved. * Such dependencies as exist are handled by the store-fetch sequenc-
This permitted the use of only one, instead of two, multipliers and one ing of thestoragebusandthecondition code controldescribed in the
adder, pipelined to start an add cycle. following paper.2
STORAGE BUS UNIT INSTRUCTION
1
v
+
6 FLOATING
5 OPERAND
FLOATING-POINT 4 STACK 8
CONTROL
BUFFERS (FLB) 3 (FLOS) FLOATING-POINT 4
CONTROL .
2 REGISTERS (FLR) 2
1 0
FLOATING.POINT
BUFFER
(FLE) BUS
TO STORAGE
RESULTBUS
The next section of this paper will discuss the physical storage is really the sink of a store. (R1 and R2 refer to
framework of registers, data paths and execution circuitry fields as defined by System/360 architecture.)
which is implied by the architecture and the overall CPU In the pseudo-register-to-register format "seen" by the
structure presented in a previouspaper.'Within this FLOS the R2 field can have three different meanings. It
framework one can subsequently discuss the problem of can be an FLR as in a normal register-to-register instruc-
precedence,somepossiblesolutions, and the selected tion. If the program contains a storage-to-register in-
solution, the CDB. In conclusion will be a summary of struction, the R2 field designates the floating-point buffer
the results obtained. (FLB)assignedby the instruction unit to receive the
storage operand. Finally, R2 can designate a store data
buffer(SDB)assignedby the instruction unit to store
Definitions and data paths instructions. In the first two cases R2 is the source of an
operand; in the last case it is a sink. Thus, the instruction
While the reader is assumed to be familiar with System/360 unit maps all of storage into the 6 floating-point buffers
architecture and mnemonics, the terminology as modified and 3 store data buffers so that the FLOS sees only pseudo-
by the context of the Model 91 organization will be re- register-to-register operations.
viewed here. The instruction unit, in preparing instruc-
tions for the floating-point operation stack (FLOS), maps The distinction between source and sinkwillbecome
both storage-to-register and register-to-registerinstruc- quite important during the discussion of precedence and
tions into a pseudo-register-to-register format. In this should be fixedfirmly in mind.All of the instructions
format R1 is always one of the four floating-point regis- (except store and compare)have the following form:
ters (FLR) defined by the architecture. It is usually the
R1 op R2"--4R1
sink of the instruction, i.e., it is the FXR whose contents
Register Register
Register
are set equal to the result of the operation. Store opera-
tions are the soleexception*wherein R1 specifies the or
source of the operand to be placed in storage. A word in
buffer
26 * Compares not,
do of course, alter the contents of R1. source source sink
R. M. TOMASULO
INSTRUCTION UNIT I DECODE 1 ION
I
3 TO 10
ACCESS
CYCLE TRANSMIT BFR
TO EXECUTION
TRANSMIT OP
FLOS
TO
I
I
I
EXECUTIONUNIT
Figure 2 Timing relationship between instruction unit and FLOS decode for the processing of one instruction.
1 EXECUTION -----
For example, the instruction ADO, 2 means“place the adder receives control information which causes it to send
double-precision sum of registers 0 and 2 in register 0,” data to floating-pointregister R1, when its sourcereg-
+
i.e., RO R2 + RO. Note that R1 is really both a source ister is set full by the buffer.
and a sink.* Nevertheless, it will be calledthe sink and R2 If the instruction is a storage-to-register arithmetic func-
the source in all subsequent discussion. tion, the storage operand is handled as in load (control
This definition of operations and the machine organiza- bits cause it to be forwarded to the proper unit) but the
tion taken togetherimply a set of data registerswith floating-point register, along with the operation, is sent
transfer paths among them. These are shown in Fig. 1. by the decoder to the appropriate unit. After receiving
The majorsets of registers(FLR’s, FLB’s, FLOS and the buffer the unit will execute the operation and send the
SDB’s)have already beendiscussed, both above and in result to register R1.
a precedingpaper.’Two additional registers,onesink In register-to-register arithmetic instructions two float-
and one source, are shown feeding each execution circuit. ing point registers are transmitted on successive cycles to
Initially these registers were considered to be the internal the appropriate execution unit.
working registers required by the execution circuits and Stores are handledlikestorage-to-registerarithmetic
put to multiple use in a way to be described below. Later, functions,except that the content of the floating-point
their function was generalized underthe reservation station register is sent to a store data buffer rather than to an
concept and they were dissociated from their “working” execution unit.
function. Thus far, the handling of one instruction at a time has
In actually designing a machine the data paths evolve proven rather straightforward. Now considerthe following
as the design progresses. Here, however, a complete, first- “program” :
pass data path will be shown to facilitate discussion. To
illustrate the operation let us consider, in turn, four kinds Example I
of instructions-load of a register from storage, storage- LD FO FLBl LOAD register FO from buffer 1
to-registerarithmetic,register-to-registerarithmetic, and
MD FO FLB2 MULTIPLY register FO bybuffer 2
store. Let us first see how each can be accomplished in
vacuo; then what difficulties arise when each is embedded The load can be handled as before, but what about the
in the context of a program. For simplicitydouble- multiply? Certainly FO and FLB2 cannot be sent to the
precision (64-bit operands) will be used throughout. multiplier as in the case of the isolated multiply, since
Figure 2 shows the timingrelationshipbetween the FLBl has not yet been set into FO.* This sequence illus-
instructionunit’shandling of an instruction and its trates the cardinal precedenceprinciple: No floating-
processing by the FLOS decode. When the FLOS decodes point register may participate in an operation if it is the
a load, the buffer which will receive the operand has not sink of another, incompleted instruction. That is, a register
yetbeen loaded from storage.+ Rather than holding the cannot be used until its contents reflect the result of the
decode until the operand arrives, the FLOS sets control most recent operation to use that register as its sink.
bitsassociatedwith the bufferwhich cause its content The design presented thus far has not incorporated any
to be transmitted to the adder when it “goes full.” The mechanism for dealing withthis situation. Three functions
must be required of any such mechanism:
* This economy of specification compounds thedifficulties of achiev-
ing concurrency while preserving precedence, as will be seen later. (1) It must recognize the existence of a dependency.
t A FULL/EMPTY controlhitindicatesthis.The bit is set F U L L
by the Main Storage Control Element and E M P T Y when the buffer is
used. LOAD usestheadderinorder to minimizethe buffer outgates * Note that the program calls for the product of FLBl and FLB2 t o
and the FLR ingates. be placed in FO. This hints at the CDB concept. 27
I D l X I IF I
RESULT TO FLR FN N
LD F4. B I D I A G I I I I 6 8 I x mF4 H EXECUTION
NOTE: ALTERNATE LINES
IDlXl m SHOW FLOS ACTIVITY
E MO FO, I D I A G 1 I I I I I I 1 x 1 !"--El
I D I X I
AD F2. FG 1 x 1 Ix
Ix IF
I D l X l F1
AD F2, F4 I D l I X I X -
IDIXI +
I
31 CYCLES
DECODE HOLD-UP
DUE TO BUSY
SINK REG.
AD F 2 F 4
26 CYCLES
(b)
Figure 3 Timing for the instruction sequence required to perform the function A + B $- C + D * E : (a) without reserva-
tion stations, (b) with reservation stations included in the register set.
(2) It must cause the correct sequencing of the dependent meet the performance goal. The next section will present
instructions. several alternatives for accomplishing these objectives.
(3) It must distinguish between the givensequence and Preservation of precedence
such sequences as
Perhaps the simplest scheme for preserving precedence is
L D FO, FLBl as follows. A "busy" bit is associated with each of the four
MD F2, FLB2 floating-pointregisters.This bit is set when the FLOS
decode issues an instruction designating the register as a
Here it must allow the independent MD to proceed
sink; it is reset when the executing unit returns the result
regardless of the disposition of the LD.
to the register. No instruction can be issued by the FLOS
The first two requirementsare necessary to preserve the if the busy bit of its sink is on. If the source of a register-
28 logicalintegrity of the program; the third isnecessary to to-register instruction has its busy bit on, the FLOS sets
R. M. TOMASULO
control bits associated with the source register. When a plicity they are treated as if they were actual units. Thus,
result is entered into the register, these control bits cause in the future, we will speak of Adder 1 (Al), Adder 2 (A2),
the register to be sent via the FLR bus to the unit waiting etc., and M/D 1 and M/D 2.
for it as a source. Figure 3b shows the effect of the addition of reservation
This scheme easily meets the first two requirements. stations onthe problem running time:five cycleshave been
The third is met with the help of the programmer; he eliminated. Note that the second AD now overlaps the
must use different registers to achieve overlap. For ex- M D and actually executes before the first AD. While the
ample, the expression A B + + +
C D * E can be pro- speed increase is gratifying and the busy bit method easy
grammed as follows: to implement, there remains a dependence on the pro-
grammer. Note that theexpression could have been coded
Example 2
this way:
LD FO, D FO =D
LD F2, C F2 = C Example 3a
LD F4, B F4 = B LD FO, E
MD FO, E FO =D*E MD FO, D
AD F2, FO F2 =C+D*E AD FO, C
AD F4, A F4 = A+ B
AD FO, B
AD F2, F4 F2 =A+BfC+D*E AD FO, A
The busy bit scheme should allow the second add and
Now overlap is impossible and the program will run six
the multiply to be executed simultaneously (really, in
cycles longer despite having two fewer instructions. Sup-
any order) since they use different sinks. Unfortunately, the
pose however, that this program ispart of a loop, as below:
timing chart of Fig. 3a shows not only that the expected
overlap does not occur but also that many cycles are
Example 3b
lost to transmission time. The overlap fails to materialize
because the first add uses the result of the multiply, and LOOP 1 LD FO, Ei
theaddermust wait for that result. Cycles are lost to MD FO, Di
control because so many of the instructions use the adder. AD FO, Ci
The FLOS cannot decode an instruction unless a unit is AD FO, Bi
available to execute it. When an assigned unit finishes AD FO, Ai
execution, it takes one cycle to transmit the fact to the STD FO, Fi
FLOS so that it can decode a waiting instruction. Similarly, BXH i, - 1, 0, LOOP 1 (decrease i by 1,
when the FLOS is held up because of a busy sink register, branch if i > 0)
it cannot begin to decode until the result has been entered LOOP 2 LD FO, Ei
into the register. LD F2, Ei + 1
One solution that could be considered is the addition of MD FO, Di
one or more adders. If this were done and some programs MD F2, Di + 1
timed, however, it would become apparent that theexecu- AD FO, Ci
tion circuitry would be in use only a small part of the time. ADF2, Ci+ 1
Most of the lost time would occur while the adder waited AD FO, Bi
for operands which are the result of previous instructions. AD F2, Bi + 1
What is required is a device to collect operands (and con- AD FO, Ai
trol information) and then engage the execution circuitry AD F2, Ai + 1
when all conditions are satisfied. But this is precisely the STD FO, Fi
function of the sink and source registers in Fig. 1. There- STD F2, Fi + 1
fore, the better solutionis to associate more than oneset of BXH i, -2,0, LOOP 2
registers (control, sink, source) with each execution unit.
Each such set is called a reservation sfation.* Now instruc- Iteration n+ 1 of LOOP 1 will appear to the FLOS to
tion issuing depends on the availability of the appropriate depend on iteration n, since the instructions inboth
kind of reservation station. In the Model 91 there are three iterations have the same sink. But it is clear that the two
add andtwo multiply/divide reservation stations. For sim- iterations are,in fact, independent. This example illustrates
____ a second way in which two instruction sequences can be
* Thefetchandstorebufferscan he considered as specialized,one-
operand reservation stations. Previous systems, such as the IBM 7030, independent. The first way, of course, is for the twostrings
have in effect employed one “reservation station” ahead of each execu-
tionunit.Theextensiontoseveralreservationstationsaddstothe to have different sink registers. The second way is for the
effectiveness of the execution hardware, second string to begin with a load. By its definition a 29
EXPLOITINGMULTIPLEARITHMETIC UNITS
load launches a new, independentstringbecause it in- the units whichfeed the CDB. Thus the floating-point
structs the computer to destroy the previous contents of buffers 1 through 6 are assigned the numbers 1 through 6;
the specified register. Unfortunately, the busy bit scheme the three adders (actually reservation stations) are num-
does not recognize this possibility. If overlapis to be bered 10 through 12; the two multiplier/dividers are 8 and
achievedwith this scheme, the programmermustwrite 9. Since there are eleven contributors to the CDB, a four-
LOOP 2. (This technique is calleddoubling or unravelling. bit binary numbersuffices to enumerate them. This number
It requirestwice as much storage but it runs faster by is called a tag. A tag is associated with each of the four
enabling two iterations to be executed simultaneously.) floating-pointregisters (in addition to the busybit*),
Attempts were made to improve the busy bit scheme with both the source and sink registers of each of the five
so as to handle this case. The most tempting approach is reservation stations and with each of the three Store Data
the expansion of the bit into a counter. This wouldappear Buffers. Thus a total of 17 four-bit fag registers has been
to allow more than one instruction with a given sink to added, as shown in Fig. 4.
beissued. As eachisissued, the FLOS increments the Tags also appear in another context. A tag is generated
counter; as each is executed the counter is decremented. by the CDB priority controls to identify the unit whose
However, major difficulty is caused by the fact that storage result will next appear on the CDB. Its use will be made
operands do not return in sequence. This can cause the clear shortly.
result of instruction n +1 to be placed in a register before Operation of this complex is asfollows. In decoding
that of n. When n completes, it erroneously destroys the each instruction the FLOS checks the busy bit of each of
register contents. the specified floating-point registers. If that bit is zero,
Some of the other proposals considered would,if imple- the content of the register@) may be sent to the selected unit
mented, have been of such logical complexity as to jeop- via the FLR bus, just as before. Upon issuing the instruc-
ardize the achievement of a fast cycle. tion, which requires onlythat a unit be availableto execute
it, the FLOS not only sets the busy bit of the sink register
The Common Data Bus but also setsits tag to the designation of the selected unit.
The preceding sections were intended to portray the dif- The source register control bits remain unchanged. As an
ficulties of achievingconcurrencyamongfloating-point example, take the instruction, AD FO, FLB1. After issuing
instructions and to show someof the steps in the evolution this instructionto Adder 1 the control bits of FO would be:
of a design to overcome them. It is clear, in retrospect,
BB TAG
that the previous algorithms failed for lack of a way to
1 1010 (Al)
uniquely identify eachinstruction and to use this informa-
tion to sequence executionand set resultsinto the floating- So far the only change from previous methods is the
point registers. As far as action by the FLOS is concerned, setting of the tag. The significant difference occurs when
the only thing unique to a particular instruction is the the FLOS findsthe busy bit on at decode time. Previously,
unit whichwillexecute it. This, then, mustform the this caused a suspension of decoding until the bit went
basis of the common data bus (CDB). off. Now the FLOS will issue the instruction and update
Figure 4 shows the data paths required for operation of the tag. In so doing it will not transmit the register con-
theCDB.* WhenFig. 4 is comparedwithFig. 1 the tents to the selected unit but it will transmit the “old” tag.
following changes, in addition to the reservation stations, For example, suppose the previous AD was followed by a
are evident: Another output port has been added to the second AD. At the end of the decode of this second AD,
buffers. This port has been combinedwith the results FO’s control bits would be:
from the adder and multiplier/divider; the combination
BB TAG
is the CDB. The CDB now goes not only to the registers
1 (A2)
1011
but also to the sink and source registers of all reservation
stations, including the store data buffers but excluding One cycle later the sink tag of the A2 reservation station
the floating-point buffers.This data path will enable loads would be 1010, i.e., the same as Al, the unit whose result
to be executed without the adder and will make the re- will be required by A2.
sult of any operation available to all units without first Let us look ahead temporarily to the execution of the
going through a floating-point register. first AD. Some timeafter the start of execution but before
Note that the CDB is fed by all units that can alter a the end,? A1 will request the CDB. Since the CDB is fed
register and that itfeeds all units which can havea register by a central
by many sources, its time-sharing is controlled
as an operand. The control part of the CDB enumerates
* The busy bit isnolongernecessarysinceitsfunctioncan be per-
formed by use of anunassignedtagnumber.However,it is conve-
* The FLB and FLR busses areretainedfor performance reasons. nient to retain it.
Everything could he done by a slightextension of the CDB buttime t Since the required lead time is two cycles, the request is made at
30 would he lost
dueto conflicts over
the common facility. the start of execution for an add-type instruction.
R. M. TOMASULO
STORAGE BUS INSTRUCTIONUNIT
I
e I
FLOATING-POINT 5
r
FLOATING-
1 CONTROL
~
TAGS DATA BUFFERS 2
MULTIPLY/DIVIDE
‘igure 4 Data registers and transfer paths, including CDB and reservation stations.
R. M. TOMASULO
It might appear that the CDB adds one cycle to the
execution time of each operation, but in fact it does not.
In practice only 30 nsec of the 60-nsec CDB interval are
required to perform all of the CDB functions. The remain-
ing time could, in this case, be used by the execution unit
to achieve a shorter effective cycle. For example, if an add
requires 120 nsec, then add plus the CDB time required
is 150 nsec. Therefore, as far as the add is concerned, the
STORE BUFFER
machine cycle could be 50 nsec. Besides, even without the
CDB, a similar amount of time would be required to trans-
fa)
mit results both to the floating-point registers and back
as an input to the unit generating the result.
FETCH BUFFER The followingprogram, a typical partial differential
1
equation inner loop, illustrates the possible performance
increase.
ADDER DIVIDER
LOOP MD FO, Ai
2
FO
k
STORE BUFFER
AD
LD
FO, Bi
F2, Ci
SDR F2, FO
fb) MDR F2, F6
Figure 6 Functional sequence for Example 6 (a) withbusy AD2 F2, Ci
bit controls only, (b) with CDB. STD F2, Ci
BXH i, -1, 0, LOOP
Without the CDB one iteration of the loop would use
Conclusions
17 cycles, allowing 4 per MD, 3 per AD and nothing for
Two concepts of some significance to the design of high- LD or STD. Withthe CDB one iteration requires 11 cycles.
performancecomputershave beenpresented. The first, For this kind of code the CDB improves performance by
reservation stations, is simply an expeditious method of about one-third.
buffering, in an environment where the transmission time
between units is of consequence. Because of the disparity Acknowledgments
between storage access and circuit speeds and because of
dependencies between successive operations,it is observed The author wishes to acknowledge the contributions of
(given multipleexecutionunits) that each unit spends Messrs. D. W. Anderson and D. M.Powers,whoex-
much of its time waiting for operands. In effect, the reserva- tended the original concept, and Mr. W. D. Silkman,
tion stations do the waiting for operands while the execu- who implemented all of the central control logic discussed
tion circuitry is free
to be engagedby whicheverreservation in the paper.
station fills first.
The second, and more important, innovation, the CDB, References
utilizes the reservation stations and a simpletagging 1. D. W. Anderson,F. J. Sparacioand R. M. Tomasulo,
scheme to preserveprecedencewhileencouragingcon- “TheSystem/360Model 91: MachinePhilosophyandIn-
currency. In conjunction with the various kinds of buf- struction Handling,” IBM Journal 11, 8 (1967) (this issue).
2. S. F. Anderson, J. Earle, R. E. Goldschmidt and D. M.
fering in the CPU, the CDB helps render the Model 91 Powers,“TheSystem/360Model 91 Floating-point Execu-
less sensitive to programming. It should be evident, how- tion Unit,” ZBM Journal 11, 34 (1967) (this issue).
ever, that the programmer still exercises substantial control
over how much concurrency will occur. The two different
programs for doing A + + +
B C D * E illustrate this
clearly. Received September 16, 1965.
33
1
EXPLOITING MULTIPLE ARITHMETIC UNITS