0% found this document useful (0 votes)

90 views

An Efficient Algorithm For Exploiting Multiple Arithmetic Units

This paper describes an algorithm called the Common Data Bus (CDB) that was used in IBM's System/360 Model 91 to efficiently exploit multiple floating-point execution units while preserving instruction dependencies. The CDB allows for maximum concurrency by tagging operands and registers to determine dependencies between instructions, and routing operands between execution units and registers via a shared data bus. This achieves better performance than either serial execution or requiring specially optimized code, instead letting the hardware optimize execution locally by looking ahead 8 instructions. The techniques described are broadly applicable to computers with multiple execution units and accumulators.

Uploaded by

Kiapuch Airuch Cuophi

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

90 views

An Efficient Algorithm For Exploiting Multiple Arithmetic Units

Uploaded by

Kiapuch Airuch Cuophi

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 9

R. M.

Tomasulo

An Efficient Algorithm for Exploiting

Multiple Arithmetic Units

Abstract: Thispaperdescribes the methods employed in thefloating-pointareaof the System/360Model 91 to exploitthe

existence of multiple execution units. Basic to these techniques is a simple common data busing and register tagging scheme which
permits simultaneous execution of independent instructions while preserving the essential precedences inherentthe instruction
in stream.
The common data bus improves performance by efficiently utilizing the execution units without requiring specially optimized code.
Instead, the hardware,by ‘looking ahead’ about eight instructions. automatically optimizes the program execution on a local basis.
The application of these techniquesis not limited to floating-point arithmetic or System/360 architecture.
It may be used in almost
any computer having multipleexecution units and one or more ‘accumulators.’ Bothof the execution units, as well as the associated
storage buffers, multiple accumulators and input /output buses, are extensively checked.

Introduction

After storage access time has been satisfactorily reduced execution of floating-point instructions in the IBMSys-
through the use of buffering and overlap techniques, even tem/360 Model 91. Obviously, one begins with multiple
after the instruction unit hasbeenpipelined to operate executionunits,inthiscase an adder and a multi-
at a rate approaching oneinstructionper cycle,’ there plier/divider.’
remains the need to optimize the actual performance of It might appear that achieving the concurrent operation
arithmetic operations, especially
floating-point.
Two of these two units does not differ substantially from the
familar problems confront the designer in his attempt to attainment of fixed-floating overlap. However, in the latter
balance execution with issuing. First, individual operations case the architecture limits each of the instruction classes
are not fastenough* to allowsimple serial execution. to its own set of accumulators and this guarantees inde-
Second, it is difficult to achieve the fastestexecution pendence.* In the former case there is only one set of
times in a universal execution unit. In other words, cir- accumulators, which implies program-specified sequences
cuitrydesigned to do both multiply and add will do of dependent operations. Now it is no longersimply a
neither as fast as two units each limited to one kind of matter of classifyingeach instruction as fixed-point or
instruction. floating-point, a classificationwhichisindependent of
The first step toward surmounting these obstacles has previousinstructions. Rather, it is a question of deter-
been presented,’ i.e., the division of the execution func- mining each instruction’s relationship with all previous,
tion into twoindependent parts, a fixed-pointexecu- incompletedinstructions.Simply stated, the objective
tion area and a floating-point execution area. While this must be to preserve essential precedences while allowing
relieves the physical constraint and makes concurrent the greatest possible overlap of independent operations.
execution possible,there is another consideration. In order This objective is achieved in the Model 91 through a
to secure a performance increase the program must con- schemecalled the common data bus(CDB). It makes
tain an intimate mixture of fixed-point and floating-point possible
maximum concurrency
with
minimal
effort
instructions. Obviously, it is not always feasible for the (usually none) by the programmer or, more importantly,
programmer to arrange this and, indeed,many of the by the compiler. At the same time, the hardware required
programs of greatest interest to the user consist almost is small and logically simple. The CDB can function with
wholly of floating-point instructions. The subject of this any numberof accumulators and any numberof execution
paper, then, is the method used to achieve concurrent units. In short, it provides a hardware algorithm for the
” .” automatic, efficient exploitation of multipleexecution
During the planning phase, floating-point multiply was taken to be units.
six cycles, divide as eighteen cycles andaddas two cycles. A subse-
quent papers explains how times of 3, 12, and 2 were actually achieved. * Such dependencies as exist are handled by the store-fetch sequenc-
This permitted the use of only one, instead of two, multipliers and one ing of thestoragebusandthecondition code controldescribed in the
adder, pipelined to start an add cycle. following paper.2
STORAGE BUS UNIT INSTRUCTION

1
v

+
6 FLOATING
5 OPERAND
FLOATING-POINT 4 STACK 8
CONTROL
BUFFERS (FLB) 3 (FLOS) FLOATING-POINT 4
CONTROL .
2 REGISTERS (FLR) 2
1 0

FLOATING.POINT
BUFFER
(FLE) BUS

TO STORAGE

RESULTBUS

Figure 1 Dataregisters and transferpathswithout CDB.

The next section of this paper will discuss the physical storage is really the sink of a store. (R1 and R2 refer to
framework of registers, data paths and execution circuitry fields as defined by System/360 architecture.)
which is implied by the architecture and the overall CPU In the pseudo-register-to-register format "seen" by the
structure presented in a previouspaper.'Within this FLOS the R2 field can have three different meanings. It
framework one can subsequently discuss the problem of can be an FLR as in a normal register-to-register instruc-
precedence,somepossiblesolutions, and the selected tion. If the program contains a storage-to-register in-
solution, the CDB. In conclusion will be a summary of struction, the R2 field designates the floating-point buffer
the results obtained. (FLB)assignedby the instruction unit to receive the
storage operand. Finally, R2 can designate a store data
buffer(SDB)assignedby the instruction unit to store
Definitions and data paths instructions. In the first two cases R2 is the source of an
operand; in the last case it is a sink. Thus, the instruction
While the reader is assumed to be familiar with System/360 unit maps all of storage into the 6 floating-point buffers
architecture and mnemonics, the terminology as modified and 3 store data buffers so that the FLOS sees only pseudo-
by the context of the Model 91 organization will be re- register-to-register operations.
viewed here. The instruction unit, in preparing instruc-
tions for the floating-point operation stack (FLOS), maps The distinction between source and sinkwillbecome
both storage-to-register and register-to-registerinstruc- quite important during the discussion of precedence and
tions into a pseudo-register-to-register format. In this should be fixedfirmly in mind.All of the instructions
format R1 is always one of the four floating-point regis- (except store and compare)have the following form:
ters (FLR) defined by the architecture. It is usually the
R1 op R2"--4R1
sink of the instruction, i.e., it is the FXR whose contents
Register Register
Register
are set equal to the result of the operation. Store opera-
tions are the soleexception*wherein R1 specifies the or
source of the operand to be placed in storage. A word in
buffer
26 * Compares not,
do of course, alter the contents of R1. source source sink

R. M. TOMASULO
INSTRUCTION UNIT I DECODE 1 ION
I
3 TO 10
ACCESS
CYCLE TRANSMIT BFR
TO EXECUTION

TRANSMIT OP
FLOS
TO
I
I
I

EXECUTIONUNIT

Figure 2 Timing relationship between instruction unit and FLOS decode for the processing of one instruction.
1 EXECUTION -----

For example, the instruction ADO, 2 means“place the adder receives control information which causes it to send
double-precision sum of registers 0 and 2 in register 0,” data to floating-pointregister R1, when its sourcereg-
+
i.e., RO R2 + RO. Note that R1 is really both a source ister is set full by the buffer.
and a sink.* Nevertheless, it will be calledthe sink and R2 If the instruction is a storage-to-register arithmetic func-
the source in all subsequent discussion. tion, the storage operand is handled as in load (control
This definition of operations and the machine organiza- bits cause it to be forwarded to the proper unit) but the
tion taken togetherimply a set of data registerswith floating-point register, along with the operation, is sent
transfer paths among them. These are shown in Fig. 1. by the decoder to the appropriate unit. After receiving
The majorsets of registers(FLR’s, FLB’s, FLOS and the buffer the unit will execute the operation and send the
SDB’s)have already beendiscussed, both above and in result to register R1.
a precedingpaper.’Two additional registers,onesink In register-to-register arithmetic instructions two float-
and one source, are shown feeding each execution circuit. ing point registers are transmitted on successive cycles to
Initially these registers were considered to be the internal the appropriate execution unit.
working registers required by the execution circuits and Stores are handledlikestorage-to-registerarithmetic
put to multiple use in a way to be described below. Later, functions,except that the content of the floating-point
their function was generalized underthe reservation station register is sent to a store data buffer rather than to an
concept and they were dissociated from their “working” execution unit.
function. Thus far, the handling of one instruction at a time has
In actually designing a machine the data paths evolve proven rather straightforward. Now considerthe following
as the design progresses. Here, however, a complete, first- “program” :
pass data path will be shown to facilitate discussion. To
illustrate the operation let us consider, in turn, four kinds Example I
of instructions-load of a register from storage, storage- LD FO FLBl LOAD register FO from buffer 1
to-registerarithmetic,register-to-registerarithmetic, and
MD FO FLB2 MULTIPLY register FO bybuffer 2
store. Let us first see how each can be accomplished in
vacuo; then what difficulties arise when each is embedded The load can be handled as before, but what about the
in the context of a program. For simplicitydouble- multiply? Certainly FO and FLB2 cannot be sent to the
precision (64-bit operands) will be used throughout. multiplier as in the case of the isolated multiply, since
Figure 2 shows the timingrelationshipbetween the FLBl has not yet been set into FO.* This sequence illus-
instructionunit’shandling of an instruction and its trates the cardinal precedenceprinciple: No floating-
processing by the FLOS decode. When the FLOS decodes point register may participate in an operation if it is the
a load, the buffer which will receive the operand has not sink of another, incompleted instruction. That is, a register
yetbeen loaded from storage.+ Rather than holding the cannot be used until its contents reflect the result of the
decode until the operand arrives, the FLOS sets control most recent operation to use that register as its sink.
bitsassociatedwith the bufferwhich cause its content The design presented thus far has not incorporated any
to be transmitted to the adder when it “goes full.” The mechanism for dealing withthis situation. Three functions
must be required of any such mechanism:
* This economy of specification compounds thedifficulties of achiev-
ing concurrency while preserving precedence, as will be seen later. (1) It must recognize the existence of a dependency.
t A FULL/EMPTY controlhitindicatesthis.The bit is set F U L L
by the Main Storage Control Element and E M P T Y when the buffer is
used. LOAD usestheadderinorder to minimizethe buffer outgates * Note that the program calls for the product of FLBl and FLB2 t o
and the FLR ingates. be placed in FO. This hints at the CDB concept. 27

EXPLOITINGMULTIPLE ARITHMETIC UNITS

INSTRUCTION STORAGE
UNIT ACCESS FLU LEGEND
D DECODE
I
D 1 A G ~
LD ~ 0 . ~ I I I I IXt-&O AG ADDRESS GENERATE
I D I X I III STORAGE ACCESS
LDF2,C 1 D 1 AGI I I 1 I I I I x X TRANSMISSION

I D l X I IF I
RESULT TO FLR FN N
LD F4. B I D I A G I I I I 6 8 I x mF4 H EXECUTION
NOTE: ALTERNATE LINES
IDlXl m SHOW FLOS ACTIVITY
E MO FO, I D I A G 1 I I I I I I 1 x 1 !"--El
I D I X I
AD F2. FG 1 x 1 Ix

Ix IF
I D l X l F1

AD F2, F4 I D l I X I X -

IDIXI +
I

31 CYCLES

DECODE HOLD-UP
DUE TO BUSY
SINK REG.

AD F 2 F 4

26 CYCLES
(b)
Figure 3 Timing for the instruction sequence required to perform the function A + B $- C + D * E : (a) without reserva-
tion stations, (b) with reservation stations included in the register set.

(2) It must cause the correct sequencing of the dependent meet the performance goal. The next section will present
instructions. several alternatives for accomplishing these objectives.
(3) It must distinguish between the givensequence and Preservation of precedence
such sequences as
Perhaps the simplest scheme for preserving precedence is
L D FO, FLBl as follows. A "busy" bit is associated with each of the four
MD F2, FLB2 floating-pointregisters.This bit is set when the FLOS
decode issues an instruction designating the register as a
Here it must allow the independent MD to proceed
sink; it is reset when the executing unit returns the result
regardless of the disposition of the LD.
to the register. No instruction can be issued by the FLOS
The first two requirementsare necessary to preserve the if the busy bit of its sink is on. If the source of a register-
28 logicalintegrity of the program; the third isnecessary to to-register instruction has its busy bit on, the FLOS sets

R. M. TOMASULO
control bits associated with the source register. When a plicity they are treated as if they were actual units. Thus,
result is entered into the register, these control bits cause in the future, we will speak of Adder 1 (Al), Adder 2 (A2),
the register to be sent via the FLR bus to the unit waiting etc., and M/D 1 and M/D 2.
for it as a source. Figure 3b shows the effect of the addition of reservation
This scheme easily meets the first two requirements. stations onthe problem running time:five cycleshave been
The third is met with the help of the programmer; he eliminated. Note that the second AD now overlaps the
must use different registers to achieve overlap. For ex- M D and actually executes before the first AD. While the
ample, the expression A B + + +
C D * E can be pro- speed increase is gratifying and the busy bit method easy
grammed as follows: to implement, there remains a dependence on the pro-
grammer. Note that theexpression could have been coded
Example 2
this way:
LD FO, D FO =D
LD F2, C F2 = C Example 3a
LD F4, B F4 = B LD FO, E
MD FO, E FO =D*E MD FO, D
AD F2, FO F2 =C+D*E AD FO, C
AD F4, A F4 = A+ B
AD FO, B
AD F2, F4 F2 =A+BfC+D*E AD FO, A
The busy bit scheme should allow the second add and
Now overlap is impossible and the program will run six
the multiply to be executed simultaneously (really, in
cycles longer despite having two fewer instructions. Sup-
any order) since they use different sinks. Unfortunately, the
pose however, that this program ispart of a loop, as below:
timing chart of Fig. 3a shows not only that the expected
overlap does not occur but also that many cycles are
Example 3b
lost to transmission time. The overlap fails to materialize
because the first add uses the result of the multiply, and LOOP 1 LD FO, Ei
theaddermust wait for that result. Cycles are lost to MD FO, Di
control because so many of the instructions use the adder. AD FO, Ci
The FLOS cannot decode an instruction unless a unit is AD FO, Bi
available to execute it. When an assigned unit finishes AD FO, Ai
execution, it takes one cycle to transmit the fact to the STD FO, Fi
FLOS so that it can decode a waiting instruction. Similarly, BXH i, - 1, 0, LOOP 1 (decrease i by 1,
when the FLOS is held up because of a busy sink register, branch if i > 0)
it cannot begin to decode until the result has been entered LOOP 2 LD FO, Ei
into the register. LD F2, Ei + 1
One solution that could be considered is the addition of MD FO, Di
one or more adders. If this were done and some programs MD F2, Di + 1
timed, however, it would become apparent that theexecu- AD FO, Ci
tion circuitry would be in use only a small part of the time. ADF2, Ci+ 1
Most of the lost time would occur while the adder waited AD FO, Bi
for operands which are the result of previous instructions. AD F2, Bi + 1
What is required is a device to collect operands (and con- AD FO, Ai
trol information) and then engage the execution circuitry AD F2, Ai + 1
when all conditions are satisfied. But this is precisely the STD FO, Fi
function of the sink and source registers in Fig. 1. There- STD F2, Fi + 1
fore, the better solutionis to associate more than oneset of BXH i, -2,0, LOOP 2
registers (control, sink, source) with each execution unit.
Each such set is called a reservation sfation.* Now instruc- Iteration n+ 1 of LOOP 1 will appear to the FLOS to
tion issuing depends on the availability of the appropriate depend on iteration n, since the instructions inboth
kind of reservation station. In the Model 91 there are three iterations have the same sink. But it is clear that the two
add andtwo multiply/divide reservation stations. For sim- iterations are,in fact, independent. This example illustrates
____ a second way in which two instruction sequences can be
* Thefetchandstorebufferscan he considered as specialized,one-
operand reservation stations. Previous systems, such as the IBM 7030, independent. The first way, of course, is for the twostrings
have in effect employed one “reservation station” ahead of each execu-
tionunit.Theextensiontoseveralreservationstationsaddstothe to have different sink registers. The second way is for the
effectiveness of the execution hardware, second string to begin with a load. By its definition a 29

EXPLOITINGMULTIPLEARITHMETIC UNITS
load launches a new, independentstringbecause it in- the units whichfeed the CDB. Thus the floating-point
structs the computer to destroy the previous contents of buffers 1 through 6 are assigned the numbers 1 through 6;
the specified register. Unfortunately, the busy bit scheme the three adders (actually reservation stations) are num-
does not recognize this possibility. If overlapis to be bered 10 through 12; the two multiplier/dividers are 8 and
achievedwith this scheme, the programmermustwrite 9. Since there are eleven contributors to the CDB, a four-
LOOP 2. (This technique is calleddoubling or unravelling. bit binary numbersuffices to enumerate them. This number
It requirestwice as much storage but it runs faster by is called a tag. A tag is associated with each of the four
enabling two iterations to be executed simultaneously.) floating-pointregisters (in addition to the busybit*),
Attempts were made to improve the busy bit scheme with both the source and sink registers of each of the five
so as to handle this case. The most tempting approach is reservation stations and with each of the three Store Data
the expansion of the bit into a counter. This wouldappear Buffers. Thus a total of 17 four-bit fag registers has been
to allow more than one instruction with a given sink to added, as shown in Fig. 4.
beissued. As eachisissued, the FLOS increments the Tags also appear in another context. A tag is generated
counter; as each is executed the counter is decremented. by the CDB priority controls to identify the unit whose
However, major difficulty is caused by the fact that storage result will next appear on the CDB. Its use will be made
operands do not return in sequence. This can cause the clear shortly.
result of instruction n +1 to be placed in a register before Operation of this complex is asfollows. In decoding
that of n. When n completes, it erroneously destroys the each instruction the FLOS checks the busy bit of each of
register contents. the specified floating-point registers. If that bit is zero,
Some of the other proposals considered would,if imple- the content of the register@) may be sent to the selected unit
mented, have been of such logical complexity as to jeop- via the FLR bus, just as before. Upon issuing the instruc-
ardize the achievement of a fast cycle. tion, which requires onlythat a unit be availableto execute
it, the FLOS not only sets the busy bit of the sink register
The Common Data Bus but also setsits tag to the designation of the selected unit.
The preceding sections were intended to portray the dif- The source register control bits remain unchanged. As an
ficulties of achievingconcurrencyamongfloating-point example, take the instruction, AD FO, FLB1. After issuing
instructions and to show someof the steps in the evolution this instructionto Adder 1 the control bits of FO would be:
of a design to overcome them. It is clear, in retrospect,
BB TAG
that the previous algorithms failed for lack of a way to
1 1010 (Al)
uniquely identify eachinstruction and to use this informa-
tion to sequence executionand set resultsinto the floating- So far the only change from previous methods is the
point registers. As far as action by the FLOS is concerned, setting of the tag. The significant difference occurs when
the only thing unique to a particular instruction is the the FLOS findsthe busy bit on at decode time. Previously,
unit whichwillexecute it. This, then, mustform the this caused a suspension of decoding until the bit went
basis of the common data bus (CDB). off. Now the FLOS will issue the instruction and update
Figure 4 shows the data paths required for operation of the tag. In so doing it will not transmit the register con-
theCDB.* WhenFig. 4 is comparedwithFig. 1 the tents to the selected unit but it will transmit the “old” tag.
following changes, in addition to the reservation stations, For example, suppose the previous AD was followed by a
are evident: Another output port has been added to the second AD. At the end of the decode of this second AD,
buffers. This port has been combinedwith the results FO’s control bits would be:
from the adder and multiplier/divider; the combination
BB TAG
is the CDB. The CDB now goes not only to the registers
1 (A2)
1011
but also to the sink and source registers of all reservation
stations, including the store data buffers but excluding One cycle later the sink tag of the A2 reservation station
the floating-point buffers.This data path will enable loads would be 1010, i.e., the same as Al, the unit whose result
to be executed without the adder and will make the re- will be required by A2.
sult of any operation available to all units without first Let us look ahead temporarily to the execution of the
going through a floating-point register. first AD. Some timeafter the start of execution but before
Note that the CDB is fed by all units that can alter a the end,? A1 will request the CDB. Since the CDB is fed
register and that itfeeds all units which can havea register by a central
by many sources, its time-sharing is controlled
as an operand. The control part of the CDB enumerates
* The busy bit isnolongernecessarysinceitsfunctioncan be per-
formed by use of anunassignedtagnumber.However,it is conve-
* The FLB and FLR busses areretainedfor performance reasons. nient to retain it.
Everything could he done by a slightextension of the CDB buttime t Since the required lead time is two cycles, the request is made at
30 would he lost
dueto conflicts over
the common facility. the start of execution for an add-type instruction.

R. M. TOMASULO
STORAGE BUS INSTRUCTIONUNIT

I
e I

FLOATING-POINT 5
r

FLOATING-

STACK (FLOS) FLOATING POINT 4

REGISTERS (FLR) 2

1 CONTROL
~
TAGS DATA BUFFERS 2

MULTIPLY/DIVIDE

‘igure 4 Data registers and transfer paths, including CDB and reservation stations.

I priority circuit. If the CDB is free, the priority control

signals the requesting adder, Al, to outgate its result and
unit in place of the register contents. The unit continuously
compares this tag with that generated by the CDB priority
~ it broadcasts the tag of the requestor (1010in this case) control. When a match is detected, the unit ingates from
to all reservation stations. Each active reservation station the CDB. The unit begins executing as soonas it has both
(selected but awaiting a register operand) compares its operands. It may receiveone or both operands from either
sink and source tags to the CDB tag. If they match, the the CDB or the FZR bus; the source operand for storage-
reservation station ingates the data from the CDB. In a to-register instructions is transmitted via the FLB bus.
similar manner, the CDB tag is compared with the tag As each instruction is issued the existing tag(s) is (are)
of each busy floating-pointregister.Allbusyregisters transmitted to the selected unit and then the sink tag is
with matching tags ingate from the CDB and reset their updated. By passing tags around in this fashion, all opera-
busy bits. tions having the same sink are correctly sequenced while
Twosteps toward the goal of preservingprecedence other operations are allowed to proceedindependently.
have been accomplished by the foregoing. First, the second Finally, the floating-point register tag controls the chang-
AD cannot start until the first AD finishesbecause it ing of the register itself, thereby ensuring that only the
cannot receive both its operands until the result of the most recent instruction will change the register. This has
first AD appears on the CDB. Secondly, the result of the the interesting consequence that a loop of the following
first AD cannot change register FO once the second AD kind :
is issued, since the tag in FO will not match Al. These are
precisely the desired effects. Example 5
Before proceeding with more detailed considerations let
LOOP LD Ai
FO,
us recapitulate the essence of the method. The floating-
point registertagidentifiesthelast unit whose result is AD BiFO,
destined for the register.When an instruction is issued STD FO, Ci STORE
that requires a busy register the tag is sent to the selected BXH i, -1, 0, LOOP 31

EXPLOITING MULTIPLE ARITHMETIC UNITS

0CDB SLOT store took place duringthe CDB cycle followingthe divide.
FLOS DECODES NOT SHOWN
In a similar fashion a register-to-register load of a busy
LD FO. FLBll D l A G l 8 “-0 register is accomplished by moving the tag of the source
DD FO, FLBP I D lAGl floating-point register to the tag of the sink floating-point
STD FO I D IAGI
r“”””--’ register. For example, in the sequence
LD FO, FLB3 I
AD FO, FLBl
AD FO. FLB4 1-JI

L ”“””_ 1 LDR F2, FO move FO to F2

WITH CDB WITH BUSY BIT
SCHEME ONLY the tag of FO will be 1010 (Al) at the time the LDR is
Figure 5 Timing sequence for Example 6, showing effect decoded. The decoder simply sets F2’s tag to 1010. Now,
of CDB. when the result of the AD appears on the CDB both FO
and F2 will ingate since the CDB tag of 1010 will match
the tag of each register. Thus, no unit or extra time was
required for the execution of the LDR.
A number of details have been omitted from this dis-
cussioninorder to clarify the concept, but reallyonly
may execute indefinitely without any change in the con-
two are of operational significance. First, every unit must
tents of FO. Under normal conditions only the final itera-
request the CDB two cycles before it finishes execution.
tion will place its result in FO.
(Thesetwocycles are required for propagation of the
As mentioned previously, there are two ways of starting
request to the CDB controls, the establishment of priority
an independent instruction string. The first is to specify a
amongcompetingunits, and propagation of a “select”
different sink register and the second is to load a register.
signal to the chosen unit.) This limits the execution time of
The CDB handles the former in essentially the same way
any instruction to a two-cycle minimum. (Of course, the
as the busy bit scheme. The load, which had been a dif-
faster the execution the less the need for, or gain from,
ficultproblempreviously, is nowverysimple.Regard-
concurrency.) It also adds one* cycle to the access time
less of the register tag or busy bit, a load turns the busy
for loads. Because of buffering and overlap, this does not
bit on and sets the tag equal to the floating-point buffer
usually cause an increase in problem running time.
which the instruction unit had assigned to the load. This
The second point is concernedwith mixedprecision.
causes subsequent instructions to sequence on the buffer
Because the architectural definition causes the low-order
rather than on whatever unit mayhaveidentified the
part of an FLR to bepreservedduringsingle-precision
register as its sink prior to the load. The buffer controls
operation, an error canoccurin the following kind of
are set to request the CDB when the storage operand
program:
arrives. The following exampleand Fig. 5 show this clearly.
LD FO, FLBl
Example 6 AD FO, FLB2
LD FO, FLBl AE FO, FLB3
DD FO, FLB2 DIVIDE
Since only the last instruction, which is single-precision,
STD FO, A
will changeFO, the low order result of the double-precision
LD FO, FLB3
AD will be lost. This is handled by associating a bit with
AD FO, FLB4
eachregister to indicate whether a particular registeris
Note that the add finishes before the divide. The dashed the sink of an outstanding single- or double-precision
line portion of Fig. 5 shows what would happen if the instruction. If this bit does not match the “length” of the
busy bit schemealonewereused. Figure 6 displays the instruction being decoded, the decode is suspended until
sequencesfollowedunder the two schemes. Thisfigure the busybitgoes off. Whilethis stratagemt solves the
graphically illustrates the bottleneckcaused by using a logic problem, it does so at the expense of performance.
singlesinkregisterwith a busy bit scheme.Because all Unfortunately, no way has been found to avoid this. Note,
data mustpass through thisregister, the program is however, that all-single- or all-dohble-precision programs
reduced to strictly sequential execution, steps1 through 7. run at the maximum possible speed.It is onlythe interface
With the CDB, on the other hand, the sink register hardly betweensingle- and double-precisionto the same sink
appears and the program is broken into two independent, register that suffers delay.
concurrent sequences. This facility of the CDB obviates
* It does not add two cycles since storage gives one cycle prenotifica-
the need for loop doubling. tion of the arrival of data.
t Further complications arisefromthefactthatsingle-precision
The CDB makes it possible to execute someinstructions multiplyproducesa double-precision product.Thisishandledsepa-
32 in, effectively, no time at all. In the aboveexample the rately but with the same time penalty as above.

R. M. TOMASULO
It might appear that the CDB adds one cycle to the
execution time of each operation, but in fact it does not.
In practice only 30 nsec of the 60-nsec CDB interval are
required to perform all of the CDB functions. The remain-
ing time could, in this case, be used by the execution unit
to achieve a shorter effective cycle. For example, if an add
requires 120 nsec, then add plus the CDB time required
is 150 nsec. Therefore, as far as the add is concerned, the
STORE BUFFER
machine cycle could be 50 nsec. Besides, even without the
CDB, a similar amount of time would be required to trans-
fa)
mit results both to the floating-point registers and back
as an input to the unit generating the result.
FETCH BUFFER The followingprogram, a typical partial differential
1
equation inner loop, illustrates the possible performance
increase.
ADDER DIVIDER

LOOP MD FO, Ai
2

FO
k
STORE BUFFER
AD
LD
FO, Bi
F2, Ci
SDR F2, FO
fb) MDR F2, F6
Figure 6 Functional sequence for Example 6 (a) withbusy AD2 F2, Ci
bit controls only, (b) with CDB. STD F2, Ci
BXH i, -1, 0, LOOP
Without the CDB one iteration of the loop would use
Conclusions
17 cycles, allowing 4 per MD, 3 per AD and nothing for
Two concepts of some significance to the design of high- LD or STD. Withthe CDB one iteration requires 11 cycles.
performancecomputershave beenpresented. The first, For this kind of code the CDB improves performance by
reservation stations, is simply an expeditious method of about one-third.
buffering, in an environment where the transmission time
between units is of consequence. Because of the disparity Acknowledgments
between storage access and circuit speeds and because of
dependencies between successive operations,it is observed The author wishes to acknowledge the contributions of
(given multipleexecutionunits) that each unit spends Messrs. D. W. Anderson and D. M.Powers,whoex-
much of its time waiting for operands. In effect, the reserva- tended the original concept, and Mr. W. D. Silkman,
tion stations do the waiting for operands while the execu- who implemented all of the central control logic discussed
tion circuitry is free
to be engagedby whicheverreservation in the paper.
station fills first.
The second, and more important, innovation, the CDB, References
utilizes the reservation stations and a simpletagging 1. D. W. Anderson,F. J. Sparacioand R. M. Tomasulo,
scheme to preserveprecedencewhileencouragingcon- “TheSystem/360Model 91: MachinePhilosophyandIn-
currency. In conjunction with the various kinds of buf- struction Handling,” IBM Journal 11, 8 (1967) (this issue).
2. S. F. Anderson, J. Earle, R. E. Goldschmidt and D. M.
fering in the CPU, the CDB helps render the Model 91 Powers,“TheSystem/360Model 91 Floating-point Execu-
less sensitive to programming. It should be evident, how- tion Unit,” ZBM Journal 11, 34 (1967) (this issue).
ever, that the programmer still exercises substantial control
over how much concurrency will occur. The two different
programs for doing A + + +
B C D * E illustrate this
clearly. Received September 16, 1965.

33
1
EXPLOITING MULTIPLE ARITHMETIC UNITS

Microlevelling Using Bi-Directional Gridding
100% (1)
Microlevelling Using Bi-Directional Gridding
5 pages
DAMA Data Governance 90 Min PDF
75% (4)
DAMA Data Governance 90 Min PDF
58 pages
The IBM System 360 Model 91 Floating-Point Execution Unit
No ratings yet
The IBM System 360 Model 91 Floating-Point Execution Unit
20 pages
Lec 25
No ratings yet
Lec 25
15 pages
Computer Architecture
No ratings yet
Computer Architecture
22 pages
Energy Efficient High Speed Floating Point Arithmetic Unit: Somya Kumawat, Arpan Shah, Ramesh Bharti
No ratings yet
Energy Efficient High Speed Floating Point Arithmetic Unit: Somya Kumawat, Arpan Shah, Ramesh Bharti
3 pages
Computer Architecture - An: Unit-1
No ratings yet
Computer Architecture - An: Unit-1
30 pages
A CMOS Floating Point Unit
No ratings yet
A CMOS Floating Point Unit
13 pages
Implementation of Double Precision Floating Point Radix-2 FFT Using VHDL
No ratings yet
Implementation of Double Precision Floating Point Radix-2 FFT Using VHDL
7 pages
Chapter One: Introduction To Pipelined Processors
No ratings yet
Chapter One: Introduction To Pipelined Processors
48 pages
CO Unit 1-2
No ratings yet
CO Unit 1-2
14 pages
Design and Implementation of Single Precision Pipelined Floating Point Co-Processor
No ratings yet
Design and Implementation of Single Precision Pipelined Floating Point Co-Processor
4 pages
Rc Presentation
No ratings yet
Rc Presentation
10 pages
Coa Module 5
No ratings yet
Coa Module 5
10 pages
DSP Architecture
100% (1)
DSP Architecture
31 pages
Computer Architecture and Organization: The Central Processing Unit
100% (1)
Computer Architecture and Organization: The Central Processing Unit
126 pages
3
No ratings yet
3
30 pages
UNIT-V-Pipeline and Array Processing and Multi Processors
No ratings yet
UNIT-V-Pipeline and Array Processing and Multi Processors
51 pages
Chapter One: Introduction To Pipelined Processors
No ratings yet
Chapter One: Introduction To Pipelined Processors
41 pages
Coa Mod 4 5
No ratings yet
Coa Mod 4 5
91 pages
MEL G642-Compre Solution - 2 2016-17
No ratings yet
MEL G642-Compre Solution - 2 2016-17
9 pages
Coa, Unit v, Notes
No ratings yet
Coa, Unit v, Notes
26 pages
Intel X86 and Arm Data Types
No ratings yet
Intel X86 and Arm Data Types
20 pages
Dit 705 - DSP - 5
No ratings yet
Dit 705 - DSP - 5
14 pages
Design & Simulation of 32-Bit Floating Point Alu
No ratings yet
Design & Simulation of 32-Bit Floating Point Alu
3 pages
Algorithm and Design
No ratings yet
Algorithm and Design
6 pages
32 Bit Floating Point ALU
80% (5)
32 Bit Floating Point ALU
7 pages
32 Bit Floating Point ALU
0% (1)
32 Bit Floating Point ALU
7 pages
Coa-qbank-unit 1 - 2 (1)
No ratings yet
Coa-qbank-unit 1 - 2 (1)
4 pages
C674x CPU Features
No ratings yet
C674x CPU Features
23 pages
s5-948 Overview
No ratings yet
s5-948 Overview
110 pages
Addressing Modes
No ratings yet
Addressing Modes
4 pages
DLD Assignment
No ratings yet
DLD Assignment
10 pages
Project Report Vlsi
No ratings yet
Project Report Vlsi
33 pages
CSA
No ratings yet
CSA
71 pages
IJSPR_1203_438 (1)
No ratings yet
IJSPR_1203_438 (1)
4 pages
COA Solved Model Paper
No ratings yet
COA Solved Model Paper
36 pages
Design and Implementation of FPGA Based 32 Bit Floating Point Processor For DSP Application
No ratings yet
Design and Implementation of FPGA Based 32 Bit Floating Point Processor For DSP Application
5 pages
Introduction To Microprocessor Based System
No ratings yet
Introduction To Microprocessor Based System
30 pages
Design_and_Implementation_of_Single_Precision_Floating-point_Arithmetic_Logic_Unit_for_RISC_Processor_on_FPGA
No ratings yet
Design_and_Implementation_of_Single_Precision_Floating-point_Arithmetic_Logic_Unit_for_RISC_Processor_on_FPGA
5 pages
Abstract-A New Floating-Point Fused Multiply-Add (FMA) Design For The
No ratings yet
Abstract-A New Floating-Point Fused Multiply-Add (FMA) Design For The
5 pages
Digital Signal Processing With Field Programmable Gate Arrays
No ratings yet
Digital Signal Processing With Field Programmable Gate Arrays
42 pages
coa mid 2qb and obj
No ratings yet
coa mid 2qb and obj
29 pages
Existing Methodology: I I I-1 I I-1 I I
No ratings yet
Existing Methodology: I I I-1 I I-1 I I
9 pages
CAO - Lecutre5 Datapath Design
No ratings yet
CAO - Lecutre5 Datapath Design
43 pages
ADVANCED COMPUTER ARCHITECTURE
No ratings yet
ADVANCED COMPUTER ARCHITECTURE
71 pages
Verilog Project Report
No ratings yet
Verilog Project Report
13 pages
NAME: Fuldeore Srushti Vinod Subject: Coa ID: 201071908 BRANCH: Computer
No ratings yet
NAME: Fuldeore Srushti Vinod Subject: Coa ID: 201071908 BRANCH: Computer
6 pages
UNIT-3: MIPS Instructions
No ratings yet
UNIT-3: MIPS Instructions
15 pages
ComputerArchitecture_Notes
No ratings yet
ComputerArchitecture_Notes
8 pages
Sol Cia2 Coa(23 24 Odd) 70marks
No ratings yet
Sol Cia2 Coa(23 24 Odd) 70marks
9 pages
Single-Precision Logarithmic Arithmetic Unit With Floating-Point Input/output Data
No ratings yet
Single-Precision Logarithmic Arithmetic Unit With Floating-Point Input/output Data
10 pages
Computer Organization & Architecture
No ratings yet
Computer Organization & Architecture
37 pages
ECE/CS 752 Dynamic Scheduling (I) : Nam Sung Kim Electrical and Computer Engineering University of Wisconsin
No ratings yet
ECE/CS 752 Dynamic Scheduling (I) : Nam Sung Kim Electrical and Computer Engineering University of Wisconsin
47 pages
12250H13 - Advanced Computer Architecture: LTPC 4 0 0 4
No ratings yet
12250H13 - Advanced Computer Architecture: LTPC 4 0 0 4
8 pages
Control Memory
No ratings yet
Control Memory
6 pages
Csis Csg524 Midsem q
No ratings yet
Csis Csg524 Midsem q
3 pages
Unit 53
No ratings yet
Unit 53
72 pages
Lecture35
No ratings yet
Lecture35
34 pages
WAN TECHNOLOGY FRAME-RELAY: An Expert's Handbook of Navigating Frame Relay Networks
From Everand
WAN TECHNOLOGY FRAME-RELAY: An Expert's Handbook of Navigating Frame Relay Networks
Mamta Devi
No ratings yet
Advanced Backend Code Optimization
From Everand
Advanced Backend Code Optimization
Sid Touati
No ratings yet
SAS Programming Guidelines Interview Questions You'll Most Likely Be Asked
From Everand
SAS Programming Guidelines Interview Questions You'll Most Likely Be Asked
Vibrant Publishers
No ratings yet
MAC Layer Design For Wireless Sensor Networks: Wei Ye USC Information Sciences Institute
100% (1)
MAC Layer Design For Wireless Sensor Networks: Wei Ye USC Information Sciences Institute
31 pages
ABAP SmartForms FAQs
No ratings yet
ABAP SmartForms FAQs
7 pages
SPSS 24.mac PDF
No ratings yet
SPSS 24.mac PDF
2 pages
All Notes For Cs 301
No ratings yet
All Notes For Cs 301
93 pages
IT 475 Assignment 1
No ratings yet
IT 475 Assignment 1
5 pages
Adventure Works 2012
No ratings yet
Adventure Works 2012
149 pages
Element Entry Creation Using HDL
No ratings yet
Element Entry Creation Using HDL
4 pages
S Mod Mod30ml - 9 1
No ratings yet
S Mod Mod30ml - 9 1
8 pages
PLSQL 7 3 Practice
No ratings yet
PLSQL 7 3 Practice
12 pages
Foc QP 4
No ratings yet
Foc QP 4
18 pages
Agfa Ephoto 1280 User Guide
No ratings yet
Agfa Ephoto 1280 User Guide
66 pages
Cisco_USC_X-Series_with_Intersight_Partner__Narrative_Final
No ratings yet
Cisco_USC_X-Series_with_Intersight_Partner__Narrative_Final
6 pages
Assignment_week_05
No ratings yet
Assignment_week_05
22 pages
Final CC Pract No 5
No ratings yet
Final CC Pract No 5
10 pages
Rollinf Faculty Advt 2024 - L10 &12 20.11.2024
No ratings yet
Rollinf Faculty Advt 2024 - L10 &12 20.11.2024
15 pages
Chapter 1 Information Systems in Global Business Today
No ratings yet
Chapter 1 Information Systems in Global Business Today
11 pages
Geographic Map Shapes For Microsoft Visio
No ratings yet
Geographic Map Shapes For Microsoft Visio
3 pages
Si E124 PDF
No ratings yet
Si E124 PDF
27 pages
Cyber Forensics MCQ
100% (1)
Cyber Forensics MCQ
16 pages
The Relevant Résumé Template 2 PDF
No ratings yet
The Relevant Résumé Template 2 PDF
1 page
GU_SAP_S4H_Create Strategy - IP11 V2
No ratings yet
GU_SAP_S4H_Create Strategy - IP11 V2
13 pages
Essential Guide To Planning and Executing A Private Cloud Migration PDF
No ratings yet
Essential Guide To Planning and Executing A Private Cloud Migration PDF
178 pages
H175 E1 01
No ratings yet
H175 E1 01
108 pages
Abcwefghij PDF
No ratings yet
Abcwefghij PDF
107 pages
Hawassa University, Embedded Systems Lecture 1
No ratings yet
Hawassa University, Embedded Systems Lecture 1
18 pages
TCS2151 Tutorial 02
No ratings yet
TCS2151 Tutorial 02
4 pages
gaurav doker
No ratings yet
gaurav doker
26 pages
Chapter 4 Data Resource Management
No ratings yet
Chapter 4 Data Resource Management
2 pages

An Efficient Algorithm For Exploiting Multiple Arithmetic Units

Uploaded by

An Efficient Algorithm For Exploiting Multiple Arithmetic Units

Uploaded by

R. M.

An Efficient Algorithm for Exploiting

Abstract: Thispaperdescribes the methods employed in thefloating-pointareaof the System/360Model 91 to exploitthe

Figure 1 Dataregisters and transferpathswithout CDB.

EXPLOITINGMULTIPLE ARITHMETIC UNITS

STACK (FLOS) FLOATING POINT 4

I priority circuit. If the CDB is free, the priority control

EXPLOITING MULTIPLE ARITHMETIC UNITS

L ”“””_ 1 LDR F2, FO move FO to F2

You might also like