ACA - Pipelining
ACA - Pipelining
Suppose we want to perform the combined multiply and add operations with a
stream of numbers.
Ai * B i + Ci for i = 1, 2, 3 …….7
Each segment has one or two registers and a combinational circuit. R1 through R5
are registers that receive new data with every clock pulse.
R5 R3 + R4 add Ci to product
Ai Bi Ci
R1 R2
Multiplier
R3 R4
Adder
R5
1
Clock Segment 1 Segment 2 Segment 3
pulse No. R1 R2 R3 R4 R5
1 A1 B1 - - -
2 A2 B2 A1 * B1 C1 -
3 A3 B3 A2 * B2 C2 A1 * B1 + C1
4 A4 B4 A3 * B3 C3 A2 * B2 + C2
5 A5 B5 A4 * B4 C4 A3 * B3 + C3
6 A6 B6 A5 * B5 C5 A4 * B4 + C4
7 A7 B7 A6 * B6 C6 A5 * B5 + C5
8 - - A7 * B7 C7 A6 * B6 + C6
9 - - -- - A7 * B7 + C7
2
= p(k+n-1)/k
PROBLEMS IN IMPLEMENTATION:
Synchronization: Identical time should be taken for doing each task in the pipeline
so that a job can flow smoothly in the pipeline without holdup.
Bubbles in Pipeline: If some tasks are absent in a job “BUBBLES” form in the
pipeline.
Fault Tolerance: The system does not tolerate faults. If one segment is out of
order, the entire pipeline suffers.
Inter-task Communication: The time taken to pass information from one segment
to another in the pipeline should be much smaller compared to the time taken to
perform a task.
Scalability: The number of segments working in the pipeline cannot be increased
beyond a certain limit. The number of segments depends on the number of
independent tasks. Time taken to complete a task should be equal. Number of
jobs should be much larger compared to number of tasks.
DATA PARALLELISM:
Advantages:
There is no synchronization problem.
The problem of bubbles is absent.
The method is more fault tolerant.
There is no inter-task communication delay.
Disadvantages:
The assignment of jobs is static.
The set of jobs must be partitionable into subsets of mutually independent
jobs. Each subset should take same time to complete.
3
The time taken to divide a set of jobs into equal subsets of jobs should be
small. Further, the number of subsets should be small compared to the
number of jobs.
Four pipeline stages IF, ID, OF and EX are arranged into a linear cascade. An
instruction cycle consists of multiple pipeline cycles. A pipeline cycle can be set
equal to the delay of the slowest stage. The flow of data from stage to stage is
triggered by a common clock control. For the nonpipelined computer, it takes four
pipeline cycles to complete one instruction. Once a pipeline is filled up, an output
result is produced from the pipeline on each cycle. The instruction cycle is reduced
to one-fourth of the original cycle time.
S1 S2 S3 S4
IF ID OF EX
A Pipelined processor
Output
4
Pipeline stages
EX I1 I2 I3 I4 I5 …
OF I1 I2 I3 I4 I5 …
ID I1 I2 I3 I4 I5 …
IF I1 I2 I3 I4 I5 …
1 2 3 4 5 6 7 8 9
Time (Pipeline cycles)
Output
Stages
EX I1 I2 I3 …
OF I1 I2 I3 …
ID I1 I2 I3 …
IF I1 I2 I3 I4 …
1 2 3 4 5 6 7 8 9 10 11 12 13
Time
Space-Time diagram for a nonpipelined processor
There are two areas of computer design where the pipeline organization is
applicable. An arithmetic pipeline divides an arithmetic operation into sub operations
for execution in the pipeline segments. An instruction pipeline operates on a stream
of instructions by overlapping the IF, ID, OF and EX phases of the instruction cycle.
ARITHMETIC PIPELINE:
a b
X=A x 2 Y=B x 2
The sub operation of that are performed in the four segments are:
1. Compare the exponents
2. Align the mantissa
3. Add or subtract the mantissa
4. Normalize the result
a b A B
5
R R
Add or Subtract
Segment 3: mantissa
R R
R R
6
Segment 3: The two mantissas are added or subtracted.
1 2 3 4 5 6 7
IF DE EX MEM SR
IF DE EX MEM SR
IF DE EX MEM SR
SUPERPIPELINING: In practice some pipeline stages require less than one clock
interval. DE and SR require less time. So one clock cycle is allocated to instruction
(IF), execution (EX) and Data Memory load/store (MEM), whereas half a clock cycle
7
is allocated to decode (DE) and store register (SR). This method of pipelining is
called Superpipelining.
SUPERSCALAR PROCESSOR
1 2 3 4 5 6 7
IF DE EX MEM SR
IF DE EX MEM SR
IF DE EX MEM SR
IF DE EX MEM SR
IF DE EX MEM SR
IF DE EX MEM SR
8
It is possible to break up an instruction into a number of independent parts,
each taking nearly equal time to executive.
Instruction should be executed in sequence as they are written. In case of
many branch instruction pipeline is not effective.
Successive instructions are such that, the work done by one instruction
during execution can be effectively used by next and successive instruction.
Successive instructions are also independent of one another.
Sufficient resources are available in a processor so that, if a resource is
required successive instructions in the pipeline it is readily available.
Step 1 2 3 4 5 6 7 8 9 10 11 12 13
Instruction: 1 FI DA FO EX
2 FI DA FO EX
(Branch) 3 FI DA FO EX
4 FI - - FI DA FO EX
5 - - - FI DA FO EX
6 FI DA FO EX
7 FI DA FO EX
It is assumed that the processor has separate instruction and data memories so that
the operation in FI and FO can proceed at the same time. In the absence of a branch
instruction, each segment operates on different instructions.
STEP 4: Instruction 1 is being executed in segment EX,
The operand for instruction 2 is being fetched in segment FO,
Instruction 3 is being decoded in segment DA,
The instruction 4 is being fetched from memory in segment FI.
9
instruction is fetched in step 7. If the branch is not taken, the instruction fetched
previously in step 4 can be used.
Non-Ideal Conditions:
Available resources in a processor are limited.
Successive instructions are not independent of one another. The result
generated by an instruction may be required by the next instruction.
All programs have branches and loops. So execution of a program is not in
“Straight Line”. An ideal pipeline assumes a continuous flow of tasks.
10
Clock 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
Cycle/
Instruction
I FI DE EX MEM SR
I+1 FI DE EX EX EX ME SR
M
I+2 FI DE X X EX MEM SR
I+3 FI DE X X EX MEM SR
I+4 FI DE X X EX EX EX MEM SR
I+5 FI DE X X X X EX MEM SR
I+6 FI DE X X X X EX MEM SR
I+7 FI DE X X X X EX MEM SR
11
DELAY IN PIPELINE WITH PIPELINE LOCKING: In some pipeline design whenever work cannot continue in a particular cycle the
pipeline is “LOCKED”. All the pipeline stages except the one that is yet to finish are stalled. Though the decode step (DE) of
instruction (I+3) can go on (as the decoding unit is free) because of the locking of the pipeline, no work is done during the cycle. By
locking we ensure that successive instructions will complete in order they are issued. Recent machines do not always lock a
pipeline and let instructions continue if there are no resource constraint or other problem. In such a case completion of instruction
may be “Out Of Order”, that is a later instruction may complete before an earlier instruction. If this is logically acceptable in a
program pipeline need not be locked.
Clock 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
Cycle/
Instruction
I FI DE EX MEM SR
I+1 FI DE EX EX EX MEM SR
I+2 FI DE X X EX MEM SR
I+3 FI X X DE EX MEM SR
I+4 X X FI DE EX EX EX MEM SR
I+5 X X X FI DE X X EX MEM SR
I+6 FI X X DE EX MEM SR
I+7 X X FI DE EX MEM SR
12
DELAY DUE TO DATA DEPENDENCY: Pipeline execution is delayed due to
the fact that successive instructions are not independent of one another. The result
produced by an instruction may be needed by succeeding instructions and the
results may not be ready when needed.
Clock Cycle 1 2 3 4 5 6 7 8 9 10
ADD R1,R2,R3 FI DE EX MEM SR
MUL R3,R4,R5 FI DE X X EX MEM SR
SUB R7,R2,R6 FI DE EX MEM SR
INC R3 FI DE X EX MEM SR
The value of R3 will not be available till clock cycle 6. Thus the MUL operation
should not be executed in clock cycle 4 as the value of R3 it needs will not be stored
in the register by the ADD operation. Thus EX step of MUL instruction is delayed till
clock cycle 6. This stall is due to data dependency and is called DATA HAZARD.
The next instruction SUB does not need R3. So it can proceed without waiting. The
next instruction INC R3 can be executed in clock period 6, as R3 is already available
in the register. But the execution unit is busy at clock cycle 6. So it has to wait till
clock cycle 7 to increment R3. This delay can be eliminated if the computer has a
separate unit to execute INC operation. SUB operation is completed before the
previous instruction. This is called “OUT-OF-ORDER COMPLETION”. To confirm in-
order completion the pipeline can be locked but it will take one more clock cycle.
13
Clock Cycle 1 2 3 4 5 6 7 8 9 10 11
ADD R1,R2,R3 FI DE EX MEM SR
MUL R3,R4,R5 FI DE X X EX MEM SR
SUB R7,R2,R6 FI X X DE EX MEM SR
INC R3 X X FI DE EX MEM SR
Clock Cycle 1 2 3 4 5 6 7 8 9 10
Instruction
I FI DE EX MEM SR
(I+1) branch FI DE EX MEM SR
I+2 FI X X FI DE EX MEM SR
Consider instruction I+1 is a branch instruction. This fact will be known to the
hardware only at the end of DE step. In the meanwhile the next instruction would
have been fetched. The fetched instruction should be executed or not would be
known only at the end of MEM cycle of the branch instruction. This instruction will be
delayed by two cycles. If branch is not taken the fetched instruction could proceed to
DE step. If the branch is taken then the instruction fetched is not used and the
instruction from the branch address is fetched. So there is 3 cycles delay.
14
REDUCING PIPILINE DELAY [due to data dependency]:
ADD R1, R2, R3
MUL R3, R4, R5
SUB R3, R2, R6
Register Forwarding: This is the hardware technique to reduce pipeline delay.
The result of ADD R1, R2, R3 will be in the buffer register B3. Instead of
waiting till SR cycle to store it in the register, a path is provided from B3 to
ALU and bypasses the MEM and SR cycles. The register forwarding is done
during ADD instruction. Hard ware must be there to detect that the next
instruction needs the output of the current instruction. Hardware must provide
the facility to forward R3 to SUB also.
Software Scheduling:
LD R2, R3, Y C(R2) C(Y+C(R3))
ADD R1, R2, R3 C(R3) C(R1) + C(R2)
MUL R4, R3, R1 C(R1) C(R4) * C(R3)
SUB R7, R8, R9 C(R9) C(R7) – C(R8)
Clock Cycle 1 2 3 4 5 6 7 8 9 10
LD R2, R3, Y FI DE EX MEM SR
ADD R1,R2,R3 FI DE X X EX MEM SR
MUL R4,R3,R1 FI DE X X EX MEM SR
SUB R7,R8,R9 FI DE X X EX MEM SR
Clock Cycle 1 2 3 4 5 6 7 8 9 10
LD R2, R3, Y FI DE EX MEM SR
SUB R7,R8,R9 FI DE EX MEM SR
ADD R1,R2,R3 FI DE X EX MEM SR
MUL R4,R3,R1 FI DE X EX MEM SR
15
In the first figure the LD instruction stores the value of R2 only at the end of cycle 5.
The ADD instruction can perform addition only at clock cycle 6. MUL instruction is
also delayed due to non-availability of R3 and in the next cycle the execution unit is
busy. The SUB instruction even not dependent on any of the previous instructions is
delayed due to non-availability of execution unit. The sequence of the instructions is
reordered to reduce the pipeline delay. Here the rescheduling reduces the delay by
one clock cycle.
16
BRANCH TARGET BUFFER: A Branch Target Buffer (BTB) is used at
the Instruction Fetch step.
Left portion of the flowchart describes how BTB is created and updated dynamically
and the right portion shows how it is used. It is assumed that the actual branch target
is found during DE step. This is possible in SMAC2P with added hardware. If no
hardware is added then the predicted branch and actual branch are same will be
known only at the execution step of the pipeline. If they do not match, the instruction
fetched from the target has to be removed and the next instruction in sequence
should be taken for execution.
BTB cannot be very large. About 1000 entries are normally used in practice.
17
BRANCH TRAGET BUFFER (BTB) CREATION ANDUSE
Fetch Instruction
FI
Step
Yes
Is its address
found in
BTB?
No
Fetched instruction
from predicted branch
address
No Is the
fetched
instructio
na
branch? Is the
DE predicted
branch
Step Yes actual
branch?
No Yes
18
SOFTWARE METHOD TO REDUCE DELAY DUE TO BRANCHES: The
primary idea is for the compiler to rearrange the statements of the program in such a
way that the statement following the branch statement (Delay slot) is always
executed once it is fetched without affecting the correctness of the program. This
may not always be possible but analysis of many programs shows that this
technique succeeds quite often.
In the rearranged program the branch statement JMP has been placed before ADD
R1, R2, R3. While the jump statement is being decoded. The ADD statement would
have been fetched in the rearranged code and can be allowed to complete without
changing the meaning of the program. If no such statement is found then a No
Operation (NOP) statement is used as a filler after the branch so that when it is
fetched it does not affect the meaning of the program.
ORIGINAL PROGRAM REARRANGED PROGRAM
…………………………… ………………………………
ADD R4, R5, R6 ADD R4, R5, R6
ADD R1, R2, R3 JMP X
JMP X
DELAY SLOT ADD R1, R2, R3
……………….
X ……………... ……………….
X ……………...
The compiler for a processor that uses delayed branches is designed to analyze the
instructions before and after the branch and rearrange the program sequence by
inserting useful instructions in the delayed steps. For example, the compiler can
determine that the program dependencies allow one or more instructions preceding
the branch to be moved into the delayed steps after the branch. These instructions
are then fetched from memory and executed through the pipeline while the branch
instruction is being executed in other segments. The effect is the same as if the
instructions were executed in their original order. It is up to the compiler to find useful
19
instructions to put after the branch instruction. Failing that, the compiler can insert
NO-OP instructions.
I Instruction Fetch
A ALU operation
E Execute instruction
Example:
LOAD FROM MEMORY TO R1
INCREMENT R2
ADD R3 TO R4
SUBTRACT R5 FROM R6
BRANCH TO ADDRESS X
Clock Cycle 1 2 3 4 5 6 7 8 9 10
1 load I A E
2 Increment I A E
3. Add I A E
4. Subtract I A E
5. Branch to X I A E
6.No-operation I A E
7. No-operation I A E
8. Instruction in X I A E
Clock Cycle 1 2 3 4 5 6 7 8
1. Load I A E
2. Increment I A E
3. Branch to X I A E
4. Add I A E
5. Subtract I A E
6. Instruction in X I A E
20
MEMORY INTERLEAVING: pipeline often requires simultaneous access to
memory from two or more sources. An instruction pipeline may require the fetching
of an instruction and an operand at the same time from two different segments.
Similarly an arithmetic pipeline usually requires two or more operands to enter the
pipeline at the same time. Instead of using two memory buses for simultaneous
access, the memory can be partitioned into a number of modules connected to a
common memory address and data buses. A memory module is a memory array
together with its own address (AR) and data registers (DR). The address registers
receive information from a common address bus and the data registers communicate
with a bi-directional data bus. The modular system permits one module to initiate a
memory access while other modules are in the process of reading or writing a word
and each module can accept a memory request independent of the state of the other
modules.
Address bus
AR AR AR AR
DR DR DR DR
Data bus
Advantage: It allows the use of a technique called interleaving. In an interleaved
memory, different sets of addresses are assigned to different memory modules. In a
21
two-module memory system, the even addresses may be in one module and the odd
addresses may be in another module.
DIFFICULTIES IN PIPELINING: This difficulty is due to interruption of normal
flow of a program due to events such as illegal instruction code, page faults etc.
These are called Exception Conditions. Memory protection violation can occur at the
Instruction Fetch (IF) stage or Load/Store (MEM) step of the pipeline.
An important problem faced by the architect is to be able to attend to the exception
conditions and resume computation in an orderly fashion whenever it is feasible. The
problem of restarting computation is complicated by the fact that several instructions
will be in various stages of completion in the pipeline. If the pipeline processing can
be stopped when an exception condition is detected in such a way that all
instructions which occur before the one causing the exception are completed and all
instructions which were in progress at the instant exception occurred can be
restarted (after attending the exception) from the beginning, the pipeline is said to
have PRECISE EXCEPTION.
Clock Cycle 1 2 3 4 5 6 7 8 9 10
I FI DE EX MEM SR
I+1 FI DE EX MEM SR
I+2 FI DE EX MEM SR
I+3 FI DE EX MEM SR
I+4 FI DE EX MEM SR
I+5 FI DE EX MEM SR
Trap instruction
In this example suppose an exception occurred during the EX step of instruction I+2.
If the system supports precise exceptions then instructions I and I+1 must be
completed and instructions I+2, I+3 and I+4 should be stopped and resumed from
scratch after attending to the exception. All the actions carried out by I+2,I+3 and I+4
should be cancelled.
22
1. As soon as the exception is detected turn off write operations for the current
and all subsequent instructions in the pipeline (instructions I+2,I+3 and I+4).
2. A trap instruction is fetched as the next instruction in the pipeline (instruction
I+5).
3. This instruction invokes OS, which saves address of faulting instruction to
enable resumption of program later after attending to the exception.
WAR,WAW HAZARD:
Cycles 1 2 3 4 5 6 7
I1 FI DE EXF EXF MEM SR
I2 FI DE X X EXI MEM SR
I3 FI DE EXI MEM SR
I4 FI DE X EXI MEM SR
23
Assuming hardware register forwarding I2 can execute in cycle 5. I3 does not
apparently depend on previous instructions and can complete in cycle 6, if allowed to
proceed. This would however, lead to an incorrect result in R3 when the preceding
instruction I2 completes as I2 will take the value of R2 computed by I3 instead of the
earlier value of R2. This mistake is due to ANTI-DEPENDENCY between instruction
I2 and I3. In other words a value computed by a later instruction is used by a
previous instruction. This is also known as Write After Read (WAR) hazard and not
allowed. Such problem will arise when we allow out-of-order completion of
instructions.
Cycle 1 2 3 4 5 6 7 8 9 10
I1 FI DE EXF EXF MEM SR
I2 FI DE X X EXI MEM SR
I3 FI DE X X EXI MEM SR
I4 FI DE X EXI MEM SR
I5 FI DE EXF EXF MEM SR
I6 FI DE X X EXI MEM SR
I7 FI DE X X EXI MEM SR
I8 FI DE X EXF EXF MEM SR
X Idle Cycle
We can delay the execution of I3 till I2 completes. The execution can be carried out
in cycle 5 simultaneously, if I2 reads R2 before I3 updates it. However we wait for
the completion of execution step of I2 before commencing the execution of I3. I4 has
to wait for I1 to complete, as it requires the value of R1. I5 has no dependency and
can complete without delay. In cycle 5 both instructions I2 and I4 need integer
arithmetic units, which are available. I5 needs a floating-point unit and it is available.
I6 needs the value of R1, which is available in cycle 5. It, however, cannot complete
before I3 which uses R5 (I3 and I6 have anti-dependency). Thus I6 starts after I3
completes execution. The next instruction I7 cannot start till I6 completes, as it is
anti-dependent on I6. The destination register of I1 and I7 is same. So I1 must also
complete before I7 begins. This dependency when the destination registers of two
24
instructions are same is called OUTPUT DEPENDENCY. It is also known Write After
Write (WAW) hazard. So I7 can start execution in cycle 8. The last instruction I8 has
no dependency, so it can proceed without delay. However, it gets delayed because
of resource constraints as it needs the floating-point execution unit in cycle 6 but it is
used by I5. When the execution of I5 completes in cycle 6, I8 can use it in cycle 7.
So I8 is delayed by one cycle. If we have 2 floating-point execution units, then I8 can
complete in cycle 9. However, I7 completes only in cycle 10.
First detect which registers are used by prior instructions and whether they are
source or destination registers. If the destination register of an instruction I is used as
the source of the next or succeeding instruction K then K is FLOW DEPENDENT on
I. It is known as Read After Write (RAW) hazard. This is also known as True
dependency as K cannot proceed unless I is completed. Here I2, I4 and I6 are flow
dependent on I1.
25