0% found this document useful (0 votes)
32 views

ACA - Pipelining

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOC, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
32 views

ACA - Pipelining

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOC, PDF, TXT or read online on Scribd
You are on page 1/ 25

PIPELININING

Pipelining is a technique of decomposing a sequential process into sub operations,


with each sub process being executed in a special dedicated segment that operates
concurrently with all other segments. A pipeline can be visualized as a collection of
processing segments through which binary information flows. Each segment
performs partial processing dictated by the way the task is partitioned. The result
obtained from the computation in each segment is transferred to the next segment in
the pipeline. The final result is obtained after the data have passed through all
segments. The overlapping of computation is made possible by associating a
register with each segment in the pipeline. The registers provide isolation between
each segment so that each can operate on distinct data simultaneously.

Suppose we want to perform the combined multiply and add operations with a
stream of numbers.

Ai * B i + Ci for i = 1, 2, 3 …….7
Each segment has one or two registers and a combinational circuit. R1 through R5
are registers that receive new data with every clock pulse.

R1  Ai, R2  Bi input Ai and Bi

R3  R1 * R2, R4  Ci multiply and input Ci

R5  R3 + R4 add Ci to product

Ai Bi Ci

R1 R2

Multiplier

R3 R4

Adder

R5

1
Clock Segment 1 Segment 2 Segment 3
pulse No. R1 R2 R3 R4 R5
1 A1 B1 - - -
2 A2 B2 A1 * B1 C1 -
3 A3 B3 A2 * B2 C2 A1 * B1 + C1
4 A4 B4 A3 * B3 C3 A2 * B2 + C2
5 A5 B5 A4 * B4 C4 A3 * B3 + C3
6 A6 B6 A5 * B5 C5 A4 * B4 + C4
7 A7 B7 A6 * B6 C6 A5 * B5 + C5
8 - - A7 * B7 C7 A6 * B6 + C6
9 - - -- - A7 * B7 + C7

TEMPORAL PARALALISM: This methods of parallel processing is appropriate


if:
 The jobs to be carried out are identical.
 A job can be divided into many independent tasks and each can be performed
by a different segment.
 The time taken for each task is same.
 The time taken to send a job from one module to the next is negligible
compared to the time needed to do a task.
 The number of tasks is much smaller compared to the total number of jobs to
be done.

Let the number of jobs = n


Let the time to do a job = p
Let each job be divided into k tasks and let each task be done by a different
segment.
Let the time for doing each task = p/k.
Time to complete n jobs with no pipeline processing = np.

Time to complete n jobs with a pipeline organization of k segments


= p + (n-1)p/k

2
= p(k+n-1)/k

Speedup due to pipeline processing = np/(p(k+n-1)/k) = k/( 1+ (k-1)/n)

if n>>k then (k-1)/n = 0.


So speedup is directly proportional to the number of segments (here k).

PROBLEMS IN IMPLEMENTATION:

 Synchronization: Identical time should be taken for doing each task in the pipeline
so that a job can flow smoothly in the pipeline without holdup.
 Bubbles in Pipeline: If some tasks are absent in a job “BUBBLES” form in the
pipeline.
 Fault Tolerance: The system does not tolerate faults. If one segment is out of
order, the entire pipeline suffers.
 Inter-task Communication: The time taken to pass information from one segment
to another in the pipeline should be much smaller compared to the time taken to
perform a task.
 Scalability: The number of segments working in the pipeline cannot be increased
beyond a certain limit. The number of segments depends on the number of
independent tasks. Time taken to complete a task should be equal. Number of
jobs should be much larger compared to number of tasks.

DATA PARALLELISM:

Advantages:
 There is no synchronization problem.
 The problem of bubbles is absent.
 The method is more fault tolerant.
 There is no inter-task communication delay.

Disadvantages:
 The assignment of jobs is static.
 The set of jobs must be partitionable into subsets of mutually independent
jobs. Each subset should take same time to complete.

3
 The time taken to divide a set of jobs into equal subsets of jobs should be
small. Further, the number of subsets should be small compared to the
number of jobs.

PIPELINE COMPUTERS: The process of executing an instruction in a digital


computer involves four major steps:

 Instruction Fetch (IF): Fetch instruction from main memory

 Instruction Decoding (ID): Identify the operation to be performed.

 Operand Fetch (OF): Fetching operand.

 Execution (EX): Executing the decoded arithmetic logic operation.

Four pipeline stages IF, ID, OF and EX are arranged into a linear cascade. An
instruction cycle consists of multiple pipeline cycles. A pipeline cycle can be set
equal to the delay of the slowest stage. The flow of data from stage to stage is
triggered by a common clock control. For the nonpipelined computer, it takes four
pipeline cycles to complete one instruction. Once a pipeline is filled up, an output
result is produced from the pipeline on each cycle. The instruction cycle is reduced
to one-fourth of the original cycle time.

S1 S2 S3 S4

IF ID OF EX

A Pipelined processor

Output

4
Pipeline stages
EX I1 I2 I3 I4 I5 …
OF I1 I2 I3 I4 I5 …
ID I1 I2 I3 I4 I5 …
IF I1 I2 I3 I4 I5 …
1 2 3 4 5 6 7 8 9
Time (Pipeline cycles)

Space-Time diagram for a pipelined processor

Output
Stages
EX I1 I2 I3 …
OF I1 I2 I3 …
ID I1 I2 I3 …
IF I1 I2 I3 I4 …
1 2 3 4 5 6 7 8 9 10 11 12 13
Time
Space-Time diagram for a nonpipelined processor

There are two areas of computer design where the pipeline organization is
applicable. An arithmetic pipeline divides an arithmetic operation into sub operations
for execution in the pipeline segments. An instruction pipeline operates on a stream
of instructions by overlapping the IF, ID, OF and EX phases of the instruction cycle.

ARITHMETIC PIPELINE:
a b
X=A x 2 Y=B x 2
The sub operation of that are performed in the four segments are:
1. Compare the exponents
2. Align the mantissa
3. Add or subtract the mantissa
4. Normalize the result
a b A B

5
R R

Segment 1: Compare Difference


exponent by
Subtraction

Segment 2: Choose exponent Align mantissa

Add or Subtract
Segment 3: mantissa

R R

Segment 4: Adjust exponent Normalize result

R R

Pipeline for floating point addition and subtraction

Segment 1: The exponents are compared by subtracting them to determine their


difference. The larger exponent is chosen as the exponent of the result. The
exponent difference determines how many times the mantissa associated with the
smaller exponent must be shifted to the right.

Segment 2: The difference produces an alignment of two mantissas.

6
Segment 3: The two mantissas are added or subtracted.

Segment 4: The result is normalized in segment 4. The sum is adjusted by


normalizing the result so that it has a fraction with non-zero first digit. This is done by
shifting the mantissa once to the right and incrementing the exponent by one to
obtain the normalized sum.

0.9504 X 103 +0.8200 X 102 = 1.0324 X 103 = 0.10324 x 104


Time delay for the 4 segments are t1=60 ns, t2 = 70 ns, t3= 100 ns, t4 = 80 ns and
the interface register delay tr= 10 ns.
The clock cycle tp= t3 + tr = (100+ 10) ns = 110 ns.
An equivalent non-pipelined floating-point adder-subtractor will have a delay time
tn= t1 + t2 + t3 + t4 + tr
= 320 ns
So speedup = 320/110 = 2.9
SUPERPIPELINED AND PIPELINED PROCESSING :
Clock cycle
1 2 3 4 5
IF1 IF2 DE EX1 EX2 MEM MEM SR
IF1 IF2 DE EX1 EX2 MEM MEM SR
IF1 IF2 DE EX1 EX2 MEM MEM SR

1 2 3 4 5 6 7
IF DE EX MEM SR
IF DE EX MEM SR
IF DE EX MEM SR

Superpipelined and Pipelined processing

SUPERPIPELINING: In practice some pipeline stages require less than one clock
interval. DE and SR require less time. So one clock cycle is allocated to instruction
(IF), execution (EX) and Data Memory load/store (MEM), whereas half a clock cycle

7
is allocated to decode (DE) and store register (SR). This method of pipelining is
called Superpipelining.

SUPERSCALAR PROCESSOR

1 2 3 4 5 6 7
IF DE EX MEM SR
IF DE EX MEM SR
IF DE EX MEM SR
IF DE EX MEM SR
IF DE EX MEM SR
IF DE EX MEM SR

SUPERSCALAR PROCESSING: To improve the speed of the processor is to


combine temporal parallelism used in pipeline processing and data parallelism by
issuing several instructions simultaneously in each cycle. This is called superscalar
processing. 6 instructions are completed in 7 clock cycles. In the steady state two
instructions will be completed every clock cycle. Hardware should permit the fetching
of multiple instructions simultaneously from the instruction memory. The data cache
must also have several independent port for read/write, which can be used
simultaneously. A 64-bit data path from instruction memory is required to fetch 2 32-
bit instructions. Other requirements are 2 instruction registers, multiple execution
units. A floating point arithmetic unit along with an integer arithmetic unit.

INSTRUCTION LEVEL PARALLELSIM: Pipelining is the method to achieve


the instruction level parallelism. Pipelining is extensively used in Reduced Instruction
Set Computer (RISC). RISC processors have been succeeded by Superscalar
processors, that executive multiple instructions in one clock cycle. The idea in
Superscalar processor design is to use the parallelism available at the instruction
level by increasing the number of arithmetic and functional units I a PE (Processing
Elements).
Ideal Conditions for Pipelining:

8
 It is possible to break up an instruction into a number of independent parts,
each taking nearly equal time to executive.
 Instruction should be executed in sequence as they are written. In case of
many branch instruction pipeline is not effective.
 Successive instructions are such that, the work done by one instruction
during execution can be effectively used by next and successive instruction.
Successive instructions are also independent of one another.
 Sufficient resources are available in a processor so that, if a resource is
required successive instructions in the pipeline it is readily available.

Step 1 2 3 4 5 6 7 8 9 10 11 12 13
Instruction: 1 FI DA FO EX
2 FI DA FO EX
(Branch) 3 FI DA FO EX
4 FI - - FI DA FO EX
5 - - - FI DA FO EX
6 FI DA FO EX
7 FI DA FO EX

TIMING OF INSTRUCTION PIPELINE

It is assumed that the processor has separate instruction and data memories so that
the operation in FI and FO can proceed at the same time. In the absence of a branch
instruction, each segment operates on different instructions.
STEP 4: Instruction 1 is being executed in segment EX,
The operand for instruction 2 is being fetched in segment FO,
Instruction 3 is being decoded in segment DA,
The instruction 4 is being fetched from memory in segment FI.

Assume that instruction 3 is a branch instruction. As soon as this instruction is


decoded in segment DA in step 4, the transfer from FI to DA of other instructions is
halted until the branch instruction is executed in step 6. If the branch is taken, a new

9
instruction is fetched in step 7. If the branch is not taken, the instruction fetched
previously in step 4 can be used.

PIPELINE HAZARDS: Delays in pipeline execution of instructions due to non-


ideal conditions are called pipeline hazards.

Non-Ideal Conditions:
 Available resources in a processor are limited.
 Successive instructions are not independent of one another. The result
generated by an instruction may be required by the next instruction.
 All programs have branches and loops. So execution of a program is not in
“Straight Line”. An ideal pipeline assumes a continuous flow of tasks.

There are three types of Pipeline Hazards:


 Structural Hazard: The delay caused by the non-availability of resource is
called Structural Hazard.
 Data Hazard: The delay caused by the data dependency between instructions
is called Data Hazard.
 Control Hazard: The delay caused by the branch instructions or control
dependency in a program is called Control Hazard.

STRUCTURAL HAZARD: Pipeline hazard may be delayed due to non-availability of


resources when required during execution of an instruction. During clock cycle 4, step 4
of instruction I requires read/write in data memory and instruction (I+3) requires an
instruction to be fetched from the instruction memory. If one common memory is used for
both data and instruction, only one of these instructions can be carried out and other has
to wait. Forced waiting of an instruction in pipeline processing is called pipeline stall. So
two memories are provided, one for instruction and other for data to avoid pipeline stall.

10
Clock 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
Cycle/
Instruction
I FI DE EX MEM SR

I+1 FI DE EX EX EX ME SR
M
I+2 FI DE X X EX MEM SR

I+3 FI DE X X EX MEM SR

I+4 FI DE X X EX EX EX MEM SR

I+5 FI DE X X X X EX MEM SR

I+6 FI DE X X X X EX MEM SR

I+7 FI DE X X X X EX MEM SR

DELAY IN PIPELINE DUE TO RESOURCE CONSTRAINTS

X –indicates idle period or stall of an instruction


Pipeline execution may be delayed if one of the steps in execution of an instruction takes longer than one clock cycle. Normally a
floating-point division takes longer time than an integer addition. Instruction I+1 is a floating-point operation, which takes 3 clock
cycles. Thus the execution of instruction I+2 cannot start at clock cycle 5 as the execution unit is busy. It has to wait till clock cycle
7 when the execution unit becomes free to carry it out. This is a pipeline stall due to non-availability of required resource. This can
be avoided by introducing extra hardware to complete the floating-point operation in one cycle.

11
DELAY IN PIPELINE WITH PIPELINE LOCKING: In some pipeline design whenever work cannot continue in a particular cycle the
pipeline is “LOCKED”. All the pipeline stages except the one that is yet to finish are stalled. Though the decode step (DE) of
instruction (I+3) can go on (as the decoding unit is free) because of the locking of the pipeline, no work is done during the cycle. By
locking we ensure that successive instructions will complete in order they are issued. Recent machines do not always lock a
pipeline and let instructions continue if there are no resource constraint or other problem. In such a case completion of instruction
may be “Out Of Order”, that is a later instruction may complete before an earlier instruction. If this is logically acceptable in a
program pipeline need not be locked.

Clock 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
Cycle/
Instruction
I FI DE EX MEM SR
I+1 FI DE EX EX EX MEM SR
I+2 FI DE X X EX MEM SR
I+3 FI X X DE EX MEM SR
I+4 X X FI DE EX EX EX MEM SR
I+5 X X X FI DE X X EX MEM SR
I+6 FI X X DE EX MEM SR
I+7 X X FI DE EX MEM SR

DELAY IN PIPELINE WITH PIPELINE LOCKING

12
DELAY DUE TO DATA DEPENDENCY: Pipeline execution is delayed due to
the fact that successive instructions are not independent of one another. The result
produced by an instruction may be needed by succeeding instructions and the
results may not be ready when needed.

ADD R1, R2, R3 C(R3)  C(R1) + C(R2)

MUL R3, R4, R5 C(R5)  C(R3) * C(R4)

SUB R7, R2, R6 C(R6)  C(R7) - C(R2)

INC R3 C(R3)  C(R3) + 1

Clock Cycle 1 2 3 4 5 6 7 8 9 10
ADD R1,R2,R3 FI DE EX MEM SR
MUL R3,R4,R5 FI DE X X EX MEM SR
SUB R7,R2,R6 FI DE EX MEM SR
INC R3 FI DE X EX MEM SR

DELAY IN PIPELINE DUE TO DATA DEPENDENCY

The value of R3 will not be available till clock cycle 6. Thus the MUL operation
should not be executed in clock cycle 4 as the value of R3 it needs will not be stored
in the register by the ADD operation. Thus EX step of MUL instruction is delayed till
clock cycle 6. This stall is due to data dependency and is called DATA HAZARD.
The next instruction SUB does not need R3. So it can proceed without waiting. The
next instruction INC R3 can be executed in clock period 6, as R3 is already available
in the register. But the execution unit is busy at clock cycle 6. So it has to wait till
clock cycle 7 to increment R3. This delay can be eliminated if the computer has a
separate unit to execute INC operation. SUB operation is completed before the
previous instruction. This is called “OUT-OF-ORDER COMPLETION”. To confirm in-
order completion the pipeline can be locked but it will take one more clock cycle.

13
Clock Cycle 1 2 3 4 5 6 7 8 9 10 11
ADD R1,R2,R3 FI DE EX MEM SR
MUL R3,R4,R5 FI DE X X EX MEM SR
SUB R7,R2,R6 FI X X DE EX MEM SR
INC R3 X X FI DE EX MEM SR

LOCKING PIPELINE DUE TO DATA DEPENDENCY

PIPELINE DELAY DUE TO BRANCH INSTRUCTIONS: A branch


instruction disrupts the normal flow of control. If an instruction is a branch instruction
(which is known only at the end of instruction decode step), the next instruction may
be either the next sequential instruction (if branch is not taken) or the one specified
by the branch instruction (if branch is taken). Branch instruction can be two types -
conditional jump and unconditional jump.

Clock Cycle 1 2 3 4 5 6 7 8 9 10
Instruction
I FI DE EX MEM SR
(I+1) branch FI DE EX MEM SR
I+2 FI X X FI DE EX MEM SR

DELAY IN PIPELINE EXECUTION DUE TO BRANCH INSTRUCTION

Consider instruction I+1 is a branch instruction. This fact will be known to the
hardware only at the end of DE step. In the meanwhile the next instruction would
have been fetched. The fetched instruction should be executed or not would be
known only at the end of MEM cycle of the branch instruction. This instruction will be
delayed by two cycles. If branch is not taken the fetched instruction could proceed to
DE step. If the branch is taken then the instruction fetched is not used and the
instruction from the branch address is fetched. So there is 3 cycles delay.

14
REDUCING PIPILINE DELAY [due to data dependency]:
ADD R1, R2, R3
MUL R3, R4, R5
SUB R3, R2, R6
 Register Forwarding: This is the hardware technique to reduce pipeline delay.
The result of ADD R1, R2, R3 will be in the buffer register B3. Instead of
waiting till SR cycle to store it in the register, a path is provided from B3 to
ALU and bypasses the MEM and SR cycles. The register forwarding is done
during ADD instruction. Hard ware must be there to detect that the next
instruction needs the output of the current instruction. Hardware must provide
the facility to forward R3 to SUB also.

 Software Scheduling:
LD R2, R3, Y C(R2)  C(Y+C(R3))
ADD R1, R2, R3 C(R3)  C(R1) + C(R2)
MUL R4, R3, R1 C(R1)  C(R4) * C(R3)
SUB R7, R8, R9 C(R9)  C(R7) – C(R8)

Clock Cycle 1 2 3 4 5 6 7 8 9 10
LD R2, R3, Y FI DE EX MEM SR
ADD R1,R2,R3 FI DE X X EX MEM SR
MUL R4,R3,R1 FI DE X X EX MEM SR
SUB R7,R8,R9 FI DE X X EX MEM SR

Clock Cycle 1 2 3 4 5 6 7 8 9 10
LD R2, R3, Y FI DE EX MEM SR
SUB R7,R8,R9 FI DE EX MEM SR
ADD R1,R2,R3 FI DE X EX MEM SR
MUL R4,R3,R1 FI DE X EX MEM SR

REDUCING PIPELINE STALL BY SOFTWARE SCHEDULING

15
In the first figure the LD instruction stores the value of R2 only at the end of cycle 5.
The ADD instruction can perform addition only at clock cycle 6. MUL instruction is
also delayed due to non-availability of R3 and in the next cycle the execution unit is
busy. The SUB instruction even not dependent on any of the previous instructions is
delayed due to non-availability of execution unit. The sequence of the instructions is
reordered to reduce the pipeline delay. Here the rescheduling reduces the delay by
one clock cycle.

HARDWARE MODIFICATION TO REDUCE DELAY DUE TO BANCHES: The


primary idea is to find out the address of the next instruction to be executed as early
as possible.
 BRANCH PREDICTION BUFFER: The prediction is based on the execution
time behaviour of a program. It uses a small fast memory called Branch
Prediction Buffer to assist the hardware in selecting the instruction to be
executed immediately after a branch instruction. The low order bits of the
address of branch instructions in a program segment are used as addresses
of the branch prediction buffer memory. The content in each location of this
buffer memory is the address of the next instruction to be executed if branch
is taken. In addition, two bits count the number of times the branch has been
successfully taken. If it is a branch instruction, the low order bits of its address
are used to address the branch prediction buffer memory. Initially the
prediction bits are 00. Every time the prediction bits are incremented by 1 if
branch is taken and decremented by 1 if branch is not taken. If the prediction
bits are 11 and a branch is taken it remain 11 and if a branch is not taken it is
decremented to 10. If the prediction bits are 10 or 11, control jumps to the
branch address found in the branch prediction buffer. Otherwise the next
sequential instruction is executed. A single bit predictor mispredicts branches
more often, particularly in loops, compared to a 2-bit predictor.

Address Contents Prediction bits


Low order bits of branch Address where branch will 2 bits
instruction address jump

FIELDS OF A BRANCH PREDICTION BUFFER MEMORY

16
 BRANCH TARGET BUFFER: A Branch Target Buffer (BTB) is used at
the Instruction Fetch step.

Address Contents Prediction bits


Address of branch Address where address 1 or 2 bits (Optional)
instruction will jump

FITLEDS OF A BRANCH TARGET BUFFER


The address field contains the complete address of all branch instructions. The
contents of BTB are created dynamically. When a program is executed whenever a
branch statement is encountered its address and branch target address are placed
in BTB. At the end of execution step the target address of the branch will be known if
branch is taken. At this time target address is entered in BTB and the prediction bit is
set to 1. Once a BTB entry is made it can be accessed at instruction fetching phase
and target address found. When a loop is executed for the first time, the branch
instruction governing the loop would be found in BTB. It will be entered in BTB when
the loop is executed for the first time. When the loop is executed the second and
subsequent times the branch target would be found at the instruction fetch phase.

Left portion of the flowchart describes how BTB is created and updated dynamically
and the right portion shows how it is used. It is assumed that the actual branch target
is found during DE step. This is possible in SMAC2P with added hardware. If no
hardware is added then the predicted branch and actual branch are same will be
known only at the execution step of the pipeline. If they do not match, the instruction
fetched from the target has to be removed and the next instruction in sequence
should be taken for execution.

BTB cannot be very large. About 1000 entries are normally used in practice.

17
BRANCH TRAGET BUFFER (BTB) CREATION ANDUSE

Fetch Instruction

FI
Step
Yes
Is its address
found in
BTB?

No
Fetched instruction
from predicted branch
address

No Is the
fetched
instructio
na
branch? Is the
DE predicted
branch
Step Yes actual
branch?

No Yes

Normal flow Continue execution


EX
Step Flush fetched
instruction. Fetch
Creation of Enter address of
right instruction.
fetched instruction Modify prediction
BTB entries and branch bit of BTB.
address in BTB

18
SOFTWARE METHOD TO REDUCE DELAY DUE TO BRANCHES: The
primary idea is for the compiler to rearrange the statements of the program in such a
way that the statement following the branch statement (Delay slot) is always
executed once it is fetched without affecting the correctness of the program. This
may not always be possible but analysis of many programs shows that this
technique succeeds quite often.

In the rearranged program the branch statement JMP has been placed before ADD
R1, R2, R3. While the jump statement is being decoded. The ADD statement would
have been fetched in the rearranged code and can be allowed to complete without
changing the meaning of the program. If no such statement is found then a No
Operation (NOP) statement is used as a filler after the branch so that when it is
fetched it does not affect the meaning of the program.
ORIGINAL PROGRAM REARRANGED PROGRAM
…………………………… ………………………………
ADD R4, R5, R6 ADD R4, R5, R6
ADD R1, R2, R3 JMP X
JMP X
DELAY SLOT ADD R1, R2, R3

……………….
X ……………... ……………….
X ……………...

REARRANGINGING COMPILED CODE TO REDUCE STALLS

The compiler for a processor that uses delayed branches is designed to analyze the
instructions before and after the branch and rearrange the program sequence by
inserting useful instructions in the delayed steps. For example, the compiler can
determine that the program dependencies allow one or more instructions preceding
the branch to be moved into the delayed steps after the branch. These instructions
are then fetched from memory and executed through the pipeline while the branch
instruction is being executed in other segments. The effect is the same as if the
instructions were executed in their original order. It is up to the compiler to find useful

19
instructions to put after the branch instruction. Failing that, the compiler can insert
NO-OP instructions.
I  Instruction Fetch
A  ALU operation
E  Execute instruction
Example:
LOAD FROM MEMORY TO R1
INCREMENT R2
ADD R3 TO R4
SUBTRACT R5 FROM R6
BRANCH TO ADDRESS X

Clock Cycle 1 2 3 4 5 6 7 8 9 10
1 load I A E
2 Increment I A E
3. Add I A E
4. Subtract I A E
5. Branch to X I A E
6.No-operation I A E
7. No-operation I A E
8. Instruction in X I A E

USING NO-OPERATION INSTRUCTIONS

Clock Cycle 1 2 3 4 5 6 7 8
1. Load I A E
2. Increment I A E
3. Branch to X I A E
4. Add I A E
5. Subtract I A E
6. Instruction in X I A E

REARRANGING THE INSTRUCTIONS

20
MEMORY INTERLEAVING: pipeline often requires simultaneous access to
memory from two or more sources. An instruction pipeline may require the fetching
of an instruction and an operand at the same time from two different segments.
Similarly an arithmetic pipeline usually requires two or more operands to enter the
pipeline at the same time. Instead of using two memory buses for simultaneous
access, the memory can be partitioned into a number of modules connected to a
common memory address and data buses. A memory module is a memory array
together with its own address (AR) and data registers (DR). The address registers
receive information from a common address bus and the data registers communicate
with a bi-directional data bus. The modular system permits one module to initiate a
memory access while other modules are in the process of reading or writing a word
and each module can accept a memory request independent of the state of the other
modules.

Address bus

AR AR AR AR

Memory Memory Memory Memory


Array Array Array Array

DR DR DR DR

Data bus
Advantage: It allows the use of a technique called interleaving. In an interleaved
memory, different sets of addresses are assigned to different memory modules. In a

21
two-module memory system, the even addresses may be in one module and the odd
addresses may be in another module.
DIFFICULTIES IN PIPELINING: This difficulty is due to interruption of normal
flow of a program due to events such as illegal instruction code, page faults etc.
These are called Exception Conditions. Memory protection violation can occur at the
Instruction Fetch (IF) stage or Load/Store (MEM) step of the pipeline.
An important problem faced by the architect is to be able to attend to the exception
conditions and resume computation in an orderly fashion whenever it is feasible. The
problem of restarting computation is complicated by the fact that several instructions
will be in various stages of completion in the pipeline. If the pipeline processing can
be stopped when an exception condition is detected in such a way that all
instructions which occur before the one causing the exception are completed and all
instructions which were in progress at the instant exception occurred can be
restarted (after attending the exception) from the beginning, the pipeline is said to
have PRECISE EXCEPTION.

Clock Cycle 1 2 3 4 5 6 7 8 9 10
I FI DE EX MEM SR
I+1 FI DE EX MEM SR
I+2 FI DE EX MEM SR
I+3 FI DE EX MEM SR
I+4 FI DE EX MEM SR
I+5 FI DE EX MEM SR

Trap instruction

In this example suppose an exception occurred during the EX step of instruction I+2.
If the system supports precise exceptions then instructions I and I+1 must be
completed and instructions I+2, I+3 and I+4 should be stopped and resumed from
scratch after attending to the exception. All the actions carried out by I+2,I+3 and I+4
should be cancelled.

When an exception is detected the following actions are carried out:

22
1. As soon as the exception is detected turn off write operations for the current
and all subsequent instructions in the pipeline (instructions I+2,I+3 and I+4).
2. A trap instruction is fetched as the next instruction in the pipeline (instruction
I+5).
3. This instruction invokes OS, which saves address of faulting instruction to
enable resumption of program later after attending to the exception.

SUPERSCALAR PROCESSOR HAZARDS: It is assumed that the processor


has 2 integer execution units and one floating-point execution unit, which can all
work in parallel.

Instruction Identity Instruction No. Of Cycles Arithmetic Unit Reqd.


I1 R1  R1/R5 2 Floating Point
I2 R3  R1 + R2 1 Integer
I3 R2  R5 + 3 1 Integer
I4 R7  R1 – R9 1 Integer
I5 R6  R4 * R8 2 Floating Point
I6 R5  R1 + 6 1 Integer
I7 R1  R2 + 1 1 Integer
I8 R10  R9 * R8 2 Floating Point

WAR,WAW HAZARD:
Cycles 1 2 3 4 5 6 7
I1 FI DE EXF EXF MEM SR
I2 FI DE X X EXI MEM SR
I3 FI DE EXI MEM SR
I4 FI DE X EXI MEM SR



23
Assuming hardware register forwarding I2 can execute in cycle 5. I3 does not
apparently depend on previous instructions and can complete in cycle 6, if allowed to
proceed. This would however, lead to an incorrect result in R3 when the preceding
instruction I2 completes as I2 will take the value of R2 computed by I3 instead of the
earlier value of R2. This mistake is due to ANTI-DEPENDENCY between instruction
I2 and I3. In other words a value computed by a later instruction is used by a
previous instruction. This is also known as Write After Read (WAR) hazard and not
allowed. Such problem will arise when we allow out-of-order completion of
instructions.

Cycle 1 2 3 4 5 6 7 8 9 10
I1 FI DE EXF EXF MEM SR
I2 FI DE X X EXI MEM SR
I3 FI DE X X EXI MEM SR
I4 FI DE X EXI MEM SR
I5 FI DE EXF EXF MEM SR
I6 FI DE X X EXI MEM SR
I7 FI DE X X EXI MEM SR
I8 FI DE X EXF EXF MEM SR

X  Idle Cycle

We can delay the execution of I3 till I2 completes. The execution can be carried out
in cycle 5 simultaneously, if I2 reads R2 before I3 updates it. However we wait for
the completion of execution step of I2 before commencing the execution of I3. I4 has
to wait for I1 to complete, as it requires the value of R1. I5 has no dependency and
can complete without delay. In cycle 5 both instructions I2 and I4 need integer
arithmetic units, which are available. I5 needs a floating-point unit and it is available.
I6 needs the value of R1, which is available in cycle 5. It, however, cannot complete
before I3 which uses R5 (I3 and I6 have anti-dependency). Thus I6 starts after I3
completes execution. The next instruction I7 cannot start till I6 completes, as it is
anti-dependent on I6. The destination register of I1 and I7 is same. So I1 must also
complete before I7 begins. This dependency when the destination registers of two

24
instructions are same is called OUTPUT DEPENDENCY. It is also known Write After
Write (WAW) hazard. So I7 can start execution in cycle 8. The last instruction I8 has
no dependency, so it can proceed without delay. However, it gets delayed because
of resource constraints as it needs the floating-point execution unit in cycle 6 but it is
used by I5. When the execution of I5 completes in cycle 6, I8 can use it in cycle 7.
So I8 is delayed by one cycle. If we have 2 floating-point execution units, then I8 can
complete in cycle 9. However, I7 completes only in cycle 10.

First detect which registers are used by prior instructions and whether they are
source or destination registers. If the destination register of an instruction I is used as
the source of the next or succeeding instruction K then K is FLOW DEPENDENT on
I. It is known as Read After Write (RAW) hazard. This is also known as True
dependency as K cannot proceed unless I is completed. Here I2, I4 and I6 are flow
dependent on I1.

25

You might also like