CA 5 Pipelining
CA 5 Pipelining
©ASHIQUE 1
Pipelining
To arrange the hardware so that more than one
operation can be performed at the same time.
In this way, the number of operations performed per
second is increased even though the elapsed time
needed to perform any operation is not changed.
©ASHIQUE 2
Basic Idea of Pipelining
Execution of a program consists of a sequence of
fetch and execute steps.
Let Fi refers to fetch step for Instruction Ii
Let Ei refers to execution step for Instruction Ii
©ASHIQUE 3
Basic Idea of Pipelining
The Instruction fetched by the fetch unit is deposited in
an intermediate storage buffer, B1.
This buffer is needed to enable the execution unit to
execute the instruction while the fetch unit is fetching
the next instruction.
Assume both the source and the destination of the
data operated on by the instructions are inside
Execution unit.
©ASHIQUE 4
Basic Idea of Pipelining
In the first clock cycle, the fetch unit fetches an instruction I1 (Step
F1) and stores it in buffer B1 at the end of the clock cycle.
©ASHIQUE 5
Basic Idea of Pipelining
In this manner
Both fetch and execution units are kept busy all the time
The completion rate of instruction execution will be twice that
achievable by the sequential operation
©ASHIQUE 6
T ime
I1 I 2 I3
F 1 E 1 F 2 E 2 F 3 E 3
Interstage buffer
B1
Instruction
fetch Ex
ecution
unit
unit
T ime
Clock cycle 1 2 3 4
Instruction
I1 F 1 E 1
I2 F 2 E 2
I3 F 3 E 3
©ASHIQUE 7
4-stage Pipeline
A pipelined processor may process each instruction in 4
steps:
F Fetch: read the instruction from the memory
D Decode: decode the instruction and fetch the source operand(s)
E Execute: perform the operation specified by the instruction
W Write: store the result in the destination location.
©ASHIQUE 8
4-stage Pipeline: Buffer
Information is passed from one unit to the next through
a storage buffer
During clock cycle 4, the information in the buffer:
Buffer B1 holds I3.
Buffer B2 holds both the source operand(s) for I2 , the
specification of the operation to be performed, information
needed for the write step of I2.
Buffer B3 holds the result produced by the execution unit and
the destination information for I1.
©ASHIQUE 9
Time
Clock cycle 1 2 3 4 5 6 7
Instruction
I1 F1 D1 E1 W1
I2 F2 D2 E2 W2
I3 F3 D3 E3 W3
I4 F4 D4 E4 W4
Interstage buffers
D : Decode
F : Fetch instruction E: Execute W : Write
instruction and fetch operation results
operands
B1 B2 B3
©ASHIQUE 10
Role of Cache Memory
Each stage in a pipeline is expected to complete its operation in
one clock cycle.
The clock period should be sufficiently long to perform the task being
performed in any stage.
If different units require different amounts of time, the clock period
must allow the longest time to be completed.
Consider the fetch step:
The access time of the main memory may be as much as greater than
the time needed to perform basic pipeline stage operations.
Pipelining would be of little value
©ASHIQUE 11
Pipeline Performance
For a variety of reasons, one of the pipeline stages
may not be able to complete its processing task for a
given instruction in the time allotted.
Any time if one of the stages in the pipeline cannot
complete its operation in one clock cycle, the pipeline
stalls.
Any condition that causes the pipeline to stall is called
a hazard.
©ASHIQUE 12
Pipeline Performance
Three types of hazard:
Data hazard
Instruction or control hazard
Structural hazard
©ASHIQUE 13
Data Hazard
This is a situation in which the pipeline is stalled
because the data to be operated on are delayed for
some reason.
Example
Stage E is responsible for arithmetic and logic operations and
one cycle is assigned for this task
But some operations such as divide may require more time and
the pipeline stalls.
©ASHIQUE 14
Data Hazard
Time
Clock cycle 1 2 3 4 5 6 7 8 9
Instruction
I1 F1 D1 E1 W1
I2 F2 D2 E2 W2
I3 F3 D3 E3 W3
I4 F4 D4 E4 W4
I5 F5 D5 E5
Figure 8.3. Effect of an execution operation taking more than one clock cycle.
©ASHIQUE 15
Instruction Hazard
The pipeline may be stalled because of a delay in the
availability of an instruction
This may be a result of a miss in the cache
Required to be fetched from the main memory
©ASHIQUE 16
Instruction Hazard
Time
Clock cycle 1 2 3 4 5 6 7 8 9
Instruction
I1 F1 D1 E1 W1
I2 F2 D2 E2 W2
I3 F3 D3 E3 W3
Time
Clock cycle 1 2 3 4 5 6 7 8 9
Stage
F: Fetch F1 F2 F2 F2 F2 F3
©ASHIQUE 17
Structural Hazard
This is the situation when two instructions require the use of a
given hardware resource at the same time
One instruction may need to access memory as a part of the
execute or write stage while another instruction being fetched
Only one instruction can proceed
Other is delayed
Solution: Separate data and instruction cache
©ASHIQUE 18
Structural Hazard
Time
Clock cycle 1 2 3 4 5 6 7
Instruction
I1 F1 D1 E1 W1
I2 (Load) F2 D2 E2 M2 W2
I3 F3 D3 E3 W3
I4 F4 D4 E4
I5 F5 D5
©ASHIQUE 19
Data Hazard : More Example
We must ensure that the results obtained when instructions are
executed in a pipelined processor are identical to those obtained
when the same instructions are executed sequentially
Consider the following example
Mul R4, R2, R3
Add R6, R5, R4
©ASHIQUE 20
Data Hazard Example
Time
Clock cycle 1 2 3 4 5 6 7 8 9
Instruction
I1 (Mul) F1 D1 E1 W1
I2 (Add) F2 D2 D2A E2 W2
I3 F3 D3 E3 W3
I4 F4 D4 E4 W4
©ASHIQUE 21
Operand Forwarding
Data hazard arises because I2 is waiting for data to be
written in the register file.
©ASHIQUE 22
Operand Forwarding
What happens?
After decoding instruction I2 and detecting data dependency, a
decision is made to use data forwarding.
The operand not involved in the dependency, register R4 is read and
loaded in register SRC1 in clock cycle 3.
In the next clock cycle, the product produced by instruction I 1 is
available in register RSLT and because of forwarding connection it
can be used in step E2.
Hence execution of I2 proceeds without interruption.
©ASHIQUE 23
©ASHIQUE 24
Handling Data
Hazards
In Software
An alternative approach is
◦ to leave the task of detecting data dependencies
◦ and dealing with them to the software
In this case, the compiler can introduce the two-cycle delay needed
between instructions I1 and I2 by inserting NOP (No-operation)
instructions, as follows:
I1: Mul R2, R3, R4
NOP
NOP
I2: Mul R5, R4, R6
©ASHIQUE 25
Handling Data Hazards
In Software
So such dependencies can be detected in two ways:
Software (compiler) and hardware
Leaving tasks such as inserting NOP instructions to the compiler
leads simpler hardware.
Being aware of the need for a delay, the compiler can attempt to
reorder instructions to perform useful tasks in the NOP slots and thus
achieve better performance.
On the other hand, the insertion of NOP instruction leads to larger
code size
©ASHIQUE 26
Instruction Hazard:
Unconditional Branch
The time lost as a result of a branch instruction is
often referred to as the branch penalty
For a longer pipeline, the branch penalty may be
higher.
Reducing the branch penalty requires the branch
address to be computed earlier in the pipeline.
The instruction fetch unit has dedicated hardware
to identify a branch instruction
to compute the branch target address as early as possible
after an instruction is fetched
©ASHIQUE 27
Time
Clock cycle 1 2 3 4 5 6
Instruction
I1 F1 E1
I3 F3 X
Ik Fk Ek
©ASHIQUE 28
Time
Clock cycle 1 2 3 4 5 6 7 8
I1 F1 D1 E1 W1
I2 (Branch) F2 D2 E2
I3 F3 D3 X
I4 F4 X
Ik Fk Dk Ek Wk
Time
Clock cycle 1 2 3 4 5 6 7
I1 F1 D1 E1 W1
I2 (Branch) F2 D2
I3 F3 X
Ik Fk Dk Ek Wk
©ASHIQUE 29
Instruction Queue
and Prefetching
Many processors employ sophisticated fetch units that
can fetch instructions before they are needed and put
them in instruction queue.
The instruction queue can store several instructions.
A separate unit called dispatch unit
Takes instructions from the front of the queue
Sends them to the execution unit
Also Performs decoding function
©ASHIQUE 30
Instruction Queue
and Prefetching
Fetch unit has sufficient decoding and processing
capabilities.
©ASHIQUE 31
Instruction Queue
and Prefetching
When the pipeline stalls because of data hazard
Dispatch unit is not able to issue instructions from the
instruction queue.
Fetch unit continues to fetch instructions and add them to the
queue
©ASHIQUE 32
Instruction Queue
Instruction fetch unit
Instruction queue
F : Fetch
instruction
D : Dispatch/
E : Ex ecute W : Write
Decode
instruction results
unit
©ASHIQUE 33
Instruction Queue
Time
Clock cycle 1 2 3 4 5 6 7 8 9 10
Queue length 1 1 1 1 2 3 2 1 1 1
I1 F1 D1 E1 E1 E1 W1
I2 F2 D2 E2 W2
I3 F3 D3 E3 W3
I4 F4 D4 E4 W4
I5 (Branch) F5 D5
I6 F6 X
Ik Fk Dk Ek Wk
©ASHIQUE 34
Instruction Queue
and Prefetching
Branch instruction does not increase the overall
execution time
The instruction fetch unit has executed the branch
instruction(by computing the branch address)
concurrently with the execution of other instructions.
This technique is called branch folding
©ASHIQUE 35
Conditional Branch
A Conditional Branch instruction introduces added
hazard caused by the dependency of the branch
condition on the result of a preceding instruction.
©ASHIQUE 36
Branch Prediction
A technique for reducing the branch penalty associated
with conditional branches is to attempt to predict whether or
not a particular branch will be taken
©ASHIQUE 37
Branch Prediction
Speculative execution means the instructions are executed
before the processor is certain that they are in the correct
execution sequence
©ASHIQUE 38
Branch Prediction
©ASHIQUE 39
Branch Prediction
If branches outcomes were random, then half the
branches would be taken
Simple Approach:
Assuming that the branches will not be taken
Would save time lost to conditional branches 50 percent of the time
©ASHIQUE 40
Branch Prediction
Static Branch Prediction
The branch prediction decision is always the same
every time a given instruction is executed
©ASHIQUE 41
Static Branch
Prediction
By observing whether the target address of the branch is lower
than or higher than the address of the branch instruction.
©ASHIQUE 42
Dynamic Branch
Prediction
The objective of branch prediction algorithms is to
reduce the probability of making a wrong decision, to
avoid fetching instructions that eventually have to be
discarded.
©ASHIQUE 43
Dynamic Branch
Prediction
In its simplest form, the execution history used in
predicting the outcome of a given branch instruction is the
result of the most recent execution of that instruction.
©ASHIQUE 44
Dynamic Branch
Prediction
Better performance can be achieved by keeping more
information about execution history.
©ASHIQUE 45
Dynamic Branch
Prediction
©ASHIQUE 46
THANK YOU
©ASHIQUE 47