0% found this document useful (0 votes)
9 views47 pages

CA 5 Pipelining

Pipelining is a technique that allows multiple operations to be performed simultaneously, increasing the number of operations executed per second without changing the time for individual operations. It involves stages such as fetch, decode, execute, and write, with buffers to manage data flow between stages. However, hazards such as data, instruction, and structural hazards can cause stalls, and techniques like operand forwarding and instruction queues are used to mitigate these issues.

Uploaded by

angelina54320291
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views47 pages

CA 5 Pipelining

Pipelining is a technique that allows multiple operations to be performed simultaneously, increasing the number of operations executed per second without changing the time for individual operations. It involves stages such as fetch, decode, execute, and write, with buffers to manage data flow between stages. However, hazards such as data, instruction, and structural hazards can cause stalls, and techniques like operand forwarding and instruction queues are used to mitigate these issues.

Uploaded by

angelina54320291
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 47

Pipelining

©ASHIQUE 1
Pipelining
To arrange the hardware so that more than one
operation can be performed at the same time.
In this way, the number of operations performed per
second is increased even though the elapsed time
needed to perform any operation is not changed.

©ASHIQUE 2
Basic Idea of Pipelining
Execution of a program consists of a sequence of
fetch and execute steps.
 Let Fi refers to fetch step for Instruction Ii
 Let Ei refers to execution step for Instruction Ii

Consider computer has two hardware units


 One for instruction fetch
 Other for instruction execution

©ASHIQUE 3
Basic Idea of Pipelining
The Instruction fetched by the fetch unit is deposited in
an intermediate storage buffer, B1.
This buffer is needed to enable the execution unit to
execute the instruction while the fetch unit is fetching
the next instruction.
Assume both the source and the destination of the
data operated on by the instructions are inside
Execution unit.

©ASHIQUE 4
Basic Idea of Pipelining
In the first clock cycle, the fetch unit fetches an instruction I1 (Step
F1) and stores it in buffer B1 at the end of the clock cycle.

In the second clock cycle


 The instruction fetch unit proceeds with the fetch operation for
instruction I2 (Step F2)
 The execution unit performs the operation specified by instruction I1,
which is available in buffer B1 (Step E1)
 By the end of the second clock cycle
 The execution of I1 is completed
 I2 is stored in B1, replacing I1

©ASHIQUE 5
Basic Idea of Pipelining
In this manner
 Both fetch and execution units are kept busy all the time
 The completion rate of instruction execution will be twice that
achievable by the sequential operation

Two stage pipeline in which each stage performs one


step in processing an instruction,

©ASHIQUE 6
T ime
I1 I 2 I3

F 1 E 1 F 2 E 2 F 3 E 3

(a) Sequential execution

Interstage buffer
B1

Instruction
fetch Ex
ecution
unit
unit

(b) Hardware organization

T ime
Clock cycle 1 2 3 4

Instruction
I1 F 1 E 1

I2 F 2 E 2

I3 F 3 E 3

(c) Pipelined execution

©ASHIQUE 7
4-stage Pipeline
A pipelined processor may process each instruction in 4
steps:
 F Fetch: read the instruction from the memory
 D Decode: decode the instruction and fetch the source operand(s)
 E Execute: perform the operation specified by the instruction
 W Write: store the result in the destination location.

©ASHIQUE 8
4-stage Pipeline: Buffer
Information is passed from one unit to the next through
a storage buffer
During clock cycle 4, the information in the buffer:
 Buffer B1 holds I3.
 Buffer B2 holds both the source operand(s) for I2 , the
specification of the operation to be performed, information
needed for the write step of I2.
 Buffer B3 holds the result produced by the execution unit and
the destination information for I1.

©ASHIQUE 9
Time
Clock cycle 1 2 3 4 5 6 7

Instruction

I1 F1 D1 E1 W1

I2 F2 D2 E2 W2

I3 F3 D3 E3 W3

I4 F4 D4 E4 W4

(a) Instruction execution divided into four steps

Interstage buffers

D : Decode
F : Fetch instruction E: Execute W : Write
instruction and fetch operation results
operands
B1 B2 B3

(b) Hardware organization

Figure 8.2. A 4-stage pipeline.

©ASHIQUE 10
Role of Cache Memory
Each stage in a pipeline is expected to complete its operation in
one clock cycle.
 The clock period should be sufficiently long to perform the task being
performed in any stage.

If different units require different amounts of time, the clock period
must allow the longest time to be completed.
Consider the fetch step:
 The access time of the main memory may be as much as greater than
the time needed to perform basic pipeline stage operations.
 Pipelining would be of little value

The use of cache memories solves the problem

©ASHIQUE 11
Pipeline Performance
For a variety of reasons, one of the pipeline stages
may not be able to complete its processing task for a
given instruction in the time allotted.
Any time if one of the stages in the pipeline cannot
complete its operation in one clock cycle, the pipeline
stalls.
Any condition that causes the pipeline to stall is called
a hazard.

©ASHIQUE 12
Pipeline Performance
Three types of hazard:
 Data hazard
 Instruction or control hazard
 Structural hazard

Hazards result in degradation in performance.


An important goal in designing processors is
 to identify all hazards that cause the pipeline to stall
 To find ways to minimize their impact.

©ASHIQUE 13
Data Hazard
This is a situation in which the pipeline is stalled
because the data to be operated on are delayed for
some reason.
Example
 Stage E is responsible for arithmetic and logic operations and
one cycle is assigned for this task
 But some operations such as divide may require more time and
the pipeline stalls.

©ASHIQUE 14
Data Hazard
Time
Clock cycle 1 2 3 4 5 6 7 8 9

Instruction

I1 F1 D1 E1 W1

I2 F2 D2 E2 W2

I3 F3 D3 E3 W3

I4 F4 D4 E4 W4

I5 F5 D5 E5

Figure 8.3. Effect of an execution operation taking more than one clock cycle.

©ASHIQUE 15
Instruction Hazard
The pipeline may be stalled because of a delay in the
availability of an instruction
 This may be a result of a miss in the cache
 Required to be fetched from the main memory

©ASHIQUE 16
Instruction Hazard
Time
Clock cycle 1 2 3 4 5 6 7 8 9

Instruction

I1 F1 D1 E1 W1

I2 F2 D2 E2 W2

I3 F3 D3 E3 W3

(a) Instruction execution steps in successive clock cycles

Time
Clock cycle 1 2 3 4 5 6 7 8 9

Stage
F: Fetch F1 F2 F2 F2 F2 F3

D: Decode D1 idle idle idle D2 D3

E: Execute E1 idle idle idle E2 E3

W: Write W1 idle idle idle W2 W3

(b) Function performed by each processor stage in successive clock cycles

Figure 8.4. Pipeline stall caused by a cache miss in F2.

©ASHIQUE 17
Structural Hazard
This is the situation when two instructions require the use of a
given hardware resource at the same time
One instruction may need to access memory as a part of the
execute or write stage while another instruction being fetched
 Only one instruction can proceed
 Other is delayed
 Solution: Separate data and instruction cache

Another example: consider the following instruction


 Load X(R1), R2

©ASHIQUE 18
Structural Hazard
Time
Clock cycle 1 2 3 4 5 6 7

Instruction
I1 F1 D1 E1 W1

I2 (Load) F2 D2 E2 M2 W2

I3 F3 D3 E3 W3

I4 F4 D4 E4

I5 F5 D5

Figure 8.5. Effect of a Load instruction on pipeline timing.

©ASHIQUE 19
Data Hazard : More Example
We must ensure that the results obtained when instructions are
executed in a pipelined processor are identical to those obtained
when the same instructions are executed sequentially
Consider the following example
 Mul R4, R2, R3
 Add R6, R5, R4

Data dependency arises when the destination of one instruction is


used as a source in the next instruction
When two operations depend on each other, they must be
performed sequentially in the correct order.

©ASHIQUE 20
Data Hazard Example
Time
Clock cycle 1 2 3 4 5 6 7 8 9

Instruction

I1 (Mul) F1 D1 E1 W1

I2 (Add) F2 D2 D2A E2 W2

I3 F3 D3 E3 W3

I4 F4 D4 E4 W4

Figure 8.6. Pipeline stalled by data dependency between D2 and W1.


Pipeline stalled by data dependency between D 2 and W1.

©ASHIQUE 21
Operand Forwarding
Data hazard arises because I2 is waiting for data to be
written in the register file.

However, these data are available at the output of the


ALU once the Execute stage completes step E1.

The delay can be reduced or possibly eliminated, if we


arrange the result of I1 to be forwarded directly for use in
step E2

©ASHIQUE 22
Operand Forwarding
What happens?
After decoding instruction I2 and detecting data dependency, a
decision is made to use data forwarding.
The operand not involved in the dependency, register R4 is read and
loaded in register SRC1 in clock cycle 3.
In the next clock cycle, the product produced by instruction I 1 is
available in register RSLT and because of forwarding connection it
can be used in step E2.
Hence execution of I2 proceeds without interruption.

©ASHIQUE 23
©ASHIQUE 24
Handling Data
Hazards
In Software
An alternative approach is
◦ to leave the task of detecting data dependencies
◦ and dealing with them to the software

In this case, the compiler can introduce the two-cycle delay needed
between instructions I1 and I2 by inserting NOP (No-operation)
instructions, as follows:
I1: Mul R2, R3, R4
NOP
NOP
I2: Mul R5, R4, R6

©ASHIQUE 25
Handling Data Hazards
In Software
So such dependencies can be detected in two ways:
 Software (compiler) and hardware
Leaving tasks such as inserting NOP instructions to the compiler
leads simpler hardware.

Being aware of the need for a delay, the compiler can attempt to
reorder instructions to perform useful tasks in the NOP slots and thus
achieve better performance.

On the other hand, the insertion of NOP instruction leads to larger
code size

©ASHIQUE 26
Instruction Hazard:
Unconditional Branch
The time lost as a result of a branch instruction is
often referred to as the branch penalty
For a longer pipeline, the branch penalty may be
higher.
Reducing the branch penalty requires the branch
address to be computed earlier in the pipeline.
The instruction fetch unit has dedicated hardware
 to identify a branch instruction
 to compute the branch target address as early as possible
after an instruction is fetched

©ASHIQUE 27
Time
Clock cycle 1 2 3 4 5 6

Instruction
I1 F1 E1

I2 (Branch) F2 E2 Execution unit idle

I3 F3 X

Ik Fk Ek

Ik+1 Fk+1 Ek+1

Figure 8.8. An idle cycle caused by a branch instruction.

©ASHIQUE 28
Time
Clock cycle 1 2 3 4 5 6 7 8

I1 F1 D1 E1 W1

I2 (Branch) F2 D2 E2

I3 F3 D3 X

I4 F4 X

Ik Fk Dk Ek Wk

Ik+1 Fk+1 Dk+1 E k+1

(a) Branch address computed in Execute stage

Time
Clock cycle 1 2 3 4 5 6 7

I1 F1 D1 E1 W1

I2 (Branch) F2 D2

I3 F3 X

Ik Fk Dk Ek Wk

Ik+1 Fk+1 D k+1 E k+1

(b) Branch address computed in Decode stage

Figure 8.9. Branch timing.

©ASHIQUE 29
Instruction Queue
and Prefetching
Many processors employ sophisticated fetch units that
can fetch instructions before they are needed and put
them in instruction queue.
The instruction queue can store several instructions.
A separate unit called dispatch unit
 Takes instructions from the front of the queue
 Sends them to the execution unit
 Also Performs decoding function

©ASHIQUE 30
Instruction Queue
and Prefetching
Fetch unit has sufficient decoding and processing
capabilities.

It attempts to keep the instruction queue filled at all


times to reduce the impact of occasional delays when
fetching instruction.

©ASHIQUE 31
Instruction Queue
and Prefetching
When the pipeline stalls because of data hazard
 Dispatch unit is not able to issue instructions from the
instruction queue.
 Fetch unit continues to fetch instructions and add them to the
queue

When there is delay in fetching instruction


 Dispatch unit continues to issue instructions from the
instruction queue.

©ASHIQUE 32
Instruction Queue
Instruction fetch unit
Instruction queue
F : Fetch
instruction

D : Dispatch/
E : Ex ecute W : Write
Decode
instruction results
unit

©ASHIQUE 33
Instruction Queue
Time
Clock cycle 1 2 3 4 5 6 7 8 9 10
Queue length 1 1 1 1 2 3 2 1 1 1

I1 F1 D1 E1 E1 E1 W1

I2 F2 D2 E2 W2

I3 F3 D3 E3 W3

I4 F4 D4 E4 W4

I5 (Branch) F5 D5

I6 F6 X

Ik Fk Dk Ek Wk

Ik+1 Fk+1 Dk+1 Ek+1

Figure : Branch timing in the presence of an instruction queue.


Branch target address is computed in the D stage.

©ASHIQUE 34
Instruction Queue
and Prefetching
Branch instruction does not increase the overall
execution time
The instruction fetch unit has executed the branch
instruction(by computing the branch address)
concurrently with the execution of other instructions.
This technique is called branch folding

Branch folding occurs only if at the time a branch


instruction is encountered , at least one instruction is
available in the queue other than the branch instruction.

©ASHIQUE 35
Conditional Branch
A Conditional Branch instruction introduces added
hazard caused by the dependency of the branch
condition on the result of a preceding instruction.

The decision to branch cannot be made until the


execution of that instruction has been completed.

©ASHIQUE 36
Branch Prediction
A technique for reducing the branch penalty associated
with conditional branches is to attempt to predict whether or
not a particular branch will be taken

The simplest form of branch prediction is to assume that


the branch will not take place and to continue to fetch
instructions in sequential address order

Until the branch condition is evaluated, instruction


execution along the predicted path must be done on a
speculative basis

©ASHIQUE 37
Branch Prediction
Speculative execution means the instructions are executed
before the processor is certain that they are in the correct
execution sequence

Care must be taken that no processor registers and


memory locations are updated until it is confirmed that
these instructions should indeed be executed

If the branch decision indicates otherwise


 The instructions and all their associated data in the execution units must be
purged
 The correct instruction should be fetched and executed

©ASHIQUE 38
Branch Prediction

©ASHIQUE 39
Branch Prediction
If branches outcomes were random, then half the
branches would be taken
Simple Approach:
 Assuming that the branches will not be taken
 Would save time lost to conditional branches 50 percent of the time

Better performance can be achieved if we arrange


some branch instructions to be predicted as taken and
others as not taken, depending on the expected
program behavior

©ASHIQUE 40
Branch Prediction
Static Branch Prediction
 The branch prediction decision is always the same
every time a given instruction is executed

Dynamic Branch Prediction


 Decision may change depending on execution
history

©ASHIQUE 41
Static Branch
Prediction
By observing whether the target address of the branch is lower
than or higher than the address of the branch instruction.

To have the compiler decide whether a given branch instruction


should be predicted taken or not taken.
 include a branch prediction bit, which is set to 0 or 1 by the
compiler to indicate the desired behavior.
 The instruction fetch unit checks this bit to predict whether the
branch will be taken or not taken

©ASHIQUE 42
Dynamic Branch
Prediction
The objective of branch prediction algorithms is to
reduce the probability of making a wrong decision, to
avoid fetching instructions that eventually have to be
discarded.

In dynamic branch prediction schemes, the processor


hardware assesses the likelihood of a given branch
being taken by keeping track of branch decisions every
time that instruction is executed.

©ASHIQUE 43
Dynamic Branch
Prediction
In its simplest form, the execution history used in
predicting the outcome of a given branch instruction is the
result of the most recent execution of that instruction.

The processor assumes that the next time the instruction


is executed, the result is likely to be the same.

Hence, the algorithm may be described by the two-state


machine. The two states are:
 LT: Branch is likely to be taken
 LNT: Branch is likely not to be taken

©ASHIQUE 44
Dynamic Branch
Prediction
Better performance can be achieved by keeping more
information about execution history.

An algorithm that uses 4 states, thus requiring two bits of


history information for each branch instruction, The four
states are:

 ST: Strongly likely to be taken


 LT: Likely to be taken
 LNT: Likely not to be taken
 SNT: Strongly likely not to be taken

©ASHIQUE 45
Dynamic Branch
Prediction

©ASHIQUE 46
THANK YOU

©ASHIQUE 47

You might also like