CA Slides#3 Pipeline Introduction
CA Slides#3 Pipeline Introduction
(Pipeline : Introduction)
T 30 40 20 30 40 20 30 40 20 30 40 20
a
s A
k
B
O
r
d C
e
r D
Time
30 40 40 40 40 20
T
a A
s Pipelined laundry takes
k 3.5 hours for 4 loads
B
O
r
C
d
e
r D
4
Space-Time Utilization of Pipeline
Time 90 90 90 90 Each of A , B, C and D are
F (20) A B C D initiated as soon as prev.
Space
D (40) A B C D S2 A B C D
W (30) A B C D S3 A B C D
Fig. (b) : Pipeline Execution Full pipeline after 4 cycles
A takes 30 min at stage W then moved to stage D and B is initiated at W.
B takes 30 min at W. But, it must wait for 10 min more till A is finished at D.
After 40 min, A is moved to F, B is moved to D and C is inserted in W.
At stage F, A is completed in 20 minutes and F remains idle for another 20
min, as B is moved here (stage F) after 40 min at stage D.
As soon as stage D becomes empty, C is moved here (stage D) from stage W.
Total time for pipelined execution = 30 + 40 x 4 + 20 = 210 min (3.5 hrs) 5
Pipelining : Points to Note
If execution is non-overlapped, the functional units (k no.) are
underutilized because each unit is used only once every k cycles
If Instruction Set Architecture is carefully designed, the functional
units can be arranged so that they execute in parallel.
Pipelining overlaps the stages of execution so every stage has
something to do each cycle
Pipelining doesn’t help latency of single task, it helps throughput
of entire workload
Pipeline rate limited by slowest pipeline stage
Multiple tasks operating simultaneously
Potential speedup = Number pipe stages
Unbalanced lengths of pipe stages reduces speedup
Time to “fill” pipeline and time to “drain” it reduces speedup
6
Static vs Dynamic Pipeline
Uni-Function vs Multi-Function Pipeline
Static pipeline: It can perform only one function at a time. Static
pipelines are linear and data flows among the stages in a serial
manner. Static pipelines are either unifunctional or multifunctional.
Dynamic pipeline: It can perform more than one function at a time.
Dynamic ones are non-linear and data flows among the stages with
feedback and feed-forward control paths. Dynamic pipelines are
always multifunctional.
Uni-function Pipeline: A pipeline unit with fixed and dedicated
function is called uni-function pipeline.
Multi-function Pipeline: A pipeline unit which delivers two or
more simultaneous functions is called multi-function pipeline.
***Difference between Static and Dynamic Pipelines are same as Linear vs Non-
linear pipelines.
7
Asynchronous and Synchronous Pipeline
Asynchronous pipeline: It operates in a distributed manner and
data transfer from one stage to another stage is taken place when
both the communicating stages become ready by exchanging
handshake (Ready, Ack) signals. Mainly used in multiprocessor
systems with message-passing.
8
Asynchronous vs Synchronous Pipeline
9
Linear Pipeline
A linear pipeline is a static pipeline where data flows in stream
from first stage (S1) to last stage (Sk) in linear sequence i.e. a
sequence of subtasks to process with linear precedence.
S1 S2 Sk
10
Non-linear Pipeline
A non-Linear pipeline is a dynamic pipeline which is made of
different pipelines that are present at different stages and data
flows in non-linear fashion.
• The different pipelines are connected to perform multiple
functions.
• It has feedback and feed-forward connections.
• It is made such that it performs various function at different
time intervals
11
Linear vs Non-linear Pipeline
Linear Pipeline Non-linear Pipeline
In linear pipeline a series of In non-linear pipeline different
processing stages are connected pipelines are present at different
together in a serial manner. stages.
Linear pipeline is also called as Non-Linear pipelines is also
static pipeline as it performs fixed called as dynamic pipeline as it
functions. performs different functions.
The output is always produced The output is not necessarily
from the last block. produced from the last block.
Linear pipeline has linear Non-Linear pipeline has feedback
connections. and feed-forward connections.
It generates a single reservation It can generate more than one
table. reservation table.
It allows easy functional Functional partitioning is difficult
partitioning. in non-linear pipeline.
12
Arithmetic Pipeline
The complex arithmetic operations like multiplication, and floating
point operations consume much of the time of the ALU. These
operations can also be pipelined by segmenting the operations of
the ALU and as a consequence, high speed performance may be
achieved. Thus, the pipelines used for arithmetic operations are
known as arithmetic pipelines.
Arithmetic pipelines are constructed for :
simple fixed-point
floating-point arithmetic operations.
For implementing the arithmetic pipelines we generally use
following two types of adder:
i) Carry propagation adder (CPA): It adds two numbers such that
carries generated in successive digits are propagated.
ii) Carry save adder (CSA): It adds two numbers such that carries
generated are not propagated rather these are saved in a carry
vector.
13
Fixed Arithmetic Pipeline-1
Ex: Compute Ai*Bi + Ci for i = 1,2, ……, 7
R1 Ai, R2 Bi Load Ai and Bi
R3 R1 * R2, R4 Ci Multiply and Load Ci
R5 R3 + R4 Add
___
Clk Segment-1 Segment-2 Segment-3 Ai Bi Ci
No R1 R2
R3 R4 R5
S1 R1 R2
1 A1 B1
2 A2 B2 A1*B1 C1
Multiplier
3 A3 B3 A2*B2 C2 A1*B1+C1 S2
R3 R4
4 A4 B4 A3*B3 C3 A2*B2+C2
5 A5 B5 A4*B4 C4 A3*B3+C3 Adder
6 A6 B6 A5*B5 C5 A4*B4+C4 S3
7 A7 B7 A6*B6 C6 A5*B5+C5 R5
8 A7*B7 C7 A6*B6+C6 Z = Ai * Bi + Ci
14
Fixed Arithmetic Pipeline-2
Ex: Multiplication of fixed numbers. X Y
Two fixed-point numbers are added by the ALU L
using add and shift operations. S1 Shifted & M’plicand Generator
Sequential exec makes the multiplication slow. P1 – P6
Add multiple copies of shifted multiplicands. L
S2 CSA-1 CSA-2
L
S3 CSA-3
L
Pipeline
IF D EX WB Stage 2: D A B C D
Instr 0
IF D EX WB Stage 3: Ex A B C D
Instr 1 Stage 4: WB A B C D
IF D EX WB A B C D
Instr 2
IF D EX WB Completed A B C
Instr 3 IF D EX WB instructions A B
Instr 4 A
17
Five-Stage Instruction Pipeline
Some classic RISC pipelines (MIPS, SPARC, Motorola 88000 etc) fetches and tries to
execute one instruction per cycle. Each of them is a 5-stage execution instruction
pipeline. During operation, each pipeline stage works on one instruction at a time.
Each of these stages consists of a set of flip-flops to hold state, and combinational
logic that operates on the outputs of those flip-flops.
IF = instruction fetch Clock cycle 1 2 3 4 5 6 7 8 9 10
D = instruction decode + register read D
Ex = execute + address calculation Waiting C D
Mem = Access memory location
WB = “write-back” results to registers instructions B C D
A B C D
Time Stage 1: IF A B C D
1 2 3 4 5 6 7 8 9 Pipeline Stage 2: D A B C D
Stage 3: Ex A B C D
I0 IF D EX M WB Stage 4: Mem A B C D
I1 IF D EX M WB Stage5: WB A B C D
I2 IF D EX M WB A B C D
IF D EX M WB Completed A B C
I3 instructions A B
IF D EX M WB A
I4
18
Pipeline Speedup
𝐴𝑣𝑒𝑟𝑎𝑔𝑒 𝑢𝑛𝑝𝑖𝑝𝑒𝑙𝑖𝑛𝑒𝑑 𝑖𝑛𝑠𝑡𝑟𝑢𝑐𝑡𝑖𝑜𝑛𝑡𝑖𝑚𝑒
𝑺𝒑𝒆𝒆𝒅𝒖𝒑 ( 𝑺 ) 𝒇𝒓𝒐𝒎 𝒑𝒊𝒑𝒆𝒍𝒊𝒏𝒆=
𝐴𝑣𝑒𝑟𝑎𝑔𝑒𝑝𝑖𝑝𝑒𝑙𝑖𝑛𝑒𝑑 𝑖𝑛𝑠𝑡𝑟𝑢𝑐𝑡𝑖𝑜𝑛𝑡𝑖𝑚𝑒
Consider a case for k-segment pipeline with a clock cycle time tp
to execute n tasks.
The first task requires a time equal to k*tp to complete its
operation since there are k segments in pipeline.
The remaining n-1 tasks emerge from the pipe at a rate of one
task per clock cycle and time to finish (n-1) tasks = (n-1) *tp
Thus, total time (Tk) for n tasks in the pipeline = (k+n-1) *tp
Next, to consider an non-pipeline unit that performs the same
operation and takes a time equal to tn to complete the task.
The total time (T1) required for n tasks in non-pipeline is n*tn. 19
Pipeline Speedup Contd …
Thus, the speed up of a pipeline processing over an equivalent
non-pipeline processing is 𝑇1 𝑛𝑡 𝑛
𝑆 𝑘= = . . . . . . (1)
𝑇 𝑘 ( 𝑘+𝑛 −1 ) 𝑡 𝑝
As the number of tasks increases, n becomes much larger than k
(n>>k) and k+n-1 approaches the value of n i.e. (k+n-1) n. So,
the speed up becomes 𝑛𝑡 𝑛 𝑡 𝑛
𝑆 𝑘= = . . . . . . (2)
𝑛𝑡 𝑝 𝑡 𝑝
If we assume the time taken to process the task is the same as in
the pipeline and non-pipeline circuits i.e tn = ktp, we will have
𝑘𝑡 𝑝
𝑆 𝑘= =𝑘 . . . . . . (3)
𝑡𝑝
Thus, the speedup is reduced to number of stages of the pipeline.
Again, the speedup attains lowest value i.e. Sk = 1 when n=1. 20
Efficiency and Throughput
Efficiency (speedup per stage) of k-stage pipeline (assume tn = ktp):
𝑆𝑘 1 𝑛𝑡 𝑛 𝑛
𝐸 𝑘= = . = . . . . . . (4)
𝑘 𝑘 (𝑘+𝑛 −1)𝑡𝑝 (𝑘+𝑛 −1)
1
¿ , 𝑡h𝑒𝑙𝑜𝑤𝑒𝑠𝑡 𝑜𝑏𝑡𝑎𝑖𝑛𝑎𝑏𝑙𝑒 𝑣𝑎𝑙𝑢𝑒 𝑤h𝑒𝑛 𝑛=1
𝑘
¿ 1 ( 𝑚𝑎𝑥𝑖𝑚𝑢𝑚 ) 𝑤h𝑒𝑛 𝑛
Pipeline throughput is defined as the number of tasks per unit
cycle time or no of instructions per cycle (IPC) and is given by,
𝑛 𝑛 𝑛𝑓 𝐸𝑘
𝐻 𝑘= = = = . . . . . . (5)
𝑇 𝑝 (𝑘+𝑛 −1) 𝑡𝑝 (𝑘+𝑛 −1) 𝑡 𝑝
1
¿ = 𝑓 , 𝑤h𝑒𝑛 𝑛
𝑡𝑝
¿ 1 𝑡𝑎𝑠𝑘 ( 𝑚𝑎𝑥𝑖𝑚𝑢𝑚 ) 𝑝𝑒𝑟 𝑝𝑖𝑝𝑒𝑙𝑖𝑛𝑒 𝑐𝑦𝑐𝑙𝑒
21
Pipeline Performance Parameters
Latency: the time for an instruction to complete.
Throughput: the no of instructions completed unit pipeline cycle.
Clock Cycle: everything in CPU moves in lockstep; synchronized
by the clock.
Pipeline Cycle (tp): time required between moving an instruction
one step down the pipeline;
= time required to complete a pipe stage;
= max(times for completing all stages);
= one or two clock cycles, but rarely more.
CPI (cycles per instruction): No of cycles to process 1 instruction;
Total no of pipeline cycles ( Tp )
𝑪𝑷𝑰 =
𝑇𝑜𝑡𝑎𝑙𝑛𝑜 𝑜𝑓 𝑖𝑛𝑠𝑡𝑟𝑢𝑐𝑡𝑖𝑜𝑛𝑠(𝑛)
22
Pipeline Performance: Example
A task is divided into 4 subtasks (4 stages) with time: t1=60 ns,
t2=50 ns, t3=90 ns, t4=80 ns and latch delay = 10 ns.
Find the (i) ideal speedup, (ii) actual speedup, (iii) efficiency and (iv)
throughput when it is run for 1000 tasks.
Answer:
The k(=4)-stage pipeline cycle time (tp) = 90+10 = 100 ns and
Non-pipelined execution time (tn) = 60+50+90+80 = 280 ns
(i) Ideal speedup (when n) = tn/tp = 280/100 = 2.8 !
Now, pipeline time to process n=1000 tasks
Tp = (1000 + 4-1)*100 ns = 1003*100 ns
Non-pipeline (sequential) time for 1000 tasks, Tn = 1000*280 ns
(ii) Actual speedup = Tn/Tp = (1000*280)/1003*100 = 2.79 !
(iii) Efficiency = Sk/k = 2.79/4 = 0.6975 = 69.75% (ideal 2.8/4 = 70%)
(iv) Throughput (tasks per cycle) = 1000/1003 = 0.997 (ideal value 1)23
Latency & Throughput
1 2 3 4 5 6 7 8 9 10
IF ID EX MEM WB inst 1
IF ID EX MEM WB inst 2
Answer:
Latency = CPI = 1 instruction in one clock cycle = 1
Throughput = 1
25
THANK YOU
26