0% found this document useful (0 votes)
25 views

CA Slides#3 Pipeline Introduction

This document discusses pipelining in computer architecture. It begins with an introduction to pipelining, explaining that pipelining is a technique used to speed up processing by allowing multiple instructions to overlap in execution. It then provides an example of pipelining using an assembly line for laundry, showing how pipelining can reduce the total time to complete multiple loads from 6 hours to 3.5 hours. Finally, it covers various types of pipelines including static vs dynamic, linear vs non-linear, asynchronous vs synchronous, and arithmetic pipelines. It explains key concepts such as pipeline stages, pipeline rate, and how pipelining improves throughput.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPSX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
25 views

CA Slides#3 Pipeline Introduction

This document discusses pipelining in computer architecture. It begins with an introduction to pipelining, explaining that pipelining is a technique used to speed up processing by allowing multiple instructions to overlap in execution. It then provides an example of pipelining using an assembly line for laundry, showing how pipelining can reduce the total time to complete multiple loads from 6 hours to 3.5 hours. Finally, it covers various types of pipelines including static vs dynamic, linear vs non-linear, asynchronous vs synchronous, and arithmetic pipelines. It explains key concepts such as pipeline stages, pipeline rate, and how pipelining improves throughput.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPSX, PDF, TXT or read online on Scribd
You are on page 1/ 26

Computer Architecture:

(Pipeline : Introduction)

Dr. Tapas Kumar Maiti, Dept. of CSE, CEM


Kolaghat
What is Pipelining?
 One way to speed up CPU is to increase clock rate
(limitations on how fast clock can run?)
 Another way is to execute more than one instruction at one
time
 Pipelining is an implementation technique to speed up the
processing where multiple instructions are overlaped in
execution.
Pipelining is used in:
 Assembly lines
 Bucket brigades
 Fast food restaurants
Pipelining is used in other CS disciplines:
 Networking
 Server software architecture
2
Pipelining : Laundry Example
Ann, Brian, Cathy, Dave each have one
A B C D
load of clothes to wash, dry, and fold
Washer takes Dryer takes “Folder” takes
30 minutes 40 minutes 20 minutes

T 30 40 20 30 40 20 30 40 20 30 40 20
a
s A
k
B
O
r
d C
e
r D

Sequential laundry takes 6 hours for 4 loads


If they learned pipelining, how long would laundry take?
3
Pipelining : Example Contd …
6 PM 7 8 9 10 11 Midnight

Time

30 40 40 40 40 20
T
a A
s Pipelined laundry takes
k 3.5 hours for 4 loads
B
O
r
C
d
e
r D

4
Space-Time Utilization of Pipeline
Time  90 90 90 90  Each of A , B, C and D are
F (20) A B C D initiated as soon as prev.
Space

D (40) A B C D task is finished.


W (30) A B C D  Each takes 90 min ==>
total 90x4=360 min (4hrs).
Fig. (a) : Sequential Execution
Time (pipeline cycle) 
Time  30 40 40 40 40 20 Stage 1 2 3 4 5 6
F (20) A B C D  S1 A B C D
Space

D (40) A B C D S2 A B C D
W (30) A B C D S3 A B C D
Fig. (b) : Pipeline Execution Full pipeline after 4 cycles
 A takes 30 min at stage W  then moved to stage D and B is initiated at W.
 B takes 30 min at W. But, it must wait for 10 min more till A is finished at D.
 After 40 min, A is moved to F, B is moved to D and C is inserted in W.
 At stage F, A is completed in 20 minutes and F remains idle for another 20
min, as B is moved here (stage F) after 40 min at stage D.
 As soon as stage D becomes empty, C is moved here (stage D) from stage W.
 Total time for pipelined execution = 30 + 40 x 4 + 20 = 210 min (3.5 hrs) 5
Pipelining : Points to Note
If execution is non-overlapped, the functional units (k no.) are
underutilized because each unit is used only once every k cycles
If Instruction Set Architecture is carefully designed, the functional
units can be arranged so that they execute in parallel.
Pipelining overlaps the stages of execution so every stage has
something to do each cycle
Pipelining doesn’t help latency of single task, it helps throughput
of entire workload
Pipeline rate limited by slowest pipeline stage
Multiple tasks operating simultaneously
Potential speedup = Number pipe stages
Unbalanced lengths of pipe stages reduces speedup
Time to “fill” pipeline and time to “drain” it reduces speedup
6
Static vs Dynamic Pipeline
Uni-Function vs Multi-Function Pipeline
Static pipeline: It can perform only one function at a time. Static
pipelines are linear and data flows among the stages in a serial
manner. Static pipelines are either unifunctional or multifunctional.
Dynamic pipeline: It can perform more than one function at a time.
Dynamic ones are non-linear and data flows among the stages with
feedback and feed-forward control paths. Dynamic pipelines are
always multifunctional.
Uni-function Pipeline: A pipeline unit with fixed and dedicated
function is called uni-function pipeline.
Multi-function Pipeline: A pipeline unit which delivers two or
more simultaneous functions is called multi-function pipeline.
***Difference between Static and Dynamic Pipelines are same as Linear vs Non-
linear pipelines.
7
Asynchronous and Synchronous Pipeline
Asynchronous pipeline: It operates in a distributed manner and
data transfer from one stage to another stage is taken place when
both the communicating stages become ready by exchanging
handshake (Ready, Ack) signals. Mainly used in multiprocessor
systems with message-passing.

Synchronous pipeline: It operates in a synchronous manner


where the outputs (transfers) of all stages are latched
simultaneously under the control of a common global clock.
Here, only one task or operation enters the pipeline per cycle.

8
Asynchronous vs Synchronous Pipeline

Asynchronous Model Synchronous Model


Data flow between adjacent stages is Clocked latches are used to
controlled by handshaking protocol. interface between stages.
Different amount of delay may be Approximately equal amount of
experienced in different stages. delay is experienced in all stages.

Ready and Acknowledgement signals No such signals are used for


are used for communication purpose. communication purpose.
Transfer of data to various stages in All Latches Transfer of data to
not simultaneous. next stage simultaneous.
No Concept of Combinational Circuit Pipeline stages are
is used in pipeline stages. combinational logic circuits.

9
Linear Pipeline
A linear pipeline is a static pipeline where data flows in stream
from first stage (S1) to last stage (Sk) in linear sequence i.e. a
sequence of subtasks to process with linear precedence.

 The processing of data is done in a linear and sequential


manner

 Data flowing in streams from stage S1 to the final stage Sk

 Control of data flow : synchronous or asynchronous

S1 S2  Sk

10
Non-linear Pipeline
A non-Linear pipeline is a dynamic pipeline which is made of
different pipelines that are present at different stages and data
flows in non-linear fashion.
• The different pipelines are connected to perform multiple
functions.
• It has feedback and feed-forward connections.
• It is made such that it performs various function at different
time intervals

11
Linear vs Non-linear Pipeline
Linear Pipeline Non-linear Pipeline
In linear pipeline a series of In non-linear pipeline different
processing stages are connected pipelines are present at different
together in a serial manner. stages.
Linear pipeline is also called as Non-Linear pipelines is also
static pipeline as it performs fixed called as dynamic pipeline as it
functions. performs different functions.
The output is always produced The output is not necessarily
from the last block. produced from the last block.
Linear pipeline has linear Non-Linear pipeline has feedback
connections. and feed-forward connections.
It generates a single reservation It can generate more than one
table. reservation table.
It allows easy functional Functional partitioning is difficult
partitioning. in non-linear pipeline.
12
Arithmetic Pipeline
 The complex arithmetic operations like multiplication, and floating
point operations consume much of the time of the ALU. These
operations can also be pipelined by segmenting the operations of
the ALU and as a consequence, high speed performance may be
achieved. Thus, the pipelines used for arithmetic operations are
known as arithmetic pipelines.
 Arithmetic pipelines are constructed for :
 simple fixed-point
 floating-point arithmetic operations.
 For implementing the arithmetic pipelines we generally use
following two types of adder:
 i) Carry propagation adder (CPA): It adds two numbers such that
carries generated in successive digits are propagated.
 ii) Carry save adder (CSA): It adds two numbers such that carries
generated are not propagated rather these are saved in a carry
vector.
13
Fixed Arithmetic Pipeline-1
Ex: Compute Ai*Bi + Ci for i = 1,2, ……, 7
 R1  Ai, R2  Bi Load Ai and Bi
 R3  R1 * R2, R4  Ci Multiply and Load Ci
 R5  R3 + R4 Add
___
Clk Segment-1 Segment-2 Segment-3 Ai Bi Ci
No R1 R2 
R3 R4 R5
S1 R1 R2
1 A1 B1 
2 A2 B2 A1*B1 C1
 Multiplier
3 A3 B3 A2*B2 C2 A1*B1+C1 S2
R3 R4
4 A4 B4 A3*B3 C3 A2*B2+C2 
5 A5 B5 A4*B4 C4 A3*B3+C3  Adder
6 A6 B6 A5*B5 C5 A4*B4+C4 S3
7 A7 B7 A6*B6 C6 A5*B5+C5 R5

8 A7*B7 C7 A6*B6+C6 Z = Ai * Bi + Ci
14
Fixed Arithmetic Pipeline-2
Ex: Multiplication of fixed numbers. X Y
 Two fixed-point numbers are added by the ALU L
using add and shift operations. S1 Shifted & M’plicand Generator
 Sequential exec makes the multiplication slow. P1 – P6
 Add multiple copies of shifted multiplicands. L
S2 CSA-1 CSA-2
L

S3 CSA-3
L

 S1 generates the partial products P1 – P6 S4 CSA-4


 S2 merges all PPs (P1 – P6) into 4 nos via 2 CSAs. L
 S3 uses single CSA to merge these 4 nos to 3 nos.
 S4 takes single CSA to unites these 3 nos to 2 nos. S5 CPA
 Lastly, S5 adds these two nos through a CPA to
get the final product. P=X*Y 15
Floating Pt Arithmetic Pipeline
Exponents Mantissas
Add/Subtract 2 normalized FP numbers a b A B
X = A  2a = 0.9504  103
Y = B  2b = 0.8200  102 R R
Seg1
By 4 Segments sub-operations: Compr Expo [Sub]
1) Compare exponents by Subtraction:
R
3-2=1 Seg2
 X = A  2a = 0.9504  103 Choose Expo Align Mantissa
 Y = B  2b = 0.8200  102
R
2) Align Mantissas Seg3
 X = A  2a = 0.9504  103 Add/Sub Mantissa
 Y = B  2b = 0.08200  103
R R
3) Add Mantissas Seg4
 Z = 1.0324  103 Adjust Expo Adjust Expo
4) Normalize Result R R
 Y = 0.10324  104
16
Four-Stage Instruction Pipeline
 Break execution of each instruction IF = instruction fetch
down into several smaller steps. D = instruction decode + register read
 Enables higher clock frequency (only Ex = execute
a simple, short operation is done by WB = “write-back” results to registers
each part of pipeline each clock. Clock cycle  1 2 3 4 5 6 7 8 9
Latency: 1 instruction takes 4 cycles. D
Throughput: 1 instruction per cycle. Waiting C D
CPI: 1 cycle per instruction. instructions B C D
Time  1 2 3 4 5 6 7 8 A B C D
Stage 1: IF A B C D

Pipeline
IF D EX WB Stage 2: D A B C D
Instr 0
IF D EX WB Stage 3: Ex A B C D
Instr 1 Stage 4: WB A B C D
IF D EX WB A B C D
Instr 2
IF D EX WB Completed A B C
Instr 3 IF D EX WB instructions A B
Instr 4 A
17
Five-Stage Instruction Pipeline
Some classic RISC pipelines (MIPS, SPARC, Motorola 88000 etc) fetches and tries to
execute one instruction per cycle. Each of them is a 5-stage execution instruction
pipeline. During operation, each pipeline stage works on one instruction at a time.
Each of these stages consists of a set of flip-flops to hold state, and combinational
logic that operates on the outputs of those flip-flops.
IF = instruction fetch Clock cycle  1 2 3 4 5 6 7 8 9 10
D = instruction decode + register read D
Ex = execute + address calculation Waiting C D
Mem = Access memory location
WB = “write-back” results to registers instructions B C D
A B C D
Time  Stage 1: IF A B C D
1 2 3 4 5 6 7 8 9 Pipeline Stage 2: D A B C D
Stage 3: Ex A B C D
I0 IF D EX M WB Stage 4: Mem A B C D
I1 IF D EX M WB Stage5: WB A B C D
I2 IF D EX M WB A B C D
IF D EX M WB Completed A B C
I3 instructions A B
IF D EX M WB A
I4
18
Pipeline Speedup
𝐴𝑣𝑒𝑟𝑎𝑔𝑒 𝑢𝑛𝑝𝑖𝑝𝑒𝑙𝑖𝑛𝑒𝑑 𝑖𝑛𝑠𝑡𝑟𝑢𝑐𝑡𝑖𝑜𝑛𝑡𝑖𝑚𝑒
𝑺𝒑𝒆𝒆𝒅𝒖𝒑 ( 𝑺 ) 𝒇𝒓𝒐𝒎 𝒑𝒊𝒑𝒆𝒍𝒊𝒏𝒆=
𝐴𝑣𝑒𝑟𝑎𝑔𝑒𝑝𝑖𝑝𝑒𝑙𝑖𝑛𝑒𝑑 𝑖𝑛𝑠𝑡𝑟𝑢𝑐𝑡𝑖𝑜𝑛𝑡𝑖𝑚𝑒
 Consider a case for k-segment pipeline with a clock cycle time tp
to execute n tasks.
 The first task requires a time equal to k*tp to complete its
operation since there are k segments in pipeline.
 The remaining n-1 tasks emerge from the pipe at a rate of one
task per clock cycle and time to finish (n-1) tasks = (n-1) *tp
 Thus, total time (Tk) for n tasks in the pipeline = (k+n-1) *tp
 Next, to consider an non-pipeline unit that performs the same
operation and takes a time equal to tn to complete the task.
 The total time (T1) required for n tasks in non-pipeline is n*tn. 19
Pipeline Speedup Contd …
 Thus, the speed up of a pipeline processing over an equivalent
non-pipeline processing is 𝑇1 𝑛𝑡 𝑛
𝑆 𝑘= = . . . . . . (1)
𝑇 𝑘 ( 𝑘+𝑛 −1 ) 𝑡 𝑝
 As the number of tasks increases, n becomes much larger than k
(n>>k) and k+n-1 approaches the value of n i.e. (k+n-1)  n. So,
the speed up becomes 𝑛𝑡 𝑛 𝑡 𝑛
𝑆 𝑘= = . . . . . . (2)
𝑛𝑡 𝑝 𝑡 𝑝
 If we assume the time taken to process the task is the same as in
the pipeline and non-pipeline circuits i.e tn = ktp, we will have
𝑘𝑡 𝑝
𝑆 𝑘= =𝑘 . . . . . . (3)
𝑡𝑝
 Thus, the speedup is reduced to number of stages of the pipeline.
Again, the speedup attains lowest value i.e. Sk = 1 when n=1. 20
Efficiency and Throughput
 Efficiency (speedup per stage) of k-stage pipeline (assume tn = ktp):
𝑆𝑘 1 𝑛𝑡 𝑛 𝑛
𝐸 𝑘= = . = . . . . . . (4)
𝑘 𝑘 (𝑘+𝑛 −1)𝑡𝑝 (𝑘+𝑛 −1)
1
¿ , 𝑡h𝑒𝑙𝑜𝑤𝑒𝑠𝑡 𝑜𝑏𝑡𝑎𝑖𝑛𝑎𝑏𝑙𝑒 𝑣𝑎𝑙𝑢𝑒 𝑤h𝑒𝑛 𝑛=1
𝑘
¿ 1 ( 𝑚𝑎𝑥𝑖𝑚𝑢𝑚 ) 𝑤h𝑒𝑛 𝑛  
 Pipeline throughput is defined as the number of tasks per unit
cycle time or no of instructions per cycle (IPC) and is given by,
𝑛 𝑛 𝑛𝑓 𝐸𝑘
𝐻 𝑘= = = = . . . . . . (5)
𝑇 𝑝 (𝑘+𝑛 −1) 𝑡𝑝 (𝑘+𝑛 −1) 𝑡 𝑝
1
¿ = 𝑓 , 𝑤h𝑒𝑛 𝑛  
𝑡𝑝
¿ 1 𝑡𝑎𝑠𝑘 ( 𝑚𝑎𝑥𝑖𝑚𝑢𝑚 ) 𝑝𝑒𝑟 𝑝𝑖𝑝𝑒𝑙𝑖𝑛𝑒 𝑐𝑦𝑐𝑙𝑒
21
Pipeline Performance Parameters
 Latency: the time for an instruction to complete.
 Throughput: the no of instructions completed unit pipeline cycle.
 Clock Cycle: everything in CPU moves in lockstep; synchronized
by the clock.
 Pipeline Cycle (tp): time required between moving an instruction
one step down the pipeline;
= time required to complete a pipe stage;
= max(times for completing all stages);
= one or two clock cycles, but rarely more.
 CPI (cycles per instruction): No of cycles to process 1 instruction;
Total no of pipeline cycles ( Tp )
𝑪𝑷𝑰 =
𝑇𝑜𝑡𝑎𝑙𝑛𝑜 𝑜𝑓 𝑖𝑛𝑠𝑡𝑟𝑢𝑐𝑡𝑖𝑜𝑛𝑠(𝑛)
22
Pipeline Performance: Example
A task is divided into 4 subtasks (4 stages) with time: t1=60 ns,
t2=50 ns, t3=90 ns, t4=80 ns and latch delay = 10 ns.
Find the (i) ideal speedup, (ii) actual speedup, (iii) efficiency and (iv)
throughput when it is run for 1000 tasks.
Answer:
The k(=4)-stage pipeline cycle time (tp) = 90+10 = 100 ns and
Non-pipelined execution time (tn) = 60+50+90+80 = 280 ns
(i) Ideal speedup (when n) = tn/tp = 280/100 = 2.8 !
Now, pipeline time to process n=1000 tasks
Tp = (1000 + 4-1)*100 ns = 1003*100 ns
Non-pipeline (sequential) time for 1000 tasks, Tn = 1000*280 ns
(ii) Actual speedup = Tn/Tp = (1000*280)/1003*100 = 2.79 !
(iii) Efficiency = Sk/k = 2.79/4 = 0.6975 = 69.75% (ideal 2.8/4 = 70%)
(iv) Throughput (tasks per cycle) = 1000/1003 = 0.997 (ideal value 1)23
Latency & Throughput
1 2 3 4 5 6 7 8 9 10
IF ID EX MEM WB inst 1
IF ID EX MEM WB inst 2

Latency—the time it takes for an individual instruction to


execute
What’s the latency for this implementation?
One instruction takes 5 clock cycles
Cycles per Instruction (CPI) = 5. So, Latency = CPI = 5
Throughput—the number of instructions that execute per unit
time
What’s the throughput of this implementation?
One instruction is completed every 5 clock cycles
Average CPI = 5 = => Throughput = 1/5
24
Pipelined Latency & Throughput
1 2 3 4 5 6 7 8 9
IF ID EX MEM WB inst 1
IF ID EX MEM WB inst 2
IF ID EX MEM WB inst 3
IF ID EX MEM WB inst 4
IF ID EX MEM WB inst 5

What’s the latency of this implementation?


What’s the throughput of this implementation?

Answer:
Latency = CPI = 1 instruction in one clock cycle = 1
Throughput = 1
25
THANK YOU

26

You might also like