Lec4 - ILP Pipelining Intro
Lec4 - ILP Pipelining Intro
2
Pipelining: Basic Idea
n More systematically:
q Pipeline the execution of multiple instructions
q Analogy: “Assembly line processing” of instructions
n Idea:
q Divide the instruction processing cycle into distinct “stages” of
processing
q Ensure there are enough hardware resources to process one
instruction in each stage
q Process a different instruction in each stage
n Instructions consecutive in program order are processed in
consecutive stages
Time
4
The Laundry Analogy
6 PM 7 8 9 10 11 12 1 2 AM
Time
Task
order
A
n “when folding is finished, ask your roommate (??) to put the clothes
away” order
Task
D
B
6 PM 7 8 9 10 11 12 1 2 AM
Time
Task
order 6 PM 7 8 9 10 11 12 1 2 AM
Time
A
- 4 loads of laundry in parallel
Task
order B
- no additional resources
A
C
- throughput increased by 4
B
- latency per load is the same
D
D 6
Based on original figure from [P&H CO&D, COPYRIGHT 2004 Elsevier. ALL RIGHTS RESERVED.]
Pipelining Multiple Loads of Laundry: In Practice
Time
Task
6 PM 7 8 9 10 11 12 1 2 AM
order
A
6 PM 7 8 9 10 11 12 1 2 AM
Time
B
Task
order
C
A
D
B
6 PM 7 8 9 10 11 12 1 2 AM
Time
Task
order
6 PM 7 8 9 10 11 12 1 2 AM
TimeA
Task B
order
C
A
D
B
C
the slowest step decides throughput
D
7
Based on original figure from [P&H CO&D, COPYRIGHT 2004 Elsevier. ALL RIGHTS RESERVED.]
Pipelining Multiple Loads of Laundry: In Practice
Time
6 PM 7 8 9 10 11 12 1 2 AM
Task
order
A 6 PM 7 8 9 10 11 12 1 2 AM
Time
Task B
order
C
A
D
B
6 PM 7 8 9 10 11 12 1 2 AM
Time
Task
order 6 PM 7 8 9 10 11 12 1 2 AM
Time
A A
Task B
order B
C
A
A
B
D
B
C
D
throughput restored (2 loads per hour) using 2 dryers
8
Based on original figure from [P&H CO&D, COPYRIGHT 2004 Elsevier. ALL RIGHTS RESERVED.]
An Ideal Pipeline
n Goal: Increase throughput with little increase in cost
(hardware cost, in case of instruction processing)
10
More Realistic Pipeline: Throughput
n Nonpipelined version with delay T
BW = 1/(T+S) where S = latch delay
T ps
T/k T/k
ps ps
11
More Realistic Pipeline: Cost
n Nonpipelined version with combinational cost G
Cost = G+L where L = latch cost
G gates
G/k G/k
12
Pipelining Instruction Processing
13
Remember: The Instruction Processing Cycle
14
Remember the Single-Cycle Uarch
Instruction [25– 0] Shift Jump address [31– 0]
26
left 2
28 0 PCSrc
1 1=Jump
PC+4 [31– 28] M M
u u
x x
ALU
Add result 1 0
Add Shift
RegDst
Jump left 2
4 Branch
MemRead
Instruction [31– 26]
Control MemtoReg PCSrc2=Br Taken
ALUOp
MemWrite
ALUSrc
RegWrite
Instruction [5– 0]
ALU operation
T BW=~(1/T)
Based on original figure from [P&H CO&D, COPYRIGHT 2004
Elsevier. ALL RIGHTS RESERVED.]
15
Dividing Into Stages
200ps 100ps 200ps 200ps 100ps
IF: Instruction fetch ID: Instruction decode/ EX: Execute/ MEM: Memory access WB: Write back
register file read address calculation
0
M
u
x
1
ignore
for now
Add
4 Add Add
result
Shift
left 2
Read
PC Address register 1 Read
data 1
Read
register 2 Zero
Instruction Registers Read ALU ALU
0 Read
RF
Write data 2 result Address 1
register M data
Instruction M
u Data
memory u
Write
data
x
1
memory x
0
write
Write
data
16 32
Sign
extend
Instruction Data
lw $2, 200($0) 8 ns
800ps fetch
Reg ALU
access
Reg
Instruction
lw $3, 300($0) 8800ps
ns fetch
...
8 ns
800ps
Instruction Data
lw $2, 200($0) 2 ns Reg ALU Reg
200ps fetch access
Instruction Data
lw $3, 300($0) 2 ns
200ps Reg ALU Reg
fetch access
2 ns
200ps 2 ns
200ps 2200ps
ns 2 ns
200ps 2 ns
200ps
IF: Instruction fetch ID: Instruction decode/ EX: Execute/ MEM: Memory access WB: Write back
register file read address calculation
00
MM
No resource is used by more than 1 stage!
uu
xx
11
IF/ID
PCD+4 ID/EX EX/MEM MEM/WB
PCE+4
nPCM
Add
Add
Add
44 Add Add
Add result
result
Shift
Shift
leftleft
22
Read
Read
Instruction
register
11
AE
PCF
PCPC Address
Address register Read
Read
AoutM
data
data 11
Read
Read Zero
MDRW
Instruction register
register 22 Zero
Instruction Registers Read
Registers Read ALU ALU
ALU
IRD
AoutW
BM
ImmE
1616 3232
Sign
Sign
extend
extend
T/k T/k
ps T ps
18
Based on original figure from [P&H CO&D, COPYRIGHT 2004 Elsevier. ALL RIGHTS RESERVED.]
Write data x x
1 memory x
data 1 0
Write 0
Write data
data
16 32
lw
All instruction classes must follow the same path
Instruction fetch and timing through the pipeline stages.
lw lw lw
000
00
lw
Instruction decode
MMM
uuu
x
xxx
Any performance impact?
Execution Memory
Write back
111
IF/ID
IF/ID
IF/ID
IF/ID
IF/ID ID/EX
ID/EX
ID/EX
ID/EX
ID/EX EX/MEM
EX/MEM
EX/MEM
EX/MEM
EX/MEM MEM/WB
MEM/WB
MEM/WB
MEM/WB
MEM/WB
Add
Add
Add
Add
Add
Add
444 Add
Add Add
Add result
Add result
result
result
Shift
Shift
Shift
Shift
left 22
left22
left
left
Read
Read
Read
Instruction
Read
Instruction
Instruction
Instruction
Instruction
PC Address register
register111
register Read
PC
PC Address
Address Read
Read
Read
Read
Read data111
data
data
data
Read
Read Zero
Instruction register
register222
register Zero
Zero
Zero
Instruction
Instruction Registers
memory Registers Read
Registers Read
Read ALU
ALU ALU
ALU
ALU ALU
ALU
ALU
memory
memory Write 00
00 ALU Address Read
Read
Read
Write
Write data
data222 result Address
Address Read 11
register data
data M result
result
result Address
Address data
data
data
data 11
register
register MMM Data
Data data M
uuuu Data
Data M M
Write
Write
Write xxxx
memory
memory uu
uu
memory
memory x
xxx
data
data
data 11
11 0
00
Write 00
Write
Write
Write
data
data
data
data
16
16
16 32
32
32
Sign
Sign
Sign
extend
extend
extend
lw
0
0 M Instruction decode lw
M
u
u
x 19
Based on original figure from [P&H
x CO&D,
1 COPYRIGHT 2004 Elsevier. ALL RIGHTS RESERVED.] Write back
1
memory
memory Read 00x ALU Read u
Write
Write data 22
data result
result Addressmemory Read 11x
register
data
register M data
data
1M M
uu Data 0M
Write
Write xx
Write uu
data memory
memory xx
data
data 11
Clock 1
Clock
Clock 5 3
lw $10,
sub $11,20($1)
$2, $3 lw $10,
sub $11,20($1)
$2, $3 lw $10, 20($1)
Instruction fetch Instruction decode Execution
sub $11, $2, $3 lw $10,
sub $11,20($1)
$2, $3 sub $11,20($1)
lw $10, $2, $3
00
M
MM
u
uu Execution Memory Write back
Write back
xx
11
IF/ID
IF/ID ID/EX
ID/EX EX/MEM
EX/MEM MEM/WB
MEM/WB
Add
Add
Add
Add AddAdd
Add
44 Add
Add result
result
result
Shift
Shift
left
left 22
Read
Read
Instruction
Instruction
Instruction
PC Address
Address register 11
register
register 1 Read
PC
PC Address Read
Read
Read data 11
data
data 1
Read
Read Zero
Instruction register 22
register Zero
Zero
Instruction
Instruction Registers Read
Registers Read
Read ALU ALU
ALU ALU
ALU
memory
memory Write 0
00 Address Read
Read
Read
Write
Write data 22
data
data result
result
result Address 1
11
register
register
register M
M
M data
data
u M
M
M
uu Data
Data u
Clock
Clock
Clock56 21 43
Clock
Clock
t0 t1 t2 t3 t4 t5
Inst0 IF ID EX MEM WB
Inst1 IF ID EX MEM WB
Inst2 IF ID EX MEM WB
Inst3 IF ID EX MEM WB
Inst4 IF ID EX MEM
IF ID EX
steady state
IF ID
(full pipeline) IF
21
Illustrating Pipeline Operation: Resource View
t0 t1 t2 t3 t4 t5 t6 t7 t8 t9 t10
IF I0 I1 I2 I3 I4 I5 I6 I7 I8 I9 I10
ID I0 I1 I2 I3 I4 I5 I6 I7 I8 I9
EX I0 I1 I2 I3 I4 I5 I6 I7 I8
MEM I0 I1 I2 I3 I4 I5 I6 I7
WB I0 I1 I2 I3 I4 I5 I6
22
Remember: An Ideal Pipeline
n Goal: Increase throughput with little increase in cost
(hardware cost, in case of instruction processing)