0% found this document useful (0 votes)
24 views

Lec4 - ILP Pipelining Intro

Uploaded by

WoloWizard
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
24 views

Lec4 - ILP Pipelining Intro

Uploaded by

WoloWizard
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 24

Parallel Computing and Programming.

Lecture 4: Instruction Level Parallelism: Pipelining


Intro.

Dr. Rony Kassam

IEF Tishreen Uni


S1 - 2021
Index
n Von Neumann vs Dataflow Models.
n ISA vs Microarchitecture.
n Single-cycle vs Multi-cycle Microarchitectures.
n Instruction Level Parallelism: Pipelining Intro.
n Instruction Level Parallelism: Issues in Pipeline Design.
n Thread Level Parallelism: Data Dependence Solutions.
n Thread Level Parallelism: Shared Memory and OpenMP.

2
Pipelining: Basic Idea
n More systematically:
q Pipeline the execution of multiple instructions
q Analogy: “Assembly line processing” of instructions

n Idea:
q Divide the instruction processing cycle into distinct “stages” of
processing
q Ensure there are enough hardware resources to process one
instruction in each stage
q Process a different instruction in each stage
n Instructions consecutive in program order are processed in
consecutive stages

n Benefit: Increases instruction processing throughput (1/CPI)


n Downside: Start thinking about this…
3
Example: Execution of Four Independent ADDs
n Multi-cycle: 4 cycles per instruction
F D E W
F D E W
F D E W
F D E W
Time

n Pipelined: 4 cycles per 4 instructions (steady state)


F D E W
F D E W
Is life always this beautiful?
F D E W
F D E W

Time

4
The Laundry Analogy
6 PM 7 8 9 10 11 12 1 2 AM
Time
Task
order
A

n “place one dirty load of clothes in the washer”


n “when the washer is finished, place the wet load in the dryer”
n “when theTime
dryer
6 PM is7 finished,
8 take
9 out
10 the
11 dry 12
load and
1 fold”
2 AM

n “when folding is finished, ask your roommate (??) to put the clothes
away” order
Task

A - steps to do a load are sequentially dependent


B - no dependence between different loads
C
- different steps do not share resources
D
5
Based on original figure from [P&H CO&D, COPYRIGHT 2004 Elsevier. ALL RIGHTS RESERVED.]
Pipelining Multiple Loads of Laundry 6 PM 7 8 9 10 11 12 1 2 AM
Time
Task
order
6 PM 7 8 9 10 11 12 1 2 AM
TimeA
Task
B
order
A
C

D
B

6 PM 7 8 9 10 11 12 1 2 AM
Time

Task
order 6 PM 7 8 9 10 11 12 1 2 AM
Time
A
- 4 loads of laundry in parallel
Task
order B
- no additional resources
A
C
- throughput increased by 4
B
- latency per load is the same
D

D 6
Based on original figure from [P&H CO&D, COPYRIGHT 2004 Elsevier. ALL RIGHTS RESERVED.]
Pipelining Multiple Loads of Laundry: In Practice
Time
Task
6 PM 7 8 9 10 11 12 1 2 AM

order
A
6 PM 7 8 9 10 11 12 1 2 AM
Time
B
Task
order
C
A
D
B

6 PM 7 8 9 10 11 12 1 2 AM
Time

Task
order
6 PM 7 8 9 10 11 12 1 2 AM
TimeA

Task B
order
C
A

D
B

C
the slowest step decides throughput
D
7
Based on original figure from [P&H CO&D, COPYRIGHT 2004 Elsevier. ALL RIGHTS RESERVED.]
Pipelining Multiple Loads of Laundry: In Practice
Time
6 PM 7 8 9 10 11 12 1 2 AM

Task
order
A 6 PM 7 8 9 10 11 12 1 2 AM
Time
Task B
order
C
A

D
B

6 PM 7 8 9 10 11 12 1 2 AM
Time

Task
order 6 PM 7 8 9 10 11 12 1 2 AM
Time
A A
Task B
order B
C
A
A
B
D
B
C

D
throughput restored (2 loads per hour) using 2 dryers
8
Based on original figure from [P&H CO&D, COPYRIGHT 2004 Elsevier. ALL RIGHTS RESERVED.]
An Ideal Pipeline
n Goal: Increase throughput with little increase in cost
(hardware cost, in case of instruction processing)

n Repetition of identical operations


q The same operation is repeated on a large number of different
inputs (e.g., all laundry loads go through the same steps)
n Repetition of independent operations
q No dependencies between repeated operations
n Uniformly partitionable suboperations
q Processing can be evenly divided into uniform-latency
suboperations (that do not share resources)

n Fitting examples: automobile assembly line, doing laundry


q What about the instruction processing “cycle”?
9
Ideal Pipelining

combinational logic (F,D,E,M,W) BW=~(1/T)


T psec

T/2 ps (F,D,E) T/2 ps (M,W) BW=~(2/T)

T/3 T/3 T/3 BW=~(3/T)


ps (F,D) ps (E,M) ps (M,W)

10
More Realistic Pipeline: Throughput
n Nonpipelined version with delay T
BW = 1/(T+S) where S = latch delay

T ps

n k-stage pipelined version


BWk-stage = 1 / (T/k +S ) Latch delay reduces throughput
BWmax = 1 / (1 gate delay + S ) (switching overhead b/w stages)

T/k T/k
ps ps

11
More Realistic Pipeline: Cost
n Nonpipelined version with combinational cost G
Cost = G+L where L = latch cost

G gates

n k-stage pipelined version


Costk-stage = G + Lk Latches increase hardware cost

G/k G/k

12
Pipelining Instruction Processing

13
Remember: The Instruction Processing Cycle

q Fetch fetch (IF)


1. Instruction
2. Instruction
q Decodedecode and
register operand
q Evaluate fetch (ID/RF)
Address
3. Execute/Evaluate
q Fetch Operands
memory address (EX/AG)
4. Memory operand fetch (MEM)
q Execute
5. Store/writeback result (WB)
q Store Result

14
Remember the Single-Cycle Uarch
Instruction [25– 0] Shift Jump address [31– 0]

26
left 2
28 0 PCSrc
1 1=Jump
PC+4 [31– 28] M M
u u
x x
ALU
Add result 1 0
Add Shift
RegDst
Jump left 2
4 Branch
MemRead
Instruction [31– 26]
Control MemtoReg PCSrc2=Br Taken
ALUOp
MemWrite
ALUSrc
RegWrite

Instruction [25– 21] Read


Read register 1
PC address Read
Instruction [20– 16] data 1
Read
register 2 Zero
Instruction 0 Registers Read ALU ALU
[31– 0] 0 Read
M Write data 2 result Address 1
Instruction u register M data
u M
memory Instruction [15– 11] x u
1 Write x Data
data x
1 memory 0
Write
bcond data
16 32
Instruction [15– 0] Sign
extend ALU
control

Instruction [5– 0]

ALU operation

T BW=~(1/T)
Based on original figure from [P&H CO&D, COPYRIGHT 2004
Elsevier. ALL RIGHTS RESERVED.]
15
Dividing Into Stages
200ps 100ps 200ps 200ps 100ps
IF: Instruction fetch ID: Instruction decode/ EX: Execute/ MEM: Memory access WB: Write back
register file read address calculation
0
M
u
x
1
ignore
for now
Add

4 Add Add
result
Shift
left 2

Read
PC Address register 1 Read
data 1
Read
register 2 Zero
Instruction Registers Read ALU ALU
0 Read
RF
Write data 2 result Address 1
register M data
Instruction M
u Data
memory u
Write
data
x
1
memory x
0
write
Write
data
16 32
Sign
extend

Is this the correct partitioning?


Why not 4 or 6 stages? Why not different boundaries?
Based on original figure from [P&H CO&D, COPYRIGHT 2004 Elsevier. ALL RIGHTS RESERVED.]
16
Instruction Pipeline Throughput
Program
execution 2 2004 400 6 600 8 800 101000 1200
12 1400
14 1600
16 1800
18
order Time
(in instructions)
Instruction Data
lw $1, 100($0) fetch
Reg ALU
access
Reg

Instruction Data
lw $2, 200($0) 8 ns
800ps fetch
Reg ALU
access
Reg

Instruction
lw $3, 300($0) 8800ps
ns fetch
...
8 ns
800ps

Program 200 4 400 6 600


execution 2 8 800 1000
10 1200
12 1400
14
Time
order
(in instructions)
Instruction Data
lw $1, 100($0) Reg ALU Reg
fetch access

Instruction Data
lw $2, 200($0) 2 ns Reg ALU Reg
200ps fetch access

Instruction Data
lw $3, 300($0) 2 ns
200ps Reg ALU Reg
fetch access

2 ns
200ps 2 ns
200ps 2200ps
ns 2 ns
200ps 2 ns
200ps

5-stage speedup is 4, not 5 as predicted by the ideal model. Why?


17
Enabling Pipelined Processing: Pipeline Registers

IF: Instruction fetch ID: Instruction decode/ EX: Execute/ MEM: Memory access WB: Write back
register file read address calculation
00
MM
No resource is used by more than 1 stage!
uu
xx
11

IF/ID
PCD+4 ID/EX EX/MEM MEM/WB

PCE+4

nPCM
Add
Add

Add
44 Add Add
Add result
result
Shift
Shift
leftleft
22

Read
Read
Instruction

register
11

AE
PCF

PCPC Address
Address register Read
Read

AoutM
data
data 11
Read
Read Zero

MDRW
Instruction register
register 22 Zero
Instruction Registers Read
Registers Read ALU ALU
ALU
IRD

memory Write 00 ALU Read


Read
Write data
data 22 result
result Address
Address 11
register
register MM data
data
Instruction
uu Data
Data MM
memory Write
Write BE xx uu
memory
memory xx
data
data 11
Write 00
Write
data
data

AoutW
BM
ImmE

1616 3232
Sign
Sign
extend
extend

T/k T/k
ps T ps
18
Based on original figure from [P&H CO&D, COPYRIGHT 2004 Elsevier. ALL RIGHTS RESERVED.]
Write data x x
1 memory x
data 1 0
Write 0
Write data
data
16 32

Pipelined Operation Example


16 32Sign
Sign
extend
extend

lw
All instruction classes must follow the same path
Instruction fetch and timing through the pipeline stages.
lw lw lw
000
00
lw
Instruction decode
MMM
uuu
x
xxx
Any performance impact?
Execution Memory
Write back
111

IF/ID
IF/ID
IF/ID
IF/ID
IF/ID ID/EX
ID/EX
ID/EX
ID/EX
ID/EX EX/MEM
EX/MEM
EX/MEM
EX/MEM
EX/MEM MEM/WB
MEM/WB
MEM/WB
MEM/WB
MEM/WB

Add
Add
Add

Add
Add
Add
444 Add
Add Add
Add result
Add result
result
result
Shift
Shift
Shift
Shift
left 22
left22
left
left

Read
Read
Read
Instruction

Read
Instruction
Instruction
Instruction
Instruction

PC Address register
register111
register Read
PC
PC Address
Address Read
Read
Read
Read
Read data111
data
data
data
Read
Read Zero
Instruction register
register222
register Zero
Zero
Zero
Instruction
Instruction Registers
memory Registers Read
Registers Read
Read ALU
ALU ALU
ALU
ALU ALU
ALU
ALU
memory
memory Write 00
00 ALU Address Read
Read
Read
Write
Write data
data222 result Address
Address Read 11
register data
data M result
result
result Address
Address data
data
data
data 11
register
register MMM Data
Data data M
uuuu Data
Data M M
Write
Write
Write xxxx
memory
memory uu
uu
memory
memory x
xxx
data
data
data 11
11 0
00
Write 00
Write
Write
Write
data
data
data
data
16
16
16 32
32
32
Sign
Sign
Sign
extend
extend
extend

lw
0
0 M Instruction decode lw
M
u
u
x 19
Based on original figure from [P&H
x CO&D,
1 COPYRIGHT 2004 Elsevier. ALL RIGHTS RESERVED.] Write back
1
memory
memory Read 00x ALU Read u
Write
Write data 22
data result
result Addressmemory Read 11x
register
data
register M data
data
1M M
uu Data 0M
Write
Write xx
Write uu
data memory
memory xx
data
data 11

Pipelined Operation Example


16 32 Write 00
Write
Sign data
data
extend
16
16 32
32
Sign
Sign
extend
extend

Clock 1
Clock
Clock 5 3

lw $10,
sub $11,20($1)
$2, $3 lw $10,
sub $11,20($1)
$2, $3 lw $10, 20($1)
Instruction fetch Instruction decode Execution
sub $11, $2, $3 lw $10,
sub $11,20($1)
$2, $3 sub $11,20($1)
lw $10, $2, $3
00
M
MM
u
uu Execution Memory Write back
Write back
xx
11

IF/ID
IF/ID ID/EX
ID/EX EX/MEM
EX/MEM MEM/WB
MEM/WB

Add
Add
Add

Add AddAdd
Add
44 Add
Add result
result
result
Shift
Shift
left
left 22

Read
Read
Instruction
Instruction
Instruction

PC Address
Address register 11
register
register 1 Read
PC
PC Address Read
Read
Read data 11
data
data 1
Read
Read Zero
Instruction register 22
register Zero
Zero
Instruction
Instruction Registers Read
Registers Read
Read ALU ALU
ALU ALU
ALU
memory
memory Write 0
00 Address Read
Read
Read
Write
Write data 22
data
data result
result
result Address 1
11
register
register
register M
M
M data
data
u M
M
M
uu Data
Data u

Is life always this beautiful?


Write
Write xxx uu
memory
memory xx
data
data
data 1
11 0
Write 00
Write
Write
data
data
data
16
16
16 32
32
32
Sign
Sign
extend
extend
extend

Clock
Clock
Clock56 21 43
Clock
Clock

sub $11, $2, $3 lw $10, 20($1) 20


Based on original figure from [P&H CO&D, COPYRIGHT 2004 Elsevier. ALL RIGHTS RESERVED.]
Instruction00 fetch Instruction decode sub $11, $2, $3 lw $10, 20($1) sub $11, $2, $3
Illustrating Pipeline Operation: Operation View

t0 t1 t2 t3 t4 t5
Inst0 IF ID EX MEM WB
Inst1 IF ID EX MEM WB
Inst2 IF ID EX MEM WB
Inst3 IF ID EX MEM WB
Inst4 IF ID EX MEM
IF ID EX
steady state
IF ID
(full pipeline) IF

21
Illustrating Pipeline Operation: Resource View

t0 t1 t2 t3 t4 t5 t6 t7 t8 t9 t10

IF I0 I1 I2 I3 I4 I5 I6 I7 I8 I9 I10

ID I0 I1 I2 I3 I4 I5 I6 I7 I8 I9

EX I0 I1 I2 I3 I4 I5 I6 I7 I8

MEM I0 I1 I2 I3 I4 I5 I6 I7

WB I0 I1 I2 I3 I4 I5 I6

22
Remember: An Ideal Pipeline
n Goal: Increase throughput with little increase in cost
(hardware cost, in case of instruction processing)

n Repetition of identical operations


q The same operation is repeated on a large number of different
inputs (e.g., all laundry loads go through the same steps)
n Repetition of independent operations
q No dependencies between repeated operations
n Uniformly partitionable suboperations
q Processing an be evenly divided into uniform-latency
suboperations (that do not share resources)

n Fitting examples: automobile assembly line, doing laundry


q What about the instruction processing “cycle”?
23
Instruction Pipeline: Not An Ideal Pipeline
n Identical operations ... NOT!
Þ different instructions à not all need the same stages
Forcing different instructions to go through the same pipe stages
à external fragmentation (some pipe stages idle for some instructions)

n Uniform suboperations ... NOT!


Þ different pipeline stages à not the same latency
Need to force each stage to be controlled by the same clock
à internal fragmentation (some pipe stages are too fast but all take
the same clock cycle time)

n Independent operations ... NOT!


Þ instructions are not independent of each other
Need to detect and resolve inter-instruction dependencies to ensure
the pipeline provides correct results
à pipeline stalls (pipeline is not always moving)
24

You might also like