Enhancing Performance With Pipelining
Enhancing Performance With Pipelining
with
PIPELINING
Pipelining
• Pipeline concepts
• Hazards
• Example
Pipelined vs. Single-Cycle
Instruction Execution
Program
execution 2 4 6 8 10 12 14 16 18
order Time
(in instructions)
Instruction Data Single-cycle
lw $1, 100($0) fetch
Reg ALU
access
Reg
Instruction Data
lw $2, 200($0) 8 ns fetch
Reg ALU
access
Reg
Instruction
lw $3, 300($0) 8 ns fetch
...
8 ns
Instruction Data
Pipelined
lw $2, 200($0) 2 ns Reg ALU Reg
fetch access
Instruction Data
lw $3, 300($0) 2 ns Reg ALU Reg
fetch access
2 ns 2 ns 2 ns 2 ns 2 ns
Pipeline Implementation
• Idea:
– Goal of MIPS: CPI <= 1
– Some instructions take longer to execute than others
– Don’t want cycle time to depend on slowest instruction
– Want 100% hardware utilization
– Split execution of each instruction into several, balanced “stages”
– Each stage is a block of combinational logic
– Latency of each stage fits within 1 clock cycle
– Insert registers between each pipeline stage to hold intermediate results
– Execute each of these steps in parallel for a sequence of instructions
– Structural hazards
• Two operations require a single piece of hardware e.g. Memory
• Structural hazards can be overcome by adding additional hardware
– Control hazards
• Need to worry about branch instructions, requiring subsequent instruction fetches to be
predicted
– Flushed if prediction does not hold (make sure no state change)
• Branch hazards can use dynamic prediction/speculation, branch delay slot
– Data hazards
• Instruction from one pipeline stage is “dependant” of data computed in previous pipeline
stage
Structural Hazards
• Inadequate hardware to simultaneously support all instructions in
the pipeline in the same clock cycle
• E.g., suppose single – not separate – instruction and data
memory in pipeline below with one read port
Structural Hazards
• E.g., suppose single – not separate – instruction and data memory in
pipeline below with one read port
– then a structural hazard between first and fourth lw instructions
Program
execution 2 4 6 8 10 12 14
Time
order
(in instructions)
Instruction Data
lw $1, 100($0) Reg ALU Reg
fetch access Pipelined
Instruction Data
lw $2, 200($0) 2 ns Reg ALU Reg
fetch access
Hazard if single memory
Instruction Data
lw $3, 300($0) 2 ns Reg ALU Reg
fetch access
Instruction Data
lw $4, 400($0) Reg ALU Reg
2 ns fetch access
2 ns 2 ns 2 ns 2 ns 2 ns
Program
execution 2 4 6 8 10 12 14 16
order Time
(in instructions)
Instruction
Reg ALU
Data
Reg Note that branch outcome is
add $4, $5, $6 fetch access computed in ID stage with
Instruction Data added hardware (later…)
beq $1, $2, 40 Reg ALU Reg
2ns fetch access
Instruction Data
lw $3, 300($0) bubble fetch
Reg ALU
access
Reg
4 ns 2ns
Pipeline stall
Control Hazards
• Solution 2 Predict branch outcome
– e.g., predict branch-not-taken :
Program
execution 2 4 6 8 10 12 14
order Time
(in instructions)
Instruction Data
add $4, $5, $6 fetch
Reg ALU
access
Reg
Instruction Data
beq $1, $2, 40 Reg ALU Reg
2 ns fetch access
Instruction Data
lw $3, 300($0) Reg ALU Reg
2 ns fetch access
Prediction success
Program
execution 2 4 6 8 10 12 14
order Time
(in instructions)
Instruction Data
add $4, $5 ,$6 Reg ALU Reg
fetch access
Instruction Data
beq $1, $2, 40 Reg ALU Reg
fetch access
2 ns
bubble bubble bubble bubble bubble
Instruction Data
or $7, $8, $9 Reg ALU Reg
fetch access
4 ns
Prediction failure: undo (=flush) lw
Control Hazards
• Solution 3 Delayed branch: always execute the sequentially next statement with
the branch executing after one instruction delay
– compiler’s job to find a statement that can be put in the slot that is independent
of branch outcome
Program
execution 2 4 6 8 10 12 14
order Time
(in instructions)
Instruction Data
lw $3, 300($0) Reg ALU Reg
2 ns fetch access
2 ns
2 4 6 8 10
Time
Instruction pipeline diagram:
add $s0, $t0, $t1 IF ID EX MEM WB shade indicates use –
left=write, right=read
Program
execution 2 4 6 8 10
order Time
(in instructions)
add $s0, $t0, $t1 IF ID EX MEM WB
Without forwarding – blue line –
data has to go back in time;
with forwarding – red line –
sub $t2, $s0, $t3
data is available in time
IF ID EX MEM WB
Data Hazards
• Forwarding may not be enough
– e.g., if an R-type instruction following a load uses the result of the load –
called load-use data hazard
2 4 6 8 10 12 14
Program Time
execution
order
(in instructions)
Without a stall it is impossible
lw $s0, 20($t1) IF ID EX MEM WB
to provide input to the sub
instruction in time
2 4 6 8 10 12 14
Program Time
execution
order
(in instructions)
With a one-stage stall, forwarding
lw $s0, 20($t1) IF ID EX MEM WB
can get the data to the sub
instruction in time
bubble bubble bubble bubble bubble
• Reordered code:
lw $t0, 0($t1)
lw $t2, 4($t1)
sw $t0, 4($t1) Interchanged
sw $t2, 0($t1)
Pipelined Datapath
Recall the 5 steps in instruction execution
1. Instruction Fetch & PC Increment (IF)
2. Instruction Decode and Register Read (ID)
3. Execution or calculate address (EX)
4. Memory access (MEM)
5. Write result into register (WB)
Review - Single-Cycle
Datapath “Steps”
ADD
4 ADD
PC <<2
Instruction I
ADDR RD
32 16 32
5 5 5
Instruction
Memory RN1 RN2 WN
RD1 Zero
Register File ALU
WD
RD2 M
U ADDR
X
Data
Memory RD M
E U
16 X 32 X
T WD
N
D
IF ID EX MEM WB
Instruction Fetch Instruction Decode Execute/ Address Calc. Memory Access Write Back
Pipelined Datapath – Key Idea
• What happens if we break the execution into multiple cycles,
but keep the extra hardware?
– Answer: We may be able to start executing a new instruction at each
clock cycle - pipelining
• …but we shall need extra registers to hold data between
cycles – pipeline registers
Pipelined Datapath
Pipeline registers wide enough to hold data coming in
ADD
4 ADD
64 bits 128 bits
PC <<2 97 bits 64 bits
Instruction I
ADDR RD
32 16 32
5 5 5
Instruction
Memory RN1 RN2 WN
RD1
Zero
Register File ALU
WD
RD2 M
U ADDR
X
Data
Memory RD M
E U
16 X 32 X
T WD
N
D
4 ADD
64 bits 128 bits
PC <<2 97 bits 64 bits
Instruction I
ADDR RD
32 16 32
5 5 5
Instruction
Memory RN1 RN2 WN
RD1
Zero
Register File ALU
WD
RD2 M
U ADDR
X
Data
Memory RD M
E U
16 X 32 X
T WD
N
D
4 ADD
PC <<2
Instruction I
ADDR RD
32 16 32
5 5 5
Instruction
Memory RN1 RN2 WN
RD1
Register File ALU
WD
RD2 M
U ADDR
X
Data
Memory RD M
E U
16 X 32 X
T WD
N
D
ADD
ADD
4 64 bits 133 bits
102 bits 69 bits
<<2
PC
ADDR RD 5
RN1 RD1
32
ALU Zero
Instruction RN2
5
Memory Register
5
WN File RD2 M
WD U ADDR
X
Data
E Memory RD M
U
16 X 32 X
T WD
N
5 D
ADD
ADD
4
<<2
PC
ADDR RD RN1 RD1
32 5 Zero
Instruction RN2 ALU
5
Memory Register
WN File
5 RD2 M
WD U ADDR
X
Data
Memory RD M
E U
16 X 32 X
T WD
lw $t0, 10($t1) 5
N
D
sw $t3, 20($t4)
add $t5, $t6, $t7
sub $t8, $t9, $t10
Single-Clock-Cycle Diagram:
SW
Clock Cycle 2 LW
ADD
ADD
4
<<2
PC
ADDR RD RN1 RD1
32 5 Zero
Instruction RN2 ALU
5
Memory Register
WN File
5 RD2 M
WD U ADDR
X
Data
Memory RD M
E U
16 X 32 X
T WD
lw $t0, 10($t1) 5
N
D
sw $t3, 20($t4)
add $t5, $t6, $t7
sub $t8, $t9, $t10
Single-Clock-Cycle Diagram:
ADD
Clock Cycle 3 SW LW
ADD
ADD
4
<<2
PC
ADDR RD RN1 RD1
32 5 Zero
Instruction RN2 ALU
5
Memory Register
WN File
5 RD2 M
WD U ADDR
X
Data
Memory RD M
E U
16 X 32 X
T WD
lw $t0, 10($t1) 5
N
D
sw $t3, 20($t4)
add $t5, $t6, $t7
sub $t8, $t9, $t10
Single-Clock-Cycle Diagram:
Clock Cycle 4 ADD SW LW
SUB
ADD
ADD
4
<<2
PC
ADDR RD RN1 RD1
32 5 Zero
Instruction RN2 ALU
5
Memory Register
WN File
5 RD2 M
WD U ADDR
X
Data
Memory RD M
E U
16 X 32 X
T WD
lw $t0, 10($t1) 5
N
D
sw $t3, 20($t4)
add $t5, $t6, $t7
sub $t8, $t9, $t10
Single-Clock-Cycle Diagram:
Clock Cycle 5 SUB ADD SW LW
ADD
ADD
4
<<2
PC
ADDR RD RN1 RD1
32 5 Zero
Instruction RN2 ALU
5
Memory Register
WN File
5 RD2 M
WD U ADDR
X
Data
Memory RD M
E U
16 X 32 X
T WD
lw $t0, 10($t1) 5
N
D
sw $t3, 20($t4)
add $t5, $t6, $t7
sub $t8, $t9, $t10
Single-Clock-Cycle Diagram:
Clock Cycle 6 SUB ADD SW
ADD
ADD
4
<<2
PC
ADDR RD RN1 RD1
32 5 Zero
Instruction RN2 ALU
5
Memory Register
WN File
5 RD2 M
WD U ADDR
X
Data
Memory RD M
E U
16 X 32 X
T WD
lw $t0, 10($t1) 5
N
D
sw $t3, 20($t4)
add $t5, $t6, $t7
sub $t8, $t9, $t10
Single-Clock-Cycle Diagram:
Clock Cycle 7 SUB ADD
ADD
ADD
4
<<2
PC
ADDR RD RN1 RD1
32 5 Zero
Instruction RN2 ALU
5
Memory Register
WN File
5 RD2 M
WD U ADDR
X
Data
Memory RD M
E U
16 X 32 X
T WD
lw $t0, 10($t1) 5
N
D
sw $t3, 20($t4)
add $t5, $t6, $t7
sub $t8, $t9, $t10
Single-Clock-Cycle Diagram:
Clock Cycle 8 SUB
ADD
ADD
4
<<2
PC
ADDR RD RN1 RD1
32 5 Zero
Instruction RN2 ALU
5
Memory Register
WN File
5 RD2 M
WD U ADDR
X
Data
Memory RD M
E U
16 X 32 X
T WD
lw $t0, 10($t1) 5
N
D
sw $t3, 20($t4)
add $t5, $t6, $t7
sub $t8, $t9, $t10
Tutorial Question
• Write the significant difference in the execution of an R-type
instruction between multicycle and pipelined implementations.
Recall Single-Cycle Control –
the Datapath
0
M
u
x
ALU
Add result 1
Add Shift PCSrc
RegDst left 2
4 Branch
MemRead
Instruction [31 26] MemtoReg
Control
ALUOp
MemWrite
ALUSrc
RegWrite
Instruction [5 0]
Recall Single-Cycle – ALU Control
Instruction AluOp Instruction Funct Field Desired ALU control
opcode operation ALU action input
LW 00 load word xxxxxx add 010
SW 00 store word xxxxxx add 010
Branch eq 01 branch eq xxxxxx subtract 110
R-type 10 add 100000 add 010
R-type 10 subtract 100010 subtract 110
R-type 10 AND 100100 and 000
R-type 10 OR 100101 or 001
R-type 10 set on less 101010 set on less 111
RegDst The register destination number for the The register destination number for the
Write register comes from the rt field (bits 20-16) Write register comes from the rd field (bits 15-11)
RegWrite None The register on the Write register input is written
with the value on the Write data input
AlLUSrc The second ALU operand comes from the The second ALU operand is the sign-extended,
second register file output (Read data 2) lower 16 bits of the instruction
PCSrc The PC is replaced by the output of the adder The PC is replaced by the output of the adder
that computes the value of PC + 4 that computes the branch target
MemRead None Data memory contents designated by the address
input are put on the first Read data output
MemWrite None Data memory contents designated by the address
input are replaced by the value of the Write data input
MemtoReg The value fed to the register Write data input The value fed to the register Write data input
comes from the ALU comes from the data memory
0
M
u
x
1
Add
Add
4 Add
result
Branch
Shift
RegWrite left 2
Read MemWrite
Instruction
signals as the
0
M ALUOp
Instruction u
datapath RegDst
Pipeline Control Signals
• There are five stages in the pipeline
– instruction fetch / PC increment Nothing to control as instruction memory
read and PC write are always enabled
– instruction decode / register fetch
– execution / address calculation
– memory access
– write back
Write-back
Execution/Address Calculation Memory access stage stage control
stage control lines control lines lines
Reg ALU ALU ALU Mem Mem Reg Mem to
Instruction Dst Op1 Op0 Src Branch Read Write write Reg
R-format 1 1 0 0 0 0 0 1 0
lw 0 0 0 1 0 1 0 1 1
sw X 0 0 1 0 0 1 0 X
beq X 0 1 0 1 0 0 0 X
Pipeline Control Implementation
• Pass control signals along just like the data – extend each pipeline register to hold needed control bits for succeeding stages
WB
Instruction
Control M WB
EX M WB
• Note: The 6-bit funct field of the instruction required in the EX stage to generate ALU control can be retrieved as the 6 least
significant bits of the immediate field which is sign-extended and passed from the IF/ID register to the ID/EX register
EX M WB
IF/ID
Add
Add
4 Add result
RegWrite
Branch
Shift
left 2
MemWrite
ALUSrc
Read
MemtoReg
Instruction
PC Address register 1
Read
data 1
Read
register 2 Zero
Instruction
Registers Read ALU ALU
memory Write 0 Read
data 2 result Address 1
register M data
u Data M
Write x memory u
data x
1
0
Write
data
Instruction 16 32 6
[15– 0]
Control signals
Sign ALU MemRead
extend control
pipeline registers
RegDst
IF: lw $10, 20($1) ID: before<1> EX: before<2> MEM: before<3> WB: before<4>
Pipelined
IF/ID ID/EX EX/MEM MEM/WB
0
M 00 00
u WB
x
1 000 000 00
Control M WB
0 0 0
Execution
0000 00 0
EX M WB 0
0 0
Add
Add
4 Add result
RegWrite
and
Shift Branch
left 2
MemWrite
ALUSrc
Read
MemtoReg
Instruction
PC Address register 1 Read
Read data 1
register 2 Zero
Instruction
Control
Registers Read ALU ALU
memory Write 0 Read
data 2 result Address 1
register M data
u Data M
Write x memory u
data x
1
0
Write
data
Instruction
[15– 0] Sign ALU MemRead
extend control
Instruction
[20– 16]
Clock cycle 1
0 ALUOp
M
Instruction u
[15– 11] x
• Instruction Clock 1 1
RegDst
sequence:
IF: sub $11, $2, $3 ID: lw $10, 20($1) EX: before<1> MEM: before<2> WB: before<3>
lw $10, 20($1)
11 00
u WB
x
1 lw 010 000 00
Control M WB
Add
or $13, $6, $7
4 Add result
RegWrite
Shift Branch
left 2
MemWrite
add $14, $8, $9
ALUSrc
1 Read
MemtoReg
Instruction
register 1
PC Address Read $1
X data 1
Read
register 2 Zero
Instruction
Registers Read $X ALU ALU
memory Write 0 Read
data 2 result Address 1
register M data
u Data M
Write x memory u
x
i th instruction before
Instruction
20 [15– 0] Sign 20 ALU MemRead
extend control
Instruction
lw 10 [20– 16] 10
0 ALUOp
Clock cycle 2
M
Instruction u
X [15– 11] X x
1
Clock 2 RegDst
IF: and $12, $4, $5 ID: sub $11, $2, $3 EX: lw $10, . . . MEM: before<1> WB: before<2>
Pipelined
IF/ID ID/EX EX/MEM MEM/WB
0
M 10 11
u WB
x
1 sub 000 010 00
Control M WB
0 0 0
Execution
1100 00 0
EX M WB 0
1 0
Add
Add
4 Add result
RegWrite
and
Shift Branch
left 2
MemWrite
ALUSrc
2 Read
MemtoReg
Instruction
PC Address register 1 Read $2 $1
3 Read data 1
register 2 Zero
Instruction
Control
Registers Read $3 ALU ALU
memory Write 0 Read
data 2 result Address 1
register M data
u Data M
Write x memory u
data x
1
0
Write
data
Instruction
X [15– 0] Sign X 20 ALU MemRead
extend control
Instruction
X [20– 16] X 10
0
Clock cycle 3
ALUOp
M
Instruction u
11 [15– 11] 11 x
• Instruction Clock 3
1
RegDst
sequence: IF: or $13, $6, $7 ID: and $12, $2, $3 EX: sub $11, . . . MEM: lw $10, . . . WB: before<1>
lw $10, 20($1)
x
1 and 000 000 11
Control M WB
1 0 0
Add
Add result
or $13, $6, $7
RegWrite
Shift Branch
left 2
MemWrite
ALUSrc
MemtoReg
Instruction
register 1
PC Address Read $4 $2
5 data 1
Read
register 2 Zero
Instruction
Registers Read $5 $3 ALU ALU
memory Write 0 Address Read
data 2 result 1
register M data
u Data M
Write x u
memory x
data 1
0
Write
data
Instruction
X [15– 0] Sign X ALU MemRead
extend control
Instruction
X [20– 16] X
0 ALUOp
Clock cycle 4
M 10
Instruction u
12 [15– 11] 12 11 x
1
Clock 4 RegDst
IF: add $14, $8, $9 ID: or $13, $6, $7 EX: and $12, . . . MEM: sub $11, . . . WB: lw $10, . . .
Pipelined
IF/ID ID/EX EX/MEM MEM/WB
0
M 10 10
u WB
x
1 or 000 000 10
Control M WB
1 0 1
Execution
1100 10 0
EX M WB 1
0 0
Add
Add
4 Add result
RegWrite
and
Shift Branch
left 2
MemWrite
ALUSrc
6 Read
MemtoReg
Instruction
PC Address register 1 Read $6 $4
7 Read data 1
register 2 Zero
Instruction
Control
Registers Read $7 $5 ALU ALU
memory 10 Write 0 Read
data 2 result Address 1
register M data
u Data M
Write x memory u
data x
1
0
Write
data
Instruction
X [15– 0] Sign X ALU MemRead
extend control
Instruction
X [20– 16] X
Clock cycle 5
0 ALUOp
M 11 10
Instruction u
13 [15– 11] 13 12 x
Clock 5 1
• Instruction RegDst
sequence: IF: after<1> ID: add $14, $8, $9 EX: or $13, . . . MEM: and $12, . . . WB: sub $11, . . .
lw $10, 20($1)
x
1 add 000 000 10
Control M WB
1 0 1
Add
Add result
RegWrite
or $13, $6, $7 Shift
left 2
Branch
MemWrite
ALUSrc
8
MemtoReg
Instruction
register 1
PC Address Read $8 $6
9 data 1
Read
register 2 Zero
Instruction
Registers Read $9 $7 ALU ALU
memory 11 Write 0 Read
data 2 result Address 1
register M data
u Data M
Write x memory u
x
Instruction
X [20– 16] X
Clock cycle 6
0 ALUOp
M 12 11
Instruction u
14 [15– 11] 14 13 x
1
Clock 6 RegDst
IF: after<2> ID: after<1> EX: add $14, . . . MEM: or $13, . . . WB: and $12, . . .
Pipelined
IF/ID ID/EX EX/MEM MEM/WB
0
M 00 10
u WB
x
1 000 000 10
Control M WB
1 0 1
Execution
0000 10 0
EX M WB 0
0 0
Add
Add
4 Add result
RegWrite
and
Shift Branch
left 2
MemWrite
ALUSrc
Read
MemtoReg
Instruction
PC Address register 1 Read $8
Read data 1
register 2 Zero
Instruction
Control
Registers Read $9 ALU ALU
memory 12 Write 0 Read
data 2 result Address 1
register M data
u Data M
Write x memory u
data x
1
0
Write
data
Instruction
[15– 0] Sign ALU MemRead
extend control
Instruction
[20– 16]
Clock cycle 7
0 ALUOp
M 13 12
Instruction u
[15– 11] 14 x
1
Clock 7 RegDst
• Instruction
IF: after<3> ID: after<2> EX: after<1> MEM: add $14, . . . WB: or $13, . . .
sequence:
IF/ID ID/EX EX/MEM MEM/WB
0
M 00 00
u WB
x
1 000 000 10
Control M WB
Add
RegWrite
Shift Branch
left 2
MemWrite
or $13, $6, $7
ALUSrc
Read
MemtoReg
Instruction
PC Address register 1
Read
data 1
Instruction
[15– 0] Sign ALU MemRead
extend control
Instruction
[20– 16]
Clock cycle 8
0 ALUOp
M 14 13
Instruction u
[15– 11] x
1
Clock 8 RegDst
Pipelined Execution and Control
• Instruction IF: after<4> ID: after<3> EX: after<2> MEM: after<1> WB: add $14, . . .
sequence:
IF/ID ID/EX EX/MEM MEM/WB
0
M 00 00
u WB
x
1 000 000 00
lw $10, 20($1)
Control M WB
0 0 1
0000 00 0
EX M WB 0
Add
and $12, $4, $7 4
Add
Add result
or $13, $6, $7
RegWrite
Shift Branch
left 2
MemWrite
ALUSrc
add $14, $8, $9 Read
MemtoReg
Instruction
Instruction
[15– 0] Sign ALU MemRead
extend control
Instruction
[20– 16]
0 ALUOp
M 14
or $13, $6, $2
add $14, $2, $2 or $13, $6, $2 IM Reg DM Reg
sw $15, 100($2)
add $14, $2, $2 IM Reg DM Reg
EX M WB
IF/ID
Add
Add
4 Add result
RegWrite
Branch
Shift
left 2
MemWrite
ALUSrc
Read
MemtoReg
Instruction
PC Address register 1
Read
data 1
Read
register 2 Zero
Instruction
Registers Read ALU ALU
memory Write 0 Read
data 2 result Address 1
register M data
u Data M
Write x memory u
data x
1
0
Write
data
Instruction 16 32 6
[15– 0]
Control signals
Sign ALU MemRead
extend control
pipeline registers
RegDst
Hazard Detection
• Hazard conditions: sub $2, $1, $3
1a. EX/MEM.RegisterRd = ID/EX.RegisterRs and $12, $2, $5
or $13, $6, $2
1b. EX/MEM.RegisterRd = ID/EX.RegisterRt
add $14, $2, $2
2a. MEM/WB.RegisterRd = ID/EX.RegisterRs sw $15, 100($2)
2b. MEM/WB.RegisterRd = ID/EX.RegisterRt
– Eg., in the example, first hazard between
• sub $2, $1, $3 and
• and $12, $2, $5 is detected
– When the and is in EX stage and the sub is in MEM stage because
• EX/MEM.RegisterRd = ID/EX.RegisterRs = $2 (1a)
Hazard Detection
Time (in clock cycles)
Value of CC 1 CC 2 CC 3 CC 4 CC 5 CC 6 CC 7 CC 8 CC 9
register $2: 10 10 10 10 10/– 20 – 20 – 20 – 20 – 20
Program
execution
order
(in instructions) sub $2, $1, $3
sub $2, $1, $3 IM Reg DM Reg
and $12, $2, $5
or $13, $6, $2
add $14, $2, $2
and $12, $2, $5 IM Reg DM Reg sw $15, 100($2)
Program
execution order
(in instructions)
sub $2, $1, $3 IM Reg DM Reg
Forwarding
Hardware Registers ALU
Data
memory M
u
x
M
u
x
Registers
ForwardA ALU
M Data
u memory
x M
u
x
Rs ForwardB
Rt
Rt M
u EX/MEM.RegisterRd
Rd
x
Forwarding MEM/WB.RegisterRd
unit
b. With forwarding
Datapath after adding forwarding hardware
Forwarding Hardware:
Multiplexor Control
Mux control Source Explanation
ForwardA = 00 ID/EX The first ALU operand comes from the register file
ForwardA = 10 EX/MEM The first ALU operand is forwarded from prior ALU result
ForwardA = 01 MEM/WB The first ALU operand is forwarded from data memory
or an earlier ALU result
ForwardB = 00 ID/EX The second ALU operand comes from the register file
ForwardB = 10 EX/MEM The second ALU operand is forwarded from prior ALU result
ForwardB = 01 MEM/WB The second ALU operand is forwarded from data memory
or an earlier ALU result
1. EX hazard
if ( EX/MEM.RegWrite // if there is a write…
and ( EX/MEM.RegisterRd 0 ) // to a non-$0 register…
and ( EX/MEM.RegisterRd = ID/EX.RegisterRs ) ) // which matches, then…
ForwardA = 10
This check is necessary, e.g., for sequences such as add $1, $1, $2; add $1, $1, $3; add $1, $1, $4;
(array summing…), where an earlier pipeline (EX/MEM) register has more recent data
Forwarding Hardware with Control
Called forwarding unit, not hazard detection unit,
because once data is forwarded there is no hazard!
ID/EX
WB
EX/MEM
Control M WB
MEM/WB
IF/ID EX M WB
M
Instruction
u
x
Registers
Instruction Data
PC ALU
memory memory M
u
M x
u
x
IF/ID.RegisterRs Rs
IF/ID.RegisterRt Rt
IF/ID.RegisterRt Rt
M EX/MEM.RegisterRd
IF/ID.RegisterRd Rd u
x
Forwarding MEM/WB.RegisterRd
unit
ID/EX
10 10
Forwarding
WB
EX/MEM
Control M WB
MEM/WB
IF/ID EX M WB
2 $2 $1
M
Instruction
5 u
x
Registers
Instruction Data
PC ALU
memory memory M
$5 $3
u
M x
u
x
2 1
5 3
M
4 2 u
x
Forwarding
• Execution Clock 3
example:
add $9, $4, $2 or $4, $4, $2 and $4, $2, $5 sub $2, . . . before<1>
ID/EX
10 10
WB
WB
10
MEM/WB
and $4, $2, $5 IF/ID EX M WB
or $4, $4, $2 4 $4 $2
6 u
x
Registers
Instruction Data
PC ALU
memory memory M
$2 $5
u
M x
u
x
2 2
6 5
M 2
4 4 u
x
Forwarding
Clock 4
after<1> add $9, $4, $2 or $4, $4, $2 and $4, . . . sub $2, . . .
ID/EX
10 10
Forwarding
WB
EX/MEM
10
Control M WB
MEM/WB
1
IF/ID EX M WB
4 $4 $4
M
Instruction
2 u
x
Registers
Instruction 2 Data
PC ALU
memory memory M
$2 $2
u
M x
u
x
4 4
2 2
M 4 2
9 4 u
• Execution x
Forwarding
(cont.): Clock 5
ID/EX
10
WB
sub $2, $1, $3 Control M
EX/MEM
WB
10
MEM/WB
and $4, $2, $5 IF/ID EX M WB
1
or $4, $4, $2 $4
u
x
Registers
Instruction 4 Data
PC ALU
memory memory M
$2
u
M x
u
x
4
2
M 4 4
9 u
x
Forwarding
Clock 6
Data Hazards and Stalls
• Load word can still cause a hazard:
– an instruction tries to read a register following a load instruction that writes to the same register
As even a pipeline
or $8, $2, $6 IM Reg DM Reg
dependency goes
backward in time
forwarding will not add $9, $4, $2 IM Reg DM Reg
Hazard ID/EX
0
M
u Detection WB
EX/MEM
x
1
Unit Control M WB
MEM/WB
EX M WB
IF/ID
Add
Add
4 Add result
RegWrite
Branch
Shift
left 2
MemWrite
ALUSrc
Read
MemtoReg
Instruction
PC Address register 1
Read
data 1
Read
register 2 Zero
Instruction
Registers Read ALU ALU
memory Write 0 Read
data 2 result Address 1
register M data
u Data M
Write x memory u
data x
1
0
Write
data
Instruction 16 32 6
[15– 0]
Control signals
Sign ALU MemRead
extend control
pipeline registers
RegDst
Hazard Detection Logic to Stall
• Hazard detection unit implements the following check if to stall
WB
IF/IDWrite
EX/MEM
M
Control u M WB
x MEM/WB
0
IF/ID EX M WB
PCWrite
M
Instruction
u
x
Registers
Instruction Data
PC ALU
memory memory M
u
M x
u
x
IF/ID.RegisterRs
IF/ID.RegisterRt
IF/ID.RegisterRt Rt M EX/MEM.RegisterRd
IF/ID.RegisterRd Rd u
x
ID/EX.RegisterRt Rs Forwarding MEM/WB.RegisterRd
Rt unit
bubble
Stalling
unit ID/EX
X
11
WB
IF/IDWrite
EX/MEM
M
Control u M WB
x MEM/WB
0
IF/ID EX M WB
1 $1
PCWrite
M
Instruction
X u
x
Registers
Instruction Data
PC ALU
memory memory M
$X
u
M x
u
x
• Execution 1
X
example: 2
M
u
x
ID/EX.RegisterRt Forwarding
unit
Clock cycle 2
Clock 2
lw $2, 20($1) or $4, $4, $2 and $4, $2, $5 lw $2, 20($1) before<1> before<2>
Hazard
and $4, $2, $5 2
5
detection
unit
ID/EX.MemRead
ID/EX
or $4, $4, $2 00
WB
11
IF/IDWrite
EX/MEM
2 $2 $1
PCWrite
M
Instruction
5 u
x
Registers
Instruction Data
PC ALU
memory memory M
$5 $X
u
M x
u
x
2 1
5 X
2 M
4 u
x
ID/EX.RegisterRt Forwarding
unit
Clock cycle 3
Clock 3
or $4, $4, $2 and $4, $2, $5 bubble lw $2, . . . before<1>
Hazard
ID/EX.MemRead
detection
2 unit ID/EX
5
10 00
WB
IF/IDWrite
EX/MEM
Stalling
M 11
Control u M WB
x MEM/WB
0
IF/ID EX M WB
2 $2 $2
PCWrite
M
Instruction
5 u
x
Registers
Instruction Data
PC ALU
memory memory M
$5 $5
u
M x
u
x
• Execution
2
5
2
5
M 2
example 4 4 u
x
ID/EX.RegisterRt Forwarding
(cont.): unit
Clock cycle 4
Clock 4
add $9, $4, $2 or $4, $4, $2 and $4, $2, $5 bubble lw $2, . . .
Hazard
ID/EX.MemRead
detection
4
lw $2, 20($1) 2
unit
10
ID/EX
WB
10
IF/IDWrite
M
Instruction
2 u
x
Registers
Instruction 2 Data
PC ALU
memory memory M
$2 $5
u
M x
u
x
4 2
2 5
M 2
4 4 u
x
ID/EX.RegisterRt Forwarding
unit
Clock cycle 5
Clock 5
after<1> add $9, $4, $2 or $4, $4, $2 and $4, . . . bubble
Hazard ID/EX.MemRead
detection
4
unit ID/EX
Stalling
2
10 10
WB
IF/IDWrite
EX/MEM
M 10
Control u M WB
x MEM/WB
0
0
IF/ID EX M WB
4 $4 $4
PCWrite
M
Instruction
2 u
x
Registers
Instruction Data
PC ALU
memory memory M
$2 $2
u
M x
u
x
4 4
• Execution 2 2
M 4
example
9 4 u
x
ID/EX.RegisterRt Forwarding
unit
(cont.): Clock cycle 6
Clock 6
10
ID/EX
10
WB
and $4, $2, $5
IF/IDWrite
EX/MEM
M 10
Control u M WB
or $4, $4, $2 0
x
EX M
MEM/WB
WB
1
IF/ID
M
Instruction
u
x
Registers
Instruction 4 Data
PC ALU
memory memory M
$2
u
M x
u
x
4
2
M 4 4
9 u
x
ID/EX.RegisterRt Forwarding
unit
Clock cycle 7
Clock 7
Pipelining Exercise
Consider the following MIPS assembly code:
add $3, $2, $3
lw $4, 100($3)
sub $7, $6, $2
xor $6, $4, $3
Assume there is no forwarding or stalling circuitry in a pipelined processor that uses the
standard 5-stages (IF, ID, EX, Mem, WB). Instead, we will require the compiler to add no-
ops to the code to ensure correct execution. (Assume that if the processor reads and
writes to the same register in a given cycle, the value read out will be the new value that
is written in.)
1.Rewrite the code to include the no-ops that are needed. Do not change the order of the
four statements. Use as few no-ops as possible.
2.Suppose the complier is allowed to change the order of the four statements, provided it
doesn’t change the final answer. Is it possible to reduce the number of no-ops needed?
Why or why not?
Another pipelining exercise
Consider (again) the following MIPS assembly code:
add $3, $2, $3
lw $4, 100($3)
sub $7, $6, $2
xor $6, $4, $3
Assume that there is forwarding and stalling hardware. And the text (the forwarding path is
shown in the diagram on the next page).
Draw an execution diagram that shows where forwarding and stalling would take place. Use
arrows to show forwarding and “bubbles” to show stalls, as in the following
(hypothetical) example.
Cycle 1 2 3 4 5 6 7 This arrow says the
value stored in flip-
IF ID Ex Mem WB flops between Mem and
lw
WB is forwarded to an
ALU input in the Ex
stage for the “or”.
IM Bubble ID Ex Mem WB
or
and IF ID Ex Mem WB
Hazard
detection IF.Flush control zeros out the instruction in the IF/ID
unit
M ID/EX
pipeline register (which follows the branch)
u
x
WB
EX/MEM
M
Control u M WB
x MEM/WB
0
IF/ID EX M WB
4 Shift
left 2
M
u
x
Registers =
Instruction Data
PC ALU
memory memory M
u
M x
u
x
Sign
extend
M
u
x
Forwarding
unit
Branch decision is moved from the MEM stage to the ID stage – simplified drawing
not showing enhancements to the forwarding and hazard detection units
and $12, $2, $5 beq $1, $3, 7 sub $10, $4, $8 before<1> before<2>
Pipelined
IF.Flush
Hazard
detection
unit
72 ID/EX
M
u
48 x WB
EX/MEM
M
Branch
Control u M WB
x MEM/WB
28
0
IF/ID EX M WB
48 44 72
4
$1
Shift M $4
left 2 u
x
=
Registers
Instruction Data
PC ALU
memory memory M
72 44 $3
u
M $8 x
7 u
x
• Execution Sign
extend
example:
10
Forwarding
Clock cycle 3
unit
Clock 3
36 sub $10, $4, $8
lw $4, 50($7) bubble (nop) beq $1, $3, 7 sub $10, . . .
40 beq $1, $3, 7 before<1>
IF.Flush
48 or $13 $2, $6
unit
ID/EX
M
u
76 x WB
… 4
Shift
left 2
M
u
$1
72 lw $4, 50($7)
x
Registers
=
Instruction Data
PC ALU
memory memory M
76 72
u
M $3 x
u
Sign
Forwarding
unit
Clock cycle
Clock 4
4
Simple Example: Comparing
Performance
• Compare performance for single-cycle, multicycle, and pipelined
datapaths using the gcc instruction mix
– assume 2 ns for memory access, 2 ns for ALU operation, 1 ns for
register read or write
– assume gcc instruction mix 23% loads, 13% stores, 19% branches,
2% jumps, 43% ALU
– for pipelined execution assume
• 50% of the loads are followed immediately by an instruction that uses
the result of the load
• 25% of branches are mispredicted
• branch delay on misprediction is 1 clock cycle
• jumps always incur 1 clock cycle delay so their average time is 2 clock
cycles
Simple Example: Comparing Performance
• Single-cycle (p. 373): average instruction time 8 ns
• Multicycle (p. 397): average instruction time 8.04 ns
• Pipelined:
– loads use 1 cc (clock cycle) when no load-use dependency and 2 cc when
there is dependency – given 50% of loads are followed by dependency the
average cc per load is 1.5
– stores use 1 cc each
– branches use 1 cc when predicted correctly and 2 cc when not – given 25%
misprediction average cc per branch is 1.25
– jumps use 2 cc each
– ALU instructions use 1 cc each
– therefore, average CPI is
1.5 23% + 1 13% + 1.25 19% + 2 2% + 1 43% = 1.18
– therefore, average instruction time is 1.18 2 = 2.36 ns
• 50% of the loads are followed immediately by an instruction that uses the result of
the load
• 25% of branches are mispredicted
• branch delay on misprediction is 1 clock cycle
• jumps always incur 1 clock cycle delay so their average time is 2 clock cycles
Pipelining Advantages
• Higher maximum throughput
• Higher utilization of CPU resources
– Branch speedup
= (cycles before enhancement) / (cycles after enhancement)
= 3 / [.15(3) + .85(1)] = 2.3
1
– Amdahl’s Law: Speedup
1 Fractionenhanced Fractionenhanced
Speedupenhanced
– 6% improvement