0% found this document useful (0 votes)
59 views

Modulo15 RiscV DDCArv Ch7

The document discusses pipelining a RISC-V processor to improve its throughput. It describes how the processor is divided into 5 stages - fetch, decode, execute, memory, and writeback - with pipeline registers between each stage. This allows multiple instructions to be in different stages of processing simultaneously, improving instruction throughput compared to a single-cycle processor where only one instruction progresses at a time. An example is given comparing the processing of instructions over time in a single-cycle versus pipelined processor.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
59 views

Modulo15 RiscV DDCArv Ch7

The document discusses pipelining a RISC-V processor to improve its throughput. It describes how the processor is divided into 5 stages - fetch, decode, execute, memory, and writeback - with pipeline registers between each stage. This allows multiple instructions to be in different stages of processing simultaneously, improving instruction throughput compared to a single-cycle processor where only one instruction progresses at a time. An example is given comparing the processing of instructions over time in a single-cycle versus pipelined processor.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 34

Chapter 7: Microarchitecture

Pipelined RISC­V
Processor
Pipelined RISC­V Processor
• Temporal parallelism
• Divide single­cycle processor into 5 stages:
– Fetch
– Decode
– Execute
– Memory
– Writeback
• Add pipeline registers between stages

102 Digital Design & Computer Architecture Microarchitecture


Single­Cycle vs. Pipelined Processor
Single­Cycle
0 100 200 300 400 500 600 700 800 900 1000 1100 1200 1300 1400 1500
Instr
Time (ps)
Dec
1 Fetch Execute Memory Wr
Read
Instruction ALU Read / Write Reg
Reg
Dec
2 Fetch Execute Memory Wr
Read
Instruction ALU Read / Write Reg
Reg

Instr
Pipelined
Dec
1 Fetch Execute Memory Wr
Read
Instruction ALU Read / Write Reg
Reg
Dec
2 Fetch Execute Memory Wr
Read
Instruction ALU Read / Write Reg
Reg
Dec
3 Fetch Execute Memory Wr
Read
Instruction ALU Read / Write Reg
Reg

103 Digital Design & Computer Architecture Microarchitecture


Pipelined Processor Abstraction
1 2 3 4 5 6 7 8 9 10

Time (cycles)
s0
lw DM s2
lw s2, 40(s0) IM RF 40 + RF

s9
add DM s3
add s3, s9, s10 IM RF s10 + RF

t1
sub DM s4
sub s4, t1, s8 IM RF s8 - RF

s11
and DM s5
and s5, s11, t0 IM RF t0 & RF

t4
sw DM
sw s6, 20(t4) IM RF 20 + RF

t2
or DM s7
or s7, t2, t3 IM RF t3 | RF

104 Digital Design & Computer Architecture Microarchitecture


Single­Cycle & Pipelined Datapaths
CLK
CLK Single­Cycle CLK

19:15 WE3 SrcAE WE


0 PC' PC Instr A1 RD1 Zero
A RD ReadData 00

ALU
1 ALUResult
A RD 01
Instruction 24:20 10
A2 RD2 0 SrcBE Data
Memory 11:7
A3 1 Memory
Register WriteData
WD3 WD
File
PCTarget

+
+
4 ImmExt
31:7 Extend

PCPlus4

Result

Pipelined
CLK CLK CLK

Zero
CLK

Signals in 0 PCF' PCF


A RD
InstrD
19:15
A1
WE3
RD1
RD1E SrcAE WE
00

ALU
1 ALUResultM ReadDataW
RD2E A RD 01
Instruction

Pipelined
24:20 10
A2 RD2 0 SrcBE Data
Memory 11:7
A3 1 Memory
Register WriteDataE WriteDataM
WD3 WD

Processor are File


PCD PCE

+
appended with
+

first letter of 4
31:7 Extend
ImmExtD ImmExtE

stage (i.e., PCF, PCPlus4F PCPlus4D PCPlus4E

PCTargetE
PCPlus4M
PCPlus4W

PCD, PCE).
ResultW

Fetch Decode Execute Memory Writeback

105 Digital Design & Computer Architecture Microarchitecture


Corrected Pipelined Datapath
CLK CLK CLK
CLK
19:15 WE3 RD1E SrcAE WE
0 PCF' PCF InstrD A1 RD1
A RD ReadDataW 00

ALU
1 ALUResultM
RD2E A RD 01
Instruction 24:20 10
A2 RD2 0 SrcBE Data
Memory
A3 1 Memory
Register WriteDataE WriteDataM
WD3 WD
File
PCD PCE

+
11:7 RdD RdE RdM RdW
+

4 ImmExtD ImmExtE
31:7 Extend

PCPlus4F PCPlus4D PCPlus4E PCPlus4M


PCPlus4W
PCTargetE

ResultW

• Rd must arrive at same time as Result


• Register file written on falling edge of CLK

106 Digital Design & Computer Architecture Microarchitecture


Pipelined Processor with Control
PCSrcE
ZeroE
CLK CLK CLK

RegWriteD RegWriteE RegWriteM RegWriteW


Control ResultSrcD1:0 ResultSrcE1:0 ResultSrcM1:0 ResultSrcW1:0
Unit
MemWriteD MemWriteE MemWriteM
JumpD JumpE

6:0
BranchD BranchE
op ALUControlD2:0 ALUControlE2:0
14:12
funct3
30
ALUSrcD ALUSrcE
funct75
ImmSrcD1:0

CLK CLK CLK


CLK
19:15 WE3 RD1E SrcAE WE
0 PCF' PCF InstrD A1 RD1
A RD ReadDataW 00

ALU
1 ALUResultM
RD2E A RD 01
Instruction 24:20 10
A2 RD2 0 SrcBE Data
Memory
A3 1 Memory
Register WriteDataE WriteDataM
WD3 WD
File
PCD PCE

+
11:7 RdD RdE RdM RdW
+

4 ImmExtD ImmExtE
31:7 Extend

PCPlus4F PCPlus4D PCPlus4E PCPlus4M


PCPlus4W
PCTargetE

ResultW

• Same control unit as single­cycle processor


• Control signals travel with the instruction (drop off when used)
107 Digital Design & Computer Architecture Microarchitecture
Chapter 7: Microarchitecture

Pipelined Processor
Hazards
Pipelined Hazards
• When an instruction depends on result from
instruction that hasn’t completed
• Types:
– Data hazard: register value not yet written back to
register file
– Control hazard: next instruction not decided yet
(caused by branch)

109 Digital Design & Computer Architecture Microarchitecture


Data Hazard
1 2 3 4 5 6 7 8

Time (cycles)
s4
add DM s8
add s8, s4, s5 IM RF s5 + RF

s8
sub DM s2
sub s2, s8, s3 IM RF s3 - RF

t6
or DM s9
or s9, t6, s8 IM RF s8 | RF

s8
and DM s7
and s7, s8, t2 IM RF t2 & RF

110 Digital Design & Computer Architecture Microarchitecture


Handling Data Hazards
• Insert nops in code at compile time
• Rearrange code at compile time
• Forward data at run time
• Stall the processor at run time

111 Digital Design & Computer Architecture Microarchitecture


Handling Data Hazards
• Insert enough nops for result to be ready
• Or move independent useful instructions forward
1 2 3 4 5 6 7 8 9 10

Time (cycles)
s4
add DM s8
add s8, s4, s5 IM RF s5 + RF

nop DM
nop IM RF RF

nop DM
nop IM RF RF

s8
sub DM s2
sub s2, s8, s3 IM RF s3 - RF

t6
or DM s9
or s9, t6, s8 IM RF s8 | RF

s8
and DM s7
and s7, s8, t2 IM RF t2 & RF

112 Digital Design & Computer Architecture Microarchitecture


Data Forwarding
• Data is available on internal busses before it is written
back to the register file (RF).
• Forward data from internal busses to Execute stage.

1 2 3 4 5 6 7 8

Time (cycles)
s4
add DM s8
add s8, s4, s5 IM RF s5 + RF

s8
sub DM s2
sub s2, s8, s3 IM RF s3 - RF

t6
or DM s9
or s9, t6, s8 IM RF s8 | RF

s8
and DM s7
and s7, s8, t2 IM RF t2 & RF

113 Digital Design & Computer Architecture Microarchitecture


Data Forwarding
• Check if source register in Execute stage matches
destination register of instruction in Memory or
Writeback stage.
• If so, forward result.
1 2 3 4 5 6 7 8

Time (cycles)
s4
add DM s8
add s8, s4, s5 IM RF s5 + RF

s8
sub DM s2
sub s2, s8, s3 IM RF s3 - RF

t6
or DM s9
or s9, t6, s8 IM RF s8 | RF

s8
and DM s7
and s7, s8, t2 IM RF t2 & RF

114 Digital Design & Computer Architecture Microarchitecture


Data Forwarding: Hazard Unit
PCSrcE ZeroE
CLK CLK CLK

RegWriteD RegWriteE RegWriteM RegWriteW


Control ResultSrcD1:0 ResultSrcE1:0 ResultSrcM1:0 ResultSrcW1:0
Unit
MemWriteD MemWriteE MemWriteM
JumpD JumpE

6:0
BranchD BranchE
op ALUControlD2:0 ALUControlE2:0
14:12
funct3
30
ALUSrcD ALUSrcE
funct75
ImmSrcD1:0

CLK CLK CLK


CLK
19:15 WE3 RD1E SrcAE WE
0 PCF' PCF InstrD A1 RD1 00
A RD 01 ReadDataW 00

ALU
1 10 ALUResultM
RD2E A RD 01
Instruction 24:20 10
A2 RD2 00 0 SrcBE Data
Memory 01
A3 10 1 Memory
Register WriteDataE WriteDataM
WD3 WD
File
PCD PCE

+
19:15 Rs1D Rs1E
24:20 Rs2D Rs2E
11:7 RdD RdE RdM RdW
+

4 ExtImmD ExtImmE
31:7 Extend

PCPlus4F PCPlus4D PCPlus4E PCPlus4M


PCPlus4W
PCTargetE

ResultW

ForwardAE
ForwardBE

Hazard Unit

115 Digital Design & Computer Architecture Microarchitecture


Data Forwarding
• Case 1: Execute stage Rs1 or Rs2 matches Memory stage Rd?
Forward from Memory stage
• Case 2: Execute stage Rs1 or Rs2 matches Writeback stage Rd?
Forward from Writeback stage
• Case 3: Otherwise use value read from register file (as usual)

Equations for Rs1:


if ((Rs1E == RdM) AND RegWriteM) // Case 1
ForwardAE = 10
else if ((Rs1E == RdW) AND RegWriteW) // Case 2
ForwardAE = 01
else ForwardAE = 00 // Case 3

ForwardBE equations are similar (replace Rs1E with Rs2E)


116 Digital Design & Computer Architecture Microarchitecture
Data Forwarding
• Case 1: Execute stage Rs1 or Rs2 matches Memory stage Rd?
Forward from Memory stage
• Case 2: Execute stage Rs1 or Rs2 matches Writeback stage Rd?
Forward from Writeback stage
• Case 3: Otherwise use value read from register file (as usual)

Equations for Rs1:


if ((Rs1E == RdM) AND RegWriteM) AND (Rs1E != 0) // Case 1
ForwardAE = 10
else if ((Rs1E == RdW) AND RegWriteW) AND (Rs1E != 0) // Case 2
ForwardAE = 01
else ForwardAE = 00 // Case 3

ForwardBE equations are similar (replace Rs1E with Rs2E)


117 Digital Design & Computer Architecture Microarchitecture
Data Hazard due to lw Dependency
1 2 3 4 5 6 7 8

Time (cycles)
s5
lw DM s7
lw s7, 40(s5) IM RF 40 + RF

Trouble!
s7
and DM s8
and s8, s7, t3 IM RF t3 & RF

s6
or DM t2
or t2, s6, s7 IM RF s7 | RF

s7
sub DM s3
sub s3, s7, s2 IM RF s2 - RF

118 Digital Design & Computer Architecture Microarchitecture


Stalling to solve lw Data Dependency
1 2 3 4 5 6 7 8 9

Time (cycles)
s5
lw DM s7
lw s7, 40(s5) IM RF 40 + RF

s7 s7
and DM s8
and s8, s7, t3 IM RF t3 RF t3 & RF

s6
or or DM t2
or t2, s6, s7 IM IM RF s7 | RF

Stall s7
sub DM s3
sub s3, s7, s2 IM RF s2 - RF

119 Digital Design & Computer Architecture Microarchitecture


Stalling Logic
• Is either source register in the Decode stage the
same as the destination register in the Execute
stage?
AND
• Is the instruction in the Execute stage a lw?

lwStall = ((Rs1D == RdE) OR (Rs2D == RdE)) AND ResultSrcE1


StallF = StallD = FlushE = lwStall

(Stall the Fetch and Decode stages, and flush the Execute stage.)

120 Digital Design & Computer Architecture Microarchitecture


Stalling Hardware
PCSrcE ZeroE
CLK CLK CLK

RegWriteD RegWriteE RegWriteM RegWriteW


Control ResultSrcD1:0 ResultSrcE1:0 ResultSrcM1:0 ResultSrcW1:0
Unit 0
MemWriteD MemWriteE MemWriteM
JumpD JumpE

6:0
BranchD BranchE
op ALUControlD2:0 ALUControlE2:0
14:12
funct3
30
ALUSrcD ALUSrcE
funct75
ImmSrcD1:0

CLK CLK CLK


CLK
19:15 WE3 RD1E SrcAE WE
0 PCF' PCF InstrD A1 RD1 00
A RD 01 ReadDataW 00

ALU
1 10 ALUResultM
EN

RD2E A RD 01
Instruction 24:20 10
A2 RD2 00 0 SrcBE Data
Memory 01
A3 10 1 Memory
Register WriteDataE WriteDataM
WD3 WD
File
PCD PCE

+
19:15 Rs1D Rs1E
24:20 Rs2D Rs2E
11:7 RdD RdE RdM RdW
+

4 ExtImmD ExtImmE
31:7 Extend

PCPlus4F PCPlus4D PCPlus4E PCPlus4M

CLR
EN

PCPlus4W
PCTargetE

ResultW

ResultSrcE0
ForwardAE
ForwardBE
FlushE
StallD
StallF

Hazard Unit

121 Digital Design & Computer Architecture Microarchitecture


Chapter 7: Microarchitecture

Pipelined Processor
Control Hazards
Control Hazards
• beq:
– Branch not determined until the Execute stage of
pipeline
– Instructions after branch fetched before branch
occurs
– These 2 instructions must be flushed if branch
happens

123 Digital Design & Computer Architecture Microarchitecture


Control Hazards
1 2 3 4 5 6 7 8 9 10

Time (cycles)
s1
beq DM
20 beq s1, s2, L1 IM RF s2 - RF

t1
sub DM
24 sub s8, t1, s3 IM RF s3 RF Flush
these
or
28 or s9, t6, s5 IM RF DM RF instructions

2C ...
... ...
s3
add DM s7
58 L1: add s7, s3, s4 IM RF s4 + RF

Branch misprediction penalty:


The number of instructions flushed when a branch is taken (in
this case, 2 instructions)

124 Digital Design & Computer Architecture Microarchitecture


Control Hazards: Flushing Logic
• If branch is taken in execute stage, need to
flush the instructions in the Fetch and
Decode stages
– Do this by clearing Decode and Execute Pipeline
registers using FlushD and FlushE
• Equations:
FlushD = PCSrcE
FlushE = lwStall OR PCSrcE

125 Digital Design & Computer Architecture Microarchitecture


Control Hazards: Flushing Hardware
PCSrcE
ZeroE
CLK CLK CLK

RegWriteD RegWriteE RegWriteM RegWriteW


Control ResultSrcD1:0 ResultSrcE1:0 ResultSrcM1:0 ResultSrcW1:0
Unit 0
MemWriteD MemWriteE MemWriteM
JumpD JumpE

6:0
BranchD BranchE
op ALUControlD2:0 ALUControlE2:0
14:12
funct3
30
ALUSrcD ALUSrcE
funct75
ImmSrcD1:0

CLK CLK CLK


CLK
19:15 WE3 RD1E SrcAE WE
0 PCF' PCF InstrD A1 RD1 00
A RD 01 ReadDataW 00

ALU
1 10 ALUResultM
EN

A RD 01
Instruction 24:20 RD2E 10
A2 RD2 00 0 SrcBE Data
Memory 01
A3 10 1 Memory
Register WriteDataE WriteDataM
WD3 WD
File
PCD PCE

+
19:15 Rs1D Rs1E
24:20 Rs2D Rs2E
11:7 RdD RdE RdM RdW
+

4 ExtImmD ExtImmE
31:7 Extend

PCPlus4F PCPlus4D PCPlus4M


CLR

CLR
EN

PCPlus4W
PCTargetE

ResultW

ForwardAE
ForwardBE
FlushD

FlushE
StallD
StallF

Hazard Unit

126 Digital Design & Computer Architecture Microarchitecture


RISC­V Pipelined Processor with Hazard Unit
PCSrcE
ZeroE
CLK CLK CLK

RegWriteD RegWriteE RegWriteM RegWriteW


Control ResultSrcD1:0 ResultSrcE1:0 ResultSrcM1:0 ResultSrcW1:0
Unit 0
MemWriteD MemWriteE MemWriteM
JumpD JumpE

6:0
BranchD BranchE
op
14:12
ALUControlD2:0 ALUControlE2:0
funct3
30
ALUSrcD ALUSrcE
funct75
ImmSrcD1:0

CLK CLK CLK


CLK
19:15 WE3 RD1E SrcAE WE
0 PCF' PCF InstrD A1 RD1 00
A RD 01 00

ALU
1 10 ALUResultM ReadDataW
EN

A RD 01
Instruction 24:20 RD2E 10
A2 RD2 00 0 SrcBE Data
Memory 01
A3 10 1 Memory
Register WriteDataE WriteDataM
WD3 WD
File
PCD PCE

+
19:15 Rs1D Rs1E
24:20 Rs2D Rs2E
11:7 RdD RdE RdM RdW
+

4 ExtImmD ExtImmE
31:7 Extend

PCPlus4F PCPlus4D PCPlus4E PCPlus4M


CLR

CLR
EN

PCPlus4W
PCTargetE

ResultW

ForwardAE
ForwardBE
FlushD

FlushE
StallD
StallF

Hazard Unit

127 Digital Design & Computer Architecture Microarchitecture


Chapter 7: Microarchitecture

Pipelined
Performance
Pipelined Processor Performance Example
• SPECINT2000 benchmark:
– 25% loads
– 10% stores
– 13% branches
– 52% R­type
• Suppose:
– 40% of loads used by next instruction
– 50% of branches mispredicted
• What is the average CPI? (Ideally it’s 1, but…)
– Load CPI = 1 when not stalling, 2 when stalling
So, CPIlw = 1(0.6) + 2(0.4) = 1.4
– Branch CPI = 1 when not stalling, 3 when stalling
So, CPIbeq = 1(0.5) + 3(0.5) = 2

Average CPI = (0.25)(1.4) + (0.1)(1) + (0.13)(2) + (0.52)(1) = 1.23

129 Digital Design & Computer Architecture Microarchitecture


Pipelined Processor Performance Example
Pipelined processor critical path:
Tc_pipelined = max of
tpcq + tmem + tsetup Fetch
2(tRFread + tsetup ) Decode
tpcq + 4tmux + tALU + tAND­OR + tsetup Execute
tpcq + tmem + tsetup Memory
2(tpcq + tmux + tRFwrite) Writeback

• Decode and Writeback stages both use the register file in each cycle
• So each stage gets half of the cycle time (Tc/2) to do their work
• Or, stated a different way, 2x of their work must fit in a cycle (Tc)

130 Digital Design & Computer Architecture Microarchitecture


Pipelined Critical Path: Execute Stage
PCSrcE
ZeroE
CLK CLK CLK

RegWriteD RegWriteE RegWriteM RegWriteW


Control ResultSrcD1:0 ResultSrcE1:0 ResultSrcM1:0 ResultSrcW1:0
Unit 0
MemWriteD MemWriteE MemWriteM
JumpD JumpE

6:0
BranchD BranchE
op ALUControlD2:0 ALUControlE2:0
14:12
funct3
30
ALUSrcD ALUSrcE
funct75
ImmSrcD1:0

CLK CLK CLK


CLK
19:15 WE3 RD1E SrcAE WE
0 PCF' PCF InstrD A1 RD1 00
A RD 01 00

ALU
1 10 ALUResultM ReadDataW
EN

A RD 01
Instruction 24:20 RD2E 10
A2 RD2 00 0 SrcBE Data
Memory 01
A3 10 1 Memory
Register WriteDataE WriteDataM
WD3 WD
File
PCD PCE

+
19:15 Rs1D Rs1E
24:20 Rs2D Rs2E
11:7 RdD RdE RdM RdW
+

4 ExtImmD ExtImmE
31:7 Extend

PCPlus4F PCPlus4D PCPlus4E PCPlus4M


CLR

CLR
EN

PCPlus4W
PCTargetE

ResultW

ForwardAE
ForwardBE
FlushD

FlushE
StallD
StallF

Hazard Unit

131 Digital Design & Computer Architecture Microarchitecture


Pipelined Performance Example
Element Parameter Delay (ps)
Register clock­to­Q tpcq_PC 40
Register setup tsetup 50
Multiplexer tmux 30
AND­OR gate tAND-OR 20
ALU tALU 120
Decoder (Control Unit) tdec 25
Extend unit tdec 35
Memory read tmem 200
Register file read tRFread 100
Register file setup tRFsetup 60
Tc_pipelined = tpcq + 4tmux + tALU + tAND-OR + tsetup
= (40 + 4*30 + 120 + 20 + 50) ps = 350 ps
132 Digital Design & Computer Architecture Microarchitecture
Pipelined Performance Example
Program with 100 billion instructions
Execution Time = (# instructions) × CPI × Tc
= (100 × 109)(1.23)(350 × 10­12)
= 43 seconds

133 Digital Design & Computer Architecture Microarchitecture


Processor Performance Comparison
Execution
Time Speedup
Processor (seconds) (single­cycle as baseline)
Single­cycle 75 1
Multicycle 155 0.5
Pipelined 43 1.7

134 Digital Design & Computer Architecture Microarchitecture

You might also like