0% found this document useful (0 votes)
9 views

Lecture-10-pre

Uploaded by

John
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views

Lecture-10-pre

Uploaded by

John
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 152

CPEN 411: Computer Architecture

In-order commit
and multiple issue
CPEN 411: Computer Architecture

In-order commit
and multiple issue

The time is out of joint — O cursèd spite,


That ever I was born to set it right!
The Plan

• Exceptions and exception support

• Speculative execution

• Reorder buffer

• Multiple issue
Learning objectives
• Describe the scenarios where out-of-order instruction
commit is not desirable for CPUs

• Describe exception types and the in-order implementation

• Explain how a reorder buffer ensures in-order commit

• Emulate execution in a superscalar processor


with register renaming and a reorder buffer

• Discuss relevant architectural tradoffs


Why would we flush the pipeline(s)?

• Mispredicted branch
Clock Number
1 2 3 4 5 6 7 8
TAKEN branch F D X M W
branch instr. +1 F D X M W
branch instr. +2 F D X M W
branch target F D X M W
Why would we flush the pipeline(s)?

• Mispredicted branch
Clock Number
1 2 3 4 5 6 7 8
TAKEN branch F D X M W
branch instr. +1 F D X M W
branch instr. +2 F D X M W
branch target F D X M W

• External interrupt (key press)


• Program fault (div. by 0, segfault)
• Trap (debugger breakpoint)
• OS context switch (timer)
• Page fault (unmapped virt. addr.)
Why would we flush the pipeline(s)?

• Mispredicted branch
Clock Number
1 2 3 4 5 6 7 8
TAKEN branch F D X M W
branch instr. +1 F D X M W
branch instr. +2 F D X M W
branch target F D X M W

• External interrupt (key press)


• Program fault (div. by 0, segfault)

ns
• Trap (debugger breakpoint)

io
pt
• OS context switch (timer)

ce
• Page fault (unmapped virt. addr.)

ex
Exceptions
Exceptions

• Events detected in µarch at execution time


Exceptions

• Events detected in µarch at execution time

• Require transfer of control to


an OS interrupt service routine
Exceptions

• Events detected in µarch at execution time

• Require transfer of control to


an OS interrupt service routine

• May need to resume the original program


as if nothing had happened
Precise exceptions
If architectural state is
updated by all instructions before faulting instruction
and none after (and including) faulting instruction
then pipeline supports precise exceptions
(not all architectures do)
Precise exceptions
If architectural state is
updated by all instructions before faulting instruction
and none after (and including) faulting instruction
then pipeline supports precise exceptions
(not all architectures do)

instr has updated


architectural state
div
instr has not updated
architectural state
Exceptions in an in-order pipeline

when exception strikes:


Exceptions in an in-order pipeline

when exception strikes:

1. switch instruction fetch to exception handler


Exceptions in an in-order pipeline

when exception strikes:

1. switch instruction fetch to exception handler

2. convert instructions after and including


“faulting instruction” to nops
(e.g., turn off write enable control bits)
Exceptions in an in-order pipeline

when exception strikes:

1. switch instruction fetch to exception handler

2. convert instructions after and including


“faulting instruction” to nops
(e.g., turn off write enable control bits)

3. remember PC of faulting instruction


Exception handling support
Exception handling support
add
instruction
causes
save PC of instruction causing overflow!
fault, plus cause of fault

exception
handler
address
“nop”/“squash” injection signals
Exception handling support

first instruction of handler

cancelled instructions
Which of these is not
So far a precise exception?

A: division by 0
B: segmentation fault
C: context switch
D: file does not exist
• Precise exceptions E: all are precise exceptions

• Implementation in in-order pipeline

• Exceptions and out-of-order commit

• Next: Speculative execution


and the Reorder Buffer
Which of these is not
So far a precise exception?

A: division by 0
B: segmentation fault
C: context switch
D: file does not exist ✓
• Precise exceptions E: all are precise exceptions

• Implementation in in-order pipeline

• Exceptions and out-of-order commit

• Next: Speculative execution


and the Reorder Buffer
Precise exceptions are hard w/ OoO commit b/c...
So far
A: it’s unclear which instruction caused the exception
B: program has been reordered in instruction memory
C: RF may have values computed by later instructions
D: another thread may have overwritten the registers
E: none of the above
• Precise exceptions

• Implementation in in-order pipeline

• Exceptions and out-of-order commit

• Next: Speculative execution


and the Reorder Buffer
Precise exceptions are hard w/ OoO commit b/c...
So far
A: it’s unclear which instruction caused the exception
B: program has been reordered in instruction memory
C: RF may have values computed by later instructions ✓
D: another thread may have overwritten the registers
E: none of the above
• Precise exceptions

• Implementation in in-order pipeline

• Exceptions and out-of-order commit

• Next: Speculative execution


and the Reorder Buffer
Speculative Execution

• we know how to predict branches as not taken


Clock Number
1 2 3 4 5 6 7 8
TAKEN branch F D X M W
branch instr. +1 F D X M W
branch instr. +2 F D X M W
branch target F D X M W
Speculative Execution

• we know how to predict branches as not taken


Clock Number
1 2 3 4 5 6 7 8
TAKEN branch F D X M W
branch instr. +1 F D X M W
branch instr. +2 F D X M W
branch target F D X M W

• but most branches are taken (think loops)


Speculative Execution

• we know how to predict branches as not taken


Clock Number
1 2 3 4 5 6 7 8
TAKEN branch F D X M W
branch instr. +1 F D X M W
branch instr. +2 F D X M W
branch target F D X M W

• but most branches are taken (think loops)


• often compiler (or HW) can predict outcome
Speculative Execution

• we know how to predict branches as not taken


Clock Number
1 2 3 4 5 6 7 8
TAKEN branch F D X M W
branch instr. +1 F D X M W
branch instr. +2 F D X M W
branch target F D X M W

• but most branches are taken (think loops)


• often compiler (or HW) can predict outcome
• would like a way to cancel mispredicted path
(similar to a precise exception)
Speculative Execution

• we know how to predict branches as not taken


Clock Number
1 2 3 4 5 6 7 8
TAKEN branch F D X M W
branch instr. +1 F D X M W
branch instr. +2 F D X M W
branch target F D X M W

• but most branches are taken (think loops)


• often compiler (or HW) can predict outcome
• would like a way to cancel mispredicted path
(similar to a precise exception)
• capability known as speculative execution
Speculative Execution
DIVD R3,R1,R2 ; F:1, D:2, I:3, X:4, W:104
BEQZ R3,Label ; F:2, D:3, I:4, X:105 (“not taken”)
… branch predicted “taken” on cycle 2
Label: DMUL R4,R4,R2 ; F:3, D:4, I:5, X:6, W:?

Goal: Support execution of instructions fetched following a


branch prediction before we know if the prediction is correct.
Above: Want to execute and broadcast result of DMUL
“speculatively” long before branch resolved on cycle 105.

Problem: Such instructions should not update register file


because branch might have been predicted incorrectly. Above:
cannot let DMUL write to register file until control hazard
resolved.
Speculative Execution
DIVD R3,R1,R2 ; F:1, D:2, I:3, X:4, W:104
BEQZ R3,Label ; F:2, D:3, I:4, X:105 (“not taken”)
… branch predicted “taken” on cycle 2
Label: DMUL R4,R4,R2 ; F:3, D:4, I:5, X:6, W:?
Assume Tomasulo algorithm, branch predicted taken.
Goal: This enables
Support the DMUL of
execution to start execution before
instructions the correct
fetched following
a
outcome of the branch is known.
branch prediction before we know if the prediction is correct.
Above: “S:N”
Want means in stage S on cycle N;
to execute and broadcast result of DMUL
F=fetch, D=decode, I=issue, X= execute begin, W=write result.
“speculatively” long before
BEQZ is resolved branch
in “X”. DMUL takesresolved onexecute.
10 cycles to cycle 105.
Can DMUL write to common data bus on cycle 16 or does it
Problem: Such
need instructions
to wait should
until the correct not
branch update
outcome register file
is known?
because branch might have been predicted incorrectly. Above:
cannot A:
let Write on clock cycle 16
DMUL write to register file until control hazard
B: Wait until correct branch outcome known on clock cycle 105
resolved.
C: Not sure
Speculative Execution
DIVD R3,R1,R2 ; F:1, D:2, I:3, X:4, W:104
BEQZ R3,Label ; F:2, D:3, I:4, X:105 (“not taken”)
… branch predicted “taken” on cycle 2
Label: DMUL R4,R4,R2 ; F:3, D:4, I:5, X:6, W:?
Assume Tomasulo algorithm, branch predicted taken.
Goal: This enables
Support the DMUL of
execution to start execution before
instructions the correct
fetched following
a
outcome of the branch is known.
branch prediction before we know if the prediction is correct.
Above: “S:N”
Want means in stage S on cycle N;
to execute and broadcast result of DMUL
F=fetch, D=decode, I=issue, X= execute begin, W=write result.
“speculatively” long before
BEQZ is resolved branch
in “X”. DMUL takesresolved onexecute.
10 cycles to cycle 105.
Can DMUL write to common data bus on cycle 16 or does it
Problem: Such
need instructions
to wait should
until the correct not
branch update
outcome register file
is known?
because branch might have been predicted incorrectly. Above:
cannot A:
let Write on clock cycle 16
DMUL write to register file until control hazard
B: Wait until correct branch outcome known on clock cycle 105
resolved.
C: Not sure
Speculative Execution
DIVD R3,R1,R2 ; F:1, D:2, I:3, X:4, W:104
BEQZ R3,Label ; F:2, D:3, I:4, X:105 (“not taken”)
… branch predicted “taken” on cycle 2
Label: DMUL R4,R4,R2 ; F:3, D:4, I:5, X:6, W:?
Assume Tomasulo algorithm, branch predicted taken.
Goal: This enables
Support the DMUL of
execution to start execution before
instructions the correct
fetched following
a
outcome of the branch is known.
branch prediction before we know if the prediction is correct.
Above: “S:N”
Want means in stage S on cycle N;
to execute and broadcast result of DMUL
F=fetch, D=decode, I=issue, X= execute begin, W=write result.
“speculatively” long before
BEQZ is resolved branch
in “X”. DMUL takesresolved onexecute.
10 cycles to cycle 105.
Can DMUL write to common data bus on cycle 16 or does it
Problem: Such
need instructions
to wait should
until the correct not
branch update
outcome register file
is known?
because branch might have been predicted incorrectly. Above:
cannot A:
let Write on clock cycle 16
DMUL write to register file until control hazard
B: Wait until correct branch outcome known on clock cycle 105 ✓
resolved.
C: Not sure
Speculative Execution
DIVD R3,R1,R2 ; F:1, D:2, I:3, X:4, W:104
BEQZ R3,Label ; F:2, D:3, I:4, X:105 (“not taken”)
… branch predicted “taken” on cycle 2
Label: DMUL R4,R4,R2 ; F:3, D:4, I:5, X:6, W:?

Goal: Support execution of instructions fetched following a


branch prediction before we know if the prediction is correct.
Above: Want to execute and broadcast result of DMUL
“speculatively” long before branch resolved on cycle 105.

Problem: Such instructions should not update register file


because branch might have been predicted incorrectly. Above:
cannot let DMUL write to register file until control hazard
resolved.
Completion and commit
Completion and commit

• insight: problem is writing to RF,


not finishing the computation early

• idea: compute & buffer the result


but delay the RF write until
prior instructions guaranteed to retire
(not speculative, no exceptions)
Completion and commit

• insight: problem is writing to RF,


not finishing the computation early

• idea: compute & buffer the result


but delay the RF write until
prior instructions guaranteed to retire
(not speculative, no exceptions)

• split writeback into completion and commit


– completion = buffer computed result
– commit = write result to RF
The Reorder Buffer (ROB)

• stores instrs in program order (a FIFO)


• buffers computed results before RF write
• commits in program order

Reorder
Buffer
Inst.
Queue
Commit
Regs
Issue

Res Stations Res Stations


execute Adder Multiplier

Complete
The Reorder Buffer (ROB)

• tracks execution status for each instruction

busy? sent to FU? finished? PC dst reg speculative? valid? value


The Reorder Buffer (ROB)

• tracks execution status for each instruction

busy? sent to FU? finished? PC dst reg speculative? valid? value

• issue stage allocates RS and ROB entry


– may or may not use ROB entry to hold computed value
– may or may not use ROB entry ID as renaming tag
Tomasulo With Reorder Buffer
Dest Value Program Counter Done?
Intr. ROB7 Newest
Queue
ROB6

ROB5

ROB4
Reorder Buffer
ROB3
Cycle 3: Oldest
DIV issued, allocates R3 PC=0x00(DIV R1,R1,R2) N
ROB2

ROB and Reservation ROB1

Station entries, updates R4


Register alias table R3 ROB1
Regs[R4]
R2 Regs[R2]
RAT+Registers R1 Regs[R1]
Dest
Dest
1 DIVD Regs[R1],Regs[R2]

Reservation
Stations
adder multipliers
multipliers branch
Tomasulo With ReorderInDIVD,
Buffer
reservation station for
what does “1”
represent?
Dest Value Program Counter Done?
Intr. A: R1 ROB7 Newest
Queue
B: ROB1ROB6
ROB5

ROB4
Reorder Buffer
ROB3
Cycle 3: Oldest
DIV issued, allocates R3 PC=0x00(DIV R1,R1,R2) N
ROB2

ROB and Reservation ROB1

Station entries, updates R4


Register alias table R3 ROB1
Regs[R4]
R2 Regs[R2]
RAT+Registers R1 Regs[R1]
Dest
Dest
1 DIVD Regs[R1],Regs[R2]

Reservation
Stations
adder multipliers
multipliers branch
Tomasulo With ReorderInDIVD,
Buffer
reservation station for
what does “1”
represent?
Dest Value Program Counter Done?
Intr. A: R1 ROB7 Newest
Queue

B: ROB1ROB6
ROB5

ROB4
Reorder Buffer
ROB3
Cycle 3: Oldest
DIV issued, allocates R3 PC=0x00(DIV R1,R1,R2) N
ROB2

ROB and Reservation ROB1

Station entries, updates R4


Register alias table R3 ROB1
Regs[R4]
R2 Regs[R2]
RAT+Registers R1 Regs[R1]
Dest
Dest
1 DIVD Regs[R1],Regs[R2]

Reservation
Stations
adder multipliers
multipliers branch
Tomasulo With Reorder Buffer
Done?
Intr. ROB7 Newest
Queue
ROB6

ROB5

ROB4
Reorder Buffer
ROB3
Cycle 4: - - PC=0x04(BEQZ R3,Loop) N ROB2
Oldest
BEQZ issued, allocates R3 PC=0x00(DIV R3,R1,R2) N
ROB1
ROB and Reservation
Station entries R4 Regs[R4]
R3 ROB1
R2 Regs[R2]
RAT+Registers R1 Regs[R1]
Dest
Dest
1 DIVD Regs[R1],Regs[R2] BEQZ ROB1,Loop

Reservation
Stations
adder multipliers
multipliers branch
Tomasulo With Reorder Buffer
Done?
Intr. ROB7 Newest
Queue
ROB6

ROB5

ROB4
Reorder Buffer
R4 PC=0x20(DMUL R4,R4,R2) N
Cycle 5: - - N
ROB3
Oldest
DMUL issued, allocates
PC=0x04(BEQZ R3,Loop)
ROB2
R3 PC=0x00(DIV R3,R1,R2) N
ROB and Reservation ROB1

Station entries, updates R4 ROB3


Register alias table R3 ROB1
R2 Regs[R2]
RAT+Registers R1 Regs[R1]
Dest
Dest
1 DIVD Regs[R1],Regs[R2] BEQZ ROB1,Loop
3 DMUL Regs[R4],Regs[R2]

Reservation
Stations
adder multipliers
multipliers branch
Tomasulo With Reorder Buffer
Done?
Intr. ROB7 Newest
Queue
ROB6

ROB5
R4 PC=0x24(DADD R4,R4,R1) N ROB4
Reorder Buffer
Cycle 6: R4 PC=0x20(DMUL R4,R4,R2) N ROB3
DADD issued, allocates - - PC=0x04(BEQZ R3,Loop) N ROB2
Oldest
ROB and Reservation R3 PC=0x00(DIV R3,R1,R2) N
ROB1
Station entries, updates
Register alias table R4 ROB4
R3 ROB1
R2 Regs[R2]
R1 Regs[R1]
Dest
Dest
4 DADD ROB3,Regs[R1]
1 DIVD Regs[R1],Regs[R2] BEQZ ROB1,Loop
3 DMUL Regs[R4],Regs[R2]

Reservation
Stations
adder multipliers
multipliers branch
Tomasulo With Reorder Buffer
Done?
Intr. ROB7 Newest
Queue
ROB6

ROB5
R4 PC=0x24(DADD R4,R4,R1) N ROB4
Reorder Buffer
R4 X PC=0x20(DMUL R4,R4,R2) Y ROB3
Oldest
Cycle 16: - - PC=0x04(BEQZ R3,Loop) N ROB2

DMUL writes to CDB, R3 PC=0x00(DIV R3,R1,R2) N


ROB1
DADD and ROB get
value, but register file R4 ROB4
R3 ROB1
not updated R2 Regs[R2]
R1 Regs[R1]
Dest
Dest
4 DADD X, Regs[R1]
1 DIVD Regs[R1],Regs[R2] BEQZ ROB1,Loop

Reservation
Stations
adder multipliers
multipliers branch

ROB3, X
Tomasulo With Reorder Buffer
Done?
Intr. ROB7 Newest
Queue
ROB6

ROB5
PC=0x24(DADD R4,R4,R1) Y ROB4
Reorder Buffer R4 Y
R4 X PC=0x20(DMUL R4,R4,R2) Y ROB3
Oldest
Cycle 17: - - PC=0x04(BEQZ R3,Loop) N ROB2

DADD writes to CDB, R3 PC=0x00(DIV R3,R1,R2) N


ROB1
ROB gets value, but
register file not updated R4 ROB4
R3 ROB1
R2 Regs[R2]
R1 Regs[R1]
Dest
Dest
1 DIVD Regs[R1],Regs[R2] BEQZ ROB1,Loop

Reservation
Stations
adder multipliers
multipliers branch

ROB4, Y
Tomasulo With Reorder Buffer
Done?
Intr. ROB7 Newest
Queue
ROB6

ROB5
PC=0x24(DADD R4,R4,R1) Y ROB4
Reorder Buffer R4 Y
R4 X PC=0x20(DMUL R4,R4,R2) Y ROB3
Oldest
Cycle 104: - - PC=0x04(BEQZ R3,Loop) N ROB2

DIVD writes to CDB, R3 “42” PC=0x00(DIV R3,R1,R2) Y


ROB1
BEQZ and ROB gets
value R4 ROB4
R3 ROB1
R2 Regs[R2]
R1 Regs[R1]
Dest
Dest
BEQZ ROB1,Loop

Reservation
Stations
adder multipliers
multipliers branch

ROB1, “42”
Tomasulo With Reorder Buffer
Done?
Intr. ROB7 Newest
Queue
ROB6

ROB5
PC=0x24(DADD R4,R4,R1) Y ROB4
Reorder Buffer R4 Y
R4 X PC=0x20(DMUL R4,R4,R2) Y
Cycle 105: - - N
ROB3
Oldest
Register R3 updated
PC=0x04(BEQZ R3,Loop)
ROB2
R3 “42” PC=0x00(DIV R3,R1,R2) Y
(ROB1 released). ROB1

BEQZ resolved as “not R4


taken” – flushes ROB R3 “42”
and resets RAT R2
R1
Regs[R2]
Regs[R1]
flush
Dest
Dest

Reservation
Stations
adder multipliers
multipliers branch
Tomasulo With Reorder Buffer
Since the branch was mispredicted
should we flush it and execute it
again too?
Done?
Intr. A: Yes ROB7 Newest
Queue
B: No ROB6

ROB5
PC=0x24(DADD R4,R4,R1) Y ROB4
Reorder Buffer R4 Y
R4 X PC=0x20(DMUL R4,R4,R2) Y
Cycle 105: - - N
ROB3
Oldest
Register R3 updated
PC=0x04(BEQZ R3,Loop)
ROB2
R3 “42” PC=0x00(DIV R3,R1,R2) Y
(ROB1 released). ROB1

BEQZ resolved as “not R4


taken” – flushes ROB R3 “42”
and resets RAT R2
R1
Regs[R2]
Regs[R1]
flush
Dest
Dest

Reservation
Stations
adder multipliers
multipliers branch
Tomasulo With Reorder Buffer
Since the branch was mispredicted
should we flush it and execute it
again too?
Done?
Intr. A: Yes ROB7 Newest
Queue
B: No ✓ ROB6

ROB5
PC=0x24(DADD R4,R4,R1) Y ROB4
Reorder Buffer R4 Y
R4 X PC=0x20(DMUL R4,R4,R2) Y
Cycle 105: - - N
ROB3
Oldest
Register R3 updated
PC=0x04(BEQZ R3,Loop)
ROB2
R3 “42” PC=0x00(DIV R3,R1,R2) Y
(ROB1 released). ROB1

BEQZ resolved as “not R4


taken” – flushes ROB R3 “42”
and resets RAT R2
R1
Regs[R2]
Regs[R1]
flush
Dest
Dest

Reservation
Stations
adder multipliers
multipliers branch
Some ROB challenges

• modern high-performance CPU: ~8-way issue


– max. how many deqs from ROB per cycle?
– max. how many enqs to ROB per cycle?
– max. how many new results in ROB per cycle?

• problem: multi-ported SRAMs expensive

• observation: not all instructions produce a value


– which instructions don’t?
– how frequent is this?
SUPERSCALAR EXECUTION
The 1 CPI barrier
Pipeline CPI = Ideal CPI
+ Structural stalls
+ Data hazard stalls
+ Control hazard stalls

We know how to remove/reduce stalls: pipelining, forwarding, out-of-order


execution, speculative execution...

Can we reduce “Ideal CPI”?


• Ideal CPI limited by issue rate: 1 instruction per cycle
• Can achieve CPI < 1 by issuing more than one instruction per cycle
• Convenient to speak about average instructions per cycle (IPC = 1/CPI)
Inorder Superscalar Pipeline

• inorder superscalar
used today in
some low-power
processors

• why might you


prefer this to
full Tomasulo?
486 P5 (Pentium)
Statically Scheduled Superscalar
• issue up to N instructions per cycle
• at issue:
– check each instruction for hazards with co-issued instructions
that are earlier in program order
– also against all earlier instructions still in execution

tgt src1 src1


Program order

tgt src1 src1

tgt src1 src1

above: comparisons to avoid RAW, WAR, and WAW hazards


in a 3-wide in-order superscalar.
The red and blue arrows
Statically Scheduled Superscalar check which hazard?

A: RAW
• issue up to N instructions per cycle B: WAW
• at issue: C: WAR
D: A and B
– check each instruction for hazards with co-issued instructions
that are earlier in program order E: B and C
– also against all earlier instructions still in execution

tgt src1 src1


Program order

tgt src1 src1

tgt src1 src1

above: comparisons to avoid RAW, WAR, and WAW hazards


in a 3-wide in-order superscalar.
The red and blue arrows
Statically Scheduled Superscalar check which hazard?

A: RAW
• issue up to N instructions per cycle B: WAW
• at issue: C: WAR
D: A and B
– check each instruction for hazards with co-issued instructions
that are earlier in program order E: B and C ✓
– also against all earlier instructions still in execution

tgt src1 src1


Program order

tgt src1 src1

tgt src1 src1

above: comparisons to avoid RAW, WAR, and WAW hazards


in a 3-wide in-order superscalar.
2-Issue In-order Superscalar
Clock Number
1 2 3 4 5 6 7 8
integer instruction IF ID EX MEM WB
FP instruction IF ID EX EX EX WB
integer instruction IF ID EX MEM WB
FP instruction IF ID EX EX EX WB
integer instruction IF ID EX MEM WB
FP instruction IF ID EX EX EX WB
Assuming branch resolved in
2-Issue In-order Superscalar execute, what is the branch
penalty measured in number of
instructions squashed/flushed on a
branch misprediction?

Clock A: 1 instruction
Number
B: 2-3 instructions
1 2 3 4 C: 4-5
5 instructions
6 7 8
integer instruction IF ID EX MEM D: 6-7 instructions
WB
E: Not sure
FP instruction IF ID EX EX EX WB
integer instruction IF ID EX MEM WB
FP instruction IF ID EX EX EX WB
integer instruction IF ID EX MEM WB
FP instruction IF ID EX EX EX WB
Assuming branch resolved in
2-Issue In-order Superscalar execute, what is the branch
penalty measured in number of
instructions squashed/flushed on a
branch misprediction?

Clock A: 1 instruction
Number
B: 2-3 instructions
1 2 3 4 C: 4-5
5 instructions
6 7✓ 8
integer instruction IF ID EX MEM D: 6-7 instructions
WB
E: Not sure
FP instruction IF ID EX EX EX WB
integer instruction IF ID EX MEM WB
FP instruction IF ID EX EX EX WB
integer instruction IF ID EX MEM WB
FP instruction IF ID EX EX EX WB
2-Issue In-order Superscalar
Clock Number
1 2 3 4 5 6 7 8
integer instruction IF ID EX MEM WB
FP instruction IF ID EX EX EX WB
integer instruction IF ID EX MEM WB
FP instruction IF ID EX EX EX WB
integer instruction IF ID EX MEM WB
FP instruction IF ID EX EX EX WB

• may have issue restrictions:


– e.g., one integer and one FP op (common, e.g., HP 7100)
– e.g., simple vs complex (e.g., P5 V pipe only for ALU, stack ops, and near jumps)

• challenge maintaining peak throughput: load delays, branch delays


Multiple Issue w/ Dynamic Scheduling

• extension of Tomasulo’s algorithm

• key challenge: rename, assign RS, and update


pipeline control tables for multiple instructions

• two approaches to overcome this challenge:


– run this at higher clock speed (e.g. 2× freq for 2-wide)
– add logic to rename multiple instructions at once
Tomasulo Organization (with two-wide issue)

From Mem FP Op FP Registers


Queue

Load1
Load2
Load3 Load Buffers
Load4 2x / cycle Store
Load5 Buffers
Load6

Add1
Add2 Mult1
Add3 Mult2

Reservation To Mem
Stations
FP adders FP multipliers

Common Data Bus 1

Common Data Bus 2 49


Example
Loop: L.D F0,0(R1) ; F0-array element
ADD.D F4,F0,F2 ; add scalar in F2
S.D F4,0(F1) ; store result
DADDIU R1,R1,#-8 ; decrement pointer 8 bytes (per DW)
BNE R1,R2,Loop ; branch R1 != R2

Assumptions:
(1) 2-wide issue
(2) Infinite number of reservation stations
(3) Perfect branch prediction (let’s do three iterations of loop)
(4) Issue instruction from target of taken branch 1 cycle after branch (due to fetch restrictions)
(5) One Integer Unit (handles load/store effective address calculation)
(6) Separate FP Function Unit for each type of FP operation
(7) Issue and Write Results take one cycle
(8) Latencies: One cycle for integer ALU; two cycles for loads; three cycles for FP adds
(9) Two CDBs
(10) Load/Store effective address calculation decoupled from memory access
(11) NO Reorder Buffer
Example: 2-issue w/ Tomasulo

Resource usage:
Example: 2-issue w/ Tomasulo

Resource usage:
Example: 2-issue w/ Tomasulo

Resource usage:
Example: 2-issue w/ Tomasulo

Resource usage:
Example: 2-issue w/ Tomasulo

Resource usage:
Example: 2-issue w/ Tomasulo

Resource usage:
Example: 2-issue w/ Tomasulo

Resource usage:
Example: 2-issue w/ Tomasulo

Resource usage:
Example: 2-issue w/ Tomasulo

Resource usage:
Example: 2-issue w/ Tomasulo

Resource usage:
Example: 2-issue w/ Tomasulo

Resource usage:
Example: 2-issue w/ Tomasulo

Resource usage:
Example: 2-issue w/ Tomasulo

Resource usage:
Example: 2-issue w/ Tomasulo

Resource usage:
Example: 2-issue w/ Tomasulo

Resource usage:
Example: 2-issue w/ Tomasulo

Resource usage:
Example: 2-issue w/ Tomasulo

Resource usage:
Example: 2-issue w/ Tomasulo

Resource usage:
Example: 2-issue w/ Tomasulo

Resource usage:
Example: 2-issue w/ Tomasulo

Resource usage:
Example: 2-issue w/ Tomasulo

Resource usage:
Example: 2-issue w/ Tomasulo

Resource usage:
Example: 2-issue w/ Tomasulo

Resource usage:
Example: 2-issue w/ Tomasulo

Resource usage:
Example: 2-issue w/ Tomasulo

Resource usage:
Example: 2-issue w/ Tomasulo

Resource usage:
Example: 2-issue w/ Tomasulo

Resource usage:
Example: 2-issue w/ Tomasulo

Resource usage:
Example: 2-issue w/ Tomasulo

Resource usage:
Example: 2-issue w/ Tomasulo

Resource usage:
Example: 2-issue w/ Tomasulo

Resource usage:
Example: 2-issue w/ Tomasulo

Resource usage:

1. Mark relevant dependencies (true deps)


2. Each clock cycle: Start from oldest instruction, working toward younger
instructions and attempt to apply scheduling algorithm + assumptions
Example: 2-issue w/ Tomasulo

Resource usage:

1. Mark relevant dependencies (true deps)


2. Each clock cycle: Start from oldest instruction, working toward younger
instructions and attempt to apply scheduling algorithm + assumptions
Example: 2-issue w/ Tomasulo

Resource usage:

1. Mark relevant dependencies (true deps)


2. Each clock cycle: Start from oldest instruction, working toward younger
instructions and attempt to apply scheduling algorithm + assumptions
Example: 2-issue w/ Tomasulo

Resource usage:

1. Mark relevant dependencies (true deps)


2. Each clock cycle: Start from oldest instruction, working toward younger
instructions and attempt to apply scheduling algorithm + assumptions
Example: 2-issue w/ Tomasulo

Resource usage:

1. Mark relevant dependencies (true deps)


2. Each clock cycle: Start from oldest instruction, working toward younger
instructions and attempt to apply scheduling algorithm + assumptions
Example: 2-issue w/ Tomasulo

Resource usage:

1. Mark relevant dependencies (true deps)


2. Each clock cycle: Start from oldest instruction, working toward younger
instructions and attempt to apply scheduling algorithm + assumptions
Example: 2-issue w/ Tomasulo

Resource usage:

1. Mark relevant dependencies (true deps)


2. Each clock cycle: Start from oldest instruction, working toward younger
instructions and attempt to apply scheduling algorithm + assumptions
Example: 2-issue w/ Tomasulo

Resource usage:

1. Mark relevant dependencies (true deps)


2. Each clock cycle: Start from oldest instruction, working toward younger
instructions and attempt to apply scheduling algorithm + assumptions
Example: 2-issue w/ Tomasulo

Resource usage:

1. Mark relevant dependencies (true deps)


2. Each clock cycle: Start from oldest instruction, working toward younger
instructions and attempt to apply scheduling algorithm + assumptions
Example: 2-issue w/ Tomasulo

Resource usage:

1. Mark relevant dependencies (true deps)


2. Each clock cycle: Start from oldest instruction, working toward younger
instructions and attempt to apply scheduling algorithm + assumptions
Example: 2-issue w/ Tomasulo

Resource usage:

1. Mark relevant dependencies (true deps)


2. Each clock cycle: Start from oldest instruction, working toward younger
instructions and attempt to apply scheduling algorithm + assumptions
Example: 2-issue w/ Tomasulo

Resource usage:

1. Mark relevant dependencies (true deps)


2. Each clock cycle: Start from oldest instruction, working toward younger
instructions and attempt to apply scheduling algorithm + assumptions
Example: 2-issue w/ Tomasulo

Resource usage:

1. Mark relevant dependencies (true deps)


2. Each clock cycle: Start from oldest instruction, working toward younger
instructions and attempt to apply scheduling algorithm + assumptions
Example: 2-issue w/ Tomasulo

Resource usage:

1. Mark relevant dependencies (true deps)


2. Each clock cycle: Start from oldest instruction, working toward younger
instructions and attempt to apply scheduling algorithm + assumptions
Example: 2-issue w/ Tomasulo

Resource usage:

1. Mark relevant dependencies (true deps)


2. Each clock cycle: Start from oldest instruction, working toward younger
instructions and attempt to apply scheduling algorithm + assumptions
Example: 2-issue w/ Tomasulo

Resource usage:

1. Mark relevant dependencies (true deps)


2. Each clock cycle: Start from oldest instruction, working toward younger
instructions and attempt to apply scheduling algorithm + assumptions
Example: 2-issue w/ Tomasulo

Resource usage:

1. Mark relevant dependencies (true deps)


2. Each clock cycle: Start from oldest instruction, working toward younger
instructions and attempt to apply scheduling algorithm + assumptions
Example: 2-issue w/ Tomasulo

Resource usage:

1. Mark relevant dependencies (true deps)


2. Each clock cycle: Start from oldest instruction, working toward younger
instructions and attempt to apply scheduling algorithm + assumptions
Example: 2-issue w/ Tomasulo

Resource usage:

1. Mark relevant dependencies (true deps)


2. Each clock cycle: Start from oldest instruction, working toward younger
instructions and attempt to apply scheduling algorithm + assumptions
Example: 2-issue w/ Tomasulo

Resource usage:

1. Mark relevant dependencies (true deps)


2. Each clock cycle: Start from oldest instruction, working toward younger
instructions and attempt to apply scheduling algorithm + assumptions
Example: 2-issue w/ Tomasulo

Resource usage:

1. Mark relevant dependencies (true deps)


2. Each clock cycle: Start from oldest instruction, working toward younger
instructions and attempt to apply scheduling algorithm + assumptions
Example: 2-issue w/ Tomasulo

Resource usage:

1. Mark relevant dependencies (true deps)


2. Each clock cycle: Start from oldest instruction, working toward younger
instructions and attempt to apply scheduling algorithm + assumptions
Example: 2-issue w/ Tomasulo

Resource usage:

1. Mark relevant dependencies (true deps)


2. Each clock cycle: Start from oldest instruction, working toward younger
instructions and attempt to apply scheduling algorithm + assumptions
Example: 2-issue w/ Tomasulo

Resource usage:

1. Mark relevant dependencies (true deps)


2. Each clock cycle: Start from oldest instruction, working toward younger
instructions and attempt to apply scheduling algorithm + assumptions
Example: 2-issue w/ Tomasulo

Resource usage:

1. Mark relevant dependencies (true deps)


2. Each clock cycle: Start from oldest instruction, working toward younger
instructions and attempt to apply scheduling algorithm + assumptions
Example: 2-issue w/ Tomasulo

Resource usage:

1. Mark relevant dependencies (true deps)


2. Each clock cycle: Start from oldest instruction, working toward younger
instructions and attempt to apply scheduling algorithm + assumptions
Example: 2-issue w/ Tomasulo

Resource usage:

1. Mark relevant dependencies (true deps)


2. Each clock cycle: Start from oldest instruction, working toward younger
instructions and attempt to apply scheduling algorithm + assumptions
Example: 2-issue w/ Tomasulo

Resource usage:

1. Mark relevant dependencies (true deps)


2. Each clock cycle: Start from oldest instruction, working toward younger
instructions and attempt to apply scheduling algorithm + assumptions
Example: 2-issue w/ Tomasulo

Resource usage:

1. Mark relevant dependencies (true deps)


2. Each clock cycle: Start from oldest instruction, working toward younger
instructions and attempt to apply scheduling algorithm + assumptions
Example: 2-issue w/ Tomasulo

Resource usage:

1. Mark relevant dependencies (true deps)


2. Each clock cycle: Start from oldest instruction, working toward younger
instructions and attempt to apply scheduling algorithm + assumptions
Example: 2-issue w/ Tomasulo

Resource usage:

1. Mark relevant dependencies (true deps)


2. Each clock cycle: Start from oldest instruction, working toward younger
instructions and attempt to apply scheduling algorithm + assumptions
Example: 2-issue w/ Tomasulo

Resource usage:

1. Mark relevant dependencies (true deps)


2. Each clock cycle: Start from oldest instruction, working toward younger
instructions and attempt to apply scheduling algorithm + assumptions
Example: 2-issue w/ Tomasulo

Resource usage:

1. Mark relevant dependencies (true deps)


2. Each clock cycle: Start from oldest instruction, working toward younger
instructions and attempt to apply scheduling algorithm + assumptions
Example: 2-issue w/ Tomasulo

Resource usage:

1. Mark relevant dependencies (true deps)


2. Each clock cycle: Start from oldest instruction, working toward younger
instructions and attempt to apply scheduling algorithm + assumptions
Example: 2-issue w/ Tomasulo

Resource usage:

1. Mark relevant dependencies (true deps)


2. Each clock cycle: Start from oldest instruction, working toward younger
instructions and attempt to apply scheduling algorithm + assumptions
Example: 2-issue w/ Tomasulo

Resource usage:

1. Mark relevant dependencies (true deps)


2. Each clock cycle: Start from oldest instruction, working toward younger
instructions and attempt to apply scheduling algorithm + assumptions
Example: 2-issue w/ Tomasulo

Resource usage:

1. Mark relevant dependencies (true deps)


2. Each clock cycle: Start from oldest instruction, working toward younger
instructions and attempt to apply scheduling algorithm + assumptions
Example: 2-issue w/ Tomasulo

Resource usage:

1. Mark relevant dependencies (true deps)


2. Each clock cycle: Start from oldest instruction, working toward younger
instructions and attempt to apply scheduling algorithm + assumptions
Example: 2-issue w/ Tomasulo

Resource usage:

1. Mark relevant dependencies (true deps)


2. Each clock cycle: Start from oldest instruction, working toward younger
instructions and attempt to apply scheduling algorithm + assumptions
Example: 2-issue w/ Tomasulo

Resource usage:

1. Mark relevant dependencies (true deps)


2. Each clock cycle: Start from oldest instruction, working toward younger
instructions and attempt to apply scheduling algorithm + assumptions
Example: 2-issue w/ Tomasulo

Resource usage:

1. Mark relevant dependencies (true deps)


2. Each clock cycle: Start from oldest instruction, working toward younger
instructions and attempt to apply scheduling algorithm + assumptions
• Fetch Issue 15 instructions/9 cycles = 1.67 IPC
• Execution Completion rate = 15 inst./16 cycles = 0.94 IPC (much less than 2)
Modified Example (extra Int ALU)
Loop: L.D F0,0(R1) ; F0-array element
ADD.D F4,F0,F2 ; add scalar in F2
S.D F4,0(F1) ; store result
DADDIU R1,R1,#-8 ; decrement pointer 8 bytes (per DW)
BNE R1,R2,Loop ; branch R1 != R2

Changed Assumptions:

(5) Address Adder & Integer Unit

Modified Example, cont’d...

• Execution completion rate: 3 cycles less (3/S.D vs 1/L.D: 16-4+1 = 13 vs. 16


cycles before) IPC = 15/13 = 1.15 (vs. 0.94)
• Performance limited by waiting for branch to resolve.
Modified Example, cont’d...

• Execution completion rate: 3 cycles less (3/S.D vs 1/L.D: 16-4+1 = 13


vs. 16 cycles before) IPC = 15/13 = 1.15 (vs. 0.94)
• Performance limited by waiting for branch to resolve.
Example of Multiple Issue, Tomasulo & Reorder Buffer

Memory
cycle:

#8

#8

#8

1. Mark true dependencies


2. Each clock cycle: Start from oldest instruction, working toward younger instructions and
attempt to apply <whichever OoO> algorithm rules
Example of Multiple Issue, Tomasulo & Reorder Buffer

Memory
cycle:

(for commit/mem write)


#8

#8

#8

1. Mark true dependencies


2. Each clock cycle: Start from oldest instruction, working toward younger instructions and
attempt to apply <whichever OoO> algorithm rules
Example of Multiple Issue, Tomasulo & Reorder Buffer

Memory
cycle:

(for commit/mem write)


#8

#8

#8

1. Mark true dependencies


2. Each clock cycle: Start from oldest instruction, working toward younger instructions and
attempt to apply <whichever OoO> algorithm rules
Example of Multiple Issue, Tomasulo & Reorder Buffer

Memory
cycle:

(for commit/mem write)


#8

#8

#8

1. Mark true dependencies


2. Each clock cycle: Start from oldest instruction, working toward younger instructions and
attempt to apply <whichever OoO> algorithm rules
Example of Multiple Issue, Tomasulo & Reorder Buffer

Memory
cycle:

(for commit/mem write)


#8

#8

#8

1. Mark true dependencies


2. Each clock cycle: Start from oldest instruction, working toward younger instructions and
attempt to apply <whichever OoO> algorithm rules
Example of Multiple Issue, Tomasulo & Reorder Buffer

Memory
cycle:

(for commit/mem write)


#8

#8

#8

1. Mark true dependencies


2. Each clock cycle: Start from oldest instruction, working toward younger instructions and
attempt to apply <whichever OoO> algorithm rules
Example of Multiple Issue, Tomasulo & Reorder Buffer

Memory
cycle:

(for commit/mem write)


#8

#8

#8

1. Mark true dependencies


2. Each clock cycle: Start from oldest instruction, working toward younger instructions and
attempt to apply <whichever OoO> algorithm rules
Example of Multiple Issue, Tomasulo & Reorder Buffer

Memory
cycle:

(for commit/mem write)


#8

#8

#8

1. Mark true dependencies


2. Each clock cycle: Start from oldest instruction, working toward younger instructions and
attempt to apply <whichever OoO> algorithm rules
Example of Multiple Issue, Tomasulo & Reorder Buffer

Memory
cycle:

(for commit/mem write)


#8

#8

#8

1. Mark true dependencies


2. Each clock cycle: Start from oldest instruction, working toward younger instructions and
attempt to apply <whichever OoO> algorithm rules
• Commit rate: 15 instructions in 14-5+1 = 10 cycle
• Speculative execution helped use commit 1.5 instructions per cycle
Summary
In this slide set:

• Precise exceptions

• Speculative execution

• Reorder buffer allows both in OoO CPUs

• Lowering the ideal CPI below 1

• Superscalar issue

You might also like