08 Speculation
08 Speculation
Speculation
1
Outline
• Speculation
– Re-order buffers
• Limits to ILP
2
Speculation
3
Control Dependence Ignored
• If CPU stalls on branches, how much would
CPI increase?
4
Branch Prediction and
Speculative Execution
• Speculation is to run • Example:
instructions on • for (i=0; i<1000; i++)
prediction – predictions
could be wrong. • C[i] = A[i]+B[i];
5
Exception Behavior
• Preserving exception behavior -- exceptions must be
raised exactly as in sequential execution
– Same sequence as sequential
– No “extra” exceptions
• Example:
DADDU R2,R3,R4
BEQZ R2,L1
LW R1,0(R2)
L1:
– Problem with moving LW before BEQZ?
• Again, a dynamic execution must look like a sequential
execution, any time when it is stopped
6
Exceptions in Order
• Solutions:
exception is impossible
7
Precise Interrupts
• An interrupt is precise if the saved process
state corresponds with a sequential model of
program execution where one instruction
completes before the next begins.
• Tomasulo had:
In-order issue, out-of-order execution, and
out-of-order completion
8
Short Seminar – Precise
Exceptions
9
HW Support for More ILP
• Speculation: allow an instruction to issue that is
dependent on branch predicted to be taken without
any consequences (including exceptions) if branch is
not actually taken (“HW undo”);
• Combine branch prediction with dynamic scheduling
to execute before branches resolved
• Separate speculative bypassing of results from real
bypassing of results
– When instruction no longer speculative,
write boosted results (instruction commit)
or discard boosted results
– execute out-of-order but commit in-order
to prevent irrevocable action (update state or exception)
until instruction commits
11
HW support for More ILP
12
Reorder Buffer Implementation
13
Result Shift Register
• Result Shift Register" is used to control
the result bus
• N is the length of the longest functional
unit pipeline
• An instruction that takes i clock
periods reserves stage i
• If the stage already contains valid
control information, then issue is held
until the next clock period
• Issuing instruction places control
information in the result shift register.
– the functional unit that will be supplying the
result
– the destination register
– This control information is also marked
"valid"
• Each clock period, the control
information is shifted down one stage
toward stage one.
• When it reaches stage one, it is used
during the next clock period to control
the result bus
14
The Hardware: Reorder Buffer
• If inst write results in program order,
reg/memory always get the correct IM
values
Fetch Unit
• Reorder buffer (ROB) – reorder out-of-
order inst to program order at the time of
writing reg/memory (commit)
Reorder
• If some inst goes wrong, handle it at the Decode Rename Regfile Buffer
time of commit – just flush inst
afterwards
15
Four Steps of Speculative
Tomasulo Algorithm
1. Issue—get instruction from FP Op Queue
If reservation station and reorder buffer slot free, issue instr & send
operands & reorder buffer no. for destination (this stage sometimes
called “dispatch”)
2. Execution—operate on operands (EX)
When both operands ready then execute; if not ready, watch CDB for
result; when both in reservation station, execute; checks RAW
(sometimes called “issue”)
3. Write result—finish execution (WB)
Write on Common Data Bus to all awaiting FUs
& reorder buffer; mark reservation station available.
4. Commit—update register with reorder result
When instr. at head of reorder buffer & result present, update register
with result (or store to memory) and remove instr from reorder buffer.
Mispredicted branch flushes reorder buffer (sometimes called
“graduation”)
16
Reorder Buffer Details
• Holds Instruction type: branch, store, ALU
Program Counter
Branch or L/W?
register operation
• Holds branch valid and exception bits
Exceptions?
Dest reg
– Flush pipeline when any bit is set
Ready?
Result
• Holds dest, result and PC
– Write results to dest at the time of commit
– Which PC to hold?
• A ready bit indicates if the Reorder Buffer
instruction has completed
execution and the value is ready
• Supplies operands between execution
complete and commit
• ROB replaces the Store Buffer also
17
Speculative Execution
Recovery
IM
Flush the pipeline on mis-
prediction Fetch Unit
– MIPS 5-stage pipeline
used flushing on taken Reorder
branches Decode Rename Regfile Buffer
• Where is the flush signal
from?
• When to flush?
S-buf L-buf RS RS
DM FU1 FU2
18
Changes to Other Components
• Use ROB index as tag
– Why not RS index any more?
– Why is ROB index a valid choice?
• Renaming table maps architecture registers
to ROB index if the register is renamed
• Reservation stations now use ROB index for
tracking dependence and for wakeup
• Again tag (now ROB index) and data are
broadcast on CDB at writeback
• Inst may receive values from reg/mem, data
broadcasting, or ROB
19
Code Example
Loop: LD R2, 0(R1)
DADDIU R2, R2, #1
SD R2, 0(R1)
DADDIU R1, R1, #4
BNE R2, R3, Loop
How would this code be executed?
Inst Issue Exec Memoryre Write Commit
ad results
LD 1 2 3 4 5
… … … … … …
… … … … … …
21
Summary
• Reservations stations: implicit register renaming to larger
set of registers + buffering source operands
– Prevents registers as bottleneck
– Avoids WAR, WAW hazards of Scoreboard
• Not limited to basic blocks when compared to static
scheduling (integer units gets ahead, beyond branches)
• Today, helps cache misses as well
– Don’t stall for L1 Data cache miss
– Can support memory-level parallelism
• Lasting Contributions
– Dynamic scheduling
– Register renaming
– Load/store disambiguation
• 360/91 descendants are Pentium III; PowerPC 604; MIPS
R10000; HP-PA 8000; Alpha 21264
22
Dynamic Scheduling: The Only
Choice?
• Most high-performance processors today are dynamically
scheduled superscalar processors
– With deeper and n-way issue pipeline
• Other alternatives to exploit instruction-level parallelism
– Statically scheduled superscalar
– VLIW
• Mixed effort: EPIC – Explicit Parallel Instruction Computing
– Example: Intel Itanium processors
23