0% found this document useful (0 votes)
58 views

Chapter 2 ILP

This document discusses instruction-level parallelism (ILP) and techniques for exploiting it: - ILP refers to executing multiple instructions simultaneously by overlapping their execution, and was widely adopted in processors starting in 1985. Dynamic hardware approaches are used in servers and desktops while static compiler-based approaches are used for scientific applications. - There are various hazards that can prevent full exploitation of ILP including data dependencies, structural hazards, and control hazards from branches. Compiler techniques aim to reduce these hazards through transformations like loop unrolling, strip mining, and pipeline scheduling.

Uploaded by

Setina Ali
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
58 views

Chapter 2 ILP

This document discusses instruction-level parallelism (ILP) and techniques for exploiting it: - ILP refers to executing multiple instructions simultaneously by overlapping their execution, and was widely adopted in processors starting in 1985. Dynamic hardware approaches are used in servers and desktops while static compiler-based approaches are used for scientific applications. - There are various hazards that can prevent full exploitation of ILP including data dependencies, structural hazards, and control hazards from branches. Compiler techniques aim to reduce these hazards through transformations like loop unrolling, strip mining, and pipeline scheduling.

Uploaded by

Setina Ali
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 89

Chapter 2

Instruction-Level Parallelism

1
Introduction
Introduction
 Pipelining become universal technique in 1985
 Overlaps execution of instructions
 Exploits “Instruction Level Parallelism(ILP)”

 Two main approaches:


 Dynamic hardware-based
 Used in server and desktop processors
 Not used as extensively in Parallel Multiprogrammed
Microprocessors (PMP)
 Static approaches compiler-based (software based)
 Not as successful outside of scientific applications

2
Review of basic concepts
 Pipelining: each instruction is split up into a sequence of steps
– different steps can be executed concurrently by different
circuitry.
 A basic pipeline in a RISC processor
 IF – Instruction Fetch
 ID – Instruction Decode
 EX – Instruction Execution
 MEM – Memory Access
 WB – Register Write Back
 Two techniques:
 Superscalar - A superscalar processor executes more than one

instruction during a clock cycle.


 VLIW - very long instruction word – compiler packs
multiple independent operations into an instruction

4
Review of basic concepts
IF – Instruction Fetch
Send the PC to memory
fetch the current instruction from memory.
Update the PC to next instruction by adding 4
ID – Instruction Decode
Decode the instruction
read the registers corresponding to register source
EX – Instruction Execution
adds the base register and the offset to form the effective address (mem
reference)
Register-Register ALU instruction
Conditional branch—Determine if the condition is true
MEM – Memory Access - load/store to memory
WB – Register Write Back- Reg-Reg ALU inst, or load instruc.
Note: branch require three cycles, store instruc, require four cycles, and
all other instructions require five cycles.
Basic superscalar 5-stage
pipeline
Superscalar- a processor executes more than one instruction during a
clock cycle by simultaneously dispatching multiple instructions to redundant
functional units on the processor.
The hardware determines (statically/ dynamically) - which one of a block on
n instructions will be executed next.

A single-core superscalar processor  SISD


A multi-core superscalar  MIMD.

5
Registers prevent interference between two different
instructions in adjacent stages
Hazards - pipelining could lead to incorrect results.
 Data dependence “true dependence” Read after Write hazard (RAW)
i: sub R1, R2, R3 % sub d,s,t  d =s-t
i+1: add R4, R1, R3 % add d,s t  d = s+t
Instruction (i+1) reads operand (R1) before instruction (i) writes it.
 Name dependence “anti dependence”  two instructions use the same register or
memory location, but there is no data dependency between them.
 Write after Read hazard (WAR)  Example:
i: sub R4, R5, R3
i+1: add R5, R2, R3
mul R6, R5, R7
Instruction (i+1) writes
operand (R5) before
instruction (i) reads it.
 Write after Write (WAW) (Output dependence)
Example i: sub R6, R5, R3
i+1: add R6, R2, R3
i+2: mul R1, R2, R7
Instruction (i+1) writes 6
Hazards
 Data hazards => RAW, WAR, WAW.
 Structural hazard - occurs when a part of the processor's
hardware is needed by two or more instructions at the same time.
 Example: a single memory unit that is accessed both in the fetch
stage where an instruction is retrieved from memory, and the
memory stage where data is written and/or read from memory.
They can often be resolved by separating the component into
orthogonal units (such as separate caches) or bubbling the
pipeline.

Control hazard (branch hazard) => are due to branches.


On many instruction pipeline microarchitectures, the processor will
not know the outcome of the branch when it needs to insert a new
instruction into the pipeline (normally the fetch stage).
Instruction-level parallelism

Introduction
 (ILP) parallelism, goal
When exploiting instruction-level
is to maximize CPI -cycle/instruction
 Pipeline CPI =
 Ideal pipeline CPI +

 Structural stalls +

 Data hazard stalls +

 Control stalls

 Parallelism with basic block is limited


 Typical size of basic block = 3-6 instructions
 Must optimize across branches
 For RISC programs, the average dynamic branch
frequency is between 15% and 25%.
 The simplest and most common way to increase the ILP
is to exploit parallelism among iterations of a loop.
 This type of parallelism is often called loop-level
parallelism.
Data dependence

Introduction
 Loop-Level Parallelism
 Unroll loop statically or dynamically
 Use SIMD (vector processors and GPUs)
 Challenges:
 Data dependency
 Instruction j is data dependent on instruction i
if
 Instruction i produces a result that may be used by instruction j
 Instruction j is data dependent on instruction k and instruction
k is data dependent on instruction i
 Dependent instructions cannot be
executed simultaneously
Data dependence

Introduction
 Dependencies are a property of programs
 Pipeline organization determines if dependence is
detected and if it causes a “stall.”

 Data dependence conveys:


 Possibility of a hazard
 Order in which results must be calculated
 Upper bound on exploitable instruction level
arallelism

 Dependencies that flow through memory


locations are difficult to detect
Introduction
Name dependence
 Two instructions use the same name but no flow of
information
 Not a true data dependence, but is a problem when
reordering instructions
 Antidependence: instruction j writes a
register or memory location that instruction i reads
 Initial ordering (i before j) must be preserved
 Output dependence: instruction i and
instruction j write the same register or memory
location
 Ordering must be preserved

 To resolve, use renaming techniques


Control dependence

Introduction
 Every instruction is control dependent on some set of branches
and, in general, the control dependencies must be preserved to
ensure program correctness.
 Instruction control dependent on a branch cannot be moved before the
branch so that its execution is no longer controller by the branch.
 An instruction not control dependent on a branch cannot be moved

after the branch so that its execution is controlled by the branch.


 Example
if C1 {
S1;
};
if C2 {
S2;
};
S1 is control dependent on C1; S2 is control dependent on C2 but not on C1
Properties essential to program
correctness
 Preserving exception behavior  any change in instruction order must not
change the order in which exceptions are raised. Example:
DADDU

R1, R2, R3 BEQZ

R1, L1
L1 LW R4,
0(R1)
Can we move LW before
BEQZ ?
 Preserving data flow  the flow of data between instructions that produce
results and consumes them. Example:
DADDU R1, R2, R3
BEQZ R4, L1
DSUBU R1, R8, R9
13
L1 LW R5, 0(R2)
Introduction
Examples
• Example 1:  OR instruction dependent
DADDU R1,R2,R3 on DADDU and DSUBU
BEQZ R4,L
DSUBU R1,R1,R6
L: …
OR R7,R1,R8

• Example 2:  Assume R4 isn’t used after


DADDU R1,R2,R3 skip
BEQZ R12,skip  Possible to move DSUBU
DSUBU R4,R5,R6 before the branch
DADDU R5,R4,R9
skip:
OR R7,R8,R9
Compiler techniques for
exposing ILP
 Loop transformation technique to optimize a program's execution
speed:
1. reduce or eliminate instructions that control the loop, e.g.,
pointer arithmetic and "end of loop" tests on each iteration
2. hide latencies, e.g., the delay in reading data from
memory
3. re-write loops as a repeated sequence of similar
independent statements  space-time tradeoff
4. reduce branch penalties;
 Methods
1. pipeline scheduling
2. loop unrolling
3. strip mining
1. Pipeline scheduling

Compiler Techniques
 Pipeline stall - delay in execution of an instruction in an instruction
pipeline in order to resolve a hazard. The compiler can reorder
instructions to reduce the number of pipeline stalls.
 Pipeline scheduling - Separate dependent instruction from the source
instruction by the pipeline latency of the source instruction
 Example:
for (i=999; i>=0; i=i-1)
x[i] = x[i] + s;
Pipeline stalls

Compiler Techniques
Assumptions:
1. S  F2
2. The element of the array with the highest address 
R1
Loop: L.D
3. The element of%the
F0,0(R1) array
load with
array the lowest
element in F0address +8

stallR2 % next instruction needs the result in F0
ADD.D F4,F0,F2 % increment array element
stall
stall % next instruction needs the result in F4
S.D F4,0(R1) % store array element
DADDUI R1,R1,#-8 % decrement pointer to array element
BNE R1,R2,Loop % loop if not the last element
Compiler Techniques
Pipeline scheduling
Scheduled code:
Loop: L.D F0,0(R1)
DADDUI R1,R1,#-8
ADD.D F4,F0,F2
stall
stall
S.D F4,0(R1)
BNE R1,R2,Loop
2. Loop unrolling
 Given the same code

Loop: L.D F0,0(R1)


DADDUI R1,R1,#-8
ADD.D F4,F0,F2
S.D F4,0(R1)
BNE R1,R2,Loop

 Assume # elements of the array with starting address in R1 is


divisible by 4
 Unroll by a factor of 4
 Eliminate unnecessary instructions.
 merging the addi instructions and dropping the unnecessary

bne operations
Unrolled loop

Compiler Techniques
Loop: L.D F0,0(R1)
ADD.D F4,F0,F2
S.D F4,0(R1) % drop DADDUI & BNE
L.D F6,-8(R1)
ADD.D F8,F6,F2
S.D F8,-8(R1) %drop DADDUI & BNE
L.D F10,-16(R1)
ADD.D F12,F10,F2
S.D F12,-16(R1) % drop DADDUI & BNE
L.D F14,-24(R1)
ADD.D F16,F14,F2
S.D F16,-24(R1)
DADDUI R1,R1,#-32
BNE R1,R2,Loop

 Live registers: F0-1, F2-3, F4-5, F6-7, F8-9,F10-11, F12-13, F14-15,


and F16-15; also R1 and R2
 In the original code: only F0-1, F2-3, and F4-5; also R1 and R2
Clock cycle
Without scheduling - 8 clock cycle per element
With scheduling - 7 clock cycle per element
The unrolled loop will run in 26 (4 loops) clock
cycles
– each fld has 1 stall,
– each fadd.d has 2, plus
– 14 instruction
—or 6.5 clock cycles for each of the 4 elements.
Unrolled plus scheduled has dropped to a total of
14 clock cycles, or 3.5 clock cycles per element,
compared with 8 cycles before unrolling or
scheduling and
6.5 cycles when unrolled but not scheduled.
Pipeline schedule the unrolled

Compiler Techniques
loop
Loop: L.D F0,0(R1) Pipeline scheduling reduces
L.D F6,-8(R1) the number of stalls.
L.D F10,-16(R1) 1. The L.D instruction
L.D F14,-24(R1) requires only one cycle so
ADD.D F4,F0,F2 when ADD.D are issued F4, F8,
ADD.D F8,F6,F2 F12 , and F16 are already
ADD.D F12,F10,F2 loaded.
ADD.D F16,F14,F2 2.The ADD.D requires only
S.D F4,0(R1) two cycles so that two S.D
S.D F8,-8(R1) can proceed immediately
DADDUI R1,R1,#-32 3.The array pointer is updated
S.D F12,16(R1) after the first two S.D so the
S.D F16,8(R1) loop control can proceed
BNE R1,R2,Loop
immediately after the last two
S.D
Loop unrolling & scheduling
summary
 Use different registers to avoid unnecessary constraints.
 Adjust the loop termination and iteration code.
 Find if the loop iterations are independent except the loop maintenance
code  if so unroll the loop
 Analyze memory addresses to determine if the load and store from different
iterations are independent if so interchange load and stores in the unrolled
loop
 Schedule the code while ensuring correctness.
 Limitations of loop unrolling
 Decrease of the amount of overhead with each roll
 Growth of the code size
 Register pressure (shortage of registers)  scheduling to increase ILP
increases the number of live values thus, the number of registers
Branch prediction
 Branch prediction - guess whether a conditional
jump will be taken or not.
 Goal - improve the flow in the instruction pipeline.
 Speculative execution - the branch that is
guessed to be the most likely is then fetched and
speculatively executed.
 Penalty - if it is later detected that the guess was
wrong then the speculatively executed or partially
executed instructions are discarded and the
pipeline starts over with the correct branch,
incurring a delay.
Branch prediction
• As pipelines get deeper and the potential penalty of
branches increases,
– using delayed branches is insufficient.
• Predicting branches
– low-cost static schemes that rely on information
available at compile time (always taken or not)
• use profile information collected from earlier runs.
• effectiveness of branch prediction (3% to 24%)
– predict branches dynamically based on program
behavior.
Dynamic Branch prediction
• How - the branch predictor keeps records of
whether branches are taken or not taken.
• When it encounters a conditional jump that has
been seen several times before then it can base
the prediction on the history.
• The branch predictor may, for example,
recognize that the conditional jump is taken
more often than not, or that it is taken every
second time
Branch Prediction
Predictors
 Basic 2-bit predictor:
 For each branch:
 Predict taken or not taken
 If the prediction is wrong two consecutive times, change prediction
 Correlating predictor:
 Multiple 2-bit predictors for each branch
 One for each possible combination of outcomes of preceding n
branches
 Local predictor:
 Multiple 2-bit predictors for each branch
 One for each possible combination of outcomes for the last n
occurrences of this branch
 Tournament predictor:
 Combine correlating predictor with local predictor
The states in a 2-bit prediction
scheme
Correlating predictor
• Branch predictors that use the behavior of other
branches to make a prediction
• A (1,2) predictor uses the behavior of the last
branch to choose from among a pair of 2-bit
branch predictors in predicting a particular
branch.
• (m,n) predictor uses the behavior of the last m
branches to choose from 2m branch predictors,
each of which is an n-bit predictor for a single
branch.
Correlating predictor
a = 0;
B1 while(a<1000){
if(a%2!=0) {...}
a++;
...
B2 if(b==0){...}
}
Branch history
B1: TNTNTNTNT
B2: NNNNNNNN
2-bit prediction
Correlating predictor
Correlating predictor
Correlating predictor
Correlating predictor
Correlating predictor
Correlating predictor
Correlating predictor
Correlating predictor
Correlating predictor
Correlating predictor
Correlating predictor
Correlating predictor
Branch Prediction
Branch prediction
performance
Dynamic

Branch Prediction
scheduling
 Rearrange order of instructions to reduce stalls
while maintaining data flow

 Advantages:
 Compiler doesn’t need to have knowledge of
microarchitecture
 Handles cases where dependencies are unknown at
compile time

 Disadvantage:
 Substantial increase in hardware complexity
 Complicates exceptions
Dynamic scheduling

Branch Prediction
 Dynamic scheduling implies:
 Out-of-order execution
 Out-of-order completion

 Creates the possibility for WAR and WAW


hazards

 Tomasulo’s Approach
 Tracks when operands are available
 Introduces register renaming in hardware
 Minimizes WAW and WAR hazards
Register renaming

Branch Prediction
 Example:

DIV.D F0,F2,F4
ADD.D F6,F0,F8
antidependence WAR - F8
S.D F6,0(R1)
SUB.D F8,F10,F14 antidependence WAW – F6
MUL.D F6,F10,F8
+ name dependence
with F6
Register renaming

Branch Prediction
 Example: add two temporary registers S and T

i: DIV.D F0,F2,F4
i+1: ADD.D S,F0,F8 (instead of ADD.D F6,F0,F8)
i+2 S.D S,0(R1) (instead of S.D F6,0(R1)
i+3 SUB.D T,F10,F14 (instead of SUB.D F8,F10,F14)
i+4 MUL.D F6,F10,T (instead of MUL.D F6,F10,F8)

 Now only RAW hazards remain, which can be strictly


ordered
Branch Prediction
Register renaming
 Register renaming is provided by reservation stations (RS)
 Contains:
 The instruction

 Buffered operand values (when available)

 Reservation station number of instruction providing the operand

values
 RS fetches and buffers an operand as soon as it becomes available (not
necessarily involving register file)
 Pending instructions designate the RS to which they will send their
output
 Result values broadcast on a result bus, called the common data bus (CDB)
 Only the last output updates the register file
 As instructions are issued, the register specifiers are renamed with
the reservation station
 May be more reservation stations than registers
Tomasulo’s algorithm

Branch Prediction
 Goal  high performance without special compilers when the
hardware has only a small number of floating point (FP)
registers.
 Designed in 1966 for IBM 360/91 with only 4 FP registers.
 Used in modern processors: Pentium 2/3/4, Power PC 604,
Neehalem…..
 Additional hardware needed
 Load and store buffers  contain data and addresses, act like
reservation station.
 Reservation stations  feed data to floating point arithmetic units.
Instruction execution

Branch Prediction
steps
 Issue
 Get next instruction from FIFO queue
 If available RS, issue the instruction to the RS with operand values if

available
 If operand values not available, stall the instruction

 Execute
 When operand becomes available, store it in any reservation stations

waiting for it
 When all operands are ready, issue the instruction

 Loads and store maintained in program order through effective

address
 No instruction allowed to initiate execution until all branches that

proceed it in program order have completed


 Write result
 Broadcast result on CDB into reservation stations and store buffers

 (Stores must wait until address and value are received)


Branch Prediction
Example
Notations
Example of Tomasulo
algorithm

 3 – load buffers (Load1, Load 2, Load 3)


 5 - reservation stations (Add1, Add2, Add3, Mult 1, Mult 2).
 16 pairs of floating point registers F0-F1, F2-F3,…..F30-31.
 Clock cycles: addition  2; multiplication  10; division 40;
Clock cycle 1

 A load from memory location 34+(R2) is issued; data will be stored


later in the first load buffer (Load1)
Clock cycle 2

 The Second load from memory loaction 45+(R3) is issued; data will be
stored in load buffer 2 (Load2).
 Multiple loads can be outstanding.
Clock cycle 3

 First load completed  SUBD instruction is waiting for the data.


 MULTD  issued
 Register names are removed/renamed in Reservation Stations
Clock cycle 4

 Load2 is completed; MULTD waits for the results of it.


 The results of Load 1 are available for the SUBD instruction.
Clock cycle 5

 The result of Load2 available for MULTD executed by Mult1 and


SUBD executed by Add1. Both can now proceed as they have both operands.
 Mult2 executes DIVD and cannot proceed yet as it waits for the results of
Add1.
Clock cycle 6

 Issue ADDD
Clock cycle 7

 The results of the SUBD produced by Add1 will be available in the next
cycle.
 ADDD instruction executed by Add2 waits for them.
Clock cycle 8

 The results of SUBD are deposited by the Add1 in F8-F9


Clock cycle 9
Clock cycle 10

 ADDD executed by Add2 completes, it needed 2 cycles.


 There are 5 more cycles for MULTD executed by Mult1.
Clock cycle 11

 Only MULTD and DIVD instructions did not complete. DIVD is waiting for
the result of MULTD before moving to the execute stage.
Clock cycle 12
Clock cycle 13
Clock cycle 14
Clock cycle 15

 MULTS instruction executed by Mult1 unit completed execution.


 DIVD in instruction executed by the Mult2 unit is waiting for it.
Clock cycle 16
Clock cycle 55

 DIVD will finish execution in cycle 56 and the result will be in F6-F7
in cycle 57.
Hardware-based speculation

Branch Prediction
 Goal  overcome control dependency by speculating.
 Allow instructions to execute out of order but force them to
commit to avoid: (i) updating the state or (ii) taking an
exception
 Instruction commit  allow an instruction to update the
register file when instruction is no longer speculative
 Key ideas:
1. Dynamic branch prediction.
2. Execute instructions along predicted execution paths, but only
commit the results if prediction was correct.
3. Dynamic scheduling to deal with different combination of basic
blocks
How speculative execution is

Branch Prediction
done
 Need additional hardware to prevent any irrevocable
action until an instruction commits.
 Reorder buffer (ROB)
 Modify functional units – operand source is ROB rather than
functional units
 Register values and memory values are not written until an
instruction commits
 On misprediction:
 Speculated entries in ROB are cleared

 Exceptions:
 Not recognized until it is ready to commit
Extended floating point unit
 FP using Tomasulo’s
algorithm extended to
handle speculation.
 Reorder buffer
now holds the result
of instruction
between
completion and
commit. Has 4 fields
 Instruction type:
branch/store/register
 Destination field:
register number
 Value field:
output
value
 Ready field:
completed
execution?
 Operand source is now
reorder buffer instead of
functional unit
Multiple Issue and Static Scheduling
Multiple issue and static scheduling
 To achieve CPI < 1 complete multiple instructions per
clock cycle
 Three flavors of multiple issue processors
1. Statically scheduled superscalar processors
a. Issue a varying number of instructions per clock
cycle
b. Use in-order execution
2. VLIW (very long instruction word) processors
a. Issue a fixed number of instructions as one large
instruction
b. Instructions are statically scheduled by the compiler
3. Dynamically scheduled superscalar processors
a. Issue a varying number of instructions per clock
cycle
b. Use out-of-order execution
Multiple Issue and Static Scheduling
Multiple issue
processors
VLIW
Processors
 Package multiple operations into one instruction
 Must be enough parallelism in code to fill the available
slots,
 Disadvantages:
 Statically finding parallelism
 Code size
 No hazard detection hardware
 Binary code compatibility
Example

Multiple Issue and Static Scheduling


 Unroll the loop for x[i]= x[i] +s
to eliminate any stalls. Ignore
delayed branches. Loop: L.D F0,0(R1)
 The code we had before L.D F6,-8(R1)
shown on the right 
L.D F10,-16(R1)
 Package in one VLIW L.D F14,-24(R1)
instruction :
 One integer instruction (or
ADD.D F4,F0,F2
ADD.D F8,F6,F2
branch)
 Two independent floating- ADD.D F12,F10,F2
point operations ADD.D F16,F14,F2
 Two independent memory S.D F4,0(R1)
references. S.D F8,-8(R1)
DADDUI R1,R1,#-32
S.D F12,16(R1)
S.D F16,8(R1)
BNE R1,R2,Loop
Example
Dynamic Scheduling, Multiple Issue, and Speculation
Dynamic scheduling, multiple issue, speculation

 Modern microarchitectures:
 Dynamic scheduling + multiple issue + speculation
 Two approaches:
 Assign reservation stations and update pipeline control table in half
clock cycles
 Only supports 2 instructions/clock

 Design logic to handle any possible dependencies between the


instructions
 Hybrid approaches.
 Issue logic can become bottleneck
Multiple issue processor with
speculation
 The organization should allow simultaneous execution for all issues in
one clock cycle of one of the following operations:
 FP multiplication

 FP addition

 Integer operations

 Load/Store

 Several datapaths must be widened to support multiple issues.


 The instruction issue logic will be fairly complex.
Dynamic Scheduling, Multiple Issue, and Speculation
Multiple issue processor with
speculation
Dynamic Scheduling, Multiple Issue, and Speculation
Basic strategy for updating the

issue logic
Assign a reservation station and a reorder buffer for every instruction that
may be issued in the next bundle.
 To pre-allocate reservation stations limit the number of instructions of a
given class that can be issued in a “bundle”
 I.e. one FP, one integer, one load, one store

 Examine all the dependencies among the instructions in the bundle


 If dependencies exist in bundle, use the assigned ROB number to
update the reservation table for dependent instructions. Otherwise, use the
existing reservations table entries for the issuing instruction.

 Also need multiple completion/commit


Dynamic Scheduling, Multiple Issue, and Speculation
Example
Loop: LD R2,0(R1) ;R2=array element
DADDIU R2,R2,#1 ;increment R2
SD R2,0(R1) ;store result
DADDIU R1,R1,#8 ;increment pointer
BNE R2,R3,LOOP ;branch if not last element
Dynamic Scheduling, Multiple Issue, and Speculation
Dual issue without speculation

 Time of issue, execution, and writing the result for a dual-issue of the pipeline.
 The LD following the BNE (cycles 3, 6) cannot start execution earlier, it must
wait until the branch outcome is determined as there is no speculation
Dual issue with speculation

Dynamic Scheduling, Multiple Issue, and Speculation


 Time of issue, execution, writing the result, and commit.
 The LD following the BNE can start execution early; speculation is supported.

You might also like