Chapter 2 ILP
Chapter 2 ILP
Instruction-Level Parallelism
1
Introduction
Introduction
Pipelining become universal technique in 1985
Overlaps execution of instructions
Exploits “Instruction Level Parallelism(ILP)”
2
Review of basic concepts
Pipelining: each instruction is split up into a sequence of steps
– different steps can be executed concurrently by different
circuitry.
A basic pipeline in a RISC processor
IF – Instruction Fetch
ID – Instruction Decode
EX – Instruction Execution
MEM – Memory Access
WB – Register Write Back
Two techniques:
Superscalar - A superscalar processor executes more than one
4
Review of basic concepts
IF – Instruction Fetch
Send the PC to memory
fetch the current instruction from memory.
Update the PC to next instruction by adding 4
ID – Instruction Decode
Decode the instruction
read the registers corresponding to register source
EX – Instruction Execution
adds the base register and the offset to form the effective address (mem
reference)
Register-Register ALU instruction
Conditional branch—Determine if the condition is true
MEM – Memory Access - load/store to memory
WB – Register Write Back- Reg-Reg ALU inst, or load instruc.
Note: branch require three cycles, store instruc, require four cycles, and
all other instructions require five cycles.
Basic superscalar 5-stage
pipeline
Superscalar- a processor executes more than one instruction during a
clock cycle by simultaneously dispatching multiple instructions to redundant
functional units on the processor.
The hardware determines (statically/ dynamically) - which one of a block on
n instructions will be executed next.
5
Registers prevent interference between two different
instructions in adjacent stages
Hazards - pipelining could lead to incorrect results.
Data dependence “true dependence” Read after Write hazard (RAW)
i: sub R1, R2, R3 % sub d,s,t d =s-t
i+1: add R4, R1, R3 % add d,s t d = s+t
Instruction (i+1) reads operand (R1) before instruction (i) writes it.
Name dependence “anti dependence” two instructions use the same register or
memory location, but there is no data dependency between them.
Write after Read hazard (WAR) Example:
i: sub R4, R5, R3
i+1: add R5, R2, R3
mul R6, R5, R7
Instruction (i+1) writes
operand (R5) before
instruction (i) reads it.
Write after Write (WAW) (Output dependence)
Example i: sub R6, R5, R3
i+1: add R6, R2, R3
i+2: mul R1, R2, R7
Instruction (i+1) writes 6
Hazards
Data hazards => RAW, WAR, WAW.
Structural hazard - occurs when a part of the processor's
hardware is needed by two or more instructions at the same time.
Example: a single memory unit that is accessed both in the fetch
stage where an instruction is retrieved from memory, and the
memory stage where data is written and/or read from memory.
They can often be resolved by separating the component into
orthogonal units (such as separate caches) or bubbling the
pipeline.
Introduction
(ILP) parallelism, goal
When exploiting instruction-level
is to maximize CPI -cycle/instruction
Pipeline CPI =
Ideal pipeline CPI +
Structural stalls +
Control stalls
Introduction
Loop-Level Parallelism
Unroll loop statically or dynamically
Use SIMD (vector processors and GPUs)
Challenges:
Data dependency
Instruction j is data dependent on instruction i
if
Instruction i produces a result that may be used by instruction j
Instruction j is data dependent on instruction k and instruction
k is data dependent on instruction i
Dependent instructions cannot be
executed simultaneously
Data dependence
Introduction
Dependencies are a property of programs
Pipeline organization determines if dependence is
detected and if it causes a “stall.”
Introduction
Every instruction is control dependent on some set of branches
and, in general, the control dependencies must be preserved to
ensure program correctness.
Instruction control dependent on a branch cannot be moved before the
branch so that its execution is no longer controller by the branch.
An instruction not control dependent on a branch cannot be moved
R1, L1
L1 LW R4,
0(R1)
Can we move LW before
BEQZ ?
Preserving data flow the flow of data between instructions that produce
results and consumes them. Example:
DADDU R1, R2, R3
BEQZ R4, L1
DSUBU R1, R8, R9
13
L1 LW R5, 0(R2)
Introduction
Examples
• Example 1: OR instruction dependent
DADDU R1,R2,R3 on DADDU and DSUBU
BEQZ R4,L
DSUBU R1,R1,R6
L: …
OR R7,R1,R8
Compiler Techniques
Pipeline stall - delay in execution of an instruction in an instruction
pipeline in order to resolve a hazard. The compiler can reorder
instructions to reduce the number of pipeline stalls.
Pipeline scheduling - Separate dependent instruction from the source
instruction by the pipeline latency of the source instruction
Example:
for (i=999; i>=0; i=i-1)
x[i] = x[i] + s;
Pipeline stalls
Compiler Techniques
Assumptions:
1. S F2
2. The element of the array with the highest address
R1
Loop: L.D
3. The element of%the
F0,0(R1) array
load with
array the lowest
element in F0address +8
stallR2 % next instruction needs the result in F0
ADD.D F4,F0,F2 % increment array element
stall
stall % next instruction needs the result in F4
S.D F4,0(R1) % store array element
DADDUI R1,R1,#-8 % decrement pointer to array element
BNE R1,R2,Loop % loop if not the last element
Compiler Techniques
Pipeline scheduling
Scheduled code:
Loop: L.D F0,0(R1)
DADDUI R1,R1,#-8
ADD.D F4,F0,F2
stall
stall
S.D F4,0(R1)
BNE R1,R2,Loop
2. Loop unrolling
Given the same code
bne operations
Unrolled loop
Compiler Techniques
Loop: L.D F0,0(R1)
ADD.D F4,F0,F2
S.D F4,0(R1) % drop DADDUI & BNE
L.D F6,-8(R1)
ADD.D F8,F6,F2
S.D F8,-8(R1) %drop DADDUI & BNE
L.D F10,-16(R1)
ADD.D F12,F10,F2
S.D F12,-16(R1) % drop DADDUI & BNE
L.D F14,-24(R1)
ADD.D F16,F14,F2
S.D F16,-24(R1)
DADDUI R1,R1,#-32
BNE R1,R2,Loop
Compiler Techniques
loop
Loop: L.D F0,0(R1) Pipeline scheduling reduces
L.D F6,-8(R1) the number of stalls.
L.D F10,-16(R1) 1. The L.D instruction
L.D F14,-24(R1) requires only one cycle so
ADD.D F4,F0,F2 when ADD.D are issued F4, F8,
ADD.D F8,F6,F2 F12 , and F16 are already
ADD.D F12,F10,F2 loaded.
ADD.D F16,F14,F2 2.The ADD.D requires only
S.D F4,0(R1) two cycles so that two S.D
S.D F8,-8(R1) can proceed immediately
DADDUI R1,R1,#-32 3.The array pointer is updated
S.D F12,16(R1) after the first two S.D so the
S.D F16,8(R1) loop control can proceed
BNE R1,R2,Loop
immediately after the last two
S.D
Loop unrolling & scheduling
summary
Use different registers to avoid unnecessary constraints.
Adjust the loop termination and iteration code.
Find if the loop iterations are independent except the loop maintenance
code if so unroll the loop
Analyze memory addresses to determine if the load and store from different
iterations are independent if so interchange load and stores in the unrolled
loop
Schedule the code while ensuring correctness.
Limitations of loop unrolling
Decrease of the amount of overhead with each roll
Growth of the code size
Register pressure (shortage of registers) scheduling to increase ILP
increases the number of live values thus, the number of registers
Branch prediction
Branch prediction - guess whether a conditional
jump will be taken or not.
Goal - improve the flow in the instruction pipeline.
Speculative execution - the branch that is
guessed to be the most likely is then fetched and
speculatively executed.
Penalty - if it is later detected that the guess was
wrong then the speculatively executed or partially
executed instructions are discarded and the
pipeline starts over with the correct branch,
incurring a delay.
Branch prediction
• As pipelines get deeper and the potential penalty of
branches increases,
– using delayed branches is insufficient.
• Predicting branches
– low-cost static schemes that rely on information
available at compile time (always taken or not)
• use profile information collected from earlier runs.
• effectiveness of branch prediction (3% to 24%)
– predict branches dynamically based on program
behavior.
Dynamic Branch prediction
• How - the branch predictor keeps records of
whether branches are taken or not taken.
• When it encounters a conditional jump that has
been seen several times before then it can base
the prediction on the history.
• The branch predictor may, for example,
recognize that the conditional jump is taken
more often than not, or that it is taken every
second time
Branch Prediction
Predictors
Basic 2-bit predictor:
For each branch:
Predict taken or not taken
If the prediction is wrong two consecutive times, change prediction
Correlating predictor:
Multiple 2-bit predictors for each branch
One for each possible combination of outcomes of preceding n
branches
Local predictor:
Multiple 2-bit predictors for each branch
One for each possible combination of outcomes for the last n
occurrences of this branch
Tournament predictor:
Combine correlating predictor with local predictor
The states in a 2-bit prediction
scheme
Correlating predictor
• Branch predictors that use the behavior of other
branches to make a prediction
• A (1,2) predictor uses the behavior of the last
branch to choose from among a pair of 2-bit
branch predictors in predicting a particular
branch.
• (m,n) predictor uses the behavior of the last m
branches to choose from 2m branch predictors,
each of which is an n-bit predictor for a single
branch.
Correlating predictor
a = 0;
B1 while(a<1000){
if(a%2!=0) {...}
a++;
...
B2 if(b==0){...}
}
Branch history
B1: TNTNTNTNT
B2: NNNNNNNN
2-bit prediction
Correlating predictor
Correlating predictor
Correlating predictor
Correlating predictor
Correlating predictor
Correlating predictor
Correlating predictor
Correlating predictor
Correlating predictor
Correlating predictor
Correlating predictor
Correlating predictor
Branch Prediction
Branch prediction
performance
Dynamic
Branch Prediction
scheduling
Rearrange order of instructions to reduce stalls
while maintaining data flow
Advantages:
Compiler doesn’t need to have knowledge of
microarchitecture
Handles cases where dependencies are unknown at
compile time
Disadvantage:
Substantial increase in hardware complexity
Complicates exceptions
Dynamic scheduling
Branch Prediction
Dynamic scheduling implies:
Out-of-order execution
Out-of-order completion
Tomasulo’s Approach
Tracks when operands are available
Introduces register renaming in hardware
Minimizes WAW and WAR hazards
Register renaming
Branch Prediction
Example:
DIV.D F0,F2,F4
ADD.D F6,F0,F8
antidependence WAR - F8
S.D F6,0(R1)
SUB.D F8,F10,F14 antidependence WAW – F6
MUL.D F6,F10,F8
+ name dependence
with F6
Register renaming
Branch Prediction
Example: add two temporary registers S and T
i: DIV.D F0,F2,F4
i+1: ADD.D S,F0,F8 (instead of ADD.D F6,F0,F8)
i+2 S.D S,0(R1) (instead of S.D F6,0(R1)
i+3 SUB.D T,F10,F14 (instead of SUB.D F8,F10,F14)
i+4 MUL.D F6,F10,T (instead of MUL.D F6,F10,F8)
values
RS fetches and buffers an operand as soon as it becomes available (not
necessarily involving register file)
Pending instructions designate the RS to which they will send their
output
Result values broadcast on a result bus, called the common data bus (CDB)
Only the last output updates the register file
As instructions are issued, the register specifiers are renamed with
the reservation station
May be more reservation stations than registers
Tomasulo’s algorithm
Branch Prediction
Goal high performance without special compilers when the
hardware has only a small number of floating point (FP)
registers.
Designed in 1966 for IBM 360/91 with only 4 FP registers.
Used in modern processors: Pentium 2/3/4, Power PC 604,
Neehalem…..
Additional hardware needed
Load and store buffers contain data and addresses, act like
reservation station.
Reservation stations feed data to floating point arithmetic units.
Instruction execution
Branch Prediction
steps
Issue
Get next instruction from FIFO queue
If available RS, issue the instruction to the RS with operand values if
available
If operand values not available, stall the instruction
Execute
When operand becomes available, store it in any reservation stations
waiting for it
When all operands are ready, issue the instruction
address
No instruction allowed to initiate execution until all branches that
The Second load from memory loaction 45+(R3) is issued; data will be
stored in load buffer 2 (Load2).
Multiple loads can be outstanding.
Clock cycle 3
Issue ADDD
Clock cycle 7
The results of the SUBD produced by Add1 will be available in the next
cycle.
ADDD instruction executed by Add2 waits for them.
Clock cycle 8
Only MULTD and DIVD instructions did not complete. DIVD is waiting for
the result of MULTD before moving to the execute stage.
Clock cycle 12
Clock cycle 13
Clock cycle 14
Clock cycle 15
DIVD will finish execution in cycle 56 and the result will be in F6-F7
in cycle 57.
Hardware-based speculation
Branch Prediction
Goal overcome control dependency by speculating.
Allow instructions to execute out of order but force them to
commit to avoid: (i) updating the state or (ii) taking an
exception
Instruction commit allow an instruction to update the
register file when instruction is no longer speculative
Key ideas:
1. Dynamic branch prediction.
2. Execute instructions along predicted execution paths, but only
commit the results if prediction was correct.
3. Dynamic scheduling to deal with different combination of basic
blocks
How speculative execution is
Branch Prediction
done
Need additional hardware to prevent any irrevocable
action until an instruction commits.
Reorder buffer (ROB)
Modify functional units – operand source is ROB rather than
functional units
Register values and memory values are not written until an
instruction commits
On misprediction:
Speculated entries in ROB are cleared
Exceptions:
Not recognized until it is ready to commit
Extended floating point unit
FP using Tomasulo’s
algorithm extended to
handle speculation.
Reorder buffer
now holds the result
of instruction
between
completion and
commit. Has 4 fields
Instruction type:
branch/store/register
Destination field:
register number
Value field:
output
value
Ready field:
completed
execution?
Operand source is now
reorder buffer instead of
functional unit
Multiple Issue and Static Scheduling
Multiple issue and static scheduling
To achieve CPI < 1 complete multiple instructions per
clock cycle
Three flavors of multiple issue processors
1. Statically scheduled superscalar processors
a. Issue a varying number of instructions per clock
cycle
b. Use in-order execution
2. VLIW (very long instruction word) processors
a. Issue a fixed number of instructions as one large
instruction
b. Instructions are statically scheduled by the compiler
3. Dynamically scheduled superscalar processors
a. Issue a varying number of instructions per clock
cycle
b. Use out-of-order execution
Multiple Issue and Static Scheduling
Multiple issue
processors
VLIW
Processors
Package multiple operations into one instruction
Must be enough parallelism in code to fill the available
slots,
Disadvantages:
Statically finding parallelism
Code size
No hazard detection hardware
Binary code compatibility
Example
Modern microarchitectures:
Dynamic scheduling + multiple issue + speculation
Two approaches:
Assign reservation stations and update pipeline control table in half
clock cycles
Only supports 2 instructions/clock
FP addition
Integer operations
Load/Store
Time of issue, execution, and writing the result for a dual-issue of the pipeline.
The LD following the BNE (cycles 3, 6) cannot start execution earlier, it must
wait until the branch outcome is determined as there is no speculation
Dual issue with speculation