0% found this document useful (0 votes)
34 views

Cosc530 Ch3all6up

Uploaded by

Khalid Kn3
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
34 views

Cosc530 Ch3all6up

Uploaded by

Khalid Kn3
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

Chapter 3 Instruction Parallelism Examples

● Loop-level parallelism
– Loop unrolling (compiler)
Instruction-Level Parallelism – Dynamic unrolling (superscalar scheduling)
● Data parallelism
– Vector computers
and
● Cray X1, X1E, X2; NEC SX-9
– SIMT
Its Exploitation ● GPUs
– SIMD
● Short SIMD (SSE, AVX, Intel Phi)

Introduction Types of Dependences

● Instruction level parallelism = ILP =


– (potential) overlap among instructions
● First universal ILP: pipelining (since 1985)
● Two approaches to ILP
– Discover and exploit parallelism in hardware ● Data dependences
● Dominant in server and desktop market segments ● Name dependences
● Not used in PMD segment due to energy constraints
– May be changing with Cortex-A9
● Control dependences
– Software-based discovery at compile time
● Technical markets, scientific computing, HPC
● Itanium is an example of aggressive software discovery
– But mostly abandoned by the majority of server makers

2 5

Instruction-Level Parallelism Basics Data Dependence: Basics


● Not all instructions can be executed in parallel
● Goal: minimize CPI (maximize IPC)
● Data dependent instructions have to be executed “in order”
● In a pipelined processor
– Data dependence is a property of the code
– CPI = ideal CPI + stalls (overheads):
● Instruction j is data dependent on instruction i if
● Structural stalls
● Data hazard stalls – Instruction i produces result used by instruction j
● Control stalls – Instruction j depends on instruction k and k data depends on i
● Fundamental unit for extracting parallelism is ● Pipeline interlocks
– Basic block (block of instructions between branches) – With interlocks, data dependence causes a hazard and stall
– Branches disrupt analysis; add runtime dependence – Without interlocks, data dependence prohibits the compiler
from scheduling instructions with overlap
● But typical basic blocks are small
● Data dependence conveys:
– 3-6 instruction (15%-25% branch frequency)
– Possibility of a hazard (= negative side-effect if not “in order”)
● Optimizing across branches is a must
– The required order of instructions
– Examples: loop-level parallelism, data parallelism (SIMD)
3 – Upper bound on achievable parallelism 6
Data Dependence Example Data Hazards
● Hazards exists as a result of data or name dependence
– Overlap of dependent (and nearby) instructions could change
access order to instructions' operands
double F0, *R1 – Avoiding hazards ensures program order
Loop: Load.D F0, 0(R1) F0=array element F0=R1[0] ● Possible data hazards
– RAW (read after write) ← true data dependence
Add.D F4, F0, F2 add scalar in F2 F4=F0+F2 ● Instruction i: write to x
● Instruction j: read from x
Store.D F4, 0(R1) store result R1[0]=F4
– WAW (write after write) ← output dependence
Add.I R1,R1,#-8 decrement by 8B R1-=1 ● Instruction i: write to x
● Instruction j: write to x
Branch.NE R1, R2, Loop if (R1!=R2) – WAR (write after read) ← antidependence
goto Loop
● Instruction i: read from x
● Instruction j: write to x
7
● RAR is not a hazard 10

Data Dependence Details Control Dependence


● Control dependence determines ordering of an instruction
with respect to a branch instruction
– The order must be preserved
– The execution should be conditional
● Overcoming data dependence
● Example:
– Maintain the dependence by preventing the hazard
– If p1 { S1 }
● By compiler or by hardware scheduler
– If p2 { S2 }
– Eliminate dependence by transforming the code
● A separate topic – S1 is control dependent on p1
– S2 is control dependent on p2
– Dependences that flow through memory
– S2 is not control dependent on p1
● Is R4[100] the same as R6[20]?
● Is R4[20] the same as R6[20]? ● Branches create barriers in code for potential code motion
● It might be possible to violate control dependence but
preserve correct execution with extra hardware
– Speculative execution
8 11

Name Dependence Control Dependence Examples


● Exception handling
● Name dependence occurs when two instructions use the
same register (or memory location) but there is no flow of – Add R2, R3, R4
data between them – Branch.equal0 R2, L1
– Instruction i – Load R1, 0(R2)
– Instruction j – L1: NoOp
● Two types of name dependence – No data dependence between Branch and Load
– Antidependence: Instruction i reads what instruction j writes – Load from wrong R2 could cause exception:
● Store.D F4, 0(R1) ● int *r2 = r3 + r4; y = r2 ? r2[0] : 0;
● Add.I R1, R1, #-8 ● Data flow
– Output dependence: Instructions i and j both write – Add R1, R2, R3
● Renaming is the common technique to deal with name – Branch.equal0 R4, L
dependence – Subtract R1, R5, R6
– Register renaming – L: NoOp ?

– Shadow registers – Or R7, R1, R8


9 12
Control Dependence: Software Speculation Original vs. Unrolled Loop
● Loop
F0 = R1[0]
F4 = F0 + F2
R1[0] = F4
● Ignoring control dependence may be possible after code R1 -= 1
analysis (liveness property) if (R1...
● Loop F6 = R1[-1]
– Add R1, R2, R3 F0 = R1[0] F8 = F6 + F2
– Branch.eq0 R12, Skip F4 = F0 + F2 R1[-1] = F8
?
R1[0] = F4 F10 = R1[-2]
– Subtract R4, R5, R6 R1 -= 1 F12 = F10 + F2
– Add R5, R4, R9 if (R1 != R2) R1[-2] = F12
– Skip: Or R7, R8, R9 goto Loop F14 = R1[-3]
– ; R4 is not used again (is dead) F16 = F14 + F2
R1[-3] = F16
R1 -= 4
if (R1 != R2) goto Loop
13
● New registers: F6, F8, ... 16

Compiler Techniques for Exposing ILP Loop Unrolling + Pipeline Scheduling


● Loop: Load F0, 0(R1) ● Loop: Load F0, 0(R1)
● Stall ● Add.I R1, R1, #-8 ● Loop ● Loop
● Add F4, F0, F2 F0 = R1[0] F0 = R1[0]
● Add F4, F0, F2 F4 = F0 + F2 F6 = R1[-1]
● Stall ● Stall R1[0] = F4 F10 = R1[-2]
F6 = R1[-1] F14 = R1[-3]
● Stall ● Stall
F8 = F6 + F2 F4 = F0 + F2
● Store F4, 0(R1) ● Store F4, 0(R1) R1[-1] = F8 F8 = F6 + F2
F10 = R1[-2] F12 = F10 + F2
● Add.I R1, R1, #-8 ● Branch.NoEq R1, R2, Loop F12 = F10 + F2 F16 = F14 + F2
● Branch.NoEq R1, R2, Loop R1[-2] = F12 R1[0] = F4
F14 = R1[-3] R1[-1] = F8
Instruction producing result Instruction using result Latency in clock cycles F16 = F14 + F2 R1 -= 4
FP ALU op Another FPU ALU op 3 R1[-3] = F16 R1[2] = F12
FP ALU op Store double 2 R1 -= 4 R1[1] = F16
Load double FP ALU op 1 if (R1 != R2) goto Loop if (R1 != R2) goto Loop
Load double Store double 0
14 17

Loop Unrolling Overview Unrolling with Generic Loops

● Loop unrolling simply copies the body of the loop multiple


times, each copy operates on a new loop index
● Given: for (k=0; k < N; ++k)
● Benefits
– Let's unroll 4 times
– Less branch instructions
– But what if N not divisible by 4?
● Less pressure on branch predictor
– Increased basic block size ● Solution:
● Potential for more parallelism – First, unroll N%k times: for (k=0; k < N%4; ++k)
– Less instructions executed – Then loop of group of 4:
● For example: less increments of the loop counter for (k = N%4; k < N; k += 4)
// unroll 4 times for k, k+1, k+2, k+3
● Downsides
● Refer to Chapter 4 and the technique called
– Greater register pressure
– Strip mining
– Increased use of instruction cache
● Could spill the instruction cache and cause cache thrashing
15 18
Branch Prediction Dynamic Scheduling Basics
● Instead of waiting for the branch to finish executing
● Simple techniques can only eliminate some data
– Try to predict its behavior and act upon the prediction dependence stalls
● Requirements – Pipeline scheduling by compiler
– Prediction must be cheaper than executing the branch – Forwarding and bypassing
instruction
● Dynamic scheduling adds another level adding parallelism
● Usually based on few bits of information
while maintaining the data flow
– There has to be a way of dealing with wrong predictions
– Some dependences are not known until runtime
● Beware of exceptions etc.
– The same binaries can run efficiently without recompilation
● Simple predictor
– Compiler might not know the details of the micro-architecture
– Keep a bit (or two) for a (fixed) number of branches
– There could be unpredictable delays: multi-level caches
– Every time a branch is taken increase the count
● Disadvantages
● If N-consecutive executions resulted in “branch taken” then the
next act as if the branch will be taken – Substantial increase in hardware complexity
– If N-consecutive “branch not taken” then start predicting “not – Exception handling (imprecise exceptions)
taken” 19 22

Correlating Branch Predictors Dynamic Scheduling Details

● Dynamic scheduling breaks the “in order” execution


● Observation (based on existing codes)
– Out-of-order execution
– Branches are correlated with each other
● Incoming instructions rearranged and unknown until runtime
● Application: correlating branch predictors
– Out-of-order completion
● Instead of keeping of each branch individually, ● Retired instructions' order depends on code, execution, delays
look also at the recent M branches
● New hazards to deal with
● (M, N) predictor
– WAR
– Uses behavior of last M branches ● Possibility of overwriting a value that has not been read yet
● Total of 2M branch decisions – Load F0, 0(R1) // a load from memory may be stalled for many cycles
– Each predictor has N bits – Load R1, #1 // load of a constant takes only a few cycles
– WAW
● Advantages
● Writing twice to the same location
– Better prediction yield (always test on your own code!)
– RAW hazards are still a problem
– Little hardware require to implement it ● They always occur since they are called “true data dependence”
20 23

Tournament Branch Predictors Dynamic Scheduling and Hazards


● Problem
– Branches might be badly mispredicted when moving between
program scopes
– The branch prediction information from the inner scope is
inadequate for the outer scope
● F0 = F2 / F4
● Observation True data dependence (RAW)

– There is locality in branching


● F6 = F0 + F8
● Inner and outer loops, etc. Output depenendance
● R1[0] = F6 antidepenendence (WAR)
(WAW)
● Solution ● F8 = F10 – F14 antidepenendence (WAR)

– Combine local and global information ● F6 = F10 * F8


● Typical predictors
– Size: 8K-32K bits
– Local predictors unchanged
● Examples: DEC Alpha, AMD Phenom and Opteron 21 24
Register Renaming Example Pipelined Execution

Clock cycles
Instruction 1 2 3 4 5 6 7 8
Instr. I fetch decode exe mem write
Instr. I+1 fetch decode exe mem write
● F0 = F2 / F4 ● F0 = F2 / F4 Instr. I+2 fetch decode exe mem write
Instr. I+3 fetch decode exe mem
● F6 = F0 + F8 ● S = F0 + F8 Instr. I+4 fetch decode exe
Instr. I+5 fetch decode
● R1[0] = F6 ● R1[0] = S Instr. I+6 fetch
● F8 = F10 – F14 ● T = F10 – F14 Instr. I fetch decode exe mem write
● F6 = F10 * F8 ● F6 = F10 * T Instr. I+1 fetch decode exe mem write
Instr. I+2 fetch decode exe mem write
● ●
Instr. I+3 stall fetch decode exe
Instr. I+4 fetch decode
● ● Only RAW hazards remain Instr. I+5 fetch
I1 F1 F2 R X1 X2 X3 D1 D2 T W
I2 F1 F2 R X1 X2 X3 X4 D1 D2 T W
I3 F1 F2 R X1 s D1 s s D2 T W

25 28

Register Renaming Details Tomasulo's Algorithm


● Register renaming provided by reservation stations (RS) ● Tomasulo's approach allows...
● Each entry of RS contains – Out-of-order execution (as in scoreboarding in ARM A8)
– Instruction ● Unlike scoreboarding, Tomasulo can handle anti- and output
dependences by renaming (four-issue Intel i7)
– Buffered operand values (when available)
– Extension to handle speculation
– References to instructions in RS that will provide values
● In Tomasulo, each instruction goes through three steps
● Operation
– Issue
– RS fetches and buffers an operand when available
● FIFO queue maintains correct data flow
● Might bypass a register
● Transfer instruction to RS if available or structural stall
– Pending instructions indicate the RS where they send their
output ● Rename registers to eliminate WAR and WAW (stall if no data)
– Execute
● Results broadcast on result bus (Common Data Bus)
– Only the last output updates the register file ● Monitor bus for new data and distribute it to waiting RS (RAW)
● Execute instructions in functional units when operands available
– Upon instruction issue, registers are renamed with references
to RS – Write result (other RS, registers, store buffers)
● There may be more RS than registers! 26
● Store buffer waits for address, value and memory unit(s) 29

Tomasulo Approach Example Dynamic Execution: All Issued


Instruction status
Issue Execute Write result
Load f6, (r2) x x x
Load f2, (r3) x x
● Introduced by Robert Tomasulo Mult f0,f2,f4 x
Sub f8,f2,f6 x
– Implemented in IBM 360/91 in its floating-point unit Div f9,f0,f6 x
– IBM 360/91 had long memory and floating-point delays Add f6,f8,f2 x
Reservation stations
● Only 4 floating-point registers
Busy Op Vj Vk Qj A Qk
– Binary compatibility was important for IBM customers Load1 no
● Modern processors use a variation of Tomasulo's approach Load2 yes load Reg[r3]
Add1 yes sub Mem[r2] Load2
– Also in use is a simpler algorithm called scoreboarding Add2 yes add Add1 Load2
Add3 no
Mult1 yes mul Reg[f4] Load2
Mult2 yes div Mem[r2] Mult1
f0: Mult1 f1: Load2 f6: Add2 f8: Add1 f9: Mult2
27 30
Example Dynamic Execution: Mult Ready Reorder Buffer
Instruction status ● Another set of (invisible to programmer) registers for
Issue Execute Write result intermediate results
Load f6, (r2) x x x
Load f2, (r3) x x + ● ROB registers hold data after instruction completion but
Mult f0,f2,f4 x + before instruction commit
Sub f8,f2,f6 x + + ● Each ROB entry (register) contains additional fields:
Div f9,f0,f6 x
Add f6,f8,f2 x + + – Instruction type
Reservation stations ● Branch (no destination), store (memory destination), register op
Busy Op Vj Vk Qj Qk A (ALU or register destination)
Load1 no – Destination
Load2 no ● Register number (for loads or ALU ops)or memory address (for
Add1 no stores)
Add2 no – Value
Add3 no
Mult1 yes mul Mem[r3] Reg[f4] – Ready
Mult2 yes div Mem[r2] Mult1 ● Indicates whether the supplying instruction completed its
f0: Mult1 f1: f6: f8: f9: Mult2 execution
31 34

Hardware Speculation Basics Reorder Buffer in Action


● After dealing with data dependences, control dependences ● Issue
become an issue – Has to wait until ROB entry is available (in addition to RS
– Branch prediction is not as effective with multiple instructions entry)
in-flight ● Execute
● If predicted “taken”, conditional instructions are fetched and – Results from Common Data Bus will have to end up on ROB
issued
– Speculation allows to proceed almost as if branch was not ● Write result
there – Results have to be copied to ROB
● Conditional instructions are fetched, issued, and executed ● Commit (also called completion, graduation)
● Hardware speculation comprises Normal commit: store result in destination, mark ROB entry as

– Dynamic branch prediction empty
– (speculative) execution: instructions are executed and – Store commit: destination is a memory
possibly undone) – Branch commit:
– Dynamic scheduling ● If correct prediction: no action needed
● More basic blocks available after branches are speculated out of ● If incorrect prediction: ROB result is thrown away and
the instruction stream 32 instructions restarted at the correct branch point 35

Hardware Speculation Components Reorder Buffer Exception Handling

● Additional step in instruction execution


– Issue, Execute, Write result, Commit
● Exceptions are not recognized until they are ready to commit
● Reorder buffer (ROB)
● ROB records exceptions
● Handling of...
– On mispredictions: flush the exception
– Mispredictions
– Upon reaching the head of ROB: raise the exception
– Mis-speculations
– Exceptions

33 36
Speculation at Compile Time VLIW Disadvantages
● Static parallelism
● a += 1 – Must be discovered and exploited early
If (x==0) { ● Preferably by the compiler
b += 1 ● Potential for intermediate representations, bytecodes
c += 1 ● Large code size
} else {
● If (x==0) { a -= 1 – Parallelism relies on large basic blocks
a += 1 } – Clever encoding or on-the-fly decompression may be needed
b += 1 ● a_copy = a+1 ● Lack of hazard detection in lockstep execution
c += 1 b_copy = b+1
} c_copy = c+1 ● Binary compatibility
if (x==0) { – Take code from 2-issue VLIW to (next-gen) 3-issue VLIW
a = a_copy – Add a single ALU unit to the new processor and the old code
b = b_copy will not take advantage of it
c = c_copy
} – New (wider, with more functions) processors could change
instruction encoding
37 ● ISA must provide for future hardware expansions 40

Multiple Issue Execution Tomasulo Recap


● All the techniques presented so far lead to ideal CPI=1 0x0: Load F2, 0(R1)
F0
0x1: Mul F0, F2, F4
● For CPI to go below 1 there need to be multiple instructions 0x2: ...
Empty RS F1
retired most of the time Empty RS F2
F3
Mul v1, v2, v3
– Too many stalls can quickly increase CPI above 1 F4
Load v1, v2
– See Amdahl law
● Most common flavors of multiple issue processors
Memory Buffer
– Statically scheduled processors
● In-order execution
Common data bus
● Examples: MIPS, ARM
To memory hierarchy
– VLIW (very long instruction word) processors
● Each cycle issues multiple (fixed number of) instructions Multiplier/Divider Adder/Subtractor
● Examples: DSPs, Itaniums, some GPUs
– Dynamically scheduled superscalar processors
● Out-of-order execution
● Examples: Intel i3-7, AMD Phenom, IBM POWER7 38 41

VLIW Processor Basics Multiple Issue Taxonomy


Common Issue Hazard Scheduling Distinguishing Exampes
● How many instructions per cycle? name structure detection characteristic

– Two-issue is common place


Superscalar Dynamic Hardware Static In-order execution Mostly in the
– Four-issue is manageable (static) embedded space:
MIPS and ARM
● Scheduling techniques (Cortex A8)
– Local Superscalar Dynamic Hardware Dynamic Some out-of-order None at present
(dynamic) execution but no
● Basic blocks speculation

– Global Superscalar Dynamic Hardware Dynamic Out-of-order execution Intel Core i3-7;
● Across branches (speculative) with with speculation AMD Phenom; IBM
speculation POWER7
– Trace
VLIW/LIW & Static Primarily Static All hazards Most examples are
● VLIW-specific Static software determined and in signal
● Extensive loop unrolling to generate large basic blocks indicated by compiler processing, such
(often implicitly) as TI 6Cx. Also
some GPUs
● Disadvantages EPIC (Exp. Primarily Primarily Mostly All hazards Itanium
Parallel static software static determined and
– Static parallelism, large code size, lack of hazard detection for Instruction indicated explicitly by
lockstep execution, binary compatibility Comp.) the compiler
39 42
VLIW Processors Basic Design Return Address Predictor

● Package multiple operations into one instruction


– Instruction bundles
● Branch prediction deals with conditional branches
● Example VLIW processor
● Most unconditional branches come from function returns
– One integer instruction (or branch)
● But the same function can be called from multiple sites
– Two independent floating-point operations
– This may cause the branch prediction buffer to forget about
– Two independent memory references
return address from previous calls
– Notice: there are restrictions on the instructions
● Solution
● Must be enough parallelism in code to fill the available slots
– Create return address buffer organized as a stack
– Compiler: aggressive loop unrolling
– Programmer: program restructuring

43 46

Modern Microarchitectures
● Combine:
– Dynamic scheduling
– Multiple issue
– Speculation
● Two approaches to dealing with dependences
– Assign reservation stations and update pipeline control table
in half clock cycles
● Only supports 2 instructions/clock
– Design logic to handle any possible dependencies between
the instructions
● Notice: design complexity
– Hybrid approaches
● New bottleneck:
– Issue logic
44

Modern Multiple Issue

● Limit the complexity of a single instruction “bundle”


– Limit bundle size
– Limit classes of instruction in a bundle
● One integer, two floating-point
● With limited size, all dependences in a bundle can be
examined
● Dependences from a small bundle can also be fully encoded
in RS
● Another bottleneck:
– Completion/commit unit
– Need multiple such units to keep up with incoming instructions

45

You might also like