Cosc530 Ch3all6up
Cosc530 Ch3all6up
● Loop-level parallelism
– Loop unrolling (compiler)
Instruction-Level Parallelism – Dynamic unrolling (superscalar scheduling)
● Data parallelism
– Vector computers
and
● Cray X1, X1E, X2; NEC SX-9
– SIMT
Its Exploitation ● GPUs
– SIMD
● Short SIMD (SSE, AVX, Intel Phi)
2 5
Clock cycles
Instruction 1 2 3 4 5 6 7 8
Instr. I fetch decode exe mem write
Instr. I+1 fetch decode exe mem write
● F0 = F2 / F4 ● F0 = F2 / F4 Instr. I+2 fetch decode exe mem write
Instr. I+3 fetch decode exe mem
● F6 = F0 + F8 ● S = F0 + F8 Instr. I+4 fetch decode exe
Instr. I+5 fetch decode
● R1[0] = F6 ● R1[0] = S Instr. I+6 fetch
● F8 = F10 – F14 ● T = F10 – F14 Instr. I fetch decode exe mem write
● F6 = F10 * F8 ● F6 = F10 * T Instr. I+1 fetch decode exe mem write
Instr. I+2 fetch decode exe mem write
● ●
Instr. I+3 stall fetch decode exe
Instr. I+4 fetch decode
● ● Only RAW hazards remain Instr. I+5 fetch
I1 F1 F2 R X1 X2 X3 D1 D2 T W
I2 F1 F2 R X1 X2 X3 X4 D1 D2 T W
I3 F1 F2 R X1 s D1 s s D2 T W
25 28
33 36
Speculation at Compile Time VLIW Disadvantages
● Static parallelism
● a += 1 – Must be discovered and exploited early
If (x==0) { ● Preferably by the compiler
b += 1 ● Potential for intermediate representations, bytecodes
c += 1 ● Large code size
} else {
● If (x==0) { a -= 1 – Parallelism relies on large basic blocks
a += 1 } – Clever encoding or on-the-fly decompression may be needed
b += 1 ● a_copy = a+1 ● Lack of hazard detection in lockstep execution
c += 1 b_copy = b+1
} c_copy = c+1 ● Binary compatibility
if (x==0) { – Take code from 2-issue VLIW to (next-gen) 3-issue VLIW
a = a_copy – Add a single ALU unit to the new processor and the old code
b = b_copy will not take advantage of it
c = c_copy
} – New (wider, with more functions) processors could change
instruction encoding
37 ● ISA must provide for future hardware expansions 40
– Global Superscalar Dynamic Hardware Dynamic Out-of-order execution Intel Core i3-7;
● Across branches (speculative) with with speculation AMD Phenom; IBM
speculation POWER7
– Trace
VLIW/LIW & Static Primarily Static All hazards Most examples are
● VLIW-specific Static software determined and in signal
● Extensive loop unrolling to generate large basic blocks indicated by compiler processing, such
(often implicitly) as TI 6Cx. Also
some GPUs
● Disadvantages EPIC (Exp. Primarily Primarily Mostly All hazards Itanium
Parallel static software static determined and
– Static parallelism, large code size, lack of hazard detection for Instruction indicated explicitly by
lockstep execution, binary compatibility Comp.) the compiler
39 42
VLIW Processors Basic Design Return Address Predictor
43 46
Modern Microarchitectures
● Combine:
– Dynamic scheduling
– Multiple issue
– Speculation
● Two approaches to dealing with dependences
– Assign reservation stations and update pipeline control table
in half clock cycles
● Only supports 2 instructions/clock
– Design logic to handle any possible dependencies between
the instructions
● Notice: design complexity
– Hybrid approaches
● New bottleneck:
– Issue logic
44
45