0% found this document useful (0 votes)

58 views

Chapter 2 ILP

This document discusses instruction-level parallelism (ILP) and techniques for exploiting it: - ILP refers to executing multiple instructions simultaneously by overlapping their execution, and was widely adopted in processors starting in 1985. Dynamic hardware approaches are used in servers and desktops while static compiler-based approaches are used for scientific applications. - There are various hazards that can prevent full exploitation of ILP including data dependencies, structural hazards, and control hazards from branches. Compiler techniques aim to reduce these hazards through transformations like loop unrolling, strip mining, and pipeline scheduling.

Uploaded by

Setina Ali

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

58 views

Chapter 2 ILP

Uploaded by

Setina Ali

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 89

Chapter 2

Instruction-Level Parallelism

1
Introduction
Introduction
 Pipelining become universal technique in 1985
 Overlaps execution of instructions
 Exploits “Instruction Level Parallelism(ILP)”

 Two main approaches:

 Dynamic hardware-based
 Used in server and desktop processors
 Not used as extensively in Parallel Multiprogrammed
Microprocessors (PMP)
 Static approaches compiler-based (software based)
 Not as successful outside of scientific applications

2
Review of basic concepts
 Pipelining: each instruction is split up into a sequence of steps
– different steps can be executed concurrently by different
circuitry.
 A basic pipeline in a RISC processor
 IF – Instruction Fetch
 ID – Instruction Decode
 EX – Instruction Execution
 MEM – Memory Access
 WB – Register Write Back
 Two techniques:
 Superscalar - A superscalar processor executes more than one

instruction during a clock cycle.

 VLIW - very long instruction word – compiler packs
multiple independent operations into an instruction

4
Review of basic concepts
IF – Instruction Fetch
Send the PC to memory
fetch the current instruction from memory.
Update the PC to next instruction by adding 4
ID – Instruction Decode
Decode the instruction
read the registers corresponding to register source
EX – Instruction Execution
adds the base register and the offset to form the effective address (mem
reference)
Register-Register ALU instruction
Conditional branch—Determine if the condition is true
MEM – Memory Access - load/store to memory
WB – Register Write Back- Reg-Reg ALU inst, or load instruc.
Note: branch require three cycles, store instruc, require four cycles, and
all other instructions require five cycles.
Basic superscalar 5-stage
pipeline
Superscalar- a processor executes more than one instruction during a
clock cycle by simultaneously dispatching multiple instructions to redundant
functional units on the processor.
The hardware determines (statically/ dynamically) - which one of a block on
n instructions will be executed next.

A single-core superscalar processor  SISD

A multi-core superscalar  MIMD.

5
Registers prevent interference between two different
instructions in adjacent stages
Hazards - pipelining could lead to incorrect results.
 Data dependence “true dependence” Read after Write hazard (RAW)
i: sub R1, R2, R3 % sub d,s,t  d =s-t
i+1: add R4, R1, R3 % add d,s t  d = s+t
Instruction (i+1) reads operand (R1) before instruction (i) writes it.
 Name dependence “anti dependence”  two instructions use the same register or
memory location, but there is no data dependency between them.
 Write after Read hazard (WAR)  Example:
i: sub R4, R5, R3
i+1: add R5, R2, R3
mul R6, R5, R7
Instruction (i+1) writes
operand (R5) before
instruction (i) reads it.
 Write after Write (WAW) (Output dependence)
Example i: sub R6, R5, R3
i+1: add R6, R2, R3
i+2: mul R1, R2, R7
Instruction (i+1) writes 6
Hazards
 Data hazards => RAW, WAR, WAW.
 Structural hazard - occurs when a part of the processor's
hardware is needed by two or more instructions at the same time.
 Example: a single memory unit that is accessed both in the fetch
stage where an instruction is retrieved from memory, and the
memory stage where data is written and/or read from memory.
They can often be resolved by separating the component into
orthogonal units (such as separate caches) or bubbling the
pipeline.

Control hazard (branch hazard) => are due to branches.

On many instruction pipeline microarchitectures, the processor will
not know the outcome of the branch when it needs to insert a new
instruction into the pipeline (normally the fetch stage).
Instruction-level parallelism

Introduction
 (ILP) parallelism, goal
When exploiting instruction-level
is to maximize CPI -cycle/instruction
 Pipeline CPI =
 Ideal pipeline CPI +

 Structural stalls +

 Data hazard stalls +

 Control stalls

 Parallelism with basic block is limited

 Typical size of basic block = 3-6 instructions
 Must optimize across branches
 For RISC programs, the average dynamic branch
frequency is between 15% and 25%.
 The simplest and most common way to increase the ILP
is to exploit parallelism among iterations of a loop.
 This type of parallelism is often called loop-level
parallelism.
Data dependence

Introduction
 Loop-Level Parallelism
 Unroll loop statically or dynamically
 Use SIMD (vector processors and GPUs)
 Challenges:
 Data dependency
 Instruction j is data dependent on instruction i
if
 Instruction i produces a result that may be used by instruction j
 Instruction j is data dependent on instruction k and instruction
k is data dependent on instruction i
 Dependent instructions cannot be
executed simultaneously
Data dependence

Introduction
 Dependencies are a property of programs
 Pipeline organization determines if dependence is
detected and if it causes a “stall.”

 Data dependence conveys:

 Possibility of a hazard
 Order in which results must be calculated
 Upper bound on exploitable instruction level
arallelism

 Dependencies that flow through memory

locations are difficult to detect
Introduction
Name dependence
 Two instructions use the same name but no flow of
information
 Not a true data dependence, but is a problem when
reordering instructions
 Antidependence: instruction j writes a
register or memory location that instruction i reads
 Initial ordering (i before j) must be preserved
 Output dependence: instruction i and
instruction j write the same register or memory
location
 Ordering must be preserved

 To resolve, use renaming techniques

Control dependence

Introduction
 Every instruction is control dependent on some set of branches
and, in general, the control dependencies must be preserved to
ensure program correctness.
 Instruction control dependent on a branch cannot be moved before the
branch so that its execution is no longer controller by the branch.
 An instruction not control dependent on a branch cannot be moved

after the branch so that its execution is controlled by the branch.

 Example
if C1 {
S1;
};
if C2 {
S2;
};
S1 is control dependent on C1; S2 is control dependent on C2 but not on C1
Properties essential to program
correctness
 Preserving exception behavior  any change in instruction order must not
change the order in which exceptions are raised. Example:
DADDU

R1, R2, R3 BEQZ

R1, L1
L1 LW R4,
0(R1)
Can we move LW before
BEQZ ?
 Preserving data flow  the flow of data between instructions that produce
results and consumes them. Example:
DADDU R1, R2, R3
BEQZ R4, L1
DSUBU R1, R8, R9
13
L1 LW R5, 0(R2)
Introduction
Examples
• Example 1:  OR instruction dependent
DADDU R1,R2,R3 on DADDU and DSUBU
BEQZ R4,L
DSUBU R1,R1,R6
L: …
OR R7,R1,R8

• Example 2:  Assume R4 isn’t used after

DADDU R1,R2,R3 skip
BEQZ R12,skip  Possible to move DSUBU
DSUBU R4,R5,R6 before the branch
DADDU R5,R4,R9
skip:
OR R7,R8,R9
Compiler techniques for
exposing ILP
 Loop transformation technique to optimize a program's execution
speed:
1. reduce or eliminate instructions that control the loop, e.g.,
pointer arithmetic and "end of loop" tests on each iteration
2. hide latencies, e.g., the delay in reading data from
memory
3. re-write loops as a repeated sequence of similar
independent statements  space-time tradeoff
4. reduce branch penalties;
 Methods
1. pipeline scheduling
2. loop unrolling
3. strip mining
1. Pipeline scheduling

Compiler Techniques
 Pipeline stall - delay in execution of an instruction in an instruction
pipeline in order to resolve a hazard. The compiler can reorder
instructions to reduce the number of pipeline stalls.
 Pipeline scheduling - Separate dependent instruction from the source
instruction by the pipeline latency of the source instruction
 Example:
for (i=999; i>=0; i=i-1)
x[i] = x[i] + s;
Pipeline stalls

Compiler Techniques
Assumptions:
1. S  F2
2. The element of the array with the highest address 
R1
Loop: L.D
3. The element of%the
F0,0(R1) array
load with
array the lowest
element in F0address +8

stallR2 % next instruction needs the result in F0
ADD.D F4,F0,F2 % increment array element
stall
stall % next instruction needs the result in F4
S.D F4,0(R1) % store array element
DADDUI R1,R1,#-8 % decrement pointer to array element
BNE R1,R2,Loop % loop if not the last element
Compiler Techniques
Pipeline scheduling
Scheduled code:
Loop: L.D F0,0(R1)
DADDUI R1,R1,#-8
ADD.D F4,F0,F2
stall
stall
S.D F4,0(R1)
BNE R1,R2,Loop
2. Loop unrolling
 Given the same code

Loop: L.D F0,0(R1)

DADDUI R1,R1,#-8
ADD.D F4,F0,F2
S.D F4,0(R1)
BNE R1,R2,Loop

 Assume # elements of the array with starting address in R1 is

divisible by 4
 Unroll by a factor of 4
 Eliminate unnecessary instructions.
 merging the addi instructions and dropping the unnecessary

bne operations
Unrolled loop

Compiler Techniques
Loop: L.D F0,0(R1)
ADD.D F4,F0,F2
S.D F4,0(R1) % drop DADDUI & BNE
L.D F6,-8(R1)
ADD.D F8,F6,F2
S.D F8,-8(R1) %drop DADDUI & BNE
L.D F10,-16(R1)
ADD.D F12,F10,F2
S.D F12,-16(R1) % drop DADDUI & BNE
L.D F14,-24(R1)
ADD.D F16,F14,F2
S.D F16,-24(R1)
DADDUI R1,R1,#-32
BNE R1,R2,Loop

 Live registers: F0-1, F2-3, F4-5, F6-7, F8-9,F10-11, F12-13, F14-15,

and F16-15; also R1 and R2
 In the original code: only F0-1, F2-3, and F4-5; also R1 and R2
Clock cycle
Without scheduling - 8 clock cycle per element
With scheduling - 7 clock cycle per element
The unrolled loop will run in 26 (4 loops) clock
cycles
– each fld has 1 stall,
– each fadd.d has 2, plus
– 14 instruction
—or 6.5 clock cycles for each of the 4 elements.
Unrolled plus scheduled has dropped to a total of
14 clock cycles, or 3.5 clock cycles per element,
compared with 8 cycles before unrolling or
scheduling and
6.5 cycles when unrolled but not scheduled.
Pipeline schedule the unrolled

Compiler Techniques
loop
Loop: L.D F0,0(R1) Pipeline scheduling reduces
L.D F6,-8(R1) the number of stalls.
L.D F10,-16(R1) 1. The L.D instruction
L.D F14,-24(R1) requires only one cycle so
ADD.D F4,F0,F2 when ADD.D are issued F4, F8,
ADD.D F8,F6,F2 F12 , and F16 are already
ADD.D F12,F10,F2 loaded.
ADD.D F16,F14,F2 2.The ADD.D requires only
S.D F4,0(R1) two cycles so that two S.D
S.D F8,-8(R1) can proceed immediately
DADDUI R1,R1,#-32 3.The array pointer is updated
S.D F12,16(R1) after the first two S.D so the
S.D F16,8(R1) loop control can proceed
BNE R1,R2,Loop
immediately after the last two
S.D
Loop unrolling & scheduling
summary
 Use different registers to avoid unnecessary constraints.
 Adjust the loop termination and iteration code.
 Find if the loop iterations are independent except the loop maintenance
code  if so unroll the loop
 Analyze memory addresses to determine if the load and store from different
iterations are independent if so interchange load and stores in the unrolled
loop
 Schedule the code while ensuring correctness.
 Limitations of loop unrolling
 Decrease of the amount of overhead with each roll
 Growth of the code size
 Register pressure (shortage of registers)  scheduling to increase ILP
increases the number of live values thus, the number of registers
Branch prediction
 Branch prediction - guess whether a conditional
jump will be taken or not.
 Goal - improve the flow in the instruction pipeline.
 Speculative execution - the branch that is
guessed to be the most likely is then fetched and
speculatively executed.
 Penalty - if it is later detected that the guess was
wrong then the speculatively executed or partially
executed instructions are discarded and the
pipeline starts over with the correct branch,
incurring a delay.
Branch prediction
• As pipelines get deeper and the potential penalty of
branches increases,
– using delayed branches is insufficient.
• Predicting branches
– low-cost static schemes that rely on information
available at compile time (always taken or not)
• use profile information collected from earlier runs.
• effectiveness of branch prediction (3% to 24%)
– predict branches dynamically based on program
behavior.
Dynamic Branch prediction
• How - the branch predictor keeps records of
whether branches are taken or not taken.
• When it encounters a conditional jump that has
been seen several times before then it can base
the prediction on the history.
• The branch predictor may, for example,
recognize that the conditional jump is taken
more often than not, or that it is taken every
second time
Branch Prediction
Predictors
 Basic 2-bit predictor:
 For each branch:
 Predict taken or not taken
 If the prediction is wrong two consecutive times, change prediction
 Correlating predictor:
 Multiple 2-bit predictors for each branch
 One for each possible combination of outcomes of preceding n
branches
 Local predictor:
 Multiple 2-bit predictors for each branch
 One for each possible combination of outcomes for the last n
occurrences of this branch
 Tournament predictor:
 Combine correlating predictor with local predictor
The states in a 2-bit prediction
scheme
Correlating predictor
• Branch predictors that use the behavior of other
branches to make a prediction
• A (1,2) predictor uses the behavior of the last
branch to choose from among a pair of 2-bit
branch predictors in predicting a particular
branch.
• (m,n) predictor uses the behavior of the last m
branches to choose from 2m branch predictors,
each of which is an n-bit predictor for a single
branch.
Correlating predictor
a = 0;
B1 while(a<1000){
if(a%2!=0) {...}
a++;
...
B2 if(b==0){...}
}
Branch history
B1: TNTNTNTNT
B2: NNNNNNNN
2-bit prediction
Correlating predictor
Correlating predictor
Correlating predictor
Correlating predictor
Correlating predictor
Correlating predictor
Correlating predictor
Correlating predictor
Correlating predictor
Correlating predictor
Correlating predictor
Correlating predictor
Branch Prediction
Branch prediction
performance
Dynamic

Branch Prediction
scheduling
 Rearrange order of instructions to reduce stalls
while maintaining data flow

 Advantages:
 Compiler doesn’t need to have knowledge of
microarchitecture
 Handles cases where dependencies are unknown at
compile time

 Disadvantage:
 Substantial increase in hardware complexity
 Complicates exceptions
Dynamic scheduling

Branch Prediction
 Dynamic scheduling implies:
 Out-of-order execution
 Out-of-order completion

 Creates the possibility for WAR and WAW

hazards

 Tomasulo’s Approach
 Tracks when operands are available
 Introduces register renaming in hardware
 Minimizes WAW and WAR hazards
Register renaming

Branch Prediction
 Example:

DIV.D F0,F2,F4
ADD.D F6,F0,F8
antidependence WAR - F8
S.D F6,0(R1)
SUB.D F8,F10,F14 antidependence WAW – F6
MUL.D F6,F10,F8
+ name dependence
with F6
Register renaming

Branch Prediction
 Example: add two temporary registers S and T

i: DIV.D F0,F2,F4
i+1: ADD.D S,F0,F8 (instead of ADD.D F6,F0,F8)
i+2 S.D S,0(R1) (instead of S.D F6,0(R1)
i+3 SUB.D T,F10,F14 (instead of SUB.D F8,F10,F14)
i+4 MUL.D F6,F10,T (instead of MUL.D F6,F10,F8)

 Now only RAW hazards remain, which can be strictly

ordered
Branch Prediction
Register renaming
 Register renaming is provided by reservation stations (RS)
 Contains:
 The instruction

 Buffered operand values (when available)

 Reservation station number of instruction providing the operand

values
 RS fetches and buffers an operand as soon as it becomes available (not
necessarily involving register file)
 Pending instructions designate the RS to which they will send their
output
 Result values broadcast on a result bus, called the common data bus (CDB)
 Only the last output updates the register file
 As instructions are issued, the register specifiers are renamed with
the reservation station
 May be more reservation stations than registers
Tomasulo’s algorithm

Branch Prediction
 Goal  high performance without special compilers when the
hardware has only a small number of floating point (FP)
registers.
 Designed in 1966 for IBM 360/91 with only 4 FP registers.
 Used in modern processors: Pentium 2/3/4, Power PC 604,
Neehalem…..
 Additional hardware needed
 Load and store buffers  contain data and addresses, act like
reservation station.
 Reservation stations  feed data to floating point arithmetic units.
Instruction execution

Branch Prediction
steps
 Issue
 Get next instruction from FIFO queue
 If available RS, issue the instruction to the RS with operand values if

available
 If operand values not available, stall the instruction

 Execute
 When operand becomes available, store it in any reservation stations

waiting for it
 When all operands are ready, issue the instruction

 Loads and store maintained in program order through effective

address
 No instruction allowed to initiate execution until all branches that

proceed it in program order have completed

 Write result
 Broadcast result on CDB into reservation stations and store buffers

 (Stores must wait until address and value are received)

Branch Prediction
Example
Notations
Example of Tomasulo
algorithm

 3 – load buffers (Load1, Load 2, Load 3)

 5 - reservation stations (Add1, Add2, Add3, Mult 1, Mult 2).
 16 pairs of floating point registers F0-F1, F2-F3,…..F30-31.
 Clock cycles: addition  2; multiplication  10; division 40;
Clock cycle 1

 A load from memory location 34+(R2) is issued; data will be stored

later in the first load buffer (Load1)
Clock cycle 2

 The Second load from memory loaction 45+(R3) is issued; data will be
stored in load buffer 2 (Load2).
 Multiple loads can be outstanding.
Clock cycle 3

 First load completed  SUBD instruction is waiting for the data.

 MULTD  issued
 Register names are removed/renamed in Reservation Stations
Clock cycle 4

 Load2 is completed; MULTD waits for the results of it.

 The results of Load 1 are available for the SUBD instruction.
Clock cycle 5

 The result of Load2 available for MULTD executed by Mult1 and

SUBD executed by Add1. Both can now proceed as they have both operands.
 Mult2 executes DIVD and cannot proceed yet as it waits for the results of
Add1.
Clock cycle 6

 Issue ADDD
Clock cycle 7

 The results of the SUBD produced by Add1 will be available in the next
cycle.
 ADDD instruction executed by Add2 waits for them.
Clock cycle 8

 The results of SUBD are deposited by the Add1 in F8-F9

Clock cycle 9
Clock cycle 10

 ADDD executed by Add2 completes, it needed 2 cycles.

 There are 5 more cycles for MULTD executed by Mult1.
Clock cycle 11

 Only MULTD and DIVD instructions did not complete. DIVD is waiting for
the result of MULTD before moving to the execute stage.
Clock cycle 12
Clock cycle 13
Clock cycle 14
Clock cycle 15

 MULTS instruction executed by Mult1 unit completed execution.

 DIVD in instruction executed by the Mult2 unit is waiting for it.
Clock cycle 16
Clock cycle 55

 DIVD will finish execution in cycle 56 and the result will be in F6-F7
in cycle 57.
Hardware-based speculation

Branch Prediction
 Goal  overcome control dependency by speculating.
 Allow instructions to execute out of order but force them to
commit to avoid: (i) updating the state or (ii) taking an
exception
 Instruction commit  allow an instruction to update the
register file when instruction is no longer speculative
 Key ideas:
1. Dynamic branch prediction.
2. Execute instructions along predicted execution paths, but only
commit the results if prediction was correct.
3. Dynamic scheduling to deal with different combination of basic
blocks
How speculative execution is

Branch Prediction
done
 Need additional hardware to prevent any irrevocable
action until an instruction commits.
 Reorder buffer (ROB)
 Modify functional units – operand source is ROB rather than
functional units
 Register values and memory values are not written until an
instruction commits
 On misprediction:
 Speculated entries in ROB are cleared

 Exceptions:
 Not recognized until it is ready to commit
Extended floating point unit
 FP using Tomasulo’s
algorithm extended to
handle speculation.
 Reorder buffer
now holds the result
of instruction
between
completion and
commit. Has 4 fields
 Instruction type:
branch/store/register
 Destination field:
register number
 Value field:
output
value
 Ready field:
completed
execution?
 Operand source is now
reorder buffer instead of
functional unit
Multiple Issue and Static Scheduling
Multiple issue and static scheduling
 To achieve CPI < 1 complete multiple instructions per
clock cycle
 Three flavors of multiple issue processors
1. Statically scheduled superscalar processors
a. Issue a varying number of instructions per clock
cycle
b. Use in-order execution
2. VLIW (very long instruction word) processors
a. Issue a fixed number of instructions as one large
instruction
b. Instructions are statically scheduled by the compiler
3. Dynamically scheduled superscalar processors
a. Issue a varying number of instructions per clock
cycle
b. Use out-of-order execution
Multiple Issue and Static Scheduling
Multiple issue
processors
VLIW
Processors
 Package multiple operations into one instruction
 Must be enough parallelism in code to fill the available
slots,
 Disadvantages:
 Statically finding parallelism
 Code size
 No hazard detection hardware
 Binary code compatibility
Example

Multiple Issue and Static Scheduling

 Unroll the loop for x[i]= x[i] +s
to eliminate any stalls. Ignore
delayed branches. Loop: L.D F0,0(R1)
 The code we had before L.D F6,-8(R1)
shown on the right 
L.D F10,-16(R1)
 Package in one VLIW L.D F14,-24(R1)
instruction :
 One integer instruction (or
ADD.D F4,F0,F2
ADD.D F8,F6,F2
branch)
 Two independent floating- ADD.D F12,F10,F2
point operations ADD.D F16,F14,F2
 Two independent memory S.D F4,0(R1)
references. S.D F8,-8(R1)
DADDUI R1,R1,#-32
S.D F12,16(R1)
S.D F16,8(R1)
BNE R1,R2,Loop
Example
Dynamic Scheduling, Multiple Issue, and Speculation
Dynamic scheduling, multiple issue, speculation

 Modern microarchitectures:
 Dynamic scheduling + multiple issue + speculation
 Two approaches:
 Assign reservation stations and update pipeline control table in half
clock cycles
 Only supports 2 instructions/clock

 Design logic to handle any possible dependencies between the

instructions
 Hybrid approaches.
 Issue logic can become bottleneck
Multiple issue processor with
speculation
 The organization should allow simultaneous execution for all issues in
one clock cycle of one of the following operations:
 FP multiplication

 FP addition

 Integer operations

 Load/Store

 Several datapaths must be widened to support multiple issues.

 The instruction issue logic will be fairly complex.
Dynamic Scheduling, Multiple Issue, and Speculation
Multiple issue processor with
speculation
Dynamic Scheduling, Multiple Issue, and Speculation
Basic strategy for updating the

issue logic
Assign a reservation station and a reorder buffer for every instruction that
may be issued in the next bundle.
 To pre-allocate reservation stations limit the number of instructions of a
given class that can be issued in a “bundle”
 I.e. one FP, one integer, one load, one store

 Examine all the dependencies among the instructions in the bundle

 If dependencies exist in bundle, use the assigned ROB number to
update the reservation table for dependent instructions. Otherwise, use the
existing reservations table entries for the issuing instruction.

 Also need multiple completion/commit

Dynamic Scheduling, Multiple Issue, and Speculation
Example
Loop: LD R2,0(R1) ;R2=array element
DADDIU R2,R2,#1 ;increment R2
SD R2,0(R1) ;store result
DADDIU R1,R1,#8 ;increment pointer
BNE R2,R3,LOOP ;branch if not last element
Dynamic Scheduling, Multiple Issue, and Speculation
Dual issue without speculation

 Time of issue, execution, and writing the result for a dual-issue of the pipeline.
 The LD following the BNE (cycles 3, 6) cannot start execution earlier, it must
wait until the branch outcome is determined as there is no speculation
Dual issue with speculation

Dynamic Scheduling, Multiple Issue, and Speculation

 Time of issue, execution, writing the result, and commit.
 The LD following the BNE can start execution early; speculation is supported.

Instruction Level Pipelining
100% (1)
Instruction Level Pipelining
113 pages
EC483_Fall2024_W7
No ratings yet
EC483_Fall2024_W7
40 pages
CSE 820 Graduate Computer Architecture Week 5 - Instruction Level Parallelism
No ratings yet
CSE 820 Graduate Computer Architecture Week 5 - Instruction Level Parallelism
38 pages
Pipelining Become Universal Technique in 1985
No ratings yet
Pipelining Become Universal Technique in 1985
16 pages
Instruction-Level Parallelism (ILP), Since The
100% (1)
Instruction-Level Parallelism (ILP), Since The
57 pages
Cosc530 Ch3all6up
No ratings yet
Cosc530 Ch3all6up
8 pages
13) Ilp1 PDF
No ratings yet
13) Ilp1 PDF
85 pages
Topic2c Ss Dynamicscheduling
No ratings yet
Topic2c Ss Dynamicscheduling
94 pages
CompanionAsset 9780128119051 Chapter03 (3)
No ratings yet
CompanionAsset 9780128119051 Chapter03 (3)
67 pages
Instruction-Level Parallelism and Its Exploitation: Prof. Dr. Nizamettin AYDIN
No ratings yet
Instruction-Level Parallelism and Its Exploitation: Prof. Dr. Nizamettin AYDIN
170 pages
Instruction Level Parallelism and Its Exploitation: Unit Ii by Raju K, Cse Dept
No ratings yet
Instruction Level Parallelism and Its Exploitation: Unit Ii by Raju K, Cse Dept
201 pages
CAQA5e ch3
No ratings yet
CAQA5e ch3
45 pages
4-Advanced pipelining_241114_060906
No ratings yet
4-Advanced pipelining_241114_060906
80 pages
EE457Unit9a_OoO
No ratings yet
EE457Unit9a_OoO
77 pages
Instruction-Level Parallel Processors: Objective
No ratings yet
Instruction-Level Parallel Processors: Objective
31 pages
CS 6290 Instruction Level Parallelism
No ratings yet
CS 6290 Instruction Level Parallelism
45 pages
CS 6461: Computer Architecture Instruction Level Parallelism
No ratings yet
CS 6461: Computer Architecture Instruction Level Parallelism
41 pages
ILP Overview and Scoreboard
No ratings yet
ILP Overview and Scoreboard
60 pages
Instruction-Level Parallel Processors: Asim Munir
No ratings yet
Instruction-Level Parallel Processors: Asim Munir
28 pages
Chapter 5 PPTV 41 STDV 1
No ratings yet
Chapter 5 PPTV 41 STDV 1
47 pages
Decode and Issue More and One Instruction at A Time Executing More Than One Instruction at A Time More Than One Execution Unit
No ratings yet
Decode and Issue More and One Instruction at A Time Executing More Than One Instruction at A Time More Than One Execution Unit
28 pages
Instruction Level Parallelism: Soner Onder
No ratings yet
Instruction Level Parallelism: Soner Onder
25 pages
Ch2 Lec7 Instruction Piplining
No ratings yet
Ch2 Lec7 Instruction Piplining
34 pages
Lecture-7-15.01.2025
No ratings yet
Lecture-7-15.01.2025
19 pages
Instruction level Parallelism
No ratings yet
Instruction level Parallelism
22 pages
Parallelism Via Instructions: Instruction-Level Parallelism (ILP)
No ratings yet
Parallelism Via Instructions: Instruction-Level Parallelism (ILP)
21 pages
03 Dynamic Sched
No ratings yet
03 Dynamic Sched
84 pages
MCP Unit 1
No ratings yet
MCP Unit 1
41 pages
L27,28 Superscaler
No ratings yet
L27,28 Superscaler
28 pages
4th Lecture Computer Architecture
No ratings yet
4th Lecture Computer Architecture
15 pages
Lecture 5
No ratings yet
Lecture 5
76 pages
unit4.aca
No ratings yet
unit4.aca
6 pages
Pipelining2019_(1)[1]
No ratings yet
Pipelining2019_(1)[1]
82 pages
Reduced Instruction Set Computer (Risc) Complex Instruction Set Computer (Cisc)
No ratings yet
Reduced Instruction Set Computer (Risc) Complex Instruction Set Computer (Cisc)
7 pages
3a.ILP Dipendenze e Superscalare
No ratings yet
3a.ILP Dipendenze e Superscalare
24 pages
CH10-Processor Structure and Function
No ratings yet
CH10-Processor Structure and Function
14 pages
Chapter 2 Lecture 4 and 5
No ratings yet
Chapter 2 Lecture 4 and 5
56 pages
Instruction-Level Parallelism: Stalls Control Stalls WAW Stalls WAR Stalls RAW Stalls Structural CPI CPI
No ratings yet
Instruction-Level Parallelism: Stalls Control Stalls WAW Stalls WAR Stalls RAW Stalls Structural CPI CPI
50 pages
CH16-WS ILP and Superscalar-V2
No ratings yet
CH16-WS ILP and Superscalar-V2
42 pages
Module 5_Processor Structure and Function
No ratings yet
Module 5_Processor Structure and Function
74 pages
CAunitiii
No ratings yet
CAunitiii
36 pages
Computer_Architecture_ILP_-_techniques_for_increasing
No ratings yet
Computer_Architecture_ILP_-_techniques_for_increasing
11 pages
Chapter_4
No ratings yet
Chapter_4
78 pages
William Stallings Computer Organization and Architecture 8 Edition Instruction Level Parallelism and Superscalar Processors
No ratings yet
William Stallings Computer Organization and Architecture 8 Edition Instruction Level Parallelism and Superscalar Processors
50 pages
01 - Mod 2 - Livro Autorresponsabilidade
No ratings yet
01 - Mod 2 - Livro Autorresponsabilidade
9 pages
Lec3 PDF
No ratings yet
Lec3 PDF
15 pages
study guide chapter 3
No ratings yet
study guide chapter 3
3 pages
U3.1 Concepts and Challenges[1] (1)
No ratings yet
U3.1 Concepts and Challenges[1] (1)
12 pages
P14-15 Superscalar
No ratings yet
P14-15 Superscalar
28 pages
Chapter_04_processor_2
No ratings yet
Chapter_04_processor_2
28 pages
CH14 COA9e Processor Structure and Function
No ratings yet
CH14 COA9e Processor Structure and Function
40 pages
CA Lecture 12
No ratings yet
CA Lecture 12
48 pages
Instruction Level Parallelism
No ratings yet
Instruction Level Parallelism
19 pages
CH16-WS ILP and Superscalar-v2
No ratings yet
CH16-WS ILP and Superscalar-v2
42 pages
5-Instruction Level Support For Parallel Programming-22!12!2022
No ratings yet
5-Instruction Level Support For Parallel Programming-22!12!2022
16 pages
Instruction-Level Parallelism and Superscalar Processors
No ratings yet
Instruction-Level Parallelism and Superscalar Processors
22 pages
Cs2354 Advanced Computer Architecture 2 Marks
No ratings yet
Cs2354 Advanced Computer Architecture 2 Marks
10 pages
Unit - 1 Microprocessor Architecture
No ratings yet
Unit - 1 Microprocessor Architecture
52 pages
WAN TECHNOLOGY FRAME-RELAY: An Expert's Handbook of Navigating Frame Relay Networks
From Everand
WAN TECHNOLOGY FRAME-RELAY: An Expert's Handbook of Navigating Frame Relay Networks
Mamta Devi
No ratings yet
CCNA Exam Focus: Study Guide with Practice Tests
From Everand
CCNA Exam Focus: Study Guide with Practice Tests
SUJAN
No ratings yet
Lecture 15 - Addressing Modes
100% (1)
Lecture 15 - Addressing Modes
4 pages
HPC-QUESTION-BANK-AUG-2023
No ratings yet
HPC-QUESTION-BANK-AUG-2023
12 pages
Instruction Level Parallelism
No ratings yet
Instruction Level Parallelism
3 pages
Parallelism - Multiprocessing, Multithreading & Pipelining
No ratings yet
Parallelism - Multiprocessing, Multithreading & Pipelining
65 pages
Lab2 Pipelined CPU
No ratings yet
Lab2 Pipelined CPU
15 pages
Slot24 25 CH14 ProcessorStructureAndFunction 42 Slots
No ratings yet
Slot24 25 CH14 ProcessorStructureAndFunction 42 Slots
42 pages
MPS 30
No ratings yet
MPS 30
40 pages
Co Unit 4
No ratings yet
Co Unit 4
17 pages
8085 Instruction Set
100% (1)
8085 Instruction Set
122 pages
Unit 3 Pipelining
No ratings yet
Unit 3 Pipelining
42 pages
Instruction64 Preview
No ratings yet
Instruction64 Preview
4 pages
Instruction Set 8088
No ratings yet
Instruction Set 8088
13 pages
Address Sequencing
No ratings yet
Address Sequencing
2 pages
Computer Architecture
No ratings yet
Computer Architecture
100 pages
Presentation RISC Vs CISC
No ratings yet
Presentation RISC Vs CISC
13 pages
Unit 5-2 COA
No ratings yet
Unit 5-2 COA
52 pages
5 - RISCV - SingleCycle - Arch1
No ratings yet
5 - RISCV - SingleCycle - Arch1
44 pages
The 80x86 Microprocessor: Dara Rahmati
No ratings yet
The 80x86 Microprocessor: Dara Rahmati
31 pages
80x85 Format
No ratings yet
80x85 Format
17 pages
comp.archi.2018
No ratings yet
comp.archi.2018
2 pages
Instruction32 Preview
No ratings yet
Instruction32 Preview
4 pages
ACA Assignment
No ratings yet
ACA Assignment
4 pages
Classic RISC Pipeline
No ratings yet
Classic RISC Pipeline
10 pages
Cst202 Computer Organization and Architecture May 2024
No ratings yet
Cst202 Computer Organization and Architecture May 2024
3 pages
Control Unit
No ratings yet
Control Unit
11 pages
8086 Instruction Set
100% (1)
8086 Instruction Set
101 pages
Assignment 1
No ratings yet
Assignment 1
1 page
CS60003 High Performance Computer Architecture
No ratings yet
CS60003 High Performance Computer Architecture
3 pages
Simultaneous Multithreading: Pratyusa Manadhata, Vyas Sekar (Pratyus, Vyass) @cs - Cmu.edu
No ratings yet
Simultaneous Multithreading: Pratyusa Manadhata, Vyas Sekar (Pratyus, Vyass) @cs - Cmu.edu
4 pages
Basic Computing Concept and Instruction Level Parallelism (Part II)
No ratings yet
Basic Computing Concept and Instruction Level Parallelism (Part II)
47 pages

Chapter 2 ILP

Uploaded by

Chapter 2 ILP

Uploaded by

Chapter 2

 Two main approaches:

instruction during a clock cycle.

A single-core superscalar processor  SISD

Control hazard (branch hazard) => are due to branches.

 Data hazard stalls +

 Parallelism with basic block is limited

 Data dependence conveys:

 Dependencies that flow through memory

 To resolve, use renaming techniques

after the branch so that its execution is controlled by the branch.

R1, R2, R3 BEQZ

• Example 2:  Assume R4 isn’t used after

Loop: L.D F0,0(R1)

 Assume # elements of the array with starting address in R1 is

 Live registers: F0-1, F2-3, F4-5, F6-7, F8-9,F10-11, F12-13, F14-15,

 Creates the possibility for WAR and WAW

 Now only RAW hazards remain, which can be strictly

 Buffered operand values (when available)

 Reservation station number of instruction providing the operand

 Loads and store maintained in program order through effective

proceed it in program order have completed

 (Stores must wait until address and value are received)

 3 – load buffers (Load1, Load 2, Load 3)

 A load from memory location 34+(R2) is issued; data will be stored

 First load completed  SUBD instruction is waiting for the data.

 Load2 is completed; MULTD waits for the results of it.

 The result of Load2 available for MULTD executed by Mult1 and

 The results of SUBD are deposited by the Add1 in F8-F9

 ADDD executed by Add2 completes, it needed 2 cycles.

 MULTS instruction executed by Mult1 unit completed execution.

Multiple Issue and Static Scheduling

 Design logic to handle any possible dependencies between the

 Several datapaths must be widened to support multiple issues.

 Examine all the dependencies among the instructions in the bundle

 Also need multiple completion/commit

Dynamic Scheduling, Multiple Issue, and Speculation

You might also like