0% found this document useful (0 votes)

34 views

Cosc530 Ch3all6up

Uploaded by

Khalid Kn3

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

34 views

Cosc530 Ch3all6up

Uploaded by

Khalid Kn3

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 8

Chapter 3 Instruction Parallelism Examples

● Loop-level parallelism
– Loop unrolling (compiler)
Instruction-Level Parallelism – Dynamic unrolling (superscalar scheduling)
● Data parallelism
– Vector computers
and
● Cray X1, X1E, X2; NEC SX-9
– SIMT
Its Exploitation ● GPUs
– SIMD
● Short SIMD (SSE, AVX, Intel Phi)

Introduction Types of Dependences

● Instruction level parallelism = ILP =

– (potential) overlap among instructions
● First universal ILP: pipelining (since 1985)
● Two approaches to ILP
– Discover and exploit parallelism in hardware ● Data dependences
● Dominant in server and desktop market segments ● Name dependences
● Not used in PMD segment due to energy constraints
– May be changing with Cortex-A9
● Control dependences
– Software-based discovery at compile time
● Technical markets, scientific computing, HPC
● Itanium is an example of aggressive software discovery
– But mostly abandoned by the majority of server makers

2 5

Instruction-Level Parallelism Basics Data Dependence: Basics

● Not all instructions can be executed in parallel
● Goal: minimize CPI (maximize IPC)
● Data dependent instructions have to be executed “in order”
● In a pipelined processor
– Data dependence is a property of the code
– CPI = ideal CPI + stalls (overheads):
● Instruction j is data dependent on instruction i if
● Structural stalls
● Data hazard stalls – Instruction i produces result used by instruction j
● Control stalls – Instruction j depends on instruction k and k data depends on i
● Fundamental unit for extracting parallelism is ● Pipeline interlocks
– Basic block (block of instructions between branches) – With interlocks, data dependence causes a hazard and stall
– Branches disrupt analysis; add runtime dependence – Without interlocks, data dependence prohibits the compiler
from scheduling instructions with overlap
● But typical basic blocks are small
● Data dependence conveys:
– 3-6 instruction (15%-25% branch frequency)
– Possibility of a hazard (= negative side-effect if not “in order”)
● Optimizing across branches is a must
– The required order of instructions
– Examples: loop-level parallelism, data parallelism (SIMD)
3 – Upper bound on achievable parallelism 6
Data Dependence Example Data Hazards
● Hazards exists as a result of data or name dependence
– Overlap of dependent (and nearby) instructions could change
access order to instructions' operands
double F0, *R1 – Avoiding hazards ensures program order
Loop: Load.D F0, 0(R1) F0=array element F0=R1[0] ● Possible data hazards
– RAW (read after write) ← true data dependence
Add.D F4, F0, F2 add scalar in F2 F4=F0+F2 ● Instruction i: write to x
● Instruction j: read from x
Store.D F4, 0(R1) store result R1[0]=F4
– WAW (write after write) ← output dependence
Add.I R1,R1,#-8 decrement by 8B R1-=1 ● Instruction i: write to x
● Instruction j: write to x
Branch.NE R1, R2, Loop if (R1!=R2) – WAR (write after read) ← antidependence
goto Loop
● Instruction i: read from x
● Instruction j: write to x
7
● RAR is not a hazard 10

Data Dependence Details Control Dependence

● Control dependence determines ordering of an instruction
with respect to a branch instruction
– The order must be preserved
– The execution should be conditional
● Overcoming data dependence
● Example:
– Maintain the dependence by preventing the hazard
– If p1 { S1 }
● By compiler or by hardware scheduler
– If p2 { S2 }
– Eliminate dependence by transforming the code
● A separate topic – S1 is control dependent on p1
– S2 is control dependent on p2
– Dependences that flow through memory
– S2 is not control dependent on p1
● Is R4[100] the same as R6[20]?
● Is R4[20] the same as R6[20]? ● Branches create barriers in code for potential code motion
● It might be possible to violate control dependence but
preserve correct execution with extra hardware
– Speculative execution
8 11

Name Dependence Control Dependence Examples

● Exception handling
● Name dependence occurs when two instructions use the
same register (or memory location) but there is no flow of – Add R2, R3, R4
data between them – Branch.equal0 R2, L1
– Instruction i – Load R1, 0(R2)
– Instruction j – L1: NoOp
● Two types of name dependence – No data dependence between Branch and Load
– Antidependence: Instruction i reads what instruction j writes – Load from wrong R2 could cause exception:
● Store.D F4, 0(R1) ● int *r2 = r3 + r4; y = r2 ? r2[0] : 0;
● Add.I R1, R1, #-8 ● Data flow
– Output dependence: Instructions i and j both write – Add R1, R2, R3
● Renaming is the common technique to deal with name – Branch.equal0 R4, L
dependence – Subtract R1, R5, R6
– Register renaming – L: NoOp ?

– Shadow registers – Or R7, R1, R8

9 12
Control Dependence: Software Speculation Original vs. Unrolled Loop
● Loop
F0 = R1[0]
F4 = F0 + F2
R1[0] = F4
● Ignoring control dependence may be possible after code R1 -= 1
analysis (liveness property) if (R1...
● Loop F6 = R1[-1]
– Add R1, R2, R3 F0 = R1[0] F8 = F6 + F2
– Branch.eq0 R12, Skip F4 = F0 + F2 R1[-1] = F8
?
R1[0] = F4 F10 = R1[-2]
– Subtract R4, R5, R6 R1 -= 1 F12 = F10 + F2
– Add R5, R4, R9 if (R1 != R2) R1[-2] = F12
– Skip: Or R7, R8, R9 goto Loop F14 = R1[-3]
– ; R4 is not used again (is dead) F16 = F14 + F2
R1[-3] = F16
R1 -= 4
if (R1 != R2) goto Loop
13
● New registers: F6, F8, ... 16

Compiler Techniques for Exposing ILP Loop Unrolling + Pipeline Scheduling

● Loop: Load F0, 0(R1) ● Loop: Load F0, 0(R1)
● Stall ● Add.I R1, R1, #-8 ● Loop ● Loop
● Add F4, F0, F2 F0 = R1[0] F0 = R1[0]
● Add F4, F0, F2 F4 = F0 + F2 F6 = R1[-1]
● Stall ● Stall R1[0] = F4 F10 = R1[-2]
F6 = R1[-1] F14 = R1[-3]
● Stall ● Stall
F8 = F6 + F2 F4 = F0 + F2
● Store F4, 0(R1) ● Store F4, 0(R1) R1[-1] = F8 F8 = F6 + F2
F10 = R1[-2] F12 = F10 + F2
● Add.I R1, R1, #-8 ● Branch.NoEq R1, R2, Loop F12 = F10 + F2 F16 = F14 + F2
● Branch.NoEq R1, R2, Loop R1[-2] = F12 R1[0] = F4
F14 = R1[-3] R1[-1] = F8
Instruction producing result Instruction using result Latency in clock cycles F16 = F14 + F2 R1 -= 4
FP ALU op Another FPU ALU op 3 R1[-3] = F16 R1[2] = F12
FP ALU op Store double 2 R1 -= 4 R1[1] = F16
Load double FP ALU op 1 if (R1 != R2) goto Loop if (R1 != R2) goto Loop
Load double Store double 0
14 17

Loop Unrolling Overview Unrolling with Generic Loops

● Loop unrolling simply copies the body of the loop multiple

times, each copy operates on a new loop index
● Given: for (k=0; k < N; ++k)
● Benefits
– Let's unroll 4 times
– Less branch instructions
– But what if N not divisible by 4?
● Less pressure on branch predictor
– Increased basic block size ● Solution:
● Potential for more parallelism – First, unroll N%k times: for (k=0; k < N%4; ++k)
– Less instructions executed – Then loop of group of 4:
● For example: less increments of the loop counter for (k = N%4; k < N; k += 4)
// unroll 4 times for k, k+1, k+2, k+3
● Downsides
● Refer to Chapter 4 and the technique called
– Greater register pressure
– Strip mining
– Increased use of instruction cache
● Could spill the instruction cache and cause cache thrashing
15 18
Branch Prediction Dynamic Scheduling Basics
● Instead of waiting for the branch to finish executing
● Simple techniques can only eliminate some data
– Try to predict its behavior and act upon the prediction dependence stalls
● Requirements – Pipeline scheduling by compiler
– Prediction must be cheaper than executing the branch – Forwarding and bypassing
instruction
● Dynamic scheduling adds another level adding parallelism
● Usually based on few bits of information
while maintaining the data flow
– There has to be a way of dealing with wrong predictions
– Some dependences are not known until runtime
● Beware of exceptions etc.
– The same binaries can run efficiently without recompilation
● Simple predictor
– Compiler might not know the details of the micro-architecture
– Keep a bit (or two) for a (fixed) number of branches
– There could be unpredictable delays: multi-level caches
– Every time a branch is taken increase the count
● Disadvantages
● If N-consecutive executions resulted in “branch taken” then the
next act as if the branch will be taken – Substantial increase in hardware complexity
– If N-consecutive “branch not taken” then start predicting “not – Exception handling (imprecise exceptions)
taken” 19 22

Correlating Branch Predictors Dynamic Scheduling Details

● Dynamic scheduling breaks the “in order” execution

● Observation (based on existing codes)
– Out-of-order execution
– Branches are correlated with each other
● Incoming instructions rearranged and unknown until runtime
● Application: correlating branch predictors
– Out-of-order completion
● Instead of keeping of each branch individually, ● Retired instructions' order depends on code, execution, delays
look also at the recent M branches
● New hazards to deal with
● (M, N) predictor
– WAR
– Uses behavior of last M branches ● Possibility of overwriting a value that has not been read yet
● Total of 2M branch decisions – Load F0, 0(R1) // a load from memory may be stalled for many cycles
– Each predictor has N bits – Load R1, #1 // load of a constant takes only a few cycles
– WAW
● Advantages
● Writing twice to the same location
– Better prediction yield (always test on your own code!)
– RAW hazards are still a problem
– Little hardware require to implement it ● They always occur since they are called “true data dependence”
20 23

Tournament Branch Predictors Dynamic Scheduling and Hazards

● Problem
– Branches might be badly mispredicted when moving between
program scopes
– The branch prediction information from the inner scope is
inadequate for the outer scope
● F0 = F2 / F4
● Observation True data dependence (RAW)

– There is locality in branching

● F6 = F0 + F8
● Inner and outer loops, etc. Output depenendance
● R1[0] = F6 antidepenendence (WAR)
(WAW)
● Solution ● F8 = F10 – F14 antidepenendence (WAR)

– Combine local and global information ● F6 = F10 * F8

● Typical predictors
– Size: 8K-32K bits
– Local predictors unchanged
● Examples: DEC Alpha, AMD Phenom and Opteron 21 24
Register Renaming Example Pipelined Execution

Clock cycles
Instruction 1 2 3 4 5 6 7 8
Instr. I fetch decode exe mem write
Instr. I+1 fetch decode exe mem write
● F0 = F2 / F4 ● F0 = F2 / F4 Instr. I+2 fetch decode exe mem write
Instr. I+3 fetch decode exe mem
● F6 = F0 + F8 ● S = F0 + F8 Instr. I+4 fetch decode exe
Instr. I+5 fetch decode
● R1[0] = F6 ● R1[0] = S Instr. I+6 fetch
● F8 = F10 – F14 ● T = F10 – F14 Instr. I fetch decode exe mem write
● F6 = F10 * F8 ● F6 = F10 * T Instr. I+1 fetch decode exe mem write
Instr. I+2 fetch decode exe mem write
● ●
Instr. I+3 stall fetch decode exe
Instr. I+4 fetch decode
● ● Only RAW hazards remain Instr. I+5 fetch
I1 F1 F2 R X1 X2 X3 D1 D2 T W
I2 F1 F2 R X1 X2 X3 X4 D1 D2 T W
I3 F1 F2 R X1 s D1 s s D2 T W

25 28

Register Renaming Details Tomasulo's Algorithm

● Register renaming provided by reservation stations (RS) ● Tomasulo's approach allows...
● Each entry of RS contains – Out-of-order execution (as in scoreboarding in ARM A8)
– Instruction ● Unlike scoreboarding, Tomasulo can handle anti- and output
dependences by renaming (four-issue Intel i7)
– Buffered operand values (when available)
– Extension to handle speculation
– References to instructions in RS that will provide values
● In Tomasulo, each instruction goes through three steps
● Operation
– Issue
– RS fetches and buffers an operand when available
● FIFO queue maintains correct data flow
● Might bypass a register
● Transfer instruction to RS if available or structural stall
– Pending instructions indicate the RS where they send their
output ● Rename registers to eliminate WAR and WAW (stall if no data)
– Execute
● Results broadcast on result bus (Common Data Bus)
– Only the last output updates the register file ● Monitor bus for new data and distribute it to waiting RS (RAW)
● Execute instructions in functional units when operands available
– Upon instruction issue, registers are renamed with references
to RS – Write result (other RS, registers, store buffers)
● There may be more RS than registers! 26
● Store buffer waits for address, value and memory unit(s) 29

Tomasulo Approach Example Dynamic Execution: All Issued

Instruction status
Issue Execute Write result
Load f6, (r2) x x x
Load f2, (r3) x x
● Introduced by Robert Tomasulo Mult f0,f2,f4 x
Sub f8,f2,f6 x
– Implemented in IBM 360/91 in its floating-point unit Div f9,f0,f6 x
– IBM 360/91 had long memory and floating-point delays Add f6,f8,f2 x
Reservation stations
● Only 4 floating-point registers
Busy Op Vj Vk Qj A Qk
– Binary compatibility was important for IBM customers Load1 no
● Modern processors use a variation of Tomasulo's approach Load2 yes load Reg[r3]
Add1 yes sub Mem[r2] Load2
– Also in use is a simpler algorithm called scoreboarding Add2 yes add Add1 Load2
Add3 no
Mult1 yes mul Reg[f4] Load2
Mult2 yes div Mem[r2] Mult1
f0: Mult1 f1: Load2 f6: Add2 f8: Add1 f9: Mult2
27 30
Example Dynamic Execution: Mult Ready Reorder Buffer
Instruction status ● Another set of (invisible to programmer) registers for
Issue Execute Write result intermediate results
Load f6, (r2) x x x
Load f2, (r3) x x + ● ROB registers hold data after instruction completion but
Mult f0,f2,f4 x + before instruction commit
Sub f8,f2,f6 x + + ● Each ROB entry (register) contains additional fields:
Div f9,f0,f6 x
Add f6,f8,f2 x + + – Instruction type
Reservation stations ● Branch (no destination), store (memory destination), register op
Busy Op Vj Vk Qj Qk A (ALU or register destination)
Load1 no – Destination
Load2 no ● Register number (for loads or ALU ops)or memory address (for
Add1 no stores)
Add2 no – Value
Add3 no
Mult1 yes mul Mem[r3] Reg[f4] – Ready
Mult2 yes div Mem[r2] Mult1 ● Indicates whether the supplying instruction completed its
f0: Mult1 f1: f6: f8: f9: Mult2 execution
31 34

Hardware Speculation Basics Reorder Buffer in Action

● After dealing with data dependences, control dependences ● Issue
become an issue – Has to wait until ROB entry is available (in addition to RS
– Branch prediction is not as effective with multiple instructions entry)
in-flight ● Execute
● If predicted “taken”, conditional instructions are fetched and – Results from Common Data Bus will have to end up on ROB
issued
– Speculation allows to proceed almost as if branch was not ● Write result
there – Results have to be copied to ROB
● Conditional instructions are fetched, issued, and executed ● Commit (also called completion, graduation)
● Hardware speculation comprises Normal commit: store result in destination, mark ROB entry as
–
– Dynamic branch prediction empty
– (speculative) execution: instructions are executed and – Store commit: destination is a memory
possibly undone) – Branch commit:
– Dynamic scheduling ● If correct prediction: no action needed
● More basic blocks available after branches are speculated out of ● If incorrect prediction: ROB result is thrown away and
the instruction stream 32 instructions restarted at the correct branch point 35

Hardware Speculation Components Reorder Buffer Exception Handling

● Additional step in instruction execution

– Issue, Execute, Write result, Commit
● Exceptions are not recognized until they are ready to commit
● Reorder buffer (ROB)
● ROB records exceptions
● Handling of...
– On mispredictions: flush the exception
– Mispredictions
– Upon reaching the head of ROB: raise the exception
– Mis-speculations
– Exceptions

33 36
Speculation at Compile Time VLIW Disadvantages
● Static parallelism
● a += 1 – Must be discovered and exploited early
If (x==0) { ● Preferably by the compiler
b += 1 ● Potential for intermediate representations, bytecodes
c += 1 ● Large code size
} else {
● If (x==0) { a -= 1 – Parallelism relies on large basic blocks
a += 1 } – Clever encoding or on-the-fly decompression may be needed
b += 1 ● a_copy = a+1 ● Lack of hazard detection in lockstep execution
c += 1 b_copy = b+1
} c_copy = c+1 ● Binary compatibility
if (x==0) { – Take code from 2-issue VLIW to (next-gen) 3-issue VLIW
a = a_copy – Add a single ALU unit to the new processor and the old code
b = b_copy will not take advantage of it
c = c_copy
} – New (wider, with more functions) processors could change
instruction encoding
37 ● ISA must provide for future hardware expansions 40

Multiple Issue Execution Tomasulo Recap

● All the techniques presented so far lead to ideal CPI=1 0x0: Load F2, 0(R1)
F0
0x1: Mul F0, F2, F4
● For CPI to go below 1 there need to be multiple instructions 0x2: ...
Empty RS F1
retired most of the time Empty RS F2
F3
Mul v1, v2, v3
– Too many stalls can quickly increase CPI above 1 F4
Load v1, v2
– See Amdahl law
● Most common flavors of multiple issue processors
Memory Buffer
– Statically scheduled processors
● In-order execution
Common data bus
● Examples: MIPS, ARM
To memory hierarchy
– VLIW (very long instruction word) processors
● Each cycle issues multiple (fixed number of) instructions Multiplier/Divider Adder/Subtractor
● Examples: DSPs, Itaniums, some GPUs
– Dynamically scheduled superscalar processors
● Out-of-order execution
● Examples: Intel i3-7, AMD Phenom, IBM POWER7 38 41

VLIW Processor Basics Multiple Issue Taxonomy

Common Issue Hazard Scheduling Distinguishing Exampes
● How many instructions per cycle? name structure detection characteristic

– Two-issue is common place

Superscalar Dynamic Hardware Static In-order execution Mostly in the
– Four-issue is manageable (static) embedded space:
MIPS and ARM
● Scheduling techniques (Cortex A8)
– Local Superscalar Dynamic Hardware Dynamic Some out-of-order None at present
(dynamic) execution but no
● Basic blocks speculation

– Global Superscalar Dynamic Hardware Dynamic Out-of-order execution Intel Core i3-7;
● Across branches (speculative) with with speculation AMD Phenom; IBM
speculation POWER7
– Trace
VLIW/LIW & Static Primarily Static All hazards Most examples are
● VLIW-specific Static software determined and in signal
● Extensive loop unrolling to generate large basic blocks indicated by compiler processing, such
(often implicitly) as TI 6Cx. Also
some GPUs
● Disadvantages EPIC (Exp. Primarily Primarily Mostly All hazards Itanium
Parallel static software static determined and
– Static parallelism, large code size, lack of hazard detection for Instruction indicated explicitly by
lockstep execution, binary compatibility Comp.) the compiler
39 42
VLIW Processors Basic Design Return Address Predictor

● Package multiple operations into one instruction

– Instruction bundles
● Branch prediction deals with conditional branches
● Example VLIW processor
● Most unconditional branches come from function returns
– One integer instruction (or branch)
● But the same function can be called from multiple sites
– Two independent floating-point operations
– This may cause the branch prediction buffer to forget about
– Two independent memory references
return address from previous calls
– Notice: there are restrictions on the instructions
● Solution
● Must be enough parallelism in code to fill the available slots
– Create return address buffer organized as a stack
– Compiler: aggressive loop unrolling
– Programmer: program restructuring

43 46

Modern Microarchitectures
● Combine:
– Dynamic scheduling
– Multiple issue
– Speculation
● Two approaches to dealing with dependences
– Assign reservation stations and update pipeline control table
in half clock cycles
● Only supports 2 instructions/clock
– Design logic to handle any possible dependencies between
the instructions
● Notice: design complexity
– Hybrid approaches
● New bottleneck:
– Issue logic
44

Modern Multiple Issue

● Limit the complexity of a single instruction “bundle”

– Limit bundle size
– Limit classes of instruction in a bundle
● One integer, two floating-point
● With limited size, all dependences in a bundle can be
examined
● Dependences from a small bundle can also be fully encoded
in RS
● Another bottleneck:
– Completion/commit unit
– Need multiple such units to keep up with incoming instructions

M1 Part1 Notes
100% (1)
M1 Part1 Notes
22 pages
Chapter - 5 Operating System
No ratings yet
Chapter - 5 Operating System
6 pages
Instruction-Level Parallelism and Its Exploitation: Prof. Dr. Nizamettin AYDIN
No ratings yet
Instruction-Level Parallelism and Its Exploitation: Prof. Dr. Nizamettin AYDIN
170 pages
EC483_Fall2024_W7
No ratings yet
EC483_Fall2024_W7
40 pages
CSE 820 Graduate Computer Architecture Week 5 - Instruction Level Parallelism
No ratings yet
CSE 820 Graduate Computer Architecture Week 5 - Instruction Level Parallelism
38 pages
Chapter 2 ILP
No ratings yet
Chapter 2 ILP
89 pages
Instruction Level Parallelism: Soner Onder
No ratings yet
Instruction Level Parallelism: Soner Onder
25 pages
Topic2c Ss Dynamicscheduling
No ratings yet
Topic2c Ss Dynamicscheduling
94 pages
CompanionAsset 9780128119051 Chapter03 (3)
No ratings yet
CompanionAsset 9780128119051 Chapter03 (3)
67 pages
CAQA5e ch3
No ratings yet
CAQA5e ch3
45 pages
Pipelining Become Universal Technique in 1985
No ratings yet
Pipelining Become Universal Technique in 1985
16 pages
Instruction-Level Parallelism (ILP), Since The
100% (1)
Instruction-Level Parallelism (ILP), Since The
57 pages
13) Ilp1 PDF
No ratings yet
13) Ilp1 PDF
85 pages
Instruction Level Pipelining
100% (1)
Instruction Level Pipelining
113 pages
Lecture 5
No ratings yet
Lecture 5
76 pages
Instruction Level Parallelism and Its Exploitation: Unit Ii by Raju K, Cse Dept
No ratings yet
Instruction Level Parallelism and Its Exploitation: Unit Ii by Raju K, Cse Dept
201 pages
U3.1 Concepts and Challenges[1] (1)
No ratings yet
U3.1 Concepts and Challenges[1] (1)
12 pages
MCP Unit 1
No ratings yet
MCP Unit 1
41 pages
EEF011 Computer Architecture 計算機結構: Exploiting Instruction-Level Parallelism with Software Approaches
0% (1)
EEF011 Computer Architecture 計算機結構: Exploiting Instruction-Level Parallelism with Software Approaches
40 pages
4th Lecture Computer Architecture
No ratings yet
4th Lecture Computer Architecture
15 pages
4-Advanced pipelining_241114_060906
No ratings yet
4-Advanced pipelining_241114_060906
80 pages
CS 6290 Instruction Level Parallelism
No ratings yet
CS 6290 Instruction Level Parallelism
45 pages
ILP Overview and Scoreboard
No ratings yet
ILP Overview and Scoreboard
60 pages
CH10-Processor Structure and Function
No ratings yet
CH10-Processor Structure and Function
14 pages
Instruction-Level Parallelism: Stalls Control Stalls WAW Stalls WAR Stalls RAW Stalls Structural CPI CPI
No ratings yet
Instruction-Level Parallelism: Stalls Control Stalls WAW Stalls WAR Stalls RAW Stalls Structural CPI CPI
50 pages
03 Dynamic Sched
No ratings yet
03 Dynamic Sched
84 pages
Computer_Architecture_ILP_-_techniques_for_increasing
No ratings yet
Computer_Architecture_ILP_-_techniques_for_increasing
11 pages
ACA Unit 3
No ratings yet
ACA Unit 3
50 pages
EE457Unit9a_OoO
No ratings yet
EE457Unit9a_OoO
77 pages
unit4.aca
No ratings yet
unit4.aca
6 pages
onur-447-spring15-lecture9-branch-prediction-afterlecture
No ratings yet
onur-447-spring15-lecture9-branch-prediction-afterlecture
65 pages
CS 6461: Computer Architecture Instruction Level Parallelism
No ratings yet
CS 6461: Computer Architecture Instruction Level Parallelism
41 pages
Instruction Scheduling
No ratings yet
Instruction Scheduling
17 pages
2-TypesofParallelism (1)
No ratings yet
2-TypesofParallelism (1)
69 pages
Module 5 Instruction Level Parallelism and Pipelining (1)
No ratings yet
Module 5 Instruction Level Parallelism and Pipelining (1)
54 pages
Onur 447 Spring15 Lecture12 Ooo Execution Afterlecture
No ratings yet
Onur 447 Spring15 Lecture12 Ooo Execution Afterlecture
67 pages
Lecture-7-15.01.2025
No ratings yet
Lecture-7-15.01.2025
19 pages
pdc2: MODULE2
No ratings yet
pdc2: MODULE2
113 pages
study guide chapter 3
No ratings yet
study guide chapter 3
3 pages
CAunitiii
No ratings yet
CAunitiii
36 pages
HW 2 Is Out! Due 9/25!
No ratings yet
HW 2 Is Out! Due 9/25!
21 pages
Unit - 1 Microprocessor Architecture
No ratings yet
Unit - 1 Microprocessor Architecture
52 pages
Lec18-Static BRANCH PREDICTION VLIW
No ratings yet
Lec18-Static BRANCH PREDICTION VLIW
40 pages
William Stallings Computer Organization and Architecture 8 Edition Instruction Level Parallelism and Superscalar Processors
No ratings yet
William Stallings Computer Organization and Architecture 8 Edition Instruction Level Parallelism and Superscalar Processors
50 pages
CA Lecture 12
No ratings yet
CA Lecture 12
48 pages
Advanced Computer Architecture
No ratings yet
Advanced Computer Architecture
108 pages
Instruction level Parallelism
No ratings yet
Instruction level Parallelism
22 pages
Module 5_Processor Structure and Function
No ratings yet
Module 5_Processor Structure and Function
74 pages
Lec 8
No ratings yet
Lec 8
62 pages
CH14 COA9e Processor Structure and Function
No ratings yet
CH14 COA9e Processor Structure and Function
40 pages
ACA Unit 3
No ratings yet
ACA Unit 3
17 pages
Star Lion College of Engineering & Technology: Cs2354 Aca-2 Marks & 16 Marks
No ratings yet
Star Lion College of Engineering & Technology: Cs2354 Aca-2 Marks & 16 Marks
14 pages
3a.ILP Dipendenze e Superscalare
No ratings yet
3a.ILP Dipendenze e Superscalare
24 pages
SRM Pipelining 05.Pptx
No ratings yet
SRM Pipelining 05.Pptx
42 pages
Chapter 5 PPTV 41 STDV 1
No ratings yet
Chapter 5 PPTV 41 STDV 1
47 pages
Compiler Techniques For Exposing ILP
No ratings yet
Compiler Techniques For Exposing ILP
18 pages
Lec5 - ILP Issues in Pipeline Design
No ratings yet
Lec5 - ILP Issues in Pipeline Design
38 pages
William Stallings Computer Organization and Architecture 8 Edition Processor Structure and Function
No ratings yet
William Stallings Computer Organization and Architecture 8 Edition Processor Structure and Function
74 pages
Instruction-Level Parallel Processors: Objective
No ratings yet
Instruction-Level Parallel Processors: Objective
31 pages
Ch2 Lec7 Instruction Piplining
No ratings yet
Ch2 Lec7 Instruction Piplining
34 pages
Instruction-Level Parallel Processors: Asim Munir
No ratings yet
Instruction-Level Parallel Processors: Asim Munir
28 pages
PLC: Programmable Logic Controller – Arktika.: EXPERIMENTAL PRODUCT BASED ON CPLD.
From Everand
PLC: Programmable Logic Controller – Arktika.: EXPERIMENTAL PRODUCT BASED ON CPLD.
MARIO FRANCO
No ratings yet
Copper ANEG JEFF LAPAK
No ratings yet
Copper ANEG JEFF LAPAK
98 pages
Distributed System Requirements: Remco Hobo December 1, 2004
No ratings yet
Distributed System Requirements: Remco Hobo December 1, 2004
4 pages
CH 2 Data Communication Networking Network Model Multiple Choice Questions and Answers PDF Behrou PDF
No ratings yet
CH 2 Data Communication Networking Network Model Multiple Choice Questions and Answers PDF Behrou PDF
17 pages
Nanopdf Min PDF
No ratings yet
Nanopdf Min PDF
1 page
Data Communications & Computer Networks: Digital Data, Digital Signals Digital Data, Analog Signals Home Exercises
No ratings yet
Data Communications & Computer Networks: Digital Data, Digital Signals Digital Data, Analog Signals Home Exercises
23 pages
Homework 2 Assignment For ECE671: You Get A Wrong Answer, You Can Get Partial Credit If You Show Your Work. If You Make A
No ratings yet
Homework 2 Assignment For ECE671: You Get A Wrong Answer, You Can Get Partial Credit If You Show Your Work. If You Make A
9 pages
Final Exam - Fall 2008: COE 308 - Computer Architecture
No ratings yet
Final Exam - Fall 2008: COE 308 - Computer Architecture
8 pages
Trace Tables: Natalee A. Johnson, Contributor
No ratings yet
Trace Tables: Natalee A. Johnson, Contributor
166 pages
Postek G-2108 - G-3106 - User's Manual - V1.2 - 20210824
No ratings yet
Postek G-2108 - G-3106 - User's Manual - V1.2 - 20210824
39 pages
ELL319 Digital Signal Processing Lab: Experiment 1: Familiarization With Code Composer Studio (CCS) Prepared by
No ratings yet
ELL319 Digital Signal Processing Lab: Experiment 1: Familiarization With Code Composer Studio (CCS) Prepared by
5 pages
SMT32F407xx System Block Diagram
No ratings yet
SMT32F407xx System Block Diagram
1 page
Reduced_instruction_set_computer
No ratings yet
Reduced_instruction_set_computer
16 pages
Getting Started With Pic32
No ratings yet
Getting Started With Pic32
44 pages
8085_UNIT 3
No ratings yet
8085_UNIT 3
16 pages
Cs405 Question Bank m1 m2 Ktustudents - in
No ratings yet
Cs405 Question Bank m1 m2 Ktustudents - in
3 pages
Interfacing LPC2148 With GLCD.
No ratings yet
Interfacing LPC2148 With GLCD.
3 pages
Ecdl Training Module 1 PDF
No ratings yet
Ecdl Training Module 1 PDF
155 pages
CSE-311 Operating Systems
No ratings yet
CSE-311 Operating Systems
22 pages
Pyhpc2013 Submission 10
No ratings yet
Pyhpc2013 Submission 10
7 pages
Chapter 10. Virtual Memory
No ratings yet
Chapter 10. Virtual Memory
68 pages
Erof - 84911579519
No ratings yet
Erof - 84911579519
1 page
USBasp
No ratings yet
USBasp
3 pages
Nx-8170N-G8 Specification: Model Nutanix: Per Node (Per Block) NX-8170N-G8 (Configure To Order)
No ratings yet
Nx-8170N-G8 Specification: Model Nutanix: Per Node (Per Block) NX-8170N-G8 (Configure To Order)
3 pages
Introduction To Computer Architecture What Is The Computer Architecture?
No ratings yet
Introduction To Computer Architecture What Is The Computer Architecture?
7 pages
Mik Rb750gr3 Datasheet
No ratings yet
Mik Rb750gr3 Datasheet
4 pages
Fundamentals of Programming Week 1 Module
No ratings yet
Fundamentals of Programming Week 1 Module
22 pages
11th Computer Science First Mid Term IQ
No ratings yet
11th Computer Science First Mid Term IQ
3 pages
Hard Disk Drive For Ix300 Operating Instructions
No ratings yet
Hard Disk Drive For Ix300 Operating Instructions
7 pages
Elet 3405 HW 4
0% (1)
Elet 3405 HW 4
6 pages
Virtual Machine (VM) : Assignment On
No ratings yet
Virtual Machine (VM) : Assignment On
7 pages
Module 1
No ratings yet
Module 1
109 pages
Control No. - Equipment Borrower'S Form: - Ricah Lee P. LEQUIN
No ratings yet
Control No. - Equipment Borrower'S Form: - Ricah Lee P. LEQUIN
2 pages
Centum VP System Overview HMI
No ratings yet
Centum VP System Overview HMI
46 pages
TN3270, FTP, and SSH Access
No ratings yet
TN3270, FTP, and SSH Access
16 pages
COMPUTER ORGANIZATION AND ARCHITECTURE Question Bank (All Units)
No ratings yet
COMPUTER ORGANIZATION AND ARCHITECTURE Question Bank (All Units)
14 pages

Cosc530 Ch3all6up

Uploaded by

Cosc530 Ch3all6up

Uploaded by

Chapter 3 Instruction Parallelism Examples

Introduction Types of Dependences

● Instruction level parallelism = ILP =

Instruction-Level Parallelism Basics Data Dependence: Basics

Data Dependence Details Control Dependence

Name Dependence Control Dependence Examples

– Shadow registers – Or R7, R1, R8

Compiler Techniques for Exposing ILP Loop Unrolling + Pipeline Scheduling

Loop Unrolling Overview Unrolling with Generic Loops

● Loop unrolling simply copies the body of the loop multiple

Correlating Branch Predictors Dynamic Scheduling Details

● Dynamic scheduling breaks the “in order” execution

Tournament Branch Predictors Dynamic Scheduling and Hazards

– There is locality in branching

– Combine local and global information ● F6 = F10 * F8

Register Renaming Details Tomasulo's Algorithm

Tomasulo Approach Example Dynamic Execution: All Issued

Hardware Speculation Basics Reorder Buffer in Action

Hardware Speculation Components Reorder Buffer Exception Handling

● Additional step in instruction execution

Multiple Issue Execution Tomasulo Recap

VLIW Processor Basics Multiple Issue Taxonomy

– Two-issue is common place

● Package multiple operations into one instruction

Modern Multiple Issue

● Limit the complexity of a single instruction “bundle”

You might also like