onur-447-spring15-lecture9-branch-prediction-afterlecture
onur-447-spring15-lecture9-branch-prediction-afterlecture
Computer Architecture
Lecture 9: Branch Prediction I
Pipelining
Out-of-Order Execution
4
Reminder: Relevant Seminar Tomorrow
Practical Data Value Speculation for Future High-End
Processors
Arthur Perais, INRIA (France)
Thursday, Feb 5, 4:30-5:30pm, CIC Panther Hollow Room
Summary:
Value prediction (VP) was proposed to enhance the
performance of superscalar processors by breaking RAW
dependencies. However, it has generally been considered too
complex to implement. During this presentation, we will
review different sources of additional complexity and propose
solutions to address them.
http://www.ece.cmu.edu/~calcm/doku.php?id=seminars:se
minars
5
Recap of Last Lecture
Data Dependence Handling
Data Forwarding/Bypassing
In-depth Implementation
Register dependence analysis
Stalling
Performance analysis with and without forwarding
LC-3b Pipelining
Questions to Ponder
HW vs. SW handling of data dependences
Static versus dynamic scheduling
What makes compiler based instruction scheduling difficult?
Profiling (representative input sets needed; dynamic adaptation difficult)
Introduction to static instruction scheduling (e.g., fix-up code)
Control Dependence Handling
Six ways of handling control dependences
Stalling until next fetch address is available: Bad idea
Predicting the next-sequential instruction as next fetch address
6
Tentative Plan for Friday and Monday
I will be out of town
Attending the HPCA Conference
Tentative Plan:
Friday: Recitation session Come with questions on Lab 2,
HW 2, lectures, concepts, etc
Monday: Finish branch prediction (Rachata)
7
Sample Papers from HPCA
Donghyuk Lee+, “Adaptive Latency DRAM: Optimizing
DRAM Timing for the Common Case,” HPCA 2015.
http://users.ece.cmu.edu/~omutlu/pub/adaptive-latency-
dram_hpca15.pdf
9
Review: Control Dependence
Question: What should the fetch PC be in the next cycle?
10
How to Handle Control Dependences
Critical to keep the pipeline full with correct sequence of
dynamic instructions.
How?
1. Get rid of unnecessary control flow instructions
combine predicates (predicate combining)
2. Convert control dependences into data dependences
predicated execution
13
Review: Predicate Combining (not Predicated Execution)
CMPEQ condition, a, 5;
CMOV condition, b 4;
CMOV !condition, b 3;
15
Conditional Execution in ARM
Same as predicated execution
16
Predicated Execution
Eliminates branches enables straight line code (i.e.,
larger basic blocks in code)
Advantages
Always-not-taken prediction works better (no branches)
Compiler has more freedom to optimize code (no branches)
control flow does not hinder inst. reordering optimizations
code optimizations hindered only by data dependencies
Disadvantages
Useless work: some instructions fetched/executed but
discarded (especially bad for easy-to-predict branches)
Requires additional ISA support
18
How to Handle Control Dependences
Critical to keep the pipeline full with correct sequence of
dynamic instructions.
A ex A if ex
if
B C
C A BC X A
BC X B A B C A
D C B D BC C
E BC C E B BC
F -- BC F G B
X: G G -- X: G
6 cycles 5 cycles
21
Fancy Delayed Branching (III)
Delayed branch with squashing
In SPARC
Semantics: If the branch falls through (i.e., it is not taken),
the delay slot instruction is not executed
Why could this help?
Normal code: Delayed branch code: Delayed branch w/ squashing:
X: A X: A A
B B X: B
C C C
BC X BC X BC X
D NOP A
E D D
E E
22
Delayed Branching (IV)
Advantages:
+ Keeps the pipeline full with useful instructions in a simple way assuming
1. Number of delay slots == number of instructions to keep the pipeline
full before the branch resolves
2. All delay slots can be filled with useful instructions
Disadvantages:
-- Not easy to fill the delay slots (even with a 2-stage pipeline)
1. Number of delay slots increases with pipeline depth, superscalar
execution width
2. Number of delay slots should be variable with variable latency
operations. Why?
-- Ties ISA semantics to hardware implementation
-- SPARC, MIPS, HP-PA: 1 delay slot
-- What if pipeline implementation changes with the next design?
23
An Aside: Filling the Delay Slot
a. From before b. From target c. From fall through
sub $t4, $t5, $t6
add $s1, $s2, $s3 add $s1, $s2, $s3
…
reordering data if $s2 = 0 then if $s1 = 0 then
add $s1, $s2, $s3
independent Delay slot Delay slot
instructions
does not change Becomes Becomes Becomes
program semantics
add $s1, $s2, $s3
27
Fine-grained Multithreading: History
CDC 6600’s peripheral processing unit is fine-grained
multithreaded
Thornton, “Parallel Operation in the Control Data 6600,” AFIPS 1964.
Processor executes a different I/O thread every cycle
An operation from the same thread is executed every 10 cycles
28
Fine-grained Multithreading in HEP
Cycle time: 100ns
8 stages 800 ns to
complete an
instruction
assuming no memory
access
29
Multithreaded Pipeline Example
Kongetira et al., “Niagara: A 32-Way Multithreaded Sparc Processor,” IEEE Micro 2005.
31
Fine-grained Multithreading
Advantages
+ No need for dependency checking between instructions
(only one instruction in pipeline from a single thread)
+ No need for branch prediction logic
+ Otherwise-bubble cycles used for executing useful instructions from
different threads
+ Improved system throughput, latency tolerance, utilization
Disadvantages
- Extra hardware complexity: multiple hardware contexts (PCs, register
files, …), thread selection logic
- Reduced single thread performance (one instruction fetched every N
cycles from the same thread)
- Resource contention between threads in caches and memory
- Some dependency checking logic between threads remains (load/store)
32
How to Handle Control Dependences
Critical to keep the pipeline full with correct sequence of
dynamic instructions.
34
Branch Prediction: Guess the Next Instruction to Fetch
PC ??
0x0006
0x0008
0x0007
0x0005
0x0004
I-$ DEC RF WB
0x0001
LD R1, MEM[R0]
0x0002 D-$
ADD R2, R2, #1
0x0003
BRZERO 0x0001
0x0004
ADD R3, R2, #1 12 cycles
0x0005
MUL R1, R2, R3
0x0006
LD R2, MEM[R2] Branch prediction
0x0007
LD R0, MEM[R2]
8 cycles
Misprediction Penalty
PC
I-$ DEC RF WB
0x0001
LD R1, MEM[R0] 0x0007 0x0006 0x0005 0x0004 0x0003
0x0002 D-$
ADD R2, R2, #1
0x0003
BRZERO 0x0001
0x0004
ADD R3, R2, #1
0x0005
MUL R1, R2, R3
0x0006
LD R2, MEM[R2]
0x0007
LD R0, MEM[R2]
Branch Prediction
Processors are pipelined to increase concurrency
How do we keep the pipeline full in the presence of branches?
Guess the next instruction when a branch is fetched
B1 B3 Pipeline
Fetch Decode Rename Schedule RegisterRead Execute
D
A
FD
E B1
B3
B1 D
F
A A
E B1
F
D A
E B1
E B1
D
F F
D F
E B1
A DE F
A B1
A F
E B1
D D
A E B1
E B1
F
D
A D
E B1
A A
A B1
D A
E What
Target
Fetchtofrom
Misprediction
fetch
thenext? Detected!
correct Verify
target Flush the the Prediction
pipeline
37
Branch Prediction: Always PC+4
t0 t1 t2 t3 t4 t5
Insth IFPC ID ALU MEM
Insti IFPC+4 ID ALU
Instj IFPC+8 ID
Instk IFtarget
Instl Insth branch condition and target
evaluated in ALU
When a branch resolves
- branch target (Instk) is fetched
- all instructions fetched since
insth (so called “wrong-path”
Insth is a branch instructions) must be flushed 38
Pipeline Flush on a Misprediction
t0 t1 t2 t3 t4 t5
Insth IFPC ID ALU MEM WB
Insti IFPC+4 ID killed
Instj IFPC+8 killed
Instk IFtarget ID ALU WB
Instl IF ID ALU
IF ID
IF
Insth is a branch 39
Performance Analysis
correct guess no penalty ~86% of the time
incorrect guess 2 bubbles
Assume
no data dependency related stalls
20% control flow instructions
70% of control flow instructions are taken
CPI = [ 1 + (0.20*0.7) * 2 ] =
= [ 1 + 0.14 * 2 ] = 1.28
Hazard
detection
unit
M ID/EX
u
x
WB
EX/MEM
M
Control u M WB
x MEM/WB
0
IF/ID EX M WB
4 Shift
left 2
M
u
x
Registers =
Instruction Data
PC ALU
memory memory M
u
M x
u
x
Sign
extend Is this a good idea?
M
u
x
Forwarding
unit
[Based on original figure from P&H CO&D, COPYRIGHT 2004 Elsevier. ALL RIGHTS RESERVED.]
CPI = [ 1 + (0.2*0.7) * 1 ] = 1.14 41
Branch Prediction (Enhanced)
Idea: Predict the next fetch address (to be used in the next
cycle)
taken?
Address of the
current branch
target address
Global branch
history PC + inst size Next Fetch
XOR Address
Program
hit?
Counter
Address of the
current branch
target address
44
Three Things to Be Predicted
Requires three things to be predicted at fetch stage:
1. Whether the fetched instruction is a branch
2. (Conditional) branch direction
3. Branch target address (if taken)
46
More Sophisticated Direction Prediction
Compile time (static)
Always not taken
Always taken
BTFN (Backward taken, forward not taken)
Profile based (likely direction)
Program analysis based (likely direction)
47
Static Branch Prediction (I)
Always not-taken
Simple to implement: no need for BTB, no direction prediction
Low accuracy: ~30-40% (for conditional branches)
Remember: Compiler can layout code such that the likely path
is the “not-taken” path more effective prediction
Always taken
No direction prediction
Better accuracy: ~60-70% (for conditional branches)
Backward branches (i.e. loop branches) are usually taken
Backward branch: target address lower than branch PC
51
Pragmas
Idea: Keywords that enable a programmer to convey hints
to lower levels of the transformation hierarchy
if (likely(x)) { ... }
if (unlikely(error)) { … }
52
Static Branch Prediction
All previous techniques can be combined
Profile based
Program based
Programmer based
Advantages
+ Prediction based on history of the execution of branches
+ It can adapt to dynamic changes in branch behavior
+ No need for static profiling: input set representativeness
problem goes away
Disadvantages
-- More complex (requires additional hardware)
54
Last Time Predictor
Last time predictor
Single bit per branch (stored in BTB)
Indicates which direction branch went last time it executed
TTTTTTTTTTNNNNNNNNNN 90% accuracy
N-bit BHT:
One
tag BTB: one target
Bit
table address per entry
per
entry
taken? PC+4
1 0
=
nextPC
The 1-bit BHT (Branch History Table) entry is updated with
the correct outcome after each execution of a branch
56
State Machine for Last-Time Prediction
actually
taken
actually
not taken
57
Improving the Last Time Predictor
Problem: A last-time predictor changes its prediction from
TNT or NTT too quickly
even though the branch may be mostly taken or mostly not
taken
59
State Machine for 2-bit Saturating Counter
Counter using saturating arithmetic
Arithmetic with maximum and minimum values
actually actually
taken pred !taken pred
taken taken
11 actually 10
taken
actually actually
taken !taken
actually
pred !taken pred
!taken !taken
actually
01 actually 00
!taken 60
taken
Hysteresis Using a 2-bit Counter
actually actually “weakly
taken !taken taken”
“strongly pred pred
taken” taken taken
actually
taken actually
actually
!taken
taken
actually “strongly
!taken !taken”
pred pred
“weakly !taken !taken actually
!taken” actually !taken
taken
62
Rethinking the The Branch Problem
Control flow instructions (branches) are frequent
15-25% of all instructions
63
Importance of The Branch Problem
Assume N = 20 (20 pipe stages), W = 5 (5 wide fetch)
Assume: 1 out of 5 instructions is a branch
Assume: Each 5 instruction-block ends with a branch
65