0% found this document useful (0 votes)
79 views

Lect07 Instruction Flow

The document discusses instruction flow techniques in high-performance processors, including how branches disrupt sequential instruction fetching by changing the control flow, and various techniques for branch prediction and speculation to minimize delays from branches and improve instruction throughput, such as predicting branch direction and target early to avoid stalls. It also covers limits on instruction-level parallelism from control and data dependencies.
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
79 views

Lect07 Instruction Flow

The document discusses instruction flow techniques in high-performance processors, including how branches disrupt sequential instruction fetching by changing the control flow, and various techniques for branch prediction and speculation to minimize delays from branches and improve instruction throughput, such as predicting branch direction and target early to avoid stalls. It also covers limits on instruction-level parallelism from control and data dependencies.
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 80

Instruction Flow Techniques

ECE/CS 752 Fall 2017

Prof. Mikko H. Lipasti


University of Wisconsin-Madison
High-IPC Processor
I-cache

Branch FETCH
Instruction
Predictor Flow
Instruction
Buffer
DECODE

Integer Floating-point Media Memory

Memory
Data
EXECUTE Flow
Reorder
Buffer
Register (ROB)
Data COMMIT
Flow Store D-cache
Queue

Mikko Lipasti-University of Wisconsin 2


Instruction Flow Techniques
• Instruction Flow and its Impediments
• Control Dependences
• Control Flow Speculation
– Branch Speculation
– Mis-speculation Recovery
• Branch Direction Prediction
– Static Prediction
– A brief history of dynamic branch prediction
• Branch Target Prediction
• High-bandwidth Fetch
• High-Frequency Fetch

Mikko Lipasti-University of Wisconsin 3


Goal and Impediments
• Goal of Instruction Flow
– Supply processor with maximum number of useful
instructions every clock cycle
• Impediments
– Branches and jumps
– Finite I-Cache
• Capacity
• Bandwidth restrictions

Mikko Lipasti-University of Wisconsin 4


Branch Types and
Implementation
1. Types of Branches
A. Conditional or Unconditional
B. Save PC?
C. How is target computed?
• Single target (immediate, PC+immediate)
• Multiple targets (register)
2. Branch Architectures
A. Condition code or condition registers
B. Register

Mikko Lipasti-University of Wisconsin 5


What’s So Bad About Branches?
• Effects of Branches
– Fragmentation of I-Cache lines
– Need to determine branch direction
– Need to determine branch target
– Use up execution resources
• Pipeline drain/fill

Mikko Lipasti-University of Wisconsin 6


What’s So Bad About Branches?
Problem: Fetch stalls until direction is determined
Solutions:
• Minimize delay
– Move instructions determining branch condition away
from branch (CC architecture)
• Make use of delay
– Non-speculative:
• Fill delay slots with useful safe instructions
• Execute both paths (eager execution)
– Speculative:
• Predict branch direction

Mikko Lipasti-University of Wisconsin 7


What’s So Bad About Branches?
Problem: Fetch stalls until branch target is determined
Solutions:
• Minimize delay
– Generate branch target early
• Make use of delay: Predict branch target
– Single target
– Multiple targets

Mikko Lipasti-University of Wisconsin 8


Control Dependences
BB 1 main:
addi r2, r0, A
addi r3, r0, B
addi r4, r0, C BB 1
addi r5, r0, N
BB 2 add r10,r0, r0
bge r10,r5, end
loop:
lw r20, 0(r2)
lw r21, 0(r3) BB 2
bge r20,r21,T1
BB 3 BB 4 sw r21, 0(r4) BB 3
b T2
T1:
sw r20, 0(r4) BB 4
T2:
addi r10,r10,1
BB 5
addi r2, r2, 4
addi r3, r3, 4 BB 5
addi r4, r4, 4
blt r10,r5, loop
end:
• Control Flow Graph
– Shows possible paths of control flow through basic blocks

Mikko Lipasti-University of Wisconsin 9


Control Dependences
BB 1 main:
addi r2, r0, A
addi r3, r0, B
addi r4, r0, C BB 1
addi r5, r0, N
BB 2 add r10,r0, r0
bge r10,r5, end
loop:
lw r20, 0(r2)
lw r21, 0(r3) BB 2
bge r20,r21,T1
BB 3 BB 4 sw r21, 0(r4) BB 3
b T2
T1:
sw r20, 0(r4) BB 4
T2:
addi r10,r10,1
BB 5
addi r2, r2, 4
addi r3, r3, 4 BB 5
addi r4, r4, 4
blt r10,r5, loop
end:
• Control Dependence
– Node B is CD on Node A if A determines whether B executes
– If path 1 from A to exit includes B, and path 2 does not, then B is
control-dependent on A
Mikko Lipasti-University of Wisconsin 10
Program Control Flow
• Implicit Sequential Control Flow
– Static Program Representation
• Control Flow Graph (CFG)
• Nodes = basic blocks
• Edges = Control flow transfers
– Physical Program Layout
• Mapping of CFG to linear program memory
• Implied sequential control flow
– Dynamic Program Execution
• Traversal of the CFG nodes and edges (e.g. loops)
• Traversal dictated by branch conditions
– Dynamic Control Flow
• Deviates from sequential control flow
• Disrupts sequential fetching
• Can stall IF stage and reduce I-fetch bandwidth
Program Control Flow
• Dynamic traversal of
static CFG
• Mapping CFG to linear
memory
Limits on Instruction Level Parallelism
1970: Flynn (ILP)
Weiss and Smith [1984] 1.58
Sohi and Vajapeyam [1987] 1.81
Tjaden and Flynn [1970] 1.86 (Flynn’s bottleneck)
Tjaden and Flynn [1973] 1.96
Uht [1986] 2.00
Smith et al. [1989] 2.00
Jouppi and Wall [1988] 2.40
Johnson [1991] 2.50
Acosta et al. [1986] 2.79
Wedig [1982] 3.00
Butler et al. [1991] 5.8
Melvin and Patt [1991] 6
Wall [1991] 7 (Jouppi disagreed)
Kuck et al. [1972] 8
Riseman and Foster [1972] 51 (no control dependences)
Nicolau and Fisher [1984] 90 (Fisher’s optimism)
Mikko Lipasti-University of Wisconsin 13
Riseman and Foster’s Study
1970: Flynn
1972: Riseman/Foster
• 7 benchmark programs on CDC-3600
• Assume infinite machines
– Infinite memory and instruction stack
– Infinite register file
– Infinite functional units
– True dependencies only at dataflow limit
• If bounded to single basic block, speedup is 1.72
(Flynn’s bottleneck)
• If one can bypass n branches (hypothetically), then:
Branches 0 1 2 8 32 128 
Bypassed
Speedup 1.72 2.72 3.62 7.21 14.8 24.4 51.2
Mikko Lipasti-University of Wisconsin 14
1970: Flynn
Branch Prediction
1972: Riseman/Foster
1979: Smith Predictor

• Riseman & Foster showed potential


– But no idea how to reap benefit
• 1979: Jim Smith patents branch
prediction at Control Data
– Predict current branch based on past
history
• Today: virtually all processors use
branch prediction

Mikko Lipasti-University of Wisconsin 15


Disruption of Sequential Control Flow
Fetch

Instruction/Decode Buffer
Decode
Dispatch Buffer
Dispatch

Reservation
Issue Stations
Branch

Execute

Finish Reorder/
Completion Buffer
Complete

Store Buffer
Retire
Mikko Lipasti-University of Wisconsin 16
Branch Prediction

• Target address generation  Target Speculation


– Access register:
• PC, General purpose register, Link register
– Perform calculation:
• +/- offset, autoincrement, autodecrement
• Condition resolution  Condition speculation
– Access register:
• Condition code register, General purpose register
– Perform calculation:
• Comparison of data register(s)

Mikko Lipasti-University of Wisconsin 17


Target Address Generation
Fetch

PC- Decode Buffer


rel.
Reg. Decode
ind.
Reg. Dispatch Buffer
ind.
with Dispatch
offset
Reservation
Stations
Issue
Branch

Execute

Finish Completion Buffer


Complete

Store Buffer
Retire
18
Mikko Lipasti-University of Wisconsin
Condition Resolution
Fetch
Decode Buffer

CC Decode
reg.
GP Dispatch Buffer
reg.
value Dispatch
comp.
Reservation
Stations
Issue
Branch

Execute

Finish Completion Buffer


Complete

Store Buffer
Retire
Mikko Lipasti-University of Wisconsin 19
Branch Instruction Speculation
to I-cache
Prediction
FA-mux
Spec. target PC(seq.) = FA (fetch address)
Branch PC(seq.) Fetch
Spec. cond. Predictor
(using a BTB) Decode Buffer
BTB Decode
update
(target addr. Dispatch Buffer
and history)
Dispatch

Reservation
Stations
Issue
Branch

Execute

Finish
Completion Buffer
Mikko Lipasti-University of Wisconsin 20
Branch/Jump Target Prediction
0x0348 0101 (NTNT) 0x0612

Branch inst. Information Branch target


address for predict. address (most recent)

• Branch Target Buffer: small cache in fetch stage


– Previously executed branches, address, taken history, target(s)
• Fetch stage compares current FA against BTB
– If match, use prediction
– If predict taken, use BTB target
• When branch executes, BTB is updated
• Optimization:
– Size of BTB: increases hit rate
– Prediction algorithm: increase accuracy of prediction

Mikko Lipasti-University of Wisconsin 21


Branch Prediction: Condition Speculation
1. Biased Not Taken
– Hardware prediction
– Does not affect ISA
– Not effective for loops
2. Software Prediction
– Extra bit in each branch instruction
• Set to 0 for not taken
• Set to 1 for taken
– Bit set by compiler or user; can use profiling
– Static prediction, same behavior every time
3. Prediction based on branch offset
– Positive offset: predict not taken
– Negative offset: predict taken
4. Prediction based on dynamic history

Mikko Lipasti-University of Wisconsin 22


UCB Study [Lee and Smith, 1984]
• Benchmarks used
– 26 programs (IBM 370, DEC PDP-11, CDC 6400)
– 6 workloads (4 IBM, 1 DEC, 1 CDC)
– Used trace-driven simulation
• Branch types
– Unconditional: always taken or always not taken
– Subroutine call: always taken
– Loop control: usually taken
– Decision: either way, if-then-else IBM1: compiler
IBM2: cobol (business app)
– Computed goto: always taken, with changing target IBM3: scientific
– Supervisor call: always taken IBM4: supervisor (OS)
– Execute: always taken (IBM 370)
IBM1 IBM2 IBM3 IBM4 DEC CDC Avg
T 0.640 0.657 0.704 0.540 0.738 0.778 0.676
NT 0.360 0.343 0.296 0.460 0.262 0.222 0.324
Mikko Lipasti-University of Wisconsin 23
Branch Prediction Function
• Prediction function F(X1, X2, … )
– X1 – opcode type
– X2 – history
• Prediction effectiveness based on opcode only, or history
IBM1 IBM2 IBM3 IBM4 DEC CDC
Opcode only 66 69 71 55 80 78
History 0 64 64 70 54 74 78
History 1 92 95 87 80 97 82
History 2 93 97 91 83 98 91
History 3 94 97 91 84 98 94
History 4 95 97 92 84 98 95
History 5 95 97 92 84 98 96

Mikko Lipasti-University of Wisconsin 24


Example Prediction Algorithm

T Branch inst. Information Branch target


T
address for predict. address

TT NT
T T

T
N T
N

NN
TTN N N
TT
N
N
• Hardware table remembers last 2 branch outcomes
– History of past several branches encoded by FSM
– Current state used to generate prediction
• Results:
Workload IBM1 IBM2 IBM3 IBM4 DEC CDC
Accuracy 93 97 91 83 98 91
Mikko Lipasti-University of Wisconsin 25
IBM Study [Nair, 1992]
• Branch processing on the IBM RS/6000
– Separate branch functional unit
– Overlap of branch instructions with other
instructions
• Zero cycle branches
– Two causes for branch stalls
• Unresolved conditions
• Branches downstream too close to unresolved
branches
• Investigated optimal FSM design for
branch prediction

Mikko Lipasti-University of Wisconsin 26


Exhaustive Search for Optimal 2-bit Predictor
• There are 220 possible state machines of 2-bit predictors
– Some machines are uninteresting, pruning them out reduces the number of state
machines to 5248
• For each benchmark, determine prediction accuracy for all the
predictor state machines
– Find optimal 2-bit predictor for each application

Mikko Lipasti-University of Wisconsin 27


Branch Prediction Implementation (PPC 604)
to I-cache
Prediction
FA-mux
Spec. target FA (fetch address)
Branch FA Fetch
Predictor
Decode Buffer

Branch Decode
Predictor
Update Dispatch Buffer
Dispatch
Reservation
Stations
BRN SFX SFX CFX FPU LS
Issue
Branch
Execute

Finish Completion
Buffer
Mikko Lipasti-University of Wisconsin 28
BTAC and BHT Design (PPC 604)

FA-mux
FA I-cache

FAR
FA FA
Branch History Branch Target
Table (BHT) Address Cache
+4 (BTAC)
BTAC
update Decode Buffer
BHT
BHT prediction update Decode

Dispatch Buffer
BTAC prediction
Dispatch
BTAC:
Reservation
- 64 entries Stations
- fully associative BRN SFX SFX CFX FPU LS
- hit => predict taken
Issue
BHT: Branch
- 512 entries Execute
- direct mapped
- 2-bit saturating counter
history based prediction
- overrides BTAC prediction Finish Completion
Buffer
Mikko Lipasti-University of Wisconsin 29
Branch Speculation
NT T

NT T NT
T
(TAG 2)

NT T NT T NT T NT T
(TAG 3) (TAG 1)

• Start new correct path


– Must remember the alternate (non-predicted) path
• Eliminate incorrect path
– Must ensure that the mis-speculated instructions
produce no side effects

Mikko Lipasti-University of Wisconsin 30


Mis-speculation Recovery
• Start new correct path
1. Update PC with computed branch target (if predicted
NT)
2. Update PC with sequential instruction address (if
predicted T)
3. Can begin speculation again at next branch
• Eliminate incorrect path
1. Use tag(s) to deallocate ROB entries occupied by
speculative instructions
2. Invalidate all instructions in the decode and dispatch
buffers, as well as those in reservation stations

Mikko Lipasti-University of Wisconsin 31


Tracking Instructions
• Assign branch tags
– Allocated in circular order
– Instruction carries this tag throughout
processor
• Often: track instruction groups
– Instructions managed in groups, max. one
branch per group
– ROB structured as groups
• Leads to some inefficiency
• Simpler tracking of speculative instructions

Mikko Lipasti-University of Wisconsin 32


BTAC and BHT Design (PPC 604)

Fairly simple, 5-
stage machine
from 1994

Many sources
for PC redirect

Lots of
complexity

Mikko Lipasti-University of Wisconsin 33


Instruction Flow Techniques
• Instruction Flow and its Impediments
• Control Dependences
• Control Flow Speculation
– Branch Speculation
– Mis-speculation Recovery
• Branch Direction Prediction
– Static Prediction
– A brief history of dynamic branch prediction
• Branch Target Prediction
• High-bandwidth Fetch
• High-Frequency Fetch

Mikko Lipasti-University of Wisconsin 34


Static Branch Prediction
• Single-direction
– Always not-taken: Intel i486
• Backwards Taken/Forward Not Taken
– Loop-closing branches
– Used as backup in Pentium Pro, II, III, 4
• Heuristic-based:
void * p = malloc (numBytes);
if (p == NULL)
errorHandlingFunction( );

Mikko Lipasti-University of Wisconsin 35


Static Branch Prediction
Heuristic
Description
Name
Loop Branch If the branch target is back to the head of a loop, predict taken.
If a branch compares a pointer with NULL, or if two pointers are compared, predict in the
Pointer
direction that corresponds to the pointer being not NULL, or the two pointers not being equal.
If a branch is testing that an integer is less than zero, less than or equal to zero, or equal to a
Opcode
constant, predict in the direction that corresponds to the test evaluating to false.
If the operand of the branch instruction is a register that gets used before being redefined in the
Guard
successor block, predict that the branch goes to the successor block.
If a branch occurs inside a loop, and neither of the targets is the loop head, then predict that the
Loop Exit
branch does not go to the successor that is the loop exit.
Loop Header Predict that the successor block of a branch that is a loop header or a loop pre-header is taken.
If a successor block contains a subroutine call, predict that the branch goes to that successor
Call
block.
If a successor block contains a store instruction, predict that the branch does not go to that
Store
successor block.
If a successor block contains a return from subroutine instruction, predict that the branch does
Return
not go to that successor block.

• Heuristic-based: Ball/Larus
– Thomas Ball and James R. Larus. Branch Prediction for Free. ACM SIGPLAN Symposium
on Principles and Practice of Parallel Programming, pages 300-313, May 1993.
Mikko Lipasti-University of Wisconsin 36
Static Branch Prediction
• Profile-based
1. Instrument program binary
2. Run with representative (?) input set
3. Recompile program
a. Annotate branches with hint bits, or
b. Restructure code to match predict not-taken
• Best performance: 75-80% accuracy

Mikko Lipasti-University of Wisconsin 37


Dynamic Branch Prediction
• Main advantages:
– Learn branch behavior autonomously
• No compiler analysis, heuristics, or profiling
– Adapt to changing branch behavior
• Program phase changes branch behavior
• First proposed in 1979-1980
– US Patent #4,370,711, Branch predictor using
random access memory, James. E. Smith
• Continually refined since then
Mikko Lipasti-University of Wisconsin 38
Smith Predictor Hardware
Branch Address 2m k -bit counters

Updated Counter Value

m
Saturating Counter
Increment/Decrement

Branch Prediction Branch Outcome


most significant bit

• Jim E. Smith. A Study of Branch Prediction Strategies. International


Symposium on Computer Architecture, pages 135-148, May 1981
• Widely employed: Intel Pentium, PowerPC 604, PowerPC 620, etc.

Mikko Lipasti-University of Wisconsin 39


Two-level Branch Prediction
PHT
000000
PC = 01011010010101 000001
000010
000011

010110 010100
010101
010110 1 0
010111
BHR
0110
111110
111111
1 Branch Prediction
• BHR adds global branch history
– Provides more context
– Can differentiate multiple instances of the same static branch
– Can correlate behavior across multiple static branches
Mikko Lipasti-University of Wisconsin 40
Two-level Prediction: Local History
PHT
PC = 01011010010101 0000000
0000001
0000010
0000011

BHT 0101110
000 0101100
001 0101101
010 0101110 0 1
011 0101111
100
101 110
110 0111110
111 0111111
0 Branch Prediction

• Detailed local history can be useful

Mikko Lipasti-University of Wisconsin 41


Local History Predictor Example

• Loop closing Loop closing branch’s history


11101110111011101110 0000
PHT

branches 0001

– Must identify 0010


0011
last instance 0100

• Local history 0101


0110
dedicates PHT 0111 00

entry to each 1000


1001
instance 1010

– ‘0111’ entry 1011 11


1100
predicts not 1101 11
taken 1110 11
1111

Mikko Lipasti-University of Wisconsin 42


1970: Flynn
Two-level Taxonomy
1972: Riseman/Foster
1979: Smith Predictor • Based on indices for branch history
and pattern history
1991: Two-level prediction – BHR: {G,P,S}: {Global, Per-address, Set}
– PHT: {g,p,s}: {Global, Per-address, Set}
– 9 combinations: GAg, GAp, GAs, PAg,
PAp, PAs, SAg, SAp and SAs
• Tse-Yu Yeh and Yale N. Patt. Two-Level
Adaptive Branch Prediction.
International Symposium on
Microarchitecture, pages 51-61,
November 1991.

Mikko Lipasti-University of Wisconsin 43


Index Sharing in Two-level Predictors
1970: Flynn GAp 0000 gshare 0000
1972: Riseman/Foster 0001 0001
0010 0010
1979: Smith Predictor BHR 1101 0011 BHR 1101
0011
0100 PC 0110 0100
1001 0101 0101
1991: Two-level prediction PC 0110 0110 XOR 1011 0110
0111 0111
1993: gshare, tournament 1000 1000
1001 1001
BHR 1001 1010 BHR 1001 1010
1011 1011
1001 1100 PC 1010 1100
1101 1101
PC 1010 1110 XOR 0011 1110
1111 1111

• Use XOR function to achieve better utilization of PHT


• Scott McFarling. Combining Branch Predictors. TN-36,
Digital Equipment Corporation Western Research
Laboratory, June 1993.
• Used in e.g. IBM Power 4, Alpha 21264
Mikko Lipasti-University of Wisconsin 44
Sources of Mispredictions
• Lack of history (training time)
• Randomized behavior
– Usually due to randomized input data (benchmarks)
– Surprisingly few branches depend on input data
values
• BHR capacity
– Correlate to branch that already shifted out
– E.g. loop count > BHR width
• PHT capacity
– Aliasing/interference
• Positive
• Negative

Mikko Lipasti-University of Wisconsin 45


Reducing Interference
• Compulsory aliasing (cold miss)
– Not important (less than 1%)
– Only remedy is to set appropriate initial value
– Also: beware indexing schemes with high training
cost (e.g. very long branch history)
• Capacity aliasing (capacity miss)
– Increase PHT size
• Conflict aliasing (conflict miss)
– Change indexing scheme or partition PHT in a
clever fashion
Mikko Lipasti-University of Wisconsin 46
Bi-Mode Predictor
Bra nch Address
choice
PHT0 PHT1 predictor
Global BHR

XOR

• PHT partitioned into T/NT halves Final Prediction


– Selector chooses source
• Reduces negative interference, since most entries in PHT0 tend
towards NT, and most entries in PHT1 tend towards T
• Used by ARM Cortex-A15
Mikko Lipasti-University of Wisconsin 47
gskewed Predictor PHT0 PHT1 PHT2
Branch Address
f0
Global BHR
f1

f2

M a jority

Final Prediction
• Multiple PHT banks indexed by different hash functions
– Conflicting branch pair unlikely to conflict in more than one PHT
• Majority vote determines prediction
• Used in Alpha EV8 (ultimately cancelled)
• P. Michaud, A. Seznec, and R. Uhlig. Trading Conflict and Capacity Aliasing in
Conditional Branch Predictors. ISCA-24, June 1997
Mikko Lipasti-University of Wisconsin 48
Agree Predictor
biasing bits

Branch Address

Global BHR

XOR PHT
1
Prediction
0

1 = agree with bias bit


0 = disagree

• Same principle as bi-mode


• PHT records whether branch bias matches outcome
– Exploits 70-80% static predictability
• Used in HP PA-8700
• E. Sprangle, R. S. Chappell, M. Alsup, and Y. N. Patt. The Agree
Predictor: A Mechanism for Reducing Negative Branch History
Interference. ISCA-24, June 1997.
Mikko Lipasti-University of Wisconsin 49
YAGS Predictor
1970: Flynn Branch Address
choice
1972: Riseman/Foster Global BHR PHT
1979: Smith Predictor XOR

T-cache NT-cache
1991: Two-level prediction
Partial Tag 2bC Partial Tag 2bC
1993: gshare, tournament

1998: Cache exceptions

• Based on bi-mode = =

– T/NT PHTs cache


only the exceptions 0 1 0 1
• A. N. Eden and T. N. Mudge.
The YAGS Branch Prediction 0 1
Scheme. MICRO, Dec 1998. T/NT-cache hit?
F i n a l P r e d i c ti o n

Mikko Lipasti-University of Wisconsin 50


Branch Confidence Estimation
1970: Flynn Branch Address
1972: Riseman/Foster Table of CIRs
XOR
1979: Smith Predictor Global BHR

1991: Two-level prediction


1993: gshare, tournament
1996: Confidence estimation

1998: Cache exceptions

Reduction Confidence
Function Prediction

• Limit speculation (energy), reverse predictions,


guide fetch for multithreaded processors, choose
best prediction
• Q Jacobson, E Rotenberg, and JE Smith. Assigning
Confidence to Conditional Branch Predictions.
MICRO, December 1996.
Mikko Lipasti-University of Wisconsin 51
1970: Flynn
Dynamic History Length
1972: Riseman/Foster
1979: Smith Predictor • Branch history length:
– Some prefer short history (less training time)
– Some require longer history (complex behavior)
1991: Two-level prediction • Vary history length
1993: gshare, tournament – Choose through profile/compile-time hints
1996: Confidence estimation – Or learn dynamically
1996: Vary history length • References
1998: Cache exceptions – Maria-Dana Tarlescu, Kevin B. Theobald, and Guang R.
Gao. Elastic History Buffer: A Low-Cost Method to
Improve Branch Prediction Accuracy. ICCD, October
1996.
– Toni Juan, Sanji Sanjeevan, and Juan J. Navarro. Dynamic
History-Length Fitting: A Third Level of Adaptivity for
Branch Prediction. ISCA, June 1998.
– Jared Stark, Marius Evers, and Yale N. Patt. Variable Path
Branch Prediction. ACM SIGPLAN Notices, 33(11):170-
179, 1998

Mikko Lipasti-University of Wisconsin 52


Loop Count Predictors
BHR entry:
H/C History/Count

Mode bit: “1”


H – history
C – count

• To predict last loop iteration’s NT branch:


– Must have length(BHR) > loop count
– Not feasible for large loop counts
• Instead, BHR has mode bit
– Once history == ‘111…11’ or ‘000…00’ switch to count mode
– Now nth entry in PHT trains to NT and predicts nth iteration as last
one
– Now length(BHR) > log2(loop count) is sufficient
• Used in Intel Pentium M/Core Duo/ Core 2 Duo
Mikko Lipasti-University of Wisconsin 53
Path History
A if (y == 0) B if (y == 5)
goto C; goto C;

History = T History = T
C if (y < 12)
goto D;

H is to r y = T T
Path ACD: D Path BCD:
if (y % 2)
Branch Address = X goto E; Branch Address = X
Branch History = TT Branch History = TT
Branch Outcome = Not Taken Branch Outcome = Taken

• Sometimes T/NT history is not enough


• Path history (PC values) can help
Mikko Lipasti-University of Wisconsin 54
Understanding Advanced Predictors
• Four types of history
– Local (bimodal) history (Smith predictor)
• Table of counters summarizes local history
• Simple, but only effective for biased branches
– Local outcome history (correlate with self)
• Shift register of individual branch outcomes
• Separate counter for each outcome history (M-F vs Sat/Sun)
– Global outcome history (correlate with others)
• Shift register of recent branch outcomes
• Separate counter for each outcome history
– Path history (overcomes CFG convergence aliasing)
• Shift register of recent (partial) block addresses
• Can differentiate similar global outcome histories
• Can combine histories in many ways

Mikko Lipasti-University of Wisconsin 55


Understanding Advanced Predictors

• History length
– Short history—lower training cost
– Long history—captures macro-level behavior
– Variable history length predictors
• Really long history (long loops)
– Loop count predictors
– Fourier transform into frequency domain
• Kampe et. al, “The FAB Predictor…”, HPCA 2002
• Limited capacity & interference
– Constructive vs. destructive
– Bi-mode, gskewed, agree, YAGS
– Sec. 9.3.2 provides good overview

Mikko Lipasti-University of Wisconsin 56


1970: Flynn
Perceptron Branch Prediction
1972: Riseman/Foster
n
1979: Smith Predictor [Jimenez, Lin HPCA 2001]
y = w0 + ∑ xi wi
• Perceptron
1991: Two-level prediction i=1
1993: gshare, tournament
– Basis in AI concept [1962]
1996: Confidence estimation – Computes boolean result based on
1996: Vary history length
1998: Cache exceptions
multiple weighted inputs
• Adapted for branch prediction
2001: Neural predictor
– xi from branch history (1 T, -1 NT)
– wi incremented whenever branch
outcome matches xi
– Finds correlation between current branch
and any subset of prior branches

Mikko Lipasti-University of Wisconsin 57


Perceptrons - Implementation
• Complex dot product must be
computed for every prediction
– Too slow
• Arithmetic tricks, pipelining:
– Daniel A. Jimenez and Calvin Lin.
Neural methods for dynamic
branch prediction. ACM
Transactions on Computer
Systems, 20(4):369–397,
November 2002.
– Analog circuit implementation
also possible
• Amant, Jimenez, Burger, MICRO
2008
• Key insights:
– Not all branches in history are
important, weights learn this
– Long histories are useful
58
Mikko Lipasti-University of Wisconsin
Practical Neural Predictors
PC

Precomputed
X

Response
BHR
>t

“feature”
“response”

• Approximate dot product function with


precomputed responses
• Update (inc/dec) response based on outcomes
Mikko Lipasti-University of Wisconsin 59
Practical Neural Predictors
feature
feature
feature
feature response
feature response
feature response
response Σ >t
response
response

• Many possible features (local, global, path, …)


• Responses updated based on neuron-like model
• Threshold tuned and/or updated
• Recent designs from AMD, Samsung claim “neural predictor”
– This slide is my best guess as to what that means
• Some hints: “Multiperspective Perceptron Predictor,” Daniel
Jimenez, CPB-5, ISCA 2016.
Mikko Lipasti-University of Wisconsin 60
Overriding Predictors
• Different types of history
– E.g. Bimodal, Local, Global (BLG)
• Different history lengths
• How to choose?
– Metapredictor/selector? Expensive, slow to train
• Tag match with most sophisticated predictor entry
– Parallel tag check with B, L, G, long-history G
– Choose most sophisticated prediction
– Fancy predictors only updated when simple ones fail

Mikko Lipasti-University of Wisconsin 61


Prediction by Partial Matching
1970: Flynn
1972: Riseman/Foster
1979: Smith Predictor

1991: Two-level prediction


1993: gshare, tournament
1996: Confidence estimation
1996: Vary history length
1998: Cache exceptions

2001: Neural predictor


2004: PPM

[P. Michaud, CBP-1 2004, JILP 2005]


• Elegant approach for choosing from several
predictors, based on PPM data compression
• Partial tags like YAGS, varying history lengths
Mikko Lipasti-University of Wisconsin 62
Current State of the Art
1970: Flynn
1972: Riseman/Foster • Key concepts
1979: Smith Predictor – Different history types (B,L,G)
– Geometric series history lengths
1991: Two-level prediction • Some branches prefer short, others long
1993: gshare, tournament
1996: Confidence estimation • Use geometric series [Seznec, CBP-1, O-
1996: Vary history length GEHL]
1998: Cache exceptions
– Cache only exceptions (YAGS/PPM)
2001: Neural predictor
2004: PPM
– Confidence estimation [Jacobson et al, MICRO
1996]
2006: TAGE
• Tagged Geometric History Length
(TAGE)
– A. Seznec, P. Michaud, 
“A case for (partially) tagged Geometric History Length
Branch Prediction”
, Mikko
Journal of Instruction Level Parallelism , Feb. 2006
Lipasti-University of Wisconsin 63
TAGE Predictor

• Multiple tagged tables, use different global


history lengths

• Set of history lengths forms a geometric series


{0, 2, 4, 8, 16, 32, 64, 128, 256, …, 2048}

most of the storage


for short history !!

Mikko Lipasti-University of Wisconsin 64


Tagged Geometric History Length (TAGE)
0 L1 L2 L3
GHR(h) --- --- ---

PC PC h[0 - L1] PC h[0 - L2] PC h[0 - L3]

hash hash hash


Predictor

pred tag pred tag pred tag


Base

=? =? =?

Hit Miss Hit


prediction

• Longest matching table provides the prediction, subject to branch confidence


Mikko Lipasti-University of Wisconsin 65
1970: Flynn
TAGE
1972: Riseman/Foster
• Tweaks to basic concept still win CBP-6
1979: Smith Predictor
– 1st place: TAGE-SC-L
1991: Two-level prediction – 2nd place: Perceptron+TAGE hybrid
1993: gshare, tournament
1996: Confidence estimation • State of the art, but…
1996: Vary history length
1998: Cache exceptions
– Rumors of real implementation
– Very energy-intensive (parallel lookups)
2001: Neural predictor
2004: PPM – Complex update rules
2006: TAGE • Real opportunity exists for
improvement

2016: Still TAGE vs Neural

Mikko Lipasti-University of Wisconsin 66


1970: Flynn
TAGE vs. Neural
1972: Riseman/Foster
• Neural: ARM, AMD, Samsung
1979: Smith Predictor
• TAGE: Intel, ???
1991: Two-level prediction • Similarity
1993: gshare, tournament
1996: Confidence estimation – Many sources or “features”
1996: Vary history length
1998: Cache exceptions • Key difference: how to combine them
2001: Neural predictor – TAGE: Override via partial match
2004: PPM
– Neural: integrate + threshold
• Every CBP is a cage match
2006: TAGE

– Seznec vs. Jimenez


2016: Still TAGE vs Neural

Mikko Lipasti-University of Wisconsin 67


Instruction Flow Techniques
• Instruction Flow and its Impediments
• Control Dependences
• Control Flow Speculation
– Branch Speculation
– Mis-speculation Recovery
• Branch Direction Prediction
– Static Prediction
– A brief history of dynamic branch prediction
• Branch Target Prediction
• High-bandwidth Fetch
• High-Frequency Fetch

Mikko Lipasti-University of Wisconsin 68


Branch Target Prediction
Branch Address
Branch Target Buffer

Branch target tag target tag ... target tag


Direction
Predictor

Size of
Instruction
= = =

not-taken OR
target

taken-target BTB Hit?


0 1

Branch Target

• Partial tags sufficient in BTB


Mikko Lipasti-University of Wisconsin 69
Return Address Stack
Bra nch Address Bra nch Address

Size of
Instruction

BTB BTB
Return
Address

Target Prediction

is this a return?
Target Prediction
(a) (b)
• Speculative update causes headaches
– On each predicted branch, checkpoint head/tail
– Further, checkpoint stack contents since speculative pop/push
sequence is destructive
– Conditional call/return causes more headaches
Mikko Lipasti-University of Wisconsin 70
Indirect Branches
• Tagged target cache
– Chang et. al, Target Prediction for Indirect Jumps, ISCA
1997

Mikko Lipasti-University of Wisconsin 71


Indirect Branches
• ITTAGE proposed in same 2006 paper as TAGE
– A. Seznec, P. Michaud, 
“A case for (partially) tagged Geometric History Length Branch Prediction”, Journal of
Instruction Level Parallelism , Feb. 2006

Mikko Lipasti-University of Wisconsin 72


Indirect Branches
• CPB-3 had an indirect prediction track
• #1: A. Seznec, A 64-Kbytes ITTAGE indirect branch predictor, MPPKI
34.1
• #2: Y. Ishii, T. Sawada, K. Kuroyanagi, M. Inaba, K. Hiraki, Bimode
Cascading: Adaptive Rehashing for ITTAGE Indirect Branch Predictor,
MPPKI 37.0
• #3: N. Bhansali, C. Panirwala, H. Zhou, Exploring Correlation for
Indirect Branch Prediction, MPPKI 51.6
• #4: Daniel A. Jimenez, SNIP: Scaled Neural Indirect Predictor, MPPKI
52.9

Mikko Lipasti-University of Wisconsin 73


High-Bandwidth Fetch: Collapsing Buffer
Branch A ddress Interleaved BTB

Two cache line addresses

Cache Cache
Bank 1 Bank 2 Valid
Instruction
E F G H Bits
A B C D

Interchange Switch

A B C DE F G H

Collapsing Circuit

A B C EG
To Decode Stage

• Fetch from two cache blocks, rotate, collapse past taken branches
• Thomas M. Conte, Kishore N. Menezes, Patrick M. Mills and Burzin A. Patel.
Optimization of Instruction Fetch Mechanisms for High Issue Rates.
International Symposium on Computer Architecture, June 1995.
Mikko Lipasti-University of Wisconsin 74
High-Bandwidth Fetch: Trace Cache
Instruction Cache
E F G Trace Cache
H I J

A B A B C D E F G H I J
C
D

(a) (b)

• Fold out taken branches by tracing instructions as they


commit into a fill buffer
• Eric Rotenberg, S. Bennett, and James E. Smith. Trace
Cache: A Low Latency Approach to High Bandwidth
Instruction Fetching. MICRO, December 1996.
Mikko Lipasti-University of Wisconsin 75
Intel Pentium 4 Trace Cache
Front-End BTB Level-Two
Instruction TLB
and Prefetcher Unified Data and
Instruction Cache

Instruction Decode

Trace Cache BTB Trace Cache

Ins truction Fetch Queue

To renamer, execute, etc.

• No first-level instruction cache: trace cache only


• Trace cache BTB identifies next trace
• Miss leads to fetch from level two cache
• Trace cache instructions are decoded (uops)
• Cache capacity 12k uops
– Overwhelmed for database applications
– Serial decoder becomes performance bottleneck
Mikko Lipasti-University of Wisconsin 76
High-Bandwidth Fetch: Loop Buffers
Loop Body
Fetch/Decode/D
ispatch Buffer

bc

• History: AMD29K Branch Target Cache


– Don’t cache the target address; cache 4 instructions from the target itself
– Avoid accessing I$ for first fetch group following a taken branch
– If loop body is <= 4 instructions, effectively a loop cache
– Room for 32/64 branch targets
• Also common in DSP designs, under s/w control (e.g.
Lucent)
• Introduced in Intel Merom (Core 2 Duo)
– Fetch buffer detects short backward branches, inhibits refetch from I$
• Intel Nehalem (Core i7)
– Moved loop buffer after decoders: contains uops
• Intel Sandybridge
– General-purpose uop cache (not just loops)
– 1.5K capacity
Mikko Lipasti-University of Wisconsin 77
High Frequency: Next-line Prediction
Cycle 1 Cycle 2 Cycle 3

next line
prediction 6
E F G H tag

I J K L
A B C D tag 2

Tag check for T ag c h ec k fo r


cycle 1’s lookup cycl e 2’s lookup
Next line
= m isprediction =

Target Prediction Target Prediction Target Pred

• Embed next fetch address in instruction cache


– Enables high-frequency back-to-back fetch
• Brad Calder and Dirk Grunwald. Next Cache Line and Set
Prediction. International Symposium on Computer
Architecture, pages 287-296, June 1995.
Mikko Lipasti-University of Wisconsin 78
High Frequency: Overriding Predictors
Cycle 1 Cycle 2 Cycle 3
Slow Overriding
Predictor

Small, Fast Pre dict A Predic t A Predict B Predict B Predict C Predict C


Stage 1 Predictor

Instruction Fetch B Predic t B


Fetch A Pre dict A
Stage 2 Cache

Fetch Queue A Predic t A


Stage 3 Queue

If slow pre dict agrees with fast predict, do nothing


If pre dictions do not match, flush A, B, and C,
a nd resta rt fetch at new predicted target

• Simple, fast predictor turns around every cycle


• Smarter, slower predictor can override
• Widely used: PowerPC 604, 620, Alpha 21264
Mikko Lipasti-University of Wisconsin 79
Instruction Flow Summary
• Instruction Flow and its Impediments
• Control Dependences
• Control Flow Speculation
– Branch Speculation
– Mis-speculation Recovery
• Branch Direction Prediction
– Static Prediction
– A brief history of dynamic branch prediction
• Branch Target Prediction
• High-bandwidth Fetch
• High-Frequency Fetch

Mikko Lipasti-University of Wisconsin 80

You might also like