Questions That I Encountered
Questions That I Encountered
The BTB is a cache that stores branch instruction addresses and their predicted
targets.
The fetch unit checks the PC (Program Counter) in the BTB:
o If a match is found, the instruction is likely a branch.
o If no match, it is treated as a non-branch (until decode confirms).
BTB lookup happens in parallel with instruction fetching, enabling early prediction.
Some processors use pre-decode bits to classify instructions in the fetch unit.
As an instruction is fetched from memory, a pre-decoder examines the opcode to
check if it belongs to a branch category.
This avoids waiting for full decoding in later pipeline stages.
Some architectures store a hashed pattern of past branch PCs to predict if the
fetched instruction is a branch.
Example: Gshare predictor XORs the PC with history bits to check if the current
instruction is likely a branch.
Some architectures mark branch instructions in the instruction buffer (e.g., ARM
Thumb instructions use bit patterns).
When fetching, the instruction buffer signals a branch hint to the predictor before
decode.
Some ISAs use compiler-generated hints embedded in the instruction itself (e.g.,
static branch prediction in MIPS).
The fetch unit can read these hints and prepare for a branch even before decoding.
========================================
2. Where Are the BTB and Branch Predictor Located? & Why?
Branch Predictor:
In most modern high-performance processors, the BTB (Branch Target Buffer) is placed in
the Instruction Fetch Stage to allow early prediction. However, in some architectures, the
BTB is located in the Decode Stage.
1. MIPS R4000
o Early MIPS processors had no BTB in fetch; instead, branch targets were
computed in Decode/Execute.
o The architecture relied more on delayed branches than aggressive branch
prediction.
2. ARM Cortex-M Series (Microcontrollers)
o These processors prioritize low power over aggressive branch prediction.
o The BTB is integrated with decode logic to minimize fetch-stage complexity.
3. Older x86 Processors (Before Pentium Pro)
o Some early x86 CPUs placed branch handling logic in decode because
instruction lengths were variable, making it harder to predict branches in fetch.
Yes! Different processors use hybrid BTB designs where branch prediction is distributed
across multiple pipeline stages.
Reason Impact
Reduces unnecessary BTB lookups for non-branch
Lower power consumption
instructions
Simplifies pipeline design Works well in processors with shorter pipelines
Used in embedded architectures like ARM Cortex-
More efficient for low-power CPUs
M
Useful in architectures with delayed Example: MIPS, where branches resolve
branches predictably
Helps CPUs with variable-length instructions (e.g.,
Fetch-stage complexity reduction
old x86)
Would you like a deeper look into multi-level BTBs or how modern CPUs optimize
indirect branch prediction? 🚀
1. Static Prediction:
o Simple rule-based, e.g., assume backward branches are taken (used in loops)
and forward branches are not taken.
o Used in early architectures but ineffective for complex branches.
2. Dynamic Prediction:
o Uses runtime history to predict branch behavior.
o More accurate than static prediction.
1. 1-bit Predictor:
o Stores a single bit for each branch (0 = not taken, 1 = taken).
o High misprediction in cases where a branch changes frequently.
2. 2-bit Saturating Counter Predictor:
o Uses a 2-bit counter per branch address to track history.
o More resistant to fluctuations.
o States: Strongly Taken, Weakly Taken, Weakly Not Taken, Strongly Not
Taken.
3. Gshare Predictor:
o XORs the branch history register (BHR) with the program counter (PC)
before indexing into a table.
oReduces aliasing in branch prediction.
4. TAGE (Tagged Geometric Predictor):
o Uses multiple branch history lengths to adapt to short and long-term branch
behaviors.
o Best-in-class for high-performance CPUs.
5. Neural Branch Prediction:
o Uses perceptrons (machine learning models) to improve prediction accuracy
for hard-to-predict branches.
o IBM’s POWER10 and Intel’s newer chips have experimented with this.
When a branch misprediction occurs, the processor must discard all speculative
instructions that were executed under the incorrect assumption. This process is known as a
pipeline flush. Let’s go step by step into how this works.
🔹 Example:
Every stage of the pipeline has registers that hold instructions as they progress.
On a misprediction, these registers are marked invalid, preventing the incorrect
instructions from completing execution.
The ROB stores speculative results until instructions reach retirement (commit)
stage.
If a misprediction occurs, all speculative instructions in the ROB are discarded
before they commit to the register file or memory.
2. Store Buffer – Prevents Wrong Memory Writes
1. Fetch from the Correct Target Address – The instruction fetch unit starts fetching
the correct instructions.
2. Resume Normal Execution – The pipeline is filled with instructions from the correct
path.
3. Minimize Stalls – The CPU may use checkpointing techniques to quickly recover
from mispredictions.
Key Takeaways
✅ Misprediction Detection happens in the Execution Stage, when the actual branch
outcome is computed.
✅ Pipeline Flush removes all speculative instructions from the pipeline.
✅ Reorder Buffer (ROB) ensures that speculative results do not commit to
registers/memory.
✅ PC Reset & Instruction Queue Flush restart execution from the correct branch target.
✅ Modern CPUs reduce flush penalties with better branch predictors, checkpointing,
and early resolution.
Penalty Factors:
Pipeline Depth:
o Deeper pipelines (15+ stages) suffer higher penalties (e.g., Intel P4 NetBurst
had ~20 cycle penalty).
Fetch-to-Execute Latency:
o The longer it takes to detect a misprediction, the worse the penalty.
Branch Resolution Stage:
o If the branch is resolved later in the pipeline (e.g., in execution), the penalty is
higher.
Typical Penalty Values: