0% found this document useful (0 votes)
19 views46 pages

Super Scalar 2

Uploaded by

Tharun Chitipolu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
19 views46 pages

Super Scalar 2

Uploaded by

Tharun Chitipolu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 46

Computer Architecture

ELE 475 / COS 475


Slide Deck 5: Superscalar 2 and
Exceptions
David Wentzlaff
Department of Electrical Engineering
Princeton University

1
Agenda
• Interrupts
• Out-of-Order Processors

2
Interrupts:
altering the normal flow of control

Ii-1 HI1

interrupt
program Ii HI2
handler

Ii+1 HIn

An external or internal event that needs to be processed by


another (system) program. The event is usually unexpected or
rare from program’s point of view. 3
Causes of Exceptions
Interrupt: an event that requests the attention of the processor
• Asynchronous: an external event
– input/output device service request
– timer expiration
– power disruptions, hardware failure
• Synchronous: an internal exception (a.k.a.
exceptions/trap)
– undefined opcode, privileged instruction
– arithmetic overflow, FPU exception
– misaligned memory access
– virtual memory exceptions: page faults,
TLB misses, protection violations
– software exceptions: system calls, e.g., jumps into kernel
4
Asynchronous Interrupts:
invoking the interrupt handler

• An I/O device requests attention by asserting


one of the prioritized interrupt request lines

• When the processor decides to process the


interrupt
– It stops the current program at instruction Ii, completing all the
instructions up to Ii-1 (a precise interrupt)
– It saves the PC of instruction Ii in a special register (EPC)
– It disables interrupts and transfers control to a designated interrupt
handler running in the kernel mode

5
Interrupt Handler
• Saves EPC before re-enabling interrupts to allow nested
interrupts 
– need an instruction to move EPC into GPRs
– need a way to mask further interrupts at least until EPC can be saved
• Needs to read a status register that indicates the cause
of the interrupt
• Uses a special indirect jump instruction RFE (return-
from-exception) to resume user code, this:
– enables interrupts
– restores the processor to the user mode
– restores hardware status and control state

6
Synchronous Interrupts
• A synchronous interrupt (exception) is caused by a
particular instruction

• In general, the instruction cannot be completed and


needs to be restarted after the exception has been
handled
– requires undoing the effect of one or more partially executed instructions

• In the case of a system call trap, the instruction is


considered to have been completed
– syscall is a special jump instruction involving a change to privileged kernel mode
– Handler resumes at instruction after system call

7
Exception Handling 5-Stage Pipeline
Inst. Data
PC D Decode E + M W
Mem Mem

PC address Illegal Data address


Overflow
Exception Opcode Exceptions

Asynchronous Interrupts

• How to handle multiple simultaneous exceptions in


different pipeline stages?
• How and where to handle external asynchronous
interrupts?
8
Exception Handling 5-Stage Pipeline
Commit
Point

Inst. Data
PC D Decode E + M W
Mem Mem

Illegal Overflow Data address


PC address
Opcode Exceptions
Exception

EPC Cause
Exc Exc Exc
D E M

PC PC PC
Select D E M Asynchronous
Handler Kill F Kill D Kill E Kill
PC Stage Stage Stage Interrupts Writeback

9
Exception Handling 5-Stage Pipeline
• Hold exception flags in pipeline until commit point (M
stage)

• Exceptions in earlier pipe stages override later


exceptions for a given instruction

• Inject external interrupts at commit point (override


others)

• If exception at commit: update Cause and EPC


registers, kill all stages, inject handler PC into fetch
stage

10
Speculating on Exceptions
• Prediction mechanism
– Exceptions are rare, so simply predicting no exceptions is very
accurate!
• Check prediction mechanism
– Exceptions detected at end of instruction execution pipeline, special
hardware for various exception types
• Recovery mechanism
– Only write architectural state at commit point, so can throw away
partially executed instructions after exception
– Launch exception handler after flushing pipeline

• Bypassing allows use of uncommitted instruction


results by following instructions
11
Exception Pipeline Diagram
time
t0 t1 t2 t3 t4 t5 t6 t7 . . . .
(I1) 096: ADD IF1 ID1 EX1 MA1 nop overflow!
(I2) 100: XOR IF2 ID2 EX2 nop nop
(I3) 104: SUB IF3 ID3 nop nop nop
(I4) 108: ADD IF4 nop nop nop nop
(I5) Exc. Handler code IF5 ID5 EX5 MA5 WB5

time
t0 t1 t2 t3 t4 t5 t6 t7 ....
IF I1 I2 I3 I4 I5
ID I1 I2 I3 nop I5
Resource
EX I1 I2 nop nop I5
Usage
MA I1 nop nop nop I5
WB nop nop nop nop I5

12
Agenda
• Interrupts
• Out-of-Order Processors

13
Out-Of-Order (OOO) Introduction
Name Frontend Issue Writeback Commit
I4 IO IO IO IO Fixed Length Pipelines
Scoreboard
I2O2 IO IO OOO OOO Scoreboard
I2OI IO IO OOO IO Scoreboard,
Reorder Buffer, and Store Buffer
I03 IO OOO OOO OOO Scoreboard and Issue Queue
IO2I IO OOO OOO IO Scoreboard, Issue Queue,
Reorder Buffer, and Store Buffer

14
OOO Motivating Code Sequence
0 MUL R1, R2, R3 0 1
1 ADDIU R11,R10,1
2 MUL R5, R1, R4 2 4

3 MUL R7, R5, R6 5 6


3
4 ADDIU R12,R11,1
5 ADDIU R13,R12,1
6 ADDIU R14,R12,2

• Two independent sequences of instructions enable flexibility


in terms of how instructions are scheduled in total order
• We can schedule statically in software or dynamically in
hardware

15
I4: In-Order Front-End, Issue,
Writeback, Commit

F D X M W

16
I4: In-Order Front-End, Issue,
Writeback, Commit

X1
X0
F D W
M0 M1

17
I4: In-Order Front-End, Issue,
Writeback, Commit (4-stage MUL)
X1 X2 X3
X0

F D X2 X3
M0 M1 W
Y0 Y1 Y2 Y3

To avoid increasing CPI, needs full bypassing which can be


expensive. To help cycle time, add Issue stage where
register file read and instruction “issued” to Functional Unit
18
I4: In-Order Front-End, Issue,
Writeback, Commit (4-stage MUL)
SB X0 X1 X2 X3 ARF

F D I M0 M1
X2 X3 W
Y0 Y1 Y2 Y3

ARF R W

SB R/W W
19
Basic Scoreboard
Data Avail.
P F 4 3 2 1 0
P: Pending, Write to
R1
Destination in flight
R2 F: Which functional unit
R3 is writing register
Data Avail.: Where is the

write data in the
R31 functional unit pipeline

• A One in Data Avail. In column ‘I’ means that result data is


in stage ‘I’ of functional unit F
• Can use F and Data Avail. fields to determine when to
bypass and where to bypass from
• A one in column zero means that cycle functional unit is in
the Writeback stage
• Bits in Data Avail. field shift right every cycle. 20
Basic Scoreboard
Data Avail.
P F 4 3 2 1 0
P: Pending, Write to
R1 1
Destination in flight
R2 F: Which functional unit
R3 is writing register
Data Avail.: Where is the

write data in the
R31 functional unit pipeline

• A One in Data Avail. In column ‘I’ means that result data is


in stage ‘I’ of functional unit F
• Can use F and Data Avail. fields to determine when to
bypass and where to bypass from
• A one in column zero means that cycle functional unit is in
the Writeback stage
• Bits in Data Avail. field shift right every cycle. 21
0 MUL R1, R2, R3 F D I Y0 Y1 Y2 Y3 W
1 ADDIU R11,R10,1 F D I X0 X1 X2 X3 W
2 MUL R5, R1, R4 F D I I I Y0 Y1 Y2 Y3 W
3 MUL R7, R5, R6 F D D D I I I I Y0 Y1 Y2 Y3 W
4 ADDIU R12,R11,1 F F F D D D D I X0 X1 X2 X3 W
5 ADDIU R13,R12,1 F F F F D I X0 X1 X2 X3 W
6 ADDIU R14,R12,2 F D I X0 X1 X2 X3 W

Cyc D I 4 3 2 1 0 Dest Regs


1 0 RED Indicates if we look at F
2 1 0 Field, we can bypass on this cycle
3 2 1 1 R1
4 1 1 R11
5 1 1
6 3 2 1 1
7 1 1 1 R5
8 1 1
9 1
10 4 3 1
11 5 4 1 1 R7
12 6 5 1 1 R12
13 6 1 1 1 R13
14 1 1 1 1 R14
15 1 1 1 1
16 1 1 1
17 1 1 22
18 1
I2O2: In-order Frontend/Issue, Out-of-
order Writeback/Commit
SB X0 ARF

F D I M0 M1 W
Y0 Y1 Y2 Y3

ARF R W

SB R R/W W
23
I2O2 Scoreboard
• Similar to I4, but we can now use it to track
structural hazards on Writeback port
• Set bit in Data Avail. according to length of
pipeline
• Architecture conservatively stalls to avoid
WAW hazards by stalling in Decode therefore
current scoreboard sufficient. More
complicated scoreboard needed for
processing WAW Hazards

24
0 MUL R1, R2, R3 F D I Y0 Y1 Y2 Y3 W
1 ADDIU R11,R10,1 F D I X0 W
2 MUL R5, R1, R4 F D I I I Y0 Y1 Y2 Y3 W
3 MUL R7, R5, R6 F D D D I I I I Y0 Y1 Y2 Y3 W
4 ADDIU R12,R11,1 F F F D D D D I X0 W
5 ADDIU R13,R12,1 F F F F D I X0 W
6 ADDIU R14,R12,2 F D I I X0 W

Cyc D I 4 3 2 1 0 Dest Regs


1 0 RED Indicates if we look at F
2 1 0 Field, we can bypass on this cycle
3 2 1 1 R1
4 1 1 R11
5 1 1
6 3 2 1
7 1 1 R5
8 1 Writes with two cycle
9 1 latency. Structural
10 4 3 1 Hazard
11 5 4 1 1 R7
12 6 5 1 1 R12
13 1 1 1 R13
14 6 1 1
15 1 1 R15
16 1
17 25
18
Early Commit Point?
0 MUL R1, R2, R3 F D I Y0 Y1 Y2 Y3 /
1 ADDIU R11,R10,1 F D I X0 W /
2 MUL R5, R1, R4 F D I I I /
3 MUL R7, R5, R6 F D D D /
4 ADDIU R12,R11,1 F F F /
5 ADDIU R13,R12,1 /
6 ADDIU R14,R12,2

• Limits certain types of exceptions.

26
I2OI: In-order Frontend/Issue, Out-of-
order Writeback, In-order Commit
SB X0 PRF ARF

F D I L0 L1 W ROB
FSB
C

S0
Y0 Y1 Y2 Y3

ARF W
SB R/W W
PRF R W
ROB R/W W R/W
FSB W R/W
27
PRF=Physical Register File(Future File), ROB=Reorder Buffer, FSB=Finished Store Buffer (1 entry)
Reorder Buffer (ROB)
State S ST V Preg
--
P 1
F 1
P 1
P
F
P
P
--
--
State: {Free, Pending, Finished}
S: Speculative
ST: Store bit
V: Physical Register File Specifier Valid
Preg: Physical Register File Specifier 28
Reorder Buffer (ROB)
State S ST V Preg Next instruction allocates here in D
--
P 1 Tail of ROB
F 1 Speculative because branch is in flight
P 1
P
F Instruction wrote ROB out of order
P
P Head of ROB
--
--
State: {Free, Pending, Finished}
S: Speculative Commit stage is waiting for
ST: Store bit Head of ROB to be finished
V: Physical Register File Specifier Valid
Preg: Physical Register File Specifier 29
Finished Store Buffer (FSB)
V Op Addr Data
--

• Only need one entry if we only support one


memory instruction inflight at a time.
• Single Entry FSB makes allocation trivial.
• If support more than one memory instruction,
we need to worry about Load/Store address
aliasing.

30
0 MUL R1, R2, R3 F D I Y0 Y1 Y2 Y3 W C
1 ADDIU R11,R10,1 F D I X0 W r C
2 MUL R5, R1, R4 F D I I I Y0 Y1 Y2 Y3 W C
3 MUL R7, R5, R6 F D D D I I I I Y0 Y1 Y2 Y3 W C
4 ADDIU R12,R11,1 F F F D D D D I X0 W r C
5 ADDIU R13,R12,1 F F F F D I X0 W r C
6 ADDIU R14,R12,2 F D I I X0 W r C

Cyc D I ROB 0 1 2 3
0 Empty = free entry in ROB
1 0
2 1 0 R1 State of ROB at beginning of cycle
3 2 1 R11
4 R5 Pending entry in ROB
5
6 3 2 R11 Circle=Finished (Cycle after W)
7 R7
8 R1
9 Last cycle before entry is freed from ROB
10 4 3
(Cycle in C stage)
11 5 4 R12
12 6 5 R13 R5
13 R14
14 6 R12
15 R13
16 R7 Entry becomes free and is freed
17 R14 on next cycle
18
19 31
What if First Instruction Causes an
Exception?
0 MUL R1, R2, R3 F D I Y0 Y1 Y2 Y3 W /
1 ADDIU R11,R10,1 F D I X0 W r -- /
2 MUL R5, R1, R4 F D I I I Y0 /
3 MUL R7, R5, R6 F D D D I /
4 ADDIU R12,R11,1 F F F D /
F D I. . .

32
What About Branches?
Option 2
0 BEQZ R1, target F D I X0 W C
1 ADDIU R11,R10,1 F D I X0 /
Squash instructions in ROB
2 ADDIU R5, R1, R4 F D I /
when Branch commits
3 ADDIU R7, R5, R6 F D /
T ADDIU R12,R11,1 F D I . . .

Option 1
0 BEQZ R1, target F D I X0 W C
1 ADDIU R11,R10,1 F D I -
Squash instructions earlier. Has more
2 ADDIU R5, R1, R4 F D -
complexity. ROB needs many ports.
3 ADDIU R7, R5, R6 F -
T ADDIU R12,R11,1 F D I . . .

Option 3
0 BEQZ R1, target F D I X0 W C
1 ADDIU R11,R10,1 F D I X0 W / Wait for speculative instructions to
2 ADDIU R5, R1, R4 F D I X0 W / reach the Commit stage and squash in
3 ADDIU R7, R5, R6 F D I X0 W /
Commit stage
T ADDIU R12,R11,1 F D I X0 W C
33
What About Branches?
• Three possible designs with decreasing
complexity based on when to squash speculative
instructions and de-allocate ROB entry:
1. As soon as branch resolves
2. When branch commits
3. When speculative instructions reach commit

• Base design only allows one branch at a time.


Second branch stalls in decode. Can add more
bits to track multiple in-flight branches.

34
Avoiding Stalling Commit on Store
Miss
PRF ARF
W ROB C CSB R
FSB
0 OpA F D I X0 W C CSB=Committed Store Buffer
1 SW F D I S0 W C C C C
2 OpB F D I X0 W W W W C
3 OpC F D I X X X X W C
4 OpD F D I I I I X W C

With Retire Stage


0 OpA F D I X0 W C
1 SW F D I S0 W C R R R
2 OpB F D I X0 W C
3 OpC F D I X W C
4 OpD F D I X W C 35
IO3: In-order Frontend, Out-of-order
Issue/Writeback/Commit
SB X0 ARF

F D I I
Q
M0 M1 W
Y0 Y1 Y2 Y3

ARF R W
SB R R/W W
I W R/W W
36
Q
Issue Queue (IQ)
Op Imm S V Dest V P Src0 V P Src1
Op: Opcode
Imm.: Immediate
S: Speculative Bit
V: Valid (Instruction has
corresponding Src/Dest)
P: Pending (Waiting on
operands to be produced)

Instruction Ready = (!Vsrc0 || !Psrc0) && (!Vsrc1


|| !Psrc1) && no structural hazards

• For high performance, factor in bypassing


37
Centralized vs. Distributed Issue Queue
I
X0 Q
A I X0

F D I I
Q
M0 F D M0

I
Y0 Q
B
I Y0

Centralized Distributed

38
Advanced Scoreboard
Data Avail.
P 4 3 2 1 0
P: Pending, Write to
R1
Destination in flight
R2 Data Avail.: Where is the
R3 write data in the pipeline
and which functional unit

R31

• Data Avail. now contains functional unit identfier


• A non-empty value in column zero means that cycle
functional unit is in the Writeback stage
• Bits in Data Avail. field shift right every cycle.

39
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
0 MUL R1, R2, R3 F D I Y0 Y1 Y2 Y3 W
1 ADDIU R11,R10,1 F D I X0 W
2 MUL R5, R1, R4 F D i I Y0 Y1 Y2 Y3 W
3 MUL R7, R5, R6 F D i I Y0 Y1 Y2 Y3 W
4 ADDIU R12,R11,1 F D i I X0 W
5 ADDIU R13,R12,1 F D i I X0 W
6 ADDIU R14,R12,2 F D i I X0 W

Cyc D I IQ 0 1 2
0
1 0 Dest/Src0/Src1, Circle denotes value
2 1 0 R1/R2/R3 present in ARF
3 2 1 R11/R10
4 3 R5/R1/R4
5 4 R7/R5/R6 Value bypassed so no circle, present
6 5 2 R12/R11 bit
7 6 4 R13/R12 Value set present by
8 5 R14/R12 Instruction 1 in cycle 5, W
9 Stage
10 3
11 6 R14/R12
12
13
40
14
Assume All Instruction in Issue Queue
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
0 MUL R1, R2, R3 F D i I Y0 Y1 Y2 Y3 W
1 ADDIU R11,R10,1 F D i I X0 W
2 MUL R5, R1, R4 F D i I Y0 Y1 Y2 Y3 W
3 MUL R7, R5, R6 F D i I Y0 Y1 Y2 Y3 W
4 ADDIU R12,R11,1 F D i I X0 W
5 ADDIU R13,R12,1 F D i I X0 W
6 ADDIU R14,R12,2 F D i I X0 W

• Better performance than previous?

41
IO2I: In-order Frontend, Out-of-order
Issue/Writeback, In-order Commit
SB X0 PRF ARF

F D I I
Q L0 L1 W ROB
FSB
C

S0
Y0 Y1 Y2 Y3

ARF W
SB R/W W
PRF R W
ROB R/W W R/W
FSB W R/W
42
IQ W R/W
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
0 MUL R1, R2, R3 F D I Y0 Y1 Y2 Y3 W C
1 ADDIU R11,R10,1 F D I X0 W r C
2 MUL R5, R1, R4 F D i I Y0 Y1 Y2 Y3 W C
3 MUL R7, R5, R6 F D i I Y0 Y1 Y2 Y3 W C
4 ADDIU R12,R11,1 F D i I X0 W r C
5 ADDIU R13,R12,1 F D i I X0 W r C
6 ADDIU R14,R12,2 F D i I X0 W r C

0 MUL R1, R2, R3 F D I Y0 Y1 Y2 Y3 W C


1 ADDIU R11,R10,1 F D I X0 W r C
2 MUL R5, R1, R4 F D i I Y0 Y1 Y2 Y3 W C
3 MUL R7, R5, R6 F D i I Y0 Y1 Y2 Y3 W C
4 ADDIU R12,R11,1 F D i I X0 W r C
5 ADDIU R13,R12,1 F D i I X0 W r C
6 ADDIU R14,R12,2 F D i I X0 W r C

43
Out-of-order 2-Wide Superscalar
with 1 ALU
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
0 MUL R1, R2, R3 F D I Y0 Y1 Y2 Y3 W C
1 ADDIU R11,R10,1 F D I X0 W r C
2 MUL R5, R1, R4 F D i I Y0 Y1 Y2 Y3 W C
3 MUL R7, R5, R6 F D i I Y0 Y1 Y2 Y3 W C
4 ADDIU R12,R11,1 F D I X0 W r C
5 ADDIU R13,R12,1 F D i I X0 W r C
6 ADDIU R14,R12,2 F D i I X0 W r C

44
Acknowledgements
• These slides contain material developed and copyright by:
– Arvind (MIT)
– Krste Asanovic (MIT/UCB)
– Joel Emer (Intel/MIT)
– James Hoe (CMU)
– John Kubiatowicz (UCB)
– David Patterson (UCB)
– Christopher Batten (Cornell)

• MIT material derived from course 6.823


• UCB material derived from course CS252 & CS152
• Cornell material derived from course ECE 4750

45
Copyright © 2013 David Wentzlaff

46

You might also like