0% found this document useful (0 votes)
6 views

L05-PipeliningII

The document discusses advanced topics in computer architecture, specifically focusing on pipelining, exception handling, and the architecture of Intel's Core i9-13900KS CPU. It covers various aspects of pipelining, including hazards, asynchronous interrupts, and the R4000 architecture, while also introducing concepts like loop unrolling and decoupling for performance optimization. Additionally, it includes administrative details for a course, CS152/252A, and a brief history of supercomputers, highlighting the CDC 6600 as a significant early example.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views

L05-PipeliningII

The document discusses advanced topics in computer architecture, specifically focusing on pipelining, exception handling, and the architecture of Intel's Core i9-13900KS CPU. It covers various aspects of pipelining, including hazards, asynchronous interrupts, and the R4000 architecture, while also introducing concepts like loop unrolling and decoupling for performance optimization. Additionally, it includes administrative details for a course, CS152/252A, and a brief history of supercomputers, highlighting the CDC 6600 as a significant early example.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 36

http://inst.eecs.berkeley.

edu/~cs152

CS 152/252A Computer
Architecture and Engineering Sophia Shao
Lecture 5 – Pipelining II
It may not be quite the 30 GHz that Intel Intel Launches $699 Core i9-13900KS, the
CEO Pat Gelsinger predicted back in 2002, World's First 6 GHz 320W CPU: Available Now
but Intel's Core i9-13900KS Special Edition
processor, which is available on shelves
today for $699, is the world's first consumer
CPU to run at 6 GHz without overclocking.
With a whopping 250W base power
specification, it's also now officially the
most power-hungry desktop CPU in history
— it peaks at 320W in a new Extreme
Power Delivery Profile.

https://www.tomshardware.com/news/intel- https://www.anandtech.com/show/1872
launches-dollar699-core-i9-13900ks-the-worlds- 8/the-intel-core-i9-13900ks-review-
first-6-ghz-cpu-available-now taking-intel-s-raptor-lake-to-6-ghz
Last Time in Lecture 3
§ Iron law of performance:
– time/program = insts/program * cycles/inst * time/cycle
§ Classic 5-stage RISC pipeline
§ Structural, data, and control hazards
§ Structural hazards handled with interlock or more hardware
§ Data hazards include RAW, WAR, WAW
– Handle data hazards with interlock, bypass, or speculation
§ Control hazards (branches, interrupts) most difficult as
change which is next instruction
– Branch prediction commonly used
§ Precise traps: stop cleanly on one instruction, all previous
instructions completed, no following instructions have
changed architectural state

2
Asynchronous Interrupts
§ An I/O device requests attention by asserting one
of the prioritized interrupt request lines

§ When the processor decides to process the


interrupt
– It stops the current program at instruction Ii ,
completing all the instructions up to Ii-1 (precise
interrupt)
– It saves the PC of instruction Ii in a special register (EPC)
– It disables interrupts and transfers control to a
designated interrupt handler running in supervisor
mode

3
Synchronous Trap
§ A synchronous trap is caused by an exception on
a particular instruction

§ In general, the instruction cannot be completed


and needs to be restarted after the exception has
been handled
– requires undoing the effect of one or more partially
executed instructions

§ In the case of a system call trap, the instruction is


considered to have been completed
– a special jump instruction involving a change to a
privileged mode
4
Trap:
altering the normal flow of control

Ii-1 HI1

trap
program Ii HI2 handler

Ii+1 HIn

An external or internal event that needs to be processed by another (system)


program. The event is usually unexpected or rare from program’s point of view.

5
Exception Handling 5-Stage Pipeline

Inst. Data
PC Mem D Decode E + M Mem W

PC address Illegal Data address


Overflow
Exception Opcode Exceptions

Asynchronous Interrupts

§ How to handle multiple simultaneous exceptions in


different pipeline stages?
§ How and where to handle external asynchronous
interrupts?

6
Exception Handling 5-Stage Pipeline
Commit
Point

Inst. Data
PC Mem D Decode E + M Mem W

Illegal Overflow Data address


PC address
Opcode Exceptions
Exception
Exc Exc Exc

Cause
D E M

PC PC PC

EPC
Select
Handler Kill F D Kill D E Kill E M Asynchronous
PC Stage Stage Stage Interrupts
Kill
Writeback

7
Exception Handling 5-Stage Pipeline
§ Hold exception flags in pipeline until commit
point (M stage)

§ Exceptions in earlier pipe stages override later


exceptions for a given instruction

§ Inject external interrupts at commit point


(override others)

§ If trap at commit: update Cause and EPC registers,


kill all stages, inject handler PC into fetch stage

8
Speculating on Exceptions
§ Prediction mechanism
– Exceptions are rare, so simply predicting no exceptions is very
accurate!
§ Check prediction mechanism
– Exceptions detected at end of instruction execution pipeline,
special hardware for various exception types
§ Recovery mechanism
– Only write architectural state at commit point, so can throw away
partially executed instructions after exception
– Launch exception handler after flushing pipeline

§ Bypassing allows use of uncommitted instruction


results by following instructions

9
Deeper Pipelines: MIPS R4000
Commit Point

Direct-mapped I$ allows use of


instruction before tag check complete

Figure C.36 The eight-stage pipeline structure of the R4000 uses pipelined
instruction and data caches. The pipe stages are labeled and their detailed
function is described in the text. The vertical dashed lines represent the stage
boundaries as well as the location of pipeline latches. The instruction is actually
available at the end of IS, but the tag check is done in RF, while the registers are
fetched. Thus, we show the instruction memory as operating through RF. The TC
stage is needed for data memory access, because we cannot write the data into
the register until we know whether the cache access was a hit or not.

© 2018 Elsevier Inc. All rights reserved.


10
R4000 Load-Use Delay
Direct-mapped D$ allows use of
data before tag check complete

Figure C.37 The structure of the R4000 integer pipeline leads to a x1 load
delay. A x1 delay is possible because the data value is available at the end of
DS and can be bypassed. If the tag check in TC indicates a miss, the pipeline is
backed up a cycle, when the correct data are available.
© 2018 Elsevier Inc. All rights reserved.
11
R4000 Branches

Figure C.39 The basic branch delay is three cycles, because the
condition evaluation is performed during EX.

© 2018 Elsevier Inc. All rights reserved.


12
Simple vector-vector add code example

# for(i=0; i<N; i++)


# A[i] = B[i]+C[i];

loop: fld f0, 0(x2) // x2 points to B


fld f1, 0(x3) // x3 points to C
fadd.d f2, f0, f1
fsd f2, 0(x1) // x1 points to A
addi x1, x1, 8// Bump pointer
addi x2, x2, 8// Bump pointer
addi x3, x3, 8// Bump pointer
bne x1, x4, loop // x4 holds end

13
Simple Pipeline Scheduling

Can reschedule code to try to reduce pipeline hazards

loop: fld f0, 0(x2) // x2 points to B


fld f1, 0(x3) // x3 points to C
addi x3, x3, 8// Bump pointer
addi x2, x2, 8// Bump pointer
fadd.d f2, f0, f1
addi x1, x1, 8// Bump pointer
fsd f2, -8(x1) // x1 points to A
bne x1, x4, loop // x4 holds end

Long latency loads and floating-point operations limit


parallelism within a single loop iteration
14
One way to reduce hazards: Loop Unrolling
Can unroll to expose more parallelism, reduce dynamic instruction count

loop: fld f0, 0(x2) // x2 points to B


fld f1, 0(x3) // x3 points to C
fld f10, 8(x2)
fld f11, 8(x3)
addi x3,x3,16 // Bump pointer
addi x2,x2,16 // Bump pointer
fadd.d f2, f0, f1
fadd.d f12, f10, f11
addi x1,x1,16 // Bump pointer
fsd f2, -16(x1) // x1 points to A
fsd f12, -8(x1)
bne x1, x4, loop // x4 holds end

§ Unrolling limited by number of architectural registers


§ Unrolling increases instruction cache footprint (more static instructions)
§ More complex code generation for compiler, has to understand pointers
15
Alternative Approach: Decoupling
(lookahead, runahead) in µarchitecture
Can separate control and memory address operations from
data computations:

loop: fld f0, 0(x2) // x2 points to B


fld f1, 0(x3) // x3 points to C
fadd.d f2, f0, f1
fsd f2, 0(x1) // x1 points to A
addi x1,x1,8 // Bump pointer
addi x2,x2,8 // Bump pointer
addi x3,x3,8 // Bump pointer
bne x1, x4, loop // x4 holds end
The control and address operations do not depend on the data
computations, so can be computed early relative to the data
computations, which can be delayed until later.
CS252 16
Simple Decoupled Access/Execute Machine
Integer Pipeline
F D X MW {Load Data Writeback µOp}
{Compute µOp}
µOp Queue {Store Data Read µOp}

Load Data Queue

R X1 X2 X3 W Floating-Point
Pipeline
Check

Load Address Load


Data
Store Address Store Data
Queue Queue

CS252 17
Decoupled Execution

fld f0 Send load to memory, queue up write to f0


fld f1 Send load to memory, queue up write to f1
fadd.d Queue up fadd.d
fsd f2 Queue up store address, wait for store data
addi x1 Bump pointer Check load Many writes to f0
addi x2 Bump pointer address against can be in queue at
addi x3 Bump pointer queued pending same time
bne Take branch store addresses
fld f0 Send load to memory, queue up write to f0
fld f1 Send load to memory, queue up write to f1
fadd.d Queue up fadd.d
fsd f2 Queue up store address, wait for store data

CS252 18
CS152 Administrivia
§ HW1 released
– Due Feb 02
§ Lab1 released
– Due Feb 09
§ Lab reports must be readable English summaries – not
dumps of log files!!!!!!
– We will reward good reports, and penalize undecipherable reports
– Page limit (check lab spec/Ed)
§ Lecture Ed thread
– One thread per lecture
– Post your questions following the format:
• [Slide #] Your question
– The staff team will address and clarify the questions asynchronously.
§ Guest lecture next Tuesday
– Prefetching
19
CS252 Administrivia
§ CS252 Readings on
– https://ucb-cs252-sp23.hotcrp.com/u/0/
– Use hotcrp to upload reviews before Wednesday:
• Write one paragraph on main content of paper including good/bad
points of paper
• Also, answer/ask 1-3 questions about paper for discussion
• First two “360 Architecture”, “VAX11-780”
– 2-3pm Wednesday, Soda 606/Zoom
§ CS252 Project Timeline
– Proposal Wed Feb 22
– Use 252A GSIs (Abe and Prashanth) and my OHs to get feedback.

CS252 20
Supercomputers
Definitions of a supercomputer:
§ Fastest machine in world at given task
§ A device to turn a compute-bound problem into an I/O
bound problem
§ Any machine costing $30M+
§ Any machine designed by Seymour Cray

§ CDC6600 (Cray, 1964) regarded as first supercomputer

21
CDC 6600 Seymour Cray, 1964
§ A fast pipelined machine with 60-bit words
– 128 Kword main memory capacity, 32 banks
§ Ten functional units (parallel, unpipelined)
– Floating Point: adder, 2 multipliers, divider
– Integer: adder, 2 incrementers, ...
§ Hardwired control (no microcoding)
§ Scoreboard for dynamic scheduling of instructions
§ Ten Peripheral Processors for Input/Output
– a fast multi-threaded 12-bit integer ALU
§ Very fast clock, 10 MHz (FP add in 4 clocks)
§ >400,000 transistors, 750 sq. ft., 5 tons, 150 kW,
novel freon-based technology for cooling
§ Fastest machine in world for 5 years (until 7600)
– over 100 sold ($7-10M each)

3/10/2009
22
CDC 6600:
A Load/Store Architecture

• Separate instructions to manipulate three types of reg.


• 8x60-bit data registers (X)
• 8x18-bit address registers (A)
• 8x18-bit index registers (B)

• All arithmetic and logic instructions are register-to-register (15-bit)


6 3 3 3
opcode i j k Ri ¬ Rj op Rk

•Only Load and Store instructions (30-bit) refer to memory!


6 3 3 18
opcode i j disp(imm) Ri ¬ M[Rj + disp]

Touching address registers 1 to 5 initiates a load


6 to 7 initiates a store
- very useful for vector operations

23
CDC 6600: Datapath
Operand Regs (X)
8 x 60-bit

operand
10 Functional
result Units
Central
Memory IR
Address Regs (A) Index Regs (B)
128K words,
8 x 18-bit 8 x 18-bit
32 banks, Inst. Stack
1µs cycle operand 8 x 60-bit
addr
result
addr

24
CDC6600: Vector Addition
B0 ← - n
loop: JZE B0, exit
A0 ← B0 + a0 load X0
A1 ← B0 + b0 load X1
X6 ← X0 + X1
A6 ← B0 + c0 store X6
B0 ← B0 + 1
jump loop

Ai = address register
Bi = index register
Xi = data register

25
CDC6600 ISA designed to simplify
high-performance implementation
§ Use of three-address, register-register ALU instructions
simplifies pipelined implementation
– Only 3-bit register-specifier fields checked for dependencies
– No implicit dependencies between inputs and outputs
§ Decoupling setting of address register (Ar) from retrieving
value from data register (Xr) simplifies providing multiple
outstanding memory accesses
– Address update instruction also issues implicit memory operation
– Software can schedule load of address register before use of value
– Can interleave independent instructions in between
§ CDC6600 has multiple parallel unpipelined functional units
– E.g., 2 separate multipliers
§ Follow-on machine CDC7600 used pipelined functional units
– Foreshadows later RISC designs
26
[© IBM]
27
IBM Memo on CDC6600
Thomas Watson Jr., IBM CEO, August 1963:
“Last week, Control Data ... announced the 6600 system. I understand
that in the laboratory developing the system there are only 34 people
including the janitor. Of these, 14 are engineers and 4 are programmers...
Contrasting this modest effort with our vast development activities, I fail
to understand why we have lost our industry leadership position by
letting someone else offer the world's most powerful computer.”

To which Cray replied: “It seems like Mr. Watson has


answered his own question.”

28
Computer Architecture Terminology
Latency (in seconds or cycles): Time taken for a single
operation from start to finish (initiation to useable result)
Bandwidth (in operations/second or operations/cycle): Rate
of which operations can be performed
Occupancy (in seconds or cycles): Time during which the
unit is blocked on an operation (structural hazard)
Note, for a single functional unit:
§ Occupancy can be much less than latency (how?)
§ Occupancy can be greater than latency (how?)
§ Bandwidth can be greater than 1/latency (how?)
§ Bandwidth can be less than 1/latency (how?)

29
Issues in Complex Pipeline Control
• Structural conflicts at the execution stage if some FPU or memory unit is not
pipelined and takes more than one cycle
• Structural conflicts at the write-back stage due to variable latencies of different
functional units -> many writes to reg file
• Out-of-order write hazards due to variable latencies of different functional
units
• How to handle exceptions?
ALU Mem

IF ID Issue WB
Fadd
GPRs
FPRs
Fmul

Fdiv

30
CDC6600 Scoreboard
§ Instructions dispatched in-order to functional units
provided no structural hazard or WAW
– Stall on structural hazard, no functional units available
– Only one pending write to any register
§ Instructions wait for input operands (RAW hazards) before
execution
– Can execute out-of-order
§ Instructions wait for output register to be read by
preceding instructions (WAR)
– Result held in functional unit until register free

31
More Complex In-Order Pipeline
Inst. Data
PC D Decode GPRs X1 + X2 Mem X3 W
Mem

§ Delay writeback so all operations


have same latency to W stage
FPRs X1 X2 FAdd X3 W
– Write ports never oversubscribed
(one inst. in & one inst. out every
cycle)
– Stall pipeline on long latency
operations, e.g., divides, cache X2 FMul X3
misses
– Handle exceptions in-order at
commit point Unpipelined
divider
FDiv X2 X3
How to prevent increased writeback latency
from slowing down single cycle integer Commit
Point
operations? Bypassing

32
In-Order Superscalar Pipeline
Inst. 2 Dual Data
PC Mem D Decode GPRs X1 + X2 Mem X3 W

FPRs X1 X2 FAdd X3 W

§ Fetch two instructions per cycle; issue both


simultaneously if one is integer/memory
and other is floating point X2 FMul X3
§ Inexpensive way of increasing throughput,
examples include Alpha 21064 (1992) &
MIPS R5000 series (1996) Unpipelined
divider
§ Same idea can be extended to wider issue FDiv X2 X3
by duplicating functional units (e.g. 4-issue Commit
UltraSPARC & Alpha 21164) but regfile ports Point
and bypassing costs grow quickly

33
In-Order Pipeline with two ALU stages

Address calculate
before memory
access

Integer ALU after


memory access

[ © Motorola 1994 ] 34
MC68060 Dynamic ALU Scheduling
Using RISC-V style assembly code for MC68060

EA MEM ALU add x1,x1,24(x2)


x2+24 x1+M[x2+24]

Not a real RISC-V instruction!


EA MEM ALU add x3,x1,x6
x1+x6

EA MEM ALU
addi x5,x2,12
x2+12
lw x4, 16(x5)
EA MEM ALU
x5+16
lw x8, 16(x3)
EA MEM ALU
x3+16

Common trick used in modern in-order RISC pipeline designs, even without
reg-mem operations
35
Acknowledgements
§ This course is partly inspired by previous MIT 6.823 and
Berkeley CS252 computer architecture courses created by
my collaborators and colleagues:
– Arvind (MIT)
– Krste Asanovic (MIT/UCB)
– Joel Emer (Intel/MIT)
– James Hoe (CMU)
– John Kubiatowicz (UCB)
– David Patterson (UCB)

36

You might also like