L05-PipeliningII
L05-PipeliningII
edu/~cs152
CS 152/252A Computer
Architecture and Engineering Sophia Shao
Lecture 5 – Pipelining II
It may not be quite the 30 GHz that Intel Intel Launches $699 Core i9-13900KS, the
CEO Pat Gelsinger predicted back in 2002, World's First 6 GHz 320W CPU: Available Now
but Intel's Core i9-13900KS Special Edition
processor, which is available on shelves
today for $699, is the world's first consumer
CPU to run at 6 GHz without overclocking.
With a whopping 250W base power
specification, it's also now officially the
most power-hungry desktop CPU in history
— it peaks at 320W in a new Extreme
Power Delivery Profile.
https://www.tomshardware.com/news/intel- https://www.anandtech.com/show/1872
launches-dollar699-core-i9-13900ks-the-worlds- 8/the-intel-core-i9-13900ks-review-
first-6-ghz-cpu-available-now taking-intel-s-raptor-lake-to-6-ghz
Last Time in Lecture 3
§ Iron law of performance:
– time/program = insts/program * cycles/inst * time/cycle
§ Classic 5-stage RISC pipeline
§ Structural, data, and control hazards
§ Structural hazards handled with interlock or more hardware
§ Data hazards include RAW, WAR, WAW
– Handle data hazards with interlock, bypass, or speculation
§ Control hazards (branches, interrupts) most difficult as
change which is next instruction
– Branch prediction commonly used
§ Precise traps: stop cleanly on one instruction, all previous
instructions completed, no following instructions have
changed architectural state
2
Asynchronous Interrupts
§ An I/O device requests attention by asserting one
of the prioritized interrupt request lines
3
Synchronous Trap
§ A synchronous trap is caused by an exception on
a particular instruction
Ii-1 HI1
trap
program Ii HI2 handler
Ii+1 HIn
5
Exception Handling 5-Stage Pipeline
Inst. Data
PC Mem D Decode E + M Mem W
Asynchronous Interrupts
6
Exception Handling 5-Stage Pipeline
Commit
Point
Inst. Data
PC Mem D Decode E + M Mem W
Cause
D E M
PC PC PC
EPC
Select
Handler Kill F D Kill D E Kill E M Asynchronous
PC Stage Stage Stage Interrupts
Kill
Writeback
7
Exception Handling 5-Stage Pipeline
§ Hold exception flags in pipeline until commit
point (M stage)
8
Speculating on Exceptions
§ Prediction mechanism
– Exceptions are rare, so simply predicting no exceptions is very
accurate!
§ Check prediction mechanism
– Exceptions detected at end of instruction execution pipeline,
special hardware for various exception types
§ Recovery mechanism
– Only write architectural state at commit point, so can throw away
partially executed instructions after exception
– Launch exception handler after flushing pipeline
9
Deeper Pipelines: MIPS R4000
Commit Point
Figure C.36 The eight-stage pipeline structure of the R4000 uses pipelined
instruction and data caches. The pipe stages are labeled and their detailed
function is described in the text. The vertical dashed lines represent the stage
boundaries as well as the location of pipeline latches. The instruction is actually
available at the end of IS, but the tag check is done in RF, while the registers are
fetched. Thus, we show the instruction memory as operating through RF. The TC
stage is needed for data memory access, because we cannot write the data into
the register until we know whether the cache access was a hit or not.
Figure C.37 The structure of the R4000 integer pipeline leads to a x1 load
delay. A x1 delay is possible because the data value is available at the end of
DS and can be bypassed. If the tag check in TC indicates a miss, the pipeline is
backed up a cycle, when the correct data are available.
© 2018 Elsevier Inc. All rights reserved.
11
R4000 Branches
Figure C.39 The basic branch delay is three cycles, because the
condition evaluation is performed during EX.
13
Simple Pipeline Scheduling
R X1 X2 X3 W Floating-Point
Pipeline
Check
CS252 17
Decoupled Execution
CS252 18
CS152 Administrivia
§ HW1 released
– Due Feb 02
§ Lab1 released
– Due Feb 09
§ Lab reports must be readable English summaries – not
dumps of log files!!!!!!
– We will reward good reports, and penalize undecipherable reports
– Page limit (check lab spec/Ed)
§ Lecture Ed thread
– One thread per lecture
– Post your questions following the format:
• [Slide #] Your question
– The staff team will address and clarify the questions asynchronously.
§ Guest lecture next Tuesday
– Prefetching
19
CS252 Administrivia
§ CS252 Readings on
– https://ucb-cs252-sp23.hotcrp.com/u/0/
– Use hotcrp to upload reviews before Wednesday:
• Write one paragraph on main content of paper including good/bad
points of paper
• Also, answer/ask 1-3 questions about paper for discussion
• First two “360 Architecture”, “VAX11-780”
– 2-3pm Wednesday, Soda 606/Zoom
§ CS252 Project Timeline
– Proposal Wed Feb 22
– Use 252A GSIs (Abe and Prashanth) and my OHs to get feedback.
CS252 20
Supercomputers
Definitions of a supercomputer:
§ Fastest machine in world at given task
§ A device to turn a compute-bound problem into an I/O
bound problem
§ Any machine costing $30M+
§ Any machine designed by Seymour Cray
21
CDC 6600 Seymour Cray, 1964
§ A fast pipelined machine with 60-bit words
– 128 Kword main memory capacity, 32 banks
§ Ten functional units (parallel, unpipelined)
– Floating Point: adder, 2 multipliers, divider
– Integer: adder, 2 incrementers, ...
§ Hardwired control (no microcoding)
§ Scoreboard for dynamic scheduling of instructions
§ Ten Peripheral Processors for Input/Output
– a fast multi-threaded 12-bit integer ALU
§ Very fast clock, 10 MHz (FP add in 4 clocks)
§ >400,000 transistors, 750 sq. ft., 5 tons, 150 kW,
novel freon-based technology for cooling
§ Fastest machine in world for 5 years (until 7600)
– over 100 sold ($7-10M each)
3/10/2009
22
CDC 6600:
A Load/Store Architecture
23
CDC 6600: Datapath
Operand Regs (X)
8 x 60-bit
operand
10 Functional
result Units
Central
Memory IR
Address Regs (A) Index Regs (B)
128K words,
8 x 18-bit 8 x 18-bit
32 banks, Inst. Stack
1µs cycle operand 8 x 60-bit
addr
result
addr
24
CDC6600: Vector Addition
B0 ← - n
loop: JZE B0, exit
A0 ← B0 + a0 load X0
A1 ← B0 + b0 load X1
X6 ← X0 + X1
A6 ← B0 + c0 store X6
B0 ← B0 + 1
jump loop
Ai = address register
Bi = index register
Xi = data register
25
CDC6600 ISA designed to simplify
high-performance implementation
§ Use of three-address, register-register ALU instructions
simplifies pipelined implementation
– Only 3-bit register-specifier fields checked for dependencies
– No implicit dependencies between inputs and outputs
§ Decoupling setting of address register (Ar) from retrieving
value from data register (Xr) simplifies providing multiple
outstanding memory accesses
– Address update instruction also issues implicit memory operation
– Software can schedule load of address register before use of value
– Can interleave independent instructions in between
§ CDC6600 has multiple parallel unpipelined functional units
– E.g., 2 separate multipliers
§ Follow-on machine CDC7600 used pipelined functional units
– Foreshadows later RISC designs
26
[© IBM]
27
IBM Memo on CDC6600
Thomas Watson Jr., IBM CEO, August 1963:
“Last week, Control Data ... announced the 6600 system. I understand
that in the laboratory developing the system there are only 34 people
including the janitor. Of these, 14 are engineers and 4 are programmers...
Contrasting this modest effort with our vast development activities, I fail
to understand why we have lost our industry leadership position by
letting someone else offer the world's most powerful computer.”
28
Computer Architecture Terminology
Latency (in seconds or cycles): Time taken for a single
operation from start to finish (initiation to useable result)
Bandwidth (in operations/second or operations/cycle): Rate
of which operations can be performed
Occupancy (in seconds or cycles): Time during which the
unit is blocked on an operation (structural hazard)
Note, for a single functional unit:
§ Occupancy can be much less than latency (how?)
§ Occupancy can be greater than latency (how?)
§ Bandwidth can be greater than 1/latency (how?)
§ Bandwidth can be less than 1/latency (how?)
29
Issues in Complex Pipeline Control
• Structural conflicts at the execution stage if some FPU or memory unit is not
pipelined and takes more than one cycle
• Structural conflicts at the write-back stage due to variable latencies of different
functional units -> many writes to reg file
• Out-of-order write hazards due to variable latencies of different functional
units
• How to handle exceptions?
ALU Mem
IF ID Issue WB
Fadd
GPRs
FPRs
Fmul
Fdiv
30
CDC6600 Scoreboard
§ Instructions dispatched in-order to functional units
provided no structural hazard or WAW
– Stall on structural hazard, no functional units available
– Only one pending write to any register
§ Instructions wait for input operands (RAW hazards) before
execution
– Can execute out-of-order
§ Instructions wait for output register to be read by
preceding instructions (WAR)
– Result held in functional unit until register free
31
More Complex In-Order Pipeline
Inst. Data
PC D Decode GPRs X1 + X2 Mem X3 W
Mem
32
In-Order Superscalar Pipeline
Inst. 2 Dual Data
PC Mem D Decode GPRs X1 + X2 Mem X3 W
FPRs X1 X2 FAdd X3 W
33
In-Order Pipeline with two ALU stages
Address calculate
before memory
access
[ © Motorola 1994 ] 34
MC68060 Dynamic ALU Scheduling
Using RISC-V style assembly code for MC68060
EA MEM ALU
addi x5,x2,12
x2+12
lw x4, 16(x5)
EA MEM ALU
x5+16
lw x8, 16(x3)
EA MEM ALU
x3+16
Common trick used in modern in-order RISC pipeline designs, even without
reg-mem operations
35
Acknowledgements
§ This course is partly inspired by previous MIT 6.823 and
Berkeley CS252 computer architecture courses created by
my collaborators and colleagues:
– Arvind (MIT)
– Krste Asanovic (MIT/UCB)
– Joel Emer (Intel/MIT)
– James Hoe (CMU)
– John Kubiatowicz (UCB)
– David Patterson (UCB)
36