0% found this document useful (0 votes)

6 views

L05-PipeliningII

The document discusses advanced topics in computer architecture, specifically focusing on pipelining, exception handling, and the architecture of Intel's Core i9-13900KS CPU. It covers various aspects of pipelining, including hazards, asynchronous interrupts, and the R4000 architecture, while also introducing concepts like loop unrolling and decoupling for performance optimization. Additionally, it includes administrative details for a course, CS152/252A, and a brief history of supercomputers, highlighting the CDC 6600 as a significant early example.

Uploaded by

husongtao10103426

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

6 views

L05-PipeliningII

Uploaded by

husongtao10103426

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 36

http://inst.eecs.berkeley.

edu/~cs152

CS 152/252A Computer
Architecture and Engineering Sophia Shao
Lecture 5 – Pipelining II
It may not be quite the 30 GHz that Intel Intel Launches $699 Core i9-13900KS, the
CEO Pat Gelsinger predicted back in 2002, World's First 6 GHz 320W CPU: Available Now
but Intel's Core i9-13900KS Special Edition
processor, which is available on shelves
today for $699, is the world's first consumer
CPU to run at 6 GHz without overclocking.
With a whopping 250W base power
specification, it's also now officially the
most power-hungry desktop CPU in history
— it peaks at 320W in a new Extreme
Power Delivery Profile.

https://www.tomshardware.com/news/intel- https://www.anandtech.com/show/1872
launches-dollar699-core-i9-13900ks-the-worlds- 8/the-intel-core-i9-13900ks-review-
first-6-ghz-cpu-available-now taking-intel-s-raptor-lake-to-6-ghz
Last Time in Lecture 3
§ Iron law of performance:
– time/program = insts/program * cycles/inst * time/cycle
§ Classic 5-stage RISC pipeline
§ Structural, data, and control hazards
§ Structural hazards handled with interlock or more hardware
§ Data hazards include RAW, WAR, WAW
– Handle data hazards with interlock, bypass, or speculation
§ Control hazards (branches, interrupts) most difficult as
change which is next instruction
– Branch prediction commonly used
§ Precise traps: stop cleanly on one instruction, all previous
instructions completed, no following instructions have
changed architectural state

2
Asynchronous Interrupts
§ An I/O device requests attention by asserting one
of the prioritized interrupt request lines

§ When the processor decides to process the

interrupt
– It stops the current program at instruction Ii ,
completing all the instructions up to Ii-1 (precise
interrupt)
– It saves the PC of instruction Ii in a special register (EPC)
– It disables interrupts and transfers control to a
designated interrupt handler running in supervisor
mode

3
Synchronous Trap
§ A synchronous trap is caused by an exception on
a particular instruction

§ In general, the instruction cannot be completed

and needs to be restarted after the exception has
been handled
– requires undoing the effect of one or more partially
executed instructions

§ In the case of a system call trap, the instruction is

considered to have been completed
– a special jump instruction involving a change to a
privileged mode
4
Trap:
altering the normal flow of control

Ii-1 HI1

trap
program Ii HI2 handler

Ii+1 HIn

An external or internal event that needs to be processed by another (system)

program. The event is usually unexpected or rare from program’s point of view.

5
Exception Handling 5-Stage Pipeline

Inst. Data
PC Mem D Decode E + M Mem W

PC address Illegal Data address

Overflow
Exception Opcode Exceptions

Asynchronous Interrupts

§ How to handle multiple simultaneous exceptions in

different pipeline stages?
§ How and where to handle external asynchronous
interrupts?

6
Exception Handling 5-Stage Pipeline
Commit
Point

Inst. Data
PC Mem D Decode E + M Mem W

Illegal Overflow Data address

PC address
Opcode Exceptions
Exception
Exc Exc Exc

Cause
D E M

PC PC PC

EPC
Select
Handler Kill F D Kill D E Kill E M Asynchronous
PC Stage Stage Stage Interrupts
Kill
Writeback

7
Exception Handling 5-Stage Pipeline
§ Hold exception flags in pipeline until commit
point (M stage)

§ Exceptions in earlier pipe stages override later

exceptions for a given instruction

§ Inject external interrupts at commit point

(override others)

§ If trap at commit: update Cause and EPC registers,

kill all stages, inject handler PC into fetch stage

8
Speculating on Exceptions
§ Prediction mechanism
– Exceptions are rare, so simply predicting no exceptions is very
accurate!
§ Check prediction mechanism
– Exceptions detected at end of instruction execution pipeline,
special hardware for various exception types
§ Recovery mechanism
– Only write architectural state at commit point, so can throw away
partially executed instructions after exception
– Launch exception handler after flushing pipeline

§ Bypassing allows use of uncommitted instruction

results by following instructions

9
Deeper Pipelines: MIPS R4000
Commit Point

Direct-mapped I$ allows use of

instruction before tag check complete

Figure C.36 The eight-stage pipeline structure of the R4000 uses pipelined
instruction and data caches. The pipe stages are labeled and their detailed
function is described in the text. The vertical dashed lines represent the stage
boundaries as well as the location of pipeline latches. The instruction is actually
available at the end of IS, but the tag check is done in RF, while the registers are
fetched. Thus, we show the instruction memory as operating through RF. The TC
stage is needed for data memory access, because we cannot write the data into
the register until we know whether the cache access was a hit or not.

© 2018 Elsevier Inc. All rights reserved.

10
R4000 Load-Use Delay
Direct-mapped D$ allows use of
data before tag check complete

Figure C.37 The structure of the R4000 integer pipeline leads to a x1 load
delay. A x1 delay is possible because the data value is available at the end of
DS and can be bypassed. If the tag check in TC indicates a miss, the pipeline is
backed up a cycle, when the correct data are available.
© 2018 Elsevier Inc. All rights reserved.
11
R4000 Branches

Figure C.39 The basic branch delay is three cycles, because the
condition evaluation is performed during EX.

© 2018 Elsevier Inc. All rights reserved.

12
Simple vector-vector add code example

# for(i=0; i<N; i++)

# A[i] = B[i]+C[i];

loop: fld f0, 0(x2) // x2 points to B

13
Simple Pipeline Scheduling

Can reschedule code to try to reduce pipeline hazards

loop: fld f0, 0(x2) // x2 points to B

fld f1, 0(x3) // x3 points to C
addi x3, x3, 8// Bump pointer
addi x2, x2, 8// Bump pointer
fadd.d f2, f0, f1
addi x1, x1, 8// Bump pointer
fsd f2, -8(x1) // x1 points to A
bne x1, x4, loop // x4 holds end

Long latency loads and floating-point operations limit

parallelism within a single loop iteration
14
One way to reduce hazards: Loop Unrolling
Can unroll to expose more parallelism, reduce dynamic instruction count

loop: fld f0, 0(x2) // x2 points to B

fld f1, 0(x3) // x3 points to C
fld f10, 8(x2)
fld f11, 8(x3)
addi x3,x3,16 // Bump pointer
addi x2,x2,16 // Bump pointer
fadd.d f2, f0, f1
fadd.d f12, f10, f11
addi x1,x1,16 // Bump pointer
fsd f2, -16(x1) // x1 points to A
fsd f12, -8(x1)
bne x1, x4, loop // x4 holds end

§ Unrolling limited by number of architectural registers

§ Unrolling increases instruction cache footprint (more static instructions)
§ More complex code generation for compiler, has to understand pointers
15
Alternative Approach: Decoupling
(lookahead, runahead) in µarchitecture
Can separate control and memory address operations from
data computations:

loop: fld f0, 0(x2) // x2 points to B

fld f1, 0(x3) // x3 points to C
fadd.d f2, f0, f1
fsd f2, 0(x1) // x1 points to A
addi x1,x1,8 // Bump pointer
addi x2,x2,8 // Bump pointer
addi x3,x3,8 // Bump pointer
bne x1, x4, loop // x4 holds end
The control and address operations do not depend on the data
computations, so can be computed early relative to the data
computations, which can be delayed until later.
CS252 16
Simple Decoupled Access/Execute Machine
Integer Pipeline
F D X MW {Load Data Writeback µOp}
{Compute µOp}
µOp Queue {Store Data Read µOp}

Load Data Queue

R X1 X2 X3 W Floating-Point
Pipeline
Check

Load Address Load

Data
Store Address Store Data
Queue Queue

CS252 17
Decoupled Execution

fld f0 Send load to memory, queue up write to f0

fld f1 Send load to memory, queue up write to f1
fadd.d Queue up fadd.d
fsd f2 Queue up store address, wait for store data
addi x1 Bump pointer Check load Many writes to f0
addi x2 Bump pointer address against can be in queue at
addi x3 Bump pointer queued pending same time
bne Take branch store addresses
fld f0 Send load to memory, queue up write to f0
fld f1 Send load to memory, queue up write to f1
fadd.d Queue up fadd.d
fsd f2 Queue up store address, wait for store data
…

CS252 18
CS152 Administrivia
§ HW1 released
– Due Feb 02
§ Lab1 released
– Due Feb 09
§ Lab reports must be readable English summaries – not
dumps of log files!!!!!!
– We will reward good reports, and penalize undecipherable reports
– Page limit (check lab spec/Ed)
§ Lecture Ed thread
– One thread per lecture
– Post your questions following the format:
• [Slide #] Your question
– The staff team will address and clarify the questions asynchronously.
§ Guest lecture next Tuesday
– Prefetching
19
CS252 Administrivia
§ CS252 Readings on
– https://ucb-cs252-sp23.hotcrp.com/u/0/
– Use hotcrp to upload reviews before Wednesday:
• Write one paragraph on main content of paper including good/bad
points of paper
• Also, answer/ask 1-3 questions about paper for discussion
• First two “360 Architecture”, “VAX11-780”
– 2-3pm Wednesday, Soda 606/Zoom
§ CS252 Project Timeline
– Proposal Wed Feb 22
– Use 252A GSIs (Abe and Prashanth) and my OHs to get feedback.

CS252 20
Supercomputers
Definitions of a supercomputer:
§ Fastest machine in world at given task
§ A device to turn a compute-bound problem into an I/O
bound problem
§ Any machine costing $30M+
§ Any machine designed by Seymour Cray

§ CDC6600 (Cray, 1964) regarded as first supercomputer

21
CDC 6600 Seymour Cray, 1964
§ A fast pipelined machine with 60-bit words
– 128 Kword main memory capacity, 32 banks
§ Ten functional units (parallel, unpipelined)
– Floating Point: adder, 2 multipliers, divider
– Integer: adder, 2 incrementers, ...
§ Hardwired control (no microcoding)
§ Scoreboard for dynamic scheduling of instructions
§ Ten Peripheral Processors for Input/Output
– a fast multi-threaded 12-bit integer ALU
§ Very fast clock, 10 MHz (FP add in 4 clocks)
§ >400,000 transistors, 750 sq. ft., 5 tons, 150 kW,
novel freon-based technology for cooling
§ Fastest machine in world for 5 years (until 7600)
– over 100 sold ($7-10M each)

3/10/2009
22
CDC 6600:
A Load/Store Architecture

• Separate instructions to manipulate three types of reg.

• 8x60-bit data registers (X)
• 8x18-bit address registers (A)
• 8x18-bit index registers (B)

• All arithmetic and logic instructions are register-to-register (15-bit)

6 3 3 3
opcode i j k Ri ¬ Rj op Rk

•Only Load and Store instructions (30-bit) refer to memory!

6 3 3 18
opcode i j disp(imm) Ri ¬ M[Rj + disp]

Touching address registers 1 to 5 initiates a load

6 to 7 initiates a store
- very useful for vector operations

23
CDC 6600: Datapath
Operand Regs (X)
8 x 60-bit

operand
10 Functional
result Units
Central
Memory IR
Address Regs (A) Index Regs (B)
128K words,
8 x 18-bit 8 x 18-bit
32 banks, Inst. Stack
1µs cycle operand 8 x 60-bit
addr
result
addr

24
CDC6600: Vector Addition
B0 ← - n
loop: JZE B0, exit
A0 ← B0 + a0 load X0
A1 ← B0 + b0 load X1
X6 ← X0 + X1
A6 ← B0 + c0 store X6
B0 ← B0 + 1
jump loop

Ai = address register
Bi = index register
Xi = data register

25
CDC6600 ISA designed to simplify
high-performance implementation
§ Use of three-address, register-register ALU instructions
simplifies pipelined implementation
– Only 3-bit register-specifier fields checked for dependencies
– No implicit dependencies between inputs and outputs
§ Decoupling setting of address register (Ar) from retrieving
value from data register (Xr) simplifies providing multiple
outstanding memory accesses
– Address update instruction also issues implicit memory operation
– Software can schedule load of address register before use of value
– Can interleave independent instructions in between
§ CDC6600 has multiple parallel unpipelined functional units
– E.g., 2 separate multipliers
§ Follow-on machine CDC7600 used pipelined functional units
– Foreshadows later RISC designs
26
[© IBM]
27
IBM Memo on CDC6600
Thomas Watson Jr., IBM CEO, August 1963:
“Last week, Control Data ... announced the 6600 system. I understand
that in the laboratory developing the system there are only 34 people
including the janitor. Of these, 14 are engineers and 4 are programmers...
Contrasting this modest effort with our vast development activities, I fail
to understand why we have lost our industry leadership position by
letting someone else offer the world's most powerful computer.”

To which Cray replied: “It seems like Mr. Watson has

answered his own question.”

28
Computer Architecture Terminology
Latency (in seconds or cycles): Time taken for a single
operation from start to finish (initiation to useable result)
Bandwidth (in operations/second or operations/cycle): Rate
of which operations can be performed
Occupancy (in seconds or cycles): Time during which the
unit is blocked on an operation (structural hazard)
Note, for a single functional unit:
§ Occupancy can be much less than latency (how?)
§ Occupancy can be greater than latency (how?)
§ Bandwidth can be greater than 1/latency (how?)
§ Bandwidth can be less than 1/latency (how?)

29
Issues in Complex Pipeline Control
• Structural conflicts at the execution stage if some FPU or memory unit is not
pipelined and takes more than one cycle
• Structural conflicts at the write-back stage due to variable latencies of different
functional units -> many writes to reg file
• Out-of-order write hazards due to variable latencies of different functional
units
• How to handle exceptions?
ALU Mem

IF ID Issue WB
Fadd
GPRs
FPRs
Fmul

Fdiv

30
CDC6600 Scoreboard
§ Instructions dispatched in-order to functional units
provided no structural hazard or WAW
– Stall on structural hazard, no functional units available
– Only one pending write to any register
§ Instructions wait for input operands (RAW hazards) before
execution
– Can execute out-of-order
§ Instructions wait for output register to be read by
preceding instructions (WAR)
– Result held in functional unit until register free

31
More Complex In-Order Pipeline
Inst. Data
PC D Decode GPRs X1 + X2 Mem X3 W
Mem

§ Delay writeback so all operations

have same latency to W stage
FPRs X1 X2 FAdd X3 W
– Write ports never oversubscribed
(one inst. in & one inst. out every
cycle)
– Stall pipeline on long latency
operations, e.g., divides, cache X2 FMul X3
misses
– Handle exceptions in-order at
commit point Unpipelined
divider
FDiv X2 X3
How to prevent increased writeback latency
from slowing down single cycle integer Commit
Point
operations? Bypassing

32
In-Order Superscalar Pipeline
Inst. 2 Dual Data
PC Mem D Decode GPRs X1 + X2 Mem X3 W

FPRs X1 X2 FAdd X3 W

§ Fetch two instructions per cycle; issue both

simultaneously if one is integer/memory
and other is floating point X2 FMul X3
§ Inexpensive way of increasing throughput,
examples include Alpha 21064 (1992) &
MIPS R5000 series (1996) Unpipelined
divider
§ Same idea can be extended to wider issue FDiv X2 X3
by duplicating functional units (e.g. 4-issue Commit
UltraSPARC & Alpha 21164) but regfile ports Point
and bypassing costs grow quickly

33
In-Order Pipeline with two ALU stages

Address calculate
before memory
access

Integer ALU after

memory access

EA MEM ALU add x1,x1,24(x2)

x2+24 x1+M[x2+24]

Not a real RISC-V instruction!

EA MEM ALU add x3,x1,x6
x1+x6

EA MEM ALU
addi x5,x2,12
x2+12
lw x4, 16(x5)
EA MEM ALU
x5+16
lw x8, 16(x3)
EA MEM ALU
x3+16

Common trick used in modern in-order RISC pipeline designs, even without
reg-mem operations
35
Acknowledgements
§ This course is partly inspired by previous MIT 6.823 and
Berkeley CS252 computer architecture courses created by
my collaborators and colleagues:
– Arvind (MIT)
– Krste Asanovic (MIT/UCB)
– Joel Emer (Intel/MIT)
– James Hoe (CMU)
– John Kubiatowicz (UCB)
– David Patterson (UCB)

L04-PipeliningII
No ratings yet
L04-PipeliningII
33 pages
CA Lecture 12
No ratings yet
CA Lecture 12
48 pages
L04-Pipelining
No ratings yet
L04-Pipelining
38 pages
CH10-Processor Structure and Function
No ratings yet
CH10-Processor Structure and Function
14 pages
Pooja Vashisth
No ratings yet
Pooja Vashisth
68 pages
William Stallings Computer Organization and Architecture 8 Edition Processor Structure and Function
No ratings yet
William Stallings Computer Organization and Architecture 8 Edition Processor Structure and Function
74 pages
Chapter 2 Lecture 4 and 5
No ratings yet
Chapter 2 Lecture 4 and 5
56 pages
Onur Digitaldesign - Comparch 2021 Lecture14 Pipelined Processor Design Afterlecture
No ratings yet
Onur Digitaldesign - Comparch 2021 Lecture14 Pipelined Processor Design Afterlecture
97 pages
CH12 CPU Structure and Function
No ratings yet
CH12 CPU Structure and Function
44 pages
12 - Processor Structure and Function
No ratings yet
12 - Processor Structure and Function
73 pages
Chapter 4 The Processor
No ratings yet
Chapter 4 The Processor
72 pages
Topic 10: Pipelining: Cos / Ele 375 Computer Architecture and Organization
No ratings yet
Topic 10: Pipelining: Cos / Ele 375 Computer Architecture and Organization
64 pages
Arch4 Pipelined Processor Design Afterlecture
No ratings yet
Arch4 Pipelined Processor Design Afterlecture
130 pages
16.482 / 16.561 Computer Architecture and Design: Instructor: Dr. Michael Geiger Fall 2013
No ratings yet
16.482 / 16.561 Computer Architecture and Design: Instructor: Dr. Michael Geiger Fall 2013
42 pages
Branch Handling 1
No ratings yet
Branch Handling 1
50 pages
Onur 447 Spring15 Lecture12 Ooo Execution Afterlecture
No ratings yet
Onur 447 Spring15 Lecture12 Ooo Execution Afterlecture
67 pages
Lecture 3: CPU Structure and Function
No ratings yet
Lecture 3: CPU Structure and Function
47 pages
Cse410 10 Pipelining A
No ratings yet
Cse410 10 Pipelining A
7 pages
Slot24 25 CH14 ProcessorStructureAndFunction 42 Slots
No ratings yet
Slot24 25 CH14 ProcessorStructureAndFunction 42 Slots
42 pages
CPU Structure & Functions
No ratings yet
CPU Structure & Functions
44 pages
CPU Structure and Function
100% (1)
CPU Structure and Function
30 pages
Unit - 1 Microprocessor Architecture
No ratings yet
Unit - 1 Microprocessor Architecture
52 pages
Module 5_Processor Structure and Function
No ratings yet
Module 5_Processor Structure and Function
74 pages
Lec14 Pipeline Riscv - Key
No ratings yet
Lec14 Pipeline Riscv - Key
58 pages
SRM Pipelining 05.Pptx
No ratings yet
SRM Pipelining 05.Pptx
42 pages
Comp206 Lecture9
No ratings yet
Comp206 Lecture9
53 pages
Unit V
No ratings yet
Unit V
23 pages
Slot15 CH14 ProcessorStructureAndFunction 42 Slots
No ratings yet
Slot15 CH14 ProcessorStructureAndFunction 42 Slots
42 pages
Week 11
No ratings yet
Week 11
33 pages
Data Hazards
No ratings yet
Data Hazards
29 pages
Pipelining Basic Concepts: Instruction Fetch Execute Operand Fetch IF OF EX
No ratings yet
Pipelining Basic Concepts: Instruction Fetch Execute Operand Fetch IF OF EX
28 pages
CEA201 - Chapter 14 - Processor Structure and Function
No ratings yet
CEA201 - Chapter 14 - Processor Structure and Function
42 pages
COA Unit 3
No ratings yet
COA Unit 3
89 pages
Unit1 1.7 Instr Cycle
No ratings yet
Unit1 1.7 Instr Cycle
35 pages
Dealing With Exceptions
No ratings yet
Dealing With Exceptions
35 pages
CS M151B / EE M116C: Computer Systems Architecture
No ratings yet
CS M151B / EE M116C: Computer Systems Architecture
38 pages
Pipe Lining
No ratings yet
Pipe Lining
66 pages
Ca07 2014 PDF
No ratings yet
Ca07 2014 PDF
56 pages
William Stallings Computer Organization and Architecture: CPU Structure and Function
No ratings yet
William Stallings Computer Organization and Architecture: CPU Structure and Function
40 pages
3 Pipeline
No ratings yet
3 Pipeline
38 pages
Embedded Computer Architecture 5SAI0
No ratings yet
Embedded Computer Architecture 5SAI0
59 pages
Questions With Answers
No ratings yet
Questions With Answers
22 pages
Ca06 2014 PDF
No ratings yet
Ca06 2014 PDF
53 pages
Computer Science 37 Lecture 22
No ratings yet
Computer Science 37 Lecture 22
14 pages
Instruction Level Pipelining
100% (1)
Instruction Level Pipelining
113 pages
Contact Session 8
No ratings yet
Contact Session 8
63 pages
CH14 COA9e Processor Structure and Function
No ratings yet
CH14 COA9e Processor Structure and Function
40 pages
Processor Organization
100% (1)
Processor Organization
55 pages
Review of Pipelines, Performance, Caches, & Virtual Memory
No ratings yet
Review of Pipelines, Performance, Caches, & Virtual Memory
64 pages
CH14 COA10e
No ratings yet
CH14 COA10e
54 pages
Lecture24 PDF
No ratings yet
Lecture24 PDF
49 pages
Advanced Computer Architectures: Exception Handling
No ratings yet
Advanced Computer Architectures: Exception Handling
17 pages
Risc in Pipe Ine
No ratings yet
Risc in Pipe Ine
39 pages
Chapter_4
No ratings yet
Chapter_4
78 pages
Module-5_DDCO
No ratings yet
Module-5_DDCO
35 pages
Pipelining
No ratings yet
Pipelining
44 pages
lec4
No ratings yet
lec4
35 pages
PLC: Programmable Logic Controller – Arktika.: EXPERIMENTAL PRODUCT BASED ON CPLD.
From Everand
PLC: Programmable Logic Controller – Arktika.: EXPERIMENTAL PRODUCT BASED ON CPLD.
Franco Mario
No ratings yet
CISCO PACKET TRACER LABS: Best practice of configuring or troubleshooting Network
From Everand
CISCO PACKET TRACER LABS: Best practice of configuring or troubleshooting Network
Mulayam Singh
No ratings yet
Pic® Micro Principles V11
From Everand
Pic® Micro Principles V11
Clive W. Humphris
No ratings yet
da ds notes
No ratings yet
da ds notes
27 pages
um3191-stm32wl-series-ulcsaiec-607301603351-selftest-library-user-guide-stmicroelectronics
No ratings yet
um3191-stm32wl-series-ulcsaiec-607301603351-selftest-library-user-guide-stmicroelectronics
62 pages
Bca2nd Year Notes
No ratings yet
Bca2nd Year Notes
49 pages
Graph Traversals
No ratings yet
Graph Traversals
25 pages
Question Bank 1
No ratings yet
Question Bank 1
5 pages
Os Deadlock Problems
100% (1)
Os Deadlock Problems
8 pages
Course Plan (2)
No ratings yet
Course Plan (2)
3 pages
ReferenceDataStructure Praactical
No ratings yet
ReferenceDataStructure Praactical
29 pages
Nothing Can Stop Automation: Stefan Pölt, FRA IN/P
No ratings yet
Nothing Can Stop Automation: Stefan Pölt, FRA IN/P
20 pages
Encoders and Multiplexer Circuits: by Dr. Nermeen Talaat
No ratings yet
Encoders and Multiplexer Circuits: by Dr. Nermeen Talaat
22 pages
cd3291 Data Structures Modeli QP
No ratings yet
cd3291 Data Structures Modeli QP
1 page
Track Schedule
No ratings yet
Track Schedule
9 pages
5. Efficient convolution algorithms
No ratings yet
5. Efficient convolution algorithms
13 pages
Vim 9
No ratings yet
Vim 9
42 pages
Latch and Flip Flop
No ratings yet
Latch and Flip Flop
40 pages
BCS358D Data Visualization With Python Syallabus
No ratings yet
BCS358D Data Visualization With Python Syallabus
4 pages
[FREE PDF sample] (Ebook) Real World ASP.NET Best Practices by Farhan Muhammad, Matt Milner (auth.) ISBN 9781430207696, 9781590591000, 1430207698, 1590591003 ebooks
100% (2)
[FREE PDF sample] (Ebook) Real World ASP.NET Best Practices by Farhan Muhammad, Matt Milner (auth.) ISBN 9781430207696, 9781590591000, 1430207698, 1590591003 ebooks
86 pages
Heuristic Search
No ratings yet
Heuristic Search
3 pages
Longest Palindromic Substring
No ratings yet
Longest Palindromic Substring
23 pages
Python Project
No ratings yet
Python Project
18 pages
03010501DA01_5873_Question_Paper
No ratings yet
03010501DA01_5873_Question_Paper
2 pages
CD - Cat-2-Qp - Key
No ratings yet
CD - Cat-2-Qp - Key
6 pages
MATLAB - Strings: 'Tutorials Point'
No ratings yet
MATLAB - Strings: 'Tutorials Point'
7 pages
Assign 3
No ratings yet
Assign 3
3 pages
Osy Micro
No ratings yet
Osy Micro
12 pages
CSE - 4 1 Sem - CS Syllabus - UG - R20 Revised On 27 02 2023
No ratings yet
CSE - 4 1 Sem - CS Syllabus - UG - R20 Revised On 27 02 2023
96 pages
Python Course Syllabus
No ratings yet
Python Course Syllabus
7 pages
Unit-3 (MLT)
No ratings yet
Unit-3 (MLT)
46 pages
MATLAB Functions
No ratings yet
MATLAB Functions
6 pages
Advanced UVM: Understanding The Factory and Configuration
No ratings yet
Advanced UVM: Understanding The Factory and Configuration
18 pages

L05-PipeliningII

Uploaded by

L05-PipeliningII

Uploaded by

http://inst.eecs.berkeley.

§ When the processor decides to process the

§ In general, the instruction cannot be completed

§ In the case of a system call trap, the instruction is

An external or internal event that needs to be processed by another (system)

PC address Illegal Data address

§ How to handle multiple simultaneous exceptions in

Illegal Overflow Data address

§ Exceptions in earlier pipe stages override later

§ Inject external interrupts at commit point

§ If trap at commit: update Cause and EPC registers,

§ Bypassing allows use of uncommitted instruction

Direct-mapped I$ allows use of

© 2018 Elsevier Inc. All rights reserved.

© 2018 Elsevier Inc. All rights reserved.

# for(i=0; i<N; i++)

loop: fld f0, 0(x2) // x2 points to B

Can reschedule code to try to reduce pipeline hazards

loop: fld f0, 0(x2) // x2 points to B

Long latency loads and floating-point operations limit

loop: fld f0, 0(x2) // x2 points to B

§ Unrolling limited by number of architectural registers

loop: fld f0, 0(x2) // x2 points to B

Load Data Queue

Load Address Load

fld f0 Send load to memory, queue up write to f0

§ CDC6600 (Cray, 1964) regarded as first supercomputer

• Separate instructions to manipulate three types of reg.

• All arithmetic and logic instructions are register-to-register (15-bit)

•Only Load and Store instructions (30-bit) refer to memory!

Touching address registers 1 to 5 initiates a load

To which Cray replied: “It seems like Mr. Watson has

§ Delay writeback so all operations

§ Fetch two instructions per cycle; issue both

Integer ALU after

EA MEM ALU add x1,x1,24(x2)

Not a real RISC-V instruction!

You might also like