0% found this document useful (0 votes)
15 views

KTMT

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views

KTMT

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 470

Computer Architecture

Faculty of Computer Science & Engineering - HCMUT

Chapter 1
Computer Abstractions and
Technology
Binh Tran-Thanh
[email protected]
The Computer Revolution
▪ Progress in computer technology
▪ Underpinned by Moore’s Law
▪ Makes novel applications feasible
▪ Computers in automobiles
▪ Cell phones
▪ Human genome project
▪ World Wide Web
▪ Search Engines
▪ Computers are pervasive

15-Aug-23 Faculty of Computer Science and Engineering 2


History of Computer Development
▪ First generation 1945 - 1955
▪ vacuum tubes, plug boards
▪ Second generation 1955 - 1965
▪ transistors, batch systems
▪ Third generation 1965 – 1980
▪ ICs and multiprogramming
▪ Fourth generation 1980 – present
▪ personal computers (Desk, Lap)
▪ SuperComp.,
▪ DataCenter, Clusters, etc.
15-Aug-23 Faculty of Computer Science and Engineering 3
The Moore’s Law

15-Aug-23 Faculty of Computer Science and Engineering 4


The History: at the very beginning

ENIAC, 1943, 30 tons, 200KW, ~1000 ops/sec


15-Aug-23 Faculty of Computer Science and Engineering 5
The History: Now
Typical 2023 laptop
~1kg, 10W, 10 billion ops/sec

15-Aug-23 Faculty of Computer Science and Engineering 6


Classes of Computers

Source: internet
15-Aug-23 Faculty of Computer Science and Engineering 7
Classes of Computers
▪ Personal computers
▪ General purpose, variety of software
▪ Subject to cost/performance trade-off

▪ Embedded computers
▪ Hidden as components of systems
▪ Stringent power/performance/cost constraints
15-Aug-23 Faculty of Computer Science and Engineering 8
Classes of Computers
▪ Server computers
▪ Network based
▪ High capacity, performance, reliability
▪ Range from small servers to building sized

▪ Supercomputers
▪ High end scientific and engineering calculations
▪ Highest capability but represent a small fraction of
the overall computer market

15-Aug-23 Faculty of Computer Science and Engineering 9


The PostPC Era has arrived
▪ "Your next computer is not a computer" (apple)

15-Aug-23 Faculty of Computer Science and Engineering Source: IDC 10


The PostPC Era
▪ Cloud computing
▪ Warehouse Scale Computers (WSC)
▪ Software as a Service (SaaS)
▪ Portion of software run on a PMD and a portion run in the Cloud
▪ Amazon and Google

▪ Personal Mobile Device (PMD)


▪ Battery operated
▪ Connects to the Internet
▪ Hundreds of dollars
▪ Smart phones, tablets, electronic glasses

15-Aug-23 Faculty of Computer Science and Engineering 11


What You Will Learn
▪ How programs are translated into the machine
language
▪ And how the hardware executes them
▪ The hardware/software interface
▪ What determines program performance
▪ And how it can be improved
▪ How hardware designers improve performance
▪ What is parallel processing
15-Aug-23 Faculty of Computer Science and Engineering 12
Understanding Performance
▪ Algorithm
▪ Determines number of operations executed
▪ Programming language, compiler, architecture
▪ Determine number of machine instructions
executed per operation
▪ Processor and memory system
▪ Determine how fast instructions are executed
▪ I/O system (including OS)
▪ Determines how fast I/O operations are executed

15-Aug-23 Faculty of Computer Science and Engineering 13


Below Your Program
▪ Application software
▪ Written in high-level language
▪ System software
▪ Compiler: translates HLL code to
machine code
Hardware ▪ Operating System: service code
▪ Handling input/output
▪ Managing memory and storage
▪ Scheduling tasks & sharing
resources
▪ Hardware
▪ Processor, memory, I/O controllers
15-Aug-23 Faculty of Computer Science and Engineering 14
swap(int v[], int k){
int temp;

Levels of Program Code


temp = v[k];
v[k] = v[k+1];
v[k+1] = temp;
High-level
Language }
Compiler
▪ High-level language program
(in C)
▪ Level of abstraction closer to problem swap: multi $2, $5, 4
domain add $2, $4, $2
lw $15, 0($2)
▪ Provides for productivity and Assembly
lw $16, 4($2)
portability Language
Program sw $16, 0($2)
▪ Assembly language (for MIPS) sw
jr
$15, 4($2)
$31
▪ Textual representation of instructions
Assembler
▪ Hardware representation
▪ Binary digits (bits) Binary
00000000101000100000000100011000
00000000100000100001000000100001
▪ Encoded instructions and data Machine 10001101111000100000000000000000
▪ Which layer represents for Language 10001110000100100000000000000100
(for MIPS) 10101110000100100000000000000000
program.exe/.asm/.c? 10101101111000100000000000000100
00000011111000000000000000001000

15-Aug-23 Faculty of Computer Science and Engineering 15


Components of a Computer
▪ Same components for
all kinds of computer
▪ Desktop, server, embedded
▪ Input/output includes
▪ User-interface devices
▪ Display, keyboard, mouse
▪ Storage devices
▪ Hard disk, CD/DVD, flash
▪ Network adapters
▪ For communicating with
other computers

15-Aug-23 Faculty of Computer Science and Engineering 16


Inside the Processor (CPU)
▪ Datapath: performs operations on data
▪ Control: sequences Datapath, memory, …
▪ Cache memory
▪ Small fast SRAM memory for immediate access
to data

15-Aug-23 Faculty of Computer Science and Engineering 17


Eight Great Ideas
▪ Design for Moore’s Law
▪ Use abstraction to simplify design
▪ Make the common case fast
▪ Performance via parallelism
▪ Performance via pipelining
▪ Performance via prediction
▪ Hierarchy of memories
▪ Dependability via redundancy
15-Aug-23 Faculty of Computer Science and Engineering 18
Opening the Box
Capacitive multitouch LCD screen

3.8 V, 25 Watt-hour battery

Computer board

15-Aug-23 Faculty of Computer Science and Engineering 19


Through the Looking Glass
▪ LCD screen: picture elements (pixels)
▪ Mirrors content of frame buffer memory

15-Aug-23 Faculty of Computer Science and Engineering 20


Touchscreen
▪ PostPC device
▪ Supersedes keyboard and mouse
▪ Resistive and Capacitive types
▪ Most tablets, smart phones use capacitive
▪ Capacitive allows multiple touches
simultaneously

15-Aug-23 Faculty of Computer Science and Engineering 21


Inside the Processor
▪ Apple A14

15-Aug-23 Faculty of Computer Science and Engineering 22


Abstractions
▪ Abstraction helps us deal with complexity
▪ Hide lower-level detail
▪ Instruction set architecture (ISA)
▪ The hardware/software interface
▪ Application binary interface
▪ The ISA plus system software interface
▪ Implementation
▪ The details underlying and interface
15-Aug-23 Faculty of Computer Science and Engineering 23
A Safe Place for Data
▪ Volatile main memory Thi s Photo by Unknown Author i s licensed under CC BY

▪ Loses instructions and data when power off


▪ Non-volatile secondary memory
▪ SSD, Magnetic disk
▪ Flash memory
▪ Optical disk (CDROM, DVD) Thi s Photo by Unknown Author i s licensed under CC
BY

15-Aug-23 Faculty of Computer Science and Engineering Thi s Photo by Unknown Author i s licensed under CC 24
BY-ND
Networks
▪ Communication, resource sharing, nonlocal
access
▪ Local area network (LAN): Ethernet
▪ Wide area network (WAN): the Internet
▪ Wireless network: WiFi, Bluetooth

15-Aug-23 Faculty of Computer Science and Engineering 25


Technology Trends
▪ Electronics technology continues to evolve
▪ Increased capacity and performance
▪ Reduced cost

Year Technology Relative performance/cost


1951 Vacuum tube 1
1965 Transistor 35
1975 Integrated circuit (IC) 900
1995 Very large scale IC (VLSI) 2,400,000
2013 Ultra large scale IC 250,000,000,000
15-Aug-23 Faculty of Computer Science and Engineering 26
Semiconductor Technology
▪ Silicon: semiconductor
▪ Add materials to transform properties:
▪ Conductors
▪ Insulators
▪ Switch

15-Aug-23 Faculty of Computer Science and Engineering 27


Manufacturing ICs
▪ Yield: proportion of working dies per wafer

15-Aug-23 Faculty of Computer Science and Engineering 28


Intel Core i7 Wafer
▪ 300mm wafer, 280 chips, 32nm technology
▪ Each chip is 20.7 x 10.5 mm

15-Aug-23 Faculty of Computer Science and Engineering 29


Integrated Circuit Cost
▪ Nonlinear relation to area and defect rate
▪ Wafer cost and area are fixed
▪ Defect rate determined by manufacturing process
▪ Die area determined by architecture and circuit design
Cost per wafer
Cost per die =
Dies per wafer  Yield
Dies per wafer  Wafer area Die area
1
Yield =
(1 + (Defects per area  Die area/2)) 2
15-Aug-23 Faculty of Computer Science and Engineering 30
Defining Performance
▪ Which airplane is the best performance?
Boeing 777 375 Boeing 777 4630

Boeing 747 470 Boeing 747 4150

BAC/Sud BAC/Sud
132 4000
Concorde Concorde
Douglas DC- Douglas DC-
146 8720
8-50 8-50

0 200 400 600 0 5000 10000

Passenger Capacity Cruising Range (miles)

Boeing 777 610 Boeing 777 228750

Boeing 747 610 Boeing 747 286700

BAC/Sud BAC/Sud
1350 178200
Concorde Concorde
Douglas DC- Douglas DC-
544 79424
8-50 8-50

0 500 1000 1500 0 200000 400000

Cruising Speed (mph) Passengers x mph


15-Aug-23 Faculty of Computer Science and Engineering 31
Response Time and Throughput
▪ Response time
▪ How long it takes to do a task
▪ Throughput
▪ Total work done per unit time
▪ e.g., tasks/transactions/… per hour
▪ How are response time and throughput affected by
▪ Replacing the processor with a faster version?
▪ Adding more processors?
▪ We’ll focus on response time for now…
15-Aug-23 Faculty of Computer Science and Engineering 32
Relative Performance
▪ Define: Performance = 1/Execution Time
▪ “X is n time faster than Y”
Performanc e X Performanc e Y
= Execution time Y Execution time X = n
▪ Example: time taken to run a program
▪ 10s on A, 15s on B
▪ Execution TimeB / Execution TimeA
= 15s / 10s = 1.5
▪ So, A is 1.5 times faster than B
15-Aug-23 Faculty of Computer Science and Engineering 33
Measuring Execution Time
▪ Elapsed time
▪ Total response time, including all aspects
▪ Processing, I/O, OS overhead, idle time
▪ Determines system performance
▪ CPU time
▪ Time spent processing a given job
▪ Discounts I/O time, other jobs’ shares
▪ Comprises user CPU time and system CPU time
▪ Different programs are affected differently by CPU and
system performance

15-Aug-23 Faculty of Computer Science and Engineering 34


CPU Clocking
▪ Operation of digital hardware governed by a constant-
rate clock Clock period
Clock (cycles)

Data transfer
and computation
Update state

▪ Clock period: duration of a clock cycle


▪ e.g., 250ps = 0.25ns = 250×10–12s
▪ Clock frequency (rate): cycles per second
▪ e.g., 4.0GHz = 4000MHz = 4.0×109Hz

15-Aug-23 Faculty of Computer Science and Engineering 35


CPU Time
▪ Performance improved by
▪ Reducing number of clock cycles
▪ Increasing clock rate
▪ Hardware designer must often trade off clock
rate against cycle count
CPU Time = CPU Clock Cycles  Clock Cycle Time
CPU Clock Cycles
=
Clock Rate
15-Aug-23 Faculty of Computer Science and Engineering 36
CPU Time Example
▪ Computer A: 2GHz clock, 10s CPU time
▪ Designing Computer B
▪ Aim for 6s CPU time
▪ Can do faster clock, but causes 1.2 × clock cycles
▪ How fast must Computer B clock be?
Clock Cycles B 1.2  Clock Cycles A
Clock Rate B = =
CPU Time B 6s
Clock Cycles A = CPU Time A  Clock Rate A
= 10s  2GHz = 20  109
1.2  20  109 24  109
Clock Rate B = = = 4GHz
6s 6s
15-Aug-23 Faculty of Computer Science and Engineering 37
Instruction Count and CPI
▪ Instruction Count for a program
▪ Determined by program, ISA and compiler
▪ Average cycles per instruction
▪ Determined by CPU hardware
▪ If different instructions have different CPI
▪ Average CPI affected by instruction mix
Clock Cycles = Instructio n Count  Cycles per Instructio n
CPU Time = Instructio n Count  CPI  Clock Cycle Time
Instructio n Count  CPI
=
Clock Rate
15-Aug-23 Faculty of Computer Science and Engineering 38
CPI Example
▪ Computer A: Cycle Time = 250ps, CPI = 2.0
▪ Computer B: Cycle Time = 500ps, CPI = 1.2
▪ Same ISA
▪ Which is faster, and by how much?
CPU Time = Instructio n Count  CPI  Cycle Time A is faster…
A A A
= I  2.0  250ps = I  500ps
CPU Time = Instructio n Count  CPI  Cycle Time
B B B
= I  1.2  500ps = I  600ps

B = I  600ps = 1.2
CPU Time
…by this much
CPU Time I  500ps
A
15-Aug-23 Faculty of Computer Science and Engineering 39
CPI in More Detail
▪ If different instruction classes take different
numbers of cycles n
Clock Cycles =  (CPI  Instructio n Count )
i i
i=1

▪ Weighted average CPI


Clock Cycles n
 Instructio n Count i 
CPI = =   CPIi  
Instructio n Count i=1  Instructio n Count 

Relative frequency

15-Aug-23 Faculty of Computer Science and Engineering 40


CPI Example
▪ Alternative compiled code sequences using
instructions in classes A, B, C
Class A B C
CPI for class 1 2 3
IC in sequence 1 2 1 2
IC in sequence 2 4 1 1
◼ Sequence 1: IC = 5 ◼ Sequence 2: IC = 6
◼ Clock Cycles ◼ Clock Cycles
= 2×1 + 1×2 + 2×3 = 4×1 + 1×2 + 1×3
= 10 =9
◼ Avg. CPI = 10/5 = 2.0 ◼ Avg. CPI = 9/6 = 1.5
15-Aug-23 Faculty of Computer Science and Engineering 41
Performance Summary
▪ Performance depends on
▪ Algorithm: affects IC, possibly CPI
▪ Programming language: affects IC, CPI
▪ Compiler: affects IC, CPI
▪ Instruction set architecture: affects IC, CPI, Tc
Instructio ns Clock cycles Seconds
CPU Time =  
Program Instructio n Clock cycle

15-Aug-23 Faculty of Computer Science and Engineering 42


Power Trends
▪ In CMOS IC technology

Power = Capacitive load  Voltage 2  Frequency


×30 5V → 1V ×1000
15-Aug-23 Faculty of Computer Science and Engineering 43
Multiprocessors
▪ Multicore microprocessors
▪ More than one processor per chip
▪ Requires explicitly parallel programming
▪ Compare with instruction level parallelism
▪ Hardware executes multiple instructions at once
▪ Hidden from the programmer
▪ Hard to do
▪ Programming for performance
▪ Load balancing
▪ Optimizing communication and synchronization

15-Aug-23 Faculty of Computer Science and Engineering 44


SPEC CPU Benchmark
▪ Programs used to measure performance
▪ Supposedly typical of actual workload
▪ Standard Performance Evaluation Corp (SPEC)
▪ Develops benchmarks for CPU, I/O, Web, …
▪ SPEC CPU2006
▪ Elapsed time to execute a selection of programs
▪ Negligible I/O, so focuses on CPU performance
▪ Normalize relative to reference machine
▪ Summarize as geometric mean of performance ratios
▪ CINT2006 (integer) and CFP2006 (floating-point)
n
n
Execution time ratio
i=1
i

15-Aug-23 Faculty of Computer Science and Engineering 45


SPEC Power Benchmark
▪ Power consumption of server at different
workload levels
▪ Performance: ssj_ops/sec
▪ Power: Watts (Joules/sec)
 10   10 
Overall ssj_ops per Watt =   ssj_ops i    poweri 
 i =0   i =0 

15-Aug-23 Faculty of Computer Science and Engineering 46


Pitfall: Amdahl’s Law
▪ Improving an aspect of a computer and expecting a
proportional improvement in overall performance
Taf f ected
Timprov ed = + Tunaf f ected
improvemen t factor
▪ Example: multiply accounts for 80s/100s
▪ How much improvement in multiply performance to
get 5× overall?
80
20 = + 20 ◼ Can’t be done!
n
▪ Corollary: make the common case fast
15-Aug-23 Faculty of Computer Science and Engineering 47
Pitfall: MIPS as a Performance Metric
▪ MIPS: Millions of Instructions Per Second
▪ Doesn’t account for
▪ Differences in ISAs between computers
▪ Differences in complexity between instructions
Instructio n count
MIPS =
Execution time  10 6
Instructio n count Clock rate
= =
Instructio n count  CPI CPI  10 6
 10 6

Clock rate
▪ CPI varies between programs on a given CPU

15-Aug-23 Faculty of Computer Science and Engineering 48


Reducing Power
▪ Suppose a new CPU has
▪ 85% of capacitive load of old CPU
▪ 15% voltage and 15% frequency reduction
Pnew Cold  0.85  (Vold  0.85) 2  Fold  0.85
= = 0.85 4
= 0.52
Cold  Vold  Fold
2
Pold

▪ The power wall


▪ We can’t reduce voltage further
▪ We can’t remove more heat
▪ How else can we improve performance?
15-Aug-23 Faculty of Computer Science and Engineering 49
Concluding Remarks
▪ Cost/performance is improving
▪ Due to underlying technology development
▪ Hierarchical layers of abstraction
▪ In both hardware and software
▪ Instruction set architecture
▪ The hardware/software interface
▪ Execution time: the best performance measure
▪ Power is a limiting factor
▪ Use parallelism to improve performance

15-Aug-23 Faculty of Computer Science and Engineering 50


Exercise
▪ Given a program X in detail as below.
Instruction class A B C D
CPI 2 4 3 2.5
# instruction 1000 2000 3000 4000
▪ What is CPU time of X where it is run at 2Ghz ?
▪ Improving performance by reducing #instruction of B by a
half. What is the speed up?
▪ To improve performance by changing only 1 class of
instruction. What is the limitation of speed up?
15-Aug-23 Faculty of Computer Science and Engineering 51
Exercise
▪ Given a program X in detail as below.
Instruction class A B C D
CPI 2 4 3 2.5
# instruction 1000 2000 3000 4000
▪ What is CPU time of X where it is run at 2Ghz ? (14.5 µsec)
▪ Improving performance by reducing #instruction of B by a
half. What is the speed up? (1.16)
▪ To improve performance by changing only 1 class of
instruction. What is the limitation of speed up? (1.53)
15-Aug-23 Faculty of Computer Science and Engineering 52
Question?

53
Computer Architecture
Faculty of Computer Science & Engineering - HCMUT

Chapter 2
Instructions: Language of the
Computer
Binh Tran-Thanh
[email protected]
Objectives
swap(int v[], int k){ swap: multi $2, $5, 4
int temp; add $2, $4, $2
temp = v[k]; lw $15, 0($2)
v[k] = v[k+1]; Compiler lw $16, 4($2)
v[k+1] = temp; sw $16, 0($2)
} sw $15, 4($2)
jr $31

00000000101000100000000100011000
00000000100000100001000000100001
10001101111000100000000000000000
10001110000100100000000000000100 Assembler
10101110000100100000000000000000
10101101111000100000000000000100
00000011111000000000000000001000

8/13/2021 Faculty of Computer Science and Engineering 2


Abstract layer of ISA
▪ Coordination of many levels (layers) of abstraction
Application (ex: browser)
Software Operating
Compiler System
Assembler (Mac OSX)
Instruction Set
Architecture
Processor Memory I/O system
Hardware Datapath & Control
Digital Design
Circuit Design
8/13/2021 transistors
Faculty of Computer Science and Engineering 3
Von Neumann Architecture
▪ Stored program concept
Central
▪ Instruction category Processing Unit
▪ Arithmetic Control unit
▪ Data transfer input ALU output
▪ Logical
▪ Conditional branch
Memory unit
▪ Unconditional jump
8/13/2021 Faculty of Computer Science and Engineering 4
Harvard architecture
ALU

Instruction
Data memory Control unit
memory

Input/
output

8/13/2021 Faculty of Computer Science and Engineering 5


Computer Components
CPU Main Memory
PC MAR
IR MBR Instruction
Instruction
I/O AR …
Instruction

ALU
I/O BR

data
data
I/O Module …
data
PC: Program Counter
IR: Instruction Register
MAR: Memory Address Register
MBR: Memory Buffers Register
I/O AR: Input Output Address Register
Buffers I/O BR: Input Output Buffer Register
8/13/2021 Faculty of Computer Science and Engineering 6
Instruction execution process
Fetch Stage Execute Stage

Fetch next Execute


start halt
instruction instruction

Basic instruction cycle

▪ Fetch: from memory


▪ PC increases after the fetch
▪ PC holds the address of the next instruction
▪ Execution: Encode & Execution
8/13/2021 Faculty of Computer Science and Engineering 7
Instruction Set
▪ The repertoire of instructions of a computer
▪ Different computers have different instruction sets
▪ But with many aspects in common
▪ Early computers had very simple instruction sets
▪ Simplified implementation
▪ Many modern computers also have simple
instruction sets

8/13/2021 Faculty of Computer Science and Engineering 8


RISC vs. CISC Architectures
RISC CISC
▪ Reduced Instruction Set ▪ Complex Instruction Set
Computers Computers
▪ Emphasis on software ▪ Emphasis on hardware
▪ Single-clock, ▪ Includes multi-clock,
reduced instruction only complex instructions
▪ Low cycles per second, ▪ Small code sizes,
large code sizes high cycles per second
▪ Spends more transistors ▪ Transistors used for storing
on memory registers complex instructions

8/13/2021 Faculty of Computer Science and Engineering 9


The MIPS Instruction Set
▪ Used as the example throughout the book
▪ Stanford MIPS commercialized by MIPS
Technologies (www.mips.com)
▪ Large share of embedded core market
▪ Applications in consumer electronics,
network/storage equipment, cameras, printers, …
▪ Typical of many modern ISAs
▪ See MIPS Reference Data tear-out card, and
Appendixes B and E

8/13/2021 Faculty of Computer Science and Engineering 10


Design Principles of ISA
Simplicity favors regularity

Smaller is faster

Make the common case fast

Good design demands good compromises


Arithmetic Operations
▪ Add and subtract, three operands
▪ Two sources and one destination
add a, b, c # a gets b + c
▪ All arithmetic operations have this form
▪ Design Principle 1: Simplicity favors regularity
▪ Regularity makes implementation simpler
▪ Simplicity enables higher performance at lower
cost

8/13/2021 Faculty of Computer Science and Engineering 12


Arithmetic Example
C code:
f = (g + h) - (i + j);
Compiled MIPS code:
add $t0, $s1, $s2 # t0 = g + h
add $t1, $s3, $s4 # t1 = i + j
sub $s0, $t0, $t1 # f = t0 - t1
8/13/2021 Faculty of Computer Science and Engineering 13
Register Operands
▪ Arithmetic instructions use register operands
▪ MIPS has a 32 × 32-bit register file
▪ Use for frequently accessed data
▪ Numbered 0 to 31
▪ 32-bit data called a “word”
▪ Assembler names
▪ $t0, $t1, …, $t9 for temporary values
▪ $s0, $s1, …, $s7 for saved variables
▪ Design Principle 2: Smaller is faster
▪ c.f. main memory: millions of locations
8/13/2021 Faculty of Computer Science and Engineering 14
Register Operand Example
C code:
f = (g + h) - (i + j);
f, g, h, i, j in $s0, $s1, $s2,
$s3, $s4, respectively
Compiled MIPS code:
add $t0, $s1, $s2 # t0 = g + h
add $t1, $s3, $s4 # t1 = i + j
sub $s0, $t0, $t1 # f = t0 - t1

8/13/2021 Faculty of Computer Science and Engineering 15


Memory Operands
▪ Main memory used for composite data
▪ Arrays, structures, dynamic data
▪ To apply arithmetic operations
▪ Load values from memory into registers
▪ Store result from register to memory
▪ Memory is byte addressed
▪ Each address identifies an 8-bit byte
▪ Words are aligned in memory
▪ Address must be a multiple of 4
▪ MIPS is Big Endian
▪ Most-significant byte at least address of a word
▪ c.f. Little Endian: least-significant byte at least address

8/13/2021 Faculty of Computer Science and Engineering 16


Memory Operand Example 1
C code:
g = h + A[8];
g in $s1, h in $s2, base address of A
in $s3
Compiled MIPS code:
Index 8 requires offset of 32 base register
4 bytes per word
lw $t0, 32($s3) # load word
add $s1, $s2, $t0
offset
8/13/2021 Faculty of Computer Science and Engineering 17
Memory Operand Example 2
C code:
A[12] = h + A[8];
▪ H in $s2, base address of A in $s3
Compiled MIPS code:
Index 8 requires offset of 32
lw $t0, 32($s3) # load word
add $t0, $s2, $t0
sw $t0, 48($s3) # store word
8/13/2021 Faculty of Computer Science and Engineering 18
Your turn
▪ Given 3 arrays in C as follow:
int arrayA[10];
short arrayB[10];
char arrayC[10];
▪ What is “sizeof” of each above array?
▪ Assume $a0, $a1, $a2 are base address of
ArrayA, arrayB, and arrayC, respectively.
▪ Write a piece of MIPS code to load the value of
arrayA[3], arrayB[3], and arrayC[3] to
$t0, $t1, and $t2 respectively.

8/13/2021 Faculty of Computer Science and Engineering 19


Your turn
▪ Given a structure in C as ▪ Given a structure in C as
follow: follow:
struct Person_A{ struct Person_B{
char name[5]; int age;
int age; char name[5];
char gender[3]; char gender[3];
}; };
▪ What is “sizeof”of the ▪ How about “sizeof”of
struct Person_A? the struct Person_B?
8/13/2021 Faculty of Computer Science and Engineering 20
Registers vs. Memory
▪ Registers are faster to access than memory
▪ Operating on memory data requires loads and
stores
▪ More instructions to be executed
▪ Compiler must use registers for variables as
much as possible
▪ Only spill to memory for less frequently used
variables
▪ Register optimization is important!

8/13/2021 Faculty of Computer Science and Engineering 21


Immediate Operands
▪ Constant data specified in an instruction
▪ addi $s3, $s3, 4
▪ No subtract immediate instruction
▪ Just use a negative constant
▪ addi $s2, $s1, -1;
▪ Design Principle 3: Make the common case fast
▪ Small constants are common
▪ Immediate operand avoids a load instruction

8/13/2021 Faculty of Computer Science and Engineering 22


The Constant Zero
▪ MIPS register 0 ($zero) is the constant 0
▪ Cannot be overwritten
▪ Useful for common operations
▪ Move between registers
▪ add $a0, $t0, $zero # move $t0 to $a0
▪ Assign immediate to registers
▪ addi $a0, $zero, 100 # $a0 = 100

8/13/2021 Faculty of Computer Science and Engineering 23


Unsigned Binary Integers
▪ Given an n-bit number
x = x n−1 2n−1 + x n−2 2n−2 +  + x1 21 + x 0 20
▪ Range: 0 to +2n – 1
▪ Example
▪ 0000 0000 0000 0000 0000 0000 0000 10112
= 0 + … + 1×23 + 0×22 +1×21 +1×20
= 0 + … + 8 + 0 + 2 + 1 = 1110
▪ Using 32 bits
▪ 0 to +4,294,967,295

8/13/2021 Faculty of Computer Science and Engineering 24


2s-Complement Signed Integers
▪ Given an n-bit number
n −1 n−2
x = − x n−1 2 + x n−2 2 +  + x1 2 + x 0 2
1 0

▪ Range: –2n - 1 to +22n – 1 – 1


▪ Example
▪ 1111 1111 1111 1111 1111 1111 1111 11002
= –1×231 + 1×230 + … + 1×22 +0×21 +0×20
= –2,147,483,648 + 2,147,483,644 = –410
▪ Using 32 bits
▪ –2,147,483,648 to +2,147,483,647

8/13/2021 Faculty of Computer Science and Engineering 25


2s-Complement Signed Integers
▪ Bit 31 is sign bit
▪ 1 for negative numbers
▪ 0 for non-negative numbers
▪ –(–2n - 1) can’t be represented
▪ Non-negative numbers have the same unsigned and 2s-
complement representation
▪ Some specific numbers
▪ 0: 0000 0000 … 0000
▪ –1: 1111 1111 … 1111
▪ Most-negative: 1000 0000 … 0000
▪ Most-positive: 0111 1111 … 1111
8/13/2021 Faculty of Computer Science and Engineering 26
Signed Negation
▪ Complement and add 1
▪ Complement means 1 → 0, 0 → 1
x + x = 1111...1112 = −1
x + 1 = −x

▪ Example: negate +2
▪ +2 = 0000 0000 … 00102
▪ –2 = 1111 1111 … 11012 + 1
= 1111 1111 … 11102
8/13/2021 Faculty of Computer Science and Engineering 27
Sign Extension
▪ Representing a number using more bits
▪ Preserve the numeric value
▪ In MIPS instruction set
▪ addi: extend immediate value
▪ lb, lh: extend loaded byte/halfword
▪ beq, bne: extend the displacement
▪ Replicate the sign bit to the left
▪ c.f. unsigned values: extend with 0s (ZERO extend)
▪ Examples: extend 8-bit to 16-bit for signed number
▪ +2: 0000 0010 => 0000 0000 0000 0010
▪ –2: 1111 1110 => 1111 1111 1111 1110
8/13/2021 Faculty of Computer Science and Engineering 28
Exercise(1/2)
Given a piece of MIPS code as below:
.data
int_a: .word 0xCA002021
.text
la $s0, int_a # load address
lb $t1, 0($s0)
lbu $t2, 0($s0)
lb $t3, 3($s0)
lbu $t4, 3($s0)
What are values of t1, t2, t3, t4?
How about little endian?
8/13/2021 Faculty of Computer Science and Engineering 29
Exercise
Given a piece of MIPS code as below:
.data
var_A: .byte 0xCA
var_B: .half 0xBEEF
var_C: .word 0xBAD0BABE
.text
la $s0, var_A
la $s1, var_B
la $s2, var_C
Assume that .data segment begins at 0x40000000 address.
What is value of $s0, $s1, $s2

8/13/2021 Faculty of Computer Science and Engineering 30


Representing Instructions
▪ Instructions are encoded in binary
▪ Called machine code
▪ MIPS instructions
▪ Encoded as 32-bit instruction words
▪ Small number of formats encoding operation code
(opcode), register numbers, …
▪ Regularity!
▪ Register numbers
▪ $t0 – $t7 are reg’s 8 – 15
▪ $t8 – $t9 are reg’s 24 – 25
▪ $s0 – $s7 are reg’s 16 – 23

8/13/2021 Faculty of Computer Science and Engineering 31


MIPS R-format Instructions
op rs rt rd shamt funct
6 bits 5 bits 5 bits 5 bits 5 bits 6 bits

▪ Instruction fields
▪ op: operation code (opcode)
▪ rs: first source register number
▪ rt: second source register number
▪ rd: destination register number
▪ shamt: shift amount (00000 for now)
▪ funct: function code (extends opcode)
8/13/2021 Faculty of Computer Science and Engineering 32
R-format Example
op rs rt rd shamt funct
6 bits 5 bits 5 bits 5 bits 5 bits 6 bits

add $t0, $s1, $s2


special $s1 $s2 $t0 0 add

0 17 18 8 0 32

000000 10001 10010 01000 00000 100000

000000100011001001000000001000002 = 0232402016

8/13/2021 Faculty of Computer Science and Engineering 33


Example (MIPS to machine code)
▪ What is machine code of nor $s0, $a0, $t1
op rs rt rd shamt funct
000000 00100 01001 10000 00000 100111
000000001000100110000000001001112 = 0089802716
Function Code (Hex) Function Code (Hex) Function Code (Hex) Function Code (Hex)
Add 20 Sltu 2b Mflo 12 Divu 1b
Addu 21 Srl 02 Mfc0 0 Mfhi 10
And 24 Sub 22 Mult 18 Or 25
Jump register 08 Subu 23 Multu 19 Slt 2A
Nor 27 Div 1A Sra 03
8/13/2021 Faculty of Computer Science and Engineering 34
Your turn
▪ What is machine code (in Hex) of
instruction: sub $s3, $t2, $a1

8/13/2021 Faculty of Computer Science and Engineering 35


Hexadecimal
▪ Base 16
▪ Compact representation of bit strings
▪ 4 bits per hex digit
0 0000 4 0100 8 1000 C 1100
1 0001 5 0101 9 1001 D 1101
2 0010 6 0110 A 1010 E 1110
3 0011 7 0111 B 1011 F 1111
▪ Example: 0xCAFE FACE
▪ 1100 1010 1111 1110 1111 1010 1100 1110

8/13/2021 Faculty of Computer Science and Engineering 36


MIPS I-format Instructions
op rs rt constant or address
6 bits 5 bits 5 bits 16 bits

▪ Immediate arithmetic and load/store instructions


▪ rt: destination or source register number
▪ Constant: -215 → +215 - 1
▪ Address: offset added to base address in $rs
▪ Design Principle 4: Good design demands good
compromises
▪ Different formats complicate decoding, but allow 32-bit
instructions uniformly
▪ Keep formats as similar as possible

8/13/2021 Faculty of Computer Science and Engineering 37


Exercise
▪ Given a MIPS instruction:
addi $s3, $s2, X (12345)
▪ What is the maximum value of X?
▪ What is the machine code of the above
instruction?
▪ How do we assign $s0 = 0x1234CA00 (=
305,449,472)
8/13/2021 Faculty of Computer Science and Engineering 38
Stored Program Computers
Memory
Accounting
program
▪ Instructions represented in binary,
(machine code) just like data
Editor program
(machine code)
▪ Instructions and data stored in
memory
processor

C compiler
(machine code) ▪ Programs can operate on programs
▪ e.g., compilers, linkers, …
Payroll data
▪ Binary compatibility allows
compiled programs to work on
Book text
different computers
Source code in C ▪ Standardized ISAs
for editor program

8/13/2021 Faculty of Computer Science and Engineering 39


Logical Operations
▪ Instructions for bitwise manipulation
Operation C Java MIPS
Shift left << << sll
Shift right >> >>> srl
Bitwise AND & & and, andi
Bitwise OR | | or, ori
Bitwise NOT ~ ~ nor

▪ Useful for extracting and inserting groups of bits in


a word
8/13/2021 Faculty of Computer Science and Engineering 40
Shift Operations
op rs rt rd shamt funct
6 bits 5 bits 5 bits 5 bits 5 bits 6 bits

▪ shamt: how many positions to shift


▪ Shift left logical
▪ Shift left and fill with 0 bits
▪ sll by i bits multiplies by 2i
▪ Shift right logical
▪ Shift right and fill with 0 bits
▪ srl by i bits divides by 2i (unsigned only)
8/13/2021 Faculty of Computer Science and Engineering 41
AND Operations
▪ Useful to mask bits in a word
▪ Select some bits, clear others to 0
and $t0, $t1, $t2
$t2 0000 0000 0000 0000 0000 1101 1100 0000

$t1 0000 0000 0000 0000 0011 1100 0000 0000

$t0 0000 0000 0000 0000 0000 1100 0000 0000

8/13/2021 Faculty of Computer Science and Engineering 42


OR Operations
▪ Useful to include bits in a word
▪ Set some bits to 1, leave others unchanged
or $t0, $t1, $t2
$t2 0000 0000 0000 0000 0000 1101 1100 0000

$t1 0000 0000 0000 0000 0011 1100 0000 0000

$t0 0000 0000 0000 0000 0011 1101 1100 0000

8/13/2021 Faculty of Computer Science and Engineering 43


NOT Operations
▪ Useful to invert bits in a word
▪ Change 0 to 1, and 1 to 0
▪ MIPS has NOR 3-operand instruction
▪ a NOR b == NOT (a OR b) Register 0
($zero): always
nor $t0, $t1, $zero read as zero

$zero 0000 0000 0000 0000 0000 0000 0000 0000

$t1 0000 0000 0000 0000 0011 1100 0000 0000

$t0 1111 1111 1111 1111 1100 0011 1111 1111


8/13/2021 Faculty of Computer Science and Engineering 44
Conditional Operations
▪ Branch to a labeled instruction if a condition is
true. Otherwise, continue sequentially
▪ beq rs, rt, Label
▪ if (rs == rt) branch to instruction labeled;
▪ bne rs, rt, Label
▪ if (rs != rt) branch to instruction labeled;
▪ j Label
▪ unconditional jump to instruction labeled;

8/13/2021 Faculty of Computer Science and Engineering 45


Compiling If Statement
C code: MIPS assembly:
int x, y; slt $t0, $a0, $a1
if (x < y) { beqz $t0, endif
y = y – x; sub $a1, $a1, $a0
} endif:
x in $a0, y: $a1

8/13/2021 Faculty of Computer Science and Engineering 46


Compiling If-else Statement
C code: MIPS assembly:
int x, y; slt $t0, $a0, $a1
if (x < y) { beqz $t0, else
y = y – x; sub $a1, $a1, $a0
}else{ j end_if
else:
y = y * 4;
sll $a1, $a1, 2
}
end_if:
x in $a0, y in $a1
8/13/2021 Faculty of Computer Science and Engineering 47
Compiling Loop Statement
Example
C code: MIPS assembly:
while (save[i] == k){ while: sll $t1, $s3, 2
i += 1; add $t1, $t1, $s6
lw $t0, 0($t1)
}
bne $t0, $s5, endwhile
i in $s3, k in $s5, address
addi $s3, $s3, 1
of save in $s6
j while
endwhile:

8/13/2021 Faculty of Computer Science and Engineering 48


Basic Blocks
▪ A basic block is a sequence of
instructions with
▪ No embedded branches
(except at end)
▪ No branch targets (except at
beginning)
▪ A compiler identifies basic
blocks for optimization
▪ An advanced processor can
accelerate execution of basic
blocks
8/13/2021 Faculty of Computer Science and Engineering 49
More Conditional Operations
▪ Set result to 1 if a condition is true. Otherwise, set to 0
▪ slt rd, rs, rt
▪ if rs < rt then rd = 1
▪ else rd = 0
▪ slti rt, rs, constant
▪ if rs < constant then rt = 1
▪ else rt = 0;
▪ Use in combination with beq, bne
▪ slt $t0, $s1, $s2 # if ($s1 < $s2)
▪ bne $t0, $zero, L # branch to L

8/13/2021 Faculty of Computer Science and Engineering 50


Branch Instruction Design
▪ Why not blt, bge, etc?
▪ Hardware for <, ≥, … slower than =, ≠
▪ Combining with branch involves more work
per instruction, requiring a slower clock
▪ All instructions penalized!
▪ beq and bne are the common case
▪ This is a good design compromise
▪ (Design Principle 4)
8/13/2021 Faculty of Computer Science and Engineering 51
Signed vs. Unsigned
▪ Signed comparison: slt, slti
▪ Unsigned comparison: sltu, sltiu
▪ Example:
$s0 = 1111 1111 1111 1111 1111 1111 1111 1111
$s1 = 0000 0000 0000 0000 0000 0000 0000 0001
▪ slt $t0, $s0, $s1 # signed
▪ –1 < +1  $t0 = 1
▪ sltu $t0, $s0, $s1 # unsigned
▪ +4,294,967,295 > +1  $t0 = 0
8/13/2021 Faculty of Computer Science and Engineering 52
Exercise
▪ Assume $s0 = 0xCA002021.
▪ Given MIPS instruction:
andi $t0, $s0, 0xFFFF
addi $t1, $s0, 0xFFFF
addiu $t2, $s0, 0xFFFF
▪ Which instruction types do above instructions
belong to?
▪ What are value of $t0, $t1, $t2?

8/13/2021 Faculty of Computer Science and Engineering 53


Procedure Calling
▪ Steps required
▪ Place parameters in registers
▪ Transfer control to procedure
▪ Acquire storage for procedure
▪ Perform procedure’s operations
▪ Place result in register for caller
▪ Return to place of call

8/13/2021 Faculty of Computer Science and Engineering 54


Register Usage
$a0 – $a3: arguments (reg’s 4 – 7)
$v0 - $v1: result values (reg’s 2 and 3)
$t0 – $t9: Temporaries (Can be overwritten by callee)
$s0 – $s7: Saved (Must be saved/restored by callee)
$gp: global pointer for static data (reg 28)
$sp: stack pointer (reg 29)
$fp: frame pointer (reg 30)
$ra: return address (reg 31)

8/13/2021 Faculty of Computer Science and Engineering 55


Procedure Call Instructions
▪ Procedure call: jump and link
jal ProcedureLabel
▪ Address of following instruction put in $ra
▪ Jumps to target address
▪ Procedure return: jump register
jr $ra
▪ Copies $ra to program counter
▪ Can also be used for computed jumps
▪ e.g., for case/switch statements

8/13/2021 Faculty of Computer Science and Engineering 56


Leaf Procedure Example
C code:
int leaf_example (int g, h, i, j){
int f;
f = (g + h) - (i + j);
return f;
}
▪ Arguments g, …, j in $a0, …, $a3
▪ f in $s0 (hence, need to save $s0 on
stack)
▪ Result in $v0
8/13/2021 Faculty of Computer Science and Engineering 57
Leaf Procedure Example
MIPS code:
leaf_example:
addi $sp, $sp, -4
sw $s0, 0($sp) # Save $s0 on stack
add $t0, $a0, $a1
add $t1, $a2, $a3 #Procedure body
sub $s0, $t0, $t1
add $v0, $s0, $zero # Result
lw $s0, 0($sp) # Restore $s0
addi $sp, $sp, 4
jr $ra # Return
8/13/2021 Faculty of Computer Science and Engineering 58
Non-Leaf Procedures
▪ Procedures that call other procedures
▪ For nested call, caller needs to save on the
stack:
▪ Its return address
▪ Any arguments and temporaries needed after
the call
▪ Restore from the stack after the call
8/13/2021 Faculty of Computer Science and Engineering 59
Non-Leaf Procedure Example
C code:
int fact(int n){
if (n < 1){
return f;
}else{
return n * fact(n - 1);
}
}
▪ Argument n in $a0
▪ Result in $v0
8/13/2021 Faculty of Computer Science and Engineering 60
Non-Leaf Procedure Example
fact:
MIPS: addi $sp, $sp, -8 # adjust stack for 2 items
sw $ra, 4($sp) # save return address
sw $a0, 0($sp) # save argument
slti $t0, $a0, 1 # test for n < 1
beq $t0, $zero, L1
addi $v0, $zero, 1 # if so, result is 1
addi $sp, $sp, 8 # pop 2 items from stack
jr $ra
L1:
addi $a0, $a0, -1 # else decrement n
jal fact # recursive call
lw $a0, 0($sp) # restore original n
lw $ra, 4($sp) # and return address
addi $sp, $sp, 8 # pop 2 items from stack
mul $v0, $a0, $v0 # multiply to get result
8/13/2021 jr $ra Faculty of # and return
Computer Science and Engineering 61
Local Data on the Stack
High address
$fp $fp

$sp $sp
$fp Saved argument
register (if any)
Saved return
address
Saved saved
register (if any)
Local arrays and
$sp structure (if any)
low address

▪ Local data allocated by callee


▪ e.g., C automatic variables
▪ Procedure frame (activation record)
▪ Used by some compilers to manage stack storage
8/13/2021 Faculty of Computer Science and Engineering 62
Memory Layout
Main memory
0x7fff fffc
Stack
▪ Text: program code
$sp ▪ Static data: global variables
Unused ▪ e.g., static variables in C, constant
arrays and strings
Dynamic data ▪ $gp initialized to address allowing
$gp ±offsets into this segment
0x1000 8000 Static data ▪ Dynamic data: heap
0x1000 0000
Text (code) ▪ E.g., malloc in C, new in Java
pc
0x0040 0000 Reserved ▪ Stack: automatic storage
0x00

8/13/2021 Faculty of Computer Science and Engineering 63


Character Data
▪ Byte-encoded character sets
▪ ASCII: 128 characters
▪ 95 graphic, 33 control
▪ Latin-1: 256 characters
▪ ASCII, +96 more graphic characters
▪ Unicode: 32-bit character set
▪ Used in Java, C++ wide characters, …
▪ Most of the world’s alphabets, plus symbols
▪ UTF-8, UTF-16: variable-length encodings
8/13/2021 Faculty of Computer Science and Engineering 64
Byte/Halfword Operations
▪ Could use bitwise operations
▪ MIPS byte/halfword load/store
▪ String processing is a common case
▪ lb rt, offset(rs) ; lh rt, offset(rs)
▪ Sign extend to 32 bits in rt
▪ lbu rt, offset(rs); lhu rt, offset(rs)
▪ Zero extend to 32 bits in rt
▪ sb rt, offset(rs); sh rt, offset(rs)
▪ Store just rightmost byte/halfword
8/13/2021 Faculty of Computer Science and Engineering 65
String Copy Example
C code (naïve):
▪ Null-terminated string
void strcpy (char x[], char y[]){
int i;
i = 0;
while ( (x[i]=y[i]) != ‘\0’ )
i += 1;
}
▪ Addresses of x, y in $a0, $a1
▪ i in $s0
8/13/2021 Faculty of Computer Science and Engineering 66
String Copy Example
MIPS code:
strcpy:
addi $sp, $sp, -4 # adjust stack for item
sw $s0, 0($sp) # save $s0
add $s0, $zero, $zero # i = 0
L1: add $t1, $s0, $a1 # addr of y[i] in $t1
lbu $t2, 0($t1) # $t2 = y[i]
add $t3, $s0, $a0 # addr of x[i] in $t3
sb $t2, 0($t3) # x[i] = y[i]
beq $t2, $zero, L2 # exit loop if y[i]== 0
addi $s0, $s0, 1 # i = i + 1
j L1 # next iteration of loop
L2: lw $s0, 0($sp) # restore saved $s0
addi $sp, $sp, 4 # pop 1 item from stack
jr $ra # and return
8/13/2021 Faculty of Computer Science and Engineering 67
32-bit Constants
▪ Most constants are small
▪ 16-bit immediate is sufficient
▪ For the occasional 32-bit constant
lui rt, constant
▪ Copies 16-bit constant to left 16 bits of rt
▪ Clears right 16 bits of rt to 0
Below shows how to assign 32 bits constant (4 000 000
DEC) to a register (s0)
4 000 000 Dec = 3D 0900 HEX;
3D hex = 60 DEC, 0900 Hex = 2304 Dec
lui $s0, 61 0000 0000 0011 1101 0000 0000 0000 0000
ori $s0, $s0, 2304 0000 0000 0011 1101 0000 1001 0000 0000
8/13/2021 Faculty of Computer Science and Engineering 68
Branch Addressing
▪ Branch instructions specify
▪ Opcode, two registers, target address
▪ Most branch targets are near branch
▪ Forward or backward
op rs rt constant or address
6 bits 5 bits 5 bits 16 bits
▪ PC-relative addressing
▪ Target address = (PC + 4) + offset × 4
▪ PC already incremented by 4 by this time

8/13/2021 Faculty of Computer Science and Engineering 69


Jump Addressing
▪ Jump (j and jal) targets could be anywhere
in text segment
▪ Encode full address in instruction
op address
6 bits 26 bits

▪ (Pseudo)Direct jump addressing


▪ Target address = PC[31…28] : (address × 4)
8/13/2021 Faculty of Computer Science and Engineering 70
Target Addressing Example
▪ Loop code from earlier example
▪ Assume Loop at location 80000
MIPS code Address Instruction memory
Loop: sll $t1, $s3, 2 80000 0 0 19 9 4 0
add $t1, $t1, $s6 80004 0 9 22 9 0 32
lw $t0, 0($t1) 80008 35 9 8 0
bne $t0, $s5, Exit 80012 5 8 21 2
addi $s3, $s3, 1 80016 8 19 19 1
j Loop 80020 2 20000
Exit: … 80024
8/13/2021 Faculty of Computer Science and Engineering 71
Branching Far Away
▪ If branch target is too far to encode with 16-bit
offset, assembler rewrites the code
▪ Example
beq $s0,$s1, L1

bne $s0,$s1, L2
j L1
L2:

8/13/2021 Faculty of Computer Science and Engineering 72


Addressing Mode Summary
1. Immediate addressing
op rs rt constant or address
2. Register addressing
op rs rt rd shamt funct
Register
3.Immediate addressing
op rs rt constant or address
Memory
Register byte halfword Word
4. Pc-relative addressing
op rs rt constant or address
Memory
PC Word

5. Pseudo-direct addressing
op address
Memory
PC Word
8/13/2021 Faculty of Computer Science and Engineering 73
Synchronization
▪ Two processors sharing an area of memory
▪ P1 writes, then P2 reads
▪ Data race if P1 and P2 don’t synchronize
▪ Result depends of order of accesses
▪ Hardware support required
▪ Atomic read/write memory operation
▪ No other access to the location allowed between the read
and write
▪ Could be a single instruction
▪ E.g., atomic swap of register  memory
▪ Or an atomic pair of instructions

8/13/2021 Faculty of Computer Science and Engineering 74


Synchronization in MIPS
▪ Load linked: ll rt, offset(rs)
▪ Store conditional: sc rt, offset(rs)
▪ Succeeds if location not changed since the ll
▪ Returns 1 in rt
▪ Fails if location is changed
▪ Returns 0 in rt
▪ Example: atomic swap (to test/set lock variable)
try: add $t0, $zero, $s4 # copy exchange value
ll $t1, 0($s1) # load linked
sc $t0, 0($s1) # store conditional
beq $t0, $zero, try # branch store fails
add $s4, $zero, $t1 # put load value in $s4
8/13/2021 Faculty of Computer Science and Engineering 75
Translation and Startup
Many compilers produce
object modules directly

Static linking

8/13/2021 Faculty of Computer Science and Engineering 76


Assembler Pseudo-instructions
▪ Most assembler instructions represent machine
instructions one-to-one
▪ Pseudo-instructions: figments of the assembler’s
imagination
move $t0, $t1 → add $t0, $zero, $t1
blt $t0, $t1, L → slt $at, $t0, $t1
bne $at, $zero, L
$at (register 1): assembler temporary

8/13/2021 Faculty of Computer Science and Engineering 77


Producing an Object Module
▪ Assembler (or compiler) translates program into machine
instructions
▪ Provides information for building a complete program
from the pieces
▪ Header: described contents of object module
▪ Text segment: translated instructions
▪ Static data segment: data allocated for the life of the
program
▪ Relocation info: for contents that depend on absolute
location of loaded program
▪ Symbol table: global definitions and external refs
▪ Debug info: for associating with source code
8/13/2021 Faculty of Computer Science and Engineering 78
Linking Object Modules
▪ Produces an executable image
▪ Merges segments
▪ Resolve labels (determine their addresses)
▪ Patch location-dependent and external refs
▪ Could leave location dependencies for fixing by
a relocating loader
▪ But with virtual memory, no need to do this
▪ Program can be loaded into absolute location in
virtual memory space
8/13/2021 Faculty of Computer Science and Engineering 79
Loading a Program
▪ Load from image file on disk into memory
▪ 1.Read header to determine segment sizes
▪ 2.Create virtual address space
▪ 3.Copy text and initialized data into memory
▪ Or set page table entries so they can be faulted in
▪ 4.Set up arguments on stack
▪ 5.Initialize registers (including $sp, $fp, $gp)
▪ 6.Jump to startup routine
▪ Copies arguments to $a0, … and calls main
▪ When main returns, do exit syscall

8/13/2021 Faculty of Computer Science and Engineering 80


Dynamic Linking
▪ Only link/load library procedure when it is
called
▪ Requires procedure code to be relocatable
▪ Avoids image bloat caused by static linking of
all (transitively) referenced libraries
▪ Automatically picks up new library versions

8/13/2021 Faculty of Computer Science and Engineering 81


Lazy Linkage

Indirection table

Stub: Loads routine ID,


Jump to linker/loader

Linker/loader code

Dynamically
mapped code

8/13/2021 Faculty of Computer Science and Engineering 82


Starting Java Applications
Simple portable
instruction set for
the JVM

Compiles
Interprets
bytecodes of
bytecodes
“hot” methods
into native
code for host
machine

8/13/2021 Faculty of Computer Science and Engineering 83


C Sort Example
▪ Illustrates use of assembly instructions for a C bubble sort
function
▪ Swap procedure (leaf)
void swap(int v[], int k){
int temp;
temp = v[k];
v[k] = v[k+1];
v[k+1] = temp;
}
▪ v, k, temp in $a0, $a1, and $t0, respectively

8/13/2021 Faculty of Computer Science and Engineering 84


The Procedure Swap
swap:
sll $t1, $a1, 2 # $t1 = k * 4
add $t1, $a0, $t1 # $t1 = v+(k*4)
# (address of v[k])
lw $t0, 0($t1) # $t0 (temp) = v[k]
lw $t2, 4($t1) # $t2 = v[k+1]
sw $t2, 0($t1) # v[k] = $t2 (v[k+1])
sw $t0, 4($t1) # v[k+1] = $t0 (temp)
jr $ra #return to calling routine
8/13/2021 Faculty of Computer Science and Engineering 85
Effect of Compiler Optimization Compiled with gcc for Pentium 4 under Linux

3 Relative Performance 140000 Instruction count


2.37 2.38 2.41 114938
2.5 120000
100000
2
80000
1.5
1 60000 44993
1 37470 39993
40000
0.5 20000
0 0
none O1 O2 O3 none O1 O2 O3

180000 Clock Cycles 2 CPI


158615 1.79
160000 1.8 1.66
140000 1.6 1.38 1.46
120000 1.4
1.2
100000
1
80000 66990 66521 65747
0.8
60000 0.6
40000 0.4
20000 0.2
0 0
none O1 O2 O3 none O1 O2 O3

8/13/2021 Faculty of Computer Science and Engineering 86


Effect of Language and Algorithm
3
2.37
Bubblesort Relative Performance
2.38 2.41
2.5 2.13
2

1.5
1
1

0.5 0.12
0
C/none C/O1 C/O2 C/O3 Java/int Java/JIT

2.5 Quicksort Relative Performance


1.91
2
1.5 1.5
1.5
1
1

0.5 0.29
0.05
0
C/none C/O1 C/O2 C/O3 Java/int Java/JIT

3000 Quicksort vs. Bubblesort Speedup


2468
2500
1955
2000
1562 1555
1500
1050
1000

500 338

0
C/none C/O1 C/O2 C/O3 Java/int Java/JIT

8/13/2021 Faculty of Computer Science and Engineering 87


Lessons Learnt
▪ Instruction count and CPI are not good
performance indicators in isolation
▪ Compiler optimizations are sensitive to the
algorithm
▪ Java/JIT compiled code is significantly faster
than JVM interpreted
▪ Comparable to optimized C in some cases
▪ Nothing can fix a dumb algorithm!
8/13/2021 Faculty of Computer Science and Engineering 88
Arrays vs. Pointers
▪ Array indexing involves
▪ Multiplying index by element size
▪ Adding to array base address
▪ Pointers correspond directly to memory
addresses
▪ Can avoid indexing complexity

8/13/2021 Faculty of Computer Science and Engineering 89


Example: Clearing and Array
clear1(int array[], int size) { clear2(int *array, int size) {
int *p;
int i;
for (p = &array[0]; p < &array[size];
for (i = 0; i < size; i += 1) p = p + 1)
array[i] = 0; *p = 0;
}
}
move $t0,$zero # i = 0 move $t0,$a0 # p = & array[0]
loop1: sll $t1,$t0,2 # $t1 = i * 4 sll $t1,$a1,2 # $t1 = size * 4
add $t2,$a0,$t1 # $t2 = &array[i] add $t2,$a0,$t1 # $t2 =
sw $zero, 0($t2) # array[i] = 0 # &array[size]
addi $t0,$t0,1 # i = i + 1 loop2: sw $zero,0($t0) # Memory[p] = 0
slt $t3,$t0,$a1 # $t3 = (i < size) addi $t0,$t0,4 # p = p + 4
bne $t3,$zero,loop1 # if (…) goto loop1 slt $t3,$t0,$t2 # $t3 = (p<&array[size])
bne $t3,$zero,loop2 # if (…) goto loop2

8/13/2021 Faculty of Computer Science and Engineering 90


Comparison of Array vs. Ptr
▪ Multiply “strength reduced” to shift
▪ Array version requires shift to be inside loop
▪ Part of index calculation for incremented i
▪ c.f. incrementing pointer
▪ Compiler can achieve same effect as manual
use of pointers
▪ Induction variable elimination
▪ Better to make program clearer and safer
8/13/2021 Faculty of Computer Science and Engineering 91
ARM & MIPS Similarities
▪ ARM: the most popular embedded core
▪ Similar basic set of instructions to MIPS
ARM MIPS
Date announced 1985 1985
Instruction size 32 bits 32 bits
Address space 32-bit flat 32-bit flat
Data alignment Aligned Aligned
Data addressing modes 9 3
Registers 15 × 32-bit 31 × 32-bit
Input/output Memory mapped Memory mapped
8/13/2021 Faculty of Computer Science and Engineering 92
Compare and Branch in ARM
▪ Uses condition codes for result of an
arithmetic/logical instruction
▪ Negative, zero, carry, overflow
▪ Compare instructions to set condition codes
without keeping the result
▪ Each instruction can be conditional
▪ Top 4 bits of instruction word: condition value
▪ Can avoid branches over single instructions
8/13/2021 Faculty of Computer Science and Engineering 93
Instruction Encoding

8/13/2021 Faculty of Computer Science and Engineering 94


The Intel x86 ISA
▪ Evolution with backward compatibility
▪ 8080 (1974): 8-bit microprocessor
▪ Accumulator, plus 3 index-register pairs
▪ 8086 (1978): 16-bit extension to 8080
▪ Complex instruction set (CISC)
▪ 8087 (1980): floating-point coprocessor
▪ Adds FP instructions and register stack
▪ 80286 (1982): 24-bit addresses, MMU
▪ Segmented memory mapping and protection
▪ 80386 (1985): 32-bit extension (now IA-32)
▪ Additional addressing modes and operations
▪ Paged memory mapping as well as segments

8/13/2021 Faculty of Computer Science and Engineering 95


The Intel x86 ISA
▪ Further evolution…
▪ i486 (1989): pipelined, on-chip caches and FPU
▪ Compatible competitors: AMD, Cyrix, …
▪ Pentium (1993): superscalar, 64-bit datapath
▪ Later versions added MMX (Multi-Media eXtension) instructions
▪ The infamous FDIV bug
▪ Pentium Pro (1995), Pentium II (1997)
▪ New microarchitecture (see Colwell, The Pentium Chronicles)
▪ Pentium III (1999)
▪ Added SSE (Streaming SIMD Extensions) and associated registers
▪ Pentium 4 (2001)
▪ New microarchitecture
▪ Added SSE2 instructions

8/13/2021 Faculty of Computer Science and Engineering 96


The Intel x86 ISA
▪ And further…
▪ AMD64 (2003): extended architecture to 64 bits
▪ EM64T – Extended Memory 64 Technology (2004)
▪ AMD64 adopted by Intel (with refinements)
▪ Added SSE3 instructions
▪ Intel Core (2006)
▪ Added SSE4 instructions, virtual machine support
▪ AMD64 (announced 2007): SSE5 instructions
▪ Intel declined to follow, instead…
▪ Advanced Vector Extension (announced 2008)
▪ Longer SSE registers, more instructions
▪ If Intel didn’t extend with compatibility, its competitors would!
▪ Technical elegance ≠ market success

8/13/2021 Faculty of Computer Science and Engineering 97


Basic x86 Addressing Modes
▪ Two operands per instruction
Source/dest operand Second source operand
Register Register
Register Immediate
Register Memory
Memory Register
Memory Immediate

▪ Memory addressing modes


▪ Address in register
▪ Address = Rbase + displacement
▪ Address = Rbase + 2scale × Rindex (scale = 0, 1, 2, or 3)
▪ Address = Rbase + 2scale × Rindex + displacement
8/13/2021 Faculty of Computer Science and Engineering 98
x86 Instruction Encoding
▪ Variable length encoding
▪ Postfix bytes specify
addressing mode
▪ Prefix bytes modify
operation
▪ Operand length, repetition,
locking, …

8/13/2021 Faculty of Computer Science and Engineering 99


Implementing IA-32
▪ Complex instruction set makes implementation
difficult
▪ Hardware translates instructions to simpler
microoperations
▪ Simple instructions: 1–1
▪ Complex instructions: 1–many
▪ Microengine similar to RISC
▪ Market share makes this economically viable
▪ Comparable performance to RISC
▪ Compilers avoid complex instructions
8/13/2021 Faculty of Computer Science and Engineering 100
ARM v8 Instructions
▪ In moving to 64-bit, ARM did a complete overhaul
▪ ARM v8 resembles MIPS
▪ Changes from v7:
▪ No conditional execution field
▪ Immediate field is 12-bit constant
▪ Dropped load/store multiple
▪ PC is no longer a GPR
▪ GPR set expanded to 32
▪ Addressing modes work for all word sizes
▪ Divide instruction
▪ Branch if equal/branch if not equal instructions

8/13/2021 Faculty of Computer Science and Engineering 101


Fallacies
▪ Powerful instruction  higher performance
▪ Fewer instructions required
▪ But complex instructions are hard to implement
▪ May slow down all instructions, including simple ones
▪ Compilers are good at making fast code from simple
instructions
▪ Use assembly code for high performance
▪ But modern compilers are better at dealing with
modern processors
▪ More lines of code  more errors and less productivity

8/13/2021 Faculty of Computer Science and Engineering 102


Fallacies
▪ Backward compatibility  instruction set
doesn’t change
▪ But they do accrete more instructions

x86 instruction set

8/13/2021 Faculty of Computer Science and Engineering 103


Pitfalls
▪ Sequential words are not at sequential
addresses
▪ Increment by 4, not by 1!
▪ Keeping a pointer to an automatic variable
after procedure returns
▪ e.g., passing pointer back via an argument
▪ Pointer becomes invalid when stack popped
8/13/2021 Faculty of Computer Science and Engineering 104
Concluding Remarks
▪ Design principles
▪ 1.Simplicity favors regularity
▪ 2.Smaller is faster
▪ 3.Make the common case fast
▪ 4.Good design demands good compromises
▪ Layers of software/hardware
▪ Compiler, assembler, hardware
▪ MIPS: typical of RISC ISAs
▪ c.f. x86

8/13/2021 Faculty of Computer Science and Engineering 105


Concluding Remarks
▪ Measure MIPS instruction executions in
benchmark programs
▪ Consider making the common case fast
▪ Consider compromises
Instruction class MIPS examples SPEC2006 Int SPEC2006 FP
Arithmetic add, sub, addi 16% 48%
Data transfer lw, sw, lb, lbu, lh, lhu, sb 35% 36%
Logical and, or, nor, andi, ori, sll, srl, sra 12% 4%
Cond. Branch beq, bne, slt, slti, sltiu 34% 8%
Jump j, jr, jal 2% 0%
8/13/2021 Faculty of Computer Science and Engineering 106
Computer Architecture
Faculty of Computer Science & Engineering - HCMUT

Chapter 3
Arithmetic for Computers
Binh Tran-Thanh
[email protected]
Arithmetic for Computers
▪ Operations on integers
▪ Addition and subtraction
▪ Multiplication and division
▪ Dealing with overflow
▪ Floating-point real numbers
▪ Representation and operations

8/15/2023 Facutly of Computer Science and Engineering 2


Integer Addition
▪ Example: 7 + 6

▪ Overflow if result out of range


▪ Adding +ve and –ve operands, no overflow
▪ Adding two +ve operands
▪ Overflow if result sign is 1
▪ Adding two –ve operands
▪ Overflow if result sign is 0

8/15/2023 Facutly of Computer Science and Engineering 3


Integer Subtraction
▪ Add negation of second operand
▪ Example: 7 – 6 = 7 + (–6)
+7: 0000 0000 … 0000 0111
–6: 1111 1111 … 1111 1010
------------------------------------------------------
+1: 0000 0000 … 0000 0001
▪ Overflow if result out of range
▪ Subtracting two +ve or two –ve operands, no overflow
▪ Subtracting +ve from –ve operand
▪ Overflow if result sign is 0
▪ Subtracting –ve from +ve operand
▪ Overflow if result sign is 1

8/15/2023 Facutly of Computer Science and Engineering 4


Dealing with Overflow
▪ Some languages (e.g., C) ignore overflow
▪ Use MIPS addu, addui, subu instructions
▪ Other languages (e.g., Ada, Fortran) require
raising an exception
▪ Use MIPS add, addi, sub instructions
▪ On overflow, invoke exception handler
▪ Save PC in exception program counter (EPC) register
▪ Jump to predefined handler address
▪ mfc0 (move from coprocessor reg) instruction can
retrieve EPC value, to return after corrective action

8/15/2023 Facutly of Computer Science and Engineering 5


Multiplication
▪ Start with long-multiplication approach
multiplicand 1000
× 1001 Multiplicand
multiplier 1000 Shift left
00000 64 bits
000000
1000000
product Multiplier
1001000 64-bit ALU Shift right
32 bits
Length of product is
the sum of operand Product
lengths Control test
Write
64 bits

8/15/2023 Facutly of Computer Science and Engineering 6


Multiplication Hardware Start

Multiplier0 = 1 1. Test Multiplier0 = 0


Multiplier0
Multiplicand
Shift left
1a. Add multiplicand to product and
place the result in Product register
64 bits

Multiplier
2. Shift the Multiplicand register left 1 bit 64-bit ALU Shift right
32 bits
3. Shift the Multiplier register right 1 bit

Product Control test


Write
No: < 32 repetitions
32 nd repetition?
64 bits
Yes: 32 repetitions
Initially 0
Done

8/15/2023 Facutly of Computer Science and Engineering 7


Optimized Multiplier
▪ Perform steps in parallel: add/shift
▪ One cycle per partial-product addition
▪ That’s ok, if frequency of multiplications is low
Multiplicand
32 bits

32-bit ALU

Shift right Control


Product
Write test
64 bits
8/15/2023 Facutly of Computer Science and Engineering 8
Faster Multiplier
▪ Uses multiple adders
▪ Cost/performance tradeoff
▪ Can be pipelined
▪ Several multiplication performed in parallel
Mplier31 • Mcand Mplier30 • Mcand
Mplier29 • Mcand Mplier28 • McandMplier3 • McandMplier2 • McandMplier1 • McandMplier0 • Mcand

32 bits 32 bits ... 32 bits 32 bits

32 bits 32 bits

1 bit 1 bit ... ... ... 1 bit 1 bit

32 bits

Product63 Product62 ... Product47..16 ... Product1 Product0


8/15/2023 Facutly of Computer Science and Engineering 9
MIPS Multiplication
▪ Two 32-bit registers for product
▪ HI: most-significant 32 bits
▪ LO: least-significant 32-bits
▪ Instructions
▪ mult rs, rt / multu rs, $rt
▪ 64-bit product in HI/LO
▪ mfhi rd / mflo rd
▪ Move from HI/LO to rd
▪ Can test HI value to see if product overflows 32 bits
▪ mul rd, rs, rt
▪ Least-significant 32 bits of product –> rd
8/15/2023 Facutly of Computer Science and Engineering 10
Division
quotient ▪ Check for 0 divisor
dividend
▪ Long division approach
1001 ▪ If divisor ≤ dividend bits
▪ 1 bit in quotient, subtract
1000 1001010
-1000
▪ Otherwise
divisor ▪ 0 bit in quotient, bring down next dividend bit
10
101 ▪ Restoring division
1010 ▪ Do the subtract, and if remainder goes < 0,
-1000
add divisor back
remainder 10 ▪ Signed division
▪ Divide using absolute values
▪ Adjust sign of quotient and remainder as
n-bit operands yield n-bit required
quotient and remainder
8/15/2023 Facutly of Computer Science and Engineering 11
Division Hardware Start
Initially divisor
in left half
1. Subtract the Divisor register from the
Remainder register and place the
result in the Remainder register
Divisor
Shift right
Remainder ≥ 0
Test Remainder
Remainder < 0
64 bits

2a. Shift the Quotient register to the left, 2b. Restore the original value by adding
setting the new rightmost bit to 1 the Divisor register to the Remainder
register and placing the sum in the
Quotient
Remainder register. Also shift the 64-bit ALU Shift left
Quotient register to the left, setting the
new least significant bit to 0
32 bits

Remainder Control
3. Shift the Divisor register right 1 bit
Write test
64 bits
No: < 33 repetitions
33rd repetition?

Initially dividend
Yes: 33 repetitions

8/15/2023 Done
Facutly of Computer Science and Engineering 12
Optimized Divider
▪ One cycle per partial-remainder subtraction
▪ Looks a lot like a multiplier!
▪ Same hardware can be used for both
Divisor
32 bits

32-bit ALU

Shift right Control


Remainder Shift left
Write test
64 bits
8/15/2023 Facutly of Computer Science and Engineering 13
Faster Division
▪ Can’t use parallel hardware as in multiplier
▪ Subtraction is conditional on sign of remainder
▪ Faster dividers (e.g. SRT division) generate
multiple quotient bits per step
▪ Still require multiple steps

8/15/2023 Facutly of Computer Science and Engineering 14


MIPS Division
▪ Use HI/LO registers for result
▪ HI: 32-bit remainder
▪ LO: 32-bit quotient
▪ Instructions
▪ div rs, rt / divu rs, rt
▪ No overflow or divide-by-0 checking
▪ Software must perform checks if required
▪ Use mfhi, mflo to access result
8/15/2023 Facutly of Computer Science and Engineering 15
Floating Point
▪ Representation for non-integral numbers
▪ Including very small and very large numbers
▪ Like scientific notation
▪ –2.34 × 1056 normalized

▪ +0.002 × 10–4 not normalized


▪ +987.02 × 109
▪ In binary
▪ ±1.xxxxxxx2 × 2yyyy
▪ Types float and double in C
8/15/2023 Facutly of Computer Science and Engineering 16
Floating Point Standard
▪ Defined by IEEE Std 754 -1985
▪ Developed in response to divergence of
representations
▪ Portability issues for scientific code
▪ Now almost universally adopted
▪ Two representations
▪ Single precision (32-bit)
▪ Double precision (64-bit)

8/15/2023 Facutly of Computer Science and Engineering 17


IEEE Floating-Point Format
single: 8 bits single: 23 bits
double: 11 bits double: 52 bits
S Exponent Fraction

x = ( −1)S  (1+ Fraction)  2(Exponent −Bias)


▪ S: sign bit (0  non-negative, 1  negative)
▪ Normalize significand: 1.0 ≤ |significand| < 2.0
▪ Always has a leading pre-binary-point 1 bit, so no need to represent it
explicitly (hidden bit)
▪ Significand is Fraction with the “1.” restored
▪ Exponent: excess representation: actual exponent + Bias
▪ Ensures exponent is unsigned
▪ Single: Bias = 127; Double: Bias = 1023

8/15/2023 Facutly of Computer Science and Engineering 18


Single-Precision Range
▪ Exponents 00000000 and 11111111 reserved
▪ Smallest value
▪ Exponent: 00000001
 actual exponent = 1 – 127 = –126
▪ Fraction: 000…00  significand = 1.0
▪ ±1.0 × 2–126 ≈ ±1.2 × 10–38
▪ Largest value
▪ Exponent: 11111110
 actual exponent = 254 – 127 = +127
▪ Fraction: 111…11  significand ≈ 2.0
▪ ±2.0 × 2+127 ≈ ±3.4 × 10+38

8/15/2023 Facutly of Computer Science and Engineering 19


Double-Precision Range
▪ Exponents 0000…00 and 1111…11 reserved
▪ Smallest value
▪ Exponent: 00000000001
 actual exponent = 1 – 1023 = –1022
▪ Fraction: 000…00  significand = 1.0
▪ ±1.0 × 2–1022 ≈ ±2.2 × 10–308
▪ Largest value
▪ Exponent: 11111111110
 actual exponent = 2046 – 1023 = +1023
▪ Fraction: 111…11  significand ≈ 2.0
▪ ±2.0 × 2+1023 ≈ ±1.8 × 10+308

8/15/2023 Facutly of Computer Science and Engineering 20


Floating-Point Precision
▪ Relative precision
▪ all fraction bits are significant
▪ Single: approximately 2–23
▪ Equivalent to 23 × log102 ≈ 23 × 0.3 ≈ 6 decimal
digits of precision
▪ Double: approximately 2–52
▪ Equivalent to 52 × log102 ≈ 52 × 0.3 ≈ 16 decimal
digits of precision
8/15/2023 Facutly of Computer Science and Engineering 21
Floating-Point Example
▪ Represent –0.75
▪ –0.75 = (–1)1 × 1.12 × 2–1
▪ S=1
▪ Fraction = 1000…002
▪ Exponent = –1 + Bias
▪ Single: –1 + 127 = 126 = 011111102
▪ Double: –1 + 1023 = 1022 = 011111111102
▪ Single: 1011111101000…00
▪ Double: 1011111111101000…00

8/15/2023 Facutly of Computer Science and Engineering 22


Floating-Point Example
▪ What number is represented by the single-
precision float
▪ 11000000101000…00
▪ S=1
▪ Fraction = 01000…002
▪ Exponent = 100000012 = 129
▪ x = (–1)1 × (1 + 012) × 2(129 – 127)
▪ = (–1) × 1.25 × 22
▪ = –5.0

8/15/2023 Facutly of Computer Science and Engineering 23


Your turn
▪ Represent –23.0625 in the IEEE 754 floating
point format
▪ Convert value of 0x44580000 (IEEE 754) to
floating point value.

8/15/2023 Facutly of Computer Science and Engineering 24


Denormal Numbers
▪ Exponent = 000...0  hidden bit is 0
x = ( −1)S  (0 + Fraction)  2−Bias

▪ Smaller than normal numbers


▪ allow for gradual underflow, with diminishing precision
▪ Denormal with fraction = 000...0
−Bias
x = ( −1)  (0 + 0)  2
S
= 0.0
Two representations
of 0.0!
8/15/2023 Facutly of Computer Science and Engineering 25
Infinities and NaNs
▪ Exponent = 111...1, Fraction = 000...0
▪ ±Infinity
▪ Can be used in subsequent calculations, avoiding
need for overflow check
▪ Exponent = 111...1, Fraction ≠ 000...0
▪ Not-a-Number (NaN)
▪ Indicates illegal or undefined result
▪ e.g., 0.0 / 0.0
▪ Can be used in subsequent calculations
8/15/2023 Facutly of Computer Science and Engineering 26
Floating-Point Addition
▪ Consider a 4-digit decimal example
▪ 9.999 × 101 + 1.610 × 10–1
▪ 1. Align decimal points
▪ Shift number with smaller exponent
▪ 9.999 × 101 + 0.016 × 101
▪ 2. Add significands
▪ 9.999 × 101 + 0.016 × 101 = 10.015 × 101
▪ 3. Normalize result & check for over/underflow
▪ 1.0015 × 102
▪ 4. Round and renormalize if necessary
▪ 1.002 × 102

8/15/2023 Facutly of Computer Science and Engineering 27


Floating-Point Addition
▪ Now consider a 4-digit binary example
▪ 1.0002 × 2–1 + –1.1102 × 2–2 (0.5 + –0.4375)
▪ 1. Align binary points
▪ Shift number with smaller exponent
▪ 1.0002 × 2–1 + –0.1112 × 2–1
▪ 2. Add significands
▪ 1.0002 × 2–1 + –0.1112 × 2–1 = 0.0012 × 2–1
▪ 3. Normalize result & check for over/underflow
▪ 1.0002 × 2–4, with no over/underflow
▪ 4. Round and renormalize if necessary
▪ 1.0002 × 2–4 (no change) = 0.0625

8/15/2023 Facutly of Computer Science and Engineering 28


FP Adder Hardware
▪ Much more complex than integer adder
▪ Doing it in one clock cycle would take too
long
▪ Much longer than integer operations
▪ Slower clock would penalize all instructions
▪ FP adder usually takes several cycles
▪ Can be pipelined
8/15/2023 Facutly of Computer Science and Engineering 29
FP Adder Hardware
Sign Exponent Fraction Sign Exponent Fraction

Compare
Small ALU
exponents

Exponent
difference
Step 1
0 1 0 1 0 1

Shift smaller
Control Shift right
number right

Big ALU
Add
Step 2

0 1 0 1

Increment or Shift left or right


Step 3
decrement Normalize

Rounding hardw are Round


Step 4

8/15/2023 Sign Exponent Fraction Facutly of Computer Science and Engineering 30


Floating-Point Multiplication
▪ Consider a 4-digit decimal example
▪ 1.110 × 1010 × 9.200 × 10–5
▪ 1. Add exponents
▪ For biased exponents, subtract bias from sum
▪ New exponent = 10 + –5 = 5
▪ 2. Multiply significands
▪ 1.110 × 9.200 = 10.212  10.212 × 105
▪ 3. Normalize result & check for over/underflow
▪ 1.0212 × 106
▪ 4. Round and renormalize if necessary
▪ 1.021 × 106
▪ 5. Determine sign of result from signs of operands
▪ +1.021 × 106

8/15/2023 Facutly of Computer Science and Engineering 31


Floating-Point Multiplication
▪ Now consider a 4-digit binary example
▪ 1.0002 × 2–1 × –1.1102 × 2–2 (0.5 × –0.4375)
▪ 1. Add exponents
▪ Unbiased: –1 + –2 = –3
▪ Biased: (–1 + 127) + (–2 + 127) = –3 + 254 – 127 = –3 + 127
▪ 2. Multiply significands
▪ 1.0002 × 1.1102 = 1.1102  1.1102 × 2–3
▪ 3. Normalize result & check for over/underflow
▪ 1.1102 × 2–3 (no change) with no over/underflow
▪ 4. Round and renormalize if necessary
▪ 1.1102 × 2–3 (no change)
▪ 5. Determine sign: +ve × –ve  –ve
▪ –1.1102 × 2–3 = –0.21875

8/15/2023 Facutly of Computer Science and Engineering 32


FP Arithmetic Hardware
▪ FP multiplier is of similar complexity to FP adder
▪ But uses a multiplier for significands instead of an
adder
▪ FP arithmetic hardware usually does
▪ Addition, subtraction, multiplication, division,
reciprocal, square-root
▪ FP  integer conversion
▪ Operations usually takes several cycles
▪ Can be pipelined
8/15/2023 Facutly of Computer Science and Engineering 33
FP Instructions in MIPS
▪ FP hardware is coprocessor 1
▪ Adjunct processor that extends the ISA
▪ Separate FP registers
▪ 32 single-precision: $f0, $f1, … $f31
▪ Paired for double-precision: $f0/$f1, $f2/$f3, …
▪ Release 2 of MIPs ISA supports 32 × 64-bit FP reg’s
▪ FP instructions operate only on FP registers
▪ Programs generally don’t do integer ops on FP data, or vice versa
▪ More registers with minimal code-size impact
▪ FP load and store instructions
▪ Lwc1, ldc1, swc1, sdc1
▪ e.g., ldc1 $f8, 32($sp)

8/15/2023 Facutly of Computer Science and Engineering 34


FP Instructions in MIPS
▪ Single-precision arithmetic
▪ add.s, sub.s, mul.s, div.s
▪ e.g., add.s $f0, $f1, $f6
▪ Double-precision arithmetic
▪ add.d , sub.d , mul.d , div.d
▪ e.g., mul.d $f4, $f4, $f6
▪ Single- and double-precision comparison
▪ c.xx.s, c.xx.d (xx is eq, lt, le, …)
▪ Sets or clears FP condition-code bit
▪ e.g. c.lt.s $f3, $f4
▪ Branch on FP condition code true or false
▪ bc1t , bc1f
▪ e.g.,bc1t TargetLabel

8/15/2023 Facutly of Computer Science and Engineering 35


FP Example: °F to °C
▪ C code:
float f2c (float fahr) {
return ((5.0/9.0)*(fahr - 32.0));
}
▪ fahr in $f12, result in $f0, literals in global memory space
▪ Compiled MIPS code:
f2c: lwc1 $f16, const5($gp)
lwc2 $f18, const9($gp)
div.s $f16, $f16, $f18
lwc1 $f18, const32($gp)
sub.s $f18, $f12, $f18
mul.s $f0, $f16, $f18
jr $ra

8/15/2023 Facutly of Computer Science and Engineering 36


FP Example: Array Multiplication
▪ X=X+Y×Z
▪ All 32 × 32 matrices, 64-bit double-precision elements
▪ C code:
void mm ( double x[][], double y[][],
double z[][]){
int i, j, k;
for (i = 0; i! = 32; i = i + 1)
for (j = 0; j! = 32; j = j + 1)
for (k = 0; k! = 32; k = k + 1)
x[i][j] = x[i][j]
+ y[i][k] * z[k][j];
}
▪ Addresses of x, y, z in $a0, $a1, $a2, and
i, j, k in $s0, $s1, $s2

8/15/2023 Facutly of Computer Science and Engineering 37


FP Example: Array Multiplication
MIPS code:
li $t1, 32 # $t1 = 32 (row size/loop end)
li $s0, 0 # i = 0; initialize 1st for loop
L1: li $s1, 0 # j = 0; restart 2nd for loop
L2: li $s2, 0 # k = 0; restart 3rd for loop
sll $t2, $s0, 5 # $t2 = i * 32 (size of row of x)
addu $t2, $t2, $s1 # $t2 = i * size(row) + j
sll $t2, $t2, 3 # $t2 = byte offset of [i][j]
addu $t2, $a0, $t2 # $t2 = byte address of x[i][j]
l.d $f4, 0($t2) # $f4 = 8 bytes of x[i][j]
L3: sll $t0, $s2, 5 # $t0 = k * 32 (size of row of z)
addu $t0, $t0, $s1 # $t0 = k * size(row) + j
sll $t0, $t0, 3 # $t0 = byte offset of [k][j]
addu $t0, $a2, $t0 # $t0 = byte address of z[k][j]
l.d $f16, 0($t0) # $f16 = 8 bytes of z[k][j] ...

8/15/2023 Facutly of Computer Science and Engineering 38


FP Example: Array Multiplication
...
sll $t0, $s0, 5 # $t0 = i*32 (size of row of y)
addu $t0, $t0, $s2 # $t0 = i*size(row) + k
sll $t0, $t0, 3 # $t0 = byte offset of [i][k]
addu $t0, $a1, $t0 # $t0 = byte address of y[i][k]
l.d $f18, 0($t0) # $f18 = 8 bytes of y[i][k]
mul.d $f16, $f18, $f16 # $f16 = y[i][k] * z[k][j]
add.d $f4, $f4, $f16 # f4=x[i][j] + y[i][k]*z[k][j]
addiu $s2, $s2, 1 # $k k + 1
bne $s2, $t1, L3 # if (k != 32) go to L3
s.d $f4, 0($t2) # x[i][j] = $f4
addiu $s1, $s1, 1 # $j = j + 1
bne $s1, $t1, L2 # if (j != 32) go to L2
addiu $s0, $s0, 1 # $i = i + 1
bne $s0, $t1, L1 # if (i != 32) go to L1

8/15/2023 Facutly of Computer Science and Engineering 39


Accurate Arithmetic
▪ IEEE Std 754 specifies additional rounding control
▪ Extra bits of precision (guard, round, sticky)
▪ Choice of rounding modes
▪ Allows programmer to fine-tune numerical behavior
of a computation
▪ Not all FP units implement all options
▪ Most programming languages and FP libraries just use
defaults
▪ Trade-off between hardware complexity,
performance, and market requirements

8/15/2023 Facutly of Computer Science and Engineering 40


Subword Parallellism
▪ Graphics and audio applications can take
advantage of performing simultaneous operations
on short vectors
▪ Example: 128-bit adder:
▪ Sixteen 8-bit adds
▪ Eight 16-bit adds
▪ Four 32-bit adds
▪ Also called data-level parallelism, vector
parallelism, or Single Instruction, Multiple Data
(SIMD)
8/15/2023 Facutly of Computer Science and Engineering 41
x86 FP Architecture
▪ Originally based on 8087 FP coprocessor
▪ 8 × 80-bit extended-precision registers
▪ Used as a push-down stack
▪ Registers indexed from TOS: ST(0), ST(1), …
▪ FP values are 32-bit or 64 in memory
▪ Converted on load/store of memory operand
▪ Integer operands can also be converted
on load/store
▪ Very difficult to generate and optimize code
▪ Result: poor FP performance

8/15/2023 Facutly of Computer Science and Engineering 42


x86 FP Instructions
▪ Optional variations
▪ I: integer operand
▪ P: pop operand from stack
▪ R: reverse operand order
▪ But not all combinations allowed
Data transfer Arithmetic Compare Transcendental
FILD mem/ST(i) FIADDP mem/ST(i) FICOMP FPATAN
FISTP mem/ST(i) FISUBRP mem/ST(i) FIUCOMP F2XMI
FLDPI FIMULP mem/ST(i) FSTSW FCOS
FLD1 FIDIVRP mem/ST(i) AX/mem FPTAN
FLDZ FSQRT FPREM
FABS FPSIN
FRNDINT FYL2X
8/15/2023 Facutly of Computer Science and Engineering 43
Streaming SIMD Extension 2 (SSE2)
▪ Adds 4 × 128-bit registers
▪ Extended to 8 registers in AMD64/EM64T
▪ Can be used for multiple FP operands
▪ 2 × 64-bit double precision
▪ 4 × 32-bit double precision
▪ Instructions operate on them simultaneously
▪ Single-Instruction Multiple-Data

8/15/2023 Facutly of Computer Science and Engineering 44


Matrix Multiply
▪ Unoptimized code:
1. void dgemm (int n, double* A, double* B, double* C){
2. for (int i = 0; i < n; ++i){
3. for (int j = 0; j < n; ++j){
4. /* cij = C[i][j] */
5. double cij = C[i+j*n];
6. for(int k = 0; k < n; k++ )
7. cij += A[i+k*n] * B[k+j*n];
8. /* cij += A[i][k]*B[k][j] */
9. C[i+j*n] = cij; /* C[i][j] = cij */
10. }
11. }
12. }

8/15/2023 Facutly of Computer Science and Engineering 45


Matrix Multiply
x86 assembly code:
1. vmovsd (%r10),%xmm0 # Load 1 element of C into %xmm0
2. mov %rsi,%rcx # register %rcx = %rsi
3. xor %eax,%eax # register %eax = 0
4. vmovsd (%rcx),%xmm1 # Load 1 element of B into %xmm1
5. add %r9,%rcx # register %rcx = %rcx + %r9
6. vmulsd (%r8,%rax,8),%xmm1,%xmm1 # Multiply %xmm1, element of A
7. add $0x1,%rax # register %rax = %rax + 1
8. cmp %eax,%edi # compare %eax to %edi
9. vaddsd %xmm1,%xmm0,%xmm0 # add %xmm1, %xmm0
10. jg 30 <dgemm+0x30> # jump if %eax > %edi
11. add $0x1,%r11d # register %r11 = %r11 + 1
12. vmovsd %xmm0,(%r10) # Store %xmm0 into C element

8/15/2023 Facutly of Computer Science and Engineering 46


Matrix Multiply
Optimized C code:
1. #include <x86intrin.h>
2. void dgemm (int n, double* A, double* B, double* C)
3. {
4. for ( int i = 0; i < n; i+=4 )
5. for ( int j = 0; j < n; j++ ) {
6. __m256d c0 = _mm256_load_pd(C+i+j*n);
/* c0 = C[i][j] */
7. for( int k = 0; k < n; k++ )
8. c0 = _mm256_add_pd(c0,
/* c0 += A[i][k]*B[k][j] */
9. _mm256_mul_pd(_mm256_load_pd(A+i+k*n),
10. _mm256_broadcast_sd(B+k+j*n)));
11. _mm256_store_pd(C+i+j*n, c0); /* C[i][j] = c0 */
12. }
13. }

8/15/2023 Facutly of Computer Science and Engineering 47


Matrix Multiply
Optimized x86 assembly code:
1. vmovapd (%r11),%ymm0 # Load 4 elements of C into %ymm0
2. mov %rbx,%rcx # register %rcx = %rbx
3. xor %eax,%eax # register %eax = 0
4. vbroadcastsd (%rax,%r8,1),%ymm1 # Make 4 copies of B element
5. add $0x8,%rax # register %rax = %rax + 8
6. vmulpd (%rcx),%ymm1,%ymm1 # Parallel mul %ymm1,4 A elements
7. add %r9,%rcx # register %rcx = %rcx + %r9
8. cmp %r10,%rax # compare %r10 to %rax
9. vaddpd %ymm1,%ymm0,%ymm0 # Parallel add %ymm1, %ymm0
10. jne 50 <dgemm+0x50> # jump if not %r10 != %rax
11. add $0x1,%esi # register % esi = % esi + 1
12. vmovapd %ymm0,(%r11) # Store %ymm0 into 4 C elements

8/15/2023 Facutly of Computer Science and Engineering 48


Right Shift and Division
▪ Left shift by i places multiplies an integer by 2i
▪ Right shift divides by 2i?
▪ Only for unsigned integers
▪ For signed integers
▪ Arithmetic right shift: replicate the sign bit
▪ e.g., –5 / 4
▪ 111110112 >> 2 = 111111102 = –2
▪ Rounds toward –∞
▪ c.f. 111110112 >>> 2 = 001111102 = +62

8/15/2023 Facutly of Computer Science and Engineering 49


Associativity
▪ Parallel programs may interleave operations in
unexpected orders
▪ Assumptions of associativity may fail
(x+y)+z x+(y+z)
x -1.50E+38 -1.50E+38
y 1.50E+38 0.00E+00
z 1.0 1.0 1.50E+38
1.00E+00 0.00E+00

▪ Need to validate parallel programs under varying


degrees of parallelism

8/15/2023 Facutly of Computer Science and Engineering 50


Who Cares About FP Accuracy?
▪ Important for scientific code
▪ But for everyday consumer use?
▪ “My bank balance is out by 0.0002¢!” 
▪ The Intel Pentium FDIV bug
▪ The market expects accuracy
▪ See Colwell, The Pentium Chronicles

8/15/2023 Facutly of Computer Science and Engineering 51


Concluding Remarks
▪ Bits have no inherent meaning
▪ Interpretation depends on the instructions applied
▪ Computer representations of numbers
▪ Finite range and precision
▪ Need to account for this in programs
▪ ISAs support arithmetic
▪ Signed and unsigned integers
▪ Floating-point approximation to reals
▪ Bounded range and precision
▪ Operations can overflow and underflow
▪ MIPS ISA
▪ Core instructions: 54 most frequently used
▪ 100% of SPECINT, 97% of SPECFP
▪ Other instructions: less frequent

8/15/2023 Facutly of Computer Science and Engineering 52


Computer Architecture
Faculty of Computer Science & Engineering - HCMUT

Chapter 4: The Processor


Binh Tran-Thanh
[email protected]
This chapter contents
▪ The basic units in the CPU
▪ Functions of the major components in the CPU
▪ Instruction execution at hardware level
▪ Performance and trace off among CPUs
This chapter outcomes
Students who complete this course will be
able to:
▪ Explain the structure of a computer system
and deeply understand how it works at the
hardware level.

8/15/2023 Faculty of Computer Science and Engineering 3


Introduction
▪ CPU performance factors
▪ Instruction count
▪ Determined by ISA and compiler
▪ CPI and Cycle time
▪ Determined by CPU hardware
▪ We will examine two MIPS implementations
▪ A simplified version (single clock cycle)
▪ A more realistic pipelined version
▪ Simple subset, shows most aspects
▪ Memory reference: lw, sw
▪ Arithmetic/logical: add, sub, and, or, slt
▪ Control transfer: beq, j

8/15/2023 Faculty of Computer Science and Engineering 4


The simplified processor.
0
M
Add u
x
ALU 1
4 Addresult
Shift
RegDst left 2
Branch
MemRead
Ins truction [31–26] MemtoReg
Control
ALUOp
MemWrite
ALUSrc
RegWrite
Ins truction [25–21] Read
Read register 1
pc address Read
Ins truction [20–16]
Read data 1
Instruction Zero
0 register 2
[31–0] M ALU ALU Address Read 1
Write Read 0
u result M
Instruction Ins truction [15–11] x Register data 2 M
u
data
u
memory 1 Write x Data x
data Register 1 memor 0
s Writey
data
16
Ins truction [15–0] Sign 32 ALU
extend control

Ins truction [5–0]

8/15/2023 Faculty of Computer Science and Engineering 5


MIPS Instruction Execution Cycle
1. Instruction Fetch:
▪ Get instruction from memory
Instruction
▪ Address is in Program Counter (PC)
Fetch Register
Instruction 2. Instruction Decode:
Next Instruction

Decode ▪ Find out the operation required and


control signals
Execution ▪ Get operand(s) needed for operation
Memory 3. Execution:
▪ Perform the required operation
Write 4. Memory:
Back ▪ Access memory (load/store)
5. Write Back:
▪ Store the result of the operation
8/15/2023 Faculty of Computer Science and Engineering 6
Instruction Execution
▪ PC → instruction memory, fetch instruction
▪ Register numbers → register file, read registers
▪ Depending on instruction class
▪ Use ALU to calculate
▪ Arithmetic result
▪ Memory address for load/store
▪ Branch target address
▪ Access data memory for load/store
▪ PC  target address or PC + 4
8/15/2023 Faculty of Computer Science and Engineering 7
CPU Overview
4

Add Add

Data

Register #
PC Address Instruction Registers ALU Address
Register #
Data
Instruction
memory
memory Register #

Data

8/15/2023 Faculty of Computer Science and Engineering 8


Multiplexers ▪ Can’t just join
wires together
4 ▪ Use
Add Add multiplexers

Data

Register #
PC Address Instruction Registers ALU Address
Register #
Data
Instruction
memory
memory Register #

Data

8/15/2023 Faculty of Computer Science and Engineering 9


Control Branch

M
u
x

Add Add M
u
x
ALU operation
Data
MemWrite
Register #
PC Address Instruction Registers ALU Address
Register # M Zero
u Data
Instruction
x memory
memory Register # RegWrite
Data
MemRead

Control

8/15/2023 Faculty of Computer Science and Engineering 10


Logic Design Basics
▪ Information encoded in binary
▪ Low voltage = 0, High voltage = 1
▪ One wire per bit
▪ Multi-bit data encoded on multi-wire buses
▪ Combinational element
▪ Operate on data
▪ Output is a function of input
▪ State (sequential) elements
▪ Store information

8/15/2023 Faculty of Computer Science and Engineering 11


Combinational Elements
▪ AND-gate ▪Adder
▪ Y = A & B ▪Y = A + B
A
A Y
Y +
B
B

▪Multiplexer ▪ Arithmetic/Logic Unit


▪ Y = S ? I1: I0 ▪ Y = F(A, B)
A
I0 M
u Y ALU Y
I1 x
B
S F
8/15/2023 Faculty of Computer Science and Engineering 12
Sequential Elements
▪ Register: stores data in a circuit
▪ Uses a clock signal to determine when to update
the stored value
▪ Edge-triggered: update when Clk changes from 0
to 1
Clk
D Q
D AAAA BBBB CCCC
Clk
Q AAAA BBBB

8/15/2023 Faculty of Computer Science and Engineering 13


Sequential Elements
▪ Register with write control
▪ Only updates on clock edge when write control
input is 1
▪ Used when stored value is required later
Clk

D Q Write

Write D AAAA BBBB CCCC


Clk
Q AAAA

8/15/2023 Faculty of Computer Science and Engineering 14


Clocking Methodology
▪ Combinational logic transforms data during
clock cycles
▪ Between clock edges
▪ Input from state elements, output to state element
▪ Longest delay determines clock period
State State
Combinational State Combinational
Element Element
logic Element logic
1 2

Clock cycle

8/15/2023 Faculty of Computer Science and Engineering 15


Building a Datapath
▪ Datapath
▪ Elements that process data and addresses in the
CPU
▪ Registers, ALUs, mux’s, memories, …
▪ We will build a MIPS data-path incrementally
▪ Refining the overview design

8/15/2023 Faculty of Computer Science and Engineering 16


Instruction type (review)
op rs rt rd shamt funct
6 bits 5 bits 5 bits 5 bits 5 bits 6 bits

op rs rt constant or address
6 bits 5 bits 5 bits 16 bits

op address
6 bits 26 bits
▪ Note:
▪ All MIPS instructions are 32-bit wise
▪ All MIPS instructions contain 6-bit OP
(most significant)
8/15/2023 Faculty of Computer Science and Engineering 17
Instruction Fetch

Add

Read Increment by
PC
address 4 for next
instruction
Instruction

32-bit Instruction Memory


register memory which stores
program
instructions
8/15/2023 Faculty of Computer Science and Engineering 18
Instruction Fetch (cont.)

Add

4
add $t1, $s0, $t0
lbu $t2, 0($t1) Increment by
PC add $t3, $s0, $a0
sb $t2, 0($t3)
4 for next
beq $t2, $zero, exit instruction
addi $s0, $s0, 1
j loop
...

32-bit Instruction Memory


register which stores
memory
program
instructions
8/15/2023 Faculty of Computer Science and Engineering 19
R-Format Instructions
Collection of 32
▪ Read two register operands registers, aka
▪ Perform arithmetic/logical operation register file
▪ Write register result
5 Read ALU operation
4
register 1 Read
Register 5 Read data 1
numbers register 2 Zero
Data ALU ALU
5 Write Registers
result
register Read
Write data 2
Data
Data
RegWrite

a. Registers b. ALU

8/15/2023 Faculty of Computer Science and Engineering 20


R-Format Instructions (example)
add $S0, $a0, $t0
R-type 000000 a0=4 t0=8 s0=16 00000 10000
31:26 25:21 20:16 15:11 10:6 5:0

4 5 Read ALU operation


4
register 1 Read content of
8 5 Read data 1 register $a0
register 2 Zero
16 ALU ALU
5 Write Registers
result
register($s0) Read content of
Write data 2 register $t0
Data
RegWrite

8/15/2023 Faculty of Computer Science and Engineering 21


Your turn
▪ Assume: $4 = 104, $5 = 105, … , $31 = 131
▪ What is the value of read data 1 if we assign 6 to read register 1?
▪ What is the value of read data 2 if read register 2 = 12?
▪ If write register = 10 and Write data = 12
▪ Which register is written and 5 Read
▪ What value is that (in case RegWrite = 0/1)? register 1 Read
5 Read data 1
register 2
5 Write Registers
register Read
Write data 2
Data

8/15/2023 Faculty of Computer Science and Engineering RegWrite 22


Load/Store Instructions
▪ Read register operands
▪ Calculate address using 16-bit offset
▪ Use ALU, but sign-extend offset
▪ Load: Read memory and update register
▪ Store: Write register value to memory
MemWrite

Read
Address
data
16 Sign- 32
Data extend
Write memory
data

MemRead
a. Data memory unit b. Sign extension unit
8/15/2023 Faculty of Computer Science and Engineering 23
Branch Instructions
▪ Read register operands
▪ Compare operands
▪ Use ALU, subtract and check Zero output
▪ Calculate target address
▪ Sign-extend displacement
▪ Shift left 2 places (word displacement)
▪ Add to PC + 4
▪ Already calculated by instruction fetch

8/15/2023 Faculty of Computer Science and Engineering 24


Branch Instructions
Just PC + 4 from instruction Datapath
re-routes Branch
Add Sum
wires target
Shift
left 2
Read 4 ALU operation
Instruction register 1 Read
Read data 1
register 2 To branch
ALU Zero control logic
Write Registers
register Read
Write data 2
data
RegWrite

16 Sign- 32
extend
Sign-bit wire
replicated
8/15/2023 Faculty of Computer Science and Engineering 25
Composing the Elements
▪ First-cut data path does an instruction in
one clock cycle
▪ Each datapath element can only do one
function at a time
▪ Hence, we need separate instruction and data
memories
▪ Use multiplexers where alternate data
sources are used for different instructions
8/15/2023 Faculty of Computer Science and Engineering 26
R-Type/Load/Store Datapath
▪ add $S0, $a0, $t0
Read ALU operation
register 1 4
Read MemWrite
data 1
Read MemtoReg
register 2 Zero
Instruction ALUSrc
Registers Read ALU ALU Read
Write 0 Address 1
data 2 result data M
register M
u u
x x
Write 1 0
data Data
Write memory
RegWrite data

16 32 MemRead
Sign-
extend

8/15/2023 Faculty of Computer Science and Engineering 27


R-Type/Load/Store Datapath
▪ lw $S0, 4($a0)
Read ALU operation
register 1 4
Read MemWrite
data 1
Read MemtoReg
register 2 Zero
Instruction ALUSrc
Registers Read ALU ALU Read
Write 0 Address 1
data 2 result data M
register M
u u
x x
Write 1 0
data Data
Write memory
RegWrite data

16 32 MemRead
Sign-
extend

8/15/2023 Faculty of Computer Science and Engineering 28


R-Type/Load/Store Datapath
▪ sw $S0, 4($a0)
Read ALU operation
register 1 4
Read MemWrite
data 1
Read MemtoReg
register 2 Zero
Instruction ALUSrc
Registers Read ALU ALU Read
Write 0 Address 1
data 2 result data M
register M
u u
x x
Write 1 0
data Data
Write memory
RegWrite data

16 32 MemRead
Sign-
extend

8/15/2023 Faculty of Computer Science and Engineering 29


Full Datapath
PCSrc

M
Add u
x
ALU
4 Add result
Shift
left 2

Read
PC
Read register 1
ALUSrc 4 ALU operation
address Read MemWrite
Read data 1
Zero MemtoReg
register 2
Instruction Registers Read ALU ALU Read
Write Address
Instruction data 2 M result data M
register u
memory u
x x
Write
data
Write Data
RegWrite data memory

16 32 MemRead
Sign-
extend

8/15/2023 Faculty of Computer Science and Engineering 30


ALU Control
▪ ALU used for ALU control Function
0000 AND
▪ Load/Store: F = add
0001 OR
▪ Branch: F = subtract 0010 add
▪ R-type: F depends 0110 subtract
on funct field 0111 set-on-less-than
1100 NOR

8/15/2023 Faculty of Computer Science and Engineering 31


Closer look at a 1-bit ALU
Ainvert Operation
ALU control Binvert CarryIn
Function
Ainvert Binvert Operation a
0 0
0 0 00 AND 1

0 0 01 OR 1

Result
0 0 10 ADD b
0 + 2
0 1 10 SUB 1

0 1 11 SLT CarryOut

1 1 00 NOR
Without SLT implementation

8/15/2023 Faculty of Computer Science and Engineering 32


Closer look at a 1-bit ALU
Ainvert Operation
ALU control Binvert CarryIn
Function
Ainvert Binvert Operation a
0 0
0 0 00 AND 1

0 0 01 OR 1

Result
0 0 10 ADD b
0 + 2
0 1 10 SUB 1
Less 3
0 1 11 SLT
Set
1 1 00 NOR Overflow
detection Overflow
With SLT implementation
1-bit ALU [0] 1-bit ALU [31]

8/15/2023 Faculty of Computer Science and Engineering 33


ALU Control
▪ Assume 2-bit ALUOp derived from opcode
▪ Combinational logic derives ALU control
opcode ALUOp Operation funct ALU function ALU control
lw 00 load word XXXXXX add 0010
sw 00 store word XXXXXX add 0010
beq 01 branch equal XXXXXX subtract 0110
R-type 10 add 100000 add 0010
subtract 100010 subtract 0110
AND 100100 AND 0000
OR 100101 OR 0001
set-on-less-than 101010 set-on-less-than 0111
8/15/2023 Faculty of Computer Science and Engineering 34
The Main Control Unit
▪ Control signals derived from instruction
R-type op rs rt rd shamt funct
31:26 25:21 20:16 15:11 10:6 5:0
Load/ 35 or 43
Store
rs rt address
31:26 25:21 20:16 15:0
Branch
4 rs rt address
31:26 25:21 20:16 15:0

opcode always read, write for R- sign-extend


read except for type and and add
load load
8/15/2023 Faculty of Computer Science and Engineering 35
Datapath With Control
0
M
Add u
x
ALU 1
4 Add result
Shift
RegDst left 2
Branch
MemRead
Ins truction [31–26] MemtoReg
Control ALUOp
MemWrite
ALUSrc
RegWrite
Ins truction [25–21] Read
Read address register 1
pc Read
Ins truction [20–16] Read data 1
Instruction register 2 Zero
[31–0] 0 ALU ALU Address Read
M Write Read 1
data 2 0 result
Instruction u Register M data M
Ins truction [15–11] x u
memory Write u Data
1 x x
data
Registers 1 memory 0
Write
data

16
Ins truction [15–0] Sign 32 ALU
extend control

8/15/2023 Ins truction [5–0] 36


R-Type Instruction
0
M
Add u
x
ALU 1
4 Add
result
Shift
RegDst left 2
Branch
MemRead
Ins truction [31–26] MemtoReg
Control ALUOp
MemWrite
ALUSrc
RegWrite
Ins truction [25–21] Read
pc Read address register 1
Read
Ins truction [20–16] Read data 1
Instruction register 2 Zero
[31–0] 0 ALU ALU Address Read
M Write Read 1
data 2 0 result
Instruction u Register M data M
Ins truction [15–11] x u
memory Write u Data
1 x memory x
data 0
1
Registers Write
data

16
Ins truction [15–0] Sign 32
extend ALU
control

8/15/2023 Ins truction [5–0] 37


Load Instruction
0
M
Add u
x
ALU 1
4 Add
result
Shift
RegDst left 2
Branch
MemRead
Ins truction [31–26] MemtoReg
Control ALUOp
MemWrite
ALUSrc
RegWrite
Ins truction [25–21] Read
pc Read address register 1
Read
Ins truction [20–16] Read data 1
Instruction register 2 Zero
[31–0] 0 ALU ALU Address Read
M Write Read 1
data 2 0 result
Instruction u Register M data M
Ins truction [15–11] x u
memory Write u Data
1 x memory x
data 0
1
Registers Write
data

16
Ins truction [15–0] Sign 32
extend ALU
control

8/15/2023 Ins truction [5–0] 38


Branch-on-Equal Instruction
0
M
Add u
x
ALU 1
4 Add
result
Shift
RegDst left 2
Branch
MemRead
Ins truction [31–26] MemtoReg
Control ALUOp
MemWrite
ALUSrc
RegWrite
Ins truction [25–21] Read
pc Read address register 1
Read
Ins truction [20–16] Read data 1
Instruction register 2 Zero
[31–0] 0 ALU ALU Address Read
M Write Read 1
data 2 0 result
Instruction u Register M data M
Ins truction [15–11] x u
memory Write u Data
1 x memory x
data 0
1
Registers Write
data

16
Ins truction [15–0] Sign 32
extend ALU
control

8/15/2023 Ins truction [5–0] 39


Exercise
▪ What are the values of control signal of
following instructions?
bne $s1, $s2, exit
sw $s1, 4($a0)

8/15/2023 Faculty of Computer Science and Engineering 40


Implementing Jumps
Jump 2 address
31:26 25:0

▪ Jump uses word address


▪ Update PC with concatenation of
▪ Top 4 bits of old PC
▪ 26-bit jump address
▪ 00
▪ Need an extra control signal decoded from
opcode
8/15/2023 Faculty of Computer Science and Engineering 41
Datapath With Jumps Added
Ins truction [25–0]
Shift Jump a ddress [31-0]
left 2
26 28 PC + 4 [31-28] 0 1
M M
Add u u
x x
1 0
4 Add ALU
result
RegDst Shift
left 2
Jump
Branch
MemRead
Ins truction [31–26]
Control MemtoReg
ALUOp
MemWrite
ALUSrc
RegWrite
Ins truction [25–21]
pc Read
Read register 1 Read
address data 1
Ins truction [20–16]
Instruction Read Zero
0 register 2
[31–0] M Read ALU ALU Address Read
u 0 result 1
Instruction Write data 2 M data M
Ins truction [15–11] x Register
memory u u
1 x Data x
Write memory
data Registers 1 0
Write
data
Ins truction [15–0] 16 Sign 32
extend ALU
control

Ins truction [5–0]


8/15/2023 42
Performance Issues
▪ Longest delay determines clock period
▪ Critical path: load instruction
▪ Instruction memory → register file → ALU → data
memory → MUX.
▪ Not feasible to vary period for different
instructions
▪ Violates design principle
▪ Making the common case fast
▪ We will improve performance by pipelining
8/15/2023 Faculty of Computer Science and Engineering 43
Your turn
▪ What is critical path of following
instructions:
bne, sw

8/15/2023 Faculty of Computer Science and Engineering 44


Pipelining Analogy
▪ Pipelined laundry: overlapping execution
▪ Parallelism improves performance
▪ Four loads:
▪ Speedup = 8/3.5 = 2.3

▪ Non-stop:
▪ Speedup
= 2n/(0.5n + 1.5) ≈ 4
= number of stages
8/15/2023 Faculty of Computer Science and Engineering 45
MIPS Pipeline
▪ Five stages, one step per stage
▪ IF: Instruction fetch from memory
▪ ID: Instruction decode & register read
▪ EX: Execute operation or calculate address
▪ MEM: Access memory operand
▪ WB: Write result back to register

8/15/2023 Faculty of Computer Science and Engineering 46


Pipeline Performance
▪ Assume time for stages is
▪ 100ps for register read or write
▪ 200ps for other stages
▪ Compare pipelined datapath with single-cycle datapath
Instr Register Memory Register Total
Instr ALU op
fetch read access write time
lw 200ps 100 ps 200ps 200ps 100 ps 800ps
sw 200ps 100 ps 200ps 200ps 700ps
R-format 200ps 100 ps 200ps 100 ps 600ps
beq 200ps 100 ps 200ps 500ps
8/15/2023 Faculty of Computer Science and Engineering 47
Pipeline Performance
Single-cycle (Tc= 800ps)
Program
execution 200 400 600 800 1000 1200 1400 1600 1800
order Time
( in instructions )
Instruction Data
lw $1, 100($0) Reg ALU Reg
fetch access

Instruction Data
lw $2, 200($0) 800 ps Reg ALU Reg
fetch access

lw $3, 300($0) Instruction


800 ps
fetch

800 ps

Pipelined (Tc= 200ps)


Program
execution
order 200 400 600 800 1000 1200 1400
Time
( in instructions )
Instruction Data
lw $1, 100($0) Reg ALU Reg
fetch access

Instruction Data
lw $2, 200($0) 200 ps Reg ALU Reg
fetch access

lw $3, 300($0) Instruction Data


200 ps Reg ALU Reg
fetch access

200 ps 200 ps 200 ps 200 ps 200 ps

8/15/2023 Faculty of Computer Science and Engineering 48


Pipeline Speedup
▪ If all stages are balanced
▪ i.e., all take the same time
▪ Time between instructions pipelined
= Time between instructions nonpipelined
Number of stages
▪ If not balanced, speedup is less
▪ Speedup due to increased throughput
▪ Latency (time for each instruction) does not
decrease
▪ What is the value of pipeline CPI?
8/15/2023 Faculty of Computer Science and Engineering 49
Pipelining and ISA Design
▪ MIPS ISA designed for pipelining
▪ All instructions are 32-bits
▪ Easier to fetch and decode in one cycle
▪ c.f. x86: 1- to 15-byte instructions
▪ Few and regular instruction formats
▪ Can decode and read registers in one step
▪ Load/store addressing
▪ Can calculate address in 3rd stage, access memory in 4th
stage
▪ Alignment of memory operands
▪ Memory access takes only one cycle

8/15/2023 Faculty of Computer Science and Engineering 50


Hazards
▪ Situations that prevent starting the next instruction
in the next cycle
▪ Structure hazards
▪ A required resource is busy
▪ Data hazard
▪ Need to wait for previous instruction to complete its
data read/write
▪ Control hazard
▪ Deciding on control action depends on previous
instruction

8/15/2023 Faculty of Computer Science and Engineering 51


Structure Hazards
▪ Assume Instruction Memory and Data Memory are in the same
single memory Memory
conflict
IM REG ALU DM REG

IM REG ALU DM REG

IM REG ALU DM REG

IM REG ALU DM REG

8/15/2023 Faculty of Computer Science and Engineering 52


Structure Hazards
▪ Conflict for use of a resource
▪ In MIPS pipeline with a single memory
▪ Load/store requires data access
▪ Instruction fetch would have to stall for that cycle
▪ Would cause a pipeline “bubble”
▪ Hence, pipelined datapaths require separate
instruction/data memories
▪ Or separate instruction/data caches

8/15/2023 Faculty of Computer Science and Engineering 53


Data Hazards
▪ An instruction depends on completion of data access
by a previous instruction
add $s0, $t0, $t1
sub $t2, $s0, $t3
200 400 600 800 1000 1200 1400 1600
Time

add $s0, $t0, $t1 IF ID EX MEM WB Updated new $s0

sub $t2 $s0, $t3 IF ID EX MEM WB

Decode old $s0


(wrong)
8/15/2023 Faculty of Computer Science and Engineering 54
Data Hazards (bubble, stall, delay)
▪ An instruction depends on completion of data access
by a previous instruction
add $s0, $t0, $t1
sub $t2, $s0, $t3
200 400 600 800 1000 1200 1400 1600
Time

add $s0, $t0, $t1 IF ID EX MEM WB

bubble bubble bubble bubble bubble

bubble bubble bubble bubble bubble

sub $t2 $s0, $t3 IF ID EX MEM WB


8/15/2023 Faculty of Computer Science and Engineering 55
Forwarding (aka Bypassing)
▪ Use result when it is computed
▪ Don’t wait for it to be stored in a register
▪ Requires extra connections in the datapath
Program
execution 200 400 600 800 1000
order Time
(in instructions)
add $s0, $t0, $t1 IF ID EX MEM WB

sub $t2, $s0, $t3 IF ID EX MEM WB

8/15/2023 Faculty of Computer Science and Engineering 56


Pipeline visualization
Time (in clock cycles)
CC 1 CC 2 CC 3 CC 4 CC 5 CC 6 CC 7 CC 8 CC 9

IM REG ALU DM REG


Instruction 1

Instruction 2 IM REG ALU DM REG

Instruction 3 IM REG ALU DM REG

Instruction 4 IM REG ALU DM REG

Instruction 5 IM REG ALU DM REG

8/15/2023 Faculty of Computer Science and Engineering 57


How many stalls?
▪ Example 1
lw $2, $1, $3
and $12, $2, $5
or $13, $6, $2
add $14, $12, $2
sw $14, 100($2)
▪ Example 2
addi $2, $0, 10
Loop: addi $2, $2, -1
bne $2, $0, Loop

8/15/2023 Faculty of Computer Science and Engineering 58


Load-Use Data Hazard
▪ Can’t always avoid stalls by forwarding
▪ If value not computed when needed
▪ Can’t forward backward in time!
Program
execution
200 400 600 800 1000 1200 1400
order Time
(in instructions)
lw $s0, 20($t1) IF ID EX MEM WB

bubble bubble bubble bubble bubble

sub $t2, $s0, $t3 IF ID EX MEM WB

8/15/2023 Faculty of Computer Science and Engineering 59


Code Scheduling to Avoid Stalls

▪ Reorder code to avoid use of load result in


the next instruction
▪ C code for A = B + E; C = B + F;
lw $t1, 0($t0) lw $t1, 0($t0)
lw $t2, 4($t0) lw $t2, 4($t0)
stall
add $t3, $t1, $t2 lw $t4, 8($t0)
sw $t3, 12($t0) add $t3, $t1, $t2
lw $t4, 8($t0) sw $t3, 12($t0)
stall add $t5, $t1, $t4 add $t5, $t1, $t4
sw $t5, 16($t0) sw $t5, 16($t0)
8/15/2023 13 cycles 11 cycles 60
How many stalls? And …
lw $10, 20($1)
bne $10, $9, else
sub $11, $2, $3
add $12, $11, $4
j exit
else: lw $13, 24($12)
add $14, $5, $12
exit:
8/15/2023 Faculty of Computer Science and Engineering 61
Control Hazards
▪ Branch determines flow of control
▪ Fetching next instruction depends on branch
outcome
▪ Pipeline can’t always fetch correct instruction
▪ Still working on ID stage of branch
▪ In MIPS pipeline
▪ Need to compare registers and compute target
early in the pipeline
▪ Add hardware to do it in ID stage
8/15/2023 Faculty of Computer Science and Engineering 62
Stall on Branch
▪ Wait until branch outcome determined
before fetching next instruction
Program
execution 200 400 600 800 1000 1200 1400
Time
order
( in instructions )
Instruction Data
add $4, $5, $6 Reg ALU Reg
fetch access

Instruction Data
beq $1, $2, 40 fetch
Reg ALU
access
Reg
200 ps

bubble bubble bubble bubble bubble

or $7, $8, $9 Instruction


Reg ALU
Data
Reg
400 ps fetch access

8/15/2023 Faculty of Computer Science and Engineering 63


Branch Prediction
▪ Longer pipelines can’t readily determine branch
outcome early
▪ Stall penalty becomes unacceptable
▪ Predict outcome of branch
▪ Only stall if prediction is wrong
▪ In MIPS pipeline
▪ Can predict branches not taken
▪ Fetch instruction after branch, with no delay

8/15/2023 Faculty of Computer Science and Engineering 64


MIPS with Predict Not Taken
Program
execution
200 400 600 800 1000 1200 1400
order Time
(in instructions )
Instruction Data
Prediction add $4, $5, $6 fetch
Reg ALU
access
Reg

correct beq $1, $2, 40


Instruction
Reg ALU
Data
Reg
fetch access
200 ps
Instruction Data
lw $3, 300($0) Reg ALU Reg
200 ps fetch access

Program
execution
200 400 600 800 1000 1200 1400
order Time
(in instructions )
Instruction Data
add $4, $5, $6 fetch
Reg ALU
access
Reg
Prediction Instruction Data
beq $1, $2, 40 Reg ALU Reg
incorrect 200 ps fetch access

bubble bubble bubble bubble bubble

or $7, $8, $9 Instruction Data


Reg ALU Reg
400 ps fetch access

8/15/2023 Faculty of Computer Science and Engineering 65


More-Realistic Branch Prediction
▪ Static branch prediction
▪ Based on typical branch behavior
▪ Example: loop and if-statement branches
▪ Predict backward branches taken
▪ Predict forward branches not taken
▪ Dynamic branch prediction
▪ Hardware measures actual branch behavior
▪ e.g., record recent history of each branch
▪ Assume future behavior will continue the trend
▪ When wrong, stall while re-fetching, and update history
8/15/2023 Faculty of Computer Science and Engineering 66
Pipeline Summary
▪ Pipelining improves performance by increasing
instruction throughput
▪ Executes multiple instructions in parallel
▪ Each instruction has the same latency
▪ Subject to hazards
▪ Structure, data, control
▪ Instruction set design affects complexity of
pipeline implementation

8/15/2023 Faculty of Computer Science and Engineering 67


Single clock cycle vs Pipeline vs
Multiple clock cycle
time 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

Lw IF ID E M WB
Sw IF ID E M WB
add IF ID M E WB
Single clock cycle: 3 cycles, cycle time = 5 secs
Lw IF ID E M WB
Sw IF ID E M
add IF ID E WB
Multi clock cycle: 5 + 4 + 4 = 13 cycles, cycle time = 1 secs
Lw IF ID E M WB
Sw IF ID E M WB
add IF ID E M WB
8/15/2023 Pipeline : 7 cycles, cycles time = 1 secs 68
Multiple clock cycle
Instruction #cycles
Load 5 IF ID EXE MEM WB
Store 4 IF ID EXE MEM
Branch 3 IF ID EXE
Arithmetic/logical 4 IF ID EXE WB
Jump 2 IF ID

8/15/2023 Faculty of Computer Science and Engineering 69


MIPS Pipelined Datapath IF: Instruction fetch ID: Instruction decode/
register file read
EX: Execute/
address calculation
MEM: Memory access WB: Write back

Add

4 Add
ADD
result
Shift
left 2

0
Read Read
M register 1
Address data 1 Zero
u PC
x Read ALU ALU
1 register 2 Address
Instruction result Read
Registers 0 1
data
MEM Write Read
M
u
Data M
Instruction register data 2 m emory u
m emory x x
Write 1
data 0
Write
Right-to-left data

flow leads to WB 16 32
Sign-
hazards extend

8/15/2023 Faculty of Computer Science and Engineering 70


Pipeline registers
▪ Need registers between stages
▪ To hold information produced in previous
cycle IF/ID ID/EX EX/MEM MEM/WB

Add

4 Add Add
result
Shift
left 2

Instruction
0
M
u PC Address Read
x register 1 Read
1
data 1
Read Zero
Instruction
register 2 ALU
Read ALU Read
memory result Address data 0
Write data 2 0
M
register M
Data u
Registers u
Write memory x
data x 1
1

Write
data

16 Sign- 32
extend

8/15/2023 Faculty of Computer Science and Engineering 71


Pipeline Operation
▪ Cycle-by-cycle flow of instructions through the
pipelined datapath
▪ “Single-clock-cycle” pipeline diagram
▪ Shows pipeline usage in a single cycle
▪ Highlight resources used
▪ c.f. “multi-clock-cycle” diagram
▪ Graph of operation over time
▪ We’ll look at “single-clock-cycle” diagrams for
load & store
8/15/2023 Faculty of Computer Science and Engineering 72
IF for Load, Store, … lw

Instruction fetch

IF/ID ID/EX EX/MEM MEM/WB

Add

4 Add Add
result
Shift
left 2

Instruction
0
M
u PC Address Read
x register 1 Read
1
data 1
Read Zero
Instruction
register 2 ALU
Read ALU Read
memory 0 result Address data 0
Write data 2 M
register M
Data u
Registers u
Write memory x
data x 1
1

Write
data

16 Sign- 32
extend

8/15/2023 Faculty of Computer Science and Engineering 73


ID for Load, Store, … lw

Instruction decode

IF/ID ID/EX EX/MEM MEM/WB

Add

4 Add Add
result
Shift
left 2

Instruction
0
M
u PC Address Read
x register 1 Read
1
data 1
Read Zero
Instruction
register 2 ALU
Read ALU Read
memory 0 result Address data 0
Write data 2 M
register M
Data u
Registers u
Write memory x
data x 1
1

Write
data

16 Sign- 32
extend

8/15/2023 Faculty of Computer Science and Engineering 74


EX for Load lw

Execution

IF/ID ID/EX EX/MEM MEM/WB

Add

4 Add Add
result
Shift
left 2

Instruction
0
M
u PC Address Read
x register 1 Read
1
data 1
Read Zero
Instruction
register 2 ALU
Read ALU Read
memory 0 result Address data 0
Write data 2 M
register M
Data u
Registers u
Write memory x
data x 1
1

Write
data

16 Sign- 32
extend

8/15/2023 Faculty of Computer Science and Engineering 75


MEM for Load lw

Memory

IF/ID ID/EX EX/MEM MEM/WB

Add

4 Add Add
result
Shift
left 2

Instruction
0
M
u PC Address Read
x register 1 Read
1
data 1
Read Zero
Instruction
register 2 ALU
Read ALU Read
memory 0 result Address data 0
Write data 2 M
register M
Data u
Registers u
Write memory x
data x 1
1

Write
data

16 Sign- 32
extend

8/15/2023 Faculty of Computer Science and Engineering 76


WB for Load lw

Write back

IF/ID ID/EX EX/MEM MEM/WB

Add

4 Add Add
result
Shift
left 2

Instruction
0
M
u PC Address Read
x register 1 Read
1
data 1
Read Zero
Instruction
register 2 ALU
Read ALU Read
memory 0 result Address data 0
Write data 2 M
register M
Data u
Registers u
Write memory x
data x 1
1

Write
data

16 Sign- 32
Wrong extend

register
number

8/15/2023 Faculty of Computer Science and Engineering 77


Corrected Datapath for Load
IF/ID ID/EX EX/MEM MEM/WB

Add

4 Add Add
result
Shift
left 2

Instruction
0
M
u PC Address Read
x register 1 Read
1
data 1
Read Zero
Instruction
register 2 ALU
Read ALU Read
memory 0 result Address data 0
Write data 2 M
register M
Data u
Registers u
Write memory x
data x 1
1

Write
data

16 Sign- 32
extend

8/15/2023 Faculty of Computer Science and Engineering 78


EX for Store sw

Execution

IF/ID ID/EX EX/MEM MEM/WB

Add

4 Add Add
result
Shift
left 2

Instruction
0
M
u PC Address Read
x register 1 Read
1
data 1
Read Zero
Instruction
register 2 ALU
Read ALU Read
memory 0 result Address data 0
Write data 2 M
register M
Data u
Registers u
Write memory x
data x 1
1

Write
data

16 Sign- 32
extend

8/15/2023 Faculty of Computer Science and Engineering 79


MEM for Store sw

Memory

IF/ID ID/EX EX/MEM MEM/WB

Add

4 Add Add
result
Shift
left 2

Instruction
0
M
u PC Address Read
x register 1 Read
1
data 1
Read Zero
Instruction
register 2 ALU
Read ALU Read
memory 0 result Address data 0
Write data 2 M
register M
Data u
Registers u
Write memory x
data x 1
1

Write
data

16 Sign- 32
extend

8/15/2023 Faculty of Computer Science and Engineering 80


WB for Store sw

Write back

IF/ID ID/EX EX/MEM MEM/WB

Add

4 Add Add
result
Shift
left 2

Instruction
0
M
u PC Address Read
x register 1 Read
1
data 1
Read Zero
Instruction
register 2 ALU
Read ALU Read
memory 0 result Address data 0
Write data 2 M
register M
Data u
Registers u
Write memory x
data x 1
1

Write
data

16 Sign- 32
extend

8/15/2023 Faculty of Computer Science and Engineering 81


Multi-Cycle Pipeline Diagram
▪ Form showing resource usage
Program
execution Time (in clock cycles)
order CC 1 CC 2 CC 3 CC 4 CC 5 CC 6 CC 7 CC 8 CC 9
( in instructions )
lw $10, 20($1) IM REG ALU DM REG

sub $11, $2, $3 IM REG ALU DM REG

add $12, $3, $4 IM REG ALU DM REG

lw $13, 24($1)
IM REG ALU DM REG

add $14, $5, $6


IM REG ALU DM REG
8/15/2023 82
Single-Cycle Pipeline Diagram
▪ State of pipeline in a given cycle
add $14, $5, $6 lw $13, 24 ($1) add $12, $3, $4 sub $11, $2, $3 lw $10, 20($1)
Instruction fetch Instruction decode Execution Memory Write-back

IF/ID ID/EX EX/MEM MEM/WB

Add

4 Add Add
result
Shift
left 2

Instruction
0
M
u PC Address Read
x register 1 Read
1
data 1
Read Zero
Instruction
register 2 ALU
Read ALU Read
memory result Address data 0
Write data 2 0
M
register M
Data u
Registers u
Write memory x
data x 1
1

Write
data

16 Sign- 32
extend

8/15/2023 Faculty of Computer Science and Engineering 83


Pipelined Control (Simplified) PCSrc

IF/ID ID/EX EX/MEM MEM/WB

Add
Add
4 Shift result Branch
Left 2

Instruction
0 RegWrite
M
u
PC Address Read Read
x register 1 MemWrite
1 data 1 MemtoReg
Read ALUSrc Zero
register 2 Add ALU
Instruction Read 0 result Address
Read
data
1
M
memory Write
register
data 2 M
u Data u
Write
data
Registers x
1 memory
x
0

Write
Instruction data

(15–0) 16 Sign- 32 6 ALU


extend control MemRead
Instruction
(20–16) 0
M
ALUOp
Instruction u
x
(15–11) 1

RegDst

8/15/2023 Faculty of Computer Science and Engineering 84


Pipelined Control
▪ Control signals derived from instruction
▪ As in single-cycle implementation
WB

Instruction
Control M WB

EX M WB

8/15/2023 Faculty of Computer Science and Engineering 85


IF/ID ID/EX EX/MEM MEM/WB
Pipelined Control PCSrc

ID/EX
WB
EX/MEM

Control M WB
MEM/WB

EX M WB
IF/ID

Add Add
Add
4

Instruction

RegWrite
Shift result Branch
Left 2
ALUSrc

MemtoReg
MemWrite
0
M
u PC Read
Address register 1 Read
x
1 data 1
Read Zero
register 2 Add ALU
Instruction
Read 1
Read 0
result Address data
Write data 2 M M
memory register u Data u
Write
data
Registers x
1 memory
x
0

Write
Instruction data

[15–0] 16 Sign- 32 6 ALU


extend control
Instruction MemRead
[20–16] ALUOp
0
Instruction M
u
[15–11] x
1
RegDst

8/15/2023 Faculty of Computer Science and Engineering 86


Data Hazards in ALU Instructions

▪ Consider this sequence:


sub $2, $1, $3
and $12, $2, $5
or $13, $6, $2
add $14, $2, $2
sw $15, 100($2)
▪ We can resolve hazards with forwarding
▪ How do we detect when to forward?
8/15/2023 Faculty of Computer Science and Engineering 87
Dependencies & Forwarding
Time (in clock cycles)
Value of CC 1 CC 2 CC 3 CC 4 CC 5 CC 6 CC 7 CC 8 CC 9
register $2: 10 10 10 10 10/20 20 20 20 20

IM REG ALU DM REG


1.sub $2, $1, $3

IM REG ALU DM REG


2.and $12, $2, $5

IM REG ALU DM REG


3.or $13, $6, $2

4.add $14, $2, $2 IM REG ALU DM REG

5.sw $15, 100 ($2) IM REG ALU DM REG

8/15/2023 Faculty of Computer Science and Engineering 88


Detecting the Need to Forward
▪ Pass register numbers along pipeline
▪ e.g., ID/EX.RegisterRs = register number for Rs sitting in
ID/EX pipeline register
▪ ALU operand register numbers in EX stage are given by
▪ ID/EX.RegisterRs, ID/EX.RegisterRt
▪ Data hazards when Fwd from
▪ 1a. EX/MEM.RegisterRd = ID/EX.RegisterRs EX/MEM
▪ 1b. EX/MEM.RegisterRd = ID/EX.RegisterRt pipeline reg

▪ 2a. MEM/WB.RegisterRd = ID/EX.RegisterRs Fwd from


▪ 2b. MEM/WB.RegisterRd = ID/EX.RegisterRt MEM/WB
pipeline reg

8/15/2023 Faculty of Computer Science and Engineering 89


Detecting the Need to Forward
▪ But only if forwarding instruction will write
to a register!
▪ EX/MEM.RegWrite, MEM/WB.RegWrite
▪ And only if Rd for that instruction is not
$zero
▪ EX/MEM.RegisterRd ≠ 0,
MEM/WB.RegisterRd ≠ 0

8/15/2023 Faculty of Computer Science and Engineering 90


No Forwarding
ID/EX EX/MEM MEM/WB

Registers
ALU

Data
memory M
u
x

a. No forwarding

8/15/2023 Faculty of Computer Science and Engineering 91


Forwarding Path ID/EX EX/MEM MEM/WB

M
u
x
Registers ForwardA
ALU

M Data
u M
x memory
u
x

ForwardB
Rs
Rt
Rt EX/MEM.RegisterRd
Rd
M
u
x
Forwarding
MEM/WB.RegisterRd
unit

b. With forwarding
8/15/2023 Faculty of Computer Science and Engineering 92
Forwarding Conditions
▪ EX hazard
▪ if (EX/MEM.RegWrite
and (EX/MEM.RegisterRd ≠ 0)
and (EX/MEM.RegisterRd = ID/EX.RegisterRs))
ForwardA = 10
▪ if (EX/MEM.RegWrite
and (EX/MEM.RegisterRd ≠ 0)
and (EX/MEM.RegisterRd = ID/EX.RegisterRt))
ForwardB = 10

8/15/2023 Faculty of Computer Science and Engineering 93


Forwarding Conditions
▪ MEM hazard
▪ if (MEM/WB.RegWrite
and (MEM/WB.RegisterRd ≠ 0)
and (MEM/WB.RegisterRd = ID/EX.RegisterRs))
ForwardA = 01
▪ if (MEM/WB.RegWrite
and (MEM/WB.RegisterRd ≠ 0)
and (MEM/WB.RegisterRd = ID/EX.RegisterRt))
ForwardB = 01

8/15/2023 Faculty of Computer Science and Engineering 94


Double Data Hazard
▪ Consider the sequence:
add $1,$1,$2
add $1,$1,$3
add $1,$1,$4
▪ Both hazards occur
▪ Want to use the most recent
▪ Revise MEM hazard condition
▪ Only fwd if EX hazard condition isn’t true

8/15/2023 Faculty of Computer Science and Engineering 95


Revised Forwarding Condition
▪ MEM hazard
if (MEM/WB.RegWrite
and (MEM/WB.RegisterRd ≠ 0)
and not (EX/MEM.RegWrite and (EX/MEM.RegisterRd ≠ 0)
and (EX/MEM.RegisterRd = ID/EX.RegisterRs))
and (MEM/WB.RegisterRd = ID/EX.RegisterRs))
ForwardA = 01
if (MEM/WB.RegWrite
and (MEM/WB.RegisterRd ≠ 0)
and not (EX/MEM.RegWrite and (EX/MEM.RegisterRd ≠ 0)
and (EX/MEM.RegisterRd = ID/EX.RegisterRt))
and (MEM/WB.RegisterRd = ID/EX.RegisterRt))
ForwardB = 01

8/15/2023 Faculty of Computer Science and Engineering 96


Datapath with Forwarding ID/EXE
WB EXE/MEM
Control M WB MEM/WB

IF/ID EX M WB

M
U

Instruction
X
Registers M
Add
Instruction U
PC
memory M
Data X
U
memory
X

IF/ID.RegisterRs Rs
IF/ID.RegisterRt Rt
IF/ID.RegisterRt Rt EX/MEM.RegisterRd
M
IF/ID.RegisterRd Rd
U
X
Forwarding MEM/WB.RegisterRd
unit

8/15/2023 Faculty of Computer Science and Engineering 97


Load-Use Data Hazard
Program
execution Time (in clock cycles)
order
CC 1 CC 2 CC 3 CC 4 CC 5 CC 6 CC 7 CC 8 CC 9
( in instructions )

lw $2, 20($1)
IM REG ALU DM REG Need to stall
for one cycle
IM REG ALU DM REG
and $4, $2, $5

IM REG ALU DM REG


or $8, $2, $6

add $9, $4, $2 IM REG ALU DM REG

slt $1, $6, $7 IM REG ALU DM REG

8/15/2023 Faculty of Computer Science and Engineering 98


Load-Use Hazard Detection
▪ Check when using instruction is decoded in ID
stage
▪ ALU operand register numbers in ID stage are given
by
▪ IF/ID.RegisterRs, IF/ID.RegisterRt
▪ Load-use hazard when
▪ ID/EX.MemRead and
((ID/EX.RegisterRt = IF/ID.RegisterRs) or
(ID/EX.RegisterRt = IF/ID.RegisterRt))
▪ If detected, stall and insert bubble

8/15/2023 Faculty of Computer Science and Engineering 99


How to Stall the Pipeline
▪ Force control values in ID/EX register
to 0
▪ EX, MEM and WB do nop (no-operation)
▪ Prevent update of PC and IF/ID register
▪ Using instruction is decoded again
▪ Following instruction is fetched again
▪ 1-cycle stall allows MEM to read data for lw
▪ Can subsequently forward to EX stage

8/15/2023 Faculty of Computer Science and Engineering 100


Load-Use Data Hazard
Program
execution Time (in clock cycles)
order
CC 1 CC 2 CC 3 CC 4 CC 5 CC 6 CC 7 CC 8 CC 9
( in instructions )
ALU
lw $2, 20($1)
IM REG DM REG
Stall inserted
bubble here
IM REG ALU DM REG
and becomes nop

IM REG ALU DM REG


and $4, $2, $5

or $8, $2, $6 IM REG ALU DM REG

add $9, $4, $2 IM REG ALU DM REG

8/15/2023 Faculty of Computer Science and Engineering 101


Datapath with Hazard Detection Hazard ID/EX.MemRead
detection
unit

IF/ID.Write
ID/EXE
PCWrite
WB EXE/MEM
M
Control U M WB MEM/WB

0
X
IF/ID M WB
EX

M
U
Instruction
X
Registers
M
Instruction Add
U
PC memory M X
U Data
X memory

IF/ID.RegisterRs
IF/ID.RegisterRt
IF/ID.RegisterRt Rt
M
IF/ID.RegisterRd Rd
U
ID/EXE.RegisterRt X
Rs Forwarding
Rt
unit

8/15/2023 Faculty of Computer Science and Engineering 102


Stalls and Performance
▪ Stalls reduce performance
▪ But are required to get correct results
▪ Compiler can arrange code to avoid hazards
and stalls
▪ Requires knowledge of the pipeline structure

8/15/2023 Faculty of Computer Science and Engineering 103


Branch Hazards
Program
execution Time (in clock cycles)
order
CC 1 CC 2 CC 3 CC 4 CC 5 CC 6 CC 7 CC 8 CC 9
( in instructions )
IM REG ALU DM REG
40 beq $1, $3, 28

IM REG ALU DM REG


44 and $12, $2, $5
Flush these
instructions
IM REG ALU DM REG (Set control
48 or $13, $6, $2
values to 0)

52 add $14, $2, $2 IM REG ALU DM REG

72 lw $4, 50($7) IM REG ALU DM REG

PC

8/15/2023 Faculty of Computer Science and Engineering 104


Reducing Branch Delay
▪ Move hardware to determine outcome to ID stage
▪ Target address adder
▪ Register comparator
▪ Example: branch taken
36: sub $10, $4, $8
40: beq $1, $3, 7
44: and $12, $2, $5
48: or $13, $2, $6
52: add $14, $4, $2
56: slt $15, $6, $7
...
72: lw $4, 50($7)
8/15/2023 Faculty of Computer Science and Engineering 105
Example: Branch Taken (clock 3)
and $12, $2, $5 beq $1, $3, 7 sub $10, $4, $8 before<1> before<2>
IF.Flush

Hazard
detection
unit
ID/EXE
WB EXE/MEM
M
Control MEM/WB
28 U M WB
IF/ID X
+ 72 M WB
44 EX
48
+ Shift M
$4
4 Left 2 $1 U
X
Registers
M
= M
U Instruction $3 ALU U
PC 44 memory M $8 Data
X 72 X
7 U memory
X
Sign-
extend

10

Forw arding
unit

8/15/2023 Faculty of Computer Science and Engineering 106


Example: Branch Taken (clock 4)
lw $4, 50($7) Bubble beq $1, $3, 7 sub $10, $4, $8 before<1>
IF.Flush

Hazard
detection
unit
ID/EXE
WB EXE/MEM
M
Control MEM/WB
U M WB
IF/ID X
+ M WB
72 EX
76
+ Shift M
$1
4 Left 2 U
X
Registers
M
= M
U Instruction ALU U
PC 72 memory M $3 Data
X 76 X
U memory
X
Sign-
extend

Forw arding
unit

8/15/2023 Faculty of Computer Science and Engineering 107


Data Hazards for Branches
▪ If a comparison register is a destination of 2nd
or 3rd preceding ALU instruction
add $1, $2, $3 IF ID EX MEM WB

add $4, $5, $6 IF ID EX MEM WB

… IF ID EX MEM WB

beq $1, $4, target IF ID EX MEM WB

▪ Can resolve using forwarding

8/15/2023 Faculty of Computer Science and Engineering 108


Data Hazards for Branches
▪ If a comparison register is a destination of
preceding ALU instruction or 2nd preceding
load instruction
▪ Need 1 stall cycle
lw $1, addr IF ID EX MEM WB

add $4, $5, $6 IF ID EX MEM WB

beq stalled IF ID

beq $1, $4, target ID EX MEM WB

8/15/2023 Faculty of Computer Science and Engineering 109


Data Hazards for Branches
▪ If a comparison register is a destination of
immediately preceding load instruction
▪ Need 2 stall cycles
lw $1, addr IF ID EX MEM WB

beq stalled IF ID

beq stalled ID

beq $1, $0, target ID EX MEM WB

8/15/2023 Faculty of Computer Science and Engineering 110


Dynamic Branch Prediction
▪ In deeper and superscalar pipelines, branch penalty
is more significant
▪ Use dynamic prediction
▪ Branch prediction buffer (aka branch history table)
▪ Indexed by recent branch instruction addresses
▪ Stores outcome (taken/not taken)
▪ To execute a branch
▪ Check table, expect the same outcome
▪ Start fetching from fall-through or target
▪ If wrong, flush pipeline and flip prediction

8/15/2023 Faculty of Computer Science and Engineering 111


1-Bit Predictor: Shortcoming
▪ Inner loop branches mispredicted twice!
outer: …

inner: …

beq …, …, inner

beq …, …, outer
◼ Mispredict as taken on last iteration of
inner loop
◼ Then mispredict as not taken on first
iteration of inner loop next time around
8/15/2023 Faculty of Computer Science and Engineering 112
2-Bit Predictor
▪ Only change prediction on two successive
mispredictions
Taken

Not taken
Predict taken Predict taken
Taken

Not taken Taken

Not taken
Predict not taken Predict not taken
Taken

Not taken
8/15/2023 Faculty of Computer Science and Engineering 113
Calculating the Branch Target
▪ Even with predictor, still need to calculate
the target address
▪ 1-cycle penalty for a taken branch
▪ Branch target buffer
▪ Cache of target addresses
▪ Indexed by PC when instruction fetched
▪ If hit and instruction is branch predicted taken,
can fetch target immediately

8/15/2023 Faculty of Computer Science and Engineering 114


Exceptions and Interrupts
▪ “Unexpected” events requiring change
in flow of control
▪ Different ISAs use the terms differently
▪ Exception
▪ Arises within the CPU
▪ e.g., undefined opcode, overflow, syscall, …
▪ Interrupt
▪ From an external I/O controller
▪ Dealing with them without sacrificing performance
is hard
8/15/2023 Faculty of Computer Science and Engineering 115
Handling Exceptions
▪ In MIPS, exceptions managed by a System
Control Coprocessor (CP0)
▪ Save PC of offending (or interrupted) instruction
▪ In MIPS: Exception Program Counter (EPC)
▪ Save indication of the problem
▪ In MIPS: Cause register
▪ We’ll assume 1-bit
▪ 0 for undefined opcode, 1 for overflow
▪ Jump to handler at 8000 00180
8/15/2023 Faculty of Computer Science and Engineering 116
An Alternate Mechanism
▪ Vectored Interrupts
▪ Handler address determined by the cause
▪ Example:
▪ Undefined opcode: C000 0000
▪ Overflow: C000 0020
▪ …: C000 0040
▪ Instructions either
▪ Deal with the interrupt, or
▪ Jump to real handler

8/15/2023 Faculty of Computer Science and Engineering 117


Handler Actions
▪ Read cause, and transfer to relevant handler
▪ Determine action required
▪ If restartable
▪ Take corrective action
▪ use EPC to return to program
▪ Otherwise
▪ Terminate program
▪ Report error using EPC, cause, …
8/15/2023 Faculty of Computer Science and Engineering 118
Exceptions in a Pipeline
▪ Another form of control hazard
▪ Consider overflow on add in EX stage
▪ add $1, $2, $1
▪ Prevent $1 from being clobbered
▪ Complete previous instructions
▪ Flush add and subsequent instructions
▪ Set Cause and EPC register values
▪ Transfer control to handler
▪ Similar to mispredicted branch
▪ Use much of the same hardware
8/15/2023 Faculty of Computer Science and Engineering 119
Pipeline with Exceptions EX.Flush

ID.Flush
Hazard

IF.Flush
detection
unit
M
ID/EXE U
X EXE/MEM
WB 0
M M
Control U M WB MEM/WB
Cause U
IF/ID X X 1
+ 0 EX EPC 0 M WB

+ Shift
M
4 Left 2 U
X
Registers = ALU M
M U
Instruction
U PC M X
80000180 memory Data
X U
memory
X
Sign-
extend

M
U
X
Forw arding
unit

8/15/2023 Faculty of Computer Science and Engineering 120


Exception Properties
▪ Restartable exceptions
▪ Pipeline can flush the instruction
▪ Handler executes, then returns to the instruction
▪ Refetched and executed from scratch
▪ PC saved in EPC register
▪ Identifies causing instruction
▪ Actually PC + 4 is saved
▪ Handler must adjust
8/15/2023 Faculty of Computer Science and Engineering 121
Exception Example
▪ Exception on add in
40 sub $11, $2, $4
44 and $12, $2, $5
48 or $13, $2, $6
4C add $1, $2, $1
50 slt $15, $6, $7
54 lw $16, 50($7)

▪ Handler
80000180 sw $25, 1000($0)
80000184 sw $26, 1004($0)

8/15/2023 Faculty of Computer Science and Engineering 122


Exception Example (clock 6)
lw $16, 50($7) slt $15, $6, $7 add $1, $2, $1 or $13, $2, $6
EX.Flush and $12,…
ID.Flush
Hazard
IF.Flush detection
unit
M 00
ID/EXE U
0 10 X
WB 0 EXE/MEM
M M 10
Control U
0 M 000 WB MEM/WB
Cause U
IF/ID +
X 50 X 1
0 0 EX EPC 0 M WB
54
58
+ Shift
M
$2
4 Left 2 $6 U
X
Registers =
12 ALU M
M $7 U
Instruction
U PC 54 M X
80000180 memory $1 Data
X U
80000180 memory
X
Sign-
extend

13 12
M
15 $1 U
X
Forw arding
unit

8/15/2023 Faculty of Computer Science and Engineering 123


Exception Example (clock 7)
sw $26, 1000($0) bubble (nop) bubble bubble
EX.Flush or $13, $2, $6
ID.Flush
Hazard
IF.Flush detection
unit
M 00
ID/EXE U
0 0 X
WB 0 EXE/MEM
M M 00
Control 0 000
U M WB MEM/WB
Cause U
IF/ID +
X X
0 0 EX EPC 0 M WB
80000180 58
+ Shift
M
4 Left 2 U
X
Registers =
13 ALU M
M U
Instruction
U PC 72 M X
80000180 memory Data
X U
80000184 memory
X
Sign-
extend

13
M
U
X
Forw arding
unit

8/15/2023 Faculty of Computer Science and Engineering 124


Multiple Exceptions
▪ Pipelining overlaps multiple instructions
▪ Could have multiple exceptions at once
▪ Simple approach: deal with exception from earliest
instruction
▪ Flush subsequent instructions
▪ “Precise” exceptions
▪ In complex pipelines
▪ Multiple instructions issued per cycle
▪ Out-of-order completion
▪ Maintaining precise exceptions is difficult!

8/15/2023 Faculty of Computer Science and Engineering 125


Imprecise Exceptions
▪ Just stop pipeline and save state
▪ Including exception cause(s)
▪ Let the handler work out
▪ Which instruction(s) had exceptions
▪ Which to complete or flush
▪ May require “manual” completion
▪ Simplifies hardware, but more complex handler
software
▪ Not feasible for complex multiple-issue
out-of-order pipelines
8/15/2023 Faculty of Computer Science and Engineering 126
Instruction-Level Parallelism (ILP)

▪ Pipelining: executing multiple instructions in parallel


▪ To increase ILP
▪ Deeper pipeline
▪ Less work per stage  shorter clock cycle
▪ Multiple issue
▪ Replicate pipeline stages  multiple pipelines
▪ Start multiple instructions per clock cycle
▪ CPI < 1, so use Instructions Per Cycle (IPC)
▪ E.g., 4GHz 4-way multiple-issue
▪ 16 BIPS, peak CPI = 0.25, peak IPC = 4
▪ But dependencies reduce this in practice

8/15/2023 Faculty of Computer Science and Engineering 127


Multiple Issue
▪ Static multiple issue
▪ Compiler groups instructions to be issued together
▪ Packages them into “issue slots”
▪ Compiler detects and avoids hazards
▪ Dynamic multiple issue
▪ CPU examines instruction stream and chooses
instructions to issue each cycle
▪ Compiler can help by reordering instructions
▪ CPU resolves hazards using advanced techniques at
runtime

8/15/2023 Faculty of Computer Science and Engineering 128


Speculation
▪ “Guess” what to do with an instruction
▪ Start operation as soon as possible
▪ Check whether guess was right
▪ If so, complete the operation
▪ If not, roll-back and do the right thing
▪ Common to static and dynamic multiple issue
▪ Examples
▪ Speculate on branch outcome
▪ Roll back if path taken is different
▪ Speculate on load
▪ Roll back if location is updated

8/15/2023 Faculty of Computer Science and Engineering 129


Compiler/Hardware Speculation
▪ Compiler can reorder instructions
▪ e.g., move load before branch
▪ Can include “fix-up” instructions to recover from
incorrect guess
▪ Hardware can look ahead for instructions to
execute
▪ Buffer results until it determines they are actually
needed
▪ Flush buffers on incorrect speculation
8/15/2023 Faculty of Computer Science and Engineering 130
Speculation and Exceptions
▪ What if exception occurs on a speculatively
executed instruction?
▪ e.g., speculative load before null-pointer check
▪ Static speculation
▪ Can add ISA support for deferring exceptions
▪ Dynamic speculation
▪ Can buffer exceptions until instruction
completion (which may not occur)
8/15/2023 Faculty of Computer Science and Engineering 131
Static Multiple Issue
▪ Compiler groups instructions into “issue
packets”
▪ Group of instructions that can be issued on a single
cycle
▪ Determined by pipeline resources required
▪ Think of an issue packet as a very long
instruction
▪ Specifies multiple concurrent operations
▪  Very Long Instruction Word (VLIW)
8/15/2023 Faculty of Computer Science and Engineering 132
Scheduling Static Multiple Issue

▪ Compiler must remove some/all hazards


▪ Reorder instructions into issue packets
▪ No dependencies with a packet
▪ Possibly some dependencies between packets
▪ Varies between ISAs; compiler must know!
▪ Pad with nop if necessary

8/15/2023 Faculty of Computer Science and Engineering 133


MIPS with Static Dual Issue
▪ Two-issue packets
▪ One ALU/branch instruction
▪ One load/store instruction
▪ 64-bit aligned
▪ ALU/branch, then load/store
▪ Pad an unused instruction with nop
Address Instruction type Pipeline Stages
n + 00 ALU/branch IF ID EX MEM WB
n + 04 Load/store IF ID EX MEM WB
n + 08 ALU/branch IF ID EX MEM WB
n + 12 Load/store IF ID EX MEM WB
n + 16 ALU/branch IF ID EX MEM WB
n + 20 Load/store IF ID EX MEM WB
8/15/2023 Faculty of Computer Science and Engineering 134
MIPS with Static Dual Issue
M
u
4 x
ALU

M
M Registers u
80000180 u Instruction x
PC
x memory
Write
data
Sign- ALU
Data
extend Sign-
memory
extend
Address

8/15/2023 Faculty of Computer Science and Engineering 135


Hazards in the Dual-Issue MIPS
▪ More instructions executing in parallel
▪ EX data hazard
▪ Forwarding avoided stalls with single-issue
▪ Now can’t use ALU result in load/store in same packet
add $t0, $s0, $s1
lw $s2, 0($t0)
▪ Split into two packets, effectively a stall
▪ Load-use hazard
▪ Still one cycle use latency, but now two instructions
▪ More aggressive scheduling required
8/15/2023 Faculty of Computer Science and Engineering 136
Scheduling Example
▪ Schedule this for dual-issue MIPS
Loop: lw $t0, 0($s1) # $t0=array element
addu $t0, $t0, $s2 # add scalar in $s2
sw $t0, 0($s1) # store result
addi $s1, $s1,–4 # decrement pointer
bne $s1, $zero, Loop # branch $s1!=0
ALU/branch Load/store cycle
Loop: nop lw $t0, 0($s1) 1
addi $s1, $s1,–4 nop 2
addu $t0, $t0, $s2 nop 3
bne $s1, $zero, Loop sw $t0, 0($s1) 4
◼ IPC = 5/4 = 1.25 (c.f. peak IPC = 2)
8/15/2023 Faculty of Computer Science and Engineering 137
Loop Unrolling
▪ Replicate loop body to expose more parallelism
▪ Reduces loop-control overhead
▪ Use different registers per replication
▪ Called “register renaming”
▪ Avoid loop-carried “anti-dependencies”
▪ Store followed by a load of the same register
▪ Aka “name dependence”
▪ Reuse of a register name

8/15/2023 Faculty of Computer Science and Engineering 138


Loop Unrolling Example
▪ IPC = 14/8 = 1.75
▪ Closer to 2, but at cost of registers and code size
ALU/branch Load/store cycle
Loop: addi $s1, $s1,–16 lw $t0, 0($s1) 1
nop lw $t1, 12($s1) 2
addu $t0, $t0, $s2 lw $t2, 8($s1) 3
addu $t1, $t1, $s2 lw $t3, 4($s1) 4
addu $t2, $t2, $s2 sw $t0, 16($s1) 5
addu $t3, $t4, $s2 sw $t1, 12($s1) 6
nop sw $t2, 8($s1) 7
bne $s1, $zero, Loop sw $t3, 4($s1) 8
8/15/2023 Faculty of Computer Science and Engineering 139
Dynamic Multiple Issue
▪ “Superscalar” processors
▪ CPU decides whether to issue 0, 1, 2, … each
cycle
▪ Avoiding structural and data hazards
▪ Avoids the need for compiler scheduling
▪ Though it may still help
▪ Code semantics ensured by the CPU
8/15/2023 Faculty of Computer Science and Engineering 140
Dynamic Pipeline Scheduling
▪ Allow the CPU to execute instructions out of
order to avoid stalls
▪ But commit result to registers in order
▪ Example
lw $t0, 20($s2)
addu $t1, $t0, $t2
sub $s4, $s4, $t3
slti $t5, $s4, 20
▪ Can start sub while addu is waiting for lw
8/15/2023 Faculty of Computer Science and Engineering 141
Dynamically Scheduled CPU Instruction fetch In-order issue Preserves
and decode unit
dependencies

Hold pending
Reservation Reservation Reservation Reservation
station station ... station station operands

Functional Floating Load- Out-of-order execute


Integer Integer ...
units point store

Results also sent


to any waiting
reservation stations
Commit In-order commit
Reorders buffer for unit
register writes
Can supply
operands for
issued instructions
8/15/2023 Faculty of Computer Science and Engineering 142
Register Renaming
▪ Reservation stations and reorder buffer effectively
provide register renaming
▪ On instruction issue to reservation station
▪ If operand is available in register file or reorder buffer
▪ Copied to reservation station
▪ No longer required in the register; can be overwritten
▪ If operand is not yet available
▪ It will be provided to the reservation station by a function
unit
▪ Register update may not be required

8/15/2023 Faculty of Computer Science and Engineering 143


Speculation
▪ Predict branch and continue issuing
▪ Don’t commit until branch outcome determined
▪ Load speculation
▪ Avoid load and cache miss delay
▪ Predict the effective address
▪ Predict loaded value
▪ Load before completing outstanding stores
▪ Bypass stored values to load unit
▪ Don’t commit load until speculation cleared
8/15/2023 Faculty of Computer Science and Engineering 144
Why Do Dynamic Scheduling?
▪ Why not just let the compiler schedule code?
▪ Not all stalls are predicable
▪ e.g., cache misses
▪ Can’t always schedule around branches
▪ Branch outcome is dynamically determined
▪ Different implementations of an ISA have
different latencies and hazards

8/15/2023 Faculty of Computer Science and Engineering 145


Does Multiple Issue Work?
▪ Yes, but not as much as we’d like
▪ Programs have real dependencies that limit ILP
▪ Some dependencies are hard to eliminate
▪ e.g., pointer aliasing
▪ Some parallelism is hard to expose
▪ Limited window size during instruction issue
▪ Memory delays and limited bandwidth
▪ Hard to keep pipelines full
▪ Speculation can help if done well
8/15/2023 Faculty of Computer Science and Engineering 146
Fallacies
▪ Pipelining is easy (!)
▪ The basic idea is easy
▪ The devil is in the details
▪ e.g., detecting data hazards
▪ Pipelining is independent of technology
▪ So why haven’t we always done pipelining?
▪ More transistors make more advanced techniques
feasible
▪ Pipeline-related ISA design needs to take account of
technology trends
▪ e.g., predicated instructions
8/15/2023 Faculty of Computer Science and Engineering 148
Pitfalls
▪ Poor ISA design can make pipelining harder
▪ e.g., complex instruction sets (VAX, IA-32)
▪ Significant overhead to make pipelining work
▪ IA-32 micro-op approach
▪ e.g., complex addressing modes
▪ Register update side effects, memory indirection
▪ e.g., delayed branches
▪ Advanced pipelines have long delay slots
8/15/2023 Faculty of Computer Science and Engineering 149
Concluding Remarks
▪ ISA influences design of datapath and control
▪ Datapath and control influence design of ISA
▪ Pipelining improves instruction throughput
using parallelism
▪ More instructions completed per second
▪ Latency for each instruction not reduced
▪ Hazards: structural, data, control
▪ Multiple issue and dynamic scheduling (ILP)
▪ Dependencies limit achievable parallelism
▪ Complexity leads to the power wall
8/15/2023 Faculty of Computer Science and Engineering 150
Computer Architecture
Faculty of Computer Science & Engineering - HCMUT

Chapter 5: Large and Fast:


Exploiting Memory Hierarchy
Binh Tran-Thanh
[email protected]
This chapter contents
▪ Memory technology/ hierarchy
▪ Cache and Virtual Memory
▪ Memory performance
This chapter outcomes
Students who complete this course will be
able to
▪ Explain the structure of a memory hierarchy.
▪ Deeply understand how Memory, Cache, and
Virtual Memory work at the hardware level.
▪ Estimate the performance of a memory
hierarchy as well as a system.

11/13/2023 Faculty of Computer Science and Engineering 3


Principle of Locality
▪ Programs access a small proportion of their address
space at any time
▪ Temporal locality
▪ Items accessed recently are likely to be accessed again
soon
▪ e.g., instructions in a loop, induction variables
▪ Spatial locality
▪ Items near those accessed recently are likely to be
accessed soon
▪ E.g., sequential instruction access, array data

11/13/2023 Faculty of Computer Science and Engineering 4


Warm up
for (int i = 0; i < MAX_SIZE; i ++){
Sum += Array[i];
}
▪ Which variables/instructions exhibit
temporal locality?
▪ Which variables /instructions exhibit spatial
locality?
11/13/2023 Faculty of Computer Science and Engineering 5
Taking Advantage of Locality
▪ Memory hierarchy
▪ Store everything on disk
▪ Copy recently accessed (and nearby) items from
disk to smaller DRAM memory
▪ Main memory
▪ Copy more recently accessed (and nearby)
items from DRAM to smaller SRAM memory
▪ Cache memory attached to CPU

11/13/2023 Faculty of Computer Science and Engineering 6


Memory Hierarchy Levels
Processor
▪ Block (aka line): unit of copying
▪ May be multiple words
▪ If accessed data is present in upper level
▪ Hit: access satisfied by upper level
▪ Hit ratio: hits/accesses
Data is ▪ If accessed data is absent
transferred ▪ Miss: block copied from lower level
▪ Time taken: miss penalty
▪ Miss ratio: misses/accesses
= 1 – hit ratio
▪ Then accessed data supplied from upper
level
11/13/2023 Faculty of Computer Science and Engineering 7
Memory Technology
▪ Static RAM (SRAM)
▪ 0.5ns – 2.5ns, $500 – $1000 per GiB
▪ Dynamic RAM (DRAM)
▪ 50ns – 70ns, $10 – $20 per GiB
▪ Flash
▪ 5µs – 50 µs, $0.75 - $1.00 per GiB
▪ Magnetic disk
▪ 5ms – 20ms, $0.05 – $0.10 per GiB
▪ Ideal memory
▪ Access time of SRAM
▪ Capacity and cost/GB of disk

11/13/2023 Faculty of Computer Science and Engineering 8


Dynamic RAM (DRAM) cell

Source: internet
11/13/2023 Faculty of Computer Science and Engineering 9
Static RAM (SRAM) cell

Source: https://commons.wikimedia.org/wiki/File:SRAM_Cell_(6_Transistors).svg
11/13/2023 Faculty of Computer Science and Engineering 10
DRAM Technology
▪ Data stored as a charge in a capacitor
▪ Single transistor used to access the charge
▪ Must periodically be refreshed
▪ Read contents and write back
▪ Performed on a DRAM “row”
Bank
Column
Rd/Wr
Act

Pre

11/13/2023
Row
Faculty of Computer Science and Engineering 11
Advanced DRAM Organization
▪ Bits in a DRAM are organized as a rectangular
array
▪ DRAM accesses an entire row
▪ Burst mode: supply successive words from a row
with reduced latency
▪ Double data rate (DDR) DRAM
▪ Transfer on rising and falling clock edges
▪ Quad data rate (QDR) DRAM
▪ Separate DDR inputs and outputs

11/13/2023 Faculty of Computer Science and Engineering 12


DRAM Generations
Year Capacity $/GB Trac Tcac
300
1980 64Kbit $1500000
250
1983 256Kbit $500000 250

1985 1Mbit $200000 185


200
1989 4Mbit $50000 150
150 135
1992 16Mbit $15000
110
100
1996 64Mbit $10000 100 90

1998 128Mbit $4000 60 60 55 50


50 40 40 40
30
2000 256Mbit $1000 12 10 7 5 1.25
2004 512Mbit $250 0
'80 '83 '85 '89 '92 '96 '98 '00 '04 '07
2007 1Gbit $50
2010 2Gbit $30
2012 4Gbit $1
11/13/2023 Faculty of Computer Science and Engineering 13
DRAM Performance Factors
▪ Row buffer
▪ Allows several words to be read and refreshed in
parallel
▪ Synchronous DRAM
▪ Allows for consecutive accesses in bursts without
needing to send each address
▪ Improves bandwidth
▪ DRAM banking
▪ Allows simultaneous access to multiple DRAMs
▪ Improves bandwidth

11/13/2023 Faculty of Computer Science and Engineering 14


Main Memory Supporting Caches
▪ Use DRAMs for main memory
▪ Fixed width (e.g., 1 word)
▪ Connected by fixed-width clocked bus
▪ Bus clock is typically slower than CPU clock
▪ Example cache block read
▪ 1 bus cycle for address transfer
▪ 15 bus cycles per DRAM access
▪ 1 bus cycle per data transfer
11/13/2023 Faculty of Computer Science and Engineering 15
Increasing Memory Bandwidth
▪ For 4-word block, 1-word-wide
DRAM
▪ Miss penalty = 1 + 4×15 + 4×1 = 65
bus cycles
▪ Bandwidth = 16 bytes / 65 cycles =
0.25 B/cycle
▪ 4-word wide memory
▪ Miss penalty = 1 + 15 + 1 = 17 bus
cycles
▪ Bandwidth = 16 bytes / 17 cycles =
0.94 B/cycle
▪ 4-bank interleaved memory
▪ Miss penalty = 1 + 15 + 4×1 = 20 bus
cycles
▪ Bandwidth = 16 bytes / 20 cycles =
0.8 B/cycle
11/13/2023 Faculty of Computer Science and Engineering 16
Flash Storage
▪ Nonvolatile semiconductor storage
▪ 100× – 1000× faster than disk
▪ Smaller, lower power, more robust
▪ But more $/GB (between disk and DRAM)

Thi s Photo by Unknown Author i s licensed under CC


BY
11/13/2023 Faculty of Computer Science and Engineering 17
Flash Types
▪ NOR flash: bit cell like a NOR gate
▪ Was first introduced by Intel in 1988
▪ Random read/write access
▪ Used for instruction memory in embedded systems
▪ NAND flash: bit cell like a NAND gate (SSD)
▪ Was introduced by Toshiba in 1989
▪ Denser (bits/area), but block-at-a-time access
▪ Cheaper per GB
▪ Used for USB keys, media storage, …
▪ Flash bits wears out after 1000’s of accesses
▪ Not suitable for direct RAM or disk replacement
▪ Wear leveling: remap data to less used blocks

11/13/2023 Faculty of Computer Science and Engineering 18


Disk Storage
▪ Nonvolatile, rotating magnetic storage

11/13/2023 Faculty of Computer Science and Engineering 19


Disk Sectors and Access
▪ Each sector records
▪ Sector ID
▪ Data (512 bytes, 4096 bytes proposed)
▪ Error correcting code (ECC)
▪ Used to hide defects and recording errors
▪ Synchronization fields and gaps
▪ Access to a sector involves
▪ Queuing delay if other accesses are pending
▪ Seek: move the heads
▪ Rotational latency
▪ Data transfer
▪ Controller overhead

11/13/2023 Faculty of Computer Science and Engineering 20


Disk Access Example
▪ Given
▪ 512B sector, 15,000rpm, 4ms average seek time, 100MB/s
transfer rate, 0.2ms controller overhead, idle disk
▪ Average read time
▪ 4ms seek time
+ ½ / (15,000/60) = 2ms rotational latency
+ 512 / 100MB/s = 0.005ms transfer time
+ 0.2ms controller delay
= 6.2ms
▪ If actual average seek time is 1ms
▪ Average read time = 3.2ms

11/13/2023 Faculty of Computer Science and Engineering 21


Disk Performance Issues
▪ Manufacturers quote average seek time
▪ Based on all possible seeks
▪ Locality and OS scheduling lead to smaller actual average
seek times
▪ Smart disk controller allocate physical sectors on disk
▪ Present logical sector interface to host
▪ SCSI, ATA, SATA
▪ Disk drives include caches
▪ Prefetch sectors in anticipation of access
▪ Avoid seek and rotational delay

11/13/2023 Faculty of Computer Science and Engineering 22


Cache Memory
X4 X4

▪ Cache memory X1 X1

▪ The level of the memory hierarchy Xn – 2 Xn – 2

closest to the CPU


Xn – 1 Xn – 1
▪ Given accesses X1, …, Xn–1, Xn X2 X2

▪ How do we know if the data is Xn

present? X3 X3

a. Before the b. After the


▪ Where do we look? reference to Xn reference to Xn

11/13/2023 Faculty of Computer Science and Engineering 23


Direct Mapped Cache
▪ Location determined by address Cache

000

111
001
010
011
100
101
110
▪ Direct mapped: only one choice
▪ (Block address) modulo (#Blocks in cache)
▪ #Blocks is a power of 2
▪ Use low-order address bits

0000100101 0100101101 1000110101 1100111101


Memory

11/13/2023 Faculty of Computer Science and Engineering 24


Addressing - offset
31 0
Tag Index Offset
# bits # bits # bits
Cache
▪ Offset
▪ Determined the position
(offset) of data in a block(line). x
▪ Byte offset, half-word offset,
a
word offset.

MEMORY
Given 8-byte block (line)
→ offset of x = 2
→ offset of a = 5 …
x …
a …

11/13/2023 Faculty of Computer Science and Engineering 25


Addressing - index
31 0
Tag Index Offset
# bits # bits # bits
Cache
▪ Index
▪ Determined the position of a
set in a cache x
a

Idx 0 Idx 1 Idx 2 Idx 3 Idx 4 Idx 5 Idx 6 Idx 7


Given 8-block cache MEMORY
→ Index of x, a = 0


x …
a …

Idx 0 Idx 1 Idx 2 Idx 3 Idx 4 Idx 5 Idx 6 Idx 7 Idx 0 Idx 7
11/13/2023 Faculty of Computer Science and Engineering 26
Addressing - Tag
31 0
Tag Index Offset
# bits # bits # bits
Cache
▪ Tag Tag 1 Tag 0
▪ Determined which block ID is
stored at specific index in a x
cache a

Idx 0 Idx 1 Idx 2 Idx 3 Idx 4 Idx 5 Idx 6 Idx 7


Block Id = {Tag, Idx} MEMORY
blockID of x, a = 8
Tag 0 Tag 0 Tag 0 Tag 0 Tag 0 Tag 0 Tag 0 Tag 0 Tag 1 … Tag n
Tag of x, a = 1

x …
a …

Idx 0 Idx 1 Idx 2 Idx 3 Idx 4 Idx 5 Idx 6 Idx 7 Idx 0 Idx 7
11/13/2023 Faculty of Computer Science and Engineering 27
Addressing - Tag
31 0
Tag Index Offset
# bits # bits # bits
Cache
▪ Tag Tag 0 Tag 0
▪ Determined which block ID is b
stored at specific index in a c
cache

Idx 0 Idx 1 Idx 2 Idx 3 Idx 4 Idx 5 Idx 6 Idx 7


Block Id = {Tag, Idx} MEMORY
blockID of b, c = 0
Tag 0 Tag 0 Tag 0 Tag 0 Tag 0 Tag 0 Tag 0 Tag 0 Tag 1 … Tag n
Tag of b, c = 0
b …
c x …
a …

Idx 0 Idx 1 Idx 2 Idx 3 Idx 4 Idx 5 Idx 6 Idx 7 Idx 0 Idx 7
11/13/2023 Faculty of Computer Science and Engineering 28
Your turn
▪ What are physical address of x, a?
▪ What are the Tag, Index, and Byte Offset of
a variable where its address is 0x01020304
(Hex)
▪ Use the configuration in the previous slide

11/13/2023 Faculty of Computer Science and Engineering 29


Your turn
▪ What are the Tag, Index, Byte Offset, and
BlockID of a variable where its address is
0x10203040 (Hex)
▪ Direct mapped
▪ 32-word block
▪ 64-block cache

11/13/2023 Faculty of Computer Science and Engineering 30


Tags and Valid Bits
▪ How do we know which particular block is stored in
a cache location?
▪ Store block address as well as the data
▪ Actually, only need the high-order bits
▪ Called the tag
▪ What if there is no data in a location?
▪ Valid bit
▪ 1 = present.
▪ 0 = not present.
▪ Initially 0

11/13/2023 Faculty of Computer Science and Engineering 31


Miss/Hit ratio
Addressing of a block
+0 +1
+2 +3
+4 +5
▪ What are address of a, b, c, …
+6 +7
▪ What are MEMORY
Tag/Index of Tag 0 Tag 0 Tag 0 Tag 0 Tag 0 Tag 0 Tag 0 Tag 0 Tag 1 Tag 1 … Tag n
a z …
a, b, c, ...
d c i …
b g …
h …
Idx 0 Idx 1 Idx 2 Idx 3 Idx 4 Idx 5 Idx 6 Idx 7 Idx 0 Idx 7

11/13/2023 Faculty of Computer Science and Engineering 32


Tag Cache

Miss/Hit ratio
▪ After executing the given piece Idx 0 Idx 1 Idx 2 Idx 3 Idx 4 Idx 5 Idx 6 Idx 7
of C code. What is miss/hit ratio?
MEMORY
▪ a = b + c; Tag 0 Tag 0 Tag 0 Tag 0 Tag 0 Tag 0 Tag 0 Tag 0 Tag 1 Tag 1 … Tag n
▪ d = a + i; a z …
d c i …
▪ g= h + z;
b g …
▪ b = a + c; h …
▪ d = i + h; Idx 0 Idx 1 Idx 2 Idx 3 Idx 4 Idx 5 Idx 6 Idx 7 Idx 0 Idx 7

All variable are 1-byte type.


In form of statement a = b + c; suppose that b read first, then c then a.
11/13/2023 Faculty of Computer Science and Engineering 33
Cache Example
▪ 8-blocks, 1 word/block, direct mapped
▪ Initial state
Index V Tag Data
▪ Access word address: 000 N

22, 26, 22, 26, 16, 3, 16, 18 001


010
N
N
011 N
100 N
101 N
110 N
111 N

11/13/2023 Faculty of Computer Science and Engineering 34


Cache Example
Word addr Binary addr Hit/miss Cache block

▪ 22 22 10 110 Miss 110

▪ 26
▪ 22 Index V Tag Data

▪ 26 000 N
001 N
▪ 16 010 N
▪ 3 011 N
100 N
▪ 16 101 N
▪ 18 110 Y 10 Mem[10110]
111 N

11/13/2023 Faculty of Computer Science and Engineering 35


Cache Example
Word addr Binary addr Hit/miss Cache block

▪ 22 26 11 010 Miss 010

▪ 26
▪ 22 Index V Tag Data

▪ 26 000 N
001 N
▪ 16 010 Y 11 Mem[11010]
▪ 3 011 N
100 N
▪ 16 101 N
▪ 18 110 Y 10 Mem[10110]
111 N

11/13/2023 Faculty of Computer Science and Engineering 36


Cache Example
Word addr Binary addr Hit/miss Cache block

▪ 22 22
26
10 110
11 010
Hit
Hit
110
010
▪ 26
▪ 22 Index V Tag Data

▪ 26 000 N
001 N
▪ 16 010 Y 11 Mem[11010]
▪ 3 011 N
100 N
▪ 16 101 N
▪ 18 110 Y 10 Mem[10110]
111 N

11/13/2023 Faculty of Computer Science and Engineering 37


Cache Example
Word addr Binary addr Hit/miss Cache block

▪ 22 16
3
10 000
00 011
Miss
Miss
000
011
▪ 26 16 10 000 Hit 000

▪ 22 Index V Tag Data

▪ 26 000 Y 10 Mem[10000]
001 N
▪ 16 010 Y 11 Mem[11010]
▪ 3 011 Y 00 Mem[00011]
100 N
▪ 16 101 N
▪ 18 110 Y 10 Mem[10110]
111 N

11/13/2023 Faculty of Computer Science and Engineering 38


Cache Example
Word addr Binary addr Hit/miss Cache block

▪ 22 18 10 010 Miss 010

▪ 26
▪ 22 Index V Tag Data

▪ 26 000 Y 10 Mem[10000]
001 N
▪ 16 010 Y 10 Mem[10010]
▪ 3 011 Y 00 Mem[00011]
100 N
▪ 16 101 N
▪ 18 110 Y 10 Mem[10110]
111 N

11/13/2023 Faculty of Computer Science and Engineering 39


Your turn
▪ What is Hit/Miss ratio when a processor
accesses a sequence of byte address: 1, 4, 2,
12, 3, 32, 0, 33, 1, 44

11/13/2023 Faculty of Computer Science and Engineering 40


Address Subdivision Address (showing bit positions)
31 30 13 12 11 2 10
Byte
offset
Hit 20 10
Tag
Index Data

Index Valid Tag Data


0
1
2

1021
1022
1023
20 32

11/13/2023 Faculty of Computer Science and Engineering 41


Example: Larger Block Size
▪ 64 blocks, 16 bytes/block
▪ To what block number does address 1200
map?
▪ Block address = 1200/16 = 75
▪ Block number (ID) = 75 modulo 64 = 11
31 10 9 4 3 0
Tag Index Offset
22 bits 6 bits 4 bits
11/13/2023 Faculty of Computer Science and Engineering 42
Block Size Considerations
▪ Larger blocks should reduce miss rate
▪ Due to spatial locality
▪ But in a fixed-sized cache
▪ Larger blocks  fewer of them
▪ More competition  increased miss rate
▪ Larger blocks  pollution (much unnecessarily
data)
▪ Larger miss penalty
▪ Can override benefit of reduced miss rate
▪ Early restart and critical-word-first can help
11/13/2023 Faculty of Computer Science and Engineering 43
Cache Misses
▪ On cache hit, CPU proceeds normally
▪ On cache miss
▪ Stall the CPU pipeline
▪ Fetch block from next level of hierarchy
▪ Instruction cache miss
▪ Restart instruction fetch
▪ Data cache miss
▪ Complete data access

11/13/2023 Faculty of Computer Science and Engineering 44


Write-Through
▪ On data-write hit, could just update the block in cache
▪ But then cache and memory would be inconsistent
▪ Write through: also update memory
▪ But makes writes take longer
▪ e.g., if base CPI = 1, 10% of instructions are stores, write to
memory takes 100 cycles
▪ Effective CPI = 1 + 0.1×100 = 11
▪ Solution: write buffer
▪ Holds data waiting to be written to memory
▪ CPU continues immediately
▪ Only stalls on write if write buffer is already full

11/13/2023 Faculty of Computer Science and Engineering 45


Write-Back
▪ Alternative: On data-write hit, just update
the block in cache
▪ Keep track of whether each block is dirty
▪ When a dirty block is replaced
▪ Write it back to memory
▪ Can use a write buffer to allow replacing block
to be read first

11/13/2023 Faculty of Computer Science and Engineering 46


Write Allocation
▪ What should happen on a write miss?
▪ Alternatives for write-through
▪ Allocate on miss (write allocate or fetch-on write):
fetch the block
▪ Write around (write-no-allocate): don’t fetch the
block
▪ Since programs often write a whole block before
reading it (e.g., initialization)
▪ For write-back
▪ Usually fetch the block
11/13/2023 Faculty of Computer Science and Engineering 47
Cache coherence

11/13/2023 Faculty of Computer Science and Engineering 48


Example: Intrinsity FastMATH
▪ Embedded MIPS processor
▪ 12-stage pipeline
▪ Instruction and data access on each cycle
▪ Split cache: separate I-cache and D-cache
▪ Each 16KB: 256 blocks × 16 words/block
▪ D-cache: write-through or write-back
▪ SPEC2000 miss rates
▪ I-cache: 0.4%
▪ D-cache: 11.4%
▪ Weighted average: 3.2%
11/13/2023 Faculty of Computer Science and Engineering 49
Example: Intrinsity FastMATH
Address (showing bit positions)
31 1413 6 5 2 1 0

18 8 4 Byte Data
Hit Tag offset
Index Block offset
18 bits 512bits
V Tag Data

256
entries

18 32 32 32

=
Mux
32
11/13/2023 Faculty of Computer Science and Engineering 50
Measuring Cache Performance
▪ Components of CPU time
▪ Program execution cycles
▪ Includes cache hit time
▪ Memory stall cycles
▪ Mainly from cache misses
▪ With simplifying assumptions:
Memory stall cycles
Memory accesses
=  Miss rate  Miss penalty
Program
Instructio ns Misses
=   Miss penalty
Program Instructio n
11/13/2023 Faculty of Computer Science and Engineering 51
I-cache, D-cache

PC Address Instruction Registers ALU Address


Instruction Data
cache cache

Data

Instruction Data
memory memory

11/13/2023 Faculty of Computer Science and Engineering 52


I-cache, D-cache Miss
Instruction fetch:
▪ I-cache Miss NOT FOUND

IM Stall Stall …. Stall IM REG ALU DM REG

▪ D-cache Miss data access:


NOT FOUND

IM REG ALU DM
Stall Stall …. Stall DM REG

11/13/2023 Faculty of Computer Science and Engineering 53


Cache Performance Example
▪ Given
▪ I-cache miss rate = 2%
▪ D-cache miss rate = 4%
▪ Miss penalty = 100 cycles
▪ Base CPI (ideal cache) = 2
▪ Load & stores are 36% of instructions
▪ Miss cycles per instruction
▪ I-cache: 0.02 × 100 = 2
▪ D-cache: 0.36 × 0.04 × 100 = 1.44
▪ Actual CPI = 2 + 2 + 1.44 = 5.44
▪ Ideal CPU is 5.44/2 =2.72 times faster

11/13/2023 Faculty of Computer Science and Engineering 54


Average Access Time
▪ Hit time is also important for performance
▪ Average memory access time (AMAT)
▪ AMAT = Hit time + Miss rate × Miss penalty
▪ Example
▪ CPU with 1ns clock, hit time = 1 cycle, miss
penalty = 20 cycles, I-cache miss rate = 5%
▪ AMAT = 1 + 0.05 × 20 = 2ns
▪ 2 cycles per instruction
11/13/2023 Faculty of Computer Science and Engineering 55
Performance Summary
▪ When CPU performance increased
▪ Miss penalty becomes more significant
▪ Decreasing base CPI
▪ Greater proportion of time spent on memory stalls
▪ Increasing clock rate
▪ Memory stalls account for more CPU cycles
▪ Can’t neglect cache behavior when evaluating
system performance
11/13/2023 Faculty of Computer Science and Engineering 56
Associative Caches
▪ Fully associative
▪ Allow a given block to go in any cache entry
▪ Requires all entries to be searched at once
▪ Comparator per entry (expensive)
▪ n-way set associative
▪ Each set contains n entries
▪ Block number determines which set
▪ (Block number) modulo (#Sets in cache)
▪ Search all entries in a given set at once
▪ n comparators (less expensive)

11/13/2023 Faculty of Computer Science and Engineering 57


Associative Cache Example
Direct mapped Set associative Fully associative
Set # 0 12 3 4567 Set # 0 1 2 3

Data Data Data

1 1 1
Tag 2 Tag 2
Tag 2

Search Search Search

11/13/2023 Faculty of Computer Science and Engineering 58


Your turn
▪ Given a configuration of cache system
▪ 32-word block
▪ 128-Kbyte cache.
▪ 4G RAM
▪ How wide are Tag, Index, and Byte Offset fields for:
▪ Direct mapped
▪ 4-way set associative
▪ Fully associative
▪ Given int A at 0x12345678. What are Tag, Index,
word offset of A for each configuration?
11/13/2023 Faculty of Computer Science and Engineering 59
Spectrum of AssociativitysetTwo-way
associative
Set Tag Data Tag Data
▪ For a cache with 8 entries 0
One-way set associative 1
(direct mapped) 2

Block Tag Data Four-way 3

0 set associative
1 Set Tag Data Tag Data Tag Data Tag Data
2 0
3 1
4 Eight-way set associative (fully associative)
5
data

data

data

data
data

data

data
Tag

Tag

Tag

Tag
Tag

Tag

Tag
Tag

Tag
6
7
11/13/2023 Faculty of Computer Science and Engineering 60
Associativity Example
▪ Compare 4-block caches
▪ Direct mapped, 2-way set associative,
fully associative
▪ Block access sequence: 0, 8, 0, 6, 8
▪ Direct mapped (100% Miss)
Block Cache Hit/miss Cache content after access
address index 0 1 2 3
0 0 miss Mem[0]
8 0 miss Mem[8]
0 0 miss Mem[0]
6 2 miss Mem[0] Mem[6]
8 0 miss Mem[8] Mem[6]
11/13/2023 Faculty of Computer Science and Engineering 61
Associativity Example
▪ 2-way set associative (80% Miss)
Block Cache Hit/miss Cache content after access
address index Set 0 Set 1
0 0 miss Mem[0]
8 0 miss Mem[0] Mem[8]
0 0 hit Mem[0] Mem[8]
6 0 miss Mem[0] Mem[6]
8 0 miss Mem[8] Mem[6]

▪ Fully associative (60% Miss)


Block Hit/miss Cache content after access
address
0 miss Mem[0]
8 miss Mem[0] Mem[8]
0 hit Mem[0] Mem[8]
6 miss Mem[0] Mem[8] Mem[6]
11/13/2023 8 hit Mem[0] Mem[8] Mem[6] 62
How Much Associativity
▪ Increased associativity decreases miss rate
▪ But with diminishing returns
▪ Simulation of a system with 64KB D-cache,
16-word blocks, SPEC2000
▪ 1-way: 10.3%
▪ 2-way: 8.6%
▪ 4-way: 8.3%
▪ 8-way: 8.1%
11/13/2023 Faculty of Computer Science and Engineering 63
Set Associative Cache Organization
31 30
Address
12 11 10 9 8 3 2 10

22 8
Tag
Index
Index V Tag Data V Tag Data V Tag Data V Tag Data
0
1
2

253
254
255
22 32

= = = =

4-to-1 multiplexor

Hit Data

11/13/2023 Faculty of Computer Science and Engineering 64


Replacement Policy
▪ Direct mapped: no choice
▪ Set associative
▪ Prefer non-valid entry, if there is one
▪ Otherwise, choose among entries in the set
▪ Least-recently used (LRU)
▪ Choose the one unused for the longest time
▪ Simple for 2-way, manageable for 4-way, too hard beyond that
▪ Random
▪ Gives approximately the same performance as LRU for high
associativity

11/13/2023 Faculty of Computer Science and Engineering 65


Multilevel Caches
▪ Primary cache attached to CPU
▪ Small, but fast
▪ Level-2 cache services misses from primary
cache
▪ Larger, slower, but still faster than main
memory
▪ Main memory services L-2 cache misses
▪ Some high-end systems include L-3 cache
11/13/2023 Faculty of Computer Science and Engineering 66
Multilevel Cache Example
▪ Given
▪ CPU base CPI = 1, clock rate = 4GHz
▪ Miss rate/instruction = 2%
▪ Main memory access time = 100ns
▪ With just primary cache
▪ Miss penalty = 100ns/0.25ns = 400 cycles
▪ Effective CPI = 1 + 0.02 × 400 = 9
11/13/2023 Faculty of Computer Science and Engineering 67
Multilevel Cache Example (cont.)
▪ Now add L-2 cache
▪ Access time = 5ns
▪ Global miss rate to main memory = 0.5%
▪ Primary miss with L-2 hit
▪ Penalty = 5ns/0.25ns = 20 cycles
▪ Primary miss with L-2 miss
▪ Extra penalty = 500 cycles
▪ CPI = 1 + 0.02 × 20 + 0.005 × 400 = 3.4
▪ Performance ratio = 9/3.4 = 2.6
11/13/2023 Faculty of Computer Science and Engineering 68
Multilevel Cache Considerations
▪ Primary cache
▪ Focus on minimal hit time
▪ L-2 cache
▪ Focus on low miss rate to avoid main memory
access
▪ Hit time has less overall impact
▪ Results
▪ L-1 cache usually smaller than a single cache
▪ L-1 block size smaller than L-2 block size

11/13/2023 Faculty of Computer Science and Engineering 69


Interactions with Advanced CPUs
▪ Out-of-order CPUs can execute instructions
during cache miss
▪ Pending store stays in load/store unit
▪ Dependent instructions wait in reservation stations
▪ Independent instructions continue
▪ Effect of miss depends on program data flow
▪ Much harder to analyse
▪ Use system simulation
11/13/2023 Faculty of Computer Science and Engineering 70
Virtual Machines
▪ Host computer emulates guest operating system and
machine resources
▪ Improved isolation of multiple guests
▪ Avoids security and reliability problems
▪ Aids sharing of resources
▪ Virtualization has some performance impact
▪ Feasible with modern high-performance computers
▪ Examples: (Operating) System Virtual Machines
▪ IBM VM/370 (1970s technology!)
▪ VMWare
▪ Microsoft Virtual PC

11/13/2023 Faculty of Computer Science and Engineering 71


Virtual Machine Monitor (hypervisor)
▪ Maps virtual resources to physical resources
▪ Memory, I/O devices, CPUs
▪ Guest code runs on native machine in user
mode
▪ Traps to VMM on privileged instructions and access
to protected resources
▪ Guest OS may be different from host OS
▪ VMM handles real I/O devices
▪ Emulates generic virtual I/O devices for guest

11/13/2023 Faculty of Computer Science and Engineering 72


Virtual machine architecture
VM1 VM2

Application

Application

Application
Application

Application

Application

Application
Application
Guest Operating System Guest Operating System
(Linux) (Window)

VMWare Hypervisor

Host Operating System

Physical Hardware

11/13/2023 Faculty of Computer Science and Engineering 73


Example: Timer Virtualization
▪ In native machine, on timer interrupt
▪ OS suspends current process, handles interrupt, selects
and resumes next process
▪ With Virtual Machine Monitor
▪ VMM suspends current VM, handles interrupt, selects
and resumes next VM
▪ If a VM requires timer interrupts
▪ VMM emulates a virtual timer
▪ Emulates interrupt for VM when physical timer
interrupt occurs

11/13/2023 Faculty of Computer Science and Engineering 74


Instruction Set Support
▪ User and System modes
▪ Privileged instructions only available in system mode
▪ Trap to system if executed in user mode
▪ All physical resources only accessible using
privileged instructions
▪ Including page tables, interrupt controls, I/O registers
▪ Renaissance of virtualization support
▪ Current ISAs (e.g., x86) adapting

11/13/2023 Faculty of Computer Science and Engineering 75


Virtual Memory
▪ Use main memory as a “cache” for secondary (disk) storage
▪ Managed jointly by CPU hardware and the operating system (OS)
▪ Programs share main memory
▪ Each gets a private virtual address space holding its frequently used
code and data
▪ Protected from other programs
▪ CPU and OS translate virtual addresses to physical addresses
▪ VM “block” is called a page
▪ VM translation “miss” is called a page fault

11/13/2023 Faculty of Computer Science and Engineering 76


Address Translation
▪ Fixed-size pages (e.g., 4K)
Virtual addresses Physical addresses
Address translation

Disk addresses

11/13/2023 Faculty of Computer Science and Engineering 77


Page Fault Penalty
▪ On page fault, the page must be fetched
from disk
▪ Takes millions of clock cycles
▪ Handled by OS code
▪ Try to minimize page fault rate
▪ Fully associative placement
▪ Smart replacement algorithms
11/13/2023 Faculty of Computer Science and Engineering 78
Page Tables
▪ Stores placement information
▪ Array of page table entries, indexed by virtual page
number
▪ Page table register in CPU points to page table in
physical memory
▪ If page is present in memory
▪ PTE stores the physical page number
▪ Plus other status bits (referenced, dirty, …)
▪ If page is not present
▪ PTE can refer to location in swap space on disk

11/13/2023 Faculty of Computer Science and Engineering 79


Translation Using a Page Table
Page table register

Virtual address
31 30 29 28 27 15 14 13 12 11 10 9 8 321 0
Virtual page number Page offset
20 12
Valid Physical page number

Page table

18
If 0 then page is not
present in memory
29 28 27 15 14 13 12 11 10 9 8 321 0
Physical page number Page offset
11/13/2023 80
Physical address
Mapping Pages to Storage
Virtual page
Page table
number
Physical page or
Physical memory
Valid disk address

1
1
1
1
0
1
1
0
1 Disk storage
1
0
1

11/13/2023 Faculty of Computer Science and Engineering 81


Replacement and Writes
▪ To reduce page fault rate, prefer least-recently used
(LRU) replacement
▪ Reference bit (aka use bit) in PTE set to 1 on access to page
▪ Periodically cleared to 0 by OS
▪ A page with reference bit = 0 has not been used recently
▪ Disk writes take millions of cycles
▪ Block at once, not individual locations
▪ Write through is impractical
▪ Use write-back
▪ Dirty bit in PTE set when page is written

11/13/2023 Faculty of Computer Science and Engineering 82


Fast Translation Using a TLB
▪ Translation-lookaside buff er
▪ Address translation would appear to require extra
memory references
▪ One to access the PTE
▪ Then the actual memory access
▪ But access to page tables has good locality
▪ So use a fast cache of PTEs within the CPU
▪ Called a Translation Look-aside Buffer (TLB)
▪ Typical: 16–512 PTEs, 0.5–1 cycle for hit, 10–100 cycles for
miss, 0.01%–1% miss rate
▪ Misses could be handled by hardware or software

11/13/2023 Faculty of Computer Science and Engineering 83


Fast Translation Using a TLB TLB
Virtual page Physical page
number Valid Dirty Ref Tag address

1 0 1
1 1 1
1 1 1
Physical memory
1 0 1
0 0 0
1 0 1

Page table
Physical page
Valid Dirty Ref or disk address

1 0 1
1 0 0
1 0 0 Disk storage
1 0 1
0 0 0
1 0 1
1 0 1
0 0 0
1 1 1
1 1 1
0 0 0
1 1 1
11/13/2023 Faculty of Computer Science and Engineering 84
TLB Misses
▪ If page is in memory
▪ Load the PTE from memory and retry
▪ Could be handled in hardware
▪ Can get complex for more complicated page table
structures
▪ Or in software
▪ Raise a special exception, with optimized handler
▪ If page is not in memory (page fault)
▪ OS handles fetching the page and updating the page
table
▪ Then restart the faulting instruction
11/13/2023 Faculty of Computer Science and Engineering 85
TLB Miss Handler
▪ TLB miss indicates
▪ Page present, but PTE not in TLB
▪ Page not preset
▪ Must recognize TLB miss before destination
register overwritten
▪ Raise exception
▪ Handler copies PTE from memory to TLB
▪ Then restarts instruction
▪ If page not present, page fault will occur

11/13/2023 Faculty of Computer Science and Engineering 86


Page Fault Handler
▪ Use faulting virtual address to find PTE
▪ Locate page on disk
▪ Choose page to replace
▪ If dirty, write to disk first
▪ Read page into memory and update page table
▪ Make process runnable again
▪ Restart from faulting instruction
11/13/2023 Faculty of Computer Science and Engineering 87
TLB and Cache Interaction
▪ If cache tag uses 31 30 29
Virtual address
14 13 12 11 10
9 3 2 10

physical address
Virtual page number Page offset
20 12

▪ Need to translate
Valid Dirty Tag Physical page number
TLB =
=

before cache lookup


TLB hit =
=
=
=

▪ Alternative: use virtual


20

address tag Physical page number


Physical address tag
Physical address
Page offset
Cache index Block
offset
Byte
offset

▪ Complications due to 18 8 4 2

aliasing 8
12 Data

▪ Different virtual Valid Tag

addresses for shared Cache

physical address
=
Cache hit

11/13/2023 32 88
Memory Protection
▪ Different tasks can share parts of their virtual
address spaces
▪ But need to protect against errant access
▪ Requires OS assistance
▪ Hardware support for OS protection
▪ Privileged supervisor mode (aka kernel mode)
▪ Privileged instructions
▪ Page tables and other state information only
accessible in supervisor mode
▪ System call exception (e.g., syscall in MIPS)

11/13/2023 Faculty of Computer Science and Engineering 89


The Memory Hierarchy
▪ Common principles apply at all levels of the
memory hierarchy
▪ Based on notions of caching
▪ At each level in the hierarchy
▪ Block placement
▪ Finding a block
▪ Replacement on a miss
▪ Write policy
11/13/2023 Faculty of Computer Science and Engineering 90
Block Placement
▪ Determined by associativity
▪ Direct mapped (1-way associative)
▪ One choice for placement
▪ n-way set associative
▪ n choices within a set
▪ Fully associative
▪ Any location
▪ Higher associativity reduces miss rate
▪ Increases complexity, cost, and access time
11/13/2023 Faculty of Computer Science and Engineering 91
Finding a Block
▪ Hardware caches
▪ Reduce comparisons to reduce cost
▪ Virtual memory
▪ Full table lookup makes full associativity feasible
▪ Benefit in reduced miss rate

Associativity Location method Tag comparisons


Direct mapped Index 1
n-way set Set index, then search n
associative entries within the set
Fully associative Search all entries #entries
Full lookup table 0
11/13/2023 Faculty of Computer Science and Engineering 92
Replacement
▪ Choice of entry to replace on a miss
▪ Least recently used (LRU)
▪ Complex and costly hardware for high
associativity
▪ Random
▪ Close to LRU, easier to implement
▪ Virtual memory
▪ LRU approximation with hardware support

11/13/2023 Faculty of Computer Science and Engineering 93


Write Policy
▪ Write-through
▪ Update both upper and lower levels
▪ Simplifies replacement, but may require write
buffer
▪ Write-back
▪ Update upper level only
▪ Update lower level when block is replaced
▪ Need to keep more state
▪ Virtual memory
▪ Only write-back is feasible, given disk write latency
11/13/2023 Faculty of Computer Science and Engineering 94
Sources of Misses
▪ Compulsory misses (aka cold start misses)
▪ First access to a block
▪ Capacity misses
▪ Due to finite cache size
▪ A replaced block is later accessed again
▪ Conflict misses (aka collision misses)
▪ In a non-fully associative cache
▪ Due to competition for entries in a set
▪ Would not occur in a fully associative cache of the
same total size

11/13/2023 Faculty of Computer Science and Engineering 95


Cache Design Trade-offs
Design Effect on miss rate Negative performance effect
change
Increase Decrease capacity May increase access time
cache size misses
Increase Decrease conflict May increase access time
associativity misses
Increase Decrease Increases miss penalty. For very
block size compulsory misses large block size, may increase
miss rate due to pollution.

11/13/2023 Faculty of Computer Science and Engineering 96


Cache Control
▪ Example cache characteristics
▪ Direct-mapped, write-back, write allocate
▪ Block size: 4 words (16 bytes)
▪ Cache size: 16 KB (1024 blocks)
▪ 32-bit byte addresses
▪ Valid bit and dirty bit per block
▪ Blocking cache
▪ CPU waits until access is complete
31 10 9 4 3 0
Tag Index Offset
18 bits 10 bits 4 bits
11/13/2023 Faculty of Computer Science and Engineering 97
Interface Signals
Read/Write Read/Write
Valid Valid
Address 32 Address 32

CPU Write Data 32 Cache Write Data 128 Memory


Read Data 32 Read Data 128

Ready Ready

Multiple cycles
per access

11/13/2023 Faculty of Computer Science and Engineering 98


Cache Coherence Problem
▪ Suppose two CPU cores share a physical
address space
▪ Write-through caches
Time step Event CPU A’s cache CPU B’s cache Memory
0 0
1 CPU A reads X 0 0
2 CPU B reads X 0 0 0
3 CPU A writes 1 0 1
1 to X
11/13/2023 Faculty of Computer Science and Engineering 101
Coherence Defined
▪ Informally: Reads return most recently written value
▪ Formally:
▪ P writes X; P reads X (no intervening writes)
 read returns written value
▪ P1 writes X; P2 reads X (sufficiently later)
 read returns written value
▪ c.f. CPU B reading X after step 3 in example
▪ P1 writes X, P2 writes X
 all processors see writes in the same order
▪ End up with the same final value for X
11/13/2023 Faculty of Computer Science and Engineering 102
Cache Coherence Protocols
▪ Operations performed by caches in multiprocessors to
ensure coherence
▪ Migration of data to local caches
▪ Reduces bandwidth for shared memory
▪ Replication of read-shared data
▪ Reduces contention for access
▪ Snooping protocols
▪ Each cache monitors bus reads/writes
▪ Directory-based protocols
▪ Caches and memory record sharing status of blocks in a
directory

11/13/2023 Faculty of Computer Science and Engineering 103


Invalidating Snooping Protocols
▪ Cache gets exclusive access to a block when it is to
be written
▪ Broadcasts an invalidate message on the bus
▪ Subsequent read in another cache misses
▪ Owning cache supplies updated value
CPU activity Bus activity CPU A’s cache CPU B’s cache Memory
0
CPU A reads X Cache miss for X 0 0
CPU B reads X Cache miss for X 0 0 0
CPU A writes 1 to X Invalidate for X 1 0
CPU B read X Cache miss for X 1 1 1
11/13/2023 Faculty of Computer Science and Engineering 104
Memory Consistency
▪ When are writes seen by other processors
▪ “Seen” means a read returns the written value
▪ Can’t be instantaneously
▪ Assumptions
▪ A write completes only when all processors have seen it
▪ A processor does not reorder writes with other accesses
▪ Consequence
▪ P writes X then writes Y
 all processors that see new Y also see new X
▪ Processors can reorder reads, but not writes

11/13/2023 Faculty of Computer Science and Engineering 105


Multilevel On-Chip Caches

11/13/2023 Faculty of Computer Science and Engineering 106


2-Level TLB Organization

11/13/2023 Faculty of Computer Science and Engineering 107


Supporting Multiple Issue
▪ Both have multi-banked caches that allow
multiple accesses per cycle assuming no bank
conflicts
▪ Core i7 cache optimizations
▪ Return requested word first
▪ Non-blocking cache
▪ Hit under miss
▪ Miss under miss
▪ Data prefetching
11/13/2023 Faculty of Computer Science and Engineering 108
Pitfalls
▪ Byte vs. word addressing
▪ Example: 32-byte direct-mapped cache,
4-byte blocks
▪ Byte 36 maps to block 1
▪ Word 36 maps to block 4
▪ Ignoring memory system effects when writing
or generating code
▪ Example: iterating over rows vs. columns of arrays
▪ Large strides result in poor locality
11/13/2023 Faculty of Computer Science and Engineering 109
Pitfalls
▪ In multiprocessor with shared L2 or L3 cache
▪ Less associativity than cores results in conflict
misses
▪ More cores  need to increase associativity
▪ Using AMAT to evaluate performance of out-of-
order processors
▪ Ignores effect of non-blocked accesses
▪ Instead, evaluate performance by simulation
11/13/2023 Faculty of Computer Science and Engineering 110
Pitfalls
▪ Extending address range using segments
▪ E.g., Intel 80286
▪ But a segment is not always big enough
▪ Makes address arithmetic complicated
▪ Implementing a VMM on an ISA not designed for
virtualization
▪ E.g., non-privileged instructions accessing hardware
resources
▪ Either extend ISA, or require guest OS not to use
problematic instructions

11/13/2023 Faculty of Computer Science and Engineering 111


Concluding Remarks
▪ Fast memories are small, large memories are slow
▪ We really want fast, large memories 
▪ Caching gives this illusion ☺
▪ Principle of locality
▪ Programs use a small part of their memory space
frequently
▪ Memory hierarchy
▪ L1 cache  L2 cache  …  DRAM memory
 disk
▪ Memory system design is critical for multiprocessors
11/13/2023 Faculty of Computer Science and Engineering 112

You might also like