0% found this document useful (0 votes)

9 views

12-caches-notes

Uploaded by

Vishakha Agarwal

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

9 views

12-caches-notes

Uploaded by

Vishakha Agarwal

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 144

Caches & Memory

CS 3410
Computer System Organization & Programming

These slides are the product of many rounds of teaching CS 3410 by Professors Weatherspoon, Bala, Bracy, and Sirer.
Programs 101
C Code MIPS Assembly
int main (int argc, char* main: addiu $sp,$sp,-
argv[ ]) { 48
int i; sw
int m = n; $31,44($sp)
sw
int sum = 0; $fp,40($sp)
for (i = 1; i <= m; i++) { move $fp,$sp
sum += i; sw
} $4,48($fp)
printf (“...”, n, sum); sw
} $5,52($fp)
la $2,n
Load/Store Architectures: lw $2,0($2)
sw
• Read data from memory $2,28($fp)
sw
(put in registers) $0,32($fp)
li $2,1
• Manipulate it 
sw
Instructions
$2,24($fp) that read
from
• Store it back to memory $L2: lw
$2,24($fp)
or write to memory…
2
1 Cycle Per Stage: the Biggest Lie (So Far)
Code Stored in Memory
(also, data and stack)
compute
jump/branch
targets

A
memory
register

D
ALU
file

B
+4
addr
PC
inst

din dout

M
control

B
memory
imm

extend
new
forward
pc detect unit
Stack, Data, Code
hazard Stored in Memory

Instruction Instruction Write-

ctrl

ctrl
Fetch Decode Execute Memory Back

IF/ID ID/EX EX/MEM MEM/WB 3

What’s the problem?
CPU
Main Memory
+ big
– slow
– far away

SandyBridge Motherboard, 2011 4

http://news.softpedia.com
The Need for Speed
CPU Pipeline

5
The Need for Speed

CPU Pipeline

Instruction speeds:
• add,sub,shift: 1 cycle
• mult: 3 cycles
• load/store: 100 cycles
off-chip 50(-70)ns
2(-3) GHz processor  0.5 ns clock
6
The Need for Speed
CPU Pipeline

7
What’s the solution?
Caches !
Level 1
Data $
Level 2 $

Level 1
Insn $

Intel Pentium 3, 1999 8

Aside

• Go back to 04-state and 05-memory and look

at how registers, SRAM and DRAM are built.

9
What’s the solution?
Caches !
Level 1
Data $
Level 2 $

Level 1
Insn $

What lucky
data gets to go
here?
Intel Pentium 3, 1999 10
Locality Locality Locality

If you ask for something, you’re likely to ask for:

• the same thing again soon
 Temporal Locality
• something near that thing, soon
 Spatial Locality
total = 0;

for (i = 0; i < n; i++)

total += a[i];

return total;
11
Clicker Questions
This highlights the 1 total = 0;

temporal and spatial 2 for (i = 0; i < n; i++) {

locality of data. 3 n--;

4 total += a[i];

Q1: Which line of code 5 return total;

exhibits good temporal A) 1

locality? B) 2
C) 3
D) 4
E) 5
12
Clicker Questions
This highlights the 1 total = 0;

temporal and spatial 2 for (i = 0; i < n; i++) {

locality of data. 3 n--;

4 total += a[i];

Q1: Which line of code 5 return total;

exhibits good temporal A) 1

locality? B) 2
C) 3
D) 4
E) 5
13
Clicker Questions
This highlights the 1 total = 0;

temporal and spatial 2 for (i = 0; i < n; i++) {

locality of data. 3 n--;

4 total += a[i];

Q1: Which line of code 5 return total;

exhibits good temporal A) 1

locality? B) 2
Q2: Which line of code C) 3
exhibits good spatial locality D) 4
with the line after it? E) 5
14
Clicker Questions
This highlights the 1 total = 0;

temporal and spatial 2 for (i = 0; i < n; i++) {

locality of data. 3 n--;

4 total += a[i];

Q1: Which line of code 5 return total;

exhibits good temporal A) 1

locality? B) 2
Q2: Which line of code C) 3
exhibits good spatial locality D) 4
with the line after it? E) 5
15
Your life is full of Locality

Last Called
Speed Dial
Favorites
Contacts
Google/Facebook/email

16
Your life is full of Locality

17
The Memory Hierarchy
Small, Fast
1 cycle,
Registers 128 bytes
4 cycles,
L1 Caches 64 KB
12 cycles,
L2 Cache 256 KB
36 cycles,
L3 Cache 2-20 MB
50-70 ns,
Main Memory 512 MB – 4 GB
Big, Slow
5-20 ms
Disk 16GB – 4 TB,

Intel Haswell Processor, 2013 18

Some Terminology
Cache hit
•data is in the Cache
•thit : time it takes to access the cache
•Hit rate (%hit): # cache hits / # cache accesses
Cache miss
•data is not in the Cache
•tmiss : time it takes to get the data from below the $
•Miss rate (%miss): # cache misses / # cache accesses
Cacheline or cacheblock or simply line or block
19
The Memory Hierarchy

1 cycle, average access time

Registers 128 bytes tavg = thit + %miss* tmiss
4 cycles,
L1 Caches 64 KB = 4 + 5% x 100
= 9 cycles
12 cycles,
L2 Cache 256 KB
36 cycles,
L3 Cache 2-20 MB
50-70 ns,
Main Memory 512 MB – 4 GB
5-20 ms
Disk 16GB – 4 TB,

Intel Haswell Processor, 2013 20

Single Core Memory Hierarchy
ON CHIP
Processor
Registers
Regs

L1 Caches
I$ D$

L2 Cache
L2
L3 Cache
Main
Main Memory
Memory
Disk
Disk
21
Multi-Core Memory Hierarchy
ON CHIP

Processor Processor Processor Processor

Regs Regs Regs Regs

I$ D$ I$ D$ I$ D$ I$ D$

L2 L2 L2 L2

Main Memory

Disk
22
Memory Hierarchy by the Numbers
CPU clock rates ~0.33ns – 2ns (3GHz-500MHz)

Memory Transistor Access time Access time in $ per GIB Capacity

technology count* cycles in 2012
SRAM 6-8 transistors 0.5-2.5 ns 1-3 cycles $4k 256 KB
(on chip)
SRAM 1.5-30 ns 5-15 cycles $4k 32 MB
(off chip)
DRAM 1 transistor 50-70 ns 150-200 cycles $10-$20 8 GB
(needs refresh)
SSD (Flash) 5k-50k ns Tens of $0.75-$1 512 GB
thousands
Disk 5M-20M ns Millions $0.05- 4 TB
$0.1

*Registers,D-Flip Flops: 10-100’s of registers

23
Basic Cache Design

Direct Mapped Caches

24
MEMORY
16 Byte Memory addr
0000
data
A
0001 B
0010 C
0011 D
load 1100  r1 0100 E
0101 F
0110 G
0111 H
1000 J
• Byte-addressable memory 1001 K
• 4 address bits  16 bytes total 1010 L
1011 M
• b addr bits  2b bytes in memory 1100 N
1101 O
1110 P
1111 Q 25
MEMORY
4-Byte, Direct Mapped Cache addr data
0000 A
CACHE 0001 B
index index data 0010 C
XXXX 00 A Cache entry 0011 D
01 B 0100 E
= row
10 C 0101 F
= (cache) line
11 D 0110 G
= (cache) block
0111 H
Block Size: 1 byte
1000 J
1001 K
Direct mapped: 1010 L
• Each address maps to 1 cache block 1011 M
• 4 entries  2 index bits (2n  n bits) 1100 N
1101 O
Index with LSB: 1110 P
• Supports spatial locality 1111 Q 26
Analogy to a Spice Rack
Spice Rack Spice Wall
(Cache) (Memory)
index spice
A
B
C
D
E
F

…
Z

• Compared to your spice wall

– Smaller
– Faster
– More costly (per oz.)
27
http://www.bedbathandbeyond.com
Analogy to a Spice Rack
Spice Rack Spice Wall
(Cache) (Memory)
index tag spice
A
B
C innamon
Cinnamon
D
E
F

…
Z

• How do you know what’s in the jar?

• Need labels
Tag = Ultra-minimalist label

28
MEMORY
4-Byte, Direct Mapped addr data
Cache 0000 A
0001 B
tag|index 0010 C
CACHE 0011 D
XXXX
index tag data 0100 E
00 00 A 0101 F
01 00 B 0110 G
10 00 C 0111 H
11 00 D 1000 J
1001 K
Tag: minimalist label/address 1010 L
address = tag + index 1011 M
1100 N
1101 O
1110 P
1111 Q 29
MEMORY
4-Byte, Direct Mapped Cache addr
0000
data
A
0001 B
0010 C
CACHE 0011 D
index V tag data 0100 E
00 0 00 X 0101 F
01 0 00 X 0110 G
10 0 00 X 0111 H
11 0 00 X 1000 J
1001 K
One last tweak: valid bit 1010 L
1011 M
1100 N
1101 O
1110 P
1111 Q 30
MEMORY
Simulation #1 addr data
of a 4-byte, DM Cache 0000 A
0001 B
tag|index 0010 C
CACHE 0011 D
XXXX
index V tag data 0100 E
00 0 11 X 0101 F
01 0 11 X 0110 G
10 0 11 X 0111 H
11 0 11 X 1000 J
1001 K
load 1100 Miss Lookup: 1010 L
• Index into $ 1011 M
1100 N
• Check tag 1101 O
• Check valid bit 1110 P
1111 Q 31
MEMORY
Simulation #1 addr data
of a 4-byte, DM Cache 0000 A
0001 B
tag|index 0010 C
CACHE 0011 D
XXXX
index V tag data 0100 E
00 1 11 N 0101 F
01 0 xx X 0110 G
10 0 xx X 0111 H
11 0 xx X 1000 J
1001 K
load 1100 Miss Lookup: 1010 L
• Index into $ 1011 M
1100 N
• Check tag 1101 O
• Check valid bit 1110 P
1111 Q 32
MEMORY
Simulation #1 addr data
of a 4-byte, DM Cache 0000 A
0001 B
tag|index 0010 C
CACHE 0011 D
XXXX
index V tag data 0100 E
00 1 11 N 0101 F
01 0 11 X 0110 G
10 0 11 X 0111 H
11 0 11 X 1000 J
1001 K
load 1100 Miss Lookup: 1010 L
... • Index into $ 1011 M
load 1100 Hit! 1100 N
• Check tag
1101 O
Awesome! • Check valid bit 1110 P
1111 Q 33
Block Diagram
4-entry, direct mapped Cache
tag|index
CACHE
1101
V tag data
2 2 1 00 1111 0000
1 11 1010 0101
0 01 1010 1010 Great!
1 11 0000 0000
Are we done?
2 8

= 1010 0101

data
Hit!
36
MEMORY
Simulation #2: addr data
0000 A
4-byte, DM Cache 0001 B
Clicker: CACHE
0010 C
A) Hit 0011 D
B) Miss index V tag data 0100 E
00 0 11 X 0101 F
01 0 11 X 0110 G
10 0 11 X 0111 H
11 0 11 X 1000 J
1001 K
load 1100 Miss Lookup: 1010 L
load 1101 • Index into $ 1011 M
load 0100 1100 N
load 1100
• Check tag 1101 O
• Check valid bit 1110 P
1111 Q 37
MEMORY
Simulation #2: addr data
4-byte, DM Cache 0000 A
0001 B
0010 C
CACHE 0011 D
index V tag data 0100 E
00 1 11 N 0101 F
01 0 11 X 0110 G
10 0 11 X 0111 H
11 0 11 X 1000 J
1001 K
load 1100 Miss Lookup: 1010 L
load 1101 • Index into $ 1011 M
load 0100 1100 N
load 1100
• Check tag 1101 O
• Check valid bit 1110 P
1111 Q 38
MEMORY
Simulation #2: addr data
4-byte, DM Cache 0000 A
0001 B
Clicker: CACHE
0010 C
0011 D
A) Hit
index V tag data 0100 E
B) Miss 00 1 11 N 0101 F
01 0 11 X 0110 G
10 0 11 X 0111 H
11 0 11 X 1000 J
1001 K
load 1100 Miss Lookup: 1010 L
load 1101 Miss
• Index into $ 1011 M
load 0100 1100 N
load 1100
• Check tag 1101 O
• Check valid bit 1110 P
1111 Q 39
MEMORY
Simulation #2: addr data
4-byte, DM Cache 0000 A
0001 B
tag|index 0010 C
CACHE 0011 D
XXXX
index V tag data 0100 E
00 1 11 N 0101 F
01 1 11 O 0110 G
10 0 11 X 0111 H
11 0 11 X 1000 J
1001 K
load 1100 Miss Lookup: 1010 L
load 1101 Miss
• Index into $ 1011 M
load 0100 1100 N
load 1100
• Check tag 1101 O
• Check valid bit 1110 P
1111 Q 40
MEMORY
Simulation #2: addr data
4-byte, DM Cache 0000 A
0001 B
Clicker: CACHE
0010 C
0011 D
A) Hit
index V tag data 0100 E
B) Miss 00 1 11 N 0101 F
01 1 11 O 0110 G
10 0 xx X 0111 H
11 0 xx X 1000 J
1001 K
load 1100 Miss Lookup: 1010 L
load 1101 Miss
• Index into $ 1011 M
load 0100 Miss 1100 N
load 1100
• Check tag 1101 O
• Check valid bit 1110 P
1111 Q 41
MEMORY
Simulation #2: addr data
4-byte, DM Cache 0000 A
0001 B
0010 C
CACHE 0011 D
index V tag data 0100 E
00 1 01 E 0101 F
01 1 11 O 0110 G
10 0 11 X 0111 H
11 0 11 X 1000 J
1001 K
load 1100 Miss Lookup: 1010 L
load 1101 Miss
• Index into $ 1011 M
load 0100 Miss 1100 N
load 1100
• Check tag 1101 O
• Check valid bit 1110 P
1111 Q 42
MEMORY
Simulation #2: addr data
4-byte, DM Cache 0000 A
0001 B
Clicker: CACHE
0010 C
0011 D
A) Hit
index V tag data 0100 E
B) Miss 00 1 01 E 0101 F
01 1 11 O 0110 G
10 0 11 X 0111 H
11 0 11 X 1000 J
1001 K
load 1100 Miss Lookup: 1010 L
load 1101 Miss
• Index into $ 1011 M
load 0100 Miss 1100 N
load 1100
• Check tag
Miss 1101 O
• Check valid bit 1110 P
1111 Q 43
MEMORY
Simulation #2: addr data
4-byte, DM Cache 0000 A
0001 B
tag|index 0010 C
CACHE 0011 D
XXXX
index V tag data 0100 E
00 1 11 N 0101 F
01 1 11 O 0110 G
10 0 11 X 0111 H
11 0 11 X 1000 J
1001 K
load 1100 Miss cold 1010 L
Disappointed!
load 1101 Miss cold 1011 M
load
load
0100
1100
Miss
Miss
cold
 1100
1101
N
O
1110 P
1111 Q 44
Reducing Cold Misses
by Increasing Block Size
Leveraging Spatial Locality

45
MEMORY
Increasing Block Size addr
0000
data
A
0001 B
CACHE
offset 0010 C
index V tag data 0011 D
XXXX
00 0 x A | B 0100 E
01 0 x C | D 0101 F
10 0 x E | F 0110 G
11 0 x G | H 0111 H
1000 J
• Block Size: 2 bytes 1001 K
1010 L
• Block Offset: least significant bits 1011 M
indicate where you live in the block 1100 N
1101 O
• Which bits are the index? tag?
1110 P
1111 Q 46
MEMORY
Simulation #3: addr data
index 8-byte, DM Cache 0000 A
0001 B
CACHE
tag| |offset 0010 C
index V tag data 0011 D
XXXX
00 0 x X | X 0100 E
01 0 x X | X 0101 F
10 0 x X | X 0110 G
11 0 x X | X 0111 H
1000 J
1001 K
load 1100 Miss Lookup: 1010 L
load 1101 • Index into $ 1011 M
load 0100 1100 N
load 1100
• Check tag 1101 O
• Check valid bit 1110 P
1111 Q 47
MEMORY
Simulation #3: addr data
index 8-byte, DM Cache 0000 A
0001 B
CACHE
tag| |offset 0010 C
index V tag data 0011 D
XXXX
00 0 x X | X 0100 E
01 0 x X | X 0101 F
10 1 1 N | O 0110 G
11 0 x X | X 0111 H
1000 J
1001 K
load 1100 Miss Lookup: 1010 L
load 1101 • Index into $ 1011 M
load 0100 1100 N
load 1100
• Check tag 1101 O
• Check valid bit 1110 P
1111 Q 48
MEMORY
Simulation #3: addr data
index 8-byte, DM Cache 0000 A
0001 B
CACHE
tag| |offset 0010 C
index V tag data 0011 D
XXXX
00 0 x X | X 0100 E
01 0 x X | X 0101 F
10 1 1 N | O 0110 G
11 0 x X | X 0111 H
1000 J
1001 K
load 1100 Miss Lookup: 1010 L
load 1101 Hit!
• Index into $ 1011 M
load 0100 1100 N
load 1100
• Check tag
1101 O
• Check valid bit 1110 P
1111 Q 49
MEMORY
Simulation #3: addr data
index 8-byte, DM Cache 0000 A
0001 B
CACHE
tag| |offset 0010 C
index V tag data 0011 D
XXXX
00 0 x X | X 0100 E
01 0 x X | X 0101 F
10 1 1 N | O 0110 G
11 0 x X | X 0111 H
1000 J
1001 K
Miss
load 1100 Lookup: 1010 L
load 1101 Hit!
• Index into $ 1011 M
load 0100 Miss 1100 N
load 1100
• Check tag 1101 O
• Check valid bit 1110 P
1111 Q 50
MEMORY
Simulation #3: addr data
index 8-byte, DM Cache 0000 A
0001 B
CACHE
tag| |offset 0010 C
index V tag data 0011 D
XXXX
00 0 x X | X 0100 E
01 0 x X | X 0101 F
10 1 0 E | F 0110 G
11 0 x X | X 0111 H
1000 J
1001 K
Miss
load 1100 Lookup: 1010 L
load 1101 Hit!
• Index into $ 1011 M
load 0100 Miss 1100 N
load 1100
• Check tag 1101 O
• Check valid bit 1110 P
1111 Q 51
MEMORY
Simulation #3: addr data
index 8-byte, DM Cache 0000 A
0001 B
CACHE
tag| |offset 0010 C
index V tag data 0011 D
XXXX
00 0 x X | X 0100 E
01 0 x X | X 0101 F
10 1 0 E | F 0110 G
11 0 x X | X 0111 H
1000 J
1001 K
Miss
load 1100 Lookup: 1010 L
load 1101 Hit!
• Index into $ 1011 M
load 0100 Miss 1100 N
load 1100
• Check tag
Miss 1101 O
• Check valid bit 1110 P
1111 Q 52
MEMORY
Simulation #3: addr data
8-byte, DM Cache 0000 A
0001 B
CACHE
0010 C
index V tag data 0011 D
00 0 x X | X 0100 E
01 0 x X | X 0101 F
10 1 0 E | F 0110 G
11 0 x X | X 0111 H
1000 J
1001 K
load 1100 Miss cold 1010 L
1 hit, 3 misses
load 1101 Hit!
3 bytes don’t fit in 1011 M
load 0100 Miss cold
a 4 entry cache? 1100 N
load 1100 Miss conflict 1101 O
1110 P
1111 Q 53
Removing Conflict Misses
with Fully-Associative Caches

54
MEMORY
8 byte, fully-associative Cache addr
0000
data
A
0001 B
XXXX 0010 C
tag|offset
offset 0011 D
CACHE 0100 E
0101 F
V tag data V tag data V tag data V tag data 0110 G
0 xxx X | X 0 xxx X | X 0 xxx X | X 0 xxx X | X 0111 H
1000 J
What should the offset be? Clicker: 1001 K

What should the index be? A) xxxx 1010 L

B) xxxx 1011 M
What should the tag be? 1100 N
C) xxxx 1101 O
D) xxxx 1110 P
E) None 1111 Q 55
MEMORY
Simulation #4: addr data
8-byte, FA Cache 0000 A
0001 B
XXXX 0010 C
tag|offset 0011 D
CACHE 0100 E
0101 F
V tag data V tag data V tag data V tag data 0110 G
0 xxx X | X 0 xxx X | X 0 xxx X | X 0 xxx X | X 0111 H
1000 J
1001 K
load 1100 Miss Lookup: 1010 L
load 1101 • Index into $ 1011 M
load 0100 1100 N
load 1100
• Check tags 1101 O
• Check valid bits 1110 P
LRU Pointer 1111 Q 56
MEMORY
Simulation #4: addr data
8-byte, FA Cache 0000 A
0001 B
XXXX 0010 C
tag|offset 0011 D
CACHE 0100 E
0101 F
V tag data V tag data V tag data V tag data 0110 G
1 110 N | O 0 xxx X | X 0 xxx X | X 0 xxx X | X 0111 H
1000 J
1001 K
load 1100 Miss Lookup: 1010 L
load 1101 Hit!
• Index into $ 1011 M
load 0100 1100 N
load 1100
• Check tags
1101 O
• Check valid bits 1110 P
1111 Q 57
MEMORY
Simulation #4: addr data
8-byte, FA Cache 0000 A
0001 B
XXXX 0010 C
tag|offset 0011 D
CACHE 0100 E
0101 F
V tag data V tag data V tag data V tag data 0110 G
1 110 N | O 0 xxx X | X 0 xxx X | X 0 xxx X | X 0111 H
1000 J
1001 K
load 1100 Miss Lookup: 1010 L
load 1101 Hit!
• Index into $ 1011 M
load 0100 Miss 1100 N
load 1100
• Check tags 1101 O
• Check valid bits 1110 P
LRU Pointer 1111 Q 58
MEMORY
Simulation #4: addr data
8-byte, FA Cache 0000 A
0001 B
XXXX 0010 C
tag|offset 0011 D
CACHE 0100 E
0101 F
V tag data V tag data V tag data V tag data 0110 G
1 110 N | O 1 010 E | F 0 xxx X | X 0 xxx X | X 0111 H
1000 J
1001 K
load 1100 Miss Lookup: 1010 L
load 1101 Hit!
• Index into $ 1011 M
load 0100 Miss 1100 N
load 1100
• Check tags
Hit! 1101 O
• Check valid bits 1110 P
LRU Pointer 1111 Q 59
Pros and Cons of Full Associativity
+ No more conflicts!
+ Excellent utilization!
But either:
Parallel Reads
– lots of reading!
Serial Reads
– lots of waiting

tavg = thit + %miss* tmiss

= 4 + 5% x 100 = 6 + 3% x 100

= 9 cycles = 9 cycles 60
Pros & Cons
Direct Mapped Fully Associative
Tag Size Smaller Larger
SRAM Overhead Less More
Controller Logic Less More
Speed Faster Slower
Price Less More
Scalability Very Not Very
# of conflict misses Lots Zero
Hit Rate Low High
Pathological Cases Common ?
Reducing Conflict Misses
with Set-Associative Caches
Not too conflict-y. Not too slow.

… Just Right!

62
MEMORY
8 byte, 2-way addr data
set associative Cache 0000 A
0001 B
XXXX 0010 C
tag||offset
offset
index

0011 D
CACHE 0100 E
index V tag data V tag data 0101 F
0 0 xx E | F 0 xx N | O 0110 G
1 0 xx C | D 0 xx P | Q 0111 H
1000 J
1001 K
What should the offset be?
1010 L
What should the index be? 1011 M
1100 N
What should the tag be? 1101 O
1110 P
1111 Q 63
Clicker Question
5 bit address
XXXXX
2 byte block size
24 byte, 3-Way Set Associative CACHE
index V tag data V tag data V tag data
00 0 ? X | Y 0 ? X’ | Y’ 0 ? X’’ | Y’’
01 0 ? X | Y 0 ? X’ | Y’ 0 ? X’’ | Y’’
10 0 ? X | Y 0 ? X’ | Y’ 0 ? X’’ | Y’’
11 0 ? X | Y 0 ? X’ | Y’ 0 ? X’’ | Y’’

A) 0
B) 1
How many tag bits?
C) 2
D) 3
64
E) 4
Clicker Question
5 bit address
XXXXX
2 byte block size
24 byte, 3-Way Set Associative CACHE
index V tag data V tag data V tag data
00 0 ? X | Y 0 ? X’ | Y’ 0 ? X’’ | Y’’
01 0 ? X | Y 0 ? X’ | Y’ 0 ? X’’ | Y’’
10 0 ? X | Y 0 ? X’ | Y’ 0 ? X’’ | Y’’
11 0 ? X | Y 0 ? X’ | Y’ 0 ? X’’ | Y’’

A) 0
B) 1
How many tag bits?
C) 2
D) 3
65
E) 4
MEMORY
8 byte, 2-way addr data
set associative Cache 0000 A
0001 B
XXXX 0010 C
tag||offset
index

0011 D
CACHE 0100 E
index V tag data V tag data 0101 F
0 0 xx X | X 0 xx X | X 0110 G
1 0 xx X | X 0 xx X | X 0111 H
1000 J
1001 K
load 1100 Miss Lookup: 1010 L
load 1101 • Index into $ 1011 M
load 0100 1100 N
load 1100
• Check tag 1101 O
• Check valid bit 1110 P
LRU Pointer 1111 Q 66
MEMORY
8 byte, 2-way addr data
set associative Cache 0000 A
0001 B
XXXX 0010 C
tag||offset
index

0011 D
CACHE 0100 E
index V tag data V tag data 0101 F
0 1 11 N | O 0 xx X | X 0110 G
1 0 xx X | X 0 xx X | X 0111 H
1000 J
1001 K
load 1100 Miss Lookup: 1010 L
load 1101 Hit!
• Index into $ 1011 M
load 0100 1100 N
load 1100
• Check tag 1101 O
• Check valid bit 1110 P
LRU Pointer 1111 Q 67
MEMORY
8 byte, 2-way addr data
set associative Cache 0000 A
0001 B
XXXX 0010 C
tag||offset
index

0011 D
CACHE 0100 E
index V tag data V tag data 0101 F
0 1 11 N | O 0 xx X | X 0110 G
1 0 xx X | X 0 xx X | X 0111 H
1000 J
1001 K
load 1100 Miss Lookup: 1010 L
load 1101 Hit!
• Index into $ 1011 M
load 0100 Miss 1100 N
load 1100
• Check tag 1101 O
• Check valid bit 1110 P
LRU Pointer 1111 Q 68
MEMORY
8 byte, 2-way addr data
set associative Cache 0000 A
0001 B
XXXX 0010 C
tag||offset
index

0011 D
CACHE 0100 E
index V tag data V tag data 0101 F
0 1 11 N | O 1 01 E | F 0110 G
1 0 xx X | X 0 xx X | X 0111 H
1000 J
1001 K
load 1100 Miss Lookup: 1010 L
load 1101 Hit!
• Index into $ 1011 M
load 0100 Miss 1100 N
load 1100 Hit! • Check tag 1101 O
• Check valid bit 1110 P
LRU Pointer 1111 Q 69
Eviction Policies

Which cache line should be evicted from the cache

to make room for a new line?
• Direct-mapped: no choice, must evict line selected
by index
• Associative caches
• Random: select one of the lines at random
• Round-Robin: similar to random
• FIFO: replace oldest line
• LRU: replace line that has not been used in the
longest time
70
Misses: the Three C’s

• Cold (compulsory) Miss:

never seen this address before
• Conflict Miss:
cache associativity is too low
• Capacity Miss:
cache is too small

71
Miss Rate vs. Block Size

72
Block Size Tradeoffs
• For a given total cache size,
Larger block sizes mean….
– fewer lines
– so fewer tags, less overhead
– and fewer cold misses (within-block “prefetching”)
• But also…
– fewer blocks available (for scattered accesses!)
– so more conflicts
– can decrease performance if working set can’t fit in $
– and larger miss penalty (time to fetch block)
Miss Rate vs. Associativity

74
Clicker Question

What does NOT happen when you

increase the associativity of the cache?

A) Conflict misses decrease

B) Tag overhead decreases
C) Hit time increases
D) Cache stays the same size

75
Clicker Question

What does NOT happen when you

increase the associativity of the cache?

A) Conflict misses decrease

B) Tag overhead decreases
C) Hit time increases
D) Cache stays the same size

76
ABCs of Caches
tavg = thit + %miss* tmiss
+ Associativity:
⬇conflict misses 
⬆hit time 
+ Block Size:
⬇cold misses 
⬆conflict misses 
+ Capacity:
⬇capacity misses 
⬆hit time 
77
Which caches get what properties?
tavg = thit + %miss* tmiss
Design with
Fast
speed in mind
L1 Caches
More Associative
L2 Cache Bigger Block Sizes
Larger Capacity
L3 Cache
Design with miss
Big rate in mind

78
Roadmap
• Things we have covered:
– The Need for Speed
– Locality to the Rescue!
– Calculating average memory access time
– $ Misses: Cold, Conflict, Capacity
– $ Characteristics: Associativity, Block Size, Capacity
• Things we will now cover:
– Cache Figures
– Cache Performance Examples
– Writes
79
2-Way Set Associative Cache (Reading)

Tag Index Offset

V Tag Data V Tag Data

= =

line select
64bytes
word select
32bits 80
hit? data
3-Way Set Associative Cache (Reading)

Tag Index Offset

= = =

line select
64bytes

word select
32bits 81
hit? data
How Big is the Cache?
Tag Index Offset

n bit index, m bit offset, N-way Set Associative

Question: How big is cache?
• Data only?
(what we usually mean when we ask “how big” is the cache)
• Data + overhead?
82
How Big is the Cache?
Tag Index Offset

n bit index, m bit offset, N-way Set Associative

Question: How big is cache?
• How big is the cache (Data only)?
Cache of size 2n sets
Block size of 2m bytes, N-way set associative
Cache Size: 2m bytes-per-block x (2n sets x N-way-per-set)
= N x 2n+m bytes
83
How Big is the Cache?
Tag Index Offset

n bit index, m bit offset, N-way Set Associative

Question: How big is cache?
• How big is the cache (Data + overhead)?
Cache of size 2n sets
Block size of 2m bytes, N-way set associative
Tag field: 32 – (n + m), Valid bit: 1
SRAM Size: 2n sets x N-way-per-set x (block size + tag size + valid bit size)
84
= 2n x N-way x (2m bytes x 8 bits-per-byte + (32–n–m) + 1)
Performance Calculation with $ Hierarchy
• Parameters
t avg = t hit + % miss* t miss
– Reference stream: all loads
– D$: thit = 1ns, %miss = 5%
– L2: thit = 10ns, %miss = 20% (local miss rate)
– Main memory: thit = 50ns
• What is tavgD$ without an L2?
– tmissD$ =
– tavgD$ =
• What is tavgD$ with an L2?
– tmissD$ =
– tavgL2 =
– tavgD$ =
85
Performance Calculation with $ Hierarchy
• Parameters tavg = thit + %miss* tmiss
– Reference stream: all loads
– D$: thit = 1ns, %miss = 5%
– L2: thit = 10ns, %miss = 20% (local miss rate)
– Main memory: thit = 50ns
• What is tavgD$ without an L2?
thitM
– tmissD$ =
thitD$ + %missD$*thitM = 1ns+(0.05*50ns) =
– tavgD$ =
3.5ns
• What is tavgD$ with an L2?
– tmissD$ = tavgL2
– tavgL2 =
thitL2+%missL2*thitM = 10ns+(0.2*50ns) = 20ns
– tavgD$ =
thitD$ + %missD$*tavgL2 = 1ns+(0.05*20ns) = 2ns
86
Performance Summary
Average memory access time (AMAT) depends on:
• cache architecture and size
• Hit and miss rates
• Access times and miss penalty

Cache design a very complex problem:

• Cache size, block size (aka line size)
• Number of ways of set-associativity (1, N, )
• Eviction policy
• Number of levels of caching, parameters for each
• Separate I-cache from D-cache, or Unified cache
• Prefetching policies / instructions
• Write policy
87
Takeaway
Direct Mapped  fast, but low hit rate
Fully Associative  higher hit cost, higher hit rate
Set Associative  middleground

Line size matters. Larger cache lines can increase

performance due to prefetching. BUT, can also decrease
performance is working set size cannot fit in cache.

Cache performance is measured by the average memory

access time (AMAT), which depends cache architecture
and size, but also the access time for hit, miss penalty, hit
rate.
88
What about Stores?
We want to write to the cache.

If the data is not in the cache?

Bring it in. (Write allocate policy)

Should we also update memory?

• Yes: write-through policy
• No: write-back policy
Write-Through Cache
16 byte, byte-addressed memory
4 btye, fully-associative cache: Memory
Instructions: 2-byte blocks, write-allocate 0 78
LB $1  M[ 1 ] 4 bit addresses: 1 29
LB $2  M[ 7 ] 3 bit tag, 1 bit offset 2 120
SB $2  M[ 0 ] 3 123
SB $1  M[ 5 ] lru V tag data 4 71
LB $2  M[ 10 ] 1 0 5 150
6 162
SB $1  M[ 5 ] 0 0 7 173
SB $1  M[ 10 ] 8 18
9 21
Register File Cache 10 33
$0 11 28
$1 12 19
Misses: 0 13 200
$2
$3 Hits: 0 14 210
Reads: 0 15 225
Writes: 0
Write-Through (REF 1)

Memory
Instructions:
0 78
LB $1  M[ 1 ] 1 29
LB $2  M[ 7 ] 2 120
SB $2  M[ 0 ] 3 123
SB $1  M[ 5 ] lru V tag data 4 71
LB $2  M[ 10 ] 1 0 5 150
6 162
SB $1  M[ 5 ] 0 0 7 173
SB $1  M[ 10 ] 8 18
9 21
Register File Cache 10 33
$0 11 28
$1 12 19
Misses: 0 13 200
$2
$3 Hits: 0 14 210
Reads: 0 15 225
Writes: 0
Write-Through (REF 1)

Memory
Instructions:
0 78
LB $1  M[ 1 ] M 1 29
LB $2  M[ 7 ] 2 120
SB $2  M[ 0 ] 3 123
SB $1  M[ 5 ] lru V tag data 4 71
LB $2  M[ 10 ] 0 1 000 78 5 150
29 6 162
SB $1  M[ 5 ] 1 0 7 173
SB $1  M[ 10 ] 8 18
9 21
Register File Cache 10 33
$0 11 28
$1 29 12 19
Misses: 1 13 200
$2
$3 Hits: 0 14 210
Reads: 2 15 225
Writes: 0
Write-Through (REF 2)

Memory
Instructions:
0 78
LB $1  M[ 1 ] M 1 29
LB $2  M[ 7 ] M 2 120
SB $2  M[ 0 ] 3 123
SB $1  M[ 5 ] lru V tag data 4 71
LB $2  M[ 10 ] 0 1 000 78 5 150
29 6 162
SB $1  M[ 5 ] 1 1 011 162 7 173
SB $1  M[ 10 ] 173 8 18
9 21
Register File Cache 10 33
$0 11 28
$1 29 12 19
173
Misses: 2 13 200
$2
$3 Hits: 0 14 210
Reads: 4 15 225
Writes: 0
Write-Through (REF 3)

CLICKER:
(A) HIT Memory
Instructions:
LB $1  M[ 1 ] M (B) MISS 0 78
1 29
LB $2  M[ 7 ] M 2 120
SB $2  M[ 0 ] 3 123
SB $1  M[ 5 ] lru V tag data 4 71
LB $2  M[ 10 ] 1 1 000 78 5 150
29 6 162
SB $1  M[ 5 ] 0 1 011 162 7 173
SB $1  M[ 10 ] 173 8 18
9 21
Register File Cache 10 33
$0 11 28
$1 29 12 19
173
Misses: 2 13 200
$2
$3 Hits: 0 14 210
Reads: 4 15 225
Writes: 0
Write-Through (REF 3)

Memory
Instructions:
0 173
LB $1  M[ 1 ] M 1 29
LB $2  M[ 7 ] M 2 120
SB $2  M[ 0 ] Hit 3 123
SB $1  M[ 5 ] lru V tag data 4 71
LB $2  M[ 10 ] 0 1 000 173 5 150
29 6 162
SB $1  M[ 5 ] 1 1 011 162 7 173
SB $1  M[ 10 ] 173 8 18
9 21
Register File Cache 10 33
$0 11 28
$1 29 12 19
173
Misses: 2 13 200
$2
$3 Hits: 1 14 210
Reads: 4 15 225
Writes: 1
Write-Through (REF 4)

write-allocate
Memory
Instructions:
0 173
LB $1  M[ 1 ] M 1 29
LB $2  M[ 7 ] M 2 120
SB $2  M[ 0 ] Hit 3 123
SB $1  M[ 5 ] M lru V tag data 4 71
LB $2  M[ 10 ] 0 1 000 173 5 150
29 6 162
SB $1  M[ 5 ] 1 1 011
010 7
162
71 173
SB $1  M[ 10 ] 150
173 8 18
9 21
Register File Cache 10 33
$0 11 28
$1 29 12 19
173
Misses: 2 13 200
$2
$3 Hits: 1 14 210
Reads: 4 15 225
Writes: 1
Write-Through (REF 4)

Memory
Instructions:
0 173
LB $1  M[ 1 ] M 1 29
LB $2  M[ 7 ] M 2 120
SB $2  M[ 0 ] Hit 3 123
SB $1  M[ 5 ] M lru V tag data 4 71
LB $2  M[ 10 ] 1 1 000 173 5 150
29
29 6 162
SB $1  M[ 5 ] 0 1 010 71 7 173
SB $1  M[ 10 ] 150
29
150 8 18
9 21
Register File Cache 10 33
$0 11 28
$1 29 12 19
173
Misses: 3 13 200
$2
$3 Hits: 1 14 210
Reads: 6 15 225
Writes: 2
Write-Through (REF 5)

CLICKER:
(A) HIT Memory
Instructions:
LB $1  M[ 1 ] (B) MISS 0 173
M 1 29
LB $2  M[ 7 ] M 2 120
SB $2  M[ 0 ] Hit 3 123
SB $1  M[ 5 ] M lru V tag data 4 71
LB $2  M[ 10 ] 1 1 000 173 5 29
29 6 162
SB $1  M[ 5 ] 0 1 010 71 7 173
SB $1  M[ 10 ] 29 8 18
9 21
Register File Cache 10 33
$0 11 28
$1 29 12 19
173
Misses: 3 13 200
$2
$3 Hits: 1 14 210
Reads: 6 15 225
Writes: 2
Write-Through (REF 5)

Memory
Instructions:
0 173
LB $1  M[ 1 ] M 1 29
LB $2  M[ 7 ] M 2 120
SB $2  M[ 0 ] Hit 3 123
SB $1  M[ 5 ] M lru V tag data 4 71
LB $2  M[ 10 ] 0 1 101 33 5 29
M
28 6 162
SB $1  M[ 5 ] 1 1 010 71 7 173
SB $1  M[ 10 ] 29 8 18
9 21
Register File Cache 10 33
$0 11 28
$1 29 12 19
33
Misses: 4 13 200
$2
$3 Hits: 1 14 210
Reads: 8 15 225
Writes: 2
Write-Through (REF 6)

Memory
Instructions:
0 173
LB $1  M[ 1 ] M 1 29
LB $2  M[ 7 ] M 2 120
SB $2  M[ 0 ] Hit 3 123
SB $1  M[ 5 ] M lru V tag data 4 71
LB $2  M[ 10 ] 0 1 101 33 5 29
M
28 6 162
SB $1  M[ 5 ] Hit 1 1 010 71 7 173
SB $1  M[ 10 ] 29 8 18
9 21
Register File Cache 10 33
$0 11 28
$1 29 12 19
33
Misses: 4 13 200
$2
$3 Hits: 2 14 210
Reads: 8 15 225
Writes: 3
Write-Through (REF 7)

Memory
Instructions:
0 173
LB $1  M[ 1 ] M 1 29
LB $2  M[ 7 ] M 2 120
SB $2  M[ 0 ] Hit 3 123
SB $1  M[ 5 ] M lru V tag data 4 71
LB $2  M[ 10 ] 0 1 101 29
33 5 29
M
28 6 162
SB $1  M[ 5 ] Hit 1 1 010 71 7 173
SB $1  M[ 10 ] Hit 29 8 18
9 21
Register File Cache 10 29
33
$0 11 28
$1 29 12 19
33
Misses: 4 13 200
$2
$3 Hits: 3 14 210
Reads: 8 15 225
Writes: 4
Summary: Write Through

Write-through policy with write allocate

• Cache miss: read entire block from
memory
• Write: write only updated item to
memory
• Eviction: no need to write to memory
Next Goal: Write-Through vs. Write-Back

What if we DON’T to write stores immediately to

memory?
– Keep the current copy in cache, and update
memory when data is evicted (write-back policy)
– Write-back all evicted lines?
• No, only written-to blocks
Write-Back Meta-Data (Valid, Dirty Bits)
V D Tag Byte 1 Byte 2 … Byte N

• V = 1 means the line has valid data

• D = 1 means the bytes are newer than main memory
• When allocating line:
– Set V = 1, D = 0, fill in Tag and Data
• When writing line:
– Set D = 1
• When evicting line:
– If D = 0: just set V = 0
– If D = 1: write-back Data, then set D = 0, V = 0
Write-back Example

• Example: How does a write-back cache work?

• Assume write-allocate
Handling Stores (Write-Back)
16 byte, byte-addressed memory
4 btye, fully-associative cache: Memory
Instructions: 2-byte blocks, write-allocate 0 78
LB $1  M[ 1 ] 4 bit addresses: 1 29
LB $2  M[ 7 ] 3 bit tag, 1 bit offset 2 120
SB $2  M[ 0 ] 3 123
SB $1  M[ 5 ] lru V d tag data 4 71
LB $2  M[ 10 ] 1 0 5 150
6 162
SB $1  M[ 5 ] 0 0 7 173
SB $1  M[ 10 ] 8 18
9 21
Register File Cache 10 33
$0 11 28
Misses: 0 12 19
$1
$2 Hits: 0 13 200
$3 Reads: 0 14 210
Writes: 0 15 225
Write-Back (REF 1)

Memory
Instructions:
0 78
LB $1  M[ 1 ] 1 29
LB $2  M[ 7 ] 2 120
SB $2  M[ 0 ] 3 123
SB $1  M[ 5 ] lru V d tag data 4 71
LB $2  M[ 10 ] 1 0 5 150
6 162
SB $1  M[ 5 ] 0 0 7 173
SB $1  M[ 10 ] 8 18
9 21
Register File Cache 10 33
$0 11 28
Misses: 0 12 19
$1
$2 Hits: 0 13 200
$3 Reads: 0 14 210
Writes: 0 15 225
Write-Back (REF 1)

Memory
Instructions:
0 78
LB $1  M[ 1 ] M 1 29
LB $2  M[ 7 ] 2 120
SB $2  M[ 0 ] 3 123
SB $1  M[ 5 ] lru V d tag data 4 71
LB $2  M[ 10 ] 0 1 0 000 78 5 150
29 6 162
SB $1  M[ 5 ] 1 0 7 173
SB $1  M[ 10 ] 8 18
9 21
Register File Cache 10 33
$0 11 28
29 Misses: 1 12 19
$1
$2 Hits: 0 13 200
$3 Reads: 2 14 210
Writes: 0 15 225
Write-Back (REF 2)

Memory
Instructions:
0 78
LB $1  M[ 1 ] M 1 29
LB $2  M[ 7 ] M 2 120
SB $2  M[ 0 ] 3 123
SB $1  M[ 5 ] lru V d tag data 4 71
LB $2  M[ 10 ] 1 1 0 000 78 5 150
29 6 162
SB $1  M[ 5 ] 0 1 0 011 162 7 173
SB $1  M[ 10 ] 173 8 18
9 21
Register File Cache 10 33
$0 11 28
29 Misses: 2 12 19
$1
$2 173 Hits: 0 13 200
$3 Reads: 4 14 210
Writes: 0 15 225
Write-Back (REF 3)

Memory
Instructions:
0 78
LB $1  M[ 1 ] M 1 29
LB $2  M[ 7 ] M 2 120
SB $2  M[ 0 ] Hit 3 123
SB $1  M[ 5 ] lru V d tag data 4 71
LB $2  M[ 10 ] 0 1 1 000 173 5 150
29 6 162
SB $1  M[ 5 ] 1 1 0 011 162 7 173
SB $1  M[ 10 ] 173 8 18
9 21
Register File Cache 10 33
$0 11 28
29 Misses: 2 12 19
$1
$2 173 Hits: 1 13 200
$3 Reads: 4 14 210
Writes: 0 15 225
Write-Back (REF 4)

Memory
Instructions:
0 78
LB $1  M[ 1 ] M 1 29
LB $2  M[ 7 ] M 2 120
SB $2  M[ 0 ] Hit 3 123
SB $1  M[ 5 ] lru V d tag data 4 71
LB $2  M[ 10 ] 0 1 1 000 173 5 150
29 6 162
SB $1  M[ 5 ] 1 1 0 011 162 7 173
SB $1  M[ 10 ] 173 8 18
9 21
Register File Cache 10 33
$0 11 28
29 Misses: 3 12 19
$1
$2 173 Hits: 1 13 200
$3 Reads: 6 14 210
Writes: 0 15 225
Write-Back (REF 4)

Memory
Instructions:
0 78
LB $1  M[ 1 ] M 1 29
LB $2  M[ 7 ] M 2 120
SB $2  M[ 0 ] Hit 3 123
SB $1  M[ 5 ] M lru V d tag data 4 71
LB $2  M[ 10 ] 1 1 1 000 173 5 150
29 6 162
SB $1  M[ 5 ] 0 1 1 010 71 7 173
SB $1  M[ 10 ] 150
29 8 18
9 21
Register File Cache 10 33
$0 11 28
29 Misses: 3 12 19
$1
$2 173 Hits: 1 13 200
$3 Reads: 6 14 210
Writes: 0 15 225
Write-Back (REF 5)

Memory
Instructions:
0 78
LB $1  M[ 1 ] M 1 29
LB $2  M[ 7 ] M 2 120
SB $2  M[ 0 ] Hit 3 123
SB $1  M[ 5 ] M lru V d tag data 4 71
LB $2  M[ 10 ] 1 1 1 000 173 5 150
29 6 162
SB $1  M[ 5 ] 0 1 1 010 71 7 173
SB $1  M[ 10 ] 29 8 18
9 21
Register File Cache 10 33
$0 11 28
29 Misses: 3 12 19
$1
$2 173 Hits: 1 13 200
$3 Reads: 6 14 210
Writes: 0 15 225
Write-Back (REF 5)

Eviction, WB dirty block

Memory
Instructions:
0 173
78
LB $1  M[ 1 ] M 1 29
LB $2  M[ 7 ] M 2 120
SB $2  M[ 0 ] Hit 3 123
SB $1  M[ 5 ] M lru V d tag data 4 71
LB $2  M[ 10 ] 1 1 1 000 173 5 150
29 6 162
SB $1  M[ 5 ] 0 1 1 010 71 7 173
SB $1  M[ 10 ] 29 8 18
9 21
Register File Cache 10 33
$0 11 28
29 Misses: 3 12 19
$1
$2 173 Hits: 1 13 200
$3 Reads: 6 14 210
Writes: 2 15 225
Write-Back (REF 5)

Memory
Instructions:
0 173
LB $1  M[ 1 ] M 1 29
LB $2  M[ 7 ] M 2 120
SB $2  M[ 0 ] Hit 3 123
SB $1  M[ 5 ] M lru V d tag data 4 71
LB $2  M[ 10 ] M 0 1 0 101 33 5 150
28 6 162
SB $1  M[ 5 ] 1 1 1 010 71 7 173
SB $1  M[ 10 ] 29 8 18
9 21
Register File Cache 10 33
$0 11 28
29 Misses: 4 12 19
$1
$2 33 Hits: 1 13 200
$3 Reads: 8 14 210
Writes: 2 15 225
Write-Back (REF 6)

CLICKER:
(A) HIT Memory
Instructions:
LB $1  M[ 1 ] M (B) MISS 0 173
1 29
LB $2  M[ 7 ] M 2 120
SB $2  M[ 0 ] Hit 3 123
SB $1  M[ 5 ] M lru V d tag data 4 71
LB $2  M[ 10 ] M 0 1 0 101 33 5 150
28 6 162
SB $1  M[ 5 ] 1 1 1 010 71 7 173
SB $1  M[ 10 ] 29 8 18
9 21
Register File Cache 10 33
$0 11 28
29 Misses: 4 12 19
$1
$2 33 Hits: 1 13 200
$3 Reads: 8 14 210
Writes: 2 15 225
Write-Back (REF 6)

Memory
Instructions:
0 173
LB $1  M[ 1 ] M 1 29
LB $2  M[ 7 ] M 2 120
SB $2  M[ 0 ] Hit 3 123
SB $1  M[ 5 ] M lru V d tag data 4 71
LB $2  M[ 10 ] M 1 1 0 101 33 5 150
28 6 162
SB $1  M[ 5 ] Hit 0 1 1 010 71 7 173
SB $1  M[ 10 ] 29 8 18
9 21
Register File Cache 10 33
$0 11 28
29 Misses: 4 12 19
$1
$2 33 Hits: 2 13 200
$3 Reads: 8 14 210
Writes: 2 15 225
Write-Back (REF 7)

Memory
Instructions:
0 173
LB $1  M[ 1 ] M 1 29
LB $2  M[ 7 ] M 2 120
SB $2  M[ 0 ] Hit 3 123
SB $1  M[ 5 ] M lru V d tag data 4 71
LB $2  M[ 10 ] M 0 1 1 101 29 5 150
28 6 162
SB $1  M[ 5 ] Hit 1 1 1 010 71 7 173
SB $1  M[ 10 ] Hit 29 8 18
9 21
Register File Cache 10 33
$0 11 28
29 Misses: 4 12 19
$1
$2 33 Hits: 3 13 200
$3 Reads: 8 14 210
Writes: 2 15 225
Write-Back (REF 8,9)

Cheap subsequent updates!

M
M Memory
Instructions:
0 173
... Hit
1 29
SB $1  M[ 5 ] M 2 120
LB $2  M[ 10 ] M 3 123
SB $1  M[ 5 ] Hit lru V d tag data 4 71
SB $1  M[ 10 ] Hit 0 1 1 101 29 5 150
28 6 162
SB $1  M[ 5 ] 1 1 1 010 71 7 173
SB $1  M[ 10 ] 29 8 18
9 21
Register File Cache 10 33
$0 11 28
29 Misses: 4 12 19
$1
$2 33 Hits: 3 13 200
$3 Reads: 8 14 210
Writes: 2 15 225
Write-Back (REF 8,9)

M
M Memory
Instructions:
0 173
... Hit
1 29
SB $1  M[ 5 ] M 2 120
LB $2  M[ 10 ] M 3 123
SB $1  M[ 5 ] Hit lru V d tag data 4 71
SB $1  M[ 10 ] Hit 0 1 1 101 29 5 150
28 6 162
SB $1  M[ 5 ] Hit
1 1 1 010 71 7 173
SB $1  M[ 10 ] Hit 29 8 18
9 21
Register File Cache 10 33
$0 11 28
29 Misses: 4 12 19
$1
$2 33 Hits: 3 13 200
$3 Reads: 8 14 210
Writes: 2 15 225
How Many Memory References?

Write-back performance
• How many reads?
– Each miss (read or write) reads a block from mem
– 4 misses  8 mem reads
• How many writes?
– Some evictions write a block to mem
– 1 dirty eviction  2 mem writes
– (+ 2 dirty evictions later  +4 mem writes)
Write-back vs. Write-through Example
Assume: large associative cache, 16-byte lines
N 4-byte words

for (i=1; i<n; i++) Write-thru: n/4 reads

A[0] += A[i]; n writes
Write-back: n/4 reads
1 write
for (i=0; i<n; i++)
B[i] = A[i] Write-thru: 2 x n/4 reads
n writes
Write-back: 2 x n/4 reads
n/4 writes
So is write back just better?

Short Answer: Yes (fewer writes is a good thing)

Long Answer: It’s complicated.
• Evictions require entire line be written back to
memory (vs. just the data that was written)
• Write-back can lead to incoherent caches on
multi-core processors (later lecture)
Optimization: Write Buffering

• Q: Writes to main memory are slow!

• A: Use a write-back buffer
– A small queue holding dirty lines
– Add to end upon eviction
– Remove from front upon completion
• Q: When does it help?
• A: short bursts of writes (but not sustained writes)
• A: fast eviction reduces miss penalty
Write-through vs. Write-back

• Write-through is slower
– But simpler (memory always consistent)

• Write-back is almost always faster

– write-back buffer hides large eviction cost
– But what about multiple cores with separate caches but
sharing memory?
• Write-back requires a cache coherency protocol
– Inconsistent views of memory
– Need to “snoop” in each other’s caches
– Extremely complex protocols, very hard to get right
Cache-coherency
• Q: Multiple readers and writers?
A: Potentially inconsistent views of memory
A’ CPU CPU CPU CPU
AL1 L1 AL1 L1 L1 L1 L1 L1
A L2 L2
net A Mem disk

Cache coherency protocol

• May need to snoop on other CPU’s cache activity
• Invalidate cache line when other CPU writes
• Flush write-back caches before other CPU reads
• Or the reverse: Before writing/reading…
• Extremely complex protocols, very hard to get right
Takeaway
• Write-through policy with write allocate
• Cache miss: read entire block from memory
• Write: write only updated item to memory
• Eviction: no need to write to memory
• Slower, but cleaner

• Write-back policy with write allocate

• Cache miss: read entire block from memory
• **But may need to write dirty cacheline first**
– Write: nothing to memory
– Eviction: have to write to memory entire cacheline because
don’t know what is dirty (only 1 dirty bit)
– Faster, but more complicated, especially with multicore
Cache Conscious Programming
// H = 6, W = 10
int A[H][W];
1
for(x=0; x < W; x++)
2
for(y=0; y < H; y++)
sum += A[y][x]; 3
W 4
1 5
2 6
3
YOUR CACHE
H 4
5
MIND
6

Every access a cache miss! MEMORY

(unless entire matrix fits in cache)
Cache Conscious Programming
// H = 6, W = 10
int A[H][W];
1 2 3 4
for(x=0; x < H; x++)
5 6 7 8
for(y=0; y < W; y++)
sum += A[x][y];
W
1 2 3 4 5 6 7 8

YOUR CACHE
H
MIND

• Block size = 4  75% hit rate

• Block size = 8  87.5% hit rate MEMORY
• Block size = 16  93.75% hit rate
• And you can easily prefetch to warm the cache
Clicker Question
Choose the best block size for your cache among the
choices given. Assume that integers and pointers are all 4
bytes each and that the scores array is 4-byte aligned.
(a) 1 byte (b) 4 bytes (c) 8 bytes (d) 16 bytes (e) 32 bytes

int scores[NUM STUDENTS] = 0;

int sum = 0;
for (i = 0; i < NUM STUDENTS; i++) {
sum += scores[i];
}

137
Clicker Question
Choose the best block size for your cache among the
choices given. Assume that integers and pointers are all 4
bytes each and that the scores array is 4-byte aligned.
(a) 1 byte (b) 4 bytes (c) 8 bytes (d) 16 bytes (e) 32 bytes

int scores[NUM STUDENTS] = 0;

int sum = 0;
for (i = 0; i < NUM STUDENTS; i++) {
sum += scores[i];
}

138
Clicker Question
Choose the best block size for your cache among the
choices given. Assume integers and pointers are 4 bytes.
(a) 1 byte (b) 4 bytes (c) 8 bytes (d) 16 bytes (e) 32 bytes
typedef struct item_t {
int value;
struct item_t *next;
char *name;
} item_t;

int sum = 0;
item_t *curr = list_head;
while (curr != NULL) {
sum += curr->value;
curr = curr->next;
} 139
Clicker Question
Choose the best block size for your cache among the
choices given. Assume integers and pointers are 4 bytes.
(a) 1 byte (b) 4 bytes (c) 8 bytes (d) 16 bytes (e) 32 bytes
typedef struct item_t {
int value;
struct item_t *next;
char *name;
} item_t;

int sum = 0;
item_t *curr = list_head;
while (curr != NULL) {
sum += curr->value;
curr = curr->next;
} 140
By the end of the cache lectures…
•
•
> dmidecode -t cache A Real Example
Cache Information
• Socket Designation: L1 Cache Microsoft Surfacebook
• Configuration: Enabled, Not Socketed,
Level 1 Dual core
• Operational Mode: Write Back
• Location: Internal Intel i7-6600 CPU @ 2.6 GHz
•
•
Installed Size: 128 kB
Maximum Size: 128 kB (purchased in 2016)
Cache Information
• Supported SRAM Types: Socket Designation: L3 Cache
• Synchronous Configuration: Enabled, Not Socketed,
• Installed SRAM Type: Synchronous Level 3
• Speed: Unknown Operational Mode: Write Back
• Error Correction Type: Parity Location: Internal
Installed Size: 4096 kB
• System Type: Unified Maximum Size: 4096 kB
• Associativity: 8-way Set-associative
Supported SRAM Types:
Synchronous
• Cache Information Installed SRAM Type: Synchronous
• Socket Designation: L2 Cache Speed: Unknown
Error Correction Type: Multi-bit ECC
• Configuration: Enabled, Not Socketed,
System Type: Unified
• Level 2 Associativity: 16-way Set-associative
• Operational Mode: Write Back
• Location: Internal
• Installed Size: 512 kB
• Maximum Size: 512 kB
• Supported SRAM Types:
• Synchronous
• Installed SRAM Type: Synchronous
• Speed: Unknown
• Error Correction Type: Single-bit ECC
• System Type: Unified
• Associativity: 4-way Set-associative
•
A Real Example
> sudo dmidecode -t cache
• Cache Information
•
•
Dual-core 3.16GHz Intel
Configuration: Enabled, Not Socketed, Level 1
Operational Mode: Write Back
• Installed Size: 128 KB (purchased in 2011)
• Error Correction Type: None
• Cache Information
• Configuration: Enabled, Not Socketed, Level 2
• Operational Mode: Varies With Memory Address
• Installed Size: 6144 KB
• Error Correction Type: Single-bit ECC
• > cd /sys/devices/system/cpu/cpu0; grep cache/*/*
• cache/index0/level:1
• cache/index0/type:Data
• cache/index0/ways_of_associativity:8
• cache/index0/number_of_sets:64
• cache/index0/coherency_line_size:64
• cache/index0/size:32K
• cache/index1/level:1
• cache/index1/type:Instruction
• cache/index1/ways_of_associativity:8
• cache/index1/number_of_sets:64
• cache/index1/coherency_line_size:64
• cache/index1/size:32K
• cache/index2/level:2
• cache/index2/type:Unified
• cache/index2/shared_cpu_list:0-1
• cache/index2/ways_of_associativity:24
• cache/index2/number_of_sets:4096
• cache/index2/coherency_line_size:64
• cache/index2/size:6144K
A Real Example

• Dual 32K L1 Instruction caches

Dual-core 3.16GHz Intel
– 8-way set associative (purchased in 2009)
– 64 sets
– 64 byte line size
• Dual 32K L1 Data caches
– Same as above
• Single 6M L2 Unified cache
– 24-way set associative (!!!)
– 4096 sets
– 64 byte line size
• 4GB Main memory
• 1TB Disk
Summary
• Memory performance matters!
– often more than CPU performance
– … because it is the bottleneck, and not improving much
– … because most programs move a LOT of data
• Design space is huge
– Gambling against program behavior
– Cuts across all layers:
users  programs  os  hardware
• NEXT: Multi-core processors are complicated
– Inconsistent views of memory
– Extremely complex protocols, very hard to get right

Java Cheat Sheet
100% (1)
Java Cheat Sheet
98 pages
12-caches-notes
No ratings yet
12-caches-notes
144 pages
Chapter 3 P1
No ratings yet
Chapter 3 P1
57 pages
Week 12 - Lecture 12 - Memory
No ratings yet
Week 12 - Lecture 12 - Memory
27 pages
Lecture 16
No ratings yet
Lecture 16
22 pages
Memory Design
No ratings yet
Memory Design
36 pages
Computer Architecture: Memory Hierarchy Design
No ratings yet
Computer Architecture: Memory Hierarchy Design
60 pages
CH10 - Memory Hierarchy
No ratings yet
CH10 - Memory Hierarchy
106 pages
Computer Organization & Architecture: Cache Memory
No ratings yet
Computer Organization & Architecture: Cache Memory
71 pages
Systems I: Locality and Caching
No ratings yet
Systems I: Locality and Caching
18 pages
Cache Memory Virtual Memory
No ratings yet
Cache Memory Virtual Memory
40 pages
5 Memory Hierarchy
No ratings yet
5 Memory Hierarchy
99 pages
Sampriya Chandra Cache Memory
No ratings yet
Sampriya Chandra Cache Memory
36 pages
Cache Memory
No ratings yet
Cache Memory
89 pages
Lec2 PDF
No ratings yet
Lec2 PDF
21 pages
פרק ט - גדול ומהיר - ניצול היררכיות זיכרון
No ratings yet
פרק ט - גדול ומהיר - ניצול היררכיות זיכרון
77 pages
Cache Memory: William Stallings, Computer Organization and Architecture, 9 Edition
No ratings yet
Cache Memory: William Stallings, Computer Organization and Architecture, 9 Edition
47 pages
Chapter 6
No ratings yet
Chapter 6
37 pages
Cache Memory: William Stallings, Computer Organization and Architecture, 9 Edition
No ratings yet
Cache Memory: William Stallings, Computer Organization and Architecture, 9 Edition
47 pages
Lecture 5: Memory Hierarchy and Cache Traditional Four Questions For Memory Hierarchy Designers
No ratings yet
Lecture 5: Memory Hierarchy and Cache Traditional Four Questions For Memory Hierarchy Designers
10 pages
Cache Memory
No ratings yet
Cache Memory
61 pages
Cache1 2
No ratings yet
Cache1 2
30 pages
Lecture 10: Memory System - Memory Technology: CSE 564 Computer Architecture Summer 2017
No ratings yet
Lecture 10: Memory System - Memory Technology: CSE 564 Computer Architecture Summer 2017
44 pages
L15 Cache Introduction
No ratings yet
L15 Cache Introduction
35 pages
Memory Organization: Dr. Bernard Chen PH.D
No ratings yet
Memory Organization: Dr. Bernard Chen PH.D
77 pages
L15 Cache Introduction (1)
No ratings yet
L15 Cache Introduction (1)
35 pages
Welcome To Part 3: Memory Systems and I/O
No ratings yet
Welcome To Part 3: Memory Systems and I/O
31 pages
William Stallings Computer Organization and Architecture 7th Edition Cache Memory
No ratings yet
William Stallings Computer Organization and Architecture 7th Edition Cache Memory
57 pages
Lec8 - Caches
No ratings yet
Lec8 - Caches
55 pages
ALL CSC 417 NOTE
No ratings yet
ALL CSC 417 NOTE
238 pages
U4 L4 - CAche and Virtual Memory
No ratings yet
U4 L4 - CAche and Virtual Memory
14 pages
COA ch3(1)
No ratings yet
COA ch3(1)
39 pages
William Stallings Computer Organization and Architecture 7th Edition Cache Memory
No ratings yet
William Stallings Computer Organization and Architecture 7th Edition Cache Memory
57 pages
CS 61C: Great Ideas in Computer Architecture: Lecture 12 - Memory Hierarchy/Direct-Mapped Caches
No ratings yet
CS 61C: Great Ideas in Computer Architecture: Lecture 12 - Memory Hierarchy/Direct-Mapped Caches
27 pages
Chapter5 PDF
No ratings yet
Chapter5 PDF
95 pages
Computer Architecture: Cache Memory
No ratings yet
Computer Architecture: Cache Memory
28 pages
03-Chap4-Cache Memory Mapping
No ratings yet
03-Chap4-Cache Memory Mapping
24 pages
Cache
No ratings yet
Cache
36 pages
CAO - Lecutre7 Cache Memory
100% (1)
CAO - Lecutre7 Cache Memory
39 pages
Advanced Computer Architecture: BY Dr. Radwa M. Tawfeek
No ratings yet
Advanced Computer Architecture: BY Dr. Radwa M. Tawfeek
32 pages
William Stallings Computer Organization and Architecture 7th Edition Cache Memory
No ratings yet
William Stallings Computer Organization and Architecture 7th Edition Cache Memory
51 pages
William Stallings Computer Organization and Architecture 7th Edition Cache Memory
No ratings yet
William Stallings Computer Organization and Architecture 7th Edition Cache Memory
64 pages
Unit 4 Memory Hierarchy
No ratings yet
Unit 4 Memory Hierarchy
66 pages
Lecture 3 (Memory Hierarchy and Caches)
No ratings yet
Lecture 3 (Memory Hierarchy and Caches)
88 pages
Chap 6
No ratings yet
Chap 6
48 pages
Computer Architecture and Organization: Lecture12: Locality and Caching
No ratings yet
Computer Architecture and Organization: Lecture12: Locality and Caching
17 pages
Ddca 2024 Lecture24 Memory Hierarchy and Caches Beforelecture
No ratings yet
Ddca 2024 Lecture24 Memory Hierarchy and Caches Beforelecture
304 pages
Computer Organization and Architecture Chapter 7 Large and Fast Exploiting
No ratings yet
Computer Organization and Architecture Chapter 7 Large and Fast Exploiting
32 pages
08 Caches
No ratings yet
08 Caches
78 pages
Pertemuan 6
No ratings yet
Pertemuan 6
56 pages
Lecture-04 & 05, Adv. Computer Architecture, CS-522
No ratings yet
Lecture-04 & 05, Adv. Computer Architecture, CS-522
63 pages
cache_ppt
No ratings yet
cache_ppt
38 pages
Chapter5-The Memory System
No ratings yet
Chapter5-The Memory System
36 pages
help2
No ratings yet
help2
102 pages
Advanced Computer Architecture
No ratings yet
Advanced Computer Architecture
21 pages
Unit III Memory Hierarchy
No ratings yet
Unit III Memory Hierarchy
21 pages
10 Cache Memories
No ratings yet
10 Cache Memories
49 pages
Lecture 14
No ratings yet
Lecture 14
14 pages
Foundation Course for Advanced Computer Studies
From Everand
Foundation Course for Advanced Computer Studies
Franck Ismael Djédjé
No ratings yet
All My IT Tech Posts
From Everand
All My IT Tech Posts
Stephen Edwards
No ratings yet
PC Interfacing Pocket Reference
From Everand
PC Interfacing Pocket Reference
Myke Predko
No ratings yet
13-vm-notes
No ratings yet
13-vm-notes
60 pages
16-io-notes
No ratings yet
16-io-notes
35 pages
17-storage-notes
No ratings yet
17-storage-notes
33 pages
14-ecf-notes
No ratings yet
14-ecf-notes
45 pages
BOM Cambium-Maxime-Oct 2019
No ratings yet
BOM Cambium-Maxime-Oct 2019
1 page
Robust-Construction-of-the-Optoisolator-ebv25035
No ratings yet
Robust-Construction-of-the-Optoisolator-ebv25035
13 pages
Backing Storage: Name: - Group
No ratings yet
Backing Storage: Name: - Group
1 page
Service Documentation - 161150-486 - A - en - BCI VIDAS LIS
No ratings yet
Service Documentation - 161150-486 - A - en - BCI VIDAS LIS
182 pages
CV-prasad Bhandare
No ratings yet
CV-prasad Bhandare
3 pages
C-Unit 1 Answer
No ratings yet
C-Unit 1 Answer
36 pages
Budget Planner Portal (Synopsis)
100% (1)
Budget Planner Portal (Synopsis)
133 pages
HP Diagnostics (UEFI) Failure Code List
100% (1)
HP Diagnostics (UEFI) Failure Code List
11 pages
BKHD 1264 NAS V1.0 Product Manual_ZH_EN
No ratings yet
BKHD 1264 NAS V1.0 Product Manual_ZH_EN
20 pages
It - 506 - Advanced Java Lab Manual Updated
No ratings yet
It - 506 - Advanced Java Lab Manual Updated
24 pages
Oracle: Question & Answers
No ratings yet
Oracle: Question & Answers
8 pages
MTP3055V
No ratings yet
MTP3055V
3 pages
Zetex: A Product Line of Diodes Incorporated
No ratings yet
Zetex: A Product Line of Diodes Incorporated
9 pages
5 Uml Physical Asd
No ratings yet
5 Uml Physical Asd
3 pages
Unit 2 Esp32
100% (2)
Unit 2 Esp32
19 pages
Mobile Legend Build Apk: 1. Project Management
No ratings yet
Mobile Legend Build Apk: 1. Project Management
5 pages
Google Cloud Fundamentals: Core Infrastructure: Welcome
No ratings yet
Google Cloud Fundamentals: Core Infrastructure: Welcome
18 pages
565 2013 1 PB
No ratings yet
565 2013 1 PB
5 pages
LeCroy WavePro 7 Zi-A Datasheet
No ratings yet
LeCroy WavePro 7 Zi-A Datasheet
28 pages
CS3910Syllabus PDF
No ratings yet
CS3910Syllabus PDF
3 pages
18sub v1 1
No ratings yet
18sub v1 1
27 pages
COMWARE V5 Platform System Log Message
No ratings yet
COMWARE V5 Platform System Log Message
220 pages
7SD5xx Technical Data Manual A3 V044100 en
No ratings yet
7SD5xx Technical Data Manual A3 V044100 en
66 pages
Semester Subject Sub Codes Number of Teaching Hours Per Week (55 Mins Each)
No ratings yet
Semester Subject Sub Codes Number of Teaching Hours Per Week (55 Mins Each)
9 pages
SIEMENS Rwf40000a97
No ratings yet
SIEMENS Rwf40000a97
57 pages
Sadigya Subedi BSC - It 1St Semester Enrolled in 2019: Programming Assignment Submitted by
No ratings yet
Sadigya Subedi BSC - It 1St Semester Enrolled in 2019: Programming Assignment Submitted by
55 pages
Unidentified Alarm Occurrence / Run CPU Address Error: Point of Detection Application
No ratings yet
Unidentified Alarm Occurrence / Run CPU Address Error: Point of Detection Application
1 page
4a Interrupts of Intel 8085
No ratings yet
4a Interrupts of Intel 8085
3 pages
UNIT II (Part 2) PDF
No ratings yet
UNIT II (Part 2) PDF
7 pages

12-caches-notes

Uploaded by

12-caches-notes

Uploaded by

Caches & Memory

Instruction Instruction Write-

IF/ID ID/EX EX/MEM MEM/WB 3

SandyBridge Motherboard, 2011 4

Intel Pentium 3, 1999 8

• Go back to 04-state and 05-memory and look

If you ask for something, you’re likely to ask for:

for (i = 0; i < n; i++)

temporal and spatial 2 for (i = 0; i < n; i++) {

locality of data. 3 n--;

Q1: Which line of code 5 return total;

exhibits good temporal A) 1

temporal and spatial 2 for (i = 0; i < n; i++) {

locality of data. 3 n--;

Q1: Which line of code 5 return total;

exhibits good temporal A) 1

temporal and spatial 2 for (i = 0; i < n; i++) {

locality of data. 3 n--;

Q1: Which line of code 5 return total;

exhibits good temporal A) 1

temporal and spatial 2 for (i = 0; i < n; i++) {

locality of data. 3 n--;

Q1: Which line of code 5 return total;

exhibits good temporal A) 1

Intel Haswell Processor, 2013 18

1 cycle, average access time

Intel Haswell Processor, 2013 20

Processor Processor Processor Processor

Memory Transistor Access time Access time in $ per GIB Capacity

*Registers,D-Flip Flops: 10-100’s of registers

Direct Mapped Caches

• Compared to your spice wall

• How do you know what’s in the jar?

What should the index be? A) xxxx 1010 L

tavg = thit + %miss* tmiss

Which cache line should be evicted from the cache

• Cold (compulsory) Miss:

What does NOT happen when you

A) Conflict misses decrease

What does NOT happen when you

A) Conflict misses decrease

Tag Index Offset

Tag Index Offset

n bit index, m bit offset, N-way Set Associative

n bit index, m bit offset, N-way Set Associative

n bit index, m bit offset, N-way Set Associative

Cache design a very complex problem:

Line size matters. Larger cache lines can increase

Cache performance is measured by the average memory

If the data is not in the cache?

Should we also update memory?

Write-through policy with write allocate

What if we DON’T to write stores immediately to

• V = 1 means the line has valid data

• Example: How does a write-back cache work?

Eviction, WB dirty block

Cheap subsequent updates!

for (i=1; i<n; i++) Write-thru: n/4 reads

Short Answer: Yes (fewer writes is a good thing)

• Q: Writes to main memory are slow!

• Write-back is almost always faster

Cache coherency protocol

• Write-back policy with write allocate

Every access a cache miss! MEMORY

• Block size = 4  75% hit rate

int scores[NUM STUDENTS] = 0;

int scores[NUM STUDENTS] = 0;

• Dual 32K L1 Instruction caches

You might also like