0% found this document useful (0 votes)
9 views

12-caches-notes

Uploaded by

Vishakha Agarwal
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views

12-caches-notes

Uploaded by

Vishakha Agarwal
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 144

Caches & Memory

CS 3410
Computer System Organization & Programming

These slides are the product of many rounds of teaching CS 3410 by Professors Weatherspoon, Bala, Bracy, and Sirer.
Programs 101
C Code MIPS Assembly
int main (int argc, char* main: addiu $sp,$sp,-
argv[ ]) { 48
int i; sw
int m = n; $31,44($sp)
sw
int sum = 0; $fp,40($sp)
for (i = 1; i <= m; i++) { move $fp,$sp
sum += i; sw
} $4,48($fp)
printf (“...”, n, sum); sw
} $5,52($fp)
la $2,n
Load/Store Architectures: lw $2,0($2)
sw
• Read data from memory $2,28($fp)
sw
(put in registers) $0,32($fp)
li $2,1
• Manipulate it 
sw
Instructions
$2,24($fp) that read
from
• Store it back to memory $L2: lw
$2,24($fp)
or write to memory…
2
1 Cycle Per Stage: the Biggest Lie (So Far)
Code Stored in Memory
(also, data and stack)
compute
jump/branch
targets

A
memory
register

D
ALU
file

B
+4
addr
PC
inst

din dout

M
control

B
memory
imm

extend
new
forward
pc detect unit
Stack, Data, Code
hazard Stored in Memory

Instruction Instruction Write-


ctrl

ctrl

ctrl
Fetch Decode Execute Memory Back

IF/ID ID/EX EX/MEM MEM/WB 3


What’s the problem?
CPU
Main Memory
+ big
– slow
– far away

SandyBridge Motherboard, 2011 4


http://news.softpedia.com
The Need for Speed
CPU Pipeline

5
The Need for Speed

CPU Pipeline

Instruction speeds:
• add,sub,shift: 1 cycle
• mult: 3 cycles
• load/store: 100 cycles
off-chip 50(-70)ns
2(-3) GHz processor  0.5 ns clock
6
The Need for Speed
CPU Pipeline

7
What’s the solution?
Caches !
Level 1
Data $
Level 2 $

Level 1
Insn $

Intel Pentium 3, 1999 8


Aside

• Go back to 04-state and 05-memory and look


at how registers, SRAM and DRAM are built.

9
What’s the solution?
Caches !
Level 1
Data $
Level 2 $

Level 1
Insn $

What lucky
data gets to go
here?
Intel Pentium 3, 1999 10
Locality Locality Locality

If you ask for something, you’re likely to ask for:


• the same thing again soon
 Temporal Locality
• something near that thing, soon
 Spatial Locality
total = 0;

for (i = 0; i < n; i++)

total += a[i];

return total;
11
Clicker Questions
This highlights the 1 total = 0;

temporal and spatial 2 for (i = 0; i < n; i++) {

locality of data. 3 n--;

4 total += a[i];

Q1: Which line of code 5 return total;

exhibits good temporal A) 1


locality? B) 2
C) 3
D) 4
E) 5
12
Clicker Questions
This highlights the 1 total = 0;

temporal and spatial 2 for (i = 0; i < n; i++) {

locality of data. 3 n--;

4 total += a[i];

Q1: Which line of code 5 return total;

exhibits good temporal A) 1


locality? B) 2
C) 3
D) 4
E) 5
13
Clicker Questions
This highlights the 1 total = 0;

temporal and spatial 2 for (i = 0; i < n; i++) {

locality of data. 3 n--;

4 total += a[i];

Q1: Which line of code 5 return total;

exhibits good temporal A) 1


locality? B) 2
Q2: Which line of code C) 3
exhibits good spatial locality D) 4
with the line after it? E) 5
14
Clicker Questions
This highlights the 1 total = 0;

temporal and spatial 2 for (i = 0; i < n; i++) {

locality of data. 3 n--;

4 total += a[i];

Q1: Which line of code 5 return total;

exhibits good temporal A) 1


locality? B) 2
Q2: Which line of code C) 3
exhibits good spatial locality D) 4
with the line after it? E) 5
15
Your life is full of Locality

Last Called
Speed Dial
Favorites
Contacts
Google/Facebook/email

16
Your life is full of Locality

17
The Memory Hierarchy
Small, Fast
1 cycle,
Registers 128 bytes
4 cycles,
L1 Caches 64 KB
12 cycles,
L2 Cache 256 KB
36 cycles,
L3 Cache 2-20 MB
50-70 ns,
Main Memory 512 MB – 4 GB
Big, Slow
5-20 ms
Disk 16GB – 4 TB,

Intel Haswell Processor, 2013 18


Some Terminology
Cache hit
•data is in the Cache
•thit : time it takes to access the cache
•Hit rate (%hit): # cache hits / # cache accesses
Cache miss
•data is not in the Cache
•tmiss : time it takes to get the data from below the $
•Miss rate (%miss): # cache misses / # cache accesses
Cacheline or cacheblock or simply line or block
19
The Memory Hierarchy

1 cycle, average access time


Registers 128 bytes tavg = thit + %miss* tmiss
4 cycles,
L1 Caches 64 KB = 4 + 5% x 100
= 9 cycles
12 cycles,
L2 Cache 256 KB
36 cycles,
L3 Cache 2-20 MB
50-70 ns,
Main Memory 512 MB – 4 GB
5-20 ms
Disk 16GB – 4 TB,

Intel Haswell Processor, 2013 20


Single Core Memory Hierarchy
ON CHIP
Processor
Registers
Regs

L1 Caches
I$ D$

L2 Cache
L2
L3 Cache
Main
Main Memory
Memory
Disk
Disk
21
Multi-Core Memory Hierarchy
ON CHIP

Processor Processor Processor Processor


Regs Regs Regs Regs

I$ D$ I$ D$ I$ D$ I$ D$

L2 L2 L2 L2

L3

Main Memory

Disk
22
Memory Hierarchy by the Numbers
CPU clock rates ~0.33ns – 2ns (3GHz-500MHz)

Memory Transistor Access time Access time in $ per GIB Capacity


technology count* cycles in 2012
SRAM 6-8 transistors 0.5-2.5 ns 1-3 cycles $4k 256 KB
(on chip)
SRAM 1.5-30 ns 5-15 cycles $4k 32 MB
(off chip)
DRAM 1 transistor 50-70 ns 150-200 cycles $10-$20 8 GB
(needs refresh)
SSD (Flash) 5k-50k ns Tens of $0.75-$1 512 GB
thousands
Disk 5M-20M ns Millions $0.05- 4 TB
$0.1

*Registers,D-Flip Flops: 10-100’s of registers


23
Basic Cache Design

Direct Mapped Caches

24
MEMORY
16 Byte Memory addr
0000
data
A
0001 B
0010 C
0011 D
load 1100  r1 0100 E
0101 F
0110 G
0111 H
1000 J
• Byte-addressable memory 1001 K
• 4 address bits  16 bytes total 1010 L
1011 M
• b addr bits  2b bytes in memory 1100 N
1101 O
1110 P
1111 Q 25
MEMORY
4-Byte, Direct Mapped Cache addr data
0000 A
CACHE 0001 B
index index data 0010 C
XXXX 00 A Cache entry 0011 D
01 B 0100 E
= row
10 C 0101 F
= (cache) line
11 D 0110 G
= (cache) block
0111 H
Block Size: 1 byte
1000 J
1001 K
Direct mapped: 1010 L
• Each address maps to 1 cache block 1011 M
• 4 entries  2 index bits (2n  n bits) 1100 N
1101 O
Index with LSB: 1110 P
• Supports spatial locality 1111 Q 26
Analogy to a Spice Rack
Spice Rack Spice Wall
(Cache) (Memory)
index spice
A
B
C
D
E
F


Z

• Compared to your spice wall


– Smaller
– Faster
– More costly (per oz.)
27
http://www.bedbathandbeyond.com
Analogy to a Spice Rack
Spice Rack Spice Wall
(Cache) (Memory)
index tag spice
A
B
C innamon
Cinnamon
D
E
F


Z

• How do you know what’s in the jar?


• Need labels
Tag = Ultra-minimalist label

28
MEMORY
4-Byte, Direct Mapped addr data
Cache 0000 A
0001 B
tag|index 0010 C
CACHE 0011 D
XXXX
index tag data 0100 E
00 00 A 0101 F
01 00 B 0110 G
10 00 C 0111 H
11 00 D 1000 J
1001 K
Tag: minimalist label/address 1010 L
address = tag + index 1011 M
1100 N
1101 O
1110 P
1111 Q 29
MEMORY
4-Byte, Direct Mapped Cache addr
0000
data
A
0001 B
0010 C
CACHE 0011 D
index V tag data 0100 E
00 0 00 X 0101 F
01 0 00 X 0110 G
10 0 00 X 0111 H
11 0 00 X 1000 J
1001 K
One last tweak: valid bit 1010 L
1011 M
1100 N
1101 O
1110 P
1111 Q 30
MEMORY
Simulation #1 addr data
of a 4-byte, DM Cache 0000 A
0001 B
tag|index 0010 C
CACHE 0011 D
XXXX
index V tag data 0100 E
00 0 11 X 0101 F
01 0 11 X 0110 G
10 0 11 X 0111 H
11 0 11 X 1000 J
1001 K
load 1100 Miss Lookup: 1010 L
• Index into $ 1011 M
1100 N
• Check tag 1101 O
• Check valid bit 1110 P
1111 Q 31
MEMORY
Simulation #1 addr data
of a 4-byte, DM Cache 0000 A
0001 B
tag|index 0010 C
CACHE 0011 D
XXXX
index V tag data 0100 E
00 1 11 N 0101 F
01 0 xx X 0110 G
10 0 xx X 0111 H
11 0 xx X 1000 J
1001 K
load 1100 Miss Lookup: 1010 L
• Index into $ 1011 M
1100 N
• Check tag 1101 O
• Check valid bit 1110 P
1111 Q 32
MEMORY
Simulation #1 addr data
of a 4-byte, DM Cache 0000 A
0001 B
tag|index 0010 C
CACHE 0011 D
XXXX
index V tag data 0100 E
00 1 11 N 0101 F
01 0 11 X 0110 G
10 0 11 X 0111 H
11 0 11 X 1000 J
1001 K
load 1100 Miss Lookup: 1010 L
... • Index into $ 1011 M
load 1100 Hit! 1100 N
• Check tag
1101 O
Awesome! • Check valid bit 1110 P
1111 Q 33
Block Diagram
4-entry, direct mapped Cache
tag|index
CACHE
1101
V tag data
2 2 1 00 1111 0000
1 11 1010 0101
0 01 1010 1010 Great!
1 11 0000 0000
Are we done?
2 8

= 1010 0101

data
Hit!
36
MEMORY
Simulation #2: addr data
0000 A
4-byte, DM Cache 0001 B
Clicker: CACHE
0010 C
A) Hit 0011 D
B) Miss index V tag data 0100 E
00 0 11 X 0101 F
01 0 11 X 0110 G
10 0 11 X 0111 H
11 0 11 X 1000 J
1001 K
load 1100 Miss Lookup: 1010 L
load 1101 • Index into $ 1011 M
load 0100 1100 N
load 1100
• Check tag 1101 O
• Check valid bit 1110 P
1111 Q 37
MEMORY
Simulation #2: addr data
4-byte, DM Cache 0000 A
0001 B
0010 C
CACHE 0011 D
index V tag data 0100 E
00 1 11 N 0101 F
01 0 11 X 0110 G
10 0 11 X 0111 H
11 0 11 X 1000 J
1001 K
load 1100 Miss Lookup: 1010 L
load 1101 • Index into $ 1011 M
load 0100 1100 N
load 1100
• Check tag 1101 O
• Check valid bit 1110 P
1111 Q 38
MEMORY
Simulation #2: addr data
4-byte, DM Cache 0000 A
0001 B
Clicker: CACHE
0010 C
0011 D
A) Hit
index V tag data 0100 E
B) Miss 00 1 11 N 0101 F
01 0 11 X 0110 G
10 0 11 X 0111 H
11 0 11 X 1000 J
1001 K
load 1100 Miss Lookup: 1010 L
load 1101 Miss
• Index into $ 1011 M
load 0100 1100 N
load 1100
• Check tag 1101 O
• Check valid bit 1110 P
1111 Q 39
MEMORY
Simulation #2: addr data
4-byte, DM Cache 0000 A
0001 B
tag|index 0010 C
CACHE 0011 D
XXXX
index V tag data 0100 E
00 1 11 N 0101 F
01 1 11 O 0110 G
10 0 11 X 0111 H
11 0 11 X 1000 J
1001 K
load 1100 Miss Lookup: 1010 L
load 1101 Miss
• Index into $ 1011 M
load 0100 1100 N
load 1100
• Check tag 1101 O
• Check valid bit 1110 P
1111 Q 40
MEMORY
Simulation #2: addr data
4-byte, DM Cache 0000 A
0001 B
Clicker: CACHE
0010 C
0011 D
A) Hit
index V tag data 0100 E
B) Miss 00 1 11 N 0101 F
01 1 11 O 0110 G
10 0 xx X 0111 H
11 0 xx X 1000 J
1001 K
load 1100 Miss Lookup: 1010 L
load 1101 Miss
• Index into $ 1011 M
load 0100 Miss 1100 N
load 1100
• Check tag 1101 O
• Check valid bit 1110 P
1111 Q 41
MEMORY
Simulation #2: addr data
4-byte, DM Cache 0000 A
0001 B
0010 C
CACHE 0011 D
index V tag data 0100 E
00 1 01 E 0101 F
01 1 11 O 0110 G
10 0 11 X 0111 H
11 0 11 X 1000 J
1001 K
load 1100 Miss Lookup: 1010 L
load 1101 Miss
• Index into $ 1011 M
load 0100 Miss 1100 N
load 1100
• Check tag 1101 O
• Check valid bit 1110 P
1111 Q 42
MEMORY
Simulation #2: addr data
4-byte, DM Cache 0000 A
0001 B
Clicker: CACHE
0010 C
0011 D
A) Hit
index V tag data 0100 E
B) Miss 00 1 01 E 0101 F
01 1 11 O 0110 G
10 0 11 X 0111 H
11 0 11 X 1000 J
1001 K
load 1100 Miss Lookup: 1010 L
load 1101 Miss
• Index into $ 1011 M
load 0100 Miss 1100 N
load 1100
• Check tag
Miss 1101 O
• Check valid bit 1110 P
1111 Q 43
MEMORY
Simulation #2: addr data
4-byte, DM Cache 0000 A
0001 B
tag|index 0010 C
CACHE 0011 D
XXXX
index V tag data 0100 E
00 1 11 N 0101 F
01 1 11 O 0110 G
10 0 11 X 0111 H
11 0 11 X 1000 J
1001 K
load 1100 Miss cold 1010 L
Disappointed!
load 1101 Miss cold 1011 M
load
load
0100
1100
Miss
Miss
cold
 1100
1101
N
O
1110 P
1111 Q 44
Reducing Cold Misses
by Increasing Block Size
Leveraging Spatial Locality

45
MEMORY
Increasing Block Size addr
0000
data
A
0001 B
CACHE
offset 0010 C
index V tag data 0011 D
XXXX
00 0 x A | B 0100 E
01 0 x C | D 0101 F
10 0 x E | F 0110 G
11 0 x G | H 0111 H
1000 J
• Block Size: 2 bytes 1001 K
1010 L
• Block Offset: least significant bits 1011 M
indicate where you live in the block 1100 N
1101 O
• Which bits are the index? tag?
1110 P
1111 Q 46
MEMORY
Simulation #3: addr data
index 8-byte, DM Cache 0000 A
0001 B
CACHE
tag| |offset 0010 C
index V tag data 0011 D
XXXX
00 0 x X | X 0100 E
01 0 x X | X 0101 F
10 0 x X | X 0110 G
11 0 x X | X 0111 H
1000 J
1001 K
load 1100 Miss Lookup: 1010 L
load 1101 • Index into $ 1011 M
load 0100 1100 N
load 1100
• Check tag 1101 O
• Check valid bit 1110 P
1111 Q 47
MEMORY
Simulation #3: addr data
index 8-byte, DM Cache 0000 A
0001 B
CACHE
tag| |offset 0010 C
index V tag data 0011 D
XXXX
00 0 x X | X 0100 E
01 0 x X | X 0101 F
10 1 1 N | O 0110 G
11 0 x X | X 0111 H
1000 J
1001 K
load 1100 Miss Lookup: 1010 L
load 1101 • Index into $ 1011 M
load 0100 1100 N
load 1100
• Check tag 1101 O
• Check valid bit 1110 P
1111 Q 48
MEMORY
Simulation #3: addr data
index 8-byte, DM Cache 0000 A
0001 B
CACHE
tag| |offset 0010 C
index V tag data 0011 D
XXXX
00 0 x X | X 0100 E
01 0 x X | X 0101 F
10 1 1 N | O 0110 G
11 0 x X | X 0111 H
1000 J
1001 K
load 1100 Miss Lookup: 1010 L
load 1101 Hit!
• Index into $ 1011 M
load 0100 1100 N
load 1100
• Check tag
1101 O
• Check valid bit 1110 P
1111 Q 49
MEMORY
Simulation #3: addr data
index 8-byte, DM Cache 0000 A
0001 B
CACHE
tag| |offset 0010 C
index V tag data 0011 D
XXXX
00 0 x X | X 0100 E
01 0 x X | X 0101 F
10 1 1 N | O 0110 G
11 0 x X | X 0111 H
1000 J
1001 K
Miss
load 1100 Lookup: 1010 L
load 1101 Hit!
• Index into $ 1011 M
load 0100 Miss 1100 N
load 1100
• Check tag 1101 O
• Check valid bit 1110 P
1111 Q 50
MEMORY
Simulation #3: addr data
index 8-byte, DM Cache 0000 A
0001 B
CACHE
tag| |offset 0010 C
index V tag data 0011 D
XXXX
00 0 x X | X 0100 E
01 0 x X | X 0101 F
10 1 0 E | F 0110 G
11 0 x X | X 0111 H
1000 J
1001 K
Miss
load 1100 Lookup: 1010 L
load 1101 Hit!
• Index into $ 1011 M
load 0100 Miss 1100 N
load 1100
• Check tag 1101 O
• Check valid bit 1110 P
1111 Q 51
MEMORY
Simulation #3: addr data
index 8-byte, DM Cache 0000 A
0001 B
CACHE
tag| |offset 0010 C
index V tag data 0011 D
XXXX
00 0 x X | X 0100 E
01 0 x X | X 0101 F
10 1 0 E | F 0110 G
11 0 x X | X 0111 H
1000 J
1001 K
Miss
load 1100 Lookup: 1010 L
load 1101 Hit!
• Index into $ 1011 M
load 0100 Miss 1100 N
load 1100
• Check tag
Miss 1101 O
• Check valid bit 1110 P
1111 Q 52
MEMORY
Simulation #3: addr data
8-byte, DM Cache 0000 A
0001 B
CACHE
0010 C
index V tag data 0011 D
00 0 x X | X 0100 E
01 0 x X | X 0101 F
10 1 0 E | F 0110 G
11 0 x X | X 0111 H
1000 J
1001 K
load 1100 Miss cold 1010 L
1 hit, 3 misses
load 1101 Hit!
3 bytes don’t fit in 1011 M
load 0100 Miss cold
a 4 entry cache? 1100 N
load 1100 Miss conflict 1101 O
1110 P
1111 Q 53
Removing Conflict Misses
with Fully-Associative Caches

54
MEMORY
8 byte, fully-associative Cache addr
0000
data
A
0001 B
XXXX 0010 C
tag|offset
offset 0011 D
CACHE 0100 E
0101 F
V tag data V tag data V tag data V tag data 0110 G
0 xxx X | X 0 xxx X | X 0 xxx X | X 0 xxx X | X 0111 H
1000 J
What should the offset be? Clicker: 1001 K

What should the index be? A) xxxx 1010 L


B) xxxx 1011 M
What should the tag be? 1100 N
C) xxxx 1101 O
D) xxxx 1110 P
E) None 1111 Q 55
MEMORY
Simulation #4: addr data
8-byte, FA Cache 0000 A
0001 B
XXXX 0010 C
tag|offset 0011 D
CACHE 0100 E
0101 F
V tag data V tag data V tag data V tag data 0110 G
0 xxx X | X 0 xxx X | X 0 xxx X | X 0 xxx X | X 0111 H
1000 J
1001 K
load 1100 Miss Lookup: 1010 L
load 1101 • Index into $ 1011 M
load 0100 1100 N
load 1100
• Check tags 1101 O
• Check valid bits 1110 P
LRU Pointer 1111 Q 56
MEMORY
Simulation #4: addr data
8-byte, FA Cache 0000 A
0001 B
XXXX 0010 C
tag|offset 0011 D
CACHE 0100 E
0101 F
V tag data V tag data V tag data V tag data 0110 G
1 110 N | O 0 xxx X | X 0 xxx X | X 0 xxx X | X 0111 H
1000 J
1001 K
load 1100 Miss Lookup: 1010 L
load 1101 Hit!
• Index into $ 1011 M
load 0100 1100 N
load 1100
• Check tags
1101 O
• Check valid bits 1110 P
1111 Q 57
MEMORY
Simulation #4: addr data
8-byte, FA Cache 0000 A
0001 B
XXXX 0010 C
tag|offset 0011 D
CACHE 0100 E
0101 F
V tag data V tag data V tag data V tag data 0110 G
1 110 N | O 0 xxx X | X 0 xxx X | X 0 xxx X | X 0111 H
1000 J
1001 K
load 1100 Miss Lookup: 1010 L
load 1101 Hit!
• Index into $ 1011 M
load 0100 Miss 1100 N
load 1100
• Check tags 1101 O
• Check valid bits 1110 P
LRU Pointer 1111 Q 58
MEMORY
Simulation #4: addr data
8-byte, FA Cache 0000 A
0001 B
XXXX 0010 C
tag|offset 0011 D
CACHE 0100 E
0101 F
V tag data V tag data V tag data V tag data 0110 G
1 110 N | O 1 010 E | F 0 xxx X | X 0 xxx X | X 0111 H
1000 J
1001 K
load 1100 Miss Lookup: 1010 L
load 1101 Hit!
• Index into $ 1011 M
load 0100 Miss 1100 N
load 1100
• Check tags
Hit! 1101 O
• Check valid bits 1110 P
LRU Pointer 1111 Q 59
Pros and Cons of Full Associativity
+ No more conflicts!
+ Excellent utilization!
But either:
Parallel Reads
– lots of reading!
Serial Reads
– lots of waiting

tavg = thit + %miss* tmiss


= 4 + 5% x 100 = 6 + 3% x 100

= 9 cycles = 9 cycles 60
Pros & Cons
Direct Mapped Fully Associative
Tag Size Smaller Larger
SRAM Overhead Less More
Controller Logic Less More
Speed Faster Slower
Price Less More
Scalability Very Not Very
# of conflict misses Lots Zero
Hit Rate Low High
Pathological Cases Common ?
Reducing Conflict Misses
with Set-Associative Caches
Not too conflict-y. Not too slow.

… Just Right!

62
MEMORY
8 byte, 2-way addr data
set associative Cache 0000 A
0001 B
XXXX 0010 C
tag||offset
offset
index

0011 D
CACHE 0100 E
index V tag data V tag data 0101 F
0 0 xx E | F 0 xx N | O 0110 G
1 0 xx C | D 0 xx P | Q 0111 H
1000 J
1001 K
What should the offset be?
1010 L
What should the index be? 1011 M
1100 N
What should the tag be? 1101 O
1110 P
1111 Q 63
Clicker Question
5 bit address
XXXXX
2 byte block size
24 byte, 3-Way Set Associative CACHE
index V tag data V tag data V tag data
00 0 ? X | Y 0 ? X’ | Y’ 0 ? X’’ | Y’’
01 0 ? X | Y 0 ? X’ | Y’ 0 ? X’’ | Y’’
10 0 ? X | Y 0 ? X’ | Y’ 0 ? X’’ | Y’’
11 0 ? X | Y 0 ? X’ | Y’ 0 ? X’’ | Y’’

A) 0
B) 1
How many tag bits?
C) 2
D) 3
64
E) 4
Clicker Question
5 bit address
XXXXX
2 byte block size
24 byte, 3-Way Set Associative CACHE
index V tag data V tag data V tag data
00 0 ? X | Y 0 ? X’ | Y’ 0 ? X’’ | Y’’
01 0 ? X | Y 0 ? X’ | Y’ 0 ? X’’ | Y’’
10 0 ? X | Y 0 ? X’ | Y’ 0 ? X’’ | Y’’
11 0 ? X | Y 0 ? X’ | Y’ 0 ? X’’ | Y’’

A) 0
B) 1
How many tag bits?
C) 2
D) 3
65
E) 4
MEMORY
8 byte, 2-way addr data
set associative Cache 0000 A
0001 B
XXXX 0010 C
tag||offset
index

0011 D
CACHE 0100 E
index V tag data V tag data 0101 F
0 0 xx X | X 0 xx X | X 0110 G
1 0 xx X | X 0 xx X | X 0111 H
1000 J
1001 K
load 1100 Miss Lookup: 1010 L
load 1101 • Index into $ 1011 M
load 0100 1100 N
load 1100
• Check tag 1101 O
• Check valid bit 1110 P
LRU Pointer 1111 Q 66
MEMORY
8 byte, 2-way addr data
set associative Cache 0000 A
0001 B
XXXX 0010 C
tag||offset
index

0011 D
CACHE 0100 E
index V tag data V tag data 0101 F
0 1 11 N | O 0 xx X | X 0110 G
1 0 xx X | X 0 xx X | X 0111 H
1000 J
1001 K
load 1100 Miss Lookup: 1010 L
load 1101 Hit!
• Index into $ 1011 M
load 0100 1100 N
load 1100
• Check tag 1101 O
• Check valid bit 1110 P
LRU Pointer 1111 Q 67
MEMORY
8 byte, 2-way addr data
set associative Cache 0000 A
0001 B
XXXX 0010 C
tag||offset
index

0011 D
CACHE 0100 E
index V tag data V tag data 0101 F
0 1 11 N | O 0 xx X | X 0110 G
1 0 xx X | X 0 xx X | X 0111 H
1000 J
1001 K
load 1100 Miss Lookup: 1010 L
load 1101 Hit!
• Index into $ 1011 M
load 0100 Miss 1100 N
load 1100
• Check tag 1101 O
• Check valid bit 1110 P
LRU Pointer 1111 Q 68
MEMORY
8 byte, 2-way addr data
set associative Cache 0000 A
0001 B
XXXX 0010 C
tag||offset
index

0011 D
CACHE 0100 E
index V tag data V tag data 0101 F
0 1 11 N | O 1 01 E | F 0110 G
1 0 xx X | X 0 xx X | X 0111 H
1000 J
1001 K
load 1100 Miss Lookup: 1010 L
load 1101 Hit!
• Index into $ 1011 M
load 0100 Miss 1100 N
load 1100 Hit! • Check tag 1101 O
• Check valid bit 1110 P
LRU Pointer 1111 Q 69
Eviction Policies

Which cache line should be evicted from the cache


to make room for a new line?
• Direct-mapped: no choice, must evict line selected
by index
• Associative caches
• Random: select one of the lines at random
• Round-Robin: similar to random
• FIFO: replace oldest line
• LRU: replace line that has not been used in the
longest time
70
Misses: the Three C’s

• Cold (compulsory) Miss:


never seen this address before
• Conflict Miss:
cache associativity is too low
• Capacity Miss:
cache is too small

71
Miss Rate vs. Block Size

72
Block Size Tradeoffs
• For a given total cache size,
Larger block sizes mean….
– fewer lines
– so fewer tags, less overhead
– and fewer cold misses (within-block “prefetching”)
• But also…
– fewer blocks available (for scattered accesses!)
– so more conflicts
– can decrease performance if working set can’t fit in $
– and larger miss penalty (time to fetch block)
Miss Rate vs. Associativity

74
Clicker Question

What does NOT happen when you


increase the associativity of the cache?

A) Conflict misses decrease


B) Tag overhead decreases
C) Hit time increases
D) Cache stays the same size

75
Clicker Question

What does NOT happen when you


increase the associativity of the cache?

A) Conflict misses decrease


B) Tag overhead decreases
C) Hit time increases
D) Cache stays the same size

76
ABCs of Caches
tavg = thit + %miss* tmiss
+ Associativity:
⬇conflict misses 
⬆hit time 
+ Block Size:
⬇cold misses 
⬆conflict misses 
+ Capacity:
⬇capacity misses 
⬆hit time 
77
Which caches get what properties?
tavg = thit + %miss* tmiss
Design with
Fast
speed in mind
L1 Caches
More Associative
L2 Cache Bigger Block Sizes
Larger Capacity
L3 Cache
Design with miss
Big rate in mind

78
Roadmap
• Things we have covered:
– The Need for Speed
– Locality to the Rescue!
– Calculating average memory access time
– $ Misses: Cold, Conflict, Capacity
– $ Characteristics: Associativity, Block Size, Capacity
• Things we will now cover:
– Cache Figures
– Cache Performance Examples
– Writes
79
2-Way Set Associative Cache (Reading)

Tag Index Offset


V Tag Data V Tag Data

= =

line select
64bytes
word select
32bits 80
hit? data
3-Way Set Associative Cache (Reading)

Tag Index Offset

= = =

line select
64bytes

word select
32bits 81
hit? data
How Big is the Cache?
Tag Index Offset

n bit index, m bit offset, N-way Set Associative


Question: How big is cache?
• Data only?
(what we usually mean when we ask “how big” is the cache)
• Data + overhead?
82
How Big is the Cache?
Tag Index Offset

n bit index, m bit offset, N-way Set Associative


Question: How big is cache?
• How big is the cache (Data only)?
Cache of size 2n sets
Block size of 2m bytes, N-way set associative
Cache Size: 2m bytes-per-block x (2n sets x N-way-per-set)
= N x 2n+m bytes
83
How Big is the Cache?
Tag Index Offset

n bit index, m bit offset, N-way Set Associative


Question: How big is cache?
• How big is the cache (Data + overhead)?
Cache of size 2n sets
Block size of 2m bytes, N-way set associative
Tag field: 32 – (n + m), Valid bit: 1
SRAM Size: 2n sets x N-way-per-set x (block size + tag size + valid bit size)
84
= 2n x N-way x (2m bytes x 8 bits-per-byte + (32–n–m) + 1)
Performance Calculation with $ Hierarchy
• Parameters
t avg = t hit + % miss* t miss
– Reference stream: all loads
– D$: thit = 1ns, %miss = 5%
– L2: thit = 10ns, %miss = 20% (local miss rate)
– Main memory: thit = 50ns
• What is tavgD$ without an L2?
– tmissD$ =
– tavgD$ =
• What is tavgD$ with an L2?
– tmissD$ =
– tavgL2 =
– tavgD$ =
85
Performance Calculation with $ Hierarchy
• Parameters tavg = thit + %miss* tmiss
– Reference stream: all loads
– D$: thit = 1ns, %miss = 5%
– L2: thit = 10ns, %miss = 20% (local miss rate)
– Main memory: thit = 50ns
• What is tavgD$ without an L2?
thitM
– tmissD$ =
thitD$ + %missD$*thitM = 1ns+(0.05*50ns) =
– tavgD$ =
3.5ns
• What is tavgD$ with an L2?
– tmissD$ = tavgL2
– tavgL2 =
thitL2+%missL2*thitM = 10ns+(0.2*50ns) = 20ns
– tavgD$ =
thitD$ + %missD$*tavgL2 = 1ns+(0.05*20ns) = 2ns
86
Performance Summary
Average memory access time (AMAT) depends on:
• cache architecture and size
• Hit and miss rates
• Access times and miss penalty

Cache design a very complex problem:


• Cache size, block size (aka line size)
• Number of ways of set-associativity (1, N, )
• Eviction policy
• Number of levels of caching, parameters for each
• Separate I-cache from D-cache, or Unified cache
• Prefetching policies / instructions
• Write policy
87
Takeaway
Direct Mapped  fast, but low hit rate
Fully Associative  higher hit cost, higher hit rate
Set Associative  middleground

Line size matters. Larger cache lines can increase


performance due to prefetching. BUT, can also decrease
performance is working set size cannot fit in cache.

Cache performance is measured by the average memory


access time (AMAT), which depends cache architecture
and size, but also the access time for hit, miss penalty, hit
rate.
88
What about Stores?
We want to write to the cache.

If the data is not in the cache?


Bring it in. (Write allocate policy)

Should we also update memory?


• Yes: write-through policy
• No: write-back policy
Write-Through Cache
16 byte, byte-addressed memory
4 btye, fully-associative cache: Memory
Instructions: 2-byte blocks, write-allocate 0 78
LB $1  M[ 1 ] 4 bit addresses: 1 29
LB $2  M[ 7 ] 3 bit tag, 1 bit offset 2 120
SB $2  M[ 0 ] 3 123
SB $1  M[ 5 ] lru V tag data 4 71
LB $2  M[ 10 ] 1 0 5 150
6 162
SB $1  M[ 5 ] 0 0 7 173
SB $1  M[ 10 ] 8 18
9 21
Register File Cache 10 33
$0 11 28
$1 12 19
Misses: 0 13 200
$2
$3 Hits: 0 14 210
Reads: 0 15 225
Writes: 0
Write-Through (REF 1)

Memory
Instructions:
0 78
LB $1  M[ 1 ] 1 29
LB $2  M[ 7 ] 2 120
SB $2  M[ 0 ] 3 123
SB $1  M[ 5 ] lru V tag data 4 71
LB $2  M[ 10 ] 1 0 5 150
6 162
SB $1  M[ 5 ] 0 0 7 173
SB $1  M[ 10 ] 8 18
9 21
Register File Cache 10 33
$0 11 28
$1 12 19
Misses: 0 13 200
$2
$3 Hits: 0 14 210
Reads: 0 15 225
Writes: 0
Write-Through (REF 1)

Memory
Instructions:
0 78
LB $1  M[ 1 ] M 1 29
LB $2  M[ 7 ] 2 120
SB $2  M[ 0 ] 3 123
SB $1  M[ 5 ] lru V tag data 4 71
LB $2  M[ 10 ] 0 1 000 78 5 150
29 6 162
SB $1  M[ 5 ] 1 0 7 173
SB $1  M[ 10 ] 8 18
9 21
Register File Cache 10 33
$0 11 28
$1 29 12 19
Misses: 1 13 200
$2
$3 Hits: 0 14 210
Reads: 2 15 225
Writes: 0
Write-Through (REF 2)

Memory
Instructions:
0 78
LB $1  M[ 1 ] M 1 29
LB $2  M[ 7 ] 2 120
SB $2  M[ 0 ] 3 123
SB $1  M[ 5 ] lru V tag data 4 71
LB $2  M[ 10 ] 0 1 000 78 5 150
29 6 162
SB $1  M[ 5 ] 1 0 7 173
SB $1  M[ 10 ] 8 18
9 21
Register File Cache 10 33
$0 11 28
$1 29 12 19
Misses: 1 13 200
$2
$3 Hits: 0 14 210
Reads: 2 15 225
Writes: 0
Write-Through (REF 2)

Memory
Instructions:
0 78
LB $1  M[ 1 ] M 1 29
LB $2  M[ 7 ] M 2 120
SB $2  M[ 0 ] 3 123
SB $1  M[ 5 ] lru V tag data 4 71
LB $2  M[ 10 ] 0 1 000 78 5 150
29 6 162
SB $1  M[ 5 ] 1 1 011 162 7 173
SB $1  M[ 10 ] 173 8 18
9 21
Register File Cache 10 33
$0 11 28
$1 29 12 19
173
Misses: 2 13 200
$2
$3 Hits: 0 14 210
Reads: 4 15 225
Writes: 0
Write-Through (REF 3)

CLICKER:
(A) HIT Memory
Instructions:
LB $1  M[ 1 ] M (B) MISS 0 78
1 29
LB $2  M[ 7 ] M 2 120
SB $2  M[ 0 ] 3 123
SB $1  M[ 5 ] lru V tag data 4 71
LB $2  M[ 10 ] 1 1 000 78 5 150
29 6 162
SB $1  M[ 5 ] 0 1 011 162 7 173
SB $1  M[ 10 ] 173 8 18
9 21
Register File Cache 10 33
$0 11 28
$1 29 12 19
173
Misses: 2 13 200
$2
$3 Hits: 0 14 210
Reads: 4 15 225
Writes: 0
Write-Through (REF 3)

Memory
Instructions:
0 173
LB $1  M[ 1 ] M 1 29
LB $2  M[ 7 ] M 2 120
SB $2  M[ 0 ] Hit 3 123
SB $1  M[ 5 ] lru V tag data 4 71
LB $2  M[ 10 ] 0 1 000 173 5 150
29 6 162
SB $1  M[ 5 ] 1 1 011 162 7 173
SB $1  M[ 10 ] 173 8 18
9 21
Register File Cache 10 33
$0 11 28
$1 29 12 19
173
Misses: 2 13 200
$2
$3 Hits: 1 14 210
Reads: 4 15 225
Writes: 1
Write-Through (REF 4)

write-allocate
Memory
Instructions:
0 173
LB $1  M[ 1 ] M 1 29
LB $2  M[ 7 ] M 2 120
SB $2  M[ 0 ] Hit 3 123
SB $1  M[ 5 ] M lru V tag data 4 71
LB $2  M[ 10 ] 0 1 000 173 5 150
29 6 162
SB $1  M[ 5 ] 1 1 011
010 7
162
71 173
SB $1  M[ 10 ] 150
173 8 18
9 21
Register File Cache 10 33
$0 11 28
$1 29 12 19
173
Misses: 2 13 200
$2
$3 Hits: 1 14 210
Reads: 4 15 225
Writes: 1
Write-Through (REF 4)

Memory
Instructions:
0 173
LB $1  M[ 1 ] M 1 29
LB $2  M[ 7 ] M 2 120
SB $2  M[ 0 ] Hit 3 123
SB $1  M[ 5 ] M lru V tag data 4 71
LB $2  M[ 10 ] 1 1 000 173 5 150
29
29 6 162
SB $1  M[ 5 ] 0 1 010 71 7 173
SB $1  M[ 10 ] 150
29
150 8 18
9 21
Register File Cache 10 33
$0 11 28
$1 29 12 19
173
Misses: 3 13 200
$2
$3 Hits: 1 14 210
Reads: 6 15 225
Writes: 2
Write-Through (REF 5)

CLICKER:
(A) HIT Memory
Instructions:
LB $1  M[ 1 ] (B) MISS 0 173
M 1 29
LB $2  M[ 7 ] M 2 120
SB $2  M[ 0 ] Hit 3 123
SB $1  M[ 5 ] M lru V tag data 4 71
LB $2  M[ 10 ] 1 1 000 173 5 29
29 6 162
SB $1  M[ 5 ] 0 1 010 71 7 173
SB $1  M[ 10 ] 29 8 18
9 21
Register File Cache 10 33
$0 11 28
$1 29 12 19
173
Misses: 3 13 200
$2
$3 Hits: 1 14 210
Reads: 6 15 225
Writes: 2
Write-Through (REF 5)

Memory
Instructions:
0 173
LB $1  M[ 1 ] M 1 29
LB $2  M[ 7 ] M 2 120
SB $2  M[ 0 ] Hit 3 123
SB $1  M[ 5 ] M lru V tag data 4 71
LB $2  M[ 10 ] 0 1 101 33 5 29
M
28 6 162
SB $1  M[ 5 ] 1 1 010 71 7 173
SB $1  M[ 10 ] 29 8 18
9 21
Register File Cache 10 33
$0 11 28
$1 29 12 19
33
Misses: 4 13 200
$2
$3 Hits: 1 14 210
Reads: 8 15 225
Writes: 2
Write-Through (REF 6)

Memory
Instructions:
0 173
LB $1  M[ 1 ] M 1 29
LB $2  M[ 7 ] M 2 120
SB $2  M[ 0 ] Hit 3 123
SB $1  M[ 5 ] M lru V tag data 4 71
LB $2  M[ 10 ] 0 1 101 33 5 29
M
28 6 162
SB $1  M[ 5 ] 1 1 010 71 7 173
SB $1  M[ 10 ] 29 8 18
9 21
Register File Cache 10 33
$0 11 28
$1 29 12 19
33
Misses: 4 13 200
$2
$3 Hits: 1 14 210
Reads: 8 15 225
Writes: 2
Write-Through (REF 6)

Memory
Instructions:
0 173
LB $1  M[ 1 ] M 1 29
LB $2  M[ 7 ] M 2 120
SB $2  M[ 0 ] Hit 3 123
SB $1  M[ 5 ] M lru V tag data 4 71
LB $2  M[ 10 ] 0 1 101 33 5 29
M
28 6 162
SB $1  M[ 5 ] Hit 1 1 010 71 7 173
SB $1  M[ 10 ] 29 8 18
9 21
Register File Cache 10 33
$0 11 28
$1 29 12 19
33
Misses: 4 13 200
$2
$3 Hits: 2 14 210
Reads: 8 15 225
Writes: 3
Write-Through (REF 7)

Memory
Instructions:
0 173
LB $1  M[ 1 ] M 1 29
LB $2  M[ 7 ] M 2 120
SB $2  M[ 0 ] Hit 3 123
SB $1  M[ 5 ] M lru V tag data 4 71
LB $2  M[ 10 ] 0 1 101 33 5 29
M
28 6 162
SB $1  M[ 5 ] Hit 1 1 010 71 7 173
SB $1  M[ 10 ] 29 8 18
9 21
Register File Cache 10 33
$0 11 28
$1 29 12 19
33
Misses: 4 13 200
$2
$3 Hits: 2 14 210
Reads: 8 15 225
Writes: 3
Write-Through (REF 7)

Memory
Instructions:
0 173
LB $1  M[ 1 ] M 1 29
LB $2  M[ 7 ] M 2 120
SB $2  M[ 0 ] Hit 3 123
SB $1  M[ 5 ] M lru V tag data 4 71
LB $2  M[ 10 ] 0 1 101 29
33 5 29
M
28 6 162
SB $1  M[ 5 ] Hit 1 1 010 71 7 173
SB $1  M[ 10 ] Hit 29 8 18
9 21
Register File Cache 10 29
33
$0 11 28
$1 29 12 19
33
Misses: 4 13 200
$2
$3 Hits: 3 14 210
Reads: 8 15 225
Writes: 4
Summary: Write Through

Write-through policy with write allocate


• Cache miss: read entire block from
memory
• Write: write only updated item to
memory
• Eviction: no need to write to memory
Next Goal: Write-Through vs. Write-Back

What if we DON’T to write stores immediately to


memory?
– Keep the current copy in cache, and update
memory when data is evicted (write-back policy)
– Write-back all evicted lines?
• No, only written-to blocks
Write-Back Meta-Data (Valid, Dirty Bits)
V D Tag Byte 1 Byte 2 … Byte N

• V = 1 means the line has valid data


• D = 1 means the bytes are newer than main memory
• When allocating line:
– Set V = 1, D = 0, fill in Tag and Data
• When writing line:
– Set D = 1
• When evicting line:
– If D = 0: just set V = 0
– If D = 1: write-back Data, then set D = 0, V = 0
Write-back Example

• Example: How does a write-back cache work?


• Assume write-allocate
Handling Stores (Write-Back)
16 byte, byte-addressed memory
4 btye, fully-associative cache: Memory
Instructions: 2-byte blocks, write-allocate 0 78
LB $1  M[ 1 ] 4 bit addresses: 1 29
LB $2  M[ 7 ] 3 bit tag, 1 bit offset 2 120
SB $2  M[ 0 ] 3 123
SB $1  M[ 5 ] lru V d tag data 4 71
LB $2  M[ 10 ] 1 0 5 150
6 162
SB $1  M[ 5 ] 0 0 7 173
SB $1  M[ 10 ] 8 18
9 21
Register File Cache 10 33
$0 11 28
Misses: 0 12 19
$1
$2 Hits: 0 13 200
$3 Reads: 0 14 210
Writes: 0 15 225
Write-Back (REF 1)

Memory
Instructions:
0 78
LB $1  M[ 1 ] 1 29
LB $2  M[ 7 ] 2 120
SB $2  M[ 0 ] 3 123
SB $1  M[ 5 ] lru V d tag data 4 71
LB $2  M[ 10 ] 1 0 5 150
6 162
SB $1  M[ 5 ] 0 0 7 173
SB $1  M[ 10 ] 8 18
9 21
Register File Cache 10 33
$0 11 28
Misses: 0 12 19
$1
$2 Hits: 0 13 200
$3 Reads: 0 14 210
Writes: 0 15 225
Write-Back (REF 1)

Memory
Instructions:
0 78
LB $1  M[ 1 ] M 1 29
LB $2  M[ 7 ] 2 120
SB $2  M[ 0 ] 3 123
SB $1  M[ 5 ] lru V d tag data 4 71
LB $2  M[ 10 ] 0 1 0 000 78 5 150
29 6 162
SB $1  M[ 5 ] 1 0 7 173
SB $1  M[ 10 ] 8 18
9 21
Register File Cache 10 33
$0 11 28
29 Misses: 1 12 19
$1
$2 Hits: 0 13 200
$3 Reads: 2 14 210
Writes: 0 15 225
Write-Back (REF 2)

Memory
Instructions:
0 78
LB $1  M[ 1 ] M 1 29
LB $2  M[ 7 ] 2 120
SB $2  M[ 0 ] 3 123
SB $1  M[ 5 ] lru V d tag data 4 71
LB $2  M[ 10 ] 0 1 0 000 78 5 150
29 6 162
SB $1  M[ 5 ] 1 0 7 173
SB $1  M[ 10 ] 8 18
9 21
Register File Cache 10 33
$0 11 28
29 Misses: 1 12 19
$1
$2 Hits: 0 13 200
$3 Reads: 2 14 210
Writes: 0 15 225
Write-Back (REF 2)

Memory
Instructions:
0 78
LB $1  M[ 1 ] M 1 29
LB $2  M[ 7 ] M 2 120
SB $2  M[ 0 ] 3 123
SB $1  M[ 5 ] lru V d tag data 4 71
LB $2  M[ 10 ] 1 1 0 000 78 5 150
29 6 162
SB $1  M[ 5 ] 0 1 0 011 162 7 173
SB $1  M[ 10 ] 173 8 18
9 21
Register File Cache 10 33
$0 11 28
29 Misses: 2 12 19
$1
$2 173 Hits: 0 13 200
$3 Reads: 4 14 210
Writes: 0 15 225
Write-Back (REF 3)

Memory
Instructions:
0 78
LB $1  M[ 1 ] M 1 29
LB $2  M[ 7 ] M 2 120
SB $2  M[ 0 ] 3 123
SB $1  M[ 5 ] lru V d tag data 4 71
LB $2  M[ 10 ] 1 1 0 000 78 5 150
29 6 162
SB $1  M[ 5 ] 0 1 0 011 162 7 173
SB $1  M[ 10 ] 173 8 18
9 21
Register File Cache 10 33
$0 11 28
29 Misses: 2 12 19
$1
$2 173 Hits: 0 13 200
$3 Reads: 4 14 210
Writes: 0 15 225
Write-Back (REF 3)

Memory
Instructions:
0 78
LB $1  M[ 1 ] M 1 29
LB $2  M[ 7 ] M 2 120
SB $2  M[ 0 ] Hit 3 123
SB $1  M[ 5 ] lru V d tag data 4 71
LB $2  M[ 10 ] 0 1 1 000 173 5 150
29 6 162
SB $1  M[ 5 ] 1 1 0 011 162 7 173
SB $1  M[ 10 ] 173 8 18
9 21
Register File Cache 10 33
$0 11 28
29 Misses: 2 12 19
$1
$2 173 Hits: 1 13 200
$3 Reads: 4 14 210
Writes: 0 15 225
Write-Back (REF 4)

Memory
Instructions:
0 78
LB $1  M[ 1 ] M 1 29
LB $2  M[ 7 ] M 2 120
SB $2  M[ 0 ] Hit 3 123
SB $1  M[ 5 ] lru V d tag data 4 71
LB $2  M[ 10 ] 0 1 1 000 173 5 150
29 6 162
SB $1  M[ 5 ] 1 1 0 011 162 7 173
SB $1  M[ 10 ] 173 8 18
9 21
Register File Cache 10 33
$0 11 28
29 Misses: 2 12 19
$1
$2 173 Hits: 1 13 200
$3 Reads: 4 14 210
Writes: 0 15 225
Write-Back (REF 4)

Memory
Instructions:
0 78
LB $1  M[ 1 ] M 1 29
LB $2  M[ 7 ] M 2 120
SB $2  M[ 0 ] Hit 3 123
SB $1  M[ 5 ] lru V d tag data 4 71
LB $2  M[ 10 ] 0 1 1 000 173 5 150
29 6 162
SB $1  M[ 5 ] 1 1 0 011 162 7 173
SB $1  M[ 10 ] 173 8 18
9 21
Register File Cache 10 33
$0 11 28
29 Misses: 3 12 19
$1
$2 173 Hits: 1 13 200
$3 Reads: 6 14 210
Writes: 0 15 225
Write-Back (REF 4)

Memory
Instructions:
0 78
LB $1  M[ 1 ] M 1 29
LB $2  M[ 7 ] M 2 120
SB $2  M[ 0 ] Hit 3 123
SB $1  M[ 5 ] M lru V d tag data 4 71
LB $2  M[ 10 ] 1 1 1 000 173 5 150
29 6 162
SB $1  M[ 5 ] 0 1 1 010 71 7 173
SB $1  M[ 10 ] 150
29 8 18
9 21
Register File Cache 10 33
$0 11 28
29 Misses: 3 12 19
$1
$2 173 Hits: 1 13 200
$3 Reads: 6 14 210
Writes: 0 15 225
Write-Back (REF 5)

Memory
Instructions:
0 78
LB $1  M[ 1 ] M 1 29
LB $2  M[ 7 ] M 2 120
SB $2  M[ 0 ] Hit 3 123
SB $1  M[ 5 ] M lru V d tag data 4 71
LB $2  M[ 10 ] 1 1 1 000 173 5 150
29 6 162
SB $1  M[ 5 ] 0 1 1 010 71 7 173
SB $1  M[ 10 ] 29 8 18
9 21
Register File Cache 10 33
$0 11 28
29 Misses: 3 12 19
$1
$2 173 Hits: 1 13 200
$3 Reads: 6 14 210
Writes: 0 15 225
Write-Back (REF 5)

Eviction, WB dirty block


Memory
Instructions:
0 173
78
LB $1  M[ 1 ] M 1 29
LB $2  M[ 7 ] M 2 120
SB $2  M[ 0 ] Hit 3 123
SB $1  M[ 5 ] M lru V d tag data 4 71
LB $2  M[ 10 ] 1 1 1 000 173 5 150
29 6 162
SB $1  M[ 5 ] 0 1 1 010 71 7 173
SB $1  M[ 10 ] 29 8 18
9 21
Register File Cache 10 33
$0 11 28
29 Misses: 3 12 19
$1
$2 173 Hits: 1 13 200
$3 Reads: 6 14 210
Writes: 2 15 225
Write-Back (REF 5)

Memory
Instructions:
0 173
LB $1  M[ 1 ] M 1 29
LB $2  M[ 7 ] M 2 120
SB $2  M[ 0 ] Hit 3 123
SB $1  M[ 5 ] M lru V d tag data 4 71
LB $2  M[ 10 ] M 0 1 0 101 33 5 150
28 6 162
SB $1  M[ 5 ] 1 1 1 010 71 7 173
SB $1  M[ 10 ] 29 8 18
9 21
Register File Cache 10 33
$0 11 28
29 Misses: 4 12 19
$1
$2 33 Hits: 1 13 200
$3 Reads: 8 14 210
Writes: 2 15 225
Write-Back (REF 6)

CLICKER:
(A) HIT Memory
Instructions:
LB $1  M[ 1 ] M (B) MISS 0 173
1 29
LB $2  M[ 7 ] M 2 120
SB $2  M[ 0 ] Hit 3 123
SB $1  M[ 5 ] M lru V d tag data 4 71
LB $2  M[ 10 ] M 0 1 0 101 33 5 150
28 6 162
SB $1  M[ 5 ] 1 1 1 010 71 7 173
SB $1  M[ 10 ] 29 8 18
9 21
Register File Cache 10 33
$0 11 28
29 Misses: 4 12 19
$1
$2 33 Hits: 1 13 200
$3 Reads: 8 14 210
Writes: 2 15 225
Write-Back (REF 6)

Memory
Instructions:
0 173
LB $1  M[ 1 ] M 1 29
LB $2  M[ 7 ] M 2 120
SB $2  M[ 0 ] Hit 3 123
SB $1  M[ 5 ] M lru V d tag data 4 71
LB $2  M[ 10 ] M 1 1 0 101 33 5 150
28 6 162
SB $1  M[ 5 ] Hit 0 1 1 010 71 7 173
SB $1  M[ 10 ] 29 8 18
9 21
Register File Cache 10 33
$0 11 28
29 Misses: 4 12 19
$1
$2 33 Hits: 2 13 200
$3 Reads: 8 14 210
Writes: 2 15 225
Write-Back (REF 7)

Memory
Instructions:
0 173
LB $1  M[ 1 ] M 1 29
LB $2  M[ 7 ] M 2 120
SB $2  M[ 0 ] Hit 3 123
SB $1  M[ 5 ] M lru V d tag data 4 71
LB $2  M[ 10 ] M 1 1 0 101 33 5 150
28 6 162
SB $1  M[ 5 ] Hit 0 1 1 010 71 7 173
SB $1  M[ 10 ] 29 8 18
9 21
Register File Cache 10 33
$0 11 28
29 Misses: 4 12 19
$1
$2 33 Hits: 2 13 200
$3 Reads: 8 14 210
Writes: 2 15 225
Write-Back (REF 7)

Memory
Instructions:
0 173
LB $1  M[ 1 ] M 1 29
LB $2  M[ 7 ] M 2 120
SB $2  M[ 0 ] Hit 3 123
SB $1  M[ 5 ] M lru V d tag data 4 71
LB $2  M[ 10 ] M 0 1 1 101 29 5 150
28 6 162
SB $1  M[ 5 ] Hit 1 1 1 010 71 7 173
SB $1  M[ 10 ] Hit 29 8 18
9 21
Register File Cache 10 33
$0 11 28
29 Misses: 4 12 19
$1
$2 33 Hits: 3 13 200
$3 Reads: 8 14 210
Writes: 2 15 225
Write-Back (REF 8,9)

Cheap subsequent updates!


M
M Memory
Instructions:
0 173
... Hit
1 29
SB $1  M[ 5 ] M 2 120
LB $2  M[ 10 ] M 3 123
SB $1  M[ 5 ] Hit lru V d tag data 4 71
SB $1  M[ 10 ] Hit 0 1 1 101 29 5 150
28 6 162
SB $1  M[ 5 ] 1 1 1 010 71 7 173
SB $1  M[ 10 ] 29 8 18
9 21
Register File Cache 10 33
$0 11 28
29 Misses: 4 12 19
$1
$2 33 Hits: 3 13 200
$3 Reads: 8 14 210
Writes: 2 15 225
Write-Back (REF 8,9)

M
M Memory
Instructions:
0 173
... Hit
1 29
SB $1  M[ 5 ] M 2 120
LB $2  M[ 10 ] M 3 123
SB $1  M[ 5 ] Hit lru V d tag data 4 71
SB $1  M[ 10 ] Hit 0 1 1 101 29 5 150
28 6 162
SB $1  M[ 5 ] Hit
1 1 1 010 71 7 173
SB $1  M[ 10 ] Hit 29 8 18
9 21
Register File Cache 10 33
$0 11 28
29 Misses: 4 12 19
$1
$2 33 Hits: 3 13 200
$3 Reads: 8 14 210
Writes: 2 15 225
How Many Memory References?

Write-back performance
• How many reads?
– Each miss (read or write) reads a block from mem
– 4 misses  8 mem reads
• How many writes?
– Some evictions write a block to mem
– 1 dirty eviction  2 mem writes
– (+ 2 dirty evictions later  +4 mem writes)
Write-back vs. Write-through Example
Assume: large associative cache, 16-byte lines
N 4-byte words

for (i=1; i<n; i++) Write-thru: n/4 reads


A[0] += A[i]; n writes
Write-back: n/4 reads
1 write
for (i=0; i<n; i++)
B[i] = A[i] Write-thru: 2 x n/4 reads
n writes
Write-back: 2 x n/4 reads
n/4 writes
So is write back just better?

Short Answer: Yes (fewer writes is a good thing)


Long Answer: It’s complicated.
• Evictions require entire line be written back to
memory (vs. just the data that was written)
• Write-back can lead to incoherent caches on
multi-core processors (later lecture)
Optimization: Write Buffering

• Q: Writes to main memory are slow!


• A: Use a write-back buffer
– A small queue holding dirty lines
– Add to end upon eviction
– Remove from front upon completion
• Q: When does it help?
• A: short bursts of writes (but not sustained writes)
• A: fast eviction reduces miss penalty
Write-through vs. Write-back

• Write-through is slower
– But simpler (memory always consistent)

• Write-back is almost always faster


– write-back buffer hides large eviction cost
– But what about multiple cores with separate caches but
sharing memory?
• Write-back requires a cache coherency protocol
– Inconsistent views of memory
– Need to “snoop” in each other’s caches
– Extremely complex protocols, very hard to get right
Cache-coherency
• Q: Multiple readers and writers?
A: Potentially inconsistent views of memory
A’ CPU CPU CPU CPU
AL1 L1 AL1 L1 L1 L1 L1 L1
A L2 L2
net A Mem disk

Cache coherency protocol


• May need to snoop on other CPU’s cache activity
• Invalidate cache line when other CPU writes
• Flush write-back caches before other CPU reads
• Or the reverse: Before writing/reading…
• Extremely complex protocols, very hard to get right
Takeaway
• Write-through policy with write allocate
• Cache miss: read entire block from memory
• Write: write only updated item to memory
• Eviction: no need to write to memory
• Slower, but cleaner

• Write-back policy with write allocate


• Cache miss: read entire block from memory
• **But may need to write dirty cacheline first**
– Write: nothing to memory
– Eviction: have to write to memory entire cacheline because
don’t know what is dirty (only 1 dirty bit)
– Faster, but more complicated, especially with multicore
Cache Conscious Programming
// H = 6, W = 10
int A[H][W];
1
for(x=0; x < W; x++)
2
for(y=0; y < H; y++)
sum += A[y][x]; 3
W 4
1 5
2 6
3
YOUR CACHE
H 4
5
MIND
6

Every access a cache miss! MEMORY


(unless entire matrix fits in cache)
Cache Conscious Programming
// H = 6, W = 10
int A[H][W];
1 2 3 4
for(x=0; x < H; x++)
5 6 7 8
for(y=0; y < W; y++)
sum += A[x][y];
W
1 2 3 4 5 6 7 8

YOUR CACHE
H
MIND

• Block size = 4  75% hit rate


• Block size = 8  87.5% hit rate MEMORY
• Block size = 16  93.75% hit rate
• And you can easily prefetch to warm the cache
Clicker Question
Choose the best block size for your cache among the
choices given. Assume that integers and pointers are all 4
bytes each and that the scores array is 4-byte aligned.
(a) 1 byte (b) 4 bytes (c) 8 bytes (d) 16 bytes (e) 32 bytes

int scores[NUM STUDENTS] = 0;


int sum = 0;
for (i = 0; i < NUM STUDENTS; i++) {
sum += scores[i];
}

137
Clicker Question
Choose the best block size for your cache among the
choices given. Assume that integers and pointers are all 4
bytes each and that the scores array is 4-byte aligned.
(a) 1 byte (b) 4 bytes (c) 8 bytes (d) 16 bytes (e) 32 bytes

int scores[NUM STUDENTS] = 0;


int sum = 0;
for (i = 0; i < NUM STUDENTS; i++) {
sum += scores[i];
}

138
Clicker Question
Choose the best block size for your cache among the
choices given. Assume integers and pointers are 4 bytes.
(a) 1 byte (b) 4 bytes (c) 8 bytes (d) 16 bytes (e) 32 bytes
typedef struct item_t {
int value;
struct item_t *next;
char *name;
} item_t;

int sum = 0;
item_t *curr = list_head;
while (curr != NULL) {
sum += curr->value;
curr = curr->next;
} 139
Clicker Question
Choose the best block size for your cache among the
choices given. Assume integers and pointers are 4 bytes.
(a) 1 byte (b) 4 bytes (c) 8 bytes (d) 16 bytes (e) 32 bytes
typedef struct item_t {
int value;
struct item_t *next;
char *name;
} item_t;

int sum = 0;
item_t *curr = list_head;
while (curr != NULL) {
sum += curr->value;
curr = curr->next;
} 140
By the end of the cache lectures…


> dmidecode -t cache A Real Example
Cache Information
• Socket Designation: L1 Cache Microsoft Surfacebook
• Configuration: Enabled, Not Socketed,
Level 1 Dual core
• Operational Mode: Write Back
• Location: Internal Intel i7-6600 CPU @ 2.6 GHz


Installed Size: 128 kB
Maximum Size: 128 kB (purchased in 2016)
Cache Information
• Supported SRAM Types: Socket Designation: L3 Cache
• Synchronous Configuration: Enabled, Not Socketed,
• Installed SRAM Type: Synchronous Level 3
• Speed: Unknown Operational Mode: Write Back
• Error Correction Type: Parity Location: Internal
Installed Size: 4096 kB
• System Type: Unified Maximum Size: 4096 kB
• Associativity: 8-way Set-associative
Supported SRAM Types:
Synchronous
• Cache Information Installed SRAM Type: Synchronous
• Socket Designation: L2 Cache Speed: Unknown
Error Correction Type: Multi-bit ECC
• Configuration: Enabled, Not Socketed,
System Type: Unified
• Level 2 Associativity: 16-way Set-associative
• Operational Mode: Write Back
• Location: Internal
• Installed Size: 512 kB
• Maximum Size: 512 kB
• Supported SRAM Types:
• Synchronous
• Installed SRAM Type: Synchronous
• Speed: Unknown
• Error Correction Type: Single-bit ECC
• System Type: Unified
• Associativity: 4-way Set-associative

A Real Example
> sudo dmidecode -t cache
• Cache Information


Dual-core 3.16GHz Intel
Configuration: Enabled, Not Socketed, Level 1
Operational Mode: Write Back
• Installed Size: 128 KB (purchased in 2011)
• Error Correction Type: None
• Cache Information
• Configuration: Enabled, Not Socketed, Level 2
• Operational Mode: Varies With Memory Address
• Installed Size: 6144 KB
• Error Correction Type: Single-bit ECC
• > cd /sys/devices/system/cpu/cpu0; grep cache/*/*
• cache/index0/level:1
• cache/index0/type:Data
• cache/index0/ways_of_associativity:8
• cache/index0/number_of_sets:64
• cache/index0/coherency_line_size:64
• cache/index0/size:32K
• cache/index1/level:1
• cache/index1/type:Instruction
• cache/index1/ways_of_associativity:8
• cache/index1/number_of_sets:64
• cache/index1/coherency_line_size:64
• cache/index1/size:32K
• cache/index2/level:2
• cache/index2/type:Unified
• cache/index2/shared_cpu_list:0-1
• cache/index2/ways_of_associativity:24
• cache/index2/number_of_sets:4096
• cache/index2/coherency_line_size:64
• cache/index2/size:6144K
A Real Example

• Dual 32K L1 Instruction caches


Dual-core 3.16GHz Intel
– 8-way set associative (purchased in 2009)
– 64 sets
– 64 byte line size
• Dual 32K L1 Data caches
– Same as above
• Single 6M L2 Unified cache
– 24-way set associative (!!!)
– 4096 sets
– 64 byte line size
• 4GB Main memory
• 1TB Disk
Summary
• Memory performance matters!
– often more than CPU performance
– … because it is the bottleneck, and not improving much
– … because most programs move a LOT of data
• Design space is huge
– Gambling against program behavior
– Cuts across all layers:
users  programs  os  hardware
• NEXT: Multi-core processors are complicated
– Inconsistent views of memory
– Extremely complex protocols, very hard to get right

You might also like