CS252 Graduate Computer Architecture Caches and Memory Systems I
CS252 Graduate Computer Architecture Caches and Memory Systems I
CS252/Kubiatowicz
1/24/01
Lec 3.1
Question: Who Cares About the
Memory Hierarchy?
1000 CPU
µProc
60%/yr.
“Moore’s Law”
Performance
CPU-DRAM Gap
100 Processor-Memory
Performance Gap:
(grows 50% / year)
10 “Less’ Law?” DRAM
DRAM
7%/yr.
1
1980
1981
1982
1983
1984
1985
1986
1987
1988
1989
1990
1991
1992
1993
1994
1995
1996
1997
1998
1999
2000
• 1980: no cache in µproc; 1995 2-level cache on chip
(1989 first Intel µproc with a cache on chip)
CS252/Kubiatowicz
1/24/01
Lec 3.2
Generations of Microprocessors
• Time of a full cache miss in instructions executed:
1st Alpha: 340 ns/5.0 ns = 68 clks x 2 or 136
2nd Alpha: 266 ns/3.3 ns = 80 clks x 4 or 320
3rd Alpha: 180 ns/1.7 ns =108 clks x 6 or 648
• 1/2X latency x 3X clock rate x 3X Instr/clock 5X
CS252/Kubiatowicz
1/24/01
Lec 3.3
Processor-Memory
Performance Gap “Tax”
Processor % Area %Transistors
(cost) (power)
• Alpha 21164 37% 77%
• StrongArm SA110 61% 94%
• Pentium Pro 64% 88%
– 2 dies per package: Proc/I$/D$ + L2$
• Caches have no inherent value,
only try to close performance gap
CS252/Kubiatowicz
1/24/01
Lec 3.4
What is a cache?
• Small, fast storage used to improve average access
time to slow memory.
• Exploits spacial and temporal locality
• In computer architecture, almost everything is a cache!
– Registers a cache on variables
– First-level cache a cache on second-level cache
– Second-level cache a cache on memory
– Memory a cache on disk (virtual memory)
– TLB a cache on page table
– Branch-prediction a cache on prediction information?
Proc/Regs
L1-Cache
Bigger L2-Cache Faster
Memory
: :
0x50 Byte 63 Byte 33 Byte 32 1
2
3
: : :
Byte 1023 Byte 992 31
:
CS252/Kubiatowicz
1/24/01
Lec 3.6
Set Associative Cache
• N-way set associative: N entries for each Cache
Index
– N direct mapped caches operates in parallel
• Example: Two-way set associative cache
– Cache Index selects a “set” from the cache
– The two tags in the set are compared to the input in parallel
– Data is selected based on the tag result
Cache Index
Valid Cache Tag Cache Data Cache Data Cache Tag Valid
Cache Block 0 Cache Block 0
: : : : : :
Adr Tag
Compare Sel1 1 Mux 0 Sel0 Compare
OR
1/24/01
Cache Block CS252/Kubiatowicz
Hit Lec 3.7
Disadvantage of Set Associative Cache
• N-way Set Associative Cache versus Direct Mapped
Cache:
– N comparators vs. 1
– Extra MUX delay for the data
– Data comes AFTER Hit/Miss decision and set selection
• In a direct mapped cache, Cache Block is available
BEFORE Hit/Miss:
– Possible to assume a hit and continue. Recover later if miss.
Cache Index
Valid Cache Tag Cache Data Cache Data Cache Tag Valid
Cache Block 0 Cache Block 0
: : : : : :
Adr Tag
Compare Sel1 1 Mux 0 Sel0 Compare
OR
1/24/01
Cache Block CS252/Kubiatowicz
Hit Lec 3.8
Review: Cache performance
• Miss-oriented Approach to Memory Access:
MemAccess
CPUtime IC CPI MissRate MissPenalty CycleTime
Execution Inst
MemMisses
CPUtime IC CPI MissPenalty CycleTime
Execution Inst
CS252/Kubiatowicz
1/24/01
Lec 3.10
Example: Harvard Architecture
• Unified vs Separate I&D (Harvard)
Proc
Unified I-Cache-1 Proc D-Cache-1
Cache-1 Unified
Cache-2
Unified
Cache-2
CS252/Kubiatowicz
1/24/01
Lec 3.11
Review: Four Questions for
Memory Hierarchy Designers
• Q1: Where can a block be placed in the upper level?
(Block placement)
– Fully Associative, Set Associative, Direct Mapped
• Q2: How is a block found if it is in the upper level?
(Block identification)
– Tag/Block
• Q3: Which block should be replaced on a miss?
(Block replacement)
– Random, LRU
• Q4: What happens on a write?
(Write strategy)
– Write Back or Write Through (with Write Buffer)
CS252/Kubiatowicz
1/24/01
Lec 3.12
Review: Improving Cache
Performance
1. Reduce the miss rate,
2. Reduce the miss penalty, or
3. Reduce the time to hit in the cache.
CS252/Kubiatowicz
1/24/01
Lec 3.13
Reducing Misses
• Classifying Misses: 3 Cs
– Compulsory—The first access to a block is not in the cache, so the block
must be brought into the cache. Also called cold start misses or first
reference misses.
(Misses in even an Infinite Cache)
– Capacity—If the cache cannot contain all the blocks needed during execution
of a program, capacity misses will occur due to blocks being discarded and
later retrieved.
(Misses in Fully Associative Size X Cache)
– Conflict—If block-placement strategy is set associative or direct mapped,
conflict misses (in addition to compulsory & capacity misses) will occur because
a block can be discarded and later retrieved if too many blocks map to its set.
Also called collision misses or interference misses.
(Misses in N-way Associative, Size X Cache)
• More recent, 4th “C”:
– Coherence - Misses caused by cache coherence.
CS252/Kubiatowicz
1/24/01
Lec 3.14
3Cs Absolute Miss Rate
(SPEC92)
0.14
1-way
0.12 Conflict
2-way
0.1
Miss Rate per Type
4-way
0.08
8-way
0.06
Capacity
0.04
0.02
0
1
16
32
64
128
Compulsory vanishingly Compulsory
Cache Size (KB)
small
CS252/Kubiatowicz
1/24/01
Lec 3.15
2:1 Cache Rule
miss rate 1-way associative cache size X
= miss rate 2-way associative cache size X/2
0.14
1-way
0.12 Conflict
2-way
0.1
Miss Rate per Type
4-way
0.08
8-way
0.06
Capacity
0.04
0.02
0
1
16
32
64
128
Cache Size (KB) Compulsory
CS252/Kubiatowicz
1/24/01
Lec 3.16
3Cs Relative Miss Rate
100%
1-way
80% Conflict
2-way
Miss Rate per Type
4-way
60% 8-way
40%
Capacity
20%
0%
1
16
32
64
128
Flaws: for fixed block size
Good: insight => inventionCache Size (KB) Compulsory
CS252/Kubiatowicz
1/24/01
Lec 3.17
How Can Reduce Misses?
• 3 Cs: Compulsory, Capacity, Conflict
• In all cases, assume total cache size not changed:
• What happens if:
1) Change Block Size:
Which of 3Cs is obviously affected?
2) Change Associativity:
Which of 3Cs is obviously affected?
3) Change Compiler:
Which of 3Cs is obviously affected?
CS252/Kubiatowicz
1/24/01
Lec 3.18
1. Reduce Misses via Larger
Block Size
25%
20% 1K
4K
15%
Miss
16K
Rate
10%
64K
5% 256K
0%
32
64
16
256
128
CS252/Kubiatowicz
1/24/01
Lec 3.19
2. Reduce Misses via Higher
Associativity
• 2:1 Cache Rule:
– Miss Rate DM cache size N Miss Rate 2-way cache
size N/2
• Beware: Execution time is only final
measure!
– Will Clock Cycle time increase?
– Hill [1988] suggested hit time for 2-way vs. 1-way
external cache +10%,
internal + 2%
CS252/Kubiatowicz
1/24/01
Lec 3.20
Example: Avg. Memory Access
Time vs. Miss Rate
• Example: assume CCT = 1.10 for 2-way, 1.12 for 4-
way, 1.14 for 8-way vs. CCT direct mapped
Cache Size Associativity
(KB) 1-way 2-way 4-way 8-way
1 2.33 2.15 2.07 2.01
2 1.98 1.86 1.76 1.68
4 1.72 1.67 1.61 1.53
8 1.46 1.48 1.47 1.43
16 1.29 1.32 1.32 1.32
32 1.20 1.24 1.25 1.27
64 1.14 1.20 1.21 1.23
128 1.10 1.17 1.18 1.20
CS252/Kubiatowicz
1/24/01
Lec 3.21
3. Reducing Misses via a
“Victim Cache”
• How to combine fast hit time
of direct mapped
yet still avoid conflict misses?
TAGS DATA
• Add buffer to place data
discarded from cache
• Jouppi [1990]: 4-entry victim
cache removed 20% to 95% of
conflicts for a 4 KB direct Tag and Comparator One Cache line of Data
mapped data cache Tag and Comparator One Cache line of Data
• Used in Alpha, HP machines Tag and Comparator One Cache line of Data
Tag and Comparator One Cache line of Data
CS252/Kubiatowicz
1/24/01
Lec 3.22
4. Reducing Misses via
“Pseudo-Associativity”
• How to combine fast hit time of Direct Mapped and have the
lower conflict misses of 2-way SA cache?
• Divide cache: on a miss, check other half of cache to see if
there, if so have a pseudo-hit (slow hit)
Hit Time
CS252/Kubiatowicz
1/24/01
Lec 3.23
5. Reducing Misses by Hardware
Prefetching of Instructions & Datals
• E.g., Instruction Prefetching
– Alpha 21064 fetches 2 blocks on a miss
– Extra block placed in “stream buffer”
– On miss check stream buffer
• Works with data blocks too:
– Jouppi [1990] 1 data stream buffer got 25% misses from
4KB cache; 4 streams got 43%
– Palacharla & Kessler [1994] for scientific programs for 8
streams got 50% to 70% of misses from
2 64KB, 4-way set associative caches
• Prefetching relies on having extra memory
bandwidth that can be used without penalty
CS252/Kubiatowicz
1/24/01
Lec 3.24
6. Reducing Misses by
Software Prefetching Data
• Data Prefetch
– Load data into register (HP PA-RISC loads)
– Cache Prefetch: load into cache (MIPS IV, PowerPC, SPARC v. 9)
– Special prefetching instructions cannot cause faults; a form of
speculative execution
• Prefetching comes in two flavors:
– Binding prefetch: Requests load directly into register.
» Must be correct address and register!
– Non-Binding prefetch: Load into cache.
» Can be incorrect. Frees HW/SW to guess!
• Issuing Prefetch Instructions takes time
– Is cost of prefetch issues < savings in reduced misses?
– Higher superscalar reduces difficulty of issue bandwidth
CS252/Kubiatowicz
1/24/01
Lec 3.25
7. Reducing Misses by
Compiler Optimizations
• McFarling [1989] reduced caches misses by 75%
on 8KB direct mapped cache, 4 byte blocks in software
• Instructions
– Reorder procedures in memory so as to reduce conflict misses
– Profiling to look at conflicts(using tools they developed)
• Data
– Merging Arrays: improve spatial locality by single array of compound elements
vs. 2 arrays
– Loop Interchange: change nesting of loops to access data in order stored in
memory
– Loop Fusion: Combine 2 independent loops that have same looping and some
variables overlap
– Blocking: Improve temporal locality by accessing “blocks” of data repeatedly
vs. going down whole columns or rows
CS252/Kubiatowicz
1/24/01
Lec 3.26
Merging Arrays Example
/* Before: 2 sequential arrays */
int val[SIZE];
int key[SIZE];
CS252/Kubiatowicz
1/24/01
Lec 3.27
Loop Interchange Example
/* Before */
for (k = 0; k < 100; k = k+1)
for (j = 0; j < 100; j = j+1)
for (i = 0; i < 5000; i = i+1)
x[i][j] = 2 * x[i][j];
/* After */
for (k = 0; k < 100; k = k+1)
for (i = 0; i < 5000; i = i+1)
for (j = 0; j < 100; j = j+1)
x[i][j] = 2 * x[i][j];
CS252/Kubiatowicz
1/24/01
Lec 3.28
Loop Fusion Example
/* Before */
for (i = 0; i < N; i = i+1)
for (j = 0; j < N; j = j+1)
a[i][j] = 1/b[i][j] * c[i][j];
for (i = 0; i < N; i = i+1)
for (j = 0; j < N; j = j+1)
d[i][j] = a[i][j] + c[i][j];
/* After */
for (i = 0; i < N; i = i+1)
for (j = 0; j < N; j = j+1)
{ a[i][j] = 1/b[i][j] * c[i][j];
d[i][j] = a[i][j] + c[i][j];}
CS252/Kubiatowicz
1/24/01
Lec 3.29
Blocking Example
/* Before */
for (i = 0; i < N; i = i+1)
for (j = 0; j < N; j = j+1)
{r = 0;
for (k = 0; k < N; k = k+1){
r = r + y[i][k]*z[k][j];};
x[i][j] = r;
};
• Two Inner Loops:
– Read all NxN elements of z[]
– Read N elements of 1 row of y[] repeatedly
– Write N elements of 1 row of x[]
• Capacity Misses a function of N & Cache Size:
– 2N3 + N2 => (assuming no conflict; otherwise …)
• Idea: compute on BxB submatrix that fits
CS252/Kubiatowicz
1/24/01
Lec 3.30
Blocking Example
/* After */
for (jj = 0; jj < N; jj = jj+B)
for (kk = 0; kk < N; kk = kk+B)
for (i = 0; i < N; i = i+1)
for (j = jj; j < min(jj+B-1,N); j = j+1)
{r = 0;
for (k = kk; k < min(kk+B-1,N); k = k+1) {
r = r + y[i][k]*z[k][j];};
x[i][j] = x[i][j] + r;
};
CS252/Kubiatowicz
1/24/01
Lec 3.31
Reducing Conflict Misses by Blocking
0.1
Miss Rate
0
0 50 100 150
Blocking Factor
1 1.5 2 2.5 3
Performance Improvement
CS252/Kubiatowicz
1/24/01
Lec 3.35
Write Policy:
Write-Through vs Write-Back
• Write-through: all writes update cache and underlying
memory/cache
– Can always discard cached data - most up-to-date data is in memory
– Cache control bit: only a valid bit
• Write-back: all writes simply update cache
– Can’t just discard cached data - may have to write it back to memory
– Cache control bits: both valid and dirty bits
• Other Advantages:
– Write-through:
» memory (or other processors) always have latest data
» Simpler management of cache
– Write-back:
» much lower bandwidth, since data often overwritten multiple times
» Better tolerance to long-latency memory?
CS252/Kubiatowicz
1/24/01
Lec 3.36
Write Policy 2:
Write Allocate vs Non-Allocate
(What happens on write-miss)
CS252/Kubiatowicz
1/24/01
Lec 3.37
1. Reducing Miss Penalty:
Read Priority over Write on Miss
CPU
in out
Write Buffer
write
buffer
DRAM
(or lower mem)
CS252/Kubiatowicz
1/24/01
Lec 3.38
1. Reducing Miss Penalty:
Read Priority over Write on Miss
• Write-through with write buffers offer RAW
conflicts with main memory reads on cache misses
– If simply wait for write buffer to empty, might increase read miss
penalty (old MIPS 1000 by 50% )
– Check write buffer contents before read;
if no conflicts, let the memory access continue
• Write-back also want buffer to hold misplaced blocks
– Read miss replacing dirty block
– Normal: Write dirty block to memory, and then do the read
– Instead copy the dirty block to a write buffer, then do the read,
and then do the write
– CPU stall less since restarts as soon as do read
CS252/Kubiatowicz
1/24/01
Lec 3.39
2. Reduce Miss Penalty:
Early Restart and Critical Word
First
• Don’t wait for full block to be loaded before
restarting CPU
– Early restart—As soon as the requested word of the block
arrives, send it to the CPU and let the CPU continue execution
– Critical Word First—Request the missed word first from memory
and send it to the CPU as soon as it arrives; let the CPU continue
execution while filling the rest of the words in the block. Also
called wrapped fetch and requested word first
• Generally useful only in large blocks,
• Spatial locality a problem; tend to want next
sequential word, so not clear if benefit by early
restart
block
CS252/Kubiatowicz
1/24/01
Lec 3.40
3. Reduce Miss Penalty: Non-
blocking Caches to reduce stalls on
misses
• Non-blocking cache or lockup-free cache allow data cache
to continue to supply cache hits during a miss
– requires F/E bits on registers or out-of-order execution
– requires multi-bank memories
• “hit under miss” reduces the effective miss penalty by
working during miss vs. ignoring CPU requests
• “hit under multiple miss” or “miss under miss” may further
lower the effective miss penalty by overlapping multiple
misses
– Significantly increases the complexity of the cache controller as there can
be multiple outstanding memory accesses
– Requires muliple memory banks (otherwise cannot support)
– Penium Pro allows 4 outstanding memory misses
CS252/Kubiatowicz
1/24/01
Lec 3.41
Value of Hit Under Miss for SPEC
Hit Under i Misses
1.8
1.6
1.4
0->1
0->1
Avg. Mem. Access Time
1.2
1->2
1 1->2
2->64
0.8 2->64
0.6
Base
Base
0.4
“Hit under n Misses”
0.2
0 doduc
ora
ear
xlisp
fpppp
eqntott
tomcatv
alvinn
nasa7
wave5
mdljdp2
hydro2d
su2cor
mdljsp2
espresso
swm256
spice2g6
compress
• FP programs on average: AMAT= 0.68 -> 0.52 -> 0.34 -> 0.26
• Int programs on average: AMAT= 0.24 -> 0.20 -> 0.19 -> 0.19
• 8 KB Data Cache, Direct Mapped, 32B block, 16 cycle missCS252/Kubiatowicz
1/24/01
Lec 3.42
4: Add a second-level cache
• L2 Equations
AMAT = Hit TimeL1 + Miss RateL1 x Miss PenaltyL1
• Definitions:
– Local miss rate— misses in this cache divided by the total number of memory
accesses to this cache (Miss rateL2)
– Global miss rate—misses in this cache divided by the total number of memory
accesses generated by the CPU
(Miss RateL1 x Miss RateL2)
– Global Miss Rate is what matters
CS252/Kubiatowicz
1/24/01
Lec 3.43
Comparing Local and Global
Miss Rates
• 32 KByte 1st level cache;
Increasing 2nd level cache Linear
• Global miss rate close to single
level cache rate provided L2
>> L1
• Don’t use local miss rate
• L2 not tied to CPU clock cycle! Cache Size
Cache Size
CS252/Kubiatowicz
1/24/01
Lec 3.44
Reducing Misses:
Which apply to L2 Cache?
• Reducing Miss Rate
1. Reduce Misses via Larger Block Size
2. Reduce Conflict Misses via Higher Associativity
3. Reducing Conflict Misses via Victim Cache
4. Reducing Conflict Misses via Pseudo-Associativity
5. Reducing Misses by HW Prefetching Instr, Data
6. Reducing Misses by SW Prefetching Data
7. Reducing Capacity/Conf. Misses by Compiler Optimizations
CS252/Kubiatowicz
1/24/01
Lec 3.45
L2 cache block size &
A.M.A.T.
Relative CPU Time
2 1.95
1.9
1.8
1.7
1.6 1.54
1.5
1.36 1.34
1.4 1.28 1.27
1.3
1.2
1.1
1
16 32 64 128 256 512
Block Size
CS252/Kubiatowicz
1/24/01
Lec 3.47
What is the Impact of What
You’ve Learned About Caches?
1000
CPU
• 1960-1985: Speed
= ƒ(no. operations)
• 1990 100
– Pipelined
Execution &
Fast Clock Rate
– Out-of-Order 10
execution
– Superscalar DRAM
Instruction Issue 1
• 1998: Speed =
1981
1982
1984
1986
1987
1988
1989
1992
1994
1995
1996
1997
1998
2000
1980
1983
1985
1990
1991
1993
1999
ƒ(non-cached memory accesses)
• Superscalar, Out-of-Order machines hide L1 data cache miss
(5 clocks) but not L2 cache miss (50 clocks)?
CS252/Kubiatowicz
1/24/01
Lec 3.48
Cache Optimization Summary
Technique MR MP HT Complexity
miss rate
CS252/Kubiatowicz
1/24/01
Lec 3.49