0% found this document useful (0 votes)
94 views

CS252 Graduate Computer Architecture Caches and Memory Systems I

The document discusses caches and memory systems. It notes that the performance gap between processors and memory grows over time due to Moore's Law. Caches are implemented to help close this gap by exploiting locality and providing faster access to frequently used data. The performance of a cache depends on factors like hit rate, miss rate, hit time, and miss penalty. A cache can be direct mapped, set associative, or fully associative depending on its organization.

Uploaded by

jamal4u
Copyright
© Attribution Non-Commercial (BY-NC)
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
94 views

CS252 Graduate Computer Architecture Caches and Memory Systems I

The document discusses caches and memory systems. It notes that the performance gap between processors and memory grows over time due to Moore's Law. Caches are implemented to help close this gap by exploiting locality and providing faster access to frequently used data. The performance of a cache depends on factors like hit rate, miss rate, hit time, and miss penalty. A cache can be direct mapped, set associative, or fully associative depending on its organization.

Uploaded by

jamal4u
Copyright
© Attribution Non-Commercial (BY-NC)
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 49

CS252

Graduate Computer Architecture


Lecture 3

Caches and Memory Systems I

January 24, 2001


Prof. John Kubiatowicz

CS252/Kubiatowicz
1/24/01
Lec 3.1
Question: Who Cares About the
Memory Hierarchy?

1000 CPU
µProc
60%/yr.
“Moore’s Law”
Performance

CPU-DRAM Gap
100 Processor-Memory
Performance Gap:
(grows 50% / year)
10 “Less’ Law?” DRAM
DRAM
7%/yr.
1
1980
1981
1982
1983
1984
1985
1986
1987
1988
1989
1990
1991
1992
1993
1994
1995
1996
1997
1998
1999
2000
• 1980: no cache in µproc; 1995 2-level cache on chip
(1989 first Intel µproc with a cache on chip)
CS252/Kubiatowicz
1/24/01
Lec 3.2
Generations of Microprocessors
• Time of a full cache miss in instructions executed:
1st Alpha: 340 ns/5.0 ns =  68 clks x 2 or 136
2nd Alpha: 266 ns/3.3 ns =  80 clks x 4 or 320
3rd Alpha: 180 ns/1.7 ns =108 clks x 6 or 648
• 1/2X latency x 3X clock rate x 3X Instr/clock  ­5X

CS252/Kubiatowicz
1/24/01
Lec 3.3
Processor-Memory
Performance Gap “Tax”
Processor % Area %Transistors
(­cost) (­power)
• Alpha 21164 37% 77%
• StrongArm SA110 61% 94%
• Pentium Pro 64% 88%
– 2 dies per package: Proc/I$/D$ + L2$
• Caches have no inherent value,
only try to close performance gap

CS252/Kubiatowicz
1/24/01
Lec 3.4
What is a cache?
• Small, fast storage used to improve average access
time to slow memory.
• Exploits spacial and temporal locality
• In computer architecture, almost everything is a cache!
– Registers a cache on variables
– First-level cache a cache on second-level cache
– Second-level cache a cache on memory
– Memory a cache on disk (virtual memory)
– TLB a cache on page table
– Branch-prediction a cache on prediction information?
Proc/Regs

L1-Cache
Bigger L2-Cache Faster

Memory

Disk, Tape, etc. CS252/Kubiatowicz


1/24/01
Lec 3.5
Example: 1 KB Direct Mapped Cache
• For a 2 ** N byte cache:
– The uppermost (32 - N) bits are always the Cache Tag
– The lowest M bits are the Byte Select (Block Size = 2 ** M)
Block address
31 9 4 0
Cache Tag Example: 0x50 Cache Index Byte Select
Ex: 0x01 Ex: 0x00
Stored as part
of the cache “state”

Valid Bit Cache Tag Cache Data


Byte 31 Byte 1 Byte 0 0

: :
0x50 Byte 63 Byte 33 Byte 32 1
2
3

: : :
Byte 1023 Byte 992 31

:
CS252/Kubiatowicz
1/24/01
Lec 3.6
Set Associative Cache
• N-way set associative: N entries for each Cache
Index
– N direct mapped caches operates in parallel
• Example: Two-way set associative cache
– Cache Index selects a “set” from the cache
– The two tags in the set are compared to the input in parallel
– Data is selected based on the tag result
Cache Index
Valid Cache Tag Cache Data Cache Data Cache Tag Valid
Cache Block 0 Cache Block 0

: : : : : :

Adr Tag
Compare Sel1 1 Mux 0 Sel0 Compare

OR
1/24/01
Cache Block CS252/Kubiatowicz
Hit Lec 3.7
Disadvantage of Set Associative Cache
• N-way Set Associative Cache versus Direct Mapped
Cache:
– N comparators vs. 1
– Extra MUX delay for the data
– Data comes AFTER Hit/Miss decision and set selection
• In a direct mapped cache, Cache Block is available
BEFORE Hit/Miss:
– Possible to assume a hit and continue. Recover later if miss.
Cache Index
Valid Cache Tag Cache Data Cache Data Cache Tag Valid
Cache Block 0 Cache Block 0

: : : : : :

Adr Tag
Compare Sel1 1 Mux 0 Sel0 Compare

OR
1/24/01
Cache Block CS252/Kubiatowicz
Hit Lec 3.8
Review: Cache performance
• Miss-oriented Approach to Memory Access:
 MemAccess 
CPUtime  IC   CPI   MissRate  MissPenalty   CycleTime
 Execution Inst 
 MemMisses 
CPUtime  IC   CPI   MissPenalty   CycleTime
 Execution Inst 

– CPIExecution includes ALU and Memory instructions


• Separating out Memory component entirely
– AMAT = Average Memory Access Time
– CPIALUOps does not include memory instructions
 AluOps MemAccess 
CPUtime  IC    CPI   AMAT   CycleTime
 Inst Inst 
AluOps

AMAT  HitTime  MissRate  MissPenalty


  HitTime Inst  MissRate Inst  MissPenalty Inst  
 HitTime Data  MissRate Data  MissPenalty Data  CS252/Kubiatowicz
1/24/01
Lec 3.9
Impact on
• Suppose a processorPerformance
executes at
– Clock Rate = 200 MHz (5 ns per cycle), Ideal (no misses) CPI = 1.1
– 50% arith/logic, 30% ld/st, 20% control
• Suppose that 10% of memory operations get 50 cycle
miss penalty
• Suppose that 1% of instructions get same miss penalty
• CPI = ideal CPI + average stalls per instruction
1.1(cycles/ins) +
[ 0.30 (DataMops/ins)
x 0.10 (miss/DataMop) x 50 (cycle/miss)] +
[ 1 (InstMop/ins)
x 0.01 (miss/InstMop) x 50 (cycle/miss)]
= (1.1 + 1.5 + .5) cycle/ins = 3.1
• 58% of the time the proc is stalled waiting for memory!
• AMAT=(1/1.3)x[1+0.01x50]+(0.3/1.3)x[1+0.1x50]=2.54

CS252/Kubiatowicz
1/24/01
Lec 3.10
Example: Harvard Architecture
• Unified vs Separate I&D (Harvard)

• Table on page 384:


– 16KB I&D: Inst miss rate=0.64%, Data miss rate=6.47%
– 32KB unified: Aggregate miss rate=1.99%
• Which is better (ignore L2 cache)?
– Assume 33% data ops  75% accesses from instructions (1.0/1.33)
– hit time=1, miss time=50
– Note that data hit has 1 stall for unified cache (only one port)

AMATHarvard =75%x(1+0.64%x50)+25%x(1+6.47%x50) = 2.05


AMATUnified =75%x(1+1.99%x50)+25%x(1+1+1.99%x50)= 2.24

Proc
Unified I-Cache-1 Proc D-Cache-1
Cache-1 Unified
Cache-2
Unified
Cache-2

CS252/Kubiatowicz
1/24/01
Lec 3.11
Review: Four Questions for
Memory Hierarchy Designers
• Q1: Where can a block be placed in the upper level?
(Block placement)
– Fully Associative, Set Associative, Direct Mapped
• Q2: How is a block found if it is in the upper level?
(Block identification)
– Tag/Block
• Q3: Which block should be replaced on a miss?
(Block replacement)
– Random, LRU
• Q4: What happens on a write?
(Write strategy)
– Write Back or Write Through (with Write Buffer)

CS252/Kubiatowicz
1/24/01
Lec 3.12
Review: Improving Cache
Performance
1. Reduce the miss rate,
2. Reduce the miss penalty, or
3. Reduce the time to hit in the cache.

CS252/Kubiatowicz
1/24/01
Lec 3.13
Reducing Misses
• Classifying Misses: 3 Cs
– Compulsory—The first access to a block is not in the cache, so the block
must be brought into the cache. Also called cold start misses or first
reference misses.
(Misses in even an Infinite Cache)
– Capacity—If the cache cannot contain all the blocks needed during execution
of a program, capacity misses will occur due to blocks being discarded and
later retrieved.
(Misses in Fully Associative Size X Cache)
– Conflict—If block-placement strategy is set associative or direct mapped,
conflict misses (in addition to compulsory & capacity misses) will occur because
a block can be discarded and later retrieved if too many blocks map to its set.
Also called collision misses or interference misses.
(Misses in N-way Associative, Size X Cache)
• More recent, 4th “C”:
– Coherence - Misses caused by cache coherence.

CS252/Kubiatowicz
1/24/01
Lec 3.14
3Cs Absolute Miss Rate
(SPEC92)
0.14
1-way
0.12 Conflict
2-way
0.1
Miss Rate per Type

4-way
0.08
8-way
0.06
Capacity
0.04

0.02
0
1

16

32

64

128
Compulsory vanishingly Compulsory
Cache Size (KB)
small
CS252/Kubiatowicz
1/24/01
Lec 3.15
2:1 Cache Rule
miss rate 1-way associative cache size X
= miss rate 2-way associative cache size X/2
0.14
1-way
0.12 Conflict
2-way
0.1
Miss Rate per Type

4-way
0.08
8-way
0.06
Capacity
0.04

0.02
0
1

16

32

64

128
Cache Size (KB) Compulsory
CS252/Kubiatowicz
1/24/01
Lec 3.16
3Cs Relative Miss Rate
100%
1-way
80% Conflict
2-way
Miss Rate per Type

4-way
60% 8-way

40%
Capacity

20%

0%
1

16

32

64

128
Flaws: for fixed block size
Good: insight => inventionCache Size (KB) Compulsory
CS252/Kubiatowicz
1/24/01
Lec 3.17
How Can Reduce Misses?
• 3 Cs: Compulsory, Capacity, Conflict
• In all cases, assume total cache size not changed:
• What happens if:
1) Change Block Size:
Which of 3Cs is obviously affected?

2) Change Associativity:
Which of 3Cs is obviously affected?

3) Change Compiler:
Which of 3Cs is obviously affected?

CS252/Kubiatowicz
1/24/01
Lec 3.18
1. Reduce Misses via Larger
Block Size
25%

20% 1K

4K
15%
Miss
16K
Rate
10%
64K
5% 256K

0%
32

64
16

256
128

Block Size (bytes)

CS252/Kubiatowicz
1/24/01
Lec 3.19
2. Reduce Misses via Higher
Associativity
• 2:1 Cache Rule:
– Miss Rate DM cache size N ­Miss Rate 2-way cache
size N/2
• Beware: Execution time is only final
measure!
– Will Clock Cycle time increase?
– Hill [1988] suggested hit time for 2-way vs. 1-way
external cache +10%,
internal + 2%

CS252/Kubiatowicz
1/24/01
Lec 3.20
Example: Avg. Memory Access
Time vs. Miss Rate
• Example: assume CCT = 1.10 for 2-way, 1.12 for 4-
way, 1.14 for 8-way vs. CCT direct mapped
Cache Size Associativity
(KB) 1-way 2-way 4-way 8-way
1 2.33 2.15 2.07 2.01
2 1.98 1.86 1.76 1.68
4 1.72 1.67 1.61 1.53
8 1.46 1.48 1.47 1.43
16 1.29 1.32 1.32 1.32
32 1.20 1.24 1.25 1.27
64 1.14 1.20 1.21 1.23
128 1.10 1.17 1.18 1.20

(Red means A.M.A.T. not improved by more associativity)

CS252/Kubiatowicz
1/24/01
Lec 3.21
3. Reducing Misses via a
“Victim Cache”
• How to combine fast hit time
of direct mapped
yet still avoid conflict misses?
TAGS DATA
• Add buffer to place data
discarded from cache
• Jouppi [1990]: 4-entry victim
cache removed 20% to 95% of
conflicts for a 4 KB direct Tag and Comparator One Cache line of Data
mapped data cache Tag and Comparator One Cache line of Data
• Used in Alpha, HP machines Tag and Comparator One Cache line of Data
Tag and Comparator One Cache line of Data

To Next Lower Level In


Hierarchy

CS252/Kubiatowicz
1/24/01
Lec 3.22
4. Reducing Misses via
“Pseudo-Associativity”
• How to combine fast hit time of Direct Mapped and have the
lower conflict misses of 2-way SA cache?
• Divide cache: on a miss, check other half of cache to see if
there, if so have a pseudo-hit (slow hit)

Hit Time

Pseudo Hit Time Miss Penalty

• Drawback: CPU pipeline is hard if hit takes 1 or 2 cycles


Time
– Better for caches not tied directly to processor (L2)
– Used in MIPS R1000 L2 cache, similar in UltraSPARC

CS252/Kubiatowicz
1/24/01
Lec 3.23
5. Reducing Misses by Hardware
Prefetching of Instructions & Datals
• E.g., Instruction Prefetching
– Alpha 21064 fetches 2 blocks on a miss
– Extra block placed in “stream buffer”
– On miss check stream buffer
• Works with data blocks too:
– Jouppi [1990] 1 data stream buffer got 25% misses from
4KB cache; 4 streams got 43%
– Palacharla & Kessler [1994] for scientific programs for 8
streams got 50% to 70% of misses from
2 64KB, 4-way set associative caches
• Prefetching relies on having extra memory
bandwidth that can be used without penalty

CS252/Kubiatowicz
1/24/01
Lec 3.24
6. Reducing Misses by
Software Prefetching Data
• Data Prefetch
– Load data into register (HP PA-RISC loads)
– Cache Prefetch: load into cache (MIPS IV, PowerPC, SPARC v. 9)
– Special prefetching instructions cannot cause faults; a form of
speculative execution
• Prefetching comes in two flavors:
– Binding prefetch: Requests load directly into register.
» Must be correct address and register!
– Non-Binding prefetch: Load into cache.
» Can be incorrect. Frees HW/SW to guess!
• Issuing Prefetch Instructions takes time
– Is cost of prefetch issues < savings in reduced misses?
– Higher superscalar reduces difficulty of issue bandwidth

CS252/Kubiatowicz
1/24/01
Lec 3.25
7. Reducing Misses by
Compiler Optimizations
• McFarling [1989] reduced caches misses by 75%
on 8KB direct mapped cache, 4 byte blocks in software
• Instructions
– Reorder procedures in memory so as to reduce conflict misses
– Profiling to look at conflicts(using tools they developed)
• Data
– Merging Arrays: improve spatial locality by single array of compound elements
vs. 2 arrays
– Loop Interchange: change nesting of loops to access data in order stored in
memory
– Loop Fusion: Combine 2 independent loops that have same looping and some
variables overlap
– Blocking: Improve temporal locality by accessing “blocks” of data repeatedly
vs. going down whole columns or rows

CS252/Kubiatowicz
1/24/01
Lec 3.26
Merging Arrays Example
/* Before: 2 sequential arrays */
int val[SIZE];
int key[SIZE];

/* After: 1 array of stuctures */


struct merge {
int val;
int key;
};
struct merge merged_array[SIZE];

Reducing conflicts between val & key;


improve spatial locality

CS252/Kubiatowicz
1/24/01
Lec 3.27
Loop Interchange Example
/* Before */
for (k = 0; k < 100; k = k+1)
for (j = 0; j < 100; j = j+1)
for (i = 0; i < 5000; i = i+1)
x[i][j] = 2 * x[i][j];
/* After */
for (k = 0; k < 100; k = k+1)
for (i = 0; i < 5000; i = i+1)
for (j = 0; j < 100; j = j+1)
x[i][j] = 2 * x[i][j];

Sequential accesses instead of striding through


memory every 100 words; improved spatial
locality

CS252/Kubiatowicz
1/24/01
Lec 3.28
Loop Fusion Example
/* Before */
for (i = 0; i < N; i = i+1)
for (j = 0; j < N; j = j+1)
a[i][j] = 1/b[i][j] * c[i][j];
for (i = 0; i < N; i = i+1)
for (j = 0; j < N; j = j+1)
d[i][j] = a[i][j] + c[i][j];
/* After */
for (i = 0; i < N; i = i+1)
for (j = 0; j < N; j = j+1)
{ a[i][j] = 1/b[i][j] * c[i][j];
d[i][j] = a[i][j] + c[i][j];}

2 misses per access to a & c vs. one miss per


access; improve spatial locality

CS252/Kubiatowicz
1/24/01
Lec 3.29
Blocking Example
/* Before */
for (i = 0; i < N; i = i+1)
for (j = 0; j < N; j = j+1)
{r = 0;
for (k = 0; k < N; k = k+1){
r = r + y[i][k]*z[k][j];};
x[i][j] = r;
};
• Two Inner Loops:
– Read all NxN elements of z[]
– Read N elements of 1 row of y[] repeatedly
– Write N elements of 1 row of x[]
• Capacity Misses a function of N & Cache Size:
– 2N3 + N2 => (assuming no conflict; otherwise …)
• Idea: compute on BxB submatrix that fits
CS252/Kubiatowicz
1/24/01
Lec 3.30
Blocking Example
/* After */
for (jj = 0; jj < N; jj = jj+B)
for (kk = 0; kk < N; kk = kk+B)
for (i = 0; i < N; i = i+1)
for (j = jj; j < min(jj+B-1,N); j = j+1)
{r = 0;
for (k = kk; k < min(kk+B-1,N); k = k+1) {
r = r + y[i][k]*z[k][j];};
x[i][j] = x[i][j] + r;
};

• B called Blocking Factor


• Capacity Misses from 2N3 + N2 to N3/B+2N2
• Conflict Misses Too?

CS252/Kubiatowicz
1/24/01
Lec 3.31
Reducing Conflict Misses by Blocking
0.1
Miss Rate

Direct Mapped Cache


0.05

Fully Associative Cache

0
0 50 100 150
Blocking Factor

• Conflict misses in caches not FA vs. Blocking size


– Lam et al [1991] a blocking factor of 24 had a fifth the misses
vs. 48 despite both fit in cache
CS252/Kubiatowicz
1/24/01
Lec 3.32
Summary of Compiler Optimizations to
Reduce Cache Misses (by hand)
vpenta (nasa7)
gmty (nasa7)
tomcatv
btrix (nasa7)
mxm (nasa7)
spice
cholesky
(nasa7)
compress

1 1.5 2 2.5 3
Performance Improvement

merged loop loop fusion blocking


arrays interchange
CS252/Kubiatowicz
1/24/01
Lec 3.33
Summary: Miss Rate Reduction
 Memory accesses 
CPUtime  IC  CPI   Miss rate  Miss penalty  Clock cycle time
 Execution
Instruction 

• 3 Cs: Compulsory, Capacity, Conflict


1. Reduce Misses via Larger Block Size
2. Reduce Misses via Higher Associativity
3. Reducing Misses via Victim Cache
4. Reducing Misses via Pseudo-Associativity
5. Reducing Misses by HW Prefetching Instr, Data
6. Reducing Misses by SW Prefetching Data
7. Reducing Misses by Compiler Optimizations
• Prefetching comes in two flavors:
– Binding prefetch: Requests load directly into register.
» Must be correct address and register!
– Non-Binding prefetch: Load into cache.
» Can be incorrect. Frees HW/SW to guess!
CS252/Kubiatowicz
1/24/01
Lec 3.34
Review: Improving Cache
Performance
1. Reduce the miss rate,
2. Reduce the miss penalty, or
3. Reduce the time to hit in the cache.

CS252/Kubiatowicz
1/24/01
Lec 3.35
Write Policy:
Write-Through vs Write-Back
• Write-through: all writes update cache and underlying
memory/cache
– Can always discard cached data - most up-to-date data is in memory
– Cache control bit: only a valid bit
• Write-back: all writes simply update cache
– Can’t just discard cached data - may have to write it back to memory
– Cache control bits: both valid and dirty bits
• Other Advantages:
– Write-through:
» memory (or other processors) always have latest data
» Simpler management of cache
– Write-back:
» much lower bandwidth, since data often overwritten multiple times
» Better tolerance to long-latency memory?

CS252/Kubiatowicz
1/24/01
Lec 3.36
Write Policy 2:
Write Allocate vs Non-Allocate
(What happens on write-miss)

• Write allocate: allocate new cache line in cache


– Usually means that you have to do a “read miss” to
fill in rest of the cache-line!
– Alternative: per/word valid bits
• Write non-allocate (or “write-around”):
– Simply send write data through to underlying
memory/cache - don’t allocate new cache line!

CS252/Kubiatowicz
1/24/01
Lec 3.37
1. Reducing Miss Penalty:
Read Priority over Write on Miss

CPU

in out
Write Buffer

write
buffer

DRAM
(or lower mem)

CS252/Kubiatowicz
1/24/01
Lec 3.38
1. Reducing Miss Penalty:
Read Priority over Write on Miss
• Write-through with write buffers offer RAW
conflicts with main memory reads on cache misses
– If simply wait for write buffer to empty, might increase read miss
penalty (old MIPS 1000 by 50% )
– Check write buffer contents before read;
if no conflicts, let the memory access continue
• Write-back also want buffer to hold misplaced blocks
– Read miss replacing dirty block
– Normal: Write dirty block to memory, and then do the read
– Instead copy the dirty block to a write buffer, then do the read,
and then do the write
– CPU stall less since restarts as soon as do read

CS252/Kubiatowicz
1/24/01
Lec 3.39
2. Reduce Miss Penalty:
Early Restart and Critical Word
First
• Don’t wait for full block to be loaded before
restarting CPU
– Early restart—As soon as the requested word of the block
arrives, send it to the CPU and let the CPU continue execution
– Critical Word First—Request the missed word first from memory
and send it to the CPU as soon as it arrives; let the CPU continue
execution while filling the rest of the words in the block. Also
called wrapped fetch and requested word first
• Generally useful only in large blocks,
• Spatial locality a problem; tend to want next
sequential word, so not clear if benefit by early
restart

block
CS252/Kubiatowicz
1/24/01
Lec 3.40
3. Reduce Miss Penalty: Non-
blocking Caches to reduce stalls on
misses
• Non-blocking cache or lockup-free cache allow data cache
to continue to supply cache hits during a miss
– requires F/E bits on registers or out-of-order execution
– requires multi-bank memories
• “hit under miss” reduces the effective miss penalty by
working during miss vs. ignoring CPU requests
• “hit under multiple miss” or “miss under miss” may further
lower the effective miss penalty by overlapping multiple
misses
– Significantly increases the complexity of the cache controller as there can
be multiple outstanding memory accesses
– Requires muliple memory banks (otherwise cannot support)
– Penium Pro allows 4 outstanding memory misses

CS252/Kubiatowicz
1/24/01
Lec 3.41
Value of Hit Under Miss for SPEC
Hit Under i Misses

1.8

1.6

1.4
0->1
0->1
Avg. Mem. Access Time

1.2
1->2
1 1->2
2->64
0.8 2->64
0.6
Base
Base
0.4
“Hit under n Misses”
0.2

0 doduc

ora
ear
xlisp

fpppp
eqntott

tomcatv

alvinn

nasa7
wave5

mdljdp2

hydro2d
su2cor
mdljsp2
espresso

swm256

spice2g6
compress

Integer Floating Point

• FP programs on average: AMAT= 0.68 -> 0.52 -> 0.34 -> 0.26
• Int programs on average: AMAT= 0.24 -> 0.20 -> 0.19 -> 0.19
• 8 KB Data Cache, Direct Mapped, 32B block, 16 cycle missCS252/Kubiatowicz
1/24/01
Lec 3.42
4: Add a second-level cache
• L2 Equations
AMAT = Hit TimeL1 + Miss RateL1 x Miss PenaltyL1

Miss PenaltyL1 = Hit TimeL2 + Miss RateL2 x Miss PenaltyL2

AMAT = Hit TimeL1 +


Miss RateL1 x (Hit TimeL2 + Miss RateL2 + Miss PenaltyL2)

• Definitions:
– Local miss rate— misses in this cache divided by the total number of memory
accesses to this cache (Miss rateL2)
– Global miss rate—misses in this cache divided by the total number of memory
accesses generated by the CPU
(Miss RateL1 x Miss RateL2)
– Global Miss Rate is what matters

CS252/Kubiatowicz
1/24/01
Lec 3.43
Comparing Local and Global
Miss Rates
• 32 KByte 1st level cache;
Increasing 2nd level cache Linear
• Global miss rate close to single
level cache rate provided L2
>> L1
• Don’t use local miss rate
• L2 not tied to CPU clock cycle! Cache Size

• Cost & A.M.A.T. Log


• Generally Fast Hit Times and
fewer misses
• Since hits are few, target
miss reduction

Cache Size

CS252/Kubiatowicz
1/24/01
Lec 3.44
Reducing Misses:
Which apply to L2 Cache?
• Reducing Miss Rate
1. Reduce Misses via Larger Block Size
2. Reduce Conflict Misses via Higher Associativity
3. Reducing Conflict Misses via Victim Cache
4. Reducing Conflict Misses via Pseudo-Associativity
5. Reducing Misses by HW Prefetching Instr, Data
6. Reducing Misses by SW Prefetching Data
7. Reducing Capacity/Conf. Misses by Compiler Optimizations

CS252/Kubiatowicz
1/24/01
Lec 3.45
L2 cache block size &
A.M.A.T.
Relative CPU Time

2 1.95
1.9
1.8
1.7
1.6 1.54
1.5
1.36 1.34
1.4 1.28 1.27
1.3
1.2
1.1
1
16 32 64 128 256 512
Block Size

• 32KB L1, 8 byte path to memory


CS252/Kubiatowicz
1/24/01
Lec 3.46
Reducing Miss Penalty Summary
 Memory accesses 
CPUtime  IC  CPI   Miss rate  Miss penalty  Clock cycle time
 Execution
Instruction 
• Four techniques
– Read priority over write on miss
– Early Restart and Critical Word First on miss
– Non-blocking Caches (Hit under Miss, Miss under Miss)
– Second Level Cache
• Can be applied recursively to Multilevel Caches
– Danger is that time to DRAM will grow with multiple levels in
between
– First attempts at L2 caches can make things worse, since
increased worst case is worse

CS252/Kubiatowicz
1/24/01
Lec 3.47
What is the Impact of What
You’ve Learned About Caches?
1000
CPU
• 1960-1985: Speed
= ƒ(no. operations)
• 1990 100
– Pipelined
Execution &
Fast Clock Rate
– Out-of-Order 10

execution
– Superscalar DRAM

Instruction Issue 1
• 1998: Speed =
1981

1982

1984

1986

1987

1988

1989

1992

1994

1995

1996

1997

1998

2000
1980

1983

1985

1990

1991

1993

1999
ƒ(non-cached memory accesses)
• Superscalar, Out-of-Order machines hide L1 data cache miss
(­5 clocks) but not L2 cache miss (­50 clocks)?

CS252/Kubiatowicz
1/24/01
Lec 3.48
Cache Optimization Summary
Technique MR MP HT Complexity
miss rate

Larger Block Size + – 0


Higher Associativity + – 1
Victim Caches + 2
Pseudo-Associative Caches + 2
HW Prefetching of Instr/Data + 2
Compiler Controlled Prefetching + 3
Compiler Reduce Misses + 0
Priority to Read Misses + 1
miss penalty

Early Restart & Critical Word 1st + 2


Non-Blocking Caches + 3
Second Level Caches + 2

CS252/Kubiatowicz
1/24/01
Lec 3.49

You might also like