0% found this document useful (0 votes)

94 views

CS252 Graduate Computer Architecture Caches and Memory Systems I

The document discusses caches and memory systems. It notes that the performance gap between processors and memory grows over time due to Moore's Law. Caches are implemented to help close this gap by exploiting locality and providing faster access to frequently used data. The performance of a cache depends on factors like hit rate, miss rate, hit time, and miss penalty. A cache can be direct mapped, set associative, or fully associative depending on its organization.

Uploaded by

jamal4u

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

94 views

CS252 Graduate Computer Architecture Caches and Memory Systems I

Uploaded by

jamal4u

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

You are on page 1/ 49

CS252

Graduate Computer Architecture

Lecture 3

Caches and Memory Systems I

January 24, 2001

Prof. John Kubiatowicz

CS252/Kubiatowicz
1/24/01
Lec 3.1
Question: Who Cares About the
Memory Hierarchy?

1000 CPU
µProc
60%/yr.
“Moore’s Law”
Performance

CPU-DRAM Gap
100 Processor-Memory
Performance Gap:
(grows 50% / year)
10 “Less’ Law?” DRAM
DRAM
7%/yr.
1
1980
1981
1982
1983
1984
1985
1986
1987
1988
1989
1990
1991
1992
1993
1994
1995
1996
1997
1998
1999
2000
• 1980: no cache in µproc; 1995 2-level cache on chip
(1989 first Intel µproc with a cache on chip)
CS252/Kubiatowicz
1/24/01
Lec 3.2
Generations of Microprocessors
• Time of a full cache miss in instructions executed:
1st Alpha: 340 ns/5.0 ns = 68 clks x 2 or 136
2nd Alpha: 266 ns/3.3 ns = 80 clks x 4 or 320
3rd Alpha: 180 ns/1.7 ns =108 clks x 6 or 648
• 1/2X latency x 3X clock rate x 3X Instr/clock  5X

CS252/Kubiatowicz
1/24/01
Lec 3.3
Processor-Memory
Performance Gap “Tax”
Processor % Area %Transistors
(cost) (power)
• Alpha 21164 37% 77%
• StrongArm SA110 61% 94%
• Pentium Pro 64% 88%
– 2 dies per package: Proc/I$/D$ + L2$
• Caches have no inherent value,
only try to close performance gap

CS252/Kubiatowicz
1/24/01
Lec 3.4
What is a cache?
• Small, fast storage used to improve average access
time to slow memory.
• Exploits spacial and temporal locality
• In computer architecture, almost everything is a cache!
– Registers a cache on variables
– First-level cache a cache on second-level cache
– Second-level cache a cache on memory
– Memory a cache on disk (virtual memory)
– TLB a cache on page table
– Branch-prediction a cache on prediction information?
Proc/Regs

L1-Cache
Bigger L2-Cache Faster

Memory

Disk, Tape, etc. CS252/Kubiatowicz

1/24/01
Lec 3.5
Example: 1 KB Direct Mapped Cache
• For a 2 ** N byte cache:
– The uppermost (32 - N) bits are always the Cache Tag
– The lowest M bits are the Byte Select (Block Size = 2 ** M)
Block address
31 9 4 0
Cache Tag Example: 0x50 Cache Index Byte Select
Ex: 0x01 Ex: 0x00
Stored as part
of the cache “state”

Valid Bit Cache Tag Cache Data

Byte 31 Byte 1 Byte 0 0

: :
0x50 Byte 63 Byte 33 Byte 32 1
2
3

: : :
Byte 1023 Byte 992 31

:
CS252/Kubiatowicz
1/24/01
Lec 3.6
Set Associative Cache
• N-way set associative: N entries for each Cache
Index
– N direct mapped caches operates in parallel
• Example: Two-way set associative cache
– Cache Index selects a “set” from the cache
– The two tags in the set are compared to the input in parallel
– Data is selected based on the tag result
Cache Index
Valid Cache Tag Cache Data Cache Data Cache Tag Valid
Cache Block 0 Cache Block 0

: : : : : :

Adr Tag
Compare Sel1 1 Mux 0 Sel0 Compare

OR
1/24/01
Cache Block CS252/Kubiatowicz
Hit Lec 3.7
Disadvantage of Set Associative Cache
• N-way Set Associative Cache versus Direct Mapped
Cache:
– N comparators vs. 1
– Extra MUX delay for the data
– Data comes AFTER Hit/Miss decision and set selection
• In a direct mapped cache, Cache Block is available
BEFORE Hit/Miss:
– Possible to assume a hit and continue. Recover later if miss.
Cache Index
Valid Cache Tag Cache Data Cache Data Cache Tag Valid
Cache Block 0 Cache Block 0

: : : : : :

Adr Tag
Compare Sel1 1 Mux 0 Sel0 Compare

OR
1/24/01
Cache Block CS252/Kubiatowicz
Hit Lec 3.8
Review: Cache performance
• Miss-oriented Approach to Memory Access:
 MemAccess 
CPUtime  IC   CPI   MissRate  MissPenalty   CycleTime
 Execution Inst 
 MemMisses 
CPUtime  IC   CPI   MissPenalty   CycleTime
 Execution Inst 

– CPIExecution includes ALU and Memory instructions

• Separating out Memory component entirely
– AMAT = Average Memory Access Time
– CPIALUOps does not include memory instructions
 AluOps MemAccess 
CPUtime  IC    CPI   AMAT   CycleTime
 Inst Inst 
AluOps

AMAT  HitTime  MissRate  MissPenalty

  HitTime Inst  MissRate Inst  MissPenalty Inst  
 HitTime Data  MissRate Data  MissPenalty Data  CS252/Kubiatowicz
1/24/01
Lec 3.9
Impact on
• Suppose a processorPerformance
executes at
– Clock Rate = 200 MHz (5 ns per cycle), Ideal (no misses) CPI = 1.1
– 50% arith/logic, 30% ld/st, 20% control
• Suppose that 10% of memory operations get 50 cycle
miss penalty
• Suppose that 1% of instructions get same miss penalty
• CPI = ideal CPI + average stalls per instruction
1.1(cycles/ins) +
[ 0.30 (DataMops/ins)
x 0.10 (miss/DataMop) x 50 (cycle/miss)] +
[ 1 (InstMop/ins)
x 0.01 (miss/InstMop) x 50 (cycle/miss)]
= (1.1 + 1.5 + .5) cycle/ins = 3.1
• 58% of the time the proc is stalled waiting for memory!
• AMAT=(1/1.3)x[1+0.01x50]+(0.3/1.3)x[1+0.1x50]=2.54

CS252/Kubiatowicz
1/24/01
Lec 3.10
Example: Harvard Architecture
• Unified vs Separate I&D (Harvard)

• Table on page 384:

– 16KB I&D: Inst miss rate=0.64%, Data miss rate=6.47%
– 32KB unified: Aggregate miss rate=1.99%
• Which is better (ignore L2 cache)?
– Assume 33% data ops  75% accesses from instructions (1.0/1.33)
– hit time=1, miss time=50
– Note that data hit has 1 stall for unified cache (only one port)

AMATHarvard =75%x(1+0.64%x50)+25%x(1+6.47%x50) = 2.05

AMATUnified =75%x(1+1.99%x50)+25%x(1+1+1.99%x50)= 2.24

Proc
Unified I-Cache-1 Proc D-Cache-1
Cache-1 Unified
Cache-2
Unified
Cache-2

CS252/Kubiatowicz
1/24/01
Lec 3.11
Review: Four Questions for
Memory Hierarchy Designers
• Q1: Where can a block be placed in the upper level?
(Block placement)
– Fully Associative, Set Associative, Direct Mapped
• Q2: How is a block found if it is in the upper level?
(Block identification)
– Tag/Block
• Q3: Which block should be replaced on a miss?
(Block replacement)
– Random, LRU
• Q4: What happens on a write?
(Write strategy)
– Write Back or Write Through (with Write Buffer)

CS252/Kubiatowicz
1/24/01
Lec 3.12
Review: Improving Cache
Performance
1. Reduce the miss rate,
2. Reduce the miss penalty, or
3. Reduce the time to hit in the cache.

CS252/Kubiatowicz
1/24/01
Lec 3.13
Reducing Misses
• Classifying Misses: 3 Cs
– Compulsory—The first access to a block is not in the cache, so the block
must be brought into the cache. Also called cold start misses or first
reference misses.
(Misses in even an Infinite Cache)
– Capacity—If the cache cannot contain all the blocks needed during execution
of a program, capacity misses will occur due to blocks being discarded and
later retrieved.
(Misses in Fully Associative Size X Cache)
– Conflict—If block-placement strategy is set associative or direct mapped,
conflict misses (in addition to compulsory & capacity misses) will occur because
a block can be discarded and later retrieved if too many blocks map to its set.
Also called collision misses or interference misses.
(Misses in N-way Associative, Size X Cache)
• More recent, 4th “C”:
– Coherence - Misses caused by cache coherence.

CS252/Kubiatowicz
1/24/01
Lec 3.14
3Cs Absolute Miss Rate
(SPEC92)
0.14
1-way
0.12 Conflict
2-way
0.1
Miss Rate per Type

4-way
0.08
8-way
0.06
Capacity
0.04

0.02
0
1

128
Compulsory vanishingly Compulsory
Cache Size (KB)
small
CS252/Kubiatowicz
1/24/01
Lec 3.15
2:1 Cache Rule
miss rate 1-way associative cache size X
= miss rate 2-way associative cache size X/2
0.14
1-way
0.12 Conflict
2-way
0.1
Miss Rate per Type

4-way
0.08
8-way
0.06
Capacity
0.04

0.02
0
1

128
Cache Size (KB) Compulsory
CS252/Kubiatowicz
1/24/01
Lec 3.16
3Cs Relative Miss Rate
100%
1-way
80% Conflict
2-way
Miss Rate per Type

4-way
60% 8-way

40%
Capacity

20%

0%
1

128
Flaws: for fixed block size
Good: insight => inventionCache Size (KB) Compulsory
CS252/Kubiatowicz
1/24/01
Lec 3.17
How Can Reduce Misses?
• 3 Cs: Compulsory, Capacity, Conflict
• In all cases, assume total cache size not changed:
• What happens if:
1) Change Block Size:
Which of 3Cs is obviously affected?

2) Change Associativity:
Which of 3Cs is obviously affected?

3) Change Compiler:
Which of 3Cs is obviously affected?

CS252/Kubiatowicz
1/24/01
Lec 3.18
1. Reduce Misses via Larger
Block Size
25%

20% 1K

4K
15%
Miss
16K
Rate
10%
64K
5% 256K

0%
32

64
16

256
128

Block Size (bytes)

CS252/Kubiatowicz
1/24/01
Lec 3.19
2. Reduce Misses via Higher
Associativity
• 2:1 Cache Rule:
– Miss Rate DM cache size N Miss Rate 2-way cache
size N/2
• Beware: Execution time is only final
measure!
– Will Clock Cycle time increase?
– Hill [1988] suggested hit time for 2-way vs. 1-way
external cache +10%,
internal + 2%

CS252/Kubiatowicz
1/24/01
Lec 3.20
Example: Avg. Memory Access
Time vs. Miss Rate
• Example: assume CCT = 1.10 for 2-way, 1.12 for 4-
way, 1.14 for 8-way vs. CCT direct mapped
Cache Size Associativity
(KB) 1-way 2-way 4-way 8-way
1 2.33 2.15 2.07 2.01
2 1.98 1.86 1.76 1.68
4 1.72 1.67 1.61 1.53
8 1.46 1.48 1.47 1.43
16 1.29 1.32 1.32 1.32
32 1.20 1.24 1.25 1.27
64 1.14 1.20 1.21 1.23
128 1.10 1.17 1.18 1.20

(Red means A.M.A.T. not improved by more associativity)

CS252/Kubiatowicz
1/24/01
Lec 3.21
3. Reducing Misses via a
“Victim Cache”
• How to combine fast hit time
of direct mapped
yet still avoid conflict misses?
TAGS DATA
• Add buffer to place data
discarded from cache
• Jouppi [1990]: 4-entry victim
cache removed 20% to 95% of
conflicts for a 4 KB direct Tag and Comparator One Cache line of Data
mapped data cache Tag and Comparator One Cache line of Data
• Used in Alpha, HP machines Tag and Comparator One Cache line of Data
Tag and Comparator One Cache line of Data

To Next Lower Level In

Hierarchy

CS252/Kubiatowicz
1/24/01
Lec 3.22
4. Reducing Misses via
“Pseudo-Associativity”
• How to combine fast hit time of Direct Mapped and have the
lower conflict misses of 2-way SA cache?
• Divide cache: on a miss, check other half of cache to see if
there, if so have a pseudo-hit (slow hit)

Hit Time

Pseudo Hit Time Miss Penalty

• Drawback: CPU pipeline is hard if hit takes 1 or 2 cycles

Time
– Better for caches not tied directly to processor (L2)
– Used in MIPS R1000 L2 cache, similar in UltraSPARC

CS252/Kubiatowicz
1/24/01
Lec 3.23
5. Reducing Misses by Hardware
Prefetching of Instructions & Datals
• E.g., Instruction Prefetching
– Alpha 21064 fetches 2 blocks on a miss
– Extra block placed in “stream buffer”
– On miss check stream buffer
• Works with data blocks too:
– Jouppi [1990] 1 data stream buffer got 25% misses from
4KB cache; 4 streams got 43%
– Palacharla & Kessler [1994] for scientific programs for 8
streams got 50% to 70% of misses from
2 64KB, 4-way set associative caches
• Prefetching relies on having extra memory
bandwidth that can be used without penalty

CS252/Kubiatowicz
1/24/01
Lec 3.24
6. Reducing Misses by
Software Prefetching Data
• Data Prefetch
– Load data into register (HP PA-RISC loads)
– Cache Prefetch: load into cache (MIPS IV, PowerPC, SPARC v. 9)
– Special prefetching instructions cannot cause faults; a form of
speculative execution
• Prefetching comes in two flavors:
– Binding prefetch: Requests load directly into register.
» Must be correct address and register!
– Non-Binding prefetch: Load into cache.
» Can be incorrect. Frees HW/SW to guess!
• Issuing Prefetch Instructions takes time
– Is cost of prefetch issues < savings in reduced misses?
– Higher superscalar reduces difficulty of issue bandwidth

CS252/Kubiatowicz
1/24/01
Lec 3.25
7. Reducing Misses by
Compiler Optimizations
• McFarling [1989] reduced caches misses by 75%
on 8KB direct mapped cache, 4 byte blocks in software
• Instructions
– Reorder procedures in memory so as to reduce conflict misses
– Profiling to look at conflicts(using tools they developed)
• Data
– Merging Arrays: improve spatial locality by single array of compound elements
vs. 2 arrays
– Loop Interchange: change nesting of loops to access data in order stored in
memory
– Loop Fusion: Combine 2 independent loops that have same looping and some
variables overlap
– Blocking: Improve temporal locality by accessing “blocks” of data repeatedly
vs. going down whole columns or rows

CS252/Kubiatowicz
1/24/01
Lec 3.26
Merging Arrays Example
/* Before: 2 sequential arrays */
int val[SIZE];
int key[SIZE];

/* After: 1 array of stuctures */

struct merge {
int val;
int key;
};
struct merge merged_array[SIZE];

Reducing conflicts between val & key;

improve spatial locality

CS252/Kubiatowicz
1/24/01
Lec 3.27
Loop Interchange Example
/* Before */
for (k = 0; k < 100; k = k+1)
for (j = 0; j < 100; j = j+1)
for (i = 0; i < 5000; i = i+1)
x[i][j] = 2 * x[i][j];
/* After */
for (k = 0; k < 100; k = k+1)
for (i = 0; i < 5000; i = i+1)
for (j = 0; j < 100; j = j+1)
x[i][j] = 2 * x[i][j];

Sequential accesses instead of striding through

memory every 100 words; improved spatial
locality

CS252/Kubiatowicz
1/24/01
Lec 3.28
Loop Fusion Example
/* Before */
for (i = 0; i < N; i = i+1)
for (j = 0; j < N; j = j+1)
a[i][j] = 1/b[i][j] * c[i][j];
for (i = 0; i < N; i = i+1)
for (j = 0; j < N; j = j+1)
d[i][j] = a[i][j] + c[i][j];
/* After */
for (i = 0; i < N; i = i+1)
for (j = 0; j < N; j = j+1)
{ a[i][j] = 1/b[i][j] * c[i][j];
d[i][j] = a[i][j] + c[i][j];}

2 misses per access to a & c vs. one miss per

access; improve spatial locality

CS252/Kubiatowicz
1/24/01
Lec 3.29
Blocking Example
/* Before */
for (i = 0; i < N; i = i+1)
for (j = 0; j < N; j = j+1)
{r = 0;
for (k = 0; k < N; k = k+1){
r = r + y[i][k]*z[k][j];};
x[i][j] = r;
};
• Two Inner Loops:
– Read all NxN elements of z[]
– Read N elements of 1 row of y[] repeatedly
– Write N elements of 1 row of x[]
• Capacity Misses a function of N & Cache Size:
– 2N3 + N2 => (assuming no conflict; otherwise …)
• Idea: compute on BxB submatrix that fits
CS252/Kubiatowicz
1/24/01
Lec 3.30
Blocking Example
/* After */
for (jj = 0; jj < N; jj = jj+B)
for (kk = 0; kk < N; kk = kk+B)
for (i = 0; i < N; i = i+1)
for (j = jj; j < min(jj+B-1,N); j = j+1)
{r = 0;
for (k = kk; k < min(kk+B-1,N); k = k+1) {
r = r + y[i][k]*z[k][j];};
x[i][j] = x[i][j] + r;
};

• B called Blocking Factor

• Capacity Misses from 2N3 + N2 to N3/B+2N2
• Conflict Misses Too?

CS252/Kubiatowicz
1/24/01
Lec 3.31
Reducing Conflict Misses by Blocking
0.1
Miss Rate

Direct Mapped Cache

0.05

Fully Associative Cache

0
0 50 100 150
Blocking Factor

• Conflict misses in caches not FA vs. Blocking size

– Lam et al [1991] a blocking factor of 24 had a fifth the misses
vs. 48 despite both fit in cache
CS252/Kubiatowicz
1/24/01
Lec 3.32
Summary of Compiler Optimizations to
Reduce Cache Misses (by hand)
vpenta (nasa7)
gmty (nasa7)
tomcatv
btrix (nasa7)
mxm (nasa7)
spice
cholesky
(nasa7)
compress

1 1.5 2 2.5 3
Performance Improvement

merged loop loop fusion blocking

arrays interchange
CS252/Kubiatowicz
1/24/01
Lec 3.33
Summary: Miss Rate Reduction
 Memory accesses 
CPUtime  IC  CPI   Miss rate  Miss penalty  Clock cycle time
 Execution
Instruction 

• 3 Cs: Compulsory, Capacity, Conflict

1. Reduce Misses via Larger Block Size
2. Reduce Misses via Higher Associativity
3. Reducing Misses via Victim Cache
4. Reducing Misses via Pseudo-Associativity
5. Reducing Misses by HW Prefetching Instr, Data
6. Reducing Misses by SW Prefetching Data
7. Reducing Misses by Compiler Optimizations
• Prefetching comes in two flavors:
– Binding prefetch: Requests load directly into register.
» Must be correct address and register!
– Non-Binding prefetch: Load into cache.
» Can be incorrect. Frees HW/SW to guess!
CS252/Kubiatowicz
1/24/01
Lec 3.34
Review: Improving Cache
Performance
1. Reduce the miss rate,
2. Reduce the miss penalty, or
3. Reduce the time to hit in the cache.

CS252/Kubiatowicz
1/24/01
Lec 3.35
Write Policy:
Write-Through vs Write-Back
• Write-through: all writes update cache and underlying
memory/cache
– Can always discard cached data - most up-to-date data is in memory
– Cache control bit: only a valid bit
• Write-back: all writes simply update cache
– Can’t just discard cached data - may have to write it back to memory
– Cache control bits: both valid and dirty bits
• Other Advantages:
– Write-through:
» memory (or other processors) always have latest data
» Simpler management of cache
– Write-back:
» much lower bandwidth, since data often overwritten multiple times
» Better tolerance to long-latency memory?

CS252/Kubiatowicz
1/24/01
Lec 3.36
Write Policy 2:
Write Allocate vs Non-Allocate
(What happens on write-miss)

• Write allocate: allocate new cache line in cache

– Usually means that you have to do a “read miss” to
fill in rest of the cache-line!
– Alternative: per/word valid bits
• Write non-allocate (or “write-around”):
– Simply send write data through to underlying
memory/cache - don’t allocate new cache line!

CS252/Kubiatowicz
1/24/01
Lec 3.37
1. Reducing Miss Penalty:
Read Priority over Write on Miss

CPU

in out
Write Buffer

write
buffer

DRAM
(or lower mem)

CS252/Kubiatowicz
1/24/01
Lec 3.38
1. Reducing Miss Penalty:
Read Priority over Write on Miss
• Write-through with write buffers offer RAW
conflicts with main memory reads on cache misses
– If simply wait for write buffer to empty, might increase read miss
penalty (old MIPS 1000 by 50% )
– Check write buffer contents before read;
if no conflicts, let the memory access continue
• Write-back also want buffer to hold misplaced blocks
– Read miss replacing dirty block
– Normal: Write dirty block to memory, and then do the read
– Instead copy the dirty block to a write buffer, then do the read,
and then do the write
– CPU stall less since restarts as soon as do read

CS252/Kubiatowicz
1/24/01
Lec 3.39
2. Reduce Miss Penalty:
Early Restart and Critical Word
First
• Don’t wait for full block to be loaded before
restarting CPU
– Early restart—As soon as the requested word of the block
arrives, send it to the CPU and let the CPU continue execution
– Critical Word First—Request the missed word first from memory
and send it to the CPU as soon as it arrives; let the CPU continue
execution while filling the rest of the words in the block. Also
called wrapped fetch and requested word first
• Generally useful only in large blocks,
• Spatial locality a problem; tend to want next
sequential word, so not clear if benefit by early
restart

block
CS252/Kubiatowicz
1/24/01
Lec 3.40
3. Reduce Miss Penalty: Non-
blocking Caches to reduce stalls on
misses
• Non-blocking cache or lockup-free cache allow data cache
to continue to supply cache hits during a miss
– requires F/E bits on registers or out-of-order execution
– requires multi-bank memories
• “hit under miss” reduces the effective miss penalty by
working during miss vs. ignoring CPU requests
• “hit under multiple miss” or “miss under miss” may further
lower the effective miss penalty by overlapping multiple
misses
– Significantly increases the complexity of the cache controller as there can
be multiple outstanding memory accesses
– Requires muliple memory banks (otherwise cannot support)
– Penium Pro allows 4 outstanding memory misses

CS252/Kubiatowicz
1/24/01
Lec 3.41
Value of Hit Under Miss for SPEC
Hit Under i Misses

1.8

1.6

1.4
0->1
0->1
Avg. Mem. Access Time

1.2
1->2
1 1->2
2->64
0.8 2->64
0.6
Base
Base
0.4
“Hit under n Misses”
0.2

0 doduc

ora
ear
xlisp

fpppp
eqntott

tomcatv

alvinn

nasa7
wave5

mdljdp2

hydro2d
su2cor
mdljsp2
espresso

swm256

spice2g6
compress

Integer Floating Point

• FP programs on average: AMAT= 0.68 -> 0.52 -> 0.34 -> 0.26
• Int programs on average: AMAT= 0.24 -> 0.20 -> 0.19 -> 0.19
• 8 KB Data Cache, Direct Mapped, 32B block, 16 cycle missCS252/Kubiatowicz
1/24/01
Lec 3.42
4: Add a second-level cache
• L2 Equations
AMAT = Hit TimeL1 + Miss RateL1 x Miss PenaltyL1

Miss PenaltyL1 = Hit TimeL2 + Miss RateL2 x Miss PenaltyL2

AMAT = Hit TimeL1 +

Miss RateL1 x (Hit TimeL2 + Miss RateL2 + Miss PenaltyL2)

• Definitions:
– Local miss rate— misses in this cache divided by the total number of memory
accesses to this cache (Miss rateL2)
– Global miss rate—misses in this cache divided by the total number of memory
accesses generated by the CPU
(Miss RateL1 x Miss RateL2)
– Global Miss Rate is what matters

CS252/Kubiatowicz
1/24/01
Lec 3.43
Comparing Local and Global
Miss Rates
• 32 KByte 1st level cache;
Increasing 2nd level cache Linear
• Global miss rate close to single
level cache rate provided L2
>> L1
• Don’t use local miss rate
• L2 not tied to CPU clock cycle! Cache Size

• Cost & A.M.A.T. Log

• Generally Fast Hit Times and
fewer misses
• Since hits are few, target
miss reduction

Cache Size

CS252/Kubiatowicz
1/24/01
Lec 3.44
Reducing Misses:
Which apply to L2 Cache?
• Reducing Miss Rate
1. Reduce Misses via Larger Block Size
2. Reduce Conflict Misses via Higher Associativity
3. Reducing Conflict Misses via Victim Cache
4. Reducing Conflict Misses via Pseudo-Associativity
5. Reducing Misses by HW Prefetching Instr, Data
6. Reducing Misses by SW Prefetching Data
7. Reducing Capacity/Conf. Misses by Compiler Optimizations

CS252/Kubiatowicz
1/24/01
Lec 3.45
L2 cache block size &
A.M.A.T.
Relative CPU Time

2 1.95
1.9
1.8
1.7
1.6 1.54
1.5
1.36 1.34
1.4 1.28 1.27
1.3
1.2
1.1
1
16 32 64 128 256 512
Block Size

• 32KB L1, 8 byte path to memory

CS252/Kubiatowicz
1/24/01
Lec 3.46
Reducing Miss Penalty Summary
 Memory accesses 
CPUtime  IC  CPI   Miss rate  Miss penalty  Clock cycle time
 Execution
Instruction 
• Four techniques
– Read priority over write on miss
– Early Restart and Critical Word First on miss
– Non-blocking Caches (Hit under Miss, Miss under Miss)
– Second Level Cache
• Can be applied recursively to Multilevel Caches
– Danger is that time to DRAM will grow with multiple levels in
between
– First attempts at L2 caches can make things worse, since
increased worst case is worse

CS252/Kubiatowicz
1/24/01
Lec 3.47
What is the Impact of What
You’ve Learned About Caches?
1000
CPU
• 1960-1985: Speed
= ƒ(no. operations)
• 1990 100
– Pipelined
Execution &
Fast Clock Rate
– Out-of-Order 10

execution
– Superscalar DRAM

Instruction Issue 1
• 1998: Speed =
1981

1982

1984

1986

1987

1988

1989

1992

1994

1995

1996

1997

1998

2000
1980

1983

1985

1990

1991

1993

1999
ƒ(non-cached memory accesses)
• Superscalar, Out-of-Order machines hide L1 data cache miss
(5 clocks) but not L2 cache miss (50 clocks)?

CS252/Kubiatowicz
1/24/01
Lec 3.48
Cache Optimization Summary
Technique MR MP HT Complexity
miss rate

Larger Block Size + – 0

Higher Associativity + – 1
Victim Caches + 2
Pseudo-Associative Caches + 2
HW Prefetching of Instr/Data + 2
Compiler Controlled Prefetching + 3
Compiler Reduce Misses + 0
Priority to Read Misses + 1
miss penalty

Early Restart & Critical Word 1st + 2

Non-Blocking Caches + 3
Second Level Caches + 2

CS252/Kubiatowicz
1/24/01
Lec 3.49

ANSYS Fluent Theory Guide PDF
83% (6)
ANSYS Fluent Theory Guide PDF
850 pages
Amusnet Interactive Client-Area Presentation
No ratings yet
Amusnet Interactive Client-Area Presentation
15 pages
If You Are Not Connected To The VINCI Energies Network (Outside The Business Unit's Premises)
No ratings yet
If You Are Not Connected To The VINCI Energies Network (Outside The Business Unit's Premises)
16 pages
A Project Report On "Whatsapp-An Innovative Service" Submitted in The Partial Fulfilment For The Requirement of The Degree of
No ratings yet
A Project Report On "Whatsapp-An Innovative Service" Submitted in The Partial Fulfilment For The Requirement of The Degree of
28 pages
Ch01-part3-Caches
No ratings yet
Ch01-part3-Caches
32 pages
Lect.12.Memory
No ratings yet
Lect.12.Memory
42 pages
Ch01-part3-Caches
No ratings yet
Ch01-part3-Caches
32 pages
Ch01-part3-Caches (2)
No ratings yet
Ch01-part3-Caches (2)
32 pages
A Case For Intelligent RAM: IRAM: 1. Introduction and Why There Is A Problem
No ratings yet
A Case For Intelligent RAM: IRAM: 1. Introduction and Why There Is A Problem
23 pages
A Case For Intelligent RAM IRAM
No ratings yet
A Case For Intelligent RAM IRAM
23 pages
Types of Cache Misses
No ratings yet
Types of Cache Misses
30 pages
Lec13 Memory 1 Notes
No ratings yet
Lec13 Memory 1 Notes
27 pages
Kien-Truc-May-Tinh-Nang-Cao - Tran-Ngoc-Thinh - Lec04-Cache - (Cuuduongthancong - Com)
No ratings yet
Kien-Truc-May-Tinh-Nang-Cao - Tran-Ngoc-Thinh - Lec04-Cache - (Cuuduongthancong - Com)
16 pages
ABIT KN9 SLI [NVIDIA MCP55P] SOCKETAM2
No ratings yet
ABIT KN9 SLI [NVIDIA MCP55P] SOCKETAM2
47 pages
Section9 Sol
No ratings yet
Section9 Sol
9 pages
Class 12
No ratings yet
Class 12
40 pages
Microprocessors and Computer Systems: ELE 3230 - Chapter 1 1
No ratings yet
Microprocessors and Computer Systems: ELE 3230 - Chapter 1 1
17 pages
Mehrdad Nourani: EEDG/CE 6303: Testing and Testable Design
No ratings yet
Mehrdad Nourani: EEDG/CE 6303: Testing and Testable Design
54 pages
Memory Hierarchy 1
No ratings yet
Memory Hierarchy 1
44 pages
4218296
No ratings yet
4218296
81 pages
versal-ai-edge-gen2-psg
No ratings yet
versal-ai-edge-gen2-psg
5 pages
(Ebook) Semiconductor Memories: Technology, Testing, and Reliability by Ashok K. Sharma ISBN 0780310004 instant download
100% (3)
(Ebook) Semiconductor Memories: Technology, Testing, and Reliability by Ashok K. Sharma ISBN 0780310004 instant download
46 pages
H27UBG8T2Br pdf 1
No ratings yet
H27UBG8T2Br pdf 1
57 pages
I300 MicroSD Card 128GB 1TB
No ratings yet
I300 MicroSD Card 128GB 1TB
18 pages
M24C64-W M24C64-R M24C64-F M24C64-DF: 64-Kbit Serial I C Bus EEPROM
No ratings yet
M24C64-W M24C64-R M24C64-F M24C64-DF: 64-Kbit Serial I C Bus EEPROM
54 pages
Semiconductor Memories Technology Testing and Reliability 1st Edition Ashok K. Sharma All Chapter Instant Download
100% (5)
Semiconductor Memories Technology Testing and Reliability 1st Edition Ashok K. Sharma All Chapter Instant Download
84 pages
DSP Applications Using C and the TMS320C6x DSK 1st Edition Chassaing - The ebook with all chapters is available with just one click
100% (1)
DSP Applications Using C and the TMS320C6x DSK 1st Edition Chassaing - The ebook with all chapters is available with just one click
39 pages
Ansys Electronics HF Highlights-1
No ratings yet
Ansys Electronics HF Highlights-1
8 pages
Moldex3D Approved
No ratings yet
Moldex3D Approved
61 pages
8K/16K 5.0V Microwire Serial EEPROM: Features
No ratings yet
8K/16K 5.0V Microwire Serial EEPROM: Features
12 pages
CS162 Operating Systems and Systems Programming Caching and Demand Paging
No ratings yet
CS162 Operating Systems and Systems Programming Caching and Demand Paging
37 pages
8088 Microprocessor 02
No ratings yet
8088 Microprocessor 02
12 pages
UNIT2 Cahe-Opt
No ratings yet
UNIT2 Cahe-Opt
134 pages
Millennium City Academy: Teaching Plan DG0K 33: Hardware Concepts Required Textbook: Exam Cram 2: A+ (Second Edition)
No ratings yet
Millennium City Academy: Teaching Plan DG0K 33: Hardware Concepts Required Textbook: Exam Cram 2: A+ (Second Edition)
17 pages
2012 AL ICT Model Paper Answers English at Apepanthiya - LK
No ratings yet
2012 AL ICT Model Paper Answers English at Apepanthiya - LK
13 pages
CSE820 Week 2 - Introduction: Rich Enbody (Based Loosely On Slides by David Patterson)
No ratings yet
CSE820 Week 2 - Introduction: Rich Enbody (Based Loosely On Slides by David Patterson)
21 pages
Semiconductor Memories Technology Testing and Reliability 1st Edition Ashok K. Sharma pdf download
No ratings yet
Semiconductor Memories Technology Testing and Reliability 1st Edition Ashok K. Sharma pdf download
73 pages
MSI Q5T MS-AC711VER 11 PDF
No ratings yet
MSI Q5T MS-AC711VER 11 PDF
56 pages
Miss Rate Versus Block Size: 25% 1K 4K 16K 64K 256K
No ratings yet
Miss Rate Versus Block Size: 25% 1K 4K 16K 64K 256K
33 pages
COA Lecture 16-Direct Mapped Cache PDF
No ratings yet
COA Lecture 16-Direct Mapped Cache PDF
22 pages
9 hw1
No ratings yet
9 hw1
5 pages
Vector IRAM: A Microprocessor Architecture For Media Processing
No ratings yet
Vector IRAM: A Microprocessor Architecture For Media Processing
15 pages
CA2021_project2_slides
No ratings yet
CA2021_project2_slides
16 pages
A Few Experiments with Intel's Cache Allocation Technology - Slides (2015)
No ratings yet
A Few Experiments with Intel's Cache Allocation Technology - Slides (2015)
16 pages
F20 64Gb MLC NAND Flash Memory Legacy TSOP
No ratings yet
F20 64Gb MLC NAND Flash Memory Legacy TSOP
56 pages
DTCH DigitizationWorkflows Transmissive 2018 PDF
No ratings yet
DTCH DigitizationWorkflows Transmissive 2018 PDF
67 pages
Qorivva MPC5500 Minimum Board Requirements: Application Note
No ratings yet
Qorivva MPC5500 Minimum Board Requirements: Application Note
49 pages
Lecture 5 Cache Optimization
No ratings yet
Lecture 5 Cache Optimization
25 pages
Computer Architecture: Memory Hierarchy Design
No ratings yet
Computer Architecture: Memory Hierarchy Design
60 pages
H27UBG8T2CTR BC Hynix Semiconductor
No ratings yet
H27UBG8T2CTR BC Hynix Semiconductor
57 pages
Francisco Memory Design
No ratings yet
Francisco Memory Design
27 pages
CD00001012 2
No ratings yet
CD00001012 2
50 pages
Lattice QCD On A Novel Vector Architecture: Benjamin Huth, Nils Meyer, Tilo Wettig
No ratings yet
Lattice QCD On A Novel Vector Architecture: Benjamin Huth, Nils Meyer, Tilo Wettig
7 pages
Recap 1
No ratings yet
Recap 1
15 pages
Recap 1
No ratings yet
Recap 1
15 pages
Cao Cat-2 Quesns and Key
No ratings yet
Cao Cat-2 Quesns and Key
8 pages
The Good, The Bad and The Ugly: On Threads, Processes and Coprocesses
No ratings yet
The Good, The Bad and The Ugly: On Threads, Processes and Coprocesses
35 pages
Internal Memory - 1
No ratings yet
Internal Memory - 1
12 pages
Fractal Element Antenna Genetic Optimization Using A PC Cluster
No ratings yet
Fractal Element Antenna Genetic Optimization Using A PC Cluster
13 pages
2020_cost trends at advanced nodes
No ratings yet
2020_cost trends at advanced nodes
18 pages
CMSC 611: Advanced Computer Architecture
No ratings yet
CMSC 611: Advanced Computer Architecture
21 pages
Assignment Cache
No ratings yet
Assignment Cache
2 pages
Wave and Scattering Methods for Numerical Simulation
From Everand
Wave and Scattering Methods for Numerical Simulation
Stefan Bilbao
No ratings yet
Mega Drive Architecture: Architecture of Consoles: A Practical Analysis, #3
From Everand
Mega Drive Architecture: Architecture of Consoles: A Practical Analysis, #3
Rodrigo Copetti
No ratings yet
Mc2000u (50x17) en
No ratings yet
Mc2000u (50x17) en
2 pages
WFA113040 (XiOneSC B)
No ratings yet
WFA113040 (XiOneSC B)
3 pages
Computer Implementation For 1D and 2D Problems: 4.1 MATLAB Code For 1D FEM (Steady1d.m)
No ratings yet
Computer Implementation For 1D and 2D Problems: 4.1 MATLAB Code For 1D FEM (Steady1d.m)
41 pages
10GHz EME CSVHF 2018
No ratings yet
10GHz EME CSVHF 2018
52 pages
The Greatest Guide To Us Socks5 Proxy List - Allgo-Trans: Top 10 Fastest Proxies For Seo - Hybrid Traffic
No ratings yet
The Greatest Guide To Us Socks5 Proxy List - Allgo-Trans: Top 10 Fastest Proxies For Seo - Hybrid Traffic
3 pages
Automation Anywhere Version A2019.10 Enterprise On-Premises
No ratings yet
Automation Anywhere Version A2019.10 Enterprise On-Premises
554 pages
DHI-NVR2216-4KS2 Datasheet 20220518
No ratings yet
DHI-NVR2216-4KS2 Datasheet 20220518
3 pages
Axe-Fx III MIDI For Third-Party Devices: Synposis
No ratings yet
Axe-Fx III MIDI For Third-Party Devices: Synposis
3 pages
Ibm FW Imm2 1aoo58t-4.31 Anyos Noarch
No ratings yet
Ibm FW Imm2 1aoo58t-4.31 Anyos Noarch
4 pages
What Is Meridium APM
100% (1)
What Is Meridium APM
257 pages
DS Cheat Sheets
No ratings yet
DS Cheat Sheets
18 pages
Macro Preprocessor Part 1
No ratings yet
Macro Preprocessor Part 1
33 pages
Ecs 381 Admin Guide
No ratings yet
Ecs 381 Admin Guide
224 pages
q1 w1 Career and Business
No ratings yet
q1 w1 Career and Business
18 pages
Chapter 1. Preliminaries: Example 1.1
No ratings yet
Chapter 1. Preliminaries: Example 1.1
760 pages
CDP-Model of Concrete: Preprint
No ratings yet
CDP-Model of Concrete: Preprint
4 pages
Haryana-121001.: No.-3, SGM Nagar, Faridabad
No ratings yet
Haryana-121001.: No.-3, SGM Nagar, Faridabad
3 pages
5a - Sitemap
No ratings yet
5a - Sitemap
9 pages
00031523-BLADES An Artificial Intelligence Approach To
No ratings yet
00031523-BLADES An Artificial Intelligence Approach To
13 pages
SHR SG Funds Grants Integration
No ratings yet
SHR SG Funds Grants Integration
31 pages
Assignment # 1
No ratings yet
Assignment # 1
4 pages
VT9500BT User Manual
No ratings yet
VT9500BT User Manual
15 pages
DAA Question Bank
No ratings yet
DAA Question Bank
5 pages
Tamil Christian Books List
No ratings yet
Tamil Christian Books List
4 pages
General Math DLL For Shs More DLL at Depedtambayanphblogspotcom q1 Week 01
No ratings yet
General Math DLL For Shs More DLL at Depedtambayanphblogspotcom q1 Week 01
2 pages
Python Syntax
No ratings yet
Python Syntax
41 pages

CS252 Graduate Computer Architecture Caches and Memory Systems I

Uploaded by

CS252 Graduate Computer Architecture Caches and Memory Systems I

Uploaded by

CS252

Graduate Computer Architecture

Caches and Memory Systems I

January 24, 2001

Disk, Tape, etc. CS252/Kubiatowicz

Valid Bit Cache Tag Cache Data

– CPIExecution includes ALU and Memory instructions

AMAT  HitTime  MissRate  MissPenalty

• Table on page 384:

AMATHarvard =75%x(1+0.64%x50)+25%x(1+6.47%x50) = 2.05

Block Size (bytes)

(Red means A.M.A.T. not improved by more associativity)

To Next Lower Level In

Pseudo Hit Time Miss Penalty

• Drawback: CPU pipeline is hard if hit takes 1 or 2 cycles

/* After: 1 array of stuctures */

Reducing conflicts between val & key;

Sequential accesses instead of striding through

2 misses per access to a & c vs. one miss per

• B called Blocking Factor

Direct Mapped Cache

Fully Associative Cache

• Conflict misses in caches not FA vs. Blocking size

merged loop loop fusion blocking

• 3 Cs: Compulsory, Capacity, Conflict

• Write allocate: allocate new cache line in cache

Integer Floating Point

Miss PenaltyL1 = Hit TimeL2 + Miss RateL2 x Miss PenaltyL2

AMAT = Hit TimeL1 +

• Cost & A.M.A.T. Log

• 32KB L1, 8 byte path to memory

Larger Block Size + – 0

Early Restart & Critical Word 1st + 2

You might also like