0% found this document useful (0 votes)

3 views

11_caches

The document discusses memory technologies and the concept of caches in computer organization, highlighting the trade-offs between memory size and speed. It explains different types of memory, such as SRAM and DRAM, and introduces the idea of cache hierarchies to optimize access times through locality principles. Additionally, it covers cache organization, performance metrics, and the impact of block size on cache efficiency.

Uploaded by

qyx

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

3 views

11_caches

Uploaded by

qyx

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 87

Caches

CIS 5710
Computer Organization and Design
Readings
• P&H Chapter 5
• 5.1-5.3, 5.5
• Appendix C.9

CIS 5710 | Prof Joseph Devietti

The Problem
• Using current technologies, memories can be
either large or fast, but not both.
• The size or capacity of a memory refers to the
number of bits it can store
• The speed of a memory typically refers to the
amount of time it takes to access a stored entry:
the delay between asking for the value stored at
an address and receiving the result.

CIS 5710 | Prof Joseph Devietti

Memory Technologies

CIS 5710 | Prof Joseph Devietti

Types of Memory
• Static RAM (SRAM)
• 6 transistors per bit
• Two inverters + transistors for reading/writing
• Optimized for speed (first) and density (second)
• Fast (sub-nanosecond latencies for small SRAM)
• Speed proportional to area (~ sqrt(number of bits))
• Mixes well with standard processor logic
• Dynamic RAM (DRAM)
• 1 transistor + 1 capacitor per bit
• Optimized for density (in terms of cost per bit)
• Slow (>30ns internal access, ~50ns pin-to-pin)
• Different fabrication steps (does not mix well with logic)
• Nonvolatile storage: magnetic disk, Flash, …
CIS 5710 | Prof Joseph Devietti
SRAM
• used in register 6 transistors
file, on-chip per bit

caches

CIS 5710 | Prof Joseph Devietti

Memory Technology Trends

c/o O’Hallaron, Ganger, Kesden,

CMU 15-213 / 18-213

CIS 5710 | Prof Joseph Devietti

Locality of Memory Technologies
• For many (all?) memory technologies, it may take
a long time to get the first element out of the
memory but you can fetch a lot of data in the
vicinity very quickly
• It is usually a good idea to buy in bulk by fetching
a lot of elements in the vicinity rather than just
one.

CIS 5710 | Prof Joseph Devietti

The Memory Hierarchy

CIS 5710 | Prof Joseph Devietti

Big Picture Motivation
• Processor can compute only as fast as memory
• A 3Ghz processor can execute an “add” operation in 0.33ns
• Today’s “main memory” latency is more than 33ns
• Naïve implementation: loads/stores can be 100x slower than
other operations
• Unobtainable goal:
• Memory that operates at processor speeds
• Memory as large as needed for all running programs
• Memory that is cost effective
• Can't achieve all of these goals at once

CIS 5710 | Prof Joseph Devietti

Known From the Beginning

“Ideally, one would desire an infinitely large memory

capacity such that any particular word would be
immediately available … We are forced to recognize the
possibility of constructing a hierarchy of memories, each
of which has a greater capacity than the preceding but
which is less quickly accessible.”

Burks, Goldstine, VonNeumann

“Preliminary discussion of the logical design of an
electronic computing instrument”
IAS memo 1946

CIS 5710 | Prof Joseph Devietti

This Unit: Caches
CPU • “Cache”: hardware managed
• Hardware automatically retrieves missing
data
I$ D$ • Built from fast SRAM, usually on-chip
today
L2
• In contrast to off-chip, DRAM “main
memory”
• Cache organization
Main • Speed vs. Capacity
Memory
• ABCs of caches
• Miss classification
• Some example performance
Disk
calculations
CIS 5710 | Prof Joseph Devietti
Key observation: Locality
• Temporal locality
• Recently referenced data is likely to be referenced again
soon
• Reactive: cache recently used data in small, fast memory
• Spatial locality
• More likely to reference data near recently referenced data
• Proactive: cache large chunks to include nearby data
• Both properties hold for data and instructions
• Most real-world programs have locality in some form
• Cache: finite-sized hashtable of recently used
data blocks
• In hardware, transparent to software
CIS 5710 | Prof Joseph Devietti
Food Caching
• How to get fast access to all of the food we might
want?
• Fast access to food in your kitchen
• but it has limited capacity
• The grocery store has more food, but is slow
• Far away
• Big (takes time to walk within the grocery store)
• How can you avoid these latencies?
• Keep recently used foods around (temporal locality)
• Put related foods close together (spatial locality)
• Guess what you'll need in the future (prefetching, later)

CIS 5710 | Prof Joseph Devietti

Netflix caching

temporal
locality

spatial
locality

CIS 5710 | Prof Joseph Devietti

Locality Example
• Which memory accesses have spatial locality?
• Which memory accesses have temporal locality?

int sum = 0;
int X[1000];
for(int c = 0; c < 1000; c++){
sum += X[c];
}

CIS 5710 | Prof Joseph Devietti

Exploiting locality with hierarchy
CPU • Hierarchy of memory components
• Upper components
• Fast « Small « Expensive
M1
• Lower components
• Slow « Big « Cheap
M2
• Connected by buses
• Which have latency and bandwidth issues
M3 • Most frequently accessed data in M1
• next most frequently accessed in M2, etc.
• Move data up-down hierarchy

M4 • Optimize average access time

• tavg=taccess + (%miss*tmiss)
CIS 5710 | Prof Joseph Devietti • Attack each component
Concrete Memory Hierarchy
Processor • Registers
Compiler
Regs Managed • Primary caches
• Split instruction (I$) and data (D$)
I$ D$
• Typically 32-64KB each
Hardware
L2, L3
Managed • 2nd and 3rd cache (L2, L3)
• On-chip, typically made of SRAM
• L2$ typically 256-512KB
Main • “Last level cache” (LLC) 16-100MB
Memory
Software
Managed
• main memory
(by OS) • made of DRAM (“Dynamic” RAM)
• ≥8GB for laptops, servers have TBs
Disk
• disk (swap and files)
CIS 5710 | Prof Joseph Devietti • Uses magnetic disks or flash drives
Evolution of Cache Hierarchies
Intel 486
64KB I$
64KB D$
8KB
I/D$

1.5MB L2

2021: AMD Milan-X

L3 tagswith 3D V-cache
1989: Intel 486 (96MB total L3$)

• Chips today are 30–70% cache by area

CIS 5710 | Prof Joseph Devietti
Caches

CIS 5710 | Prof Joseph Devietti

Logical Cache Organization
• Cache is a hardware hashtable
• The setup 32
• 32-bit ISA ® 4G addresses, 232 B address space
• Logical cache organization
• 4KB, organized as 1K 4B blocks (aka lines)

1024
• Each block can hold a 4-byte word
10
• Physical cache implementation
• 1K (1024 bit) by 4B SRAM
• Called data array
32
• 10-bit address input
• 32-bit data input/output
addr data
CIS 5710 | Prof Joseph Devietti
Looking Up A Block
• Which 10 of the 32 address bits to use?
• use bits [11:2]
• 2 least significant (LS) bits [1:0] are the offset bits
• Locate byte within word
• Don't need these to locate word
[11:2]
• Next 10 LS bits [11:2] are the index bits
• These locate the word
• These bits work best in practice
• Why?

11:2 addr data

CIS 5710 | Prof Joseph Devietti
Every block has a designated spot

CIS 5710 | Prof Joseph Devietti

Is this the block you’re looking for?
• Each cache row corresponds to 220 blocks
• How to know which if any is currently there?
• Tag each cache word with remaining address bits [31:12]
• Build separate and parallel tag array
• 1K by 21-bit SRAM
[11:2]
• 20-bit (next slide) tag + 1 valid bit
• Lookup algorithm
• Read tag indicated by index bits
• If tag matches & valid bit set:
then: Hit ® data is good [31:12]
==
else: Miss ® data is garbage, wait…
31:12 11:2 addr hit data
CIS 5710 | Prof Joseph Devietti
A Concrete Example
• Lookup address x000C14B8
• Index = addr [11:2] = (addr >> 2) & x3FF = x12E
• Tag = addr [31:12] = (addr >> 12) = x000C1

0 0 C 1
1 0000 0000 0000 1100 0001

[31:12]
==
0000 0000 0000 1100 0001 0100 1011 10 00

31:12 11:2 addr hit data

CIS 5710 | Prof Joseph Devietti
Cache operation: 1B block
• 8-bit addresses ® 256B memory
• Keeps diagrams simple
tag (6 bits) index (2 bits)

• 4B cache, 1B blocks
• Figure out number of sets: 4 (capacity / block-size)
• Figure out how address splits into offset/index/tag bits
• Offset: least-significant log2(block-size) = log2(1) = 0
• Index: next log2(number-of-sets) = log2(4) = 2 ®
00000000
• Tag: rest = 8 – 2 = 6 ® 00000000

CIS 5710 | Prof Joseph Devietti

Multi-Word Cache Blocks
• In most modern implementation we store more
than one address (>1 byte) in each cache block.
• The number of bytes or words stored in each
cache block is referred to as the block size.
• The entries in each block come from a contiguous
set of addresses to exploit locality of reference,
and to simplify indexing
• Cache blocks are also referred to as cache lines
• Related to cache frames – a frame is the bucket, and the
block is the data that goes in the bucket
• blocks move around due to fills & evictions
• frames are part of the cache structure and never move
CIS 5710 | Prof Joseph Devietti
Tag, Index, Block Offset
• Consider 8B cache with 2B blocks
• Figure out number of sets: 4 (capacity / block-size)
• Figure out how address splits into offset/index/tag bits
• Offset: least-significant log2(block-size) = log2(2) = 1 ®
00000000
• Index: middle log2(number-of-sets) = log2(4) = 2 ®
00000000
• Tag: remaining high-order bits = 8 – 1 – 2 = 5 ®
00000000
block offset
tag (5 bits) index (2 bits)
(1 bit)

CIS 5710 | Prof Joseph Devietti

example
• via http://comparchviz.com
• 8B DM cache with 2B blocks

CIS 5710 | Prof Joseph Devietti

Handling a Cache Miss
• What if requested data isn't in the cache?
• How does it get in there?
• Cache controller: finite state machine
• Remembers miss address
• Accesses next level of memory
• Waits for response
• Writes data/tag into proper locations
• Bringing a missing block into the cache is a cache
fill

CIS 5710 | Prof Joseph Devietti

Cache Misses and Pipeline Stalls
+
4

Regfile a
I$
D$
d

nop nop

• I$ and D$ misses stall pipeline just like data hazards

• Stall logic driven by miss signal
• Cache “logically” re-evaluates hit/miss every cycle
• Block is filled ® miss signal goes low ® pipeline restarts

CIS 5710 | Prof Joseph Devietti

Cache Performance Equation
• For a cache
• Access: read or write to cache
taccess • Hit: desired data found in cache
• Miss: desired data not found in cache
• Must get from another component
Cache • No notion of “miss” in register file
%miss
• Fill: action of placing data into cache

tmiss • %miss (miss-rate): #misses / #accesses

• taccess: time to check cache. If hit, all done
• tmiss: time to read data into cache
• Performance metric: avg access time
tavg = taccess + (%miss * tmiss)
CIS 5710 | Prof Joseph Devietti
Measuring Cache Performance
• Ultimate metric is tavg
• Cache capacity and circuits roughly determines taccess
• Lower-level memory structures determine tmiss
• Measure %miss
• Hardware performance counters
• Simulation

CIS 5710 | Prof Joseph Devietti

Capacity and Performance
• Simplest way to reduce %miss: increase capacity
+ Miss rate decreases monotonically
• “Working set”: insns/data program is actively using
• Diminishing returns
– However taccess increases
• Latency proportional to
sqrt(capacity)
%miss
“working set” size

Cache Capacity
• For a given capacity, adjust %miss via
organization
CIS 5710 | Prof Joseph Devietti
Block Size
• Given capacity, manipulate %miss by changing
organization 512*512bit
SRAM
• One option: increase block size 0
• Exploit spatial locality 1
• Boundary between index and offset changes 2

• Tag remains the same

• Ramifications 510
511
+ Reduce %miss (up to a point) 9-bit
+ Reduce tag overhead (why?) =
– Potentially useless data transfer block size

– Premature replacement of [31:15] [14:6] [5:0] <<

useful data
address data hit?
CIS 5710 | Prof Joseph Devietti
Block Size and Tag Overhead
• 4KB cache with 1024 4B blocks?
• 4B blocks ® 2-bit offset, 1024 frames ® 10-bit index
• 32-bit address – 2-bit offset – 10-bit index = 20-bit tag
• 20-bit tag / 32-bit block = 63% overhead
• 4KB cache with 512 8B blocks
• 8B blocks ® 3-bit offset, 512 frames ® 9-bit index
• 32-bit address – 3-bit offset – 9-bit index = 20-bit tag
• 20-bit tag / 64-bit block = 32% overhead
• Notice: tag size is same, but data size is twice as big
• A realistic example: 64KB cache with 64B blocks
• 16-bit tag / 512-bit block = ~ 2% overhead
• Note: tags are not optional
CIS 5710 | Prof Joseph Devietti
Effect of Block Size on Miss Rate
• Two effects on miss rate
+ Spatial prefetching (good)
• For blocks with adjacent addresses
%
• Turns miss/miss into miss/hit pairs miss
– Interference (bad) Block Size
• For blocks with non-adjacent addresses (but in adjacent
frames), large blocks turn hits into misses by disallowing
simultaneous residence
• Consider entire cache as one big block
• Both effects always present
• Spatial prefetching dominates initially
• Depends on size of the cache
• Good block size is 32–256B, optimal size program-dependent

CIS 5710 | Prof Joseph Devietti

example
• 8B DM cache with 4B blocks

CIS 5710 | Prof Joseph Devietti

Block Size and Miss Penalty
• Does increasing block size increase tmiss?
• Don't larger blocks take longer to read, transfer, and fill?
• They do, but…
• tmiss of an isolated miss is not affected
• Critical Word First / Early Restart (CRF/ER)
• Requested word fetched first, pipeline restarts immediately
• Remaining words in block transferred/filled in the background
• tmiss'es of a cluster of misses will suffer
• Reads/transfers/fills of two misses can't happen at the same
time
• Latencies can start to pile up
• This is a bandwidth problem
CIS 5710 | Prof Joseph Devietti
Cache Conflicts
0000 A Main memory tag (1 bit) index (2 bits) 1 bit
0001 B
0010 C
0011 D
0100 E
0101 F
0 1
0110 G 00
0111 H 01
1000 I 10
1001 J 11
1010 K
1011 L • Pairs like “0010” and “1010” conflict
1100 M • Same index!
1101 N
• Can such pairs to simultaneously reside in cache?
1110 P
• A: Yes, if we reorganize cache to do so
1111 Q
example
• 8B DM cache with lots of conflicts

CIS 5710 | Prof Joseph Devietti

Associativity
• Set-associativity
• Block can reside in one of few frames
• Frame groups called sets
• Each frame in set called a way
4B 4B
• This is 2-way set-associative (SA)
• 1-way ® direct-mapped (DM)
• 1-set ® fully-associative (FA)
[10:2]
+ Reduces conflicts
[31:11]
– Increases taccess: == ==
• additional tag match & muxing

associativity

31:11 10:2 addr hit data

CIS 5710 | Prof Joseph Devietti
Associativity
• Lookup algorithm
• Use index bits to find set
• Read data/tags in all frames in parallel
• Any (match and valid bit) is a hit 4B 4B

• Notice tag/index/offset bits

• Only 9-bit index (versus 10-bit [10:2]
for direct mapped)
[31:11]
== ==

associativity

31:11 10:2 addr hit data

CIS 5710 | Prof Joseph Devietti
2-way set-associative cache
• 8B cache, 2 ways, 2B blocks
• Figure out number of sets: 2 ((capacity / ways) / block-size)
• Figure out how address splits into offset/index/tag bits
• Offset: least-significant log2(block-size) = log2(2) = 1 ®
00000000
• Index: next log2(number-of-sets) = log2(2) = 1 ®
00000000 block offset
tag (6 bits) index (1 bit)
• Tag: rest = 8 – 1 – 1 = 6 ® 00000000 (1 bit)

CIS 5710 | Prof Joseph Devietti

example
• 8B 2-way cache with 4B blocks

CIS 5710 | Prof Joseph Devietti

Replacement Policies
• Set-associative caches present new design choice
• On cache miss, which block in set to replace (kick out)?
• Some options
• Random
• FIFO (first-in first-out)
• LRU (least recently used)
• Fits with temporal locality, LRU = least likely to be used in
future
• NMRU (not most recently used)
• An easier to implement approximation of LRU
• Is LRU for 2-way set-associative caches
• Belady's: replace block that will be used furthest in future
• Unachievable optimum

CIS 5710 | Prof Joseph Devietti

LRU and Miss Handling
data from memory
• Add LRU field to each set
• “Least recently used”
• LRU data is encoded “way” 0 512
• Hit? update MRU 1 513

• LRU bits updated on each

511 1023

access
= =

WE
[31:15] [14:5] [4:0] <<

address data hit?

CIS 5710 | Prof Joseph Devietti
Associativity and Performance
• Higher associative caches
+ Have better (lower) %miss
• Diminishing returns
– However taccess increases %miss ~5
• The more associative, the slower
• What about tavg?
Associativity

• Block-size and number of sets should be powers of

two
• Makes indexing easier (just rip bits out of the address)
• 3-way set-associativity? No problem

CIS 5710 | Prof Joseph Devietti

Cache Glossary

tag array data array

frame
block/line
valid bit

tag
set/row

way

block offset/displacement
address: tag index

CIS 5710 | Prof Joseph Devietti

What About Stores?

CIS 5710 | Prof Joseph Devietti

Handling stores
• So far we have looked at reading from cache
• Instruction fetches, loads
• What about writing into cache?

• Several new issues

• Tag/data access
• Write-through vs. write-back
• Write-allocate vs. write-not-allocate
• Hiding write miss latency

CIS 5710 | Prof Joseph Devietti

Tag/Data Access
• Reads: read tag and data in parallel
• Tag mis-match ® data is wrong (OK, just stall until good
data arrives)
• Writes: read tag, write data in parallel? No! Why
not?
• Tag mis-match ® clobbered data (oops!)
• For associative caches, which way was written into?
• Writes are a two step (multi-cycle) process
• Step 1: match tag
• Step 2: write to matching way
• Bypass (with address check) to avoid load stalls
• May introduce structural hazards
CIS 5710 | Prof Joseph Devietti
Write Propagation
• When to propagate new value to lower-level
caches/memory?
• Option #1: Write-through: immediately
• On hit, update cache
• Immediately send the write to the next level
• Option #2: Write-back: when block is replaced
• Now we have multiple versions of the same block in various
caches and in memory!
• Requires additional “dirty” bit per block
• Evict clean block: no extra traffic
• there was only 1 version of the block
• Evict dirty block: extra “writeback” of block
• the dirty block is the most up-to-date version
CIS 5710 | Prof Joseph Devietti
Write-back Cache Operation
• Each cache block has an associated dirty bit
• state is either clean or dirty

valid bit
dirty bit
tag block

initial state - I - -
after lw r0 <= [A] C V A 0x01

after sw r1 => [A] D V A 0x02

CIS 5710 | Prof Joseph Devietti

Write-backs across caches
• When a dirty block is evicted to a lower-level
cache, it remains dirty
• Writing a block back to memory cleanses it
• There are never dirty blocks in memory, only in caches
L1
evicted from L1 D V A 0x02

L2
evicted from L2

Memory
block is cleansed
no need for dirty/valid bits or tag 0x02

CIS 5710 | Prof Joseph Devietti

Optimizing Writebacks
+ Writeback-buffer (WBB):
• Hide latency of writeback (keep them off critical path)
• Step#1: Send “fill” request to next-level
• Step#2: While waiting, write dirty block to buffer
• Step#3: When new blocks arrives, put it into cache
• Step#4: Write buffer contents to next-level

$
2
1
WBB
4 3

Next-level-$

CIS 5710 | Prof Joseph Devietti

Write Propagation Comparison
• Write-through
– Requires additional bus bandwidth
• Consider repeated write hits
– Next level must handle small writes (1, 2, 4, 8-bytes)
+ No need for dirty bits in cache
+ No need to handle “writeback” operations
• Simplifies miss handling
• Used in GPUs, as they have low write temporal locality
• Write-back
+ Key advantage: uses less bandwidth
• Reverse of other pros/cons above
• Used in most CPU designs
Write Miss Handling
• How is a write miss actually handled?
• Write-allocate: fill block from next level, then
write it
+ Decreases read misses (next read to block will hit)
– Requires additional bandwidth
• Commonly used (especially with write-back caches)
• Write-non-allocate: just write to next level,
don’t allocate a block
– Potentially more read misses
+ Uses less bandwidth
• Use with write-through

CIS 5710 | Prof Joseph Devietti

Write Misses and Store Buffers
• Read miss?
• Load can't go on without the data, it must stall
• Write miss? Processor
• no instruction is waiting for data, why stall?
SB
• Store buffer: a small buffer
• Stores put addr/value in store buffer, keep going
• Store buffer writes stores to D$ in the background Cache
• Loads must search store buffer (in addition to D$)
+ Eliminates stalls on write misses (mostly) WBB
– Creates some problems for multicore (later)
• Store buffer vs. writeback-buffer
Next-level
• Store buffer: “in front” of D$, hides store misses cache
• Writeback buffer: “behind” D$, hides writebacks
CIS 5710 | Prof Joseph Devietti
Improving Effectiveness of Memory
Hierarchy

CIS 5710 | Prof Joseph Devietti

Classifying Misses: 3C Model
• Divide cache misses into three categories
• Compulsory (cold): never seen this address before
• Would miss even in infinite cache
• Capacity: miss caused because cache is too small
• Would miss even in fully associative cache
• Identify? Consecutive accesses to block separated by
access to at least N other distinct blocks (N is number of
frames in cache)
• Conflict: miss caused because cache associativity is too low
• Identify? All other misses
• (Coherence): miss due to external invalidations
• Only in shared memory multiprocessors (later)

CIS 5710 | Prof Joseph Devietti

Miss Rate: ABC
• Why do we care about 3C miss model?
• So that we know what to do to eliminate misses
• If you don't have conflict misses, increasing associativity
won't help
• More Associativity (assuming fixed capacity)
+ Decreases conflict misses
– Increases taccess
• Larger Block Size (assuming fixed capacity)
– Increases conflict/capacity misses (fewer frames)
+ Decreases compulsory misses (spatial locality)
• No significant effect on taccess
• More Capacity
+ Decreases capacity misses
– Increases taccess
CIS 5710 | Prof Joseph Devietti
Victim Buffers for conflict misses
• Conflict misses: not enough associativity
• High associativity is expensive, but also rarely needed
• 3 blocks mapping to same 2-way set and accessed
(XYZ)+
• Victim buffer (VB): small fully-associative cache
• Sits on I$/D$ miss path
• Small (e.g., 8 entries) so very fast I$/D$
• Blocks kicked out of I$/D$ placed in VB
• On miss, check VB: hit? Place block back in I$/D$
• 8 extra ways, shared among all sets VB

+Only a few sets will need it at any given time

L2
+ Very effective in practice

CIS 5710 | Prof Joseph Devietti

Prefetching
• Bring data into cache proactively/speculatively
• If successful, reduces number of caches misses
• Key: anticipate upcoming miss addresses accurately
• Can do in software or hardware
• Simple hardware prefetching: next block prefetching
• Miss on address X ® anticipate miss on X+block-size
+ Works for insns: sequential execution
+ Works for data: arrays
• Table-driven hardware prefetching I$/D$
• Use predictor to detect strides, common patterns
• Effectiveness determined by: prefetch logic
• Timeliness: initiate prefetches sufficiently in advance
• Coverage: prefetch for as many misses as possible
• Accuracy: don't pollute with unnecessary data L2

CIS 5710 | Prof Joseph Devietti

Software Prefetching
• Use a special “prefetch” instruction
• Tells the hardware to bring in data
• Just a hint
• Inserted by programmer or compiler
• Example
int tree_add(tree_t* t) {
if (t == NULL) return 0;
__builtin_prefetch(t->left);
__builtin_prefetch(t->right);
return t->val + tree_add(t->right) + tree_add(t->left);
}

• Multiple prefetches bring multiple blocks in parallel

• More “Memory-level” parallelism (MLP)
CIS 5710 | Prof Joseph Devietti
Software Restructuring: Data
• Capacity misses: poor spatial or temporal locality
• Several code restructuring techniques to improve both
– Compiler must know that restructuring preserves semantics
• Loop interchange: spatial locality
• Example: row-major matrix: X[i][j] then X[i][j+1]
• Poor code: X[i][j] followed by X[i+1][j]
for (j = 0; j<NCOLS; j++)
for (i = 0; i<NROWS; i++)
sum += X[i][j];
• Better code
for (i = 0; i<NROWS; i++)
for (j = 0; j<NCOLS; j++)
sum += X[i][j];
CIS 5710 | Prof Joseph Devietti
Software Restructuring: Data
• Loop blocking: temporal locality
• Poor code
for (k=0; k<NUM_ITERATIONS; k++)
for (i=0; i<NUM_ELEMS; i++)
X[i] = f(X[i]);
• Better code
• Cut array into CACHE_SIZE chunks
• Run all phases on one chunk, proceed to next chunk
for (i=0; i<NUM_ELEMS; i+=CACHE_SIZE)
for (k=0; k<NUM_ITERATIONS; k++)
for (j=0; j<CACHE_SIZE; j++)
X[i+j] = f(X[i+j]);

– Assumes you know CACHE_SIZE, but do you?

CIS 5710 | Prof Joseph Devietti

Software Restructuring: Code
• Compiler can improve code temporal/spatial locality
• If (a) { code1; } else { code2; } code3;
• But, code2 case never happens (say, error condition)

+ Better locality
+ Fewer taken branches + Better locality for code
after code3
CIS 5710 | Prof Joseph Devietti + Fewer taken branches
Cache Hierarchies

CIS 5710 | Prof Joseph Devietti

Designing a Cache Hierarchy
• For any memory component: taccess vs. %miss
tradeoff
• Upper components (I$, D$) emphasize low taccess
• Frequent access ® taccess important
• tmiss is not bad ® %miss less important
• Lower capacity and lower associativity (to reduce taccess)
• Small-medium block-size (to reduce conflicts)
• Moving down (L2, L3) emphasis turns to %miss
• Infrequent access ® taccess less important
• tmiss is bad ® %miss important
• High capacity, associativity, and block size (to reduce %miss)

CIS 5710 | Prof Joseph Devietti

Memory Hierarchy Parameters
Parameter I$/D$ L2 L3 Main Memory
taccess 2ns 10ns 30ns 100ns
tmiss 10ns 30ns 100ns 10ms (10M ns)
Capacity 32KB–64KB 256-512KB 16-100MB GBs-TBs
Block size 16B–64B 32B–128B 32B-256B 4KB-1GB
Associativity 2-8 4–16 4-16 n/a

• Some other design parameters

• Split vs. unified insns/data
• Inclusion vs. exclusion vs. nothing

CIS 5710 | Prof Joseph Devietti

Split vs Unified Caches
• Split I$/D$: insns and data in different caches
• To minimize structural hazards and taccess
• Larger unified I$/D$ would be slow, 2nd port even slower
• Optimize I$ and D$ separately
• Not writes for I$, smaller reads for D$
• Why is 486 I/D$ unified?
• Unified L2, L3: insns and data together
• To minimize %miss
+ Fewer capacity misses: unused insn capacity is used for data
– More conflict misses: insn/data conflicts
• A much smaller effect in large caches
• Go further: unify L3 of multiple cores in a multi-core
CIS 5710 | Prof Joseph Devietti
Inclusive vs Exclusive Caches
• Inclusion
• Bring block from memory into L2 then L1
• A block in the L1 is always in the L2
• If block evicted from L2, must also evict it from L1
• Why? more on this when we talk about multicore
• Exclusion
• Bring block from memory into L1 but not L2
• Move block to L2 on L1 eviction
• L2 becomes a large victim cache
• Block is either in L1 or L2 (never both)
• Good if L2 is small relative to L1
• Example: AMD's Duron 64KB L1s, 64KB L2
CIS 5710 | Prof Joseph Devietti
Memory Performance Equation
CPU • For memory component M
• Access: read or write to M
taccess • Hit: desired data found in M
• Miss: desired data not found in M
• Must get from another (slower)
M component
%miss
• Fill: action of placing data in M
• %miss (miss-rate): #misses / #accesses
tmiss
• taccess: time to read data from (write data
to) M
• tmiss: time to read data into M
• Performance metric
• tavg: average access time
tavg = taccess + (%miss * tmiss)
CIS 5710 | Prof Joseph Devietti
Hierarchy Performance
CPU
tavg
tavg = tavg-M1 tavg-M1
tacc-M1 + (%miss-M1*tmiss-M1)
M1
tmiss-M1 = tavg-M2 tacc-M1 + (%miss-M1*tavg-M2)
tacc-M1 + (%miss-M1*(tacc-M2 + (%miss-M2*tmiss-
M2 M2)))
tacc-M1 + (%miss-M1* (tacc-M2 + (%miss-M2*tavg-
tmiss-M2 = tavg-M3
M3)))
…
M3

tmiss-M3 = tavg-M4

CIS 5710 | Prof Joseph Devietti

Miss Rates: per-access vs per-insn
• Miss rates can be expressed two ways:
• Misses per “instruction” (or instructions per miss), -or-
• Misses per “cache access” (or accesses per miss)
• For first-level caches, use insn mix to convert
• If memory ops are 1/3rd of instructions..
• 2% of insns miss (1 in 50) is 6% of “accesses” miss (1 in 17)
• What about second-level caches?
• Misses per insn still straight-forward (“global” miss rate)
• Misses per “access” is trickier (“local” miss rate)
• Depends on number of accesses (which depends on L1
rate!)
• L1 acts as a filter for L2 accesses
CIS 5710 | Prof Joseph Devietti
Main Memory As A Cache
Parameter I$/D$ L2 L3 Main Memory
taccess 2ns 10ns 30ns 100ns
tmiss 10ns 30ns 100ns 10ms (10M ns)
Capacity 32KB–64KB 256-512KB 16-100MB GBs-TBs
Block size 16B–64B 32B–128B 32B-256B 4KB-1GB
Associativity 2-8 4–16 4-16 full
Replacement LRU LRU LRU “working set”
Prefetching? yes yes yes sw-managed

• How would you internally organize main memory

• tmiss is outrageously long, reduce %miss at all costs
• Full associativity: isn't that difficult to implement?
• Yes in hardware, but main memory is software-managed

CIS 5710 | Prof Joseph Devietti

Summary
App App App • Average access time of a memory component
• tavg = taccess + %miss * tmiss
System software
• Hard to get low taccess and %miss in one structure
® build a hierarchy instead
Mem CPU I/O • Memory hierarchy
• Cache (SRAM) ® memory (DRAM) ® swap (Disk)
• Smaller, faster, more expensive ® bigger, slower,
cheaper
• Cache ABCs (associativity, block size,
capacity)
• 3C miss model: compulsory, capacity, conflict
• Performance optimizations
• %miss: prefetching
• tmiss: victim buffer, critical-word-first
• Write issues
• Write-back vs. write-through, write-allocate vs.
write-no-allocate
CIS 5710 | Prof Joseph Devietti

Voltlog #272 - CJ720 GPS Tracker Command List - Sheet1
94% (18)
Voltlog #272 - CJ720 GPS Tracker Command List - Sheet1
2 pages
Alteryx PPT
No ratings yet
Alteryx PPT
13 pages
Computer MCQs Grade 7
89% (53)
Computer MCQs Grade 7
7 pages
Computer Architecture: Cache Memory
No ratings yet
Computer Architecture: Cache Memory
28 pages
Cache Memory A
No ratings yet
Cache Memory A
62 pages
This Unit: Caches: - Basic Memory Hierarchy Concepts
No ratings yet
This Unit: Caches: - Basic Memory Hierarchy Concepts
24 pages
Characteristics Location Capacity Unit of Transfer Access Method Performance Physical Type Physical Characteristics Organisation
No ratings yet
Characteristics Location Capacity Unit of Transfer Access Method Performance Physical Type Physical Characteristics Organisation
53 pages
Lecture 3 (Memory Hierarchy and Caches)
No ratings yet
Lecture 3 (Memory Hierarchy and Caches)
88 pages
CH05
No ratings yet
CH05
56 pages
Computer Organization and Architecture: Cache Memory
100% (1)
Computer Organization and Architecture: Cache Memory
57 pages
Cache Memory: 13 March 2013
No ratings yet
Cache Memory: 13 March 2013
80 pages
William Stallings Computer Organization and Architecture 7th Edition Cache Memory
No ratings yet
William Stallings Computer Organization and Architecture 7th Edition Cache Memory
57 pages
Cache Memory
No ratings yet
Cache Memory
57 pages
William Stallings Computer Organization and Architecture 7th Edition
No ratings yet
William Stallings Computer Organization and Architecture 7th Edition
57 pages
William Stallings Computer Organization and Architecture 7th Edition Cache Memory
No ratings yet
William Stallings Computer Organization and Architecture 7th Edition Cache Memory
64 pages
William Stallings Computer Organization and Architecture 8th Edition Cache Memory
No ratings yet
William Stallings Computer Organization and Architecture 8th Edition Cache Memory
43 pages
CS5204/EE5364 - Advanced Computer Architecture - Memory
No ratings yet
CS5204/EE5364 - Advanced Computer Architecture - Memory
67 pages
Computer Architecture: Memory Hierarchy Design
No ratings yet
Computer Architecture: Memory Hierarchy Design
60 pages
William Stallings Computer Organization and Architecture 7th Edition Cache Memory
No ratings yet
William Stallings Computer Organization and Architecture 7th Edition Cache Memory
57 pages
03-Chap4-Cache Memory Mapping
No ratings yet
03-Chap4-Cache Memory Mapping
24 pages
William Stallings Computer Organization and Architecture 7th Edition Cache Memory
No ratings yet
William Stallings Computer Organization and Architecture 7th Edition Cache Memory
57 pages
CH10 - Memory Hierarchy
No ratings yet
CH10 - Memory Hierarchy
106 pages
Lecture 7 Cache Memory
No ratings yet
Lecture 7 Cache Memory
44 pages
Computer Organization & Architecture: Cache Memory
No ratings yet
Computer Organization & Architecture: Cache Memory
71 pages
William Stallings Computer Organization and Architecture 7th Edition Cache Memory
No ratings yet
William Stallings Computer Organization and Architecture 7th Edition Cache Memory
67 pages
Cache Memory: William Stallings, Computer Organization and Architecture, 9 Edition
No ratings yet
Cache Memory: William Stallings, Computer Organization and Architecture, 9 Edition
47 pages
04 Cache Memory Comparc
No ratings yet
04 Cache Memory Comparc
47 pages
Cache Memory
No ratings yet
Cache Memory
89 pages
Computer Organization & Architecture: Cache Memory
No ratings yet
Computer Organization & Architecture: Cache Memory
52 pages
BCS402 - Module 5 edited.pptx (1)
No ratings yet
BCS402 - Module 5 edited.pptx (1)
16 pages
4.1 Computer Memory System Overview
No ratings yet
4.1 Computer Memory System Overview
12 pages
CH05
No ratings yet
CH05
10 pages
Characteristics Location Capacity Unit of Transfer Access Method Performance Physical Type Physical Characteristics Organisation
No ratings yet
Characteristics Location Capacity Unit of Transfer Access Method Performance Physical Type Physical Characteristics Organisation
53 pages
Advanced Computer Architecture: BY Dr. Radwa M. Tawfeek
No ratings yet
Advanced Computer Architecture: BY Dr. Radwa M. Tawfeek
32 pages
Cache Memory: Computer Organization and Architecture Characteristics of Memory Systems
No ratings yet
Cache Memory: Computer Organization and Architecture Characteristics of Memory Systems
16 pages
1559460031_Chap 4 Cache Memory
No ratings yet
1559460031_Chap 4 Cache Memory
55 pages
Cache Memory
No ratings yet
Cache Memory
21 pages
CH05 COA11e
No ratings yet
CH05 COA11e
43 pages
CH05-COA11e
100% (1)
CH05-COA11e
43 pages
Memory
No ratings yet
Memory
57 pages
William Stallings Computer Organization and Architecture 7th Edition Cache Memory
No ratings yet
William Stallings Computer Organization and Architecture 7th Edition Cache Memory
51 pages
Embedded Systems: Applications in Imaging and Communication
No ratings yet
Embedded Systems: Applications in Imaging and Communication
71 pages
William Stallings Computer Organization and Architecture 8th Edition Cache Memory
No ratings yet
William Stallings Computer Organization and Architecture 8th Edition Cache Memory
71 pages
William Stallings Computer Organization and Architecture 6th Edition Cache Memory
No ratings yet
William Stallings Computer Organization and Architecture 6th Edition Cache Memory
45 pages
Lec2 PDF
No ratings yet
Lec2 PDF
21 pages
Cache Memory: William Stallings, Computer Organization and Architecture, 9 Edition
No ratings yet
Cache Memory: William Stallings, Computer Organization and Architecture, 9 Edition
47 pages
Cache Memory
No ratings yet
Cache Memory
61 pages
Lecture-04 & 05, Adv. Computer Architecture, CS-522
No ratings yet
Lecture-04 & 05, Adv. Computer Architecture, CS-522
63 pages
Lecture 13 16 Post
No ratings yet
Lecture 13 16 Post
24 pages
Ddca 2024 Lecture24 Memory Hierarchy and Caches Beforelecture
No ratings yet
Ddca 2024 Lecture24 Memory Hierarchy and Caches Beforelecture
304 pages
CS 61C: Great Ideas in Computer Architecture: Lecture 12 - Memory Hierarchy/Direct-Mapped Caches
No ratings yet
CS 61C: Great Ideas in Computer Architecture: Lecture 12 - Memory Hierarchy/Direct-Mapped Caches
27 pages
COA Unit 4 Computer Memory System RRP
No ratings yet
COA Unit 4 Computer Memory System RRP
66 pages
Characteristics Location Capacity Unit of Transfer Access Method Performance Physical Type Physical Characteristics Organisation
No ratings yet
Characteristics Location Capacity Unit of Transfer Access Method Performance Physical Type Physical Characteristics Organisation
53 pages
Pertemuan 6
No ratings yet
Pertemuan 6
56 pages
William Stallings Computer Organization and Architecture 8th Edition Cache Memory
No ratings yet
William Stallings Computer Organization and Architecture 8th Edition Cache Memory
71 pages
William Stallings Computer Organization and Architecture 8th Edition Cache Memory
No ratings yet
William Stallings Computer Organization and Architecture 8th Edition Cache Memory
72 pages
Lecture-04, Adv. Computer Architecture, CS-522
No ratings yet
Lecture-04, Adv. Computer Architecture, CS-522
39 pages
Week 12 - Lecture 12 - Memory
No ratings yet
Week 12 - Lecture 12 - Memory
27 pages
04 Cache Memory
No ratings yet
04 Cache Memory
75 pages
FreeBSD Mastery: Storage Essentials: IT Mastery, #4
From Everand
FreeBSD Mastery: Storage Essentials: IT Mastery, #4
Michael W. Lucas
No ratings yet
OpenBSD Mastery: Filesystems: IT Mastery, #19
From Everand
OpenBSD Mastery: Filesystems: IT Mastery, #19
Michael W. Lucas
No ratings yet
Memory Basics Explained
From Everand
Memory Basics Explained
Alisa Turing
No ratings yet
Computer Science I Essentials
From Everand
Computer Science I Essentials
Randall Raus
5/5 (7)
VPC
100% (1)
VPC
59 pages
Tank Monitoring Systems
No ratings yet
Tank Monitoring Systems
69 pages
ZTE GPON Low Level Design 20082014
No ratings yet
ZTE GPON Low Level Design 20082014
26 pages
Denon dn-d4000 v.1 (ET)
No ratings yet
Denon dn-d4000 v.1 (ET)
50 pages
Canon Mf8380cdw Printer Manual
No ratings yet
Canon Mf8380cdw Printer Manual
39 pages
CS 404 Embedded Systems: 1.1 Introduction To Embedded System
No ratings yet
CS 404 Embedded Systems: 1.1 Introduction To Embedded System
47 pages
Getting Started With Tally ERP 9
No ratings yet
Getting Started With Tally ERP 9
0 pages
Data Pump in Oracle 11g
No ratings yet
Data Pump in Oracle 11g
22 pages
1st Project Pic Pascal v101
100% (1)
1st Project Pic Pascal v101
12 pages
Data Communication and Networking Worksheet Final Correct A
No ratings yet
Data Communication and Networking Worksheet Final Correct A
16 pages
Objective: Commands To Check Hard Disk Partitions and Disk Space On Linux
No ratings yet
Objective: Commands To Check Hard Disk Partitions and Disk Space On Linux
6 pages
PRE LTE Ericsson eNodeB Health Check Report
No ratings yet
PRE LTE Ericsson eNodeB Health Check Report
12 pages
Arc Sensors - Brochure
No ratings yet
Arc Sensors - Brochure
12 pages
Product DFGD
No ratings yet
Product DFGD
137 pages
Encoder Decoder
No ratings yet
Encoder Decoder
19 pages
Ramdump Adsp 2019-08-07 22-26-28 Props
No ratings yet
Ramdump Adsp 2019-08-07 22-26-28 Props
11 pages
E-Forms For K+12 : Kinder To Grade 12
No ratings yet
E-Forms For K+12 : Kinder To Grade 12
14 pages
Modbus ASCII & RS485 PDF
No ratings yet
Modbus ASCII & RS485 PDF
5 pages
Bronze Drum Case Study Next Generation Content Management Platform
No ratings yet
Bronze Drum Case Study Next Generation Content Management Platform
4 pages
Precise Molen I400 SB
No ratings yet
Precise Molen I400 SB
2 pages
GT I5510 Full - Schematic
No ratings yet
GT I5510 Full - Schematic
100 pages
Dali Wallstation Specsheet FWB
No ratings yet
Dali Wallstation Specsheet FWB
3 pages
Informe HAT
No ratings yet
Informe HAT
11 pages
Liability Network Service Providers
No ratings yet
Liability Network Service Providers
9 pages
Samsung NP300E5C-U02IN Laptop (2nd Gen Ci3/ 4GB/ 750GB/ Win7 HB/ 1GB Graph)
No ratings yet
Samsung NP300E5C-U02IN Laptop (2nd Gen Ci3/ 4GB/ 750GB/ Win7 HB/ 1GB Graph)
13 pages
Location Map Using Flutter
No ratings yet
Location Map Using Flutter
16 pages
Lenovo 3000 G550 LA-5082P KIWA7-A8 PDF
No ratings yet
Lenovo 3000 G550 LA-5082P KIWA7-A8 PDF
53 pages

11_caches

Uploaded by

11_caches

Uploaded by

Caches

CIS 5710 | Prof Joseph Devietti

CIS 5710 | Prof Joseph Devietti

CIS 5710 | Prof Joseph Devietti

CIS 5710 | Prof Joseph Devietti

c/o O’Hallaron, Ganger, Kesden,

CIS 5710 | Prof Joseph Devietti

CIS 5710 | Prof Joseph Devietti

CIS 5710 | Prof Joseph Devietti

CIS 5710 | Prof Joseph Devietti

“Ideally, one would desire an infinitely large memory

Burks, Goldstine, VonNeumann

CIS 5710 | Prof Joseph Devietti

CIS 5710 | Prof Joseph Devietti

CIS 5710 | Prof Joseph Devietti

CIS 5710 | Prof Joseph Devietti

M4 • Optimize average access time

2021: AMD Milan-X

• Chips today are 30–70% cache by area

CIS 5710 | Prof Joseph Devietti

11:2 addr data

CIS 5710 | Prof Joseph Devietti

31:12 11:2 addr hit data

CIS 5710 | Prof Joseph Devietti

CIS 5710 | Prof Joseph Devietti

CIS 5710 | Prof Joseph Devietti

CIS 5710 | Prof Joseph Devietti

• I$ and D$ misses stall pipeline just like data hazards

CIS 5710 | Prof Joseph Devietti

tmiss • %miss (miss-rate): #misses / #accesses

CIS 5710 | Prof Joseph Devietti

• Tag remains the same

– Premature replacement of [31:15] [14:6] [5:0] <<

CIS 5710 | Prof Joseph Devietti

CIS 5710 | Prof Joseph Devietti

CIS 5710 | Prof Joseph Devietti

31:11 10:2 addr hit data

• Notice tag/index/offset bits

31:11 10:2 addr hit data

CIS 5710 | Prof Joseph Devietti

CIS 5710 | Prof Joseph Devietti

CIS 5710 | Prof Joseph Devietti

• LRU bits updated on each

address data hit?

• Block-size and number of sets should be powers of

CIS 5710 | Prof Joseph Devietti

tag array data array

CIS 5710 | Prof Joseph Devietti

CIS 5710 | Prof Joseph Devietti

• Several new issues

CIS 5710 | Prof Joseph Devietti

after sw r1 => [A] D V A 0x02

CIS 5710 | Prof Joseph Devietti

CIS 5710 | Prof Joseph Devietti

CIS 5710 | Prof Joseph Devietti

CIS 5710 | Prof Joseph Devietti

CIS 5710 | Prof Joseph Devietti

CIS 5710 | Prof Joseph Devietti

+Only a few sets will need it at any given time

CIS 5710 | Prof Joseph Devietti

CIS 5710 | Prof Joseph Devietti

• Multiple prefetches bring multiple blocks in parallel

– Assumes you know CACHE_SIZE, but do you?

CIS 5710 | Prof Joseph Devietti

CIS 5710 | Prof Joseph Devietti

CIS 5710 | Prof Joseph Devietti

• Some other design parameters

CIS 5710 | Prof Joseph Devietti

CIS 5710 | Prof Joseph Devietti

• How would you internally organize main memory

CIS 5710 | Prof Joseph Devietti

You might also like