0% found this document useful (0 votes)
3 views

11_caches

The document discusses memory technologies and the concept of caches in computer organization, highlighting the trade-offs between memory size and speed. It explains different types of memory, such as SRAM and DRAM, and introduces the idea of cache hierarchies to optimize access times through locality principles. Additionally, it covers cache organization, performance metrics, and the impact of block size on cache efficiency.

Uploaded by

qyx
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views

11_caches

The document discusses memory technologies and the concept of caches in computer organization, highlighting the trade-offs between memory size and speed. It explains different types of memory, such as SRAM and DRAM, and introduces the idea of cache hierarchies to optimize access times through locality principles. Additionally, it covers cache organization, performance metrics, and the impact of block size on cache efficiency.

Uploaded by

qyx
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 87

Caches

CIS 5710
Computer Organization and Design
Readings
• P&H Chapter 5
• 5.1-5.3, 5.5
• Appendix C.9

CIS 5710 | Prof Joseph Devietti


The Problem
• Using current technologies, memories can be
either large or fast, but not both.
• The size or capacity of a memory refers to the
number of bits it can store
• The speed of a memory typically refers to the
amount of time it takes to access a stored entry:
the delay between asking for the value stored at
an address and receiving the result.

CIS 5710 | Prof Joseph Devietti


Memory Technologies

CIS 5710 | Prof Joseph Devietti


Types of Memory
• Static RAM (SRAM)
• 6 transistors per bit
• Two inverters + transistors for reading/writing
• Optimized for speed (first) and density (second)
• Fast (sub-nanosecond latencies for small SRAM)
• Speed proportional to area (~ sqrt(number of bits))
• Mixes well with standard processor logic
• Dynamic RAM (DRAM)
• 1 transistor + 1 capacitor per bit
• Optimized for density (in terms of cost per bit)
• Slow (>30ns internal access, ~50ns pin-to-pin)
• Different fabrication steps (does not mix well with logic)
• Nonvolatile storage: magnetic disk, Flash, …
CIS 5710 | Prof Joseph Devietti
SRAM
• used in register 6 transistors
file, on-chip per bit

caches

CIS 5710 | Prof Joseph Devietti


Memory Technology Trends

c/o O’Hallaron, Ganger, Kesden,


CMU 15-213 / 18-213

CIS 5710 | Prof Joseph Devietti


Locality of Memory Technologies
• For many (all?) memory technologies, it may take
a long time to get the first element out of the
memory but you can fetch a lot of data in the
vicinity very quickly
• It is usually a good idea to buy in bulk by fetching
a lot of elements in the vicinity rather than just
one.

CIS 5710 | Prof Joseph Devietti


The Memory Hierarchy

CIS 5710 | Prof Joseph Devietti


Big Picture Motivation
• Processor can compute only as fast as memory
• A 3Ghz processor can execute an “add” operation in 0.33ns
• Today’s “main memory” latency is more than 33ns
• Naïve implementation: loads/stores can be 100x slower than
other operations
• Unobtainable goal:
• Memory that operates at processor speeds
• Memory as large as needed for all running programs
• Memory that is cost effective
• Can't achieve all of these goals at once

CIS 5710 | Prof Joseph Devietti


Known From the Beginning

“Ideally, one would desire an infinitely large memory


capacity such that any particular word would be
immediately available … We are forced to recognize the
possibility of constructing a hierarchy of memories, each
of which has a greater capacity than the preceding but
which is less quickly accessible.”

Burks, Goldstine, VonNeumann


“Preliminary discussion of the logical design of an
electronic computing instrument”
IAS memo 1946

CIS 5710 | Prof Joseph Devietti


This Unit: Caches
CPU • “Cache”: hardware managed
• Hardware automatically retrieves missing
data
I$ D$ • Built from fast SRAM, usually on-chip
today
L2
• In contrast to off-chip, DRAM “main
memory”
• Cache organization
Main • Speed vs. Capacity
Memory
• ABCs of caches
• Miss classification
• Some example performance
Disk
calculations
CIS 5710 | Prof Joseph Devietti
Key observation: Locality
• Temporal locality
• Recently referenced data is likely to be referenced again
soon
• Reactive: cache recently used data in small, fast memory
• Spatial locality
• More likely to reference data near recently referenced data
• Proactive: cache large chunks to include nearby data
• Both properties hold for data and instructions
• Most real-world programs have locality in some form
• Cache: finite-sized hashtable of recently used
data blocks
• In hardware, transparent to software
CIS 5710 | Prof Joseph Devietti
Food Caching
• How to get fast access to all of the food we might
want?
• Fast access to food in your kitchen
• but it has limited capacity
• The grocery store has more food, but is slow
• Far away
• Big (takes time to walk within the grocery store)
• How can you avoid these latencies?
• Keep recently used foods around (temporal locality)
• Put related foods close together (spatial locality)
• Guess what you'll need in the future (prefetching, later)

CIS 5710 | Prof Joseph Devietti


Netflix caching

temporal
locality

spatial
locality

CIS 5710 | Prof Joseph Devietti


Locality Example
• Which memory accesses have spatial locality?
• Which memory accesses have temporal locality?

int sum = 0;
int X[1000];
for(int c = 0; c < 1000; c++){
sum += X[c];
}

CIS 5710 | Prof Joseph Devietti


Exploiting locality with hierarchy
CPU • Hierarchy of memory components
• Upper components
• Fast « Small « Expensive
M1
• Lower components
• Slow « Big « Cheap
M2
• Connected by buses
• Which have latency and bandwidth issues
M3 • Most frequently accessed data in M1
• next most frequently accessed in M2, etc.
• Move data up-down hierarchy

M4 • Optimize average access time


• tavg=taccess + (%miss*tmiss)
CIS 5710 | Prof Joseph Devietti • Attack each component
Concrete Memory Hierarchy
Processor • Registers
Compiler
Regs Managed • Primary caches
• Split instruction (I$) and data (D$)
I$ D$
• Typically 32-64KB each
Hardware
L2, L3
Managed • 2nd and 3rd cache (L2, L3)
• On-chip, typically made of SRAM
• L2$ typically 256-512KB
Main • “Last level cache” (LLC) 16-100MB
Memory
Software
Managed
• main memory
(by OS) • made of DRAM (“Dynamic” RAM)
• ≥8GB for laptops, servers have TBs
Disk
• disk (swap and files)
CIS 5710 | Prof Joseph Devietti • Uses magnetic disks or flash drives
Evolution of Cache Hierarchies
Intel 486
64KB I$
64KB D$
8KB
I/D$

1.5MB L2

2021: AMD Milan-X


L3 tagswith 3D V-cache
1989: Intel 486 (96MB total L3$)

• Chips today are 30–70% cache by area


CIS 5710 | Prof Joseph Devietti
Caches

CIS 5710 | Prof Joseph Devietti


Logical Cache Organization
• Cache is a hardware hashtable
• The setup 32
• 32-bit ISA ® 4G addresses, 232 B address space
• Logical cache organization
• 4KB, organized as 1K 4B blocks (aka lines)

1024
• Each block can hold a 4-byte word
10
• Physical cache implementation
• 1K (1024 bit) by 4B SRAM
• Called data array
32
• 10-bit address input
• 32-bit data input/output
addr data
CIS 5710 | Prof Joseph Devietti
Looking Up A Block
• Which 10 of the 32 address bits to use?
• use bits [11:2]
• 2 least significant (LS) bits [1:0] are the offset bits
• Locate byte within word
• Don't need these to locate word
[11:2]
• Next 10 LS bits [11:2] are the index bits
• These locate the word
• These bits work best in practice
• Why?

11:2 addr data


CIS 5710 | Prof Joseph Devietti
Every block has a designated spot

CIS 5710 | Prof Joseph Devietti


Is this the block you’re looking for?
• Each cache row corresponds to 220 blocks
• How to know which if any is currently there?
• Tag each cache word with remaining address bits [31:12]
• Build separate and parallel tag array
• 1K by 21-bit SRAM
[11:2]
• 20-bit (next slide) tag + 1 valid bit
• Lookup algorithm
• Read tag indicated by index bits
• If tag matches & valid bit set:
then: Hit ® data is good [31:12]
==
else: Miss ® data is garbage, wait…
31:12 11:2 addr hit data
CIS 5710 | Prof Joseph Devietti
A Concrete Example
• Lookup address x000C14B8
• Index = addr [11:2] = (addr >> 2) & x3FF = x12E
• Tag = addr [31:12] = (addr >> 12) = x000C1

0 0 C 1
1 0000 0000 0000 1100 0001

[31:12]
==
0000 0000 0000 1100 0001 0100 1011 10 00

31:12 11:2 addr hit data


CIS 5710 | Prof Joseph Devietti
Cache operation: 1B block
• 8-bit addresses ® 256B memory
• Keeps diagrams simple
tag (6 bits) index (2 bits)

• 4B cache, 1B blocks
• Figure out number of sets: 4 (capacity / block-size)
• Figure out how address splits into offset/index/tag bits
• Offset: least-significant log2(block-size) = log2(1) = 0
• Index: next log2(number-of-sets) = log2(4) = 2 ®
00000000
• Tag: rest = 8 – 2 = 6 ® 00000000

CIS 5710 | Prof Joseph Devietti


Multi-Word Cache Blocks
• In most modern implementation we store more
than one address (>1 byte) in each cache block.
• The number of bytes or words stored in each
cache block is referred to as the block size.
• The entries in each block come from a contiguous
set of addresses to exploit locality of reference,
and to simplify indexing
• Cache blocks are also referred to as cache lines
• Related to cache frames – a frame is the bucket, and the
block is the data that goes in the bucket
• blocks move around due to fills & evictions
• frames are part of the cache structure and never move
CIS 5710 | Prof Joseph Devietti
Tag, Index, Block Offset
• Consider 8B cache with 2B blocks
• Figure out number of sets: 4 (capacity / block-size)
• Figure out how address splits into offset/index/tag bits
• Offset: least-significant log2(block-size) = log2(2) = 1 ®
00000000
• Index: middle log2(number-of-sets) = log2(4) = 2 ®
00000000
• Tag: remaining high-order bits = 8 – 1 – 2 = 5 ®
00000000
block offset
tag (5 bits) index (2 bits)
(1 bit)

CIS 5710 | Prof Joseph Devietti


example
• via http://comparchviz.com
• 8B DM cache with 2B blocks

CIS 5710 | Prof Joseph Devietti


Handling a Cache Miss
• What if requested data isn't in the cache?
• How does it get in there?
• Cache controller: finite state machine
• Remembers miss address
• Accesses next level of memory
• Waits for response
• Writes data/tag into proper locations
• Bringing a missing block into the cache is a cache
fill

CIS 5710 | Prof Joseph Devietti


Cache Misses and Pipeline Stalls
+
4

Regfile a
I$
D$
d

nop nop

• I$ and D$ misses stall pipeline just like data hazards


• Stall logic driven by miss signal
• Cache “logically” re-evaluates hit/miss every cycle
• Block is filled ® miss signal goes low ® pipeline restarts

CIS 5710 | Prof Joseph Devietti


Cache Performance Equation
• For a cache
• Access: read or write to cache
taccess • Hit: desired data found in cache
• Miss: desired data not found in cache
• Must get from another component
Cache • No notion of “miss” in register file
%miss
• Fill: action of placing data into cache

tmiss • %miss (miss-rate): #misses / #accesses


• taccess: time to check cache. If hit, all done
• tmiss: time to read data into cache
• Performance metric: avg access time
tavg = taccess + (%miss * tmiss)
CIS 5710 | Prof Joseph Devietti
Measuring Cache Performance
• Ultimate metric is tavg
• Cache capacity and circuits roughly determines taccess
• Lower-level memory structures determine tmiss
• Measure %miss
• Hardware performance counters
• Simulation

CIS 5710 | Prof Joseph Devietti


Capacity and Performance
• Simplest way to reduce %miss: increase capacity
+ Miss rate decreases monotonically
• “Working set”: insns/data program is actively using
• Diminishing returns
– However taccess increases
• Latency proportional to
sqrt(capacity)
%miss
“working set” size

Cache Capacity
• For a given capacity, adjust %miss via
organization
CIS 5710 | Prof Joseph Devietti
Block Size
• Given capacity, manipulate %miss by changing
organization 512*512bit
SRAM
• One option: increase block size 0
• Exploit spatial locality 1
• Boundary between index and offset changes 2

• Tag remains the same


• Ramifications 510
511
+ Reduce %miss (up to a point) 9-bit
+ Reduce tag overhead (why?) =
– Potentially useless data transfer block size­

– Premature replacement of [31:15] [14:6] [5:0] <<


useful data
address data hit?
CIS 5710 | Prof Joseph Devietti
Block Size and Tag Overhead
• 4KB cache with 1024 4B blocks?
• 4B blocks ® 2-bit offset, 1024 frames ® 10-bit index
• 32-bit address – 2-bit offset – 10-bit index = 20-bit tag
• 20-bit tag / 32-bit block = 63% overhead
• 4KB cache with 512 8B blocks
• 8B blocks ® 3-bit offset, 512 frames ® 9-bit index
• 32-bit address – 3-bit offset – 9-bit index = 20-bit tag
• 20-bit tag / 64-bit block = 32% overhead
• Notice: tag size is same, but data size is twice as big
• A realistic example: 64KB cache with 64B blocks
• 16-bit tag / 512-bit block = ~ 2% overhead
• Note: tags are not optional
CIS 5710 | Prof Joseph Devietti
Effect of Block Size on Miss Rate
• Two effects on miss rate
+ Spatial prefetching (good)
• For blocks with adjacent addresses
%
• Turns miss/miss into miss/hit pairs miss
– Interference (bad) Block Size
• For blocks with non-adjacent addresses (but in adjacent
frames), large blocks turn hits into misses by disallowing
simultaneous residence
• Consider entire cache as one big block
• Both effects always present
• Spatial prefetching dominates initially
• Depends on size of the cache
• Good block size is 32–256B, optimal size program-dependent

CIS 5710 | Prof Joseph Devietti


example
• 8B DM cache with 4B blocks

CIS 5710 | Prof Joseph Devietti


Block Size and Miss Penalty
• Does increasing block size increase tmiss?
• Don't larger blocks take longer to read, transfer, and fill?
• They do, but…
• tmiss of an isolated miss is not affected
• Critical Word First / Early Restart (CRF/ER)
• Requested word fetched first, pipeline restarts immediately
• Remaining words in block transferred/filled in the background
• tmiss'es of a cluster of misses will suffer
• Reads/transfers/fills of two misses can't happen at the same
time
• Latencies can start to pile up
• This is a bandwidth problem
CIS 5710 | Prof Joseph Devietti
Cache Conflicts
0000 A Main memory tag (1 bit) index (2 bits) 1 bit
0001 B
0010 C
0011 D
0100 E
0101 F
0 1
0110 G 00
0111 H 01
1000 I 10
1001 J 11
1010 K
1011 L • Pairs like “0010” and “1010” conflict
1100 M • Same index!
1101 N
• Can such pairs to simultaneously reside in cache?
1110 P
• A: Yes, if we reorganize cache to do so
1111 Q
example
• 8B DM cache with lots of conflicts

CIS 5710 | Prof Joseph Devietti


Associativity
• Set-associativity
• Block can reside in one of few frames
• Frame groups called sets
• Each frame in set called a way
4B 4B
• This is 2-way set-associative (SA)
• 1-way ® direct-mapped (DM)
• 1-set ® fully-associative (FA)
[10:2]
+ Reduces conflicts
[31:11]
– Increases taccess: == ==
• additional tag match & muxing

associativity­

31:11 10:2 addr hit data


CIS 5710 | Prof Joseph Devietti
Associativity
• Lookup algorithm
• Use index bits to find set
• Read data/tags in all frames in parallel
• Any (match and valid bit) is a hit 4B 4B

• Notice tag/index/offset bits


• Only 9-bit index (versus 10-bit [10:2]
for direct mapped)
[31:11]
== ==

associativity­

31:11 10:2 addr hit data


CIS 5710 | Prof Joseph Devietti
2-way set-associative cache
• 8B cache, 2 ways, 2B blocks
• Figure out number of sets: 2 ((capacity / ways) / block-size)
• Figure out how address splits into offset/index/tag bits
• Offset: least-significant log2(block-size) = log2(2) = 1 ®
00000000
• Index: next log2(number-of-sets) = log2(2) = 1 ®
00000000 block offset
tag (6 bits) index (1 bit)
• Tag: rest = 8 – 1 – 1 = 6 ® 00000000 (1 bit)

CIS 5710 | Prof Joseph Devietti


example
• 8B 2-way cache with 4B blocks

CIS 5710 | Prof Joseph Devietti


Replacement Policies
• Set-associative caches present new design choice
• On cache miss, which block in set to replace (kick out)?
• Some options
• Random
• FIFO (first-in first-out)
• LRU (least recently used)
• Fits with temporal locality, LRU = least likely to be used in
future
• NMRU (not most recently used)
• An easier to implement approximation of LRU
• Is LRU for 2-way set-associative caches
• Belady's: replace block that will be used furthest in future
• Unachievable optimum

CIS 5710 | Prof Joseph Devietti


LRU and Miss Handling
data from memory
• Add LRU field to each set
• “Least recently used”
• LRU data is encoded “way” 0 512
• Hit? update MRU 1 513

• LRU bits updated on each


511 1023

access
= =

WE
[31:15] [14:5] [4:0] <<

address data hit?


CIS 5710 | Prof Joseph Devietti
Associativity and Performance
• Higher associative caches
+ Have better (lower) %miss
• Diminishing returns
– However taccess increases %miss ~5
• The more associative, the slower
• What about tavg?
Associativity

• Block-size and number of sets should be powers of


two
• Makes indexing easier (just rip bits out of the address)
• 3-way set-associativity? No problem

CIS 5710 | Prof Joseph Devietti


Cache Glossary

tag array data array

frame
block/line
valid bit

tag
set/row

way

block offset/displacement
address: tag index

CIS 5710 | Prof Joseph Devietti


What About Stores?

CIS 5710 | Prof Joseph Devietti


Handling stores
• So far we have looked at reading from cache
• Instruction fetches, loads
• What about writing into cache?

• Several new issues


• Tag/data access
• Write-through vs. write-back
• Write-allocate vs. write-not-allocate
• Hiding write miss latency

CIS 5710 | Prof Joseph Devietti


Tag/Data Access
• Reads: read tag and data in parallel
• Tag mis-match ® data is wrong (OK, just stall until good
data arrives)
• Writes: read tag, write data in parallel? No! Why
not?
• Tag mis-match ® clobbered data (oops!)
• For associative caches, which way was written into?
• Writes are a two step (multi-cycle) process
• Step 1: match tag
• Step 2: write to matching way
• Bypass (with address check) to avoid load stalls
• May introduce structural hazards
CIS 5710 | Prof Joseph Devietti
Write Propagation
• When to propagate new value to lower-level
caches/memory?
• Option #1: Write-through: immediately
• On hit, update cache
• Immediately send the write to the next level
• Option #2: Write-back: when block is replaced
• Now we have multiple versions of the same block in various
caches and in memory!
• Requires additional “dirty” bit per block
• Evict clean block: no extra traffic
• there was only 1 version of the block
• Evict dirty block: extra “writeback” of block
• the dirty block is the most up-to-date version
CIS 5710 | Prof Joseph Devietti
Write-back Cache Operation
• Each cache block has an associated dirty bit
• state is either clean or dirty

valid bit
dirty bit
tag block

initial state - I - -
after lw r0 <= [A] C V A 0x01

after sw r1 => [A] D V A 0x02

CIS 5710 | Prof Joseph Devietti


Write-backs across caches
• When a dirty block is evicted to a lower-level
cache, it remains dirty
• Writing a block back to memory cleanses it
• There are never dirty blocks in memory, only in caches
L1
evicted from L1 D V A 0x02

L2
evicted from L2

Memory
block is cleansed
no need for dirty/valid bits or tag 0x02

CIS 5710 | Prof Joseph Devietti


Optimizing Writebacks
+ Writeback-buffer (WBB):
• Hide latency of writeback (keep them off critical path)
• Step#1: Send “fill” request to next-level
• Step#2: While waiting, write dirty block to buffer
• Step#3: When new blocks arrives, put it into cache
• Step#4: Write buffer contents to next-level

$
2
1
WBB
4 3

Next-level-$

CIS 5710 | Prof Joseph Devietti


Write Propagation Comparison
• Write-through
– Requires additional bus bandwidth
• Consider repeated write hits
– Next level must handle small writes (1, 2, 4, 8-bytes)
+ No need for dirty bits in cache
+ No need to handle “writeback” operations
• Simplifies miss handling
• Used in GPUs, as they have low write temporal locality
• Write-back
+ Key advantage: uses less bandwidth
• Reverse of other pros/cons above
• Used in most CPU designs
Write Miss Handling
• How is a write miss actually handled?
• Write-allocate: fill block from next level, then
write it
+ Decreases read misses (next read to block will hit)
– Requires additional bandwidth
• Commonly used (especially with write-back caches)
• Write-non-allocate: just write to next level,
don’t allocate a block
– Potentially more read misses
+ Uses less bandwidth
• Use with write-through

CIS 5710 | Prof Joseph Devietti


Write Misses and Store Buffers
• Read miss?
• Load can't go on without the data, it must stall
• Write miss? Processor
• no instruction is waiting for data, why stall?
SB
• Store buffer: a small buffer
• Stores put addr/value in store buffer, keep going
• Store buffer writes stores to D$ in the background Cache
• Loads must search store buffer (in addition to D$)
+ Eliminates stalls on write misses (mostly) WBB
– Creates some problems for multicore (later)
• Store buffer vs. writeback-buffer
Next-level
• Store buffer: “in front” of D$, hides store misses cache
• Writeback buffer: “behind” D$, hides writebacks
CIS 5710 | Prof Joseph Devietti
Improving Effectiveness of Memory
Hierarchy

CIS 5710 | Prof Joseph Devietti


Classifying Misses: 3C Model
• Divide cache misses into three categories
• Compulsory (cold): never seen this address before
• Would miss even in infinite cache
• Capacity: miss caused because cache is too small
• Would miss even in fully associative cache
• Identify? Consecutive accesses to block separated by
access to at least N other distinct blocks (N is number of
frames in cache)
• Conflict: miss caused because cache associativity is too low
• Identify? All other misses
• (Coherence): miss due to external invalidations
• Only in shared memory multiprocessors (later)

CIS 5710 | Prof Joseph Devietti


Miss Rate: ABC
• Why do we care about 3C miss model?
• So that we know what to do to eliminate misses
• If you don't have conflict misses, increasing associativity
won't help
• More Associativity (assuming fixed capacity)
+ Decreases conflict misses
– Increases taccess
• Larger Block Size (assuming fixed capacity)
– Increases conflict/capacity misses (fewer frames)
+ Decreases compulsory misses (spatial locality)
• No significant effect on taccess
• More Capacity
+ Decreases capacity misses
– Increases taccess
CIS 5710 | Prof Joseph Devietti
Victim Buffers for conflict misses
• Conflict misses: not enough associativity
• High associativity is expensive, but also rarely needed
• 3 blocks mapping to same 2-way set and accessed
(XYZ)+
• Victim buffer (VB): small fully-associative cache
• Sits on I$/D$ miss path
• Small (e.g., 8 entries) so very fast I$/D$
• Blocks kicked out of I$/D$ placed in VB
• On miss, check VB: hit? Place block back in I$/D$
• 8 extra ways, shared among all sets VB

+Only a few sets will need it at any given time


L2
+ Very effective in practice

CIS 5710 | Prof Joseph Devietti


Prefetching
• Bring data into cache proactively/speculatively
• If successful, reduces number of caches misses
• Key: anticipate upcoming miss addresses accurately
• Can do in software or hardware
• Simple hardware prefetching: next block prefetching
• Miss on address X ® anticipate miss on X+block-size
+ Works for insns: sequential execution
+ Works for data: arrays
• Table-driven hardware prefetching I$/D$
• Use predictor to detect strides, common patterns
• Effectiveness determined by: prefetch logic
• Timeliness: initiate prefetches sufficiently in advance
• Coverage: prefetch for as many misses as possible
• Accuracy: don't pollute with unnecessary data L2

CIS 5710 | Prof Joseph Devietti


Software Prefetching
• Use a special “prefetch” instruction
• Tells the hardware to bring in data
• Just a hint
• Inserted by programmer or compiler
• Example
int tree_add(tree_t* t) {
if (t == NULL) return 0;
__builtin_prefetch(t->left);
__builtin_prefetch(t->right);
return t->val + tree_add(t->right) + tree_add(t->left);
}

• Multiple prefetches bring multiple blocks in parallel


• More “Memory-level” parallelism (MLP)
CIS 5710 | Prof Joseph Devietti
Software Restructuring: Data
• Capacity misses: poor spatial or temporal locality
• Several code restructuring techniques to improve both
– Compiler must know that restructuring preserves semantics
• Loop interchange: spatial locality
• Example: row-major matrix: X[i][j] then X[i][j+1]
• Poor code: X[i][j] followed by X[i+1][j]
for (j = 0; j<NCOLS; j++)
for (i = 0; i<NROWS; i++)
sum += X[i][j];
• Better code
for (i = 0; i<NROWS; i++)
for (j = 0; j<NCOLS; j++)
sum += X[i][j];
CIS 5710 | Prof Joseph Devietti
Software Restructuring: Data
• Loop blocking: temporal locality
• Poor code
for (k=0; k<NUM_ITERATIONS; k++)
for (i=0; i<NUM_ELEMS; i++)
X[i] = f(X[i]);
• Better code
• Cut array into CACHE_SIZE chunks
• Run all phases on one chunk, proceed to next chunk
for (i=0; i<NUM_ELEMS; i+=CACHE_SIZE)
for (k=0; k<NUM_ITERATIONS; k++)
for (j=0; j<CACHE_SIZE; j++)
X[i+j] = f(X[i+j]);

– Assumes you know CACHE_SIZE, but do you?

CIS 5710 | Prof Joseph Devietti


Software Restructuring: Code
• Compiler can improve code temporal/spatial locality
• If (a) { code1; } else { code2; } code3;
• But, code2 case never happens (say, error condition)

+ Better locality
+ Fewer taken branches + Better locality for code
after code3
CIS 5710 | Prof Joseph Devietti + Fewer taken branches
Cache Hierarchies

CIS 5710 | Prof Joseph Devietti


Designing a Cache Hierarchy
• For any memory component: taccess vs. %miss
tradeoff
• Upper components (I$, D$) emphasize low taccess
• Frequent access ® taccess important
• tmiss is not bad ® %miss less important
• Lower capacity and lower associativity (to reduce taccess)
• Small-medium block-size (to reduce conflicts)
• Moving down (L2, L3) emphasis turns to %miss
• Infrequent access ® taccess less important
• tmiss is bad ® %miss important
• High capacity, associativity, and block size (to reduce %miss)

CIS 5710 | Prof Joseph Devietti


Memory Hierarchy Parameters
Parameter I$/D$ L2 L3 Main Memory
taccess 2ns 10ns 30ns 100ns
tmiss 10ns 30ns 100ns 10ms (10M ns)
Capacity 32KB–64KB 256-512KB 16-100MB GBs-TBs
Block size 16B–64B 32B–128B 32B-256B 4KB-1GB
Associativity 2-8 4–16 4-16 n/a

• Some other design parameters


• Split vs. unified insns/data
• Inclusion vs. exclusion vs. nothing

CIS 5710 | Prof Joseph Devietti


Split vs Unified Caches
• Split I$/D$: insns and data in different caches
• To minimize structural hazards and taccess
• Larger unified I$/D$ would be slow, 2nd port even slower
• Optimize I$ and D$ separately
• Not writes for I$, smaller reads for D$
• Why is 486 I/D$ unified?
• Unified L2, L3: insns and data together
• To minimize %miss
+ Fewer capacity misses: unused insn capacity is used for data
– More conflict misses: insn/data conflicts
• A much smaller effect in large caches
• Go further: unify L3 of multiple cores in a multi-core
CIS 5710 | Prof Joseph Devietti
Inclusive vs Exclusive Caches
• Inclusion
• Bring block from memory into L2 then L1
• A block in the L1 is always in the L2
• If block evicted from L2, must also evict it from L1
• Why? more on this when we talk about multicore
• Exclusion
• Bring block from memory into L1 but not L2
• Move block to L2 on L1 eviction
• L2 becomes a large victim cache
• Block is either in L1 or L2 (never both)
• Good if L2 is small relative to L1
• Example: AMD's Duron 64KB L1s, 64KB L2
CIS 5710 | Prof Joseph Devietti
Memory Performance Equation
CPU • For memory component M
• Access: read or write to M
taccess • Hit: desired data found in M
• Miss: desired data not found in M
• Must get from another (slower)
M component
%miss
• Fill: action of placing data in M
• %miss (miss-rate): #misses / #accesses
tmiss
• taccess: time to read data from (write data
to) M
• tmiss: time to read data into M
• Performance metric
• tavg: average access time
tavg = taccess + (%miss * tmiss)
CIS 5710 | Prof Joseph Devietti
Hierarchy Performance
CPU
tavg
tavg = tavg-M1 tavg-M1
tacc-M1 + (%miss-M1*tmiss-M1)
M1
tmiss-M1 = tavg-M2 tacc-M1 + (%miss-M1*tavg-M2)
tacc-M1 + (%miss-M1*(tacc-M2 + (%miss-M2*tmiss-
M2 M2)))
tacc-M1 + (%miss-M1* (tacc-M2 + (%miss-M2*tavg-
tmiss-M2 = tavg-M3
M3)))

M3

tmiss-M3 = tavg-M4

M4

CIS 5710 | Prof Joseph Devietti


Miss Rates: per-access vs per-insn
• Miss rates can be expressed two ways:
• Misses per “instruction” (or instructions per miss), -or-
• Misses per “cache access” (or accesses per miss)
• For first-level caches, use insn mix to convert
• If memory ops are 1/3rd of instructions..
• 2% of insns miss (1 in 50) is 6% of “accesses” miss (1 in 17)
• What about second-level caches?
• Misses per insn still straight-forward (“global” miss rate)
• Misses per “access” is trickier (“local” miss rate)
• Depends on number of accesses (which depends on L1
rate!)
• L1 acts as a filter for L2 accesses
CIS 5710 | Prof Joseph Devietti
Main Memory As A Cache
Parameter I$/D$ L2 L3 Main Memory
taccess 2ns 10ns 30ns 100ns
tmiss 10ns 30ns 100ns 10ms (10M ns)
Capacity 32KB–64KB 256-512KB 16-100MB GBs-TBs
Block size 16B–64B 32B–128B 32B-256B 4KB-1GB
Associativity 2-8 4–16 4-16 full
Replacement LRU LRU LRU “working set”
Prefetching? yes yes yes sw-managed

• How would you internally organize main memory


• tmiss is outrageously long, reduce %miss at all costs
• Full associativity: isn't that difficult to implement?
• Yes in hardware, but main memory is software-managed

CIS 5710 | Prof Joseph Devietti


Summary
App App App • Average access time of a memory component
• tavg = taccess + %miss * tmiss
System software
• Hard to get low taccess and %miss in one structure
® build a hierarchy instead
Mem CPU I/O • Memory hierarchy
• Cache (SRAM) ® memory (DRAM) ® swap (Disk)
• Smaller, faster, more expensive ® bigger, slower,
cheaper
• Cache ABCs (associativity, block size,
capacity)
• 3C miss model: compulsory, capacity, conflict
• Performance optimizations
• %miss: prefetching
• tmiss: victim buffer, critical-word-first
• Write issues
• Write-back vs. write-through, write-allocate vs.
write-no-allocate
CIS 5710 | Prof Joseph Devietti

You might also like