Memory 2
Memory 2
1
Memory Hierarchy
o Motivation
m Exploitinglocality to provide a large, fast and
inexpensive memory
2
Cache Basics
o Cache is a high speed buffer between
CPU and main memory
o Memory is divided into blocks
m Q1: Where can a block be placed in the upper
level? (Block placement)
m Q2: How is a block found if it is in the upper
level? (Block identification)
m Q3: Which block should be replaced on a
miss? (Block replacement)
m Q4: What happens on a write? (Write strategy)
3
Q1: Block Placement
o Fully associative, direct mapped, set
associative
m Example: Block 12 placed in 8 block cache:
n Mapping = Block Number Modulo Number Sets
Direct Mapped 2-Way Assoc
Full Mapped
(12 mod 8) = 4 (12 mod 4) = 0
01234567 01234567 01234567
Cache
1111111111222222222233
01234567890123456789012345678901
Memory
4
Q2: Block Identification
o Tag on each block
m No need to check index or block offset
o Increasing associativity Þshrinks index Þ
expands tag
5
Q3: Block Replacement
o Easy for direct-mapped caches
o Set associative or fully associative:
m Random
n Easy to implement
m LRU (Least Recently Used)
n Relying on past to predict future, hard to implement
m FIFO
n Sort of approximate LRU
m Not Recently Used
n Maintain reference bits and dirty bits; clear reference bits
periodically; Divide all blocks into four categories; choose one
from the lower category
m Optimal replacement?
n Label the blocks in cache by the number of instructions to be
executed before that block is referenced. Then choose a
block with the highest lable
n Unrealizable!
6
Q4: Write Strategy
Write-Through Write-Back
Write data only
Data written to to the cache
cache block
Policy also written to Update lower
lower-level level when a
memory block falls out
of the cache
Implement Easy Hard
Do read misses
produce writes? No Yes
Do repeated
writes make it Yes No
to lower level?
7
Write Buffers
Cache Lower
Processor Level
Memory
Write Buffer
9
Performance Example
o Two data caches (assume one clock cycle for hit)
m I: 8KB, 44% miss rate, 1ns hit time
m II: 64KB, 37% miss rate, 2ns hit time
m Miss penalty: 60ns, 30% memory accesses
10
Miss Penalty in OOO Environment
o In processors with out-of-order execution
m Memory accesses can overlap with other
computation
m Latency of memory accesses is not always
fully exposed
11
Cache Performance Optimizations
o Performance formulas
m AMAT = Thit+ miss rate ´Tmiss penalty
o Reducing miss rate
m Change cache configurations, compiler optimizations
o Reducing hit time
m Simple cache, fast access and address translation
o Reducing miss penalty
m Multilevel caches, read and write policies
o Taking advantage of parallelism
m Cache serving multiple requests simultaneously
m Prefetching
12
Cache Miss Rate
o Three C’s
o Compulsory misses (cold misses)
m The first access to a block: miss regardless of cache
size
o Capacity misses
m Cache too small to hold all data needed
o Conflict misses
m More blocks mapped to a set than the associativity
o Reducing miss rate
m Larger block size (compulsory)
m Larger cache size (capacity, conflict)
m Higher associativity (conflict)
m Compiler optimizations (all three)
13
Miss Rate vs. Block Size
15
Reducing Cache Miss Penalty
o A difficult decision is
m whether to make the cache hit time fast, to keep pace with the high
clock rate of processors,
m or to make the cache large to reduce the gap between the
processor accesses and main memory accesses.
o Solution:
m Use multi-level cache:
n The first-level cache can be small enough to match a fast clock
cycle time.
n Higher level cache can be large enough to capture many
accesses that would go to main memory.
n Multilevel caches are more power-efficient than single
aggregated cache.
16
Compiler Optimizations for Cache
o Increasing locality of programs
m Temporal locality, spatial locality
o Rearrange code
m Targeting instruction cache directly
m Reorder instructions based on the set of data accessed
o Reorganize data
m Padding to eliminate conflicts:
n Change the address of two variables such that they do not map to
the same cache location
n Change the size of an array via padding
m Group data that tend to be accessed together in one block
o Example optimizations
m Merging arrays, loop interchange, loop fusion
17
Merging Arrays
/* Before: 2 sequential arrays */
int val[SIZE];
int key[SIZE];
/* After: 1 array of structures */
struct merge {
int val;
int key;
};
struct merge merged_array[SIZE];
o Improve spatial locality
m If val[i] and key[i] tend to be accessed together
o Reducing conflicts between val & key
18
Loop Interchange
o Idea: switching the nesting order of two or
more loops
19
Loop Fusion
o Takes multiple compatible loop nests and
combines their bodies into one loop nest
m Is legal if no data dependences are reversed
o Improves locality directly by merging accesses to
the same cache line into one loop iteration
/* Before */ /* After */
for (i = 0; i < N; i = i+1) for (i = 0; i < N; i = i+1)
for (j = 0; j < N; j = j+1) for (j = 0; j < N; j = j+1){
a[i][j] = 1/b[i][j] * c[i][j]; a[i][j] = 1/b[i][j] * c[i][j];
for (i = 0; i < N; i = i+1) d[i][j] = a[i][j] + c[i][j];
for (j = 0; j < N; j = j+1) }
d[i][j] = a[i][j] + c[i][j];
20
Seminar
o Pipelining Cache
o Prefetching Cache
21
Main Memory Background
o Main memory performance
m Latency: cache miss penalty
n Access time: time between request and word arrives
n Cycle time: time between requests
m Bandwidth: multiprocessors, I/O,
n large block miss penalty
o Main memory technology
m Memory is DRAM: Dynamic Random Access Memory
n Dynamic since needs to be refreshed periodically
n Requires data written back after being read
n Concerned with cost per bit and capacity
m Cache is SRAM: Static Random Access Memory
n Concerned with speed and capacity
22
Memory vs. Virtual Memory
o Analogy to cache
m Size: cache << memory << address space
m Both provide big and fast memory - exploit locality
23
Four Memory Hierarchy Questions
o Where can a block be placed in main memory?
m OS allows block to be placed anywhere: fully
associative
n No conflict misses;
o Which block should be replaced?
m An approximation of LRU: true LRU too costly and
adds little benefit
n A reference bit is set if a page is accessed
n The bit is shifted into a history register periodically
n When replacing, find one with smallest value in history
register
o What happens on a write?
m Write back: write through is prohibitively expensive
24
Four Memory Hierarchy Questions
o How is a block found in main memory?
m Use page table to translate virtual address into
physical address
• 32-bit virtual
address, page
size: 4KB, 4 bytes
per page table
entry, page table
size?
• (232/212)´22= 222
or 4MB
25
Fast Address Translation
o Motivation
m Page table is too large to be stored in cache
n May even expand multiple pages itself
m Multiple page table levels
o Solution: exploit locality and cache recent
translations
Example:
n Four page table levels
26
Fast Address Translation
o TLB: translation look-aside buffer
m A special fully-associative cache for recent translation
m Tag: virtual address
m Data: physical page frame number, protection field,
valid bit, use bit, dirty bit
o Translation
m Send virtual
address to all tags
m Check violation
m Matching tag send
physical address
m Combine offset to
get full physical address
27
Virtual Memory and Cache
o Physical cache: index cache using physical
address
m Always address translation before accessing cache
m Simple implementation, performance issue
28
Virtual Memory and Cache
o Physical cache (PIPT)
cache line return
• Slow
Processor Physical
TLB Main Memory
Core Cache
VA PA miss
hit
30
Advantages of Virtual Memory
o Translation
m Program can be given a consistent view of memory,
even though physical memory is scrambled
m Only the most important part of program (“Working
Set”) must be in physical memory
o Protection
m Different threads/processes protected from each other
m Different pages can be given special behavior
n Read only, invisible to user programs, etc.
m Kernel data protected from user programs
m Very important for protection from malicious programs
o Sharing
m Can map same physical page to multiple users
31