5 Memory Hierarchy
5 Memory Hierarchy
HIERARCHY
MEMORY
Memory
Just an “ocean of bits”
Many technologies are available
Key issues
Technology (how bits are stored)
Placement (where bits are stored)
Identification (finding the right bits)
Replacement (finding space for new bits)
Write policy (propagating changes to bits)
Must answer these regardless of memory
type
2
TYPICAL MEMORY
HIERARCHY
3
Processor Memor
y
MEMORY SYSTEMS
Input/Output
How can we supply the CPU with enough data to keep it busy?
We will focus on memory issues,
which are frequently bottlenecks that limit the performance of a system.
On-Chip
SRAM
Off-Chip
SRAM
DRAM
Disk
5
WHY MEMORY
HIERARCHY?
Need lots of bandwidth
Need lots of storage
64MB (minimum) to multiple TB
Must be cheap per bit
(TB x anything) is a lot of money!
These requirements seem incompatible
6
WHY MEMORY
HIERARCHY?
Fast and small memories
Enable quick access (fast cycle time)
Enable lots of bandwidth (1+ L/S/I-fetch/cycle)
Slower larger memories
Capture larger share of memory
Still relatively fast
Slow huge memories
Hold rarely-needed state
Needed for correctness
All together: provide appearance of large, fast
memory with cost of cheap
7
WHY DOES A
HIERARCHY WORK?
Locality of reference
Temporal locality
Reference same memory location repeatedly
Spatial locality
Reference near neighbors around the same time
Empirically observed
Significant!
Even small local storage (8KB) often satisfies
>90% of references to multi-MB data set
8
WHY LOCALITY?
Analogy:
Library (Disk)
Bookshelf (Main memory)
Stack of books on desk (off-chip cache)
Opened book on desk (on-chip cache)
Likelihood of:
Referring to same book or chapter again?
Probability decays over time
Book moves to bottom of stack, then bookshelf, then
library
Referring to chapter n+1 if looking at chapter n?
9
MEMORY HIERARCHY
Temporal Locality Spatial Locality
• Keep recently CPU • Bring neighbors of
referenced items at recently referenced to
higher levels higher levels
• Future references • Future references
I & D L1 Cache
satisfied quickly satisfied quickly
Shared L2 Cache
Main Memory
Disk
10
CACHE
INTRODUCING CACHES CPU
12
THE PRINCIPLE OF
LOCALITY
Why does the hierarchy work?
Because most programs exhibit locality,
which the cache can take advantage of.
The principle of temporal locality says that if a
program accesses one memory address, there is a
good chance that it will access the same address
again.
The principle of spatial locality says that if a
program accesses one memory address, there is a
good chance that it will also access other nearby
addresses.
13
HOW CACHES TAKE ADVANTAGE
OF LOCALITY
CPU
A cache miss
occurs if the cache does not contain the requested data. This is bad, since the
CPU must then wait for the slower main memory.
15
A SIMPLE CACHE
DESIGN Memory
Block Address
index Block
0
000 1
2
001 3
010 4 Index
011 5
6 0
100 1
7
101 8 2
110 9 3
10
111
11 A direct-mapped
Here is an example 12 cache is the simplest
cache with eight 13 approach: each main
blocks, each holding 14 memory address
one byte. 15 maps to exactly one
Caches are divided into blocks, which may cache
be block. sizes.
of various
The number of blocks in a cache is usually a power of 2. 16
FOUR IMPORTANT
QUESTIONS
1. When we copy a block of data from main memory to
the cache, where exactly should we put it?
2. How can we tell if a word is already in the cache, or
if it has to be fetched from main memory first?
3. Eventually, the small cache memory might fill up.
To load a new block from main RAM, we’d have to
replace one of the existing blocks in the cache...
which one?
4. How can write operations be handled by the
memory system?
17
ADDING TAGS
0000
0001
0010
0011
0100
0101 Index Tag Data
0110 00 00
0111 01 ??
1000 10 01
1001 11 01
1010
1011
1100
1101
1110
1111
We need to add tags to the cache, which supply the rest of the
address bits to let us distinguish between different memory
locations that map to the same cache block.
18
FIGURING OUT WHAT’S
IN THE CACHE
Now we can tell exactly which
addresses of main memory are
stored in the cache, by
concatenating the cache block
tags with the block indices.
Main memory
Index Tag Data address in cache block
00 00 00 + 00 = 0000
01 11 11 + 01 = 1101
10 01 01 + 10 = 0110
11 01 01 + 11 = 0111
19
ONE MORE DETAIL: THE
VALID BIT
When started, the cache is empty and does not contain valid data.
We should account for this by adding a valid bit for each cache
block.
When the system is initialized, all the valid bits are set to 0.
When data is loaded into a particular cache block, the
corresponding valid bit is set to 1.
Valid Main memory
Index Bit Tag Data address in cache block
00 1 00 00 + 00 = 0000
01 0 11 Invalid
10 0 01 ???
11 1 01 ???
So the cache contains more than just copies of the data in memory;
it also has bits to help us find data within the cache and verify its
validity.
20
WHAT HAPPENS ON A
CACHE HIT
When the CPU tries to read from memory, the address will be
sent to a cache controller.
The lowest k bits of the block address will index a block in
the cache.
If the block is valid and the tag matches the upper (m - k)
bits of the m-bit address, then that data will be sent to the
CPU.
Address (32 bits) Index Valid Tag Data
Here is a diagram of a 32-bit memory
0 address and a 210-byte
cache. 1
22 10 2
Index 3
To CPU
...
...
1022
1023
Tag
= Hit
21
WHAT HAPPENS ON A CACHE
MISS
On cache hit, CPU proceeds normally
On cache miss
Stall the CPU pipeline
Fetch block from next level of hierarchy
Instruction cache miss
Restart instruction fetch
Data cache miss
Complete data access
23
1
MEMORY HIERARCHY
BASICS
When a word is not found in the cache,
a miss occurs:
Fetch word from lower level in hierarchy,
requiring a higher latency reference
Lower level may be another cache or the
main memory
Also fetch the other words contained within
the block
Takes advantage of spatial locality
Place block into cache in any location within
its set, determined by address
24
PLACEMENT
Type Placement Comments
Registers Anywhere; Compiler/programmer
Int, FP, SPR manages
Cache Fixed in H/W Direct-mapped,
(SRAM) set-associative,
fully-associative
DRAM Anywhere O/S manages
Disk Anywhere O/S manages
25
CACHE SETS &
WAYS
26
CACHE SETS AND WAYS
Ways: Block can go anywhere
Sets: block/
Block line
mapped
by
addr
27
DIRECT-MAPPED CACHE
1-way
block/line
28
SET ASSOCIATIVE
CACHE
4-way
block/
line
4 Sets
29
FULLY ASSOCIATIVE
CACHE
16-ways
bloc
1 Sets k
/line
Fully associative
Each block can be mapped to any cache line
aka
m-way set associative
where m = size of cache in blocks
30
SET ASSOCIATIVE CACHE
ORGANIZATION
31
CACHE ADDRESSING
n-Ways: Block can go anywhere
s-Sets: block/
Block line
mapped
by
addr
36
REPLACEMENT POLICY
Random
Gives approximately the same performance as
LRU for high associativity
37
CPU
All accesses L1 misses
L1
L2
Write Buffer
38
WRITE-BACK
39
WRITE ALLOCATION
For write-back
Usually fetch the block
40
CACHE EXAMPLE Tag Array
ID Tag Tag LRU
32B Cache: <b=4,s=4,m=8> 6-bits 0 1
address
o=2, i=2, t=2; 2-way set-associative 00 0
Initially empty
Only tag array shown on right
01 0
Trace execution of:
Referenc Binary Set/Way Hit/Miss 10 0
e
11 0
41
CACHE EXAMPLE Tag Array
Tag Tag LRU
32B Cache: <b=4,s=4,m=8> 0 1
o=2, i=2, t=2; 2-way set-associative
0
Initially empty
Only tag array shown on right
0
Trace execution of:
Referenc Binary Set/Way Hit/Miss 10 1
e
Load 101010 2/0 Miss
0x2A 0
42
CACHE EXAMPLE Tag Array
Tag Tag LRU
32B Cache: <b=4,s=4,m=8> 0 1
o=2, i=2, t=2; 2-way set-associative
0
Initially empty
Only tag array shown on right
0
Trace execution of:
Referenc Binary Set/Way Hit/Miss 10 1
e
Load 101010 2/0 Miss
0x2A 0
Load 101011 2/0 Hit
0x2B
43
CACHE EXAMPLE Tag Array
Tag Tag LRU
32B Cache: <b=4,s=4,m=8> 0 1
o=2, i=2, t=2; 2-way set-associative
0
Initially empty
Only tag array shown on right
0
Trace execution of:
Referenc Binary Set/Way Hit/Miss 10 1
e
Load 101010 2/0 Miss
0x2A 11 1
Load 101011 2/0 Hit
0x2B
Load 111100 3/0 Miss
0x3C
44
CACHE EXAMPLE Tag Array
Tag Tag LRU
32B Cache: <b=4,s=4,m=8> 0 1
o=2, i=2, t=2; 2-way set-associative
10 1
Initially empty
Only tag array shown on right
0
Trace execution of:
Referenc Binary Set/Way Hit/Miss 10 1
e
Load 101010 2/0 Miss
0x2A 11 1
Load 101011 2/0 Hit
0x2B
Load 111100 3/0 Miss
0x3C
Load 100000 0/0 Miss
0x20 45
CACHE EXAMPLE Tag Array
Tag Tag LRU
32B Cache: <b=4,s=4,m=8> 0 1
o=2, i=2, t=2; 2-way set-associative
10 11 0
Initially empty
Only tag array shown on right
0
Trace execution of:
Referenc Binary Set/Way Hit/Miss 10 1
e
Load 101010 2/0 Miss
0x2A 11 1
Load 101011 2/0 Hit
0x2B
Load 111100 3/0 Miss
0x3C
Load 100000 0/0 Miss
0x20 46
CACHE EXAMPLE Tag Array
Tag Tag LRU
32B Cache: <b=4,s=4,m=8> 0 1
o=2, i=2, t=2; 2-way set-associative
01 11 1
Initially empty
Only tag array shown on right
0
Trace execution of:
Referenc Binary Set/Way Hit/Miss 10 1
e
Load 101010 2/0 Miss
0x2A 11 1
Load 101011 2/0 Hit
0x2B
Load 111100 3/0 Miss
0x3C
Load 100000 0/0 Miss
0x20 47
CACHE EXAMPLE Tag Array
Tag Tag LRU
32B Cache: <b=4,s=4,m=8> 0 1
o=2, i=2, t=2; 2-way set-associative
01 11 1
Initially empty
Only tag array shown on right
0
Trace execution of:
Referenc Binary Set/Way Hit/Miss 10 d 1
e
Load 101010 2/0 Miss
0x2A 11 1
Load 101011 2/0 Hit
0x2B
Load 111100 3/0 Miss
0x3C
Load 100000 0/0 Miss
0x20 48
MEASURING CACHE
PERFORMANCE
Components of CPU time
Program execution cycles: Includes cache hit time
Memory stall cycles: Mainly from cache misses
Example:
Given:
I-cache miss rate = 2%, D-cache miss rate = 4%, Miss penalty =
100 cycles, Base CPI (ideal cache) = 2, Load & stores are 36% of
instructions
Miss cycles per instruction
I-cache: 0.02 × 100 = 2
D-cache: 0.36 × 0.04 × 100 = 1.44
Actual CPI = 2 + 2 + 1.44 = 5.44
Ideal CPU is 5.44/2 =2.72 times faster 49
AVERAGE ACCESS TIME
Example
CPU with 1ns clock, hit time = 1 cycle, miss
penalty = 20 cycles, I-cache miss rate = 5%
AMAT = 1 + 0.05 × 20 = 2 (cycle)
2 cycles per instruction
50
MEASURING/
CLASSIFYING MISSES
How to find out?
Cold misses: Simulate a fully associative infinite cache size
Capacity misses: Simulate fully associative cache, then
deduct cold misses
Conflict misses: Simulate target cache configuration then
deduct cold and capacity misses
53
MULTILEVEL CACHE
CONSIDERATIONS
Primary cache
Focus on minimal hit time
L2 cache
Focus on low miss rate to avoid main
memory access
Hit time has less overall impact
Results
L-1 cache usually smaller than a single
cache
L-1 block size smaller than L-2 block size 54
VIRTUAL MEMORY
MEMORY HIERARCHY
Registers
On-Chip
SRAM
Off-Chip
SRAM
DRAM
Disk
56
WHY VIRTUAL MEMORY?
57
MAPPING VIRTUAL TO
PHYSICAL MEMORY Virtual Memory
Divide memory into equal sized ¥
“chunks” or pages (typically 4KB Stack
each)
Any chunk of Virtual Memory can be
assigned to any chunk of Physical
Memory
Physical Memory Single
64 MB Process
Heap
Static
Code
0 0 58
PAGED VIRTUAL
MEMORY
Virtual address space divided into pages
Physical address space divided into pageframes
Page missing in Main Memory = page fault
Pages not in Main Memory are on disk: swap-in/swap-out
Or have never been allocated
New page may be placed anywhere in MM (fully associative map)
59
CACHE VS VM
Cache Virtual Memory
Block or Line Page
Miss Page Fault
Block Size: 32-64B Page Size: 4K-16KB
Placement:
Direct Mapped, Fully Associative
N-way Set Associative
Replacement:
LRU or Random LRU approximation
Write Thru or Back Write Back
How Managed:
Hardware Hardware + Software
(Operating System)
60
HANDLING PAGE FAULTS
A page fault is like a cache miss
Must find page in lower level of hierarchy
61
PERFORMING ADDRESS
TRANSLATION
VM divides memory into equal
sized pages
Address translation relocates
entire pages
offsets within the pages do not change
if page size is a power of two, the virtual
address separates into two fields:
(like cache
Virtualindex, offset fields) Page Offset
Page Number
virtual address
62
MAPPING VIRTUAL TO
PHYSICAL ADDRESS
63
ADDRESS TRANSLATION
Virtual
page
number
Valid
bits
Other
flags
Main memory
65
PAGE TABLE
Page table translates address
66
MAPPING PAGES TO
STORAGE
67
REPLACEMENT AND
WRITES
To reduce page fault rate, prefer least-
recently used (LRU) replacement
Reference bit (aka use bit) in (Page Table Entry)
PTE set to 1 on access to page
Periodically cleared to 0 by OS
A page with reference bit = 0 has not been used
recently
Disk writes take millions of cycles
Block at once, not individual locations
Write through is impractical
Use write-back
Dirty bit in PTE set when page is written
68
OPTIMIZING VM
69
FAST ADDRESS
TRANSLATION
Problem: Virtual Memory requires two memory accesses!
one to translate Virtual Address into Physical Address (page table
lookup)
one to transfer the actual data (hit)
But Page Table is in physical memory! => 2 main memory accesses!
70
FAST TRANSLATION
USING A TLB
71
TLB TRANSLATION Virtual Byte
page number offset
Virtual
Valid
bits
address
TLB tags
Translation
Tags match
and ent ry
is valid
Physical
page number Physical
Other
flags
address
If page is in memory
Load the PTE from memory and retry
Could be handled in hardware
Can get complex for more complicated page table structures
Or in software
Raise a special exception, with optimized handler
73
TLB MISS HANDLER
75
TLB AND CACHE
INTERACTION
If cache tag uses
physical address
Need to translate before
cache lookup
Physically Indexed,
Physically Tagged
76
TLB AND CACHE
ADDRESSING
Cache review
Set or block field indexes are used to get tags
2 steps to determine hit:
Index (lookup) to find tags (using block or set bits)
Compare tags to determine hit
Sequential connection between indexing and tag comparison
79
CACHE & VIRTUAL
MEMORY
80
SUMMARY
81
MAIN MEMORY
MEMORY HIERARCHY
Registers
On-Chip
SRAM
Off-Chip
SRAM
DRAM
Disk
83
MAIN MEMORY DESIGN
8
4
DRAM CHIP ORGANIZATION
Bitlines
Word
Lines
Row Decoder Memory
Row Cell Transistor
Address Bitline
Array
Wordline
Capacitor
Sense Amps
Row Buffer
Column
Column Decoder
Address
Data bus
• Optimized for density, not speed • Read entire row at once (RAS, page
• Data stored as charge in capacitor open)
• Discharge on reads => destructive • Read word from row (CAS)
reads
• Burst mode (sequential words)
• Charge leaks over time
• Write row back (precharge, page
– refresh every 64ms close) 8
5
MAIN MEMORY DESIGN
Single DRAM
chip has
multiple internal
banks
8
6
MAIN MEMORY ACCESS
• Each memory access (DRAM bus clocks, 10x CPU cycle time)
– 5 cycles to send row address (page open or RAS)
– 1 cycle to send column address
– 3 cycle DRAM access latency
– 1 cycle to send data (CAS latency = 1+3+1 = 5)
– 5 cycles to send precharge (page close)
– 4 word cache block
8
7
MAIN MEMORY ACCESS
•One word wide, burst mode (pipelined)
rrrrrcdddb
b
b
bppppp
–5 + 1 + 3 + 4 = 13 cycles
–Interleaving is similar, but words can be from different
rows, each open in a different bank
8
9
ERROR CORRECTING
CODES
Probabilities:
P(1 word no errors) > P(single error) > P(two errors) >>
P(>2 errors)
9
0
ECC CODES FOR ONE
BIT
Power Correct #bit Comments
s
Nothing 0,1 1
9
2
ECC
Reduce overhead by applying codes
to a word, not a bit
Larger word means higher p(>=2 errors)
# SED SECDED
bits overhead overhead
1 1 (100%) 3 (300%)
32 1 (3%) 7 (22%)
64 1 (1.6%) 8 (13%)
n 1 (1/n) 1 + log2 n + a 9
3
64-BIT ECC
To store (write)
Use data0 to compute check0
Store data0 and check0
To load
Read data1 and check1
Use data1 to compute check2
Syndrome = check1 xor check2
I.e. make sure check bits are equal
9
5
ECC SYNDROME
9
9