Memory Hierarchy Design: A Quantitative Approach, Fifth Edition
Memory Hierarchy Design: A Quantitative Approach, Fifth Edition
Chapter 2
Memory Hierarchy Design
= 409.6 GB/s!
Cache Lower
Processor Level
Memory
Write Buffer
Causes of misses
Compulsory
First reference to a block, also called “cold miss”
Capacity
Blocks discarded (lack of space) and later retrieved
Conflict
Program makes repeated references to multiple addresses
from different blocks that map to the same location in the
cache
Average Memory Access Time = Hit Time + Miss Rate * Miss Penalty
A0-A31 A0-A31
CPU Memory
D0-D31 D0-D31
Data
oAll programs share one address space:
The physical address space
oMachine language programs must be
aware of the machine organization
oNo way to prevent a program from
accessing any machine resource
CSCE 430/830, Memory
Hierarchy Introduction
Solution: Add a Layer of Indirection
“Virtual Addresses” “Physical Addresses”
A0-A31 A0-A31
Virtual Physical
Data
• User programs run in a standardized
virtual address space
• Address Translation hardware, managed
by the operating system (OS), maps
virtual address to physical memory
• Hardware supports “modern” OS features:
Protection, Translation, Sharing
CSCE 430/830, Memory
Hierarchy Introduction
Three Advantages of Virtual Memory
• Translation:
– Program can be given consistent view of memory, even though physical
memory is scrambled
– Makes multithreading reasonable (now used a lot!)
– Only the most important part of program (“Working Set”) must be in
physical memory.
– Contiguous structures (like stacks) use only as much physical memory
as necessary yet still grow later.
• Protection:
– Different threads (or processes) protected from each other.
– Different pages can be given special behavior
» (Read Only, Invisible to user programs, etc).
– Kernel data protected from User programs
– Very important for protection from malicious programs
• Sharing:
– Can map same physical page to multiple users
(“Shared memory”)
Page size:4KB
CSCE 430/830, Memory
Hierarchy Introduction
Virtual Page & Physical Page
Virtual Memory
V.P. 0 Physical Memory
P.P. 0
V.P. 1
P.P. 1
V.P. 2
P.P. 2
V.P. 3
P.P. 3
V.P. 4
V.P. 5
Page size:4KB
CSCE 430/830, Memory
Hierarchy Introduction
Addressing
Virtual Memory Virtual Address
V.P. 0 Virtual Page No. P. Offset Physical Memory
P.P. 0
V.P. 1
P.P. 1
V.P. 2
P.P. 2
V.P. 3
P.P. 3
V.P. 4
Physical Address
V.P. 5
Physical Page No. P. Offset
Page size:4KB
CSCE 430/830, Memory
Hierarchy Introduction
Addressing
Virtual Memory Virtual Address
Virtual Page No. P. Offset
Physical Memory
Physical Address
Physical Page No. P. Offset
Valid/Present Bit
If set, page being pointed is
resident in memory
Otherwise, on disk or not
allocated
Physical Address
Physical Page No. P. Offset
Protection Bits
Restrict access;
read-only, read/write, system-
only access
Physical Address
Physical Page No. P. Offset
Reference Bit
Needed by replacement policies
If set, page has been referenced
Physical Address
Physical Page No. P. Offset
Dirty Bit
If set, at least one word in page
has been modified
Physical Address
Physical Page No. P. Offset
virtual address
virtual address
virtual address
virtual address
virtual address
physical address
V P R D Physical Page No.
V P R D Physical Page No.
V P R D Physical Page No.
V P R D Physical Page No.
V P R D Physical Page No.
V P R D Physical Page No.
1
3 V=0 pages either
physical address
reside on disk or
page off
TLB have not yet been
frame page MIPS handles TLB misses in allocated.
2 2 software (random
0 5 replacement). Other machines OS handles V=0
use hardware. “Page fault”
Can TLB and caching be overlapped?
Virtual Page Number Page Offset
Virtual
Index Byte Select
Translation
Look-Aside Cache Tags Valid Cache Data
Buffer
(TLB) Cache Block
Physical
Cache Tag = Cache Block
Hit
This works, but ...
Q. What is the downside?
A. Inflexibility. Size of cache
limited by page size.
CSCE 430/830, Memory Data out
Hierarchy Introduction
8
VA: 64bits
PA: 40bits
Page size: 16KB
TLB: 2-way set
associative, 256
entries
Cache block: 64B
L1: direct-mapping,
16KB
L2: 4-way set
associative, 4MB
CSCE 430/830, Memory
Hierarchy Introduction
Advanced Optimizations
Ten Advanced Optimizations
Small and simple first level caches
Critical timing path:
addressing tag memory, then
comparing tags, then
selecting correct set
Direct-mapped caches can overlap tag compare and
transmission of data
Lower associativity reduces power because fewer
cache lines are accessed
No write
buffering
Write buffering
Blocking
Instead of accessing entire rows or columns,
subdivide matrices into blocks
Requires more memory accesses but improves
locality of accesses
c) All N*N elements of Y and Z are accessed N times and each element of X is accessed once.
Thus, there are N3 operations and 2N3 + N2 reads! Capacity misses are a function of N and
cache size in this case.
Pentium 4 Pre-fetching
Register prefetch
Loads data into register
Cache prefetch
Loads data into cache
http://www.bit-tech.net/hardware/memory/2007/11/15/the_secrets_of_pc_memory_part_1/3
http://en.wikipedia.org/wiki/Static_random_access_memory
http://en.wikipedia.org/wiki/Dynamic_random-access_memory
Some optimizations:
Multiple accesses to same row
Synchronous DRAM
Added clock to DRAM interface and enables pipelining
Burst mode with critical word first
Wider interfaces
Double data rate (DDR)
Multiple banks on each DRAM device
http://en.wikipedia.org/wiki/DIMM
81
Comparison
82
NAND Flash Memory
Main storage component of Solid State Drive (SSD)
USB Drive, cell phone, touch pad…
83
Advantages of NAND flash
Fast random read (25 us)
Energy efficiency
High reliability (no moving parts) compared to harddisks
Widely deployed in high-end laptops
Macbook air, ThinkPad X series, touch pad…
Increasingly deployed in enterprise environment either as a
secondary cache or main storage
84
Disadvantages of SSD
Garbage collection (GC) problem of SSD
Stemmed from the out-of-place update characteristics
Update requests invalidate old version of pages and then write new
version of these pages to a new place
Copy valid data to somewhere else (increasing number of IOs)
Garbage collection is periodically started to erase victim blocks and copy
valid pages to the free blocks (slow erase: 10xW,100xR)
Blocks in the SSD have a limited number of erase cycles
100,000 for Single Level Chip (SLC), 5,000-10,000 for Multiple Level Chip
(MLC), can be as low as 3,000
May be quickly worn out in enterprise environment
Performance is very unpredictable
Due to unpredictable triggering of the time-consuming GC process
85
Hybrid Main Memory System
DRAM + Flash Memory
Uses small DRAM as a cache to buffer writes
and cache reads by leveraging access locality
Uses large flash memory to store cold data
Advantages
Similar performance as DRAM
Low power consumption
Low costs
Role of architecture:
Provide user mode and supervisor mode
Protect certain aspects of CPU state
Provide mechanisms for switching between user
mode and supervisor mode
Provide mechanisms to limit memory accesses
Provide TLB to translate addresses
Cache
1111111111222222222233
01234567890123456789012345678901
Memory
Cache
00 04 08 0C 10 14 18 1C 20 24 28 2C 30 34 38 3C 40 44 48 4C
Memory
Fully Associative Block Placement
Cache
00 04 08 0C 10 14 18 1C 20 24 28 2C 30 34 38 3C 40 44 48 4C
Memory
Set-Associative Block Placement
Cache
00 04 08 0C 10 14 18 1C 20 24 28 2C 30 34 38 3C 40 44 48 4C
Memory
Q2: How is a block found if it is in the upper level?
=
Set Associative Cache Design
Address
1
– Better hit rate 2
• Disadvantage: 253
254
– More tag bits 255
22 32
– More hardware
– Higher access time
4-to-1 multiplexor
Hit Data
– What are the lengths (in bits) of the index field and
the tag field in the address if the placement is 1-
way set-associative?
Using the principle of locality. The larger the block, the greater the chance parts
of it will be used again.
20% 1K
4K
15%
Miss
16K
Rate
10%
64K
5% 256K
0%
16
32
64
128
256
Block Size (bytes)
CSCE 430/830, Memory
Hierarchy Introduction
Increasing Block Size
• One way to reduce the miss rate is to increase
the block size
– Take advantage of spatial locality
– Decreases compulsory misses
• However, larger blocks have disadvantages
– May increase the miss penalty (need to get more
data)
– May increase hit time (need to read more data from
cache and larger mux)
– May increase miss rate, since conflict misses
• Increasing the block size can help, but don’t
overdo it.