0% found this document useful (0 votes)
20 views

5 Memory Hierarchy

Uploaded by

nightcore AMV
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
20 views

5 Memory Hierarchy

Uploaded by

nightcore AMV
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 99

MEMORY

HIERARCHY
MEMORY
Memory
 Just an “ocean of bits”
 Many technologies are available
Key issues
 Technology (how bits are stored)
 Placement (where bits are stored)
 Identification (finding the right bits)
 Replacement (finding space for new bits)
 Write policy (propagating changes to bits)
Must answer these regardless of memory
type
2
TYPICAL MEMORY
HIERARCHY

3
Processor Memor
y
MEMORY SYSTEMS
Input/Output

How can we supply the CPU with enough data to keep it busy?
We will focus on memory issues,
 which are frequently bottlenecks that limit the performance of a system.

Storage Speed Cost Capacity Delay Cost/GB


Static RAM Fastest Expensiv Smallest 0.5 – 2.5 ns $1,000’s
e
Dynamic RAM Slow Cheap Large 50 – 70 ns $10’s
Hard disks Slowest Cheapest Largest 5 – 20 ms $0.1’s

Ideal memory: large, fast and cheap


4
MEMORY HIERARCHY
Registers

On-Chip
SRAM

SPEED and COST


CAPACITY

Off-Chip
SRAM

DRAM

Disk

5
WHY MEMORY
HIERARCHY?
Need lots of bandwidth
Need lots of storage
 64MB (minimum) to multiple TB
Must be cheap per bit
 (TB x anything) is a lot of money!
These requirements seem incompatible

6
WHY MEMORY
HIERARCHY?
Fast and small memories
 Enable quick access (fast cycle time)
 Enable lots of bandwidth (1+ L/S/I-fetch/cycle)
Slower larger memories
 Capture larger share of memory
 Still relatively fast
Slow huge memories
 Hold rarely-needed state
 Needed for correctness
All together: provide appearance of large, fast
memory with cost of cheap
7
WHY DOES A
HIERARCHY WORK?
Locality of reference
 Temporal locality
 Reference same memory location repeatedly
 Spatial locality
 Reference near neighbors around the same time

Empirically observed
 Significant!
 Even small local storage (8KB) often satisfies
>90% of references to multi-MB data set

8
WHY LOCALITY?
Analogy:
 Library (Disk)
 Bookshelf (Main memory)
 Stack of books on desk (off-chip cache)
 Opened book on desk (on-chip cache)
Likelihood of:
 Referring to same book or chapter again?
 Probability decays over time
 Book moves to bottom of stack, then bookshelf, then
library
 Referring to chapter n+1 if looking at chapter n?
9
MEMORY HIERARCHY
Temporal Locality Spatial Locality
• Keep recently CPU • Bring neighbors of
referenced items at recently referenced to
higher levels higher levels
• Future references • Future references
I & D L1 Cache
satisfied quickly satisfied quickly

Shared L2 Cache

Main Memory

Disk
10
CACHE
INTRODUCING CACHES CPU

Cache – a small amount of fast, expensive


memory. A little static
RAM (cache)
 The cache goes between the processor and the
slower, dynamic main memory.
 It keeps a copy of the most frequently used data
from the main memory.
Lots of
Memory access speed increases overall, because dynamic RAM
we’ve made the common case faster.
 Reads and writes to the most frequently used
addresses will be serviced by the cache.
 We only need to access the slower main memory
for less frequently used data.

12
THE PRINCIPLE OF
LOCALITY
Why does the hierarchy work?
Because most programs exhibit locality,
which the cache can take advantage of.
 The principle of temporal locality says that if a
program accesses one memory address, there is a
good chance that it will access the same address
again.
 The principle of spatial locality says that if a
program accesses one memory address, there is a
good chance that it will also access other nearby
addresses.

13
HOW CACHES TAKE ADVANTAGE
OF LOCALITY
CPU

First time the processor reads from an address in


main memory, a copy of that data is also stored in
the cache.
 The next time that same address is read, we can use the A little static
copy of the data in the cache instead of accessing the RAM (cache)
slower dynamic memory.
 So the first read is a little slower than before since it goes
through both main memory and the cache, but
subsequent reads are much faster.
Lots of
This takes advantage of temporal locality— dynamic RAM
commonly accessed data is stored in the faster
cache memory.
By storing a block (multiple words) we also take
advantage of spatial locality
14
DEFINITIONS: HITS AND
MISSES
A cache hit
 occurs if the cache contains the data that we’re looking for. Hits are good, because
the cache can return the data much faster than main memory.

A cache miss
 occurs if the cache does not contain the requested data. This is bad, since the
CPU must then wait for the slower main memory.

There are two basic measurements of cache performance.


 The hit rate is the percentage of memory accesses that are handled by the cache.
 The miss rate = (1 - hit rate) is the percentage of accesses that must be handled
by the slower main RAM.
 Typical caches have a hit rate of 95% or higher, so in fact most memory accesses
will be handled by the cache and will be dramatically faster.

15
A SIMPLE CACHE
DESIGN Memory
Block Address
index Block
0
000 1
2
001 3
010 4 Index
011 5
6 0
100 1
7
101 8 2
110 9 3
10
111
11 A direct-mapped
Here is an example 12 cache is the simplest
cache with eight 13 approach: each main
blocks, each holding 14 memory address
one byte. 15 maps to exactly one
Caches are divided into blocks, which may cache
be block. sizes.
of various
 The number of blocks in a cache is usually a power of 2. 16
FOUR IMPORTANT
QUESTIONS
1. When we copy a block of data from main memory to
the cache, where exactly should we put it?
2. How can we tell if a word is already in the cache, or
if it has to be fetched from main memory first?
3. Eventually, the small cache memory might fill up.
To load a new block from main RAM, we’d have to
replace one of the existing blocks in the cache...
which one?
4. How can write operations be handled by the
memory system?

17
ADDING TAGS
0000
0001
0010
0011
0100
0101 Index Tag Data
0110 00 00
0111 01 ??
1000 10 01
1001 11 01
1010
1011
1100
1101
1110
1111

We need to add tags to the cache, which supply the rest of the
address bits to let us distinguish between different memory
locations that map to the same cache block.
18
FIGURING OUT WHAT’S
IN THE CACHE
Now we can tell exactly which
addresses of main memory are
stored in the cache, by
concatenating the cache block
tags with the block indices.
Main memory
Index Tag Data address in cache block
00 00 00 + 00 = 0000
01 11 11 + 01 = 1101
10 01 01 + 10 = 0110
11 01 01 + 11 = 0111

19
ONE MORE DETAIL: THE
VALID BIT
When started, the cache is empty and does not contain valid data.
We should account for this by adding a valid bit for each cache
block.
 When the system is initialized, all the valid bits are set to 0.
 When data is loaded into a particular cache block, the
corresponding valid bit is set to 1.
Valid Main memory
Index Bit Tag Data address in cache block
00 1 00 00 + 00 = 0000
01 0 11 Invalid
10 0 01 ???
11 1 01 ???

So the cache contains more than just copies of the data in memory;
it also has bits to help us find data within the cache and verify its
validity.

20
WHAT HAPPENS ON A
CACHE HIT
When the CPU tries to read from memory, the address will be
sent to a cache controller.
 The lowest k bits of the block address will index a block in
the cache.
 If the block is valid and the tag matches the upper (m - k)
bits of the m-bit address, then that data will be sent to the
CPU.
Address (32 bits) Index Valid Tag Data
Here is a diagram of a 32-bit memory
0 address and a 210-byte
cache. 1
22 10 2
Index 3
To CPU
...
...
1022
1023

Tag
= Hit

21
WHAT HAPPENS ON A CACHE
MISS
On cache hit, CPU proceeds normally
On cache miss
 Stall the CPU pipeline
 Fetch block from next level of hierarchy
 Instruction cache miss
 Restart instruction fetch
 Data cache miss
 Complete data access

The delays that we have been assuming for


memories (e.g., 2ns) are really assuming
cache hits.
22
LOADING A BLOCK INTO
THE CACHE
After data is read from main memory, putting a copy of that
data into the cache is straightforward.
 The lowest k bits of the block address specify a cache block.
 The upper (m - k) address bits are stored in the block’s tag field.
 The data from main memory is stored in the block’s data field.
 The valid bit is set to 1.

Address (32 bits) Index Valid Tag Data


0
1
22 10 2
Index 3
...
Tag
...
Data ...

23
1
MEMORY HIERARCHY
BASICS
When a word is not found in the cache,
a miss occurs:
 Fetch word from lower level in hierarchy,
requiring a higher latency reference
 Lower level may be another cache or the
main memory
 Also fetch the other words contained within
the block
 Takes advantage of spatial locality
 Place block into cache in any location within
its set, determined by address
24
PLACEMENT
Type Placement Comments
Registers Anywhere; Compiler/programmer
Int, FP, SPR manages
Cache Fixed in H/W Direct-mapped,
(SRAM) set-associative,
fully-associative
DRAM Anywhere O/S manages
Disk Anywhere O/S manages

25
CACHE SETS &
WAYS
26
CACHE SETS AND WAYS
Ways: Block can go anywhere

Sets: block/
Block line
mapped
by
addr

n-way set associative


(4-way set associative)
Example: Cache size = 16 blocks

27
DIRECT-MAPPED CACHE
1-way
block/line

Direct mapped cache


Each block maps to only one cache line
aka
1-way set associative
16 Sets

28
SET ASSOCIATIVE
CACHE
4-way

block/
line
4 Sets

n-way set associative


Each block can be mapped to a set of n-lines
Set number is based on block address
(4-way set associative)

29
FULLY ASSOCIATIVE
CACHE
16-ways
bloc
1 Sets k
/line

Fully associative
Each block can be mapped to any cache line
aka
m-way set associative
where m = size of cache in blocks

30
SET ASSOCIATIVE CACHE
ORGANIZATION

31
CACHE ADDRESSING
n-Ways: Block can go anywhere

s-Sets: block/
Block line
mapped
by
addr

Tag Index Offset


Address (remainder) (sets) (block
bits = 32-log2s- bits = size)
log2b log2s bits =
log2b
m = size of cache in blocks
Cache size = m * b
n = number of ways
# of Sets (s) = m / n
b = block size in bytes
32
CACHE ADDRESSING

Ex. 64KB cache, direct mapped, 16


byte block
Tag Index Offset
Address (remainder (sets) (block size)
) i = log2s o = log2b
t = 32-i-o
16 12 4
m = size of cache in blocks
m = 64K/16 = 4096
n = number of ways
b = 16 -> o = 4
b = block size in bytes
n=1
Cache size = s * n * b s = 4096/1 = 4096 -> i = 12
# of Sets (s) = m / n t = 32 – 12 – 4 = 16
33
CACHE ADDRESSING

Ex. 64KB cache, 2-way assoc., 16


byte block
Tag Index Offset
Address (remainder (sets) (block size)
) bits = log2s bits = log2b
bits = 32-s-
b 17 11 4
m = size of cache in blocks
m = 64K/16 = 4096
n = number of ways
b = 16 -> o = 4
b = block size in bytes
n=2
Cache size = s * n * b s = 4096/2 = 2048 -> i = 11
# of Sets (s) = m / n t = 32 – 11 – 4 = 17
34
CACHE ADDRESSING

Ex. 64KB cache, fully assoc., 16


byte block
Tag Index Offset
Address (remainder (sets) (block size)
) i = log2s o = log2b
t = 32-i-o
28 0 4
m = size of cache in blocks
m = 64K/16 = 4096
n = number of ways
b = 16 -> o = 4
b = block size in bytes
n = m = 4096
Cache size = s * n * b s = 4096/4096 = 1 -> i = 0
# of Sets (s) = m / n t = 32 – 0 – 4 = 28
35
WHAT IF THE CACHE
FILLS UP?
Our third question was what to do if we run
out of space in our cache, or if we need to
reuse a block for a different memory address.
A miss causes a new block to be loaded into
the cache, automatically overwriting any
previously stored data.
 This is a least recently used replacement policy,
which assumes that older data is less likely to be
requested than newer data.

There are other policies.

36
REPLACEMENT POLICY

Direct mapped: no choice


Set associative/Fully associative
 Prefer non-valid entry, if there is one
 Otherwise, choose among entries in the set

Least-recently used (LRU)


 Choose the one unused for the longest time
 Simple for 2-way, manageable for 4-way, too hard beyond
that

Random
 Gives approximately the same performance as
LRU for high associativity
37
CPU
All accesses L1 misses
L1
L2

WRITE-THROUGH All stores

Write Buffer

On data-write hit, could just update the block in cache


 But then cache and memory would be inconsistent

Write through: also update memory


But makes writes take longer
 e.g., if base CPI = 1, 10% of instructions are stores, write to
memory takes 100 cycles
 Effective CPI = 1 + 0.1×100 = 11

Solution: write buffer


 Holds data waiting to be written to memory
 CPU continues immediately
 Only stalls on write if write buffer is already full

38
WRITE-BACK

Alternative: On data-write hit, just


update the block in cache
 Keep track of whether each block is dirty

When a dirty block is replaced


 Write it back to memory
 Can use a write buffer to allow replacing
block to be read first

39
WRITE ALLOCATION

What should happen on a write miss?


Alternatives for write-through
 Allocate on miss: fetch the block
 Write around: don’t fetch the block
 Since programs often write a whole block before reading it
(e.g., initialization)

For write-back
 Usually fetch the block

40
CACHE EXAMPLE Tag Array
ID Tag Tag LRU
32B Cache: <b=4,s=4,m=8> 6-bits 0 1
address
 o=2, i=2, t=2; 2-way set-associative 00 0
 Initially empty
 Only tag array shown on right
01 0
Trace execution of:
Referenc Binary Set/Way Hit/Miss 10 0
e
11 0

41
CACHE EXAMPLE Tag Array
Tag Tag LRU
32B Cache: <b=4,s=4,m=8> 0 1
 o=2, i=2, t=2; 2-way set-associative
0
 Initially empty
 Only tag array shown on right
0
Trace execution of:
Referenc Binary Set/Way Hit/Miss 10 1
e
Load 101010 2/0 Miss
0x2A 0

42
CACHE EXAMPLE Tag Array
Tag Tag LRU
32B Cache: <b=4,s=4,m=8> 0 1
 o=2, i=2, t=2; 2-way set-associative
0
 Initially empty
 Only tag array shown on right
0
Trace execution of:
Referenc Binary Set/Way Hit/Miss 10 1
e
Load 101010 2/0 Miss
0x2A 0
Load 101011 2/0 Hit
0x2B

43
CACHE EXAMPLE Tag Array
Tag Tag LRU
32B Cache: <b=4,s=4,m=8> 0 1
 o=2, i=2, t=2; 2-way set-associative
0
 Initially empty
 Only tag array shown on right
0
Trace execution of:
Referenc Binary Set/Way Hit/Miss 10 1
e
Load 101010 2/0 Miss
0x2A 11 1
Load 101011 2/0 Hit
0x2B
Load 111100 3/0 Miss
0x3C

44
CACHE EXAMPLE Tag Array
Tag Tag LRU
32B Cache: <b=4,s=4,m=8> 0 1
 o=2, i=2, t=2; 2-way set-associative
10 1
 Initially empty
 Only tag array shown on right
0
Trace execution of:
Referenc Binary Set/Way Hit/Miss 10 1
e
Load 101010 2/0 Miss
0x2A 11 1
Load 101011 2/0 Hit
0x2B
Load 111100 3/0 Miss
0x3C
Load 100000 0/0 Miss
0x20 45
CACHE EXAMPLE Tag Array
Tag Tag LRU
32B Cache: <b=4,s=4,m=8> 0 1
 o=2, i=2, t=2; 2-way set-associative
10 11 0
 Initially empty
 Only tag array shown on right
0
Trace execution of:
Referenc Binary Set/Way Hit/Miss 10 1
e
Load 101010 2/0 Miss
0x2A 11 1
Load 101011 2/0 Hit
0x2B
Load 111100 3/0 Miss
0x3C
Load 100000 0/0 Miss
0x20 46
CACHE EXAMPLE Tag Array
Tag Tag LRU
32B Cache: <b=4,s=4,m=8> 0 1
 o=2, i=2, t=2; 2-way set-associative
01 11 1
 Initially empty
 Only tag array shown on right
0
Trace execution of:
Referenc Binary Set/Way Hit/Miss 10 1
e
Load 101010 2/0 Miss
0x2A 11 1
Load 101011 2/0 Hit
0x2B
Load 111100 3/0 Miss
0x3C
Load 100000 0/0 Miss
0x20 47
CACHE EXAMPLE Tag Array
Tag Tag LRU
32B Cache: <b=4,s=4,m=8> 0 1
 o=2, i=2, t=2; 2-way set-associative
01 11 1
 Initially empty
 Only tag array shown on right
0
Trace execution of:
Referenc Binary Set/Way Hit/Miss 10 d 1
e
Load 101010 2/0 Miss
0x2A 11 1
Load 101011 2/0 Hit
0x2B
Load 111100 3/0 Miss
0x3C
Load 100000 0/0 Miss
0x20 48
MEASURING CACHE
PERFORMANCE
Components of CPU time
 Program execution cycles: Includes cache hit time
 Memory stall cycles: Mainly from cache misses

With simplifying assumptions:

Example:
 Given:
 I-cache miss rate = 2%, D-cache miss rate = 4%, Miss penalty =
100 cycles, Base CPI (ideal cache) = 2, Load & stores are 36% of
instructions
 Miss cycles per instruction
 I-cache: 0.02 × 100 = 2
 D-cache: 0.36 × 0.04 × 100 = 1.44
 Actual CPI = 2 + 2 + 1.44 = 5.44
 Ideal CPU is 5.44/2 =2.72 times faster 49
AVERAGE ACCESS TIME

Hit time is also important for


performance
Average memory access time (AMAT)
 AMAT = Hit time + Miss rate × Miss penalty
 Hit rate × Hit time + Miss rate × Miss penalty

Example
 CPU with 1ns clock, hit time = 1 cycle, miss
penalty = 20 cycles, I-cache miss rate = 5%
 AMAT = 1 + 0.05 × 20 = 2 (cycle)
 2 cycles per instruction

50
MEASURING/
CLASSIFYING MISSES
How to find out?
 Cold misses: Simulate a fully associative infinite cache size
 Capacity misses: Simulate fully associative cache, then
deduct cold misses
 Conflict misses: Simulate target cache configuration then
deduct cold and capacity misses

Classification is useful to understand how to


eliminate misses
High conflict misses  need higher
associativity
High capacity misses  need larger cache
51
MULTILEVEL CACHES

Primary cache attached to CPU


 Small, but fast

Level-2 cache services misses


from primary cache
 Larger, slower, but still faster than main
memory

Main memory services L-2 cache


misses
Some high-end systems include L- 52
MULTILEVEL CACHE
EXAMPLE
Given
 CPU base CPI = 1, clock rate = 4GHz
 Miss rate/instruction = 2%
 Main memory access time = 100ns

With just primary cache


 Miss penalty = 100ns/0.25ns = 400 cycles
 Effective CPI = 1 + 0.02 × 400 = 9

Now add L-2 cache


 Access time = 5ns
 Global miss rate to main memory = 0.5%
 Primary miss with L2 hit
 Penalty = 5ns/0.25ns = 20 cycles
 Primary miss with L2 miss
 Extra penalty = 500 cycles
 CPI = 1 + 0.02 × 20 + 0.005 × 400 = 3.4
 Performance ratio = 9/3.4 = 2.6

53
MULTILEVEL CACHE
CONSIDERATIONS

Primary cache
 Focus on minimal hit time

L2 cache
 Focus on low miss rate to avoid main
memory access
 Hit time has less overall impact

Results
 L-1 cache usually smaller than a single
cache
 L-1 block size smaller than L-2 block size 54
VIRTUAL MEMORY
MEMORY HIERARCHY
Registers

On-Chip
SRAM

SPEED and COST


CAPACITY

Off-Chip
SRAM

DRAM

Disk

56
WHY VIRTUAL MEMORY?

Allows applications to be bigger than main memory size

Helps with multiple process management


 Each process gets its own chunk of memory
 Protection of processes against each other
 Mapping of multiple processes to memory
 Relocation
 Application and CPU run in virtual space
 Mapping of virtual to physical space is invisible to the application

Management between main memory and disk


 Miss in main memory is a page fault or address fault
 Block is a page

57
MAPPING VIRTUAL TO
PHYSICAL MEMORY Virtual Memory
Divide memory into equal sized ¥
“chunks” or pages (typically 4KB Stack
each)
Any chunk of Virtual Memory can be
assigned to any chunk of Physical
Memory
Physical Memory Single
64 MB Process
Heap

Static

Code
0 0 58
PAGED VIRTUAL
MEMORY
Virtual address space divided into pages
Physical address space divided into pageframes
Page missing in Main Memory = page fault
 Pages not in Main Memory are on disk: swap-in/swap-out
 Or have never been allocated
 New page may be placed anywhere in MM (fully associative map)

Dynamic address translation


 Effective address is virtual
 Must be translated to physical for every access
 Virtual to physical translation through page table in Main Memory

59
CACHE VS VM
Cache Virtual Memory
Block or Line Page
Miss Page Fault
Block Size: 32-64B Page Size: 4K-16KB
Placement:
Direct Mapped, Fully Associative
N-way Set Associative
Replacement:
LRU or Random LRU approximation
Write Thru or Back Write Back
How Managed:
Hardware Hardware + Software
(Operating System)
60
HANDLING PAGE FAULTS
A page fault is like a cache miss
 Must find page in lower level of hierarchy

If valid bit is zero, the Physical Page Number


points to a page on disk
When OS starts new process, it creates space
on disk for all the pages of the process, sets
all valid bits in page table to zero, and all
Physical Page Numbers to point to disk
 called Demand Paging - pages of the process are loaded
from disk only as needed
 Create “swap” space for all virtual pages on disk

61
PERFORMING ADDRESS
TRANSLATION
VM divides memory into equal
sized pages
Address translation relocates
entire pages
 offsets within the pages do not change
 if page size is a power of two, the virtual
address separates into two fields:
(like cache
Virtualindex, offset fields) Page Offset
Page Number

virtual address
62
MAPPING VIRTUAL TO
PHYSICAL ADDRESS

63
ADDRESS TRANSLATION

Want fully associative page placement


How to locate the physical page?
 Search impractical (too many pages)

A page table is a data structure which contains


the mapping of virtual pages to physical pages
 There are several different ways, all up to the
operating system, to keep this data around
Each process running in the system has its own
page table
64
PAGE TABLE AND
ADDRESS TRANSLATION
Page table
register Page table

Virtual
page
number

Valid
bits
Other
flags

Main memory

65
PAGE TABLE
Page table translates address

66
MAPPING PAGES TO
STORAGE

67
REPLACEMENT AND
WRITES
To reduce page fault rate, prefer least-
recently used (LRU) replacement
 Reference bit (aka use bit) in (Page Table Entry)
PTE set to 1 on access to page
 Periodically cleared to 0 by OS
 A page with reference bit = 0 has not been used
recently
Disk writes take millions of cycles
 Block at once, not individual locations
 Write through is impractical
 Use write-back
 Dirty bit in PTE set when page is written
68
OPTIMIZING VM

Page Table too big!


 4GB Virtual address space / 4 KB page
 220 page table entries. Assume 4B per entry.
 4MB just for Page Table of single process
 With 100 process, 400MB of memory is
required!

Virtual Memory too slow!


 Requires two memory accesses.
 One to access page table to get the memory address
 Another to get the real data

69
FAST ADDRESS
TRANSLATION
Problem: Virtual Memory requires two memory accesses!
 one to translate Virtual Address into Physical Address (page table
lookup)
 one to transfer the actual data (hit)
 But Page Table is in physical memory! => 2 main memory accesses!

Observation: since there is locality in pages of data,


must be locality in virtual addresses of those pages!
Why not create a cache of virtual to physical address
translations to make translation fast? (smaller is faster)
For historical reasons, such a “page table cache” is
called a Translation Lookaside Buffer, or TLB

70
FAST TRANSLATION
USING A TLB

71
TLB TRANSLATION Virtual Byte
page number offset
Virtual
Valid
bits
address

TLB tags

Translation
Tags match
and ent ry
is valid

Physical
page number Physical
Other
flags
address

Physical Byte offset


address tag in word
Cache index

Virtual-to-physical address translation by a TLB and how the


resulting physical address is used to access the cache memory. 72
TLB MISSES

If page is in memory
 Load the PTE from memory and retry
 Could be handled in hardware
 Can get complex for more complicated page table structures
 Or in software
 Raise a special exception, with optimized handler

If page is not in memory (page fault)


 OS handles fetching the page and updating the
page table
 Then restart the faulting instruction

73
TLB MISS HANDLER

TLB miss indicates


 Page present, but PTE not in TLB
 Page not present

Must recognize TLB miss before


destination register overwritten
 Raise exception

Handler copies PTE from memory to


TLB
 Then restarts instruction
 If page not present, page fault will occur
74
PAGE FAULT HANDLER

Use faulting virtual address to find PTE


Locate page on disk
Choose page to replace
 If dirty, write to disk first

Read page into memory and update


page table
Make process runnable again
 Restart from faulting instruction

75
TLB AND CACHE
INTERACTION
If cache tag uses
physical address
 Need to translate before
cache lookup

Physically Indexed,
Physically Tagged

76
TLB AND CACHE
ADDRESSING
Cache review
 Set or block field indexes are used to get tags
 2 steps to determine hit:
 Index (lookup) to find tags (using block or set bits)
 Compare tags to determine hit
 Sequential connection between indexing and tag comparison

Rather than waiting for address


translation and then performing this two-
step hit process, can we overlap the
translation and portions of the hit
sequence?
 Yes!
77
CACHE INDEX/TAG
OPTIONS
Physically indexed, physically tagged
(PIPT)
 Wait for full address translation
 Then use physical address for both
indexing and tag comparison

Virtually indexed, physically tagged


(VIPT)
 Use portion of the virtual address for
indexing then wait for address
translation and use physical address for
tag comparisons

Virtually indexed, virtually tagged (VIVT)


 Use virtual address for both indexing
and tagging…No TLB access unless
cache miss
 Requires invalidation of cache lines on
context switch or use of process ID as 78
part of tags
VIRTUALLY INDEX
PHYSICALLY TAGGED

79
CACHE & VIRTUAL
MEMORY

80
SUMMARY

Virtual Memory overcomes main


memory size limitations
VM supported through Page Tables
TLB enables fast address
translation

81
MAIN MEMORY
MEMORY HIERARCHY
Registers

On-Chip
SRAM

SPEED and COST


CAPACITY

Off-Chip
SRAM

DRAM

Disk

83
MAIN MEMORY DESIGN

Commodity DRAM chips


Wide design space for
 Minimizing cost, latency
 Maximizing bandwidth, storage

Susceptible to soft errors


 Protect with ECC (SECDED)
 ECC also widely used in on-chip memories,
busses

8
4
DRAM CHIP ORGANIZATION
Bitlines

Word
Lines
Row Decoder Memory
Row Cell Transistor
Address Bitline
Array
Wordline

Capacitor

Sense Amps

Row Buffer

Column
Column Decoder
Address

Data bus
• Optimized for density, not speed • Read entire row at once (RAS, page
• Data stored as charge in capacitor open)
• Discharge on reads => destructive • Read word from row (CAS)
reads
• Burst mode (sequential words)
• Charge leaks over time
• Write row back (precharge, page
– refresh every 64ms close) 8
5
MAIN MEMORY DESIGN

Single DRAM
chip has
multiple internal
banks
8
6
MAIN MEMORY ACCESS
• Each memory access (DRAM bus clocks, 10x CPU cycle time)
– 5 cycles to send row address (page open or RAS)
– 1 cycle to send column address
– 3 cycle DRAM access latency
– 1 cycle to send data (CAS latency = 1+3+1 = 5)
– 5 cycles to send precharge (page close)
– 4 word cache block

• One word wide: (r=row addr, c= col addr, d=delay, b=bus,


p=precharge)
rrrrrcdddbcdddbcdddbcdddbppppp
– 5 + 4 * (1 + 3 + 1) = 25 cycles delay
– 5 more busy cycles (precharge) till next command

8
7
MAIN MEMORY ACCESS
•One word wide, burst mode (pipelined)
rrrrrcdddb
b
b
bppppp
–5 + 1 + 3 + 4 = 13 cycles
–Interleaving is similar, but words can be from different
rows, each open in a different bank

•Four word wide memory:


rrrrrcdddbppppp
–5 + 1 + 3 + 1 = 9 cycles
8
8
ERROR DETECTION AND
CORRECTION
Main memory stores a huge number of bits
 Probability of bit flip becomes nontrivial
 Bit flips (called soft errors) caused by
 Slight manufacturing defects
 Gamma rays and alpha particles
 Electrical interference
 Etc.
 Getting worse with smaller feature sizes
Reliable systems must be protected from
soft errors via ECC (error correction codes)
 Even PCs support ECC these days

8
9
ERROR CORRECTING
CODES
Probabilities:
P(1 word no errors) > P(single error) > P(two errors) >>
P(>2 errors)

Detection - signal a problem


Correction - restore data to correct value
Most common
 Parity - single error detection
 SECDED - single error correction; double error detection

9
0
ECC CODES FOR ONE
BIT
Power Correct #bit Comments
s
Nothing 0,1 1

SED 00,11 2 01,10 detect


errors
SEC 000,111 3 001,010,100
=> 0
110,101,011
=> 1
SECDED 0000,11 4 One 1 => 0
11 Two 1’s => 9
1
ECC
# 1’s 0 1 2 3 4
Resu 0 0 Err 1 1
lt
Hamming distance
 No. of bit flips to convert one valid code to
another
 All legal SECDED codes are at Hamming
distance of 4
 I.e. in single-bit SECDED, all 4 bits flip to go from
representation for ‘0’ (0000) to representation for ‘1’ (1111)

9
2
ECC
Reduce overhead by applying codes
to a word, not a bit
 Larger word means higher p(>=2 errors)

# SED SECDED
bits overhead overhead
1 1 (100%) 3 (300%)
32 1 (3%) 7 (22%)
64 1 (1.6%) 8 (13%)
n 1 (1/n) 1 + log2 n + a 9
3
64-BIT ECC

64 bits data with 8 check bits


dddd…..d ccccccccc
DIMM with 9x8-bit-wide DRAM
chips = 72 bits
Intuition
 One check bit is parity
 Other check bits point to
 Error in data, or
 Error in all check bits, or
 No error
9
4
ECC

To store (write)
 Use data0 to compute check0
 Store data0 and check0

To load
 Read data1 and check1
 Use data1 to compute check2
 Syndrome = check1 xor check2
 I.e. make sure check bits are equal

9
5
ECC SYNDROME

Syndrom Parity Implications


e
0 OK data1==data0

n != 0 Not Flip bit n of data1


OK to get data0
n != 0 OK Signals
uncorrectable
error 9
6
4-BIT SECDED CODE
C1 b1  b2  b4
Bit Position 00
1
01
0
01
1
10
0
10
1
11
0
11
1 C2 b1  b3  b4
Codeword C C b1 C b2 b3 b4 P C3 b2  b3  b4
1 2 3
P even _ parity
C1 X X X X
C2 X X X X
C3 X X X X
P X X X X X X X X

Cn parity bits chosen specifically to:


 Identify errors in bits where bit n of the index is 1
 C1 checks all odd bit positions (where LSB=1)
 C2 checks all positions where middle bit=1
 C3 checks all positions where MSB=1

Hence, nonzero syndrome points to faulty bit 9


7
4-BIT SECDED EXAMPLE C1 b1  b2  b4
C2 b1  b3  b4
Bit Position 1 2 3 4 5 6 7
C3 b2  b3  b4
Codeword C C b1 C b2 b3 b4 P P even _ parity
1 2 3

Original data 1 0 1 1 0 1 0 0 Syndrome


No corruption 1 0 1 1 0 1 0 0 0 0 0, P ok
1 bit corrupted 1 0 0 1 0 1 0 0 0 1 1, P !ok
2 bits 1 0 0 1 1 1 0 0 1 1 0, P ok
corrupted
4 data bits, 3 check bits, 1 parity bit
Syndrome is xor of check bits C1-3
 If (syndrome==0) and (parity OK) => no error
 If (syndrome != 0) and (parity !OK) => flip bit position
pointed to by syndrome
 If (syndrome != 0) and (parity OK) => double-bit error
SUMMARY

Commodity DRAM chips


Wide design space for
 Minimizing cost, latency
 Maximizing bandwidth, storage

Susceptible to soft errors


 Protect with ECC (SECDED)
 ECC also widely used in on-chip memories,
busses

9
9

You might also like