0% found this document useful (0 votes)
12 views

L - 3-AssociativeMapping - Virtual Memory

This document discusses different types of associative caches, including fully associative caches and n-way set associative caches. It provides examples to illustrate how direct mapped, 2-way set associative, and fully associative caches handle block access sequences. The document also discusses factors that influence cache performance such as associativity, replacement policies, multilevel caches, and virtual memory. Virtual memory uses main memory as a cache for secondary storage and relies on address translation between virtual and physical addresses via page tables.

Uploaded by

Lekshmi
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views

L - 3-AssociativeMapping - Virtual Memory

This document discusses different types of associative caches, including fully associative caches and n-way set associative caches. It provides examples to illustrate how direct mapped, 2-way set associative, and fully associative caches handle block access sequences. The document also discusses factors that influence cache performance such as associativity, replacement policies, multilevel caches, and virtual memory. Virtual memory uses main memory as a cache for secondary storage and relies on address translation between virtual and physical addresses via page tables.

Uploaded by

Lekshmi
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 52

Associative Caches

Associative Caches
 Fully associative
 Allow a given block to go in any cache entry
 Requires all entries to be searched at once
 Comparator per entry (expensive)
 n-way set associative
 Each set contains n entries
 Block number determines which set
 (Block number) modulo (#Sets in cache)
 Search all entries in a given set at once
 n comparators (less expensive)
Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 2
Associative Cache Example

Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 3


Spectrum of Associativity
 For a cache with 8 entries

Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 4


Associativity Example
 Compare 4-block caches
 Direct mapped, 2-way set associative,
fully associative
 Block access sequence: 0, 8, 0, 6, 8
 Direct mapped
Block Cache Hit/miss Cache content after access
address index
0 1 2 3

0 0 miss Mem[0]

8 0 miss Mem[8]

0 0 miss Mem[0]
5
6 2 miss Mem[0] Mem[6]
Associativity Example

7
Associativity Example

7
How Much Associativity
 Increased associativity decreases miss
rate
 But with diminishing returns
 Simulation of a system with 64KB
D-cache, 16-word blocks, SPEC2000
 1-way: 10.3%
 2-way: 8.6%
 4-way: 8.3%
 8-way: 8.1%

Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 7


Set Associative Cache Organization

Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 8


Replacement Policy
 Direct mapped: no choice
 Set associative
 Prefer non-valid entry, if there is one
 Otherwise, choose among entries in the set
 Least-recently used (LRU)
 Choose the one unused for the longest time
 Simple for 2-way, manageable for 4-way, too hard
beyond that
 Random
 Gives approximately the same performance
as LRU for high associativity

Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 9


Multilevel Caches
 Primary cache attached to CPU
 Small, but fast
 Level-2 cache services misses from
primary cache
 Larger, slower, but still faster than main
memory
 Main memory services L-2 cache misses
 Some high-end systems include L-3 cache

Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 10


Multilevel Cache Example
 Suppose we have a processor with a base CPI
of 1.0, assuming all references hit in the primary
cache, and a clock rate of 4 GHz. Assume a
main memory access time of 100 ns, including
all the miss handling. Suppose miss rate per
instruction at the primary cache is 2%. How
much faster will the processor be if we add a
secondary cache that has a 5ns access time for
either a hit or a miss & is large enough to reduce
the miss rate to main memory to 0.5% ?

Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 11


Multilevel Cache Example
 Given
 CPU base CPI = 1, clock rate = 4GHz
 Miss rate/instruction = 2%
 Main memory access time = 100ns
 With just primary cache
 Miss penalty = 100ns/0.25ns = 400 cycles
 Effective CPI = 1 + 0.02 × 400 = 9

Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 12


Example (cont.)
 Now add L-2 cache
 Access time = 5ns
 Global miss rate to main memory = 0.5%
 Primary miss with L-2 hit
 Penalty = 5ns/0.25ns = 20 cycles
 CPI = 1 + 0.02 × 20 + 0.005 × 400 = 3.4
 Performance ratio = 9/3.4 = 2.6

Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 13


Multilevel Cache Considerations
 Primary cache
 Focus on minimal hit time
 L-2 cache
 Focus on low miss rate to avoid main memory
access
 Hit time has less overall impact
 Results
 L-1 cache usually smaller than a single cache
 L-1 block size smaller than L-2 block size

Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 14


Virtual Memory
Virtual Memory
 Use main memory as a “cache” for
secondary (disk) storage
 Managed jointly by CPU hardware and the
operating system (OS)
 Programs share main memory
 Each gets a private virtual address space
holding its frequently used code and data
 Protected from other programs
 CPU and OS translate virtual addresses to
physical addresses
 VM “block” is called a page
 VM translation “miss” is called a page fault
Virtual Memory
Consider collection of programs are running at once on a
computer
Total memory required by all programs may be larger than
amount of main memory available on computer
Yet, only a fraction of this memory is actively used at any point in
time
In Main memory – only active portions of the many programs are
running, just as cache contains only active portion of one
program
To allow multiple programs to share the same memory, ensuring
that a program can only read & write the portions of main
memory that has been assigned to it.
Also can't know which programs share the memory with other
programs when we compile them
Virtual Memory
 Programs sharing the memory change dynamically while the
programs are running
 B'coz of this dynamic interaction, compile each program into
its own address space – ie. Seperate range of memory
locations accessible only to this program
 Virtual memory implements the translation of a program's
address space to physical addresses.
– This translation process enforces protection of a program's
address space from other programs
 VM allows a single user program to exceed the size of primary
memory.
– Formerly, if program became too large for memory, it was
up to the programmer to make it fit
Virtual Memory
 Programmers divided programs into pieces(overlays) & then
identified the pieces that were mutually exclusive.
 Program never tried to access an overlay that was not loaded
& that the overlays loaded never exceeded the total size of the
memory
– Overlays were organised as modules, each containing both
code & data
 Overlaying one module with another – achieved by calls
between the procedures
Address Translation
 Fixed-size pages (e.g., 4K)

Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 20


Address Translation
 This fig. Shows the virtually addressed memory with pages
mapped to main memory
 This is called as address mapping / address translation
– Eg. today 2 memory hierarchy levels controlled by virtual
memory are DRAMs & magnetic disks
 Working Principle :
– VM simplifies loading the program for execution by
providing relocation
– Relocation maps the virtual addresses used by a program
to different physical addresses before the addresses are
used to access memory
– This Relocation allows us to load the program anywhere in
main memory
Address Translation
 Advantages :
– Eleminating the need to find a contiguous block of memory
to allocate to a program
– Formerly, relocation problems required special hardware &
special hardware & special support in the operating system
• Today, virtual memory also provides this function
– In virtual memory, the address is broken into a virtual page
number & a page offset
Translation Using a Page Table

Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 23


Translation Using a Page Table
 This fig. Shows translation of virtual page
number to a physical page number
 Physical page number : upper portion of
physical address
 Page offset : lower portion of physical
address
 No of bits in page offset field gives the
page size
Translation Using a Page Table
 No. of pages addressable with the virtual
address need not match the no. of pages
addressable with the physical address.
 Larger no. of virtual pages than physical
pages gives the illusion of an unbounded
amount of virtual memory
 When there is a miss in virtual memory, it
is page fault.
Page Fault Penalty
 On page fault, the page must be fetched
from disk
 Takes millions of clock cycles
 Handled by OS code
 Try to minimize page fault rate
 Fully associative placement
 Smart replacement algorithms

Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 26


Page Tables
 Stores placement information
 Array of page table entries, indexed by virtual
page number
 Page table register in CPU points to page
table in physical memory
 If page is present in memory
 PTE stores the physical page number
 Plus other status bits (referenced, dirty, …)
 If page is not present
 PTE can refer to location in swap space on
disk
Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 27
Mapping Pages to Storage

Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 28


Mapping Pages to Storage
 Page table : It resides in memory . In VM
systems, locate pages by using a table that
indexes the memory
 Page table is indexed with page number from the virtual
address to discover the corresponding physical page
number
 Each program has its own page table, which maps virtual
address space of that program to main memory
Mapping Pages to Storage
 Page Table register : To indicate the location

of page table in memory, hardware


includes a register that points to the start
of page table.
Replacement and Writes
 To reduce page fault rate, prefer least-
recently used (LRU) replacement
 Reference bit (aka use bit) in PTE set to 1 on
access to page
 Periodically cleared to 0 by OS
 A page with reference bit = 0 has not been
used recently
 Disk writes take millions of cycles
 Block at once, not individual locations
 Write through is impractical
 Use write-back
 Dirty bit in PTE set when page is written
Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 31
Translation-LookAside Buffer
Fast Translation Using a TLB
 Page tables here are stored in main memory
 Address translation would appear to require
extra memory references
 One memory access to obtain the physical address
 Second memory access to get the data

Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 33


Fast Translation Using a TLB
 How to improve access performance?
– But access to page tables has good locality
 So use a fast cache of PTEs within the CPU
 Called a Translation Look-aside Buffer (TLB)
 Typical: 16–512 PTEs, 0.5–1 cycle for hit, 10–100
cycles for miss, 0.01%–1% miss rate
 Misses could be handled by hardware or software

Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 34


Fast Translation Using a TLB

Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 35


TLB Misses
 If page is in memory
 Load the PTE from memory and retry
 Could be handled in hardware
 Can get complex for more complicated page table
structures
 Or in software
 Raise a special exception, with optimized handler
 If page is not in memory (page fault)
 OS handles fetching the page and updating
the page table
 Then restart the faulting instruction
Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 36
TLB Miss Handler
 TLB miss indicates
 Page present, but PTE not in TLB
 Page not preset
 Must recognize TLB miss before
destination register overwritten
 Raise exception
 Handler copies PTE from memory to TLB
 Then restarts instruction
 If page not present, page fault will occur

Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 37


Page Fault Handler
 Use faulting virtual address to find PTE
 Locate page on disk
 Choose page to replace
 If dirty, write to disk first
 Read page into memory and update page
table
 Make process runnable again
 Restart from faulting instruction

Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 38


TLB and Cache Interaction
 If cache tag uses
physical address
 Need to translate
before cache lookup
 Alternative: use virtual
address tag
 Complications due to
aliasing
 Different virtual
addresses for shared
physical address

Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 39


Memory Protection
 Different tasks can share parts of their
virtual address spaces
 But need to protect against errant access
 Requires OS assistance
 Hardware support for OS protection
 Privileged supervisor mode (aka kernel mode)
 Privileged instructions
 Page tables and other state information only
accessible in supervisor mode
 System call exception (e.g., syscall in MIPS)
Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 40
§5.8 A Common Framework for Memory Hierarchies
The Memory Hierarchy
The BIG Picture
 Common principles apply at all levels of
the memory hierarchy
 Based on notions of caching
 At each level in the hierarchy
 Block placement
 Finding a block
 Replacement on a miss
 Write policy
Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 41
Block Placement
 Determined by associativity
 Direct mapped (1-way associative)
 One choice for placement
 n-way set associative
 n choices within a set
 Fully associative
 Any location
 Higher associativity reduces miss rate
 Increases complexity, cost, and access time

Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 42


Finding a Block

Associativity Location method Tag comparisons

Direct mapped Index 1

n-way set Set index, then search n


associative entries within the set

Fully associative Search all entries #entries

Full lookup table 0

Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 43


 Hardware caches
 Reduce comparisons to reduce cost
 Virtual memory
 Full table lookup makes full associativity feasible
 Benefit in reduced miss rate
Replacement
 Choice of entry to replace on a miss
 Least recently used (LRU)
 Complex and costly hardware for high associativity
 Random
 Close to LRU, easier to implement
 Virtual memory
 LRU approximation with hardware support

Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 45


Write Policy
 Write-through
 Update both upper and lower levels
 Simplifies replacement, but may require write
buffer
 Write-back
 Update upper level only
 Update lower level when block is replaced
 Need to keep more state
 Virtual memory
 Only write-back is feasible, given disk write
latency

Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 46


Sources of Misses
 Compulsory misses (aka cold start misses)
 First access to a block
 Capacity misses
 Due to finite cache size
 A replaced block is later accessed again
 Conflict misses (aka collision misses)
 In a non-fully associative cache
 Due to competition for entries in a set
 Would not occur in a fully associative cache of
the same total size

Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 47


Cache Design Trade-offs

Design change Effect on miss rate Negative performance


effect

Increase cache size Decrease capacity May increase access


misses time

Increase associativity Decrease conflict May increase access


misses time

Increase block size Decrease compulsory Increases miss


misses penalty. For very large
block size, may
increase miss rate
due to pollution.

Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 48


§5.9 Using a Finite State Machine to Control A Simple Cache
Cache Control
 Example cache characteristics
 Direct-mapped, write-back, write allocate
 Block size: 4 words (16 bytes)
 Cache size: 16 KB (1024 blocks)
 32-bit byte addresses
 Valid bit and dirty bit per block
 Blocking cache
 CPU waits until access is complete

31 10 9 4 3 0
Tag Index Offset
18 bits 10 bits 4 bits

Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 49


Interface Signals

Read/Write Read/Write
Valid Valid
32 32
Address Address
32 Cache 128 Memory
CPU Write Data Write Data
32 128
Read Data Read Data
Ready Ready

Multiple cycles
per access

Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 50


§5.16 Concluding Remarks
Concluding Remarks
 Fast memories are small, large memories are
slow
 We really want fast, large memories 
 Caching gives this illusion 
 Principle of locality
 Programs use a small part of their memory space
frequently
 Memory hierarchy
 L1 cache  L2 cache  …  DRAM memory
 disk
 Memory system design is critical for
multiprocessors

Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 65

You might also like