0% found this document useful (0 votes)
6 views

Memory 2

Uploaded by

cse.20201016
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views

Memory 2

Uploaded by

cse.20201016
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 31

Memory Hierarchy-II

1
Memory Hierarchy

o Motivation
m Exploitinglocality to provide a large, fast and
inexpensive memory
2
Cache Basics
o Cache is a high speed buffer between
CPU and main memory
o Memory is divided into blocks
m Q1: Where can a block be placed in the upper
level? (Block placement)
m Q2: How is a block found if it is in the upper
level? (Block identification)
m Q3: Which block should be replaced on a
miss? (Block replacement)
m Q4: What happens on a write? (Write strategy)

3
Q1: Block Placement
o Fully associative, direct mapped, set
associative
m Example: Block 12 placed in 8 block cache:
n Mapping = Block Number Modulo Number Sets
Direct Mapped 2-Way Assoc
Full Mapped
(12 mod 8) = 4 (12 mod 4) = 0
01234567 01234567 01234567

Cache

1111111111222222222233
01234567890123456789012345678901

Memory

4
Q2: Block Identification
o Tag on each block
m No need to check index or block offset
o Increasing associativity Þshrinks index Þ
expands tag

Block Address Block


Tag Index Offset

5
Q3: Block Replacement
o Easy for direct-mapped caches
o Set associative or fully associative:
m Random
n Easy to implement
m LRU (Least Recently Used)
n Relying on past to predict future, hard to implement
m FIFO
n Sort of approximate LRU
m Not Recently Used
n Maintain reference bits and dirty bits; clear reference bits
periodically; Divide all blocks into four categories; choose one
from the lower category
m Optimal replacement?
n Label the blocks in cache by the number of instructions to be
executed before that block is referenced. Then choose a
block with the highest lable
n Unrealizable!
6
Q4: Write Strategy
Write-Through Write-Back
Write data only
Data written to to the cache
cache block
Policy also written to Update lower
lower-level level when a
memory block falls out
of the cache
Implement Easy Hard
Do read misses
produce writes? No Yes
Do repeated
writes make it Yes No
to lower level?

7
Write Buffers
Cache Lower
Processor Level
Memory
Write Buffer

Write-through cache: holds data awaiting


write-through to lower level memory
Q. Why a write buffer ? A. So CPU doesn’t stall.
Q. Why a buffer, why not A. Bursts of writes are
just one register ? common.
A. Yes! Drain buffer before
Q. Are Read After Write
next read, or check write buffer
(RAW) hazards an issue for
before read and perform read
write buffer?
only when no conflict. 8
Cache Performance
o Average memory access time
m Timetotal mem access = Nhit´Thit + Nmiss´Tmiss
=Nmem access ´Thit + Nmiss ´Tmiss penalty

m AMAT = Thit+ miss rate ´Tmiss penalty

o Miss penalty: time to replace a block from lower


level, including time to replace in CPU
m Access time: time to lower level(latency)
m Transfer time: time to transfer block(bandwidth)

9
Performance Example
o Two data caches (assume one clock cycle for hit)
m I: 8KB, 44% miss rate, 1ns hit time
m II: 64KB, 37% miss rate, 2ns hit time
m Miss penalty: 60ns, 30% memory accesses

m AMATI = 1ns + 44%´60ns = 27.4ns


m AMATII = 2ns + 37%´60ns = 24.2ns

m Larger cache Þsmaller miss rate but longer


ThitÞreduced AMAT

10
Miss Penalty in OOO Environment
o In processors with out-of-order execution
m Memory accesses can overlap with other
computation
m Latency of memory accesses is not always
fully exposed

m E.g.8KB cache, 44% miss rate, 1ns hit time,


miss penalty: 60ns, only 70% exposed on
average
m AMAT= 1ns + 44%´(60ns´70%) = 19.5ns

11
Cache Performance Optimizations
o Performance formulas
m AMAT = Thit+ miss rate ´Tmiss penalty
o Reducing miss rate
m Change cache configurations, compiler optimizations
o Reducing hit time
m Simple cache, fast access and address translation
o Reducing miss penalty
m Multilevel caches, read and write policies
o Taking advantage of parallelism
m Cache serving multiple requests simultaneously
m Prefetching

12
Cache Miss Rate
o Three C’s
o Compulsory misses (cold misses)
m The first access to a block: miss regardless of cache
size
o Capacity misses
m Cache too small to hold all data needed
o Conflict misses
m More blocks mapped to a set than the associativity
o Reducing miss rate
m Larger block size (compulsory)
m Larger cache size (capacity, conflict)
m Higher associativity (conflict)
m Compiler optimizations (all three)
13
Miss Rate vs. Block Size

o Larger blocks: compulsory misses reduced, but may


increase conflict misses or even capacity misses if the
cache is small; may also increase miss penalty
14
Reducing Cache Miss Rate
o Larger cache
m Less capacity misses
m Less conflict misses
n Implies higher associativity: less competition to the same set
m Has to balance hit time, energy consumption, and cost
o Higher associativity
m Less conflict misses
m Miss rate (2-way, X) » Miss rate(direct-map, 2X)

m Similarly, need to balance hit time, energy


consumption: diminishing return on reducing conflict
misses

15
Reducing Cache Miss Penalty
o A difficult decision is
m whether to make the cache hit time fast, to keep pace with the high
clock rate of processors,
m or to make the cache large to reduce the gap between the
processor accesses and main memory accesses.

o Solution:
m Use multi-level cache:
n The first-level cache can be small enough to match a fast clock
cycle time.
n Higher level cache can be large enough to capture many
accesses that would go to main memory.
n Multilevel caches are more power-efficient than single
aggregated cache.

16
Compiler Optimizations for Cache
o Increasing locality of programs
m Temporal locality, spatial locality
o Rearrange code
m Targeting instruction cache directly
m Reorder instructions based on the set of data accessed
o Reorganize data
m Padding to eliminate conflicts:
n Change the address of two variables such that they do not map to
the same cache location
n Change the size of an array via padding
m Group data that tend to be accessed together in one block
o Example optimizations
m Merging arrays, loop interchange, loop fusion

17
Merging Arrays
/* Before: 2 sequential arrays */
int val[SIZE];
int key[SIZE];
/* After: 1 array of structures */
struct merge {
int val;
int key;
};
struct merge merged_array[SIZE];
o Improve spatial locality
m If val[i] and key[i] tend to be accessed together
o Reducing conflicts between val & key
18
Loop Interchange
o Idea: switching the nesting order of two or
more loops

m Sequentialaccesses instead of striding


through memory; improved spatial locality
o Safety of loop interchange
m Need to preserve true data dependences

19
Loop Fusion
o Takes multiple compatible loop nests and
combines their bodies into one loop nest
m Is legal if no data dependences are reversed
o Improves locality directly by merging accesses to
the same cache line into one loop iteration
/* Before */ /* After */
for (i = 0; i < N; i = i+1) for (i = 0; i < N; i = i+1)
for (j = 0; j < N; j = j+1) for (j = 0; j < N; j = j+1){
a[i][j] = 1/b[i][j] * c[i][j]; a[i][j] = 1/b[i][j] * c[i][j];
for (i = 0; i < N; i = i+1) d[i][j] = a[i][j] + c[i][j];
for (j = 0; j < N; j = j+1) }
d[i][j] = a[i][j] + c[i][j];
20
Seminar

o Pipelining Cache

o Prefetching Cache

21
Main Memory Background
o Main memory performance
m Latency: cache miss penalty
n Access time: time between request and word arrives
n Cycle time: time between requests
m Bandwidth: multiprocessors, I/O,
n large block miss penalty
o Main memory technology
m Memory is DRAM: Dynamic Random Access Memory
n Dynamic since needs to be refreshed periodically
n Requires data written back after being read
n Concerned with cost per bit and capacity
m Cache is SRAM: Static Random Access Memory
n Concerned with speed and capacity

22
Memory vs. Virtual Memory
o Analogy to cache
m Size: cache << memory << address space
m Both provide big and fast memory - exploit locality

m Both need a policy - 4 memory hierarchy questions

o Difference from cache


m Cache primarily focuses on speed
m VM facilitates transparent memory management
n Providing large address space
n Sharing, protection in multi-programming environment

23
Four Memory Hierarchy Questions
o Where can a block be placed in main memory?
m OS allows block to be placed anywhere: fully
associative
n No conflict misses;
o Which block should be replaced?
m An approximation of LRU: true LRU too costly and
adds little benefit
n A reference bit is set if a page is accessed
n The bit is shifted into a history register periodically
n When replacing, find one with smallest value in history
register
o What happens on a write?
m Write back: write through is prohibitively expensive

24
Four Memory Hierarchy Questions
o How is a block found in main memory?
m Use page table to translate virtual address into
physical address
• 32-bit virtual
address, page
size: 4KB, 4 bytes
per page table
entry, page table
size?
• (232/212)´22= 222
or 4MB

25
Fast Address Translation
o Motivation
m Page table is too large to be stored in cache
n May even expand multiple pages itself
m Multiple page table levels
o Solution: exploit locality and cache recent
translations

Example:
n Four page table levels

26
Fast Address Translation
o TLB: translation look-aside buffer
m A special fully-associative cache for recent translation
m Tag: virtual address
m Data: physical page frame number, protection field,
valid bit, use bit, dirty bit

o Translation
m Send virtual
address to all tags
m Check violation
m Matching tag send
physical address
m Combine offset to
get full physical address
27
Virtual Memory and Cache
o Physical cache: index cache using physical
address
m Always address translation before accessing cache
m Simple implementation, performance issue

o Virtual cache: index cache using virtual address


to avoid translation
m Address translation only @ cache misses
m Issues
n Protection: copy protection info to each block
n Context switch: add PID to address tag
n Synonym/alias -- different virtual addresses map the same
physical address
l Checking multiple places, enforce aliases to be identical in a
fixed number of bits (page coloring)

28
Virtual Memory and Cache
o Physical cache (PIPT)
cache line return

• Slow

Processor Physical
TLB Main Memory
Core Cache
VA PA miss

hit

o Virtual cache (VIVT)


cache line return
• Protection bits
missing
Processor Virtual TLB Main Memory
Core Cache miss • Context-switch
VA
enforces cache flush

hit • Aliasing issue


29
Virtually-Indexed Physically-Tagged
o Virtually-indexed physically-tagged cache
m Use the page offset (identical in virtual & physical
addresses) to index the cache
m Associate physical address of the block as the
verification tag
m Perform cache reading and tag matching with the
physical address at the same time
m Issue: cache size is limited by page size (the length of
offset bits)

30
Advantages of Virtual Memory
o Translation
m Program can be given a consistent view of memory,
even though physical memory is scrambled
m Only the most important part of program (“Working
Set”) must be in physical memory
o Protection
m Different threads/processes protected from each other
m Different pages can be given special behavior
n Read only, invisible to user programs, etc.
m Kernel data protected from user programs
m Very important for protection from malicious programs

o Sharing
m Can map same physical page to multiple users
31

You might also like