Cache&Virtual Memory
Cache&Virtual Memory
Reference: 1
http://arstechnica.com/articles/paedia/cpu/caching.ars/2
2
The memory hierarchy
Level Access Time Typical Size Technology Managed By
Registers 1-3 ns 1 KB ? Custom CMOS Compiler
Localization principle
• The programs access a relatively small area of address space at a time
• 90/10 rule: 90% of accesses are made in 10% of memory locations
Location Types
- Time: if a sequence has been accessed recently, it tends to be accessed
again soon
- Space: If a sequence has been accessed recently, neighboring
sequences tend to be accessed soon
6
Temporal locality
• Consider a simple Photoshop filter that inverts an image to produce a
negative one; there's a small piece of code that performs the same
inversion on each pixel starting at one corner and going in sequence
all the way across and down to the opposite corner.
• This code is just a small loop that gets executed repeatedly on each
pixel, so it's an example of code that is reused again and again. Media
apps, games, and simulations, since they use lots of small loops that
iterate through very large datasets, have excellent temporal locality
for code.
• However, it's important to note that these kinds of apps have
extremely poor temporal locality for data.
7
Remarks.
Returning to our MP3 example, a music file is usually played through
once in sequence and none of its parts are repeated. This being the
case, it's actually kind of a waste to store any of that file in the cache,
since it's only going to stop off there temporarily before passing through
to the CPU. When an app fills up the cache with data that doesn't really
need to be cached because it won't be used again and as a result winds
up bumping out of the cache data that will be reused, that app is said to
"pollute the cache."
Media apps, games, and the like are big cache polluters, which is why
they weren't too affected by the original Celeron's lack of cache.
Because they were streaming data through the CPU at a very fast rate,
they didn't actually even care that their data wasn't being cached. Since
this data wasn't going to be needed again anytime soon, the fact that it
wasn't in a readily accessible cache didn't really matter.
8
• How efficient is this mechanism?
(P100) – 16KB CM ~ 90% required addresses are in CM
so 90% are fast accesses (SRAM)
- cost + power
9
Systems relative performances depending
on the cache dimension (www. Intel.com)
10
CPU 2. Cache memory components, related terms
SRAM is the static RAM memory
block that keeps the data/code
13
• Snooping operation - the supervision of the address lines done by the
cache controller for a transfer;
• Snarf - the update operation, when the cache takes over the informa-
tion from the data lines;
• „Snoop-Snarf” processes allow the cache to keep its consistency
14
3. The CACHE Architecture
15
The „Look Aside” architecture
CPU
SRAM The „look aside” CM is:
(Cache) - simple
- cheap
- provides a good response
time in case of “cache miss”
Cache
Controller
Tag
RAM
System Interface
16
CPU „Look Through” Architecture
Tag
SRAM Cache RAM
Cache Controller
- more complex …
- the access to the memory is slower, because
System Interface the main memory is accessed only after the
cache access fail
- more expensive
17
WRITE techniques
•“write back”- the cache memory works like a buffer;
- When the processor initiates a writing cycle, the cache memory
receives the data and finalizes the cycle, and then when the system bus
is available, the cache memory writes the data into the main memory;
- provides maximum performance, allowing the processor to continue
working, while the main memory is updated later;
- Complex, expensive
•“write through” - the processor writes into the main memory through
the cache memory. The cache updated its content, but the writing cycle
continues until data is stored into the main memory;
- less complex and cheaper;
- its performance is poorer because the processor must wait until the
main memory stores the new data.
18
4. Cache memory organization
DRAM Memory
Cache Page
Cache Memory
-
-
-
- Line m
-
-
-
Cache Page
Line 2
Cache Page
Line 1
Cache Page Line 0 19
DRAM Memory
Line m
-
-
Cache Memory
-
-
Line k
-
-
-
-
Line 2
Line 2
Line 1
Line 1
Line 0 Line 0
Fully-Associative Cache
!!!: - best performances
- High Complexity
- Cache<4KB 20
Direct mapped cache 21
!!!: Simple, cheap, poor performances
22
1 KB Direct Mapped Cache, 32B blocks
23
DRAM Memory
Line n
-
Page m
- Cache
-
Memory
Line n
Way 0 Way 1
-
Line 0
Line Pag
n1 Line n Line n
-
-Page k
- - -
- - -
Page 0- - -
-
-
Line 0
Line 0 Line 0
Line 0
80486DX 8 Ko L1
Pentium 16 Ko L1
28
Is still the external cache desirable?
• Shortly, the answer is yes and this organization is called two
level cache.
• Internal cache is L1 and the external cache is L2.
• With no L2 cache (SRAM), in case of MISS the CPU has to
access the RAM or ROM directly through the system bus
(slow, and performance decreasing).
• Many systems are using separate bus between L2 and the
processor to reduce burden on the system bus.
• The continuous shrinking of the processor components, many
system place L2 on the processor chip (improving the
performance).
29
Unified against Split cache
• At start L1 cache is used for both of data and
instruction.
• Recently become common to split cache into one for
data and the one for instructions.
—Reason for unified cache are:
—For specific size it will have higher Hit rate (as result of
balancing data and instructions automatically).
—Only one cache has to implemented and design.
30
Unified against Split cache
• Split caches, eliminate contention particularly for
supper scalar processor (PowerPC and Pentium)
which emphasis parallel instruction and pre-fetching
of predicted future instructions. This is very
important for any design depends on pipelining.
31
Intel Cache Evolution
Processor on which feature
Problem Solution first appears
External memory slower than the system bus. Add external cache using faster 386
memory technology.
Increased processor speed results in external bus Move external cache on-chip, 486
becoming a bottleneck for cache access. operating at the same speed as
the processor.
Internal cache is rather small, due to limited space on Add external L2 cache using faster 486
chip technology than main memory
Contention occurs when both the Instruction Prefetcher Create separate data and Pentium
and the Execution Unit simultaneously require access to instruction caches.
the cache. In that case, the Prefetcher is stalled while
the Execution Unit’s data access takes place.
33
Pentium 4 Core Processor
• Fetch/Decode Unit
— Fetches instructions from L2 cache
— Decode into micro-ops
— Store micro-ops in L1 cache
• Out of order execution logic
— Schedules micro-ops
— Based on data dependence and resources
— May speculatively execute
• Execution units
— Execute micro-ops
— Data from L1 cache
— Results in registers
• Memory subsystem
— L2 cache and systems bus 34
Pentium 4 Design Reasoning
• Decodes instructions into RISC like micro-ops before L1 cache
• Micro-ops fixed length
— Superscalar pipelining and scheduling
• Pentium instructions long & complex
• Performance improved by separating decoding from scheduling &
pipelining
— (More later – ch14)
• Data cache is write back
— Can be configured to write through
• L1 cache controlled by 2 bits in register
— CD = cache disable
— NW = not write through
— 2 instructions to invalidate (flush) cache and write back then invalidate
• L2 and L3 8-way set-associative
— Line size 128 bytes
35
6. Cache memory features identification
• CPUID instruction allows to return data and features about internal cache so:
for EAX= 2, CPUID instr. Load registers EAX, EBX, ECX şi EDX with descriptors that show
the cache and TLB features.
37
Motivation for Virtual Memory
1.Use DRAM as a cache for the hard disk
–Address space of a process can exceed DRAM physical size
–Sum of address spaces of processes can exceed DRAM size
3.Provide protection
–One process can’t interfere with another because they operate in
different address spaces
–User process can’t access privileged information
• different sections of address spaces have different permissions.
38
A System with Physical Memory Only
Examples: early PCs, nearly all embedded systems etc
41
Page Faults (similar to “Cache Misses”)
•What if an object is on disk rather than in memory?
–Page table entry indicates virtual address not in memory
–OS exception handler invoked to move data from disk into memory
42
Virtual Address Translation
Virtual-to-physical address translation performed by MMU
● Virtual address is broken into a virtual page number and an offset
● Mapping from virtual page to physical frame provided by a page table
43
Page Table Entries (PTEs)
-Typical PTE format (depends on CPU architecture!)
44
Page Tables store the virtual-to-physical address mappings.
They are located In memory!
The MMU has a special register called the page table base pointer.
This points to the physical memory address of the top of the page
table for the currently-running process.
● On every memory access, must have a separate access to consult the
page tables!
45
TLB
• Very fast (but small) cache directly on the CPU
• Pentium 6 systems have separate data and instruction TLBs, 64 entries each
• TLB caches most recent virtual to physical address translations
• Implemented as fully associative cache
• Any address can be stored in any entry in the cache
• All entries searched “in parallel” on every address translation
• A TLB miss requires that the MMU actually try to do the address translation
46
• Memory Management Unit (MMU)
- Hardware that translates a virtual address to a physical address
- Each memory reference is passed through the MMU -translate a
virtual address to a physical address
47
• The main advantage of virtual memory systems is the ability to upload
and execute a process that requires a greater amount of memory than
what is available by loading the split and then execute .
• The advantage is the system's ability to eliminate external
fragmentation
48
50
51