MyText2105 Ch12 V06
MyText2105 Ch12 V06
This chapter discusses two approaches to the problem called either the “von Neumann
bottleneck” or just the “memory bottleneck”. Recall that the standard modern computer is
based on a design called “stored program”. As the stored program design was originated in the
EDVAC, co–designed by John von Neumann in 1945, the design is also called a “von Neumann
machine”. This bottleneck refers to the fact that memory is just not fast enough to keep up
with the modern CPU. Consider a modern CPU, operating with a 2.5 GHz clock. This clock
time is 0.4 nanoseconds. If we assume that the memory can be referenced every other clock
pulse, that is one reference every 0.8 nanoseconds. The access time for modern memory is
more like 80 nanoseconds; memory is 100 times too slow.
This chapter discusses a few design tricks which will reduce the problem of the bottleneck, by
allowing memory to deliver data to the CPU at a speed more nearly that dictated by its clock.
We begin with another discussion of SDRAM (Synchronous DRAM, first discussed in Chapter 6
of this book) as an improvement on the main memory system of the computer. We then place
the main memory within a memory hierarchy, the goal of which is to mimic a very large memory
(say several terabytes) with the access time more typical of a semiconductor register.
We preview the memory hierarchy by reviewing the technologies available for data storage.
The general property for those technologies now in use is that the faster memories cost more.
Older technologies, such as mercury delay lines that cost more and are slower, are no longer in
use. If the device is slower, it must be cheaper or it will not be used.
The faster memory devices will be found on the CPU chip. Originally, the only on–chip memory
was the set of general–purpose registers. As manufacturing techniques improved, cache memory
was added to the chip, first a L1 cache and then both L1 and L2 caches. We shall justify the
multi–level cache scheme a bit later, but for now we note that anything on the CPU chip will
have a smaller access time than anything on a different chip (such as main memory).
The typical memory hierarchy in a modern computer has the following components.
Component Typical Size Access Time
CPU registers 16 – 256 bytes 0.5 nanoseconds
L1 Cache 32 kilobytes 2 nanoseconds
L2 Cache 1 – 4 MB 7 nanoseconds
Main memory 1 – 4 GB 55 nanoseconds
Disk Drives 100 GB and up 10 – 20 milliseconds
– memory cannot supply the CPU at a sufficient data rate. Fortunately there have been a number
of developments that have alleviated this problem. We will soon discussed the idea of cache
memory, in which a large memory with a 50 to 100 nanosecond access time can be coupled with
a small memory with a 10 nanosecond access time. While cache memory does help, the main
problem is that main memory is too slow.
In his 2010 book [R033], William Stallings introduced his section on advanced DRAM
organization (Section 5.3, pages 173 to 179) with the following analysis of standard memory
technology, which I quote verbatim.
“As discussed in Chapter 2 [of the reference], one of the most critical system
bottlenecks when using high–performance processors is the interface to main
internal memory. This interface is the most important pathway in the entire
computer system. The basic building block of memory remains the DRAM
chip, as it has for decades; until recently, there had been no significant changes
in DRAM architecture since the early 1970s. The traditional DRAM chip is
constrained both by its internal architecture and by its interface to the
processor’s memory bus.”
Modern computer designs, in an effort to avoid the Von Neumann bottleneck, use several tricks,
including multi–level caches and DDR SDRAM main memory. We continue to postpone the
discussion of cache memory, and focus on methods to speed up the primary memory in order to
make it more compatible with the faster, and more expensive, cache.
Suppose that the byte with address 0x124A is requested and found not to be in the L2 cache. A
cache line in the L2 cache would be filled with the 16 bytes with addresses ranging from 0x1240
through 0x124F. This might be done in two transfers of 8 bytes each.
We close this part of the discussion by examining some specifications of a memory chip that as
of July 2011 seemed to be state-of-the-art. This is the Micron DDR2 SDRAM in 3 models
MT46H512M4 64 MEG x 4 x 8 banks
MT47H256M8 32 MEG x 8 x 8 banks
MT47H128M16 16 MEG x 16 x 8 banks
Collectively, the memories are described by Micron [R89] as “high-speed dynamic random–
access memory that uses a 4ns–prefetch architecture with an interface designed to transfer two
data words per clock cycle at the I/O bond pads.” But what is “prefetch architecture”?
According to Wikipedia [R034]
“The prefetch buffer takes advantage of the specific characteristics of memory
accesses to a DRAM. Typical DRAM memory operations involve three phases
(line precharge, row access, column access). Row access is … the long and slow
phase of memory operation. However once a row is read, subsequent column
accesses to that same row can be very quick, as the sense amplifiers also act as
latches. For reference, a row of a 1Gb DDR3 device is 2,048 bits wide, so that
internally 2,048 bits are read into 2,048 separate sense amplifiers during the row
access phase. Row accesses might take 50 ns depending on the speed of the
DRAM, whereas column accesses off an open row are less than 10 ns.”
“In a prefetch buffer architecture, when a memory access occurs to a row the buffer
grabs a set of adjacent datawords on the row and reads them out ("bursts" them) in
rapid-fire sequence on the IO pins, without the need for individual column address
requests. This assumes the CPU wants adjacent datawords in memory which in
practice is very often the case. For instance when a 64 bit CPU accesses a 16 bit
wide DRAM chip, it will need 4 adjacent 16 bit datawords to make up the full 64
bits. A 4n prefetch buffer would accomplish this exactly ("n" refers to the IO width
of the memory chip; it is multiplied by the burst depth "4" to give the size in bits of
the full burst sequence).”
“The prefetch buffer depth can also be thought of as the ratio between the core
memory frequency and the IO frequency. In an 8n prefetch architecture (such as
DDR3), the IOs will operate 8 times faster than the memory core (each memory
access results in a burst of 8 datawords on the IOs). Thus a 200 MHz memory core
is combined with IOs that each operate eight times faster (1600 megabits/second).
If the memory has 16 IOs, the total read bandwidth would be 200 MHz x 8
datawords/access x 16 IOs = 25.6 gigabits/second (Gbps), or 3.2 gigabytes/second
(GBps). Modules with multiple DRAM chips can provide correspondingly higher
bandwidth.”
Each is compatible with 1066 MHz synchronous operation at double data rate. For the
MT47H128M16 (16 MEG x 16 x 8 banks, or 128 MEG x 16), the memory bus can apparently be
operated at 64 times the speed of internal memory; hence the 1066 MHz.
Here is a functional block diagram of the 128 Meg x 16 configuration, taken from the Micron
reference [R91]. Note that there is a lot going on inside that chip.
Here are the important data and address lines to the memory chip.
A[13:0] The address inputs; either row address or column address.
DQ[15:0] Bidirectional data input/output lines for the memory chip.
A few of these control signals are worth mention. Note that most of the control signals are
active–low; this is denoted in the modern notation by the sharp sign.
CS# Chip Select. This is active low, hence the “#” at the end of the signal name.
When low, this enables the memory chip command decoder.
When high, is disables the command decoder, and the chip is idle.
RAS# Row Address Strobe. When enabled, the address refers to the row number.
CAS# Column Address Strobe. When enabled, the address refers to the column
WE# Write Enable. When enabled, the CPU is writing to the memory.
The following truth table explains the operation of the chip.
CS# RAS# CAS# WE# Command / Action
1 d d d Deselect / Continue previous operation
0 1 1 1 NOP / Continue previous operation
0 0 1 1 Select and activate row
0 1 0 1 Select column and start READ burst
0 1 0 0 Select column and start WRITE burst
to cluster together within a small range that could easily fit into a small cache memory. There
are generally considered to be two types of locality. Spatial locality refers to the tendency of
program execution to reference memory locations that are clustered; if this address is accessed,
then one very near it will be accessed soon. Temporal locality refers to the tendency of a
processor to access memory locations that have been accessed recently. In the less common case
that a memory reference is to a “distant address”, the cache memory must be loaded from
another level of memory. This event, called a “memory miss”, is rare enough that most memory
references will be to addresses represented in the cache. References to addresses in the cache are
called “memory hits”; the percentage of memory references found in the cache is called the
“hit ratio”.
It is possible, though artificial, to write programs that will not display locality and thus defeat the
cache design. Most modern compilers will arrange data structures to take advantage of locality,
thus putting the cache system to best use.
Effective Access Time for Multilevel Memory
We have stated that the success of a multilevel memory system is due to the principle of locality.
The measure of the effectiveness of this system is the hit ratio, reflected in the effective access
time of the memory system.
We shall consider a multilevel memory system with primary and secondary memory. What we
derive now is true for both cache memory and virtual memory systems. In this course, we shall
use cache memory as an example. This could easily be applied to virtual memory.
In a standard memory system, an addressable item is referenced by its address. In a two level
memory system, the primary memory is first checked for the address. If the addressed item is
present in the primary memory, we have a hit, otherwise we have a miss. The hit ratio is defined
as the number of hits divided by the total number of memory accesses; 0.0 h 1.0. Given a
faster primary memory with an access time TP and a slower secondary memory with access time
TS, we compute the effective access time as a function of the hit ratio. The applicable formula is
TE = hTP + (1.0 – h)TS.
RULE: In this formula we must have TP < TS. This inequality defines the terms “primary”
and “secondary”. In this course TP always refers to the cache memory.
For our first example, we consider cache memory, with a fast cache acting as a front-end for
primary memory. In this scenario, we speak of cache hits and cache misses. The hit ratio is
also called the cache hit ratio in these circumstances. For example, consider TP = 10
nanoseconds and TS = 80 nanoseconds. The formula for effective access time becomes
TE = h10 + (1.0 – h)80. For sample values of hit ratio
Hit Ratio Access Time
0.5 45.0
0.9 17.0
0.99 10.7
The reason that cache memory works is that the principle of locality enables high values of the
hit ratio; in fact h 0.90 is a reasonable value. For this reason, a multi-level memory structure
behaves almost as if it were a very large memory with the access time of the smaller and faster
memory. Having come up with a technique for speeding up our large monolithic memory, we
now investigate techniques that allow us to fabricate such a large main memory.
0xAB7129. The block containing that address contains every item with address beginning with
0xAB712: 0xAB7120, 0xAB7121, … , 0xAB7129, 0xAB712A, … 0xAB712F.
We should point out immediately that the secondary memory will be divided into blocks of size
identical to the cache line. If the secondary memory has 16–byte blocks, this is due to the
organization of the cache as having cache lines holding 16 bytes of data.
The primary block would have 16 entries, indexed 0 through F. It would have the 20–bit tag
0XAB712 associated with the block, either explicitly or implicitly.
At system start–up, the faster cache contains no valid data, which are copied as needed from
the slower secondary memory. Each block would have three fields associated with it
The tag field identifying the memory addresses contained
Valid bit set to 0 at system start–up.
set to 1 when valid data have been copied into the block
Dirty bit set to 0 at system start–up.
set to 1 whenever the CPU writes to the faster memory
set to 0 whenever the contents are copied to the slower memory.
The basic unit of a cache is called a “cache line”, which comprises the data copied from the
slower secondary memory and the required ID fields. A 16–KB cache might contain 1,024
cache lines with the following structure.
D bit V Bit Tag 16 indexed entries (16 bytes total)
0 1 0xAB712 M[0xAB7120] … M[0xAB712F]
We now face a problem that is unique to cache memories. How do we find an addressed item?
In the primary memory, the answer is simple; just go to the address and access the item. The
cache has much fewer addressable entities than the secondary memory. For example, this cache
has 16 kilobytes set aside to store a selection of data from a 16 MB memory. It is not possible to
Associative memory would find the item in one search. Think of the control circuitry as
“broadcasting” the data value (here 0xAB712) to all memory cells at the same time. If one of
the memory cells has the value, it raises a Boolean flag and the item is found.
We do not consider duplicate entries in the associative memory. This can be handled by some
rather straightforward circuitry, but is not done in associative caches. We now focus on the use
of associative memory in a cache design, called an “associative cache”.
Assume a number of cache lines, each holding 16 bytes. Assume a 24–bit address. The simplest
cache lines, each holding 16 bytes. Assume a 24–bit address. Recall that 256 = 28, so that we
need eight bits to select the cache line. Consider addresses 0xCD4128 and 0xAB7129. Each
would be stored in cache line 0x12. Set 0 of this cache line would have one block, and set 1
would have the other.
Entry 0 Entry 1
D V Tag Contents D V Tag Contents
1 1 0xCD4 M[0xCD4120] to 0 1 0xAB7 M[0xAB7120] to
M[0xCD412F] M[0xAB712F]
Examples of Cache Memory
We need to review cache memory and work some specific examples. The idea is simple, but
fairly abstract. We must make it clear and obvious. To review, we consider the main memory of
a computer. This memory might have a size of 384 MB, 512 MB, 1GB, etc. It is divided into
blocks of size 2K bytes, with K > 2.
In general, the N–bit address is broken into two parts, a block tag and an offset.
The most significant (N – K) bits of the address are the block tag
The least significant K bits represent the offset within the block.
We use a specific example for clarity.
We have a byte addressable memory, with a 24–bit address.
The cache block size is 16 bytes, so the offset part of the address is K = 4 bits.
In our example, the address layout for main memory is as follows:
Divide the 24–bit address into two parts: a 20–bit tag and a 4–bit offset.
Bits 23 – 4 3–0
Fields Tag Offset
Let’s examine the sample address, 0xAB7129, in terms of the bit divisions above.
Bits: 23 – 20 19 – 16 15 – 12 11 – 8 7–4 3–0
Hex Digit A B 7 1 2 9
The cache line is written back only when it is replaced. The advantage of this is that it is a faster
strategy. Writes always proceed at cache speed. Furthermore, this plays on the locality theme.
Suppose each entry in the cache is written, a total of 16 cache writes. At the end of this
sequence, the cache line will eventually be written to the slower memory. This is one slow
memory write for 16 cache writes. The disadvantage of this strategy is that it is more complex,
requiring the use of a dirty bit.
Cache Line Replacement
Assume that memory block 0xAB712 is present in cache line 0x12. We now get a memory
reference to address 0x895123. This is found in memory block 0x89512, which must be placed
in cache line 0x12. The following holds for both a memory read from or memory write to
0x895123. The process is as follows.
1. The valid bit for cache line 0x12 is examined. If (Valid = 0), there is nothing
in the cache line, so go to Step 5.
2. The memory tag for cache line 0x12 is examined and compared to the desired
tag 0x895. If (Cache Tag = 0x895) go to Step 6.
3. The cache tag does not hold the required value. Check the dirty bit.
If (Dirty = 0) go to Step 5.
4. Here, we have (Dirty = 1). Write the cache line back to memory block 0xAB712.
5. Read memory block 0x89512 into cache line 0x12. Set Valid = 1 and Dirty = 0.
6. With the desired block in the cache line, perform the memory operation.
We have three different major strategies for cache mapping.
Direct Mapping is the simplest strategy, but it is rather rigid. One can devise “almost realistic”
programs that defeat this mapping. It is possible to have considerable page replacement with a
cache that is mostly empty.
Fully Associative offers the most flexibility, in that all cache lines can be used. This is also the
most complex, because it uses a larger associative memory, which is complex and costly.
N–Way Set Associative is a mix of the two strategies. It uses a smaller (and simpler)
associative memory. Each cache line holds N = 2K sets, each the size of a memory block.
Each cache line has N cache tags, one for each set.
Consider variations of mappings to store 256 memory blocks.
Direct Mapped Cache 256 cache lines
“1–Way Set Associative” 256 cache lines 1 set per line
2–Way Set Associative 128 cache lines 2 sets per line
4–Way Set Associative 64 cache lines 4 sets per line
8–Way Set Associative 32 cache lines 8 sets per line
16–Way Set Associative 16 cache lines 16 sets per line
32–Way Set Associative 8 cache lines 32 sets per line
64–Way Set Associative 4 cache lines 64 sets per line
128–Way Set Associative 2 cache lines 128 sets per line
256–Way Set Associative 1 cache line 256 sets per line
Fully Associative Cache 256 sets
N–Way Set Associative caches can be seen as a hybrid of the Direct Mapped Caches
and Fully Associative Caches. As N goes up, the performance of an N–Way Set Associative
cache improves. After about N = 8, the improvement is so slight as not to be worth the
additional cost.
We now address two questions for this design before addressing the utility of a third level in the
cache. The first question is why the L1 cache is split into two parts. The second question is why
the cache has two levels. Suffice it to say that each design decision has been well validated by
empirical studies; we just give a rationale.
There are several reasons to have a split cache, either between the CPU and main memory or
between the CPU and a higher level of cache. One advantage is the “one way” nature of the L1
Instruction Cache; the CPU cannot write to it. This means that the I–Cache is simpler and faster
than the D–Cache; faster is always better. In addition, having the I–Cache provides some
security against self modifying code; it is difficult to change an instruction just fetched and write
it back to main memory. There is also slight security against execution of data; nothing read
through the D–Cache can be executed as an instruction.
The primary advantage of the split level–1 cache is support of a modern pipelined CPU. A
pipeline is more akin to a modern assembly line. Consider an assembly line in an auto plant.
There are many cars in various stages of completion on the same line. In the CPU pipeline, there
are many instructions (generally 5 to 12) in various stages of execution. Even in the simplest
design, it is almost always the case that the CPU will try to fetch an instruction in the same clock
Here is a schematic of the pipelined CPU for the MIPS computer [R007].
This shows two of the five stages of the MIPS pipeline.
In any one clock period, the control unit will access the
Level 1 I–Cache and the ALU might access the L1
D–Cache. As the I–Cache and D–Cache are separate
memories, they can be accessed at the same time with
no conflict.
We note here that the ALU does not directly access the
D–Cache; it is the control unit either feeding data to a
register or writing the output from the ALU to primary
memory, through the D–Cache. The basic idea is sound:
on data for the Apple iMAC G5, as reported in class lectures by David Patterson [R035]. The
access times and sizes for the various memory levels are as follows:
Registers L1 I–Cache L1 D–Cache L2 Cache DRAM
Size 1 KB 64 KB 32 KB 512 KB 256 MB
Access Time 0.6 ns 1.9 ns 1.9 ns 6.9 ns 55 ns
The basic point is that smaller caches have faster access times. This, coupled with the principle
of locality implies that the two–level cache will have better performance than a larger unified
cache. Again, industrial practice has born this out.
The utility of a multi–level cache is illustrated by the following example, based on the
access times given in the previous table.
Suppose the following numbers for each of the three memory levels.
L1 Cache Access Time = 0.60 nanoseconds Hit rate = 95%
L2 Cache Access Time = 1.90 nanoseconds Hit rate = 98%
Main Memory Access Time = 55.0 nanoseconds.
The one–level cache would be implemented with the access time and hit rate of the L2
cache, as the one–level cache would be that size. The effective access time is thus:
TE = 0.981.90 + (1 – 0.98)55.0 = 0.981.90 + 0.0255.0 = 1.862 + 1.10 = 2.972.
The two–level cache would use the L1 and L2 caches above and have access time:
TE = 0.950.60 + (1 – 0.95)[0.981.90 + (1 – 0.98)55.0]
= 0.950.60 + 0.052.972 = 0.570 + 0.1486 = 0.719 nanoseconds.
The two–level cache system is about four times faster than the bigger unified cache.
the power density on the chip increases and the chip temperature climbs into a range not
compatible with stable operation.
One way to handle this heat problem is to devote more of the chip to cache memory and less to
computation. As noted by Stallings [R033], “Memory transistors are smaller and have a power
density an order of magnitude lower than logic. … the percentage of the chip area devoted to
memory has grown to exceed 50% as the chip transistor density has increased.”
Here is a diagram of a quad–core Intel Core i7 CPU. Each core has its own L1 caches as well
as dedicated L2 cache. The four cores share an 8–MB Level 3 cache.
Virtual Memory
We now turn to the next example of a memory hierarchy, one in which a magnetic disk normally
serves as a “backing store” for primary core memory. This is virtual memory. While many of
the details differ, the design strategy for virtual memory has much in common with that of cache
memory. In particular, VM is based on the idea of program locality.
Virtual memory has a precise definition and a definition implied by common usage. We discuss
both. Precisely speaking, virtual memory is a mechanism for translating logical addresses (as
issued by an executing program) into actual physical memory addresses. The address
translation circuitry is called a MMU (Memory Management Unit).
This definition alone provides a great advantage to an Operating System, which can then
allocate processes to distinct physical memory locations according to some optimization. This
has implications for security; individual programs do not have direct access to physical memory.
This allows the OS to protect specific areas of memory from unauthorized access.
Virtual Memory in Practice
Although this is not the definition, virtual memory has always been implemented by pairing a
fast DRAM Main Memory with a bigger, slower “backing store”. Originally, this was magnetic
drum memory, but it soon became magnetic disk memory. Here again is the generic two–stage
memory diagram, this time focusing on virtual memory.
The invention of time–sharing operating systems introduced another variant of VM, now part
of the common definition. A program and its data could be “swapped out” to the disk to allow
another program to run, and then “swapped in” later to resume.
Virtual memory allows the program to have a logical address space much larger than the
computers physical address space. It maps logical addresses onto physical addresses and moves
“pages” of memory between disk and main memory to keep the program running.
An address space is the range of addresses, considered as unsigned integers, that can be
generated. An N–bit address can access 2N items, with addresses 0 … 2N – 1.
16–bit address 216 items 0 to 65535
20–bit address 220 items 0 to 1,048,575
32–bit address 232 items 0 to 4,294,967,295
In all modern applications, the physical address space is no larger than the logical address space.
It is often somewhat smaller than the logical address space. As examples, we use a number of
machines with 32–bit logical address spaces.
Machine Physical Memory Logical Address Space
VAX–11/780 16 MB 4 GB (4, 096 MB)
Pentium (2004) 128 MB 4 GB
Desktop Pentium 512 MB 4 GB
Server Pentium 4 GB 4 GB
IBM z/10 Mainframe 384 GB 264 bytes = 234 GB
Organization of Virtual Memory
Virtual memory is organized very much in the same way as cache memory. In particular, the
formula for effective access time for a two–level memory system (pages 381 and 382 of this text)
still applies. The dirty bit and valid bit are still used, with the same meaning. The names are
different, and the timings are quite different. When we speak of virtual memory, we use the
terms “page” and “page frame” rather than “memory block” and “cache line”. In the virtual
memory scenario, a page of the address space is copied from the disk and placed into an equally
sized page frame in main memory.
Another minor difference between standard cache memory and virtual memory is the way in
which the memory blocks are stored. In cache memory, both the tags and the data are stored in a
single fast memory called the cache. In virtual memory, each page is stored in main memory in a
place selected by the operating system, and the address recorded in a page table for use of the
program.
Here is an example based on a configuration that runs through this textbook. Consider a
computer with a 32–bit address space. This means that it can generate 32–bit logical addresses.
Suppose that the memory is byte addressable, and that there are 224 bytes of physical memory,
requiring 24 bits to address. The logical address is divided as follows:
Bits 31 – 28 27 – 24 23 – 20 19 – 16 15 – 12 11 – 8 7–4 3–0
Field Page Number Offset in Page
The physical address associated with the page frame in main memory is organized as follows
Bits 23 – 20 19 – 16 15 – 12 11 – 8 7–4 3–0
Field Address Tag Offset in Page Frame
Virtual memory uses the page table to translate virtual addresses into physical addresses. In
most systems, there is one page table per process. Conceptually, the page table is an array,
indexed by page frame of the address tags associated with each process. But note that such an
array can be larger than the main memory itself. In our example, each address tag is a
12–bit value, requiring two bytes to store, as the architecture cannot access fractional bytes. The
page number is a 20–bit number, from 0 through 1,048,575. The full page table would require
two megabytes of memory to store.
Each process on a computer will be allocated a small page table containing mappings for the
most recently used logical addresses. Each table entry contains the following information:
1. The valid bit, which indicates whether or not there is a valid address tag (physical
page number) present in that entry of the page table.
2. The dirty bit, indicating whether or not the data in the referenced page frame
has been altered by the CPU. This is important for page replacement policies.
3. The 20–bit page number from the logical address, indicating what logical page
is being stored in the referenced page frame.
4. The 12–bit unsigned number representing the address tag (physical page number).
More on Virtual Memory: Can It Work?
Consider again the virtual memory system just discussed. Each memory reference is based on
a logical address, and must access the page table for translation.
But wait! The page table is in memory.
Does this imply two memory accesses for each memory reference?
This is where the TLB (Translation Look–aside Buffer) comes in. It is a cache for a page
table, more accurately called the “Translation Cache”.
The TLB is usually implemented as a split associative cache.
One associative cache for instruction pages, and
One associative cache for data pages.
A page table entry in main memory is accessed only if the TLB has a miss.
hit.
access to the same physical page frame, this is called memory aliasing. In such scenarios,
simple VM management systems will fail. This problem can be handled, as long as one is aware
of it.
The topic of virtual memory is worthy of considerable study. Mostly it is found in a course on
Operating Systems. The reader is encouraged to consult any one of the large number of
excellent textbooks on the subject for a more thorough examination of virtual memory.