Virtual Memory
Virtual Memory
5
The virtual address space of a process refers to the logical (or
virtual) view of how a process is stored in memory.
The actual physical layout is controlled by the
process's page table.
The large blank space (or hole) between the
heap and the stack is part of the virtual address
space but will require actual physical pages
only if the heap or stack grows.
Virtual address spaces that include holes are
known as sparse address spaces.
Using a sparse address space is beneficial
because the holes can be filled as the stack or
heap segments grow or if we wish to
dynamically link libraries (or possibly other
shared objects) during program execution.
6
Virtual memory also allows the sharing of files and memory by multiple
processes, with several benefits:
• System libraries can be shared by mapping them into the virtual address
space of more than one process.
• Processes can also share virtual memory by mapping the same block of
memory to more than one process.
• Process pages can be shared during a fork( ) system call, eliminating the
need to copy all of the pages of the original (parent) process.
7
Demand Paging
• The basic idea behind demand paging is that when a process is swapped
in, its pages are not swapped in all at once. Rather they are swapped in
only when the process needs them (on demand.) This is termed a lazy
swapper, although a pager is a more accurate term.
• Pages that are never accessed are thus never loaded into physical
memory. A demand-paging system is similar to a paging system with
swapping where processes reside in secondary memory.
A lazy swapper never swaps
a page into memory unless
that page will be needed.
A swapper manipulates
entire processes, whereas a
pager is concerned with the
individual pages of a
process.
8
The basic idea behind paging is that when a process is
swapped in, the pager only loads into memory those
pages that it expects the process to need (right away.)
Pages that are not loaded into memory are marked as
invalid in the page table, using the invalid bit.
The rest of the page table entry may either be blank or
contain information about where to find the swapped-out
page on the hard drive.
If the process only ever accesses pages that are loaded in
memory (memory resident pages), then the process runs
exactly as if all the pages were loaded in to memory.
9
Page table when
some pages are not
in main memory.
10
Access to a page marked invalid causes a page fault. The procedure for handling
this page fault is straightforward.
1. The memory address requested is first checked, to make sure it was
a valid memory request.
2. If the reference was invalid, the process is terminated. Otherwise,
the page must be paged in.
3. A free frame is located, possibly from a free-frame list.
4. A disk operation is scheduled to bring in the necessary page from
disk. (This will usually block the process on an I/O wait, allowing
some other process to use the CPU in the meantime.)
5. When the I/O operation is complete, the process's page table is
updated with the new frame number, and the invalid bit is changed
to indicate that this is now a valid page reference.
6. The instruction that caused the page fault must now be restarted
from the beginning, (as soon as this process gets another turn on
the CPU.)
11
12
• In the extreme case, we can start executing a process with no
pages in memory. NO pages are swapped in for a process until
they are requested by page faults. This is known as pure
demand paging.
• In theory each instruction could generate multiple page faults. In
practice this is very rare, due to locality of reference, which
results in reasonable performance from demand paging.
The hardware to support demand paging is the same as the
hardware for paging and swapping:
1. Page table: This table has the ability to mark an entry invalid
through a valid–invalid bit or a special value of protection bits.
2. Secondary memory: This memory holds those pages that are not
present in main memory. The secondary memory is usually a high-
speed disk. It is known as the swap device, and the section of disk
used for this purpose is known as swap space.
13
A crucial requirement for demand paging is the ability to
restart any instruction from scratch once the desired page
has been made available in memory (after a pagefault).
For most simple instructions this is not a major difficulty.
However there are some architectures that allow a single
instruction to modify a fairly large block of data, (which
may span a page boundary), and if some of the data gets
modified before the page fault occurs, this could cause
problems.
Paging is added between the CPU and the memory in a
computer system and should be entirely transparent to the
user process.
14
People often assume that paging can be added to any
system. Although this assumption is true for a non-demand-
paging environment, where a page fault represents a fatal
error, it is not true where a page fault means only that an
additional page must be brought into memory and the
process restarted.
As long as we have no page faults, the effective access time
is equal to the memory access time.
Let p be the probability of a page fault (0 ≤ p ≤ 1). We
would expect p to be close to zero—that is, we would
expect to have only a few page faults. The effective access
time is then
effective access time = (1 − p) × ma + p × page fault time.
15
A page fault causes the following sequence to occur:
1. Trap to the operating system.
2. Save the user registers and process state.
3. Determine that the interrupt was a page fault.
4. Check that the page reference was legal and determine the location of the page
on the disk.
5. Issue a read from the disk to a free frame:
a. Wait in a queue for this device until the read request is serviced.
b. Wait for the device seek and/or latency time.
c. Begin the transfer of the page to a free frame.
6. While waiting, allocate the CPU to some other user (CPU scheduling, optional).
7. Receive an interrupt from the disk I/O subsystem (I/O completed).
8. Save the registers and process state for the other user (if step 6 is executed).
9. Determine that the interrupt was from the disk.
10. Correct the page table and other tables to show that the desired page is now in
memory.
11. Wait for the CPU to be allocated to this process again.
12. Restore the user registers, process state, and new page table, and then resume
the interrupted instruction. 16
There are many steps that occur when servicing a page fault, and
some of the steps are optional or variable. But, suppose that a
normal memory access requires 200 nanoseconds, and that
servicing a page fault takes 8 milliseconds. (8,000,000
nanoseconds, or 40,000 times a normal memory access.) With a
page fault rate of p, (on a scale from 0 to 1), the effective access
time is now:
( 1 - p ) * ( 200 ) + p * 8000000
= 200 + 7,999,800 * p
which clearly depends heavily on p! Even if only one access in
1000 causes a page fault, the effective access time drops from
200 nanoseconds to 8.2 microseconds, a slowdown of a factor of
40 times. In order to keep the slowdown less than 10%, the page
fault rate must be less than 0.0000025, or one in 399,990
accesses.
17
• A subtlety is that swap space is faster to access than the
regular file system, because it does not have to go through
the whole directory structure. For this reason some
systems will transfer an entire process from the file system
to swap space before starting up the process, so that future
paging all occurs from the ( relatively ) faster swap space.
• Some systems use demand paging directly from the file
system for binary code (which never changes and hence
does not have to be stored on a page operation), and to
reserve the swap space for data segments that must be
stored. This approach is used by both Solaris and BSD
Unix.
18
Copy-on-Write
Copy-on-write is a technique, which works by allowing the
parent and child processes initially to share the same pages.
These shared pages are marked as copy-on-write pages, meaning
that if either process writes to a shared page, a copy of the shared
page is created.
The idea behind a copy-on-write fork is that the pages for a
parent process do not have to be actually copied for the child
until one or the other of the processes changes the page.
They can be simply shared between the two processes in the
meantime, with a bit set that the page needs to be copied if it
ever gets written to.
This is a reasonable approach, since the child process usually
issues an exec( ) system call immediately after the fork. 19
Before process 1 modifies page C.
21
Page Replacement
In order to make the most use of virtual memory, we load
several processes into memory at the same time. Since we
only load the pages that are actually needed by each
process at any given time, there is room to load many
more processes than if we had to load in the entire
process.
30
FIFO Page Replacement
The simplest page-replacement algorithm is a first-in,
first-out (FIFO) algorithm.
A FIFO replacement algorithm associates with each
page the time when that page was brought into memory.
When a page must be replaced, the oldest page is
chosen.
A FIFO queue can be created to hold all pages in
memory. We replace the page at the head of the queue.
When a page is brought into memory, we insert it at the
tail of the queue.
31
For ex, lets use the reference string for a memory with three frames.
7, 0, 1, 2, 0, 3, 0, 4, 2, 3, 0, 3, 2, 1, 2, 0, 1, 7, 0, 1
Initially, our three frames are initially empty. The first three references (7, 0, 1) cause
page faults and are brought into these empty frames.
Notice that the number of faults for four frames (ten) is greater than the number
of faults for three frames (nine)! 33
Examples
34
Optimal Page Replacement
The discovery of Belady's anomaly lead to the search for
an optimal page-replacement algorithm, which is simply
that which yields the lowest of all possible page-faults,
and which does not suffer from Belady's anomaly.
Such an algorithm does exist, and is called OPT or MIN.
This algorithm is simply "Replace the page that will not
be used for the longest time in the future.“
Use of this page-replacement algorithm guarantees the
lowest possible pagefault rate for a fixed number of
frames.
35
Lets use the same reference string for a memory with three frames.
7, 0, 1, 2, 0, 3, 0, 4, 2, 3, 0, 3, 2, 1, 2, 0, 1, 7, 0, 1
37
LRU Page Replacement
The prediction behind LRU, the Least Recently Used,
algorithm is that the page that has not been used in the
longest time is the one that will not be used again in the
near future.
The distinction between FIFO and LRU:- The former looks
at the oldest load time, and the latter looks at the oldest use
time.
LRU as analogous to OPT, except looking backwards in
time instead of forwards. OPT has the interesting property
that for any reference string S and its reverse R, OPT will
generate the same number of page faults for S and for R. It
turns out that LRU has this same property
38
Lets use the same reference string for a memory with three frames.
7, 0, 1, 2, 0, 3, 0, 4, 2, 3, 0, 3, 2, 1, 2, 0, 1, 7, 0, 1
⁂ LRU for our sample string, yielding 12 page faults, as compared to 15 for
FIFO and 9 for OPT.
⁂ The LRU policy is often used as a page-replacement algorithm and is
considered to be good. The major problem is how to implement LRU
replacement.
⁂ An LRU page-replacement algorithm may require substantial hardware
assistance. The problem is to determine an order for the frames defined by
the time of last use. 39
There are two simple approaches commonly used:
1. Counters:- Every memory access increments a counter, and the
current value of this counter is stored in the page table entry for
that page. Then finding the LRU page involves simple searching
the table for the page with the smallest counter value. Note that
overflowing of the counter must be considered.
2. Stack:- Another approach is to use a stack, and whenever a page
is accessed, pull that page from the middle of the stack and place
it on the top. The LRU page will always be at the bottom of the
stack. Because this requires removing objects from the middle of
the stack, a doubly linked list is the recommended data structure.
• Note that both implementations of LRU require hardware support,
either for incrementing the counter or for managing the stack, as
these operations must be performed for every memory access.
40
Neither LRU or OPT exhibit Belady's anomaly. Both belong to a
class of page-replacement algorithms called stack algorithms, which
can never exhibit Belady's anomaly.
A stack algorithm is one in which the pages kept in memory for a
frame set of size N will always be a subset of the pages kept for a
frame size of N + 1.
41
LRU-Approximation Page Replacement
⁂ Unfortunately full implementation of LRU requires hardware support, and
few systems provide the full hardware support necessary.
⁂ In particular, many systems provide a reference bit for every entry in a page
table, which is set anytime that page is accessed. Initially all bits are set to
zero, and they can also all be cleared at any time. One bit of precision is
enough to distinguish pages that have been accessed since the last clear
from those that have not, but does not provide any finer grain of detail.
Additional-Reference-Bits Algorithm
Finer grain is possible by storing the most recent 8 reference bits for each page
in an 8-bit byte in the page table entry, which is interpreted as an unsigned int.
At periodic intervals (clock interrupts), the OS takes over, and right-shifts each
of the reference bytes by one bit.
The high-order (leftmost) bit is then filled in with the current value of the
reference bit, and the reference bits are cleared.
At any given time, the page with the smallest value for the reference byte is
the LRU page. 42
Obviously the specific number of bits used and the frequency with which
the reference byte is updated are adjustable, and are tuned to give the
fastest performance on a given hardware platform
E.g., the page with ref bits 11000100 is more recently used than the page
with ref bits 01110111
Second-Chance Algorithm
The second chance algorithm is essentially a FIFO, except the reference
bit is used to give pages a second chance at staying in the page table.
When a page must be replaced, the page table is scanned in a FIFO
(circular queue) manner.
If a page is found with its reference bit not set, then that page is selected
as the next victim.
If, however, the next page in the FIFO does have its reference bit set,
then it is given a second chance:
The reference bit is cleared, and the FIFO search continues.
If some other page is found that did not have its reference bit set, then
that page will be selected as the victim, and this page ( the one being
given the second chance ) will be allowed to stay in the page table.
If , however, there are no other pages that do not have their reference
bit set, then this page will be selected as the victim when the FIFO
search circles back around to this page on the second pass.
44
If all reference bits in the table are set, then second chance degrades to FIFO,
but also requires a complete search of the table for every page-replacement.
As long as there are some pages whose reference bits are not set, then any
page referenced frequently enough gets to stay in the page table indefinitely.
45
Enhanced Second-Chance Algorithm
The enhanced second chance algorithm looks at the reference bit and the
modify bit (dirty bit) as an ordered page, and classifies pages into one of
four classes:
49
Minimum Number of Frames
⁂ The absolute minimum number of frames that a process must be
allocated is dependent on system architecture, and corresponds to the
worst-case scenario of the number of pages that could be touched by a
single (machine) instruction.
⁂ If an instruction (and its operands) spans a page boundary, then multiple
pages could be needed just for the instruction fetch.
⁂ Memory references in an instruction touch more pages, and if those
memory locations can span page boundaries, then multiple pages could
be needed for operand access also.
⁂ The worst case involves indirect addressing, particularly where multiple
levels of indirect addressing are allowed. For this reason architectures
place a limit on the number of levels of indirection allowed in an
instruction, which is enforced with a counter initialized to the limit and
decremented with every level of indirection in an instruction - If the
counter reaches zero, then an "excessive indirection" trap occurs.
50
Allocation Algorithms
⁂ Equal Allocation - If there are m frames available and n processes
to share them, each process gets m / n frames, and the leftovers are
kept in a free-frame buffer pool.
⁂ Proportional Allocation - Allocate the frames proportionally to the
size of the process, relative to the total size of all processes. So if the
size of process i is S_i, and S is the sum of all S_i, then the
allocation for process P_i is a_i = m * S_i / S.
⁂ Variations on proportional allocation could consider priority of
process rather than just their size.
⁂ Obviously all allocations fluctuate over time as the number of
available free frames, m, fluctuates, and all are also subject to the
constraints of minimum allocation. (If the minimum allocations
cannot be met, then processes must either be swapped out or not
allowed to start until more free frames become available.)
51
Global versus Local Allocation
⁂ One big question is whether frame allocation (page
replacement) occurs on a local or global level.
⁂ With local replacement, the number of pages allocated to a
process is fixed, and page replacement occurs only amongst the
pages allocated to this process.
⁂ With global replacement, any page may be a potential victim,
whether it currently belongs to the process seeking a free frame
or not.
⁂ Local page replacement allows processes to better control their
own page fault rates, and leads to more consistent performance
of a given process over different system load levels.
⁂ Global page replacement is overall more efficient, and is the
more commonly used approach.
52
Non-Uniform Memory Access
⁂ The above arguments all assume that all memory is equivalent, or at least
has equivalent access times.
⁂ This may not be the case in multiple-processor systems, especially where
each CPU is physically located on a separate circuit board which also holds
some portion of the overall system memory.
⁂ In these latter systems, CPUs can access memory that is physically located
on the same board much faster than the memory on the other boards.
⁂ The basic solution is akin to processor affinity - At the same time that we
try to schedule processes on the same CPU to minimize cache misses, we
also try to allocate memory for those processes on the same boards, to
minimize access times.
⁂ The presence of threads complicates the picture, especially when the
threads get loaded onto different processors.
⁂ Solaris uses an lgroup as a solution, in a hierarchical fashion based on
relative latency. For example, all processors and RAM on a single board
would probably be in the same lgroup. Memory assignments are made
within the same lgroup if possible, or to the next nearest lgroup otherwise.
53
Thrashing
If a process cannot maintain its minimum required
number of frames, then it must be swapped out, freeing
up frames for other processes. This is an intermediate
level of CPU scheduling.
But what about a process that can keep its minimum, but
cannot keep all of the frames that it is currently using on
a regular basis? In this case it is forced to page out pages
that it will need again in the very near future, leading to
large numbers of page faults.
A process that is spending more time paging than
executing is said to be thrashing.
54
Cause of Thrashing
⁂ Early process scheduling schemes would control the level of
multiprogramming allowed based on CPU utilization, adding in more
processes when CPU utilization was low.
⁂ The problem is that when memory filled up and processes started spending
lots of time waiting for their pages to page in, then CPU utilization would
lower, causing the schedule to add in even more processes and exacerbating
the problem! Eventually the system would essentially grind to a halt.
⁂ Local page replacement policies can prevent one thrashing process from
taking pages away from other processes, but it still tends to clog up the I/O
queue, thereby slowing down any other process that needs to do even a
little bit of paging (or any other I/O for that matter.)
55
To prevent thrashing we must
provide processes with as many
frames as they really need "right
now", but how do we know what
that is?
The locality model notes that
processes typically access
memory references in a given
locality, making lots of references
to the same general area of
memory before moving
periodically to a new locality.
If we could just keep as many
frames as are involved in the
current locality, then page faulting
would occur primarily on
switches from one locality to
another. 56
Working-Set Model
⁂ The working set model is based on the concept of locality, and defines a
working set window, of length delta Δ. Whatever pages are included in the
most recent delta page references are said to be in the processes working
set window, and comprise its current working set.
The selection of delta is critical to the success of the working set model - If it is too
small then it does not encompass all of the pages of the current locality, and if it is
too large, then it encompasses pages that are no longer being frequently accessed.
The total demand, D, is the sum of the sizes of the working sets for all processes. If
D exceeds the total number of available frames, then at least one process is
thrashing, because there are not enough frames available to satisfy its minimum
working set. If D is significantly less than the currently available frames, then
additional processes can be launched. 57
⁂ The hard part of the working-set model is keeping track of what
pages are in the current working set, since every reference adds one
to the set and removes one older page. An approximation can be
made using reference bits and a timer that goes off after a set interval
of memory references:
For example, suppose that we set the timer to go off after every
5000 references (by any process), and we can store two
additional historical reference bits in addition to the current
reference bit.
Every time the timer goes off, the current reference bit is copied
to one of the two historical bits, and then cleared.
If any of the three bits is set, then that page was referenced
within the last 15,000 references, and is considered to be in that
processes reference set.
Finer resolution can be achieved with more historical bits and a
more frequent timer, at the expense of greater overhead.
58
Page-Fault Frequency
⁂ A more direct approach is to recognize that what we really want to control
is the page-fault rate, and to allocate frames based on this directly
measurable value. If the page-fault rate exceeds a certain upper bound then
that process needs more frames, and if it is below a given lower bound,
then it can afford to give up some of its frames to other processes.
59