Ec6009 Advanced Computer Architecture Unit V Memory and I/O: Cache Performance
Ec6009 Advanced Computer Architecture Unit V Memory and I/O: Cache Performance
CACHE PERFORMANCE
The Figure shows a multilevel memory hierarchy, including typical sizes and speeds of access.
When a word is not found in the cache, the word must be fetched from a lower level in
the hierarchy (which may be another cache or the main memory) and placed in the cache before
continuing.
Multiple words, called a block (or line), are moved for efficiency reasons. Each cache
block includes a tag to indicate which memory address it corresponds to.
A key design decision is where blocks (or lines) can be placed in a cache:
(i) Set associative, where a set is a group of blocks in the cache. A block is first mapped
onto a set,and then the block can be placed any.where within that set. Finding a block consists of
first mapping the block address to the set and then searching the set—usually in parallel—to find
the block. The set is chosen by the address of the data:
(Block address) MOD (Number of sets in cache)
If there are n blocks in a set, the cache placement is called n-way set associative.
(ii) A direct-mapped cache has just one block per set (so a block is always placed in the
same location)
(iii) A fully associative cache has just one set (so a block can be placed anywhere).
Caching data that is only read is easy, since the copy in the cache and memory will be identical.
Caching writes is more difficult; for example, how can the copy in the cache and memory
be kept consistent? There are two main strategies. A write-through cache updates the item in the
cache and writes through to update main memory. A write-back cache only updates the copy in
the cache. When the block is about to be replaced, it is copied back to memory. Both write
strategies can use a write buffer to allow the cache to proceed as soon as the data are placed in
the buffer rather than wait the full latency to write the data into memory.
A measure of average memory access time:
Average memory access time = Hit time + Miss rate x Miss penalty
Hit time is the time to hit in the cache.
Miss penalty is the time to replace the block from memory (that is, the cost of a miss).
Miss rate is simply the fraction of cache accesses that result in a miss—that is, the number of
accesses that miss divided by the number of accesses.
The three Cs model sorts all misses into three simple categories:
Compulsory - The very first access to a block cannot be in the cache, so the block must
be brought into the cache. Compulsory misses are those that occur even if you had an infinite
sized cache.
Capacity - If the cache cannot contain all the blocks needed during execution of a
program, capacity misses (in addition to compulsory misses) will occur because of blocks being
discarded and later retrieved.
Conflict - If the block placement strategy is not fully associative, conflict misses (in
addition to compulsory and capacity misses) will occur because a block may be discarded and
later retrieved if multiple blocks map to its set and accesses to the different blocks are
intermingled.
Another approach reduces conflict misses and yet maintains the hit speed of direct-mapped
cache. In way prediction, extra bits are kept in the cache to predict the way, or block within the
set of the next cache access.
Added to each block of a cache are block predictor bits. The bits select which of the blocks to
try on the next cache access. If the predictor is correct, the cache access latency is the fast hit
time. If not, it tries the other block, changes the way predictor, and has a latency of one extra
clock cycle. Simulations suggest that set prediction accuracy is in excess of 90% for a two-way
set associative cache and 80% for a four-way set associative cache.
An extended form of way prediction can also be used to reduce power consumption by using
the way prediction bits to decide which cache block to actually access. The way prediction bits
are essentially extra address bits. This approach, which might be called way selection, saves
power when the way prediction is correct but adds significant time on a way misprediction.
(i) Critical Word First and Early Restart to Reduce Miss Penalty
This technique is based on the observation that the processor normally needs just one word of the
block at a time. This strategy is impatience: Don’t wait for the full block to be loaded before
sending the requested word and restarting the processor. Here are two specific strategies:
Critical word first - Request the missed word first from memory and send it to the
processor as soon as it arrives; let the processor continue execution while filling the rest of the
words in the block.
Early restart - Fetch the words in normal order, but as soon as the requested word of the
block arrives send it to the processor and let the processor continue execution.
Generally, these techniques only benefit designs with large cache blocks, since the
benefit is low unless blocks are large. Note that caches normally continue to satisfy accesses to
other blocks while the rest of the block is being filled.
Write-through caches rely on write buffers, as all stores must be sent to the next lower level
of the hierarchy. Even write-back caches use a simple buffer when a block is replaced. If the
write buffer is empty, the data and the full address are written in the buffer, and the write is
finished from the processor’s perspective. The processor continues working while the write
buffer prepares to write the word to memory. If the buffer contains other modified blocks, the
addresses can be checked to see if the address of the new data matches the address of a valid
write buffer entry. If so, the new data are combined with that entry. This optimization is called as
write merging. The Intel Core i7, among many others, uses write merging.
If the buffer is full and there is no address match, the cache (and processor) must wait until
the buffer has an empty entry. This optimization uses the memory more efficiently since
multiword writes are usually faster than writes performed one word at a time.
Write merging is
illustrated in the
following figure.
Without write
merging
For pipelined computers that allow out-of-order execution, the processor need not stall on a
data cache miss. For example, the processor could continue fetching instructions from the
instruction cache while waiting for the data cache to return the missing data. A nonblocking
cache or lockup-free cache escalates the potential benefits of such a scheme by allowing the data
cache to continue to supply cache hits during a miss. This “hit under miss” optimization reduces
the effective miss penalty by being helpful during a miss instead of ignoring the requests of the
processor.
This technique reduces miss rates without any hardware changes. This magical reduction comes
from optimized software. The increasing performance gap between processors and main memory
has inspired compiler writers to scrutinize the memory hierarchy to see if compile time
optimizations can improve performance. The optimizations presented below are found in many
modern compilers.
Loop interchange:
Some programs have nested loops that access data in memory in nonsequential order.
Simply exchanging the nesting of the loops can make the code access the data in the order in
which they are stored. Assuming the arrays do not fit in the cache, this technique reduces misses.
Reordering maximizes use of data in a cache block before they are discarded.
For example, if x is a two-dimensional array of size [5000,100] allocated so that x[i,j] and
x[i,j+1] are adjacent, then the two pieces of code below show how the accesses can be optimized:
The original code would ski p through memory in strides of 100 words, while the revised
version accesses all the words in one cache block before going to the next block. This
optimization improves cache performance without affecting the number of instructions executed.
(v) Hardware Prefetching of Instructions and Data to Reduce Miss Rate or Miss
Penalty
Another approach to reduce miss penalty is to prefetch items before the processor
requests them. Both instructions and data can be prefetched, either directly into the caches or into
an external buffer that can be more quickly accessed than main memory.
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
SUBJECT NOTES – EC6009 – ADVANCED COMPUTER ARCHITECTURE
6
Instruction prefetch is frequently done in hardware outside of the cache. Typically, the
processor fetches two blocks on a miss: the requested block and the next consecutive block. The
requested block is placed in the instruction cache when it returns, and the prefetched block is
placed into the instruction stream buffer. If the requested block is present in the instruction
stream buffer, the original cache request is canceled, the block is read from the stream buffer,
and the next prefetch request is issued.
MAIN MEMORY
Main memory is the name given to the level below the cache(s) in the memory hierarchy.
A main memory may have a few MBytes for a typical Personal Computer, tens to hundreds of
MBytes for a workstation, hundreds of MBytes to GBytes for supercomputers. The capacity of
main memory has continuously increased over the years, as prices have dramatically dropped.
The main memory must satisfy the cache requests as quickly as possible, and must provide
sufficient bandwidth for I/O devices and for vector units.
The access time, is defined as the time between the moment the read command is issued
and the moment the requested data is at outputs. The cycle time is defined as the minimum time
between successive accesses to memory. The cycle time is usually greater than the access time.
Main memory is DRAM. To access data in a memory chip with a capacity of NxM bits,
one must provide a number of addresses equal to: log2N
N is the number of “words” each chip has; each “word” is M bits wide. As the technology
improved, the packaging costs become a real concern, as the number of address lines got greater
and greater.
Dynamic since needs to be refreshed periodically (8 ms)
Addresses divided into 2 halves:
» RAS or Row Access Strobe
» CAS or Column Access Strobe
• Control Signals (RAS_L, CAS_L, WE_L, OE_L) are all active low
• Din and Dout are combined (D):
– WE_L is asserted (Low), OE_L is disasserted (High)
» D serves as the data input pin
– WE_L is disasserted (High), OE_L is asserted (Low)
» D is the data output pin
• Row and column addresses share the same pins (A)
– RAS_L goes low: Pins A are latched in as row address
– CAS_L goes low: Pins A are latched in as column address
– RAS/CAS edge-sensitive
Average seek time is the subject of considerable misunderstanding. Disk manufacturers report
minimum seek time, maximum seek time, and average seek time in their manuals. The first two
are easy to measure, but the average was open to wide interpretation. The time for the requested
sector to rotate under the head is the rotation latency or rotational delay.
One challenger to magnetic disks is optical compact disks, or CDs, and its successor,
called Digital Video Discs and then Digital Versatile Discs or just DVDs. Both the CD-ROM
and DVD-ROM are removable and inexpensive to manufacture, but they are read-only mediums.
These 4.7-inch diameter disks hold 0.65 and 4.7 GB, respectively, although some DVDs write on
both sides to double their capacity. Their high capacity and low cost have led to CD-ROMs and
DVD-ROMs replacing floppy disks as the favorite medium for distributing software and other
types of computer data.
The popularity of CDs and music that can be downloaded from the led to a market for rewritable
CDs, conveniently called CD-RW, and write once CDs, called CD-R. In 2001, there is a small
cost premium for drives that can record on CD-RW. The media itself costs about $0.20 per CD-R
disk or $0.60 per CD-RW disk. CD-RWs and CD-Rs read at about half the speed of CD-ROMs
and CD-RWs and CD-Rs write at about a quarter the speed of CD-ROMs.
Magnetic tapes have been part of computer systems as long as disks because they use
the similar technology as disks, and hence historically have followed the same density
improvements. The inherent cost/performance difference between disks and tapes is
based on their geometries:
Fixed rotating platters offer random access in milliseconds, but disks have a limited
storage area and the storage medium is sealed within each reader.
Long strips wound on removable spools of “unlimited” length mean many tapes
can be used per reader, but tapes require sequential access that can take seconds.
One of the limits of tapes had been the speed at which the tapes can spin without
breaking or jamming. A technology called helical scan tapes solves this problem by
keeping the tape speed the same but recording the information on a diagonal to the tape
with a tape reader that spins much faster than the tape is moving. This technology
increases recording density by about a factor of 20 to 50. Helical scan tapes were
developed for low-cost VCRs and camcorders, which brought down the cost of the tapes
and readers.
Embedded devices also need nonvolatile storage, but premiums placed on space and power
normally lead to the use of Flash memory instead of magnetic recording. Flash memory is also
used as a rewritable ROM in embedded systems, typically to allow software to be upgraded
without having to replace chips. Applications are typically prohibited from writing to Flash
memory in such circumstances. Like electrically erasable and programmable read-only memories
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
SUBJECT NOTES – EC6009 – ADVANCED COMPUTER ARCHITECTURE
10
(EEPROM), Flash memory is written by inducing the tunneling of charge from transistor gain to
a floating gate. The floating gate acts as a potential well which stores the charge, and the charge
cannot move from there without applying an external force. The primary difference between
EEPROM and Flash memory is that Flash restricts write to multikilobyte blocks, increasing
memory capacity per chip by reducing area dedicated to control. Compared to disks, Flash
memories offer low power consumption (less than 50 milliwatts), can be sold in small sizes, and
offer read access times comparable to DRAMs. In 2001, a 16 Mbit Flash memory has a 65 ns
access time, and a 128 Mbit Flash memory has a 150 ns access time.
BUSES
Buses were traditionally classified as CPU-memory buses or I/O buses. I/O buses may be
lengthy, may have many types of devices connected to them, have a wide range in the data
bandwidth of the devices connected to them, and normally follow a bus standard. CPU-memory
buses, on the other hand, are short, generally high speed, and matched to the memory system to
maximize memory-CPU bandwidth. During the design phase, the designer of a CPU-memory
bus knows all the types of devices that must connect together, while the I/O bus designer must
accept devices varying in latency and bandwidth capabilities. To lower costs, some computers
have a single bus for both memory and I/O devices. In the quest for higher I/O performance,
some buses are a hybrid of the two. For example, PCI is relatively short, and is used to connect
to more traditional I/O buses via bridges that speak both PCI on one end and the I/O bus protocol
on the other. To indicate its intermediate state, such buses are sometimes called mezzanine.
The design of a bus presents several options, as Figure below shows. Like the rest of the
computer system, decisions depend on cost and performance goals. The first three options in the
figure are clear—separate address and data lines, wider data lines, and multiple-word transfers
all give higher performance at more cost.
The next item in the table concerns the number of bus masters. These devices can initiate
a read or write transaction; the CPU, for instance, is always a bus master. A bus has multiple
masters when there are multiple CPUs or when I/O devices can initiate a bus transaction. With
multiple masters, a bus can offer higher bandwidth by using packets, as opposed to holding the
bus for the full transaction. This technique is called split transactions.
The final item in above figure, clocking, concerns whether a bus is synchronous or
asynchronous. If a bus is synchronous, it includes a clock in the control lines and a fixed protocol
for sending address and data relative to the clock. Since little or no logic is needed to decide
what to do next, these buses can be both fast and inexpensive.
Bus Standards
Standards that let the computer designer and I/O-device designer work independently
play a large role in buses. As long as both designers meet the requirements, any I/O device can
connect to any computer. The I/O bus standard is the document that defines how to connect
devices to computers.
Machines sometimes grow to be so popular that their I/O buses become de facto
standards; examples are the PDP-11 Unibus and the IBM PC-AT Bus. The intelligent peripheral
interface (IPI) and Ethernet are examples of standards that resulted from the cooperation of
manufacturers.
A typical interface of I/O devices and an I/O bus to the CPU-memory bus is shown in the figure
below:
Processor interface with i/o bus can be done with two techniques one using interrupts and second
using memory mapped I/O.
I/O Control Structures
Polling
Interrupts
DMA
I/O Controllers
I/O Processors
Polling : The simple interface, in which the CPU periodically checks status bits to see if it is
time for the next I/O operation, is called polling.
Interrupts : Interrupt-driven I/O, used by most systems for at least some devices, allows the
CPU to work on some other process while waiting for the I/O device. For example, the LP11 has
a mode that allows it to interrupt the CPU whenever the done bit or error bit is set. In general-
purpose applications, interrupt-driven I/O is the key to multitasking operating systems and good
response times.
DMA: The DMA hardware is a specialized processor that transfers data between memory and an
I/O device while the CPU goes on with other tasks. Thus, it is external to the CPU and must act
as a master on the bus. The CPU first sets up the DMA registers, which contain a memory
address and number of bytes to be transferred. More sophisticated DMA devices support
scatter/gather, whereby a DMA device can write or read data from a list of separate addresses.
Once the DMA transfer is complete, the DMA controller interrupts the CPU. There may be
multiple DMA devices in a computer system.
This traditional scheme for tolerating disk failure, called mirroring or shadowing, uses twice
as many disks as does RAID 0. Whenever data is written to one disk, that data is also written to a
redundant disk, so that there are always two copies of the information. If a disk fails, the system
just goes to the “mirror” to get the desired information. Mirroring is the most expensive RAID
solution, since it requires the most disks. The RAID terminology has evolved to call the former
RAID 1+0 or RAID 10 (“striped mirrors”) and the latter RAID 0+1 or RAID 01 (“mirrored
stripes”).
The cost of higher availability can be reduced to 1/N, where N is the number of disks in a
protection group. Rather than have a complete copy of the original data for each disk, we need
only add enough redundant information to restore the lost information on a failure. Reads or
writes go to all disks in the group, with one extra disk to hold the check information in case there
is a failure. RAID 3 is popular in applications with large data sets, such as multimedia and some
scientific codes.
Parity is one such scheme. Readers unfamiliar with parity can think of the redundant disk as
having the sum of all the data in the other disks. When a disk fails, then you subtract all the data
in the good disks from the parity disk; the remaining information must be the missing
information. Parity is simply the sum modulo two. The assumption behind this technique is that
failures are so rare that taking longer to recover from failure but reducing redundant storage is a
good trade-off.
This organization allows multiple writes to occur simultaneously as long as the stripe units are
not located in the same disks. For example, a write to block 8 on the right must also access its
parity block P2, thereby occupying the first and third disks. A second write to block 5 on the
right, implying an update to its parity block P1, accesses the second and fourth disks and thus
could occur at the same time as the write to block 8. Those same writes to the organization on the
left would result in changes to blocks P1 and P2, both on the fifth disk, which would be a
bottleneck.
Parity based schemes protect against a single, self-identifying failures. When a single failure
is not sufficient, parity can be generalized to have a second calculation over the data and another
check disk of information. Yet another parity block is added to allow recovery from a second
failure. Thus, the storage overhead is twice that of RAID 5. There are six disk accesses to update
both P and Q information.
1. Amount of data that can move through the system in a certain time
I/O response time (latency) – the total elapsed time to accomplish an input or output
operation. It is an especially important performance metric in real-time systems.
Many applications require both high throughput and short response times.