0% found this document useful (0 votes)
65 views

Ec6009 Advanced Computer Architecture Unit V Memory and I/O: Cache Performance

1. Caches reduce average memory access time by storing frequently used data closer to the processor to reduce hit time. Cache performance is impacted by hit rate, hit time, and miss penalty. 2. Various techniques can be used to optimize cache performance including reducing cache miss rates through compiler optimizations, reducing miss penalties with approaches like critical word first, and reducing hit times with small, simple first-level caches. 3. Write buffers can improve performance by allowing writes to finish faster using techniques like write merging that combine writes to the same address into a single entry.

Uploaded by

Anitha Denis
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
65 views

Ec6009 Advanced Computer Architecture Unit V Memory and I/O: Cache Performance

1. Caches reduce average memory access time by storing frequently used data closer to the processor to reduce hit time. Cache performance is impacted by hit rate, hit time, and miss penalty. 2. Various techniques can be used to optimize cache performance including reducing cache miss rates through compiler optimizations, reducing miss penalties with approaches like critical word first, and reducing hit times with small, simple first-level caches. 3. Write buffers can improve performance by allowing writes to finish faster using techniques like write merging that combine writes to the same address into a single entry.

Uploaded by

Anitha Denis
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 16

1

EC6009 ADVANCED COMPUTER ARCHITECTURE

UNIT V MEMORY AND I/O


Cache Performance – Reducing Cache Miss Penalty and Miss Rate – Reducing Hit Time – Main Memory
and Performance – Memory Technology. Types of Storage Devices – Buses – RAID – Reliability,
Availability and Dependability – I/O Performance Measures.

CACHE PERFORMANCE
The Figure shows a multilevel memory hierarchy, including typical sizes and speeds of access.

A quick review of caches and their operation is discussed below:

 When a word is not found in the cache, the word must be fetched from a lower level in
the hierarchy (which may be another cache or the main memory) and placed in the cache before
continuing.
 Multiple words, called a block (or line), are moved for efficiency reasons. Each cache
block includes a tag to indicate which memory address it corresponds to.

AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING


SUBJECT NOTES – EC6009 – ADVANCED COMPUTER ARCHITECTURE
2

 A key design decision is where blocks (or lines) can be placed in a cache:
(i) Set associative, where a set is a group of blocks in the cache. A block is first mapped
onto a set,and then the block can be placed any.where within that set. Finding a block consists of
first mapping the block address to the set and then searching the set—usually in parallel—to find
the block. The set is chosen by the address of the data:
(Block address) MOD (Number of sets in cache)
If there are n blocks in a set, the cache placement is called n-way set associative.
(ii) A direct-mapped cache has just one block per set (so a block is always placed in the
same location)
(iii) A fully associative cache has just one set (so a block can be placed anywhere).
Caching data that is only read is easy, since the copy in the cache and memory will be identical.
 Caching writes is more difficult; for example, how can the copy in the cache and memory
be kept consistent? There are two main strategies. A write-through cache updates the item in the
cache and writes through to update main memory. A write-back cache only updates the copy in
the cache. When the block is about to be replaced, it is copied back to memory. Both write
strategies can use a write buffer to allow the cache to proceed as soon as the data are placed in
the buffer rather than wait the full latency to write the data into memory.
 A measure of average memory access time:
Average memory access time = Hit time + Miss rate x Miss penalty
Hit time is the time to hit in the cache.
Miss penalty is the time to replace the block from memory (that is, the cost of a miss).
Miss rate is simply the fraction of cache accesses that result in a miss—that is, the number of
accesses that miss divided by the number of accesses.
The three Cs model sorts all misses into three simple categories:
 Compulsory - The very first access to a block cannot be in the cache, so the block must
be brought into the cache. Compulsory misses are those that occur even if you had an infinite
sized cache.
 Capacity - If the cache cannot contain all the blocks needed during execution of a
program, capacity misses (in addition to compulsory misses) will occur because of blocks being
discarded and later retrieved.
 Conflict - If the block placement strategy is not fully associative, conflict misses (in
addition to compulsory and capacity misses) will occur because a block may be discarded and
later retrieved if multiple blocks map to its set and accesses to the different blocks are
intermingled.

Cache optimizations are done based on these metrics:


1. Reducing the hit time - Small and simple first-level caches and way-prediction. Both
Techniques also generally decrease power consumption.
2. Reducing the miss penalty - Critical word first and merging write buffers. These
optimizations have little impact on power.
3. Reducing the miss rate - Compiler optimizations. Obviously any improvement at
compile time improves power consumption.
4. Reducing the miss penalty or miss rate via parallelism - Hardware prefetching and
compiler prefetching. These optimizations generally increase power consumption, primarily due
to prefetched data that are unused.
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
SUBJECT NOTES – EC6009 – ADVANCED COMPUTER ARCHITECTURE
3

1. REDUCING HIT TIME :

(i) Small and Simple First-Level Caches to Reduce Hit Time


The pressure of both a fast clock cycle and power limitations encourages limited size for
first-level caches. Similarly, use of lower levels of associativity can reduce both hit time and
power, although such trade-offs are more complex than those involving size.
The critical timing path in a cache hit is the three-step process of addressing the tag
memory using the index portion of the address, comparing the read tag value to the address, and
setting the multiplexor to choose the correct data item if the cache is set associative. Direct-
mapped caches can overlap the tag check with the transmission of the data, effectively reducing
hit time. Furthermore, lower levels of associativity will usually reduce power because fewer
cache lines must be accessed.
Although the total amount of on-chip cache has increased dramatically with new generations
of microprocessors, due to the clock rate impact arising from a larger L1 cache, the size of the L1
caches has recently increased either slightly or not at all. In many recent processors, designers
have opted for more associativity rather than larger caches.

(ii) Way Prediction to Reduce Hit Time

Another approach reduces conflict misses and yet maintains the hit speed of direct-mapped
cache. In way prediction, extra bits are kept in the cache to predict the way, or block within the
set of the next cache access.
Added to each block of a cache are block predictor bits. The bits select which of the blocks to
try on the next cache access. If the predictor is correct, the cache access latency is the fast hit
time. If not, it tries the other block, changes the way predictor, and has a latency of one extra
clock cycle. Simulations suggest that set prediction accuracy is in excess of 90% for a two-way
set associative cache and 80% for a four-way set associative cache.
An extended form of way prediction can also be used to reduce power consumption by using
the way prediction bits to decide which cache block to actually access. The way prediction bits
are essentially extra address bits. This approach, which might be called way selection, saves
power when the way prediction is correct but adds significant time on a way misprediction.

2. REDUCING THE MISS PENALTY and MISS RATE:

(i) Critical Word First and Early Restart to Reduce Miss Penalty

This technique is based on the observation that the processor normally needs just one word of the
block at a time. This strategy is impatience: Don’t wait for the full block to be loaded before
sending the requested word and restarting the processor. Here are two specific strategies:
 Critical word first - Request the missed word first from memory and send it to the
processor as soon as it arrives; let the processor continue execution while filling the rest of the
words in the block.
 Early restart - Fetch the words in normal order, but as soon as the requested word of the
block arrives send it to the processor and let the processor continue execution.

AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING


SUBJECT NOTES – EC6009 – ADVANCED COMPUTER ARCHITECTURE
4

Generally, these techniques only benefit designs with large cache blocks, since the
benefit is low unless blocks are large. Note that caches normally continue to satisfy accesses to
other blocks while the rest of the block is being filled.

(ii) Merging Write Buffer to Reduce Miss Penalty

Write-through caches rely on write buffers, as all stores must be sent to the next lower level
of the hierarchy. Even write-back caches use a simple buffer when a block is replaced. If the
write buffer is empty, the data and the full address are written in the buffer, and the write is
finished from the processor’s perspective. The processor continues working while the write
buffer prepares to write the word to memory. If the buffer contains other modified blocks, the
addresses can be checked to see if the address of the new data matches the address of a valid
write buffer entry. If so, the new data are combined with that entry. This optimization is called as
write merging. The Intel Core i7, among many others, uses write merging.
If the buffer is full and there is no address match, the cache (and processor) must wait until
the buffer has an empty entry. This optimization uses the memory more efficiently since
multiword writes are usually faster than writes performed one word at a time.

Write merging is
illustrated in the
following figure.

Without write
merging

With write merging

To illustrate write merging,


The write buffer on top does not use write merging while the write buffer on the bottom uses
write merging. The four writes are merged into a single buffer entry with write merging. Without
it, the buffer is full even though three-fourths of each entry is wasted. The buffer
has four entries, and each entry holds four 64-bit words. The address for each entry is on the left,
with a valid bit (V) indicating whether the next sequential 8 bytes in this entry are occupied.

AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING


SUBJECT NOTES – EC6009 – ADVANCED COMPUTER ARCHITECTURE
5

(iii) Nonblocking Caches to Reduce Miss Penalty

For pipelined computers that allow out-of-order execution, the processor need not stall on a
data cache miss. For example, the processor could continue fetching instructions from the
instruction cache while waiting for the data cache to return the missing data. A nonblocking
cache or lockup-free cache escalates the potential benefits of such a scheme by allowing the data
cache to continue to supply cache hits during a miss. This “hit under miss” optimization reduces
the effective miss penalty by being helpful during a miss instead of ignoring the requests of the
processor.

(iv) Complier optimizations to reduce Miss Rate

This technique reduces miss rates without any hardware changes. This magical reduction comes
from optimized software. The increasing performance gap between processors and main memory
has inspired compiler writers to scrutinize the memory hierarchy to see if compile time
optimizations can improve performance. The optimizations presented below are found in many
modern compilers.
Loop interchange:
Some programs have nested loops that access data in memory in nonsequential order.
Simply exchanging the nesting of the loops can make the code access the data in the order in
which they are stored. Assuming the arrays do not fit in the cache, this technique reduces misses.
Reordering maximizes use of data in a cache block before they are discarded.
For example, if x is a two-dimensional array of size [5000,100] allocated so that x[i,j] and
x[i,j+1] are adjacent, then the two pieces of code below show how the accesses can be optimized:

The original code would ski p through memory in strides of 100 words, while the revised
version accesses all the words in one cache block before going to the next block. This
optimization improves cache performance without affecting the number of instructions executed.

(v) Hardware Prefetching of Instructions and Data to Reduce Miss Rate or Miss
Penalty

Another approach to reduce miss penalty is to prefetch items before the processor
requests them. Both instructions and data can be prefetched, either directly into the caches or into
an external buffer that can be more quickly accessed than main memory.
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
SUBJECT NOTES – EC6009 – ADVANCED COMPUTER ARCHITECTURE
6

Instruction prefetch is frequently done in hardware outside of the cache. Typically, the
processor fetches two blocks on a miss: the requested block and the next consecutive block. The
requested block is placed in the instruction cache when it returns, and the prefetched block is
placed into the instruction stream buffer. If the requested block is present in the instruction
stream buffer, the original cache request is canceled, the block is read from the stream buffer,
and the next prefetch request is issued.

(vi) Compiler-Controlled Prefetching to Reduce Miss Rate or Miss Penalty

An alternative to hardware prefetching is for the compiler to insert prefetch instructions to


request data before the processor needs it. There are two flavors of prefetch:
 Register prefetch will load the value into a register.
 Cache prefetch loads data only into the cache and not the register.
Either of these can be faulting or nonfaulting; that is, the address does or does not cause an
exception for virtual address faults and protection violations. The most effective prefetch is
“semantically invisible” to a program: It doesn’t change the contents of registers and memory,
and it cannot cause virtual memory faults. Most processors today offer nonfaulting cache
prefetches.
Prefetching makes sense only if the processor can proceed while prefetching the data.
The caches do not stall but continue to supply instructions and data while waiting for the
prefetched data to return.

MAIN MEMORY
Main memory is the name given to the level below the cache(s) in the memory hierarchy.
A main memory may have a few MBytes for a typical Personal Computer, tens to hundreds of
MBytes for a workstation, hundreds of MBytes to GBytes for supercomputers. The capacity of
main memory has continuously increased over the years, as prices have dramatically dropped.
The main memory must satisfy the cache requests as quickly as possible, and must provide
sufficient bandwidth for I/O devices and for vector units.
The access time, is defined as the time between the moment the read command is issued
and the moment the requested data is at outputs. The cycle time is defined as the minimum time
between successive accesses to memory. The cycle time is usually greater than the access time.
Main memory is DRAM. To access data in a memory chip with a capacity of NxM bits,
one must provide a number of addresses equal to: log2N
N is the number of “words” each chip has; each “word” is M bits wide. As the technology
improved, the packaging costs become a real concern, as the number of address lines got greater
and greater.
Dynamic since needs to be refreshed periodically (8 ms)
Addresses divided into 2 halves:
» RAS or Row Access Strobe
» CAS or Column Access Strobe

AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING


SUBJECT NOTES – EC6009 – ADVANCED COMPUTER ARCHITECTURE
7

Logic diagram of DRAM

• Control Signals (RAS_L, CAS_L, WE_L, OE_L) are all active low
• Din and Dout are combined (D):
– WE_L is asserted (Low), OE_L is disasserted (High)
» D serves as the data input pin
– WE_L is disasserted (High), OE_L is asserted (Low)
» D is the data output pin
• Row and column addresses share the same pins (A)
– RAS_L goes low: Pins A are latched in as row address
– CAS_L goes low: Pins A are latched in as column address
– RAS/CAS edge-sensitive

AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING


SUBJECT NOTES – EC6009 – ADVANCED COMPUTER ARCHITECTURE
8

STORAGE SYSTEM – TYPES OF STORAGE DEVICES


There are various types of Storage devices such as magnetic disks, magnetic tapes, automated
tape libraries, CDs, and DVDs.
(i) Magnetic disks
The First Storage device magnetic disks have dominated nonvolatile storage since 1965.
Magnetic disks play two roles in computer systems:
 Long Term, nonvolatile storage for files, even when no programs are running
 A level of the memory hierarchy below main memory used as a backing store for
virtual memory during program execution.

A magnetic disk consists of a collection of platters (generally 1 to 12), rotating on a


spindle at 3,600 to 15,000 revolutions per minute (RPM). These platters are metal or glass disks
covered with magnetic recording material on both sides, so 10 platters have 20 recording
surfaces. The disk surface is divided into concentric circles, designated tracks. There are
typically 5,000 to 30,000 tracks on each surface. Each track in turn is divided into sectors that
contain the information; a track might have 100 to 500 sectors. A sector is the smallest unit that
can be read or written. IBM mainframes allow users to select the size of the sectors, although
most systems fix their size, typically at 512 bytes of data. The sequence recorded on the
magnetic media is a sector number, a gap, the information for that sector including error
correction code, a gap, the sector number of the next sector, and so on. To read and write
information into a sector, a movable arm containing a read/ write head is located over each
surface. To read or write a sector, the disk controller sends a command to move the arm over the
proper track. This operation is called a seek, and the time to move the arm to the desired track is
called seek time.

AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING


SUBJECT NOTES – EC6009 – ADVANCED COMPUTER ARCHITECTURE
9

Average seek time is the subject of considerable misunderstanding. Disk manufacturers report
minimum seek time, maximum seek time, and average seek time in their manuals. The first two
are easy to measure, but the average was open to wide interpretation. The time for the requested
sector to rotate under the head is the rotation latency or rotational delay.

(ii) Optical Disks:

One challenger to magnetic disks is optical compact disks, or CDs, and its successor,
called Digital Video Discs and then Digital Versatile Discs or just DVDs. Both the CD-ROM
and DVD-ROM are removable and inexpensive to manufacture, but they are read-only mediums.
These 4.7-inch diameter disks hold 0.65 and 4.7 GB, respectively, although some DVDs write on
both sides to double their capacity. Their high capacity and low cost have led to CD-ROMs and
DVD-ROMs replacing floppy disks as the favorite medium for distributing software and other
types of computer data.
The popularity of CDs and music that can be downloaded from the led to a market for rewritable
CDs, conveniently called CD-RW, and write once CDs, called CD-R. In 2001, there is a small
cost premium for drives that can record on CD-RW. The media itself costs about $0.20 per CD-R
disk or $0.60 per CD-RW disk. CD-RWs and CD-Rs read at about half the speed of CD-ROMs
and CD-RWs and CD-Rs write at about a quarter the speed of CD-ROMs.

(iii) Magnetic Tape:

Magnetic tapes have been part of computer systems as long as disks because they use
the similar technology as disks, and hence historically have followed the same density
improvements. The inherent cost/performance difference between disks and tapes is
based on their geometries:
 Fixed rotating platters offer random access in milliseconds, but disks have a limited
storage area and the storage medium is sealed within each reader.
 Long strips wound on removable spools of “unlimited” length mean many tapes
can be used per reader, but tapes require sequential access that can take seconds.
One of the limits of tapes had been the speed at which the tapes can spin without
breaking or jamming. A technology called helical scan tapes solves this problem by
keeping the tape speed the same but recording the information on a diagonal to the tape
with a tape reader that spins much faster than the tape is moving. This technology
increases recording density by about a factor of 20 to 50. Helical scan tapes were
developed for low-cost VCRs and camcorders, which brought down the cost of the tapes
and readers.

(iv) Flash Memory

Embedded devices also need nonvolatile storage, but premiums placed on space and power
normally lead to the use of Flash memory instead of magnetic recording. Flash memory is also
used as a rewritable ROM in embedded systems, typically to allow software to be upgraded
without having to replace chips. Applications are typically prohibited from writing to Flash
memory in such circumstances. Like electrically erasable and programmable read-only memories
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
SUBJECT NOTES – EC6009 – ADVANCED COMPUTER ARCHITECTURE
10

(EEPROM), Flash memory is written by inducing the tunneling of charge from transistor gain to
a floating gate. The floating gate acts as a potential well which stores the charge, and the charge
cannot move from there without applying an external force. The primary difference between
EEPROM and Flash memory is that Flash restricts write to multikilobyte blocks, increasing
memory capacity per chip by reducing area dedicated to control. Compared to disks, Flash
memories offer low power consumption (less than 50 milliwatts), can be sold in small sizes, and
offer read access times comparable to DRAMs. In 2001, a 16 Mbit Flash memory has a 65 ns
access time, and a 128 Mbit Flash memory has a 150 ns access time.

BUSES

Connecting I/O Devices To Cpu/Memory

Buses were traditionally classified as CPU-memory buses or I/O buses. I/O buses may be
lengthy, may have many types of devices connected to them, have a wide range in the data
bandwidth of the devices connected to them, and normally follow a bus standard. CPU-memory
buses, on the other hand, are short, generally high speed, and matched to the memory system to
maximize memory-CPU bandwidth. During the design phase, the designer of a CPU-memory
bus knows all the types of devices that must connect together, while the I/O bus designer must
accept devices varying in latency and bandwidth capabilities. To lower costs, some computers
have a single bus for both memory and I/O devices. In the quest for higher I/O performance,
some buses are a hybrid of the two. For example, PCI is relatively short, and is used to connect
to more traditional I/O buses via bridges that speak both PCI on one end and the I/O bus protocol
on the other. To indicate its intermediate state, such buses are sometimes called mezzanine.

Bus Design Decisions

The design of a bus presents several options, as Figure below shows. Like the rest of the
computer system, decisions depend on cost and performance goals. The first three options in the
figure are clear—separate address and data lines, wider data lines, and multiple-word transfers
all give higher performance at more cost.

AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING


SUBJECT NOTES – EC6009 – ADVANCED COMPUTER ARCHITECTURE
11

The next item in the table concerns the number of bus masters. These devices can initiate
a read or write transaction; the CPU, for instance, is always a bus master. A bus has multiple
masters when there are multiple CPUs or when I/O devices can initiate a bus transaction. With
multiple masters, a bus can offer higher bandwidth by using packets, as opposed to holding the
bus for the full transaction. This technique is called split transactions.
The final item in above figure, clocking, concerns whether a bus is synchronous or
asynchronous. If a bus is synchronous, it includes a clock in the control lines and a fixed protocol
for sending address and data relative to the clock. Since little or no logic is needed to decide
what to do next, these buses can be both fast and inexpensive.

Bus Standards

Standards that let the computer designer and I/O-device designer work independently
play a large role in buses. As long as both designers meet the requirements, any I/O device can
connect to any computer. The I/O bus standard is the document that defines how to connect
devices to computers.
Machines sometimes grow to be so popular that their I/O buses become de facto
standards; examples are the PDP-11 Unibus and the IBM PC-AT Bus. The intelligent peripheral
interface (IPI) and Ethernet are examples of standards that resulted from the cooperation of
manufacturers.

Interfacing Storage Devices to the CPU

A typical interface of I/O devices and an I/O bus to the CPU-memory bus is shown in the figure
below:

AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING


SUBJECT NOTES – EC6009 – ADVANCED COMPUTER ARCHITECTURE
12

Processor interface with i/o bus can be done with two techniques one using interrupts and second
using memory mapped I/O.
I/O Control Structures
 Polling
 Interrupts
 DMA
 I/O Controllers
 I/O Processors
Polling : The simple interface, in which the CPU periodically checks status bits to see if it is
time for the next I/O operation, is called polling.
Interrupts : Interrupt-driven I/O, used by most systems for at least some devices, allows the
CPU to work on some other process while waiting for the I/O device. For example, the LP11 has
a mode that allows it to interrupt the CPU whenever the done bit or error bit is set. In general-
purpose applications, interrupt-driven I/O is the key to multitasking operating systems and good
response times.
DMA: The DMA hardware is a specialized processor that transfers data between memory and an
I/O device while the CPU goes on with other tasks. Thus, it is external to the CPU and must act
as a master on the bus. The CPU first sets up the DMA registers, which contain a memory
address and number of bytes to be transferred. More sophisticated DMA devices support
scatter/gather, whereby a DMA device can write or read data from a list of separate addresses.
Once the DMA transfer is complete, the DMA controller interrupts the CPU. There may be
multiple DMA devices in a computer system.

AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING


SUBJECT NOTES – EC6009 – ADVANCED COMPUTER ARCHITECTURE
13

RAID : REDUNDANT ARRAYS OF INEXPENSIVE DISKS


An innovation that improves both dependability and performance of storage systems is
disk arrays. One argument for arrays is that potential throughput can be increased by having
many disk drives and, hence, many disk arms, rather than one large drive with one disk arm.
Although a disk array would have more faults than a smaller number of larger disks when each
disk has the same reliability, dependability can be improved by adding redundant disks to the
array to tolerate faults. That is, if a single disk fails, the lost information can be reconstructed
from redundant information. The only danger is in having another disk fail between the time the
first disk fails and the time it is replaced (termed mean time to repair, or MTTR). Since the mean
time to failure (MTTF) of disks is tens of years, and the MTTR is measured in hours, redundancy
can make the measured reliability of 100 disks much higher than that of a single disk. These
systems have become known by the acronym RAID, stand-ing originally for redundant array of
inexpensive disks, although some have re-named it to redundant array of independent disks
The several approaches to redundancy have different overhead and performance. Figure
below shows the standard RAID levels. It shows how eight disks of user data must be
supplemented by redundant or check disks at each RAID level. It also shows the minimum
number of disk failures that a system would survive.

(i) No Redundancy (RAID 0)

This notation is refers to a disk array in which data is


striped but there is no redundancy to tolerate disk failure. Striping across a set of disks makes the
collection appear to software as a single large disk, which simplifies storage management. It also
improves performance for large accesses, since many disks can operate at once. Video editing
systems, for example, often stripe their data.
RAID 0 something of a misnomer as there is no redundancy, it is not in the original
RAID taxonomy, and striping predates RAID. However, RAID levels are often left to the
operator to set when creating a storage system, and RAID 0 is often listed as one of the options.
Hence, the term RAID 0 has become widely used.
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
SUBJECT NOTES – EC6009 – ADVANCED COMPUTER ARCHITECTURE
14

(ii) Mirroring (RAID 1)

This traditional scheme for tolerating disk failure, called mirroring or shadowing, uses twice
as many disks as does RAID 0. Whenever data is written to one disk, that data is also written to a
redundant disk, so that there are always two copies of the information. If a disk fails, the system
just goes to the “mirror” to get the desired information. Mirroring is the most expensive RAID
solution, since it requires the most disks. The RAID terminology has evolved to call the former
RAID 1+0 or RAID 10 (“striped mirrors”) and the latter RAID 0+1 or RAID 01 (“mirrored
stripes”).

(iii) Bit-Interleaved Parity (RAID 3)

The cost of higher availability can be reduced to 1/N, where N is the number of disks in a
protection group. Rather than have a complete copy of the original data for each disk, we need
only add enough redundant information to restore the lost information on a failure. Reads or
writes go to all disks in the group, with one extra disk to hold the check information in case there
is a failure. RAID 3 is popular in applications with large data sets, such as multimedia and some
scientific codes.
Parity is one such scheme. Readers unfamiliar with parity can think of the redundant disk as
having the sum of all the data in the other disks. When a disk fails, then you subtract all the data
in the good disks from the parity disk; the remaining information must be the missing
information. Parity is simply the sum modulo two. The assumption behind this technique is that
failures are so rare that taking longer to recover from failure but reducing redundant storage is a
good trade-off.

(iv) Block-Interleaved Parity and Distributed Block-Interleaved Parity (RAID 4 and


RAID 5)
In RAID 3, every access went to all disks. Some applications would prefer to do smaller
accesses, allowing independent accesses to occur in parallel. That is the purpose of the next
RAID levels. Since error-detection information in each sector is checked on reads to see if data is
correct, such “small reads” to each disk can occur independently as long as the minimum access
is one sector.
Writes are another matter. It would seem that each small write would demand that all
other disks be accessed to read the rest of the information needed to recalculate the new parity. A
“small write” would require reading the old data and old parity, adding the new information, and
then writing the new parity to the parity disk and the new data to the data disk.
RAID 4 efficiently supports a mixture of large reads, large writes, small reads, and small writes.
One drawback to the system is that the parity disk must be updated on every write, so it is the
bottleneck for back-to-back writes. To fix the parity-write bottleneck, the parity information can
be spread throughout all the disks so that there is no single bottleneck for writes. The distributed
parity organization is RAID 5.
The below figure shows how data are distributed in RAID 4 vs. RAID 5. As the organization on
the right shows, in RAID 5 the parity associated with each row of data blocks is no longer
restricted to a single disk.
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
SUBJECT NOTES – EC6009 – ADVANCED COMPUTER ARCHITECTURE
15

This organization allows multiple writes to occur simultaneously as long as the stripe units are
not located in the same disks. For example, a write to block 8 on the right must also access its
parity block P2, thereby occupying the first and third disks. A second write to block 5 on the
right, implying an update to its parity block P1, accesses the second and fourth disks and thus
could occur at the same time as the write to block 8. Those same writes to the organization on the
left would result in changes to blocks P1 and P2, both on the fifth disk, which would be a
bottleneck.

(v) P+Q redundancy (RAID 6)

Parity based schemes protect against a single, self-identifying failures. When a single failure
is not sufficient, parity can be generalized to have a second calculation over the data and another
check disk of information. Yet another parity block is added to allow recovery from a second
failure. Thus, the storage overhead is twice that of RAID 5. There are six disk accesses to update
both P and Q information.

Reliability,Availability and Dependabilty: Refer to I unit notes

AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING


SUBJECT NOTES – EC6009 – ADVANCED COMPUTER ARCHITECTURE
16

I/O PERFORMANCE MEASURES


 I/O bandwidth (throughput) – amount of information that can be input (output) and
communicated across an interconnect (e.g., a bus) to the processor/memory (I/O device)
per unit time

1. Amount of data that can move through the system in a certain time

2. I/O operations that can be done per unit time.

 I/O response time (latency) – the total elapsed time to accomplish an input or output
operation. It is an especially important performance metric in real-time systems.
 Many applications require both high throughput and short response times.

Throughput versus Response time

AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING


SUBJECT NOTES – EC6009 – ADVANCED COMPUTER ARCHITECTURE

You might also like