Lec38 BW
Lec38 BW
UC Berkeley
in UC Berkeley
Teaching Professor Computer Architecture Professor
Dan Garcia (a.k.a. Machine Structures) Bora Nikolić
Dependability
Garcia, Nikolić
cs61c.org
6 Great Ideas in Computer Architecture
1. Abstraction (Layers of
Representation/Interpretation)
2. Moore’s Law
3. Principle of Locality/Memory Hierarchy
4. Parallelism
5. Performance Measurement &
Improvement
6. Dependability via Redundancy
6 Great Ideas in Computer Architecture
1. Abstraction (Layers of
Representation/Interpretation)
2. Moore’s Law
3. Principle of Locality/Memory Hierarchy
4. Parallelism
5. Performance Measurement &
Improvement
6. Dependability via Redundancy
Computers Fail…
May fail transiently…
…or permanently
1+1=2 2 of 3 agree
Restoration Failure
Service interruption
Deviation from
specified service
Dependability via Redundancy: Time vs. Space
Spatial Redundancy – replicated data or check
information or hardware to handle hard and
soft (transient) failures
Temporal Redundancy – redundancy in time
(retry) to handle soft (transient) failures
Dependability Measures
Reliability: Mean Time To Failure (MTTF)
Service interruption: Mean Time To Repair (MTTR)
Mean time between failures (MTBF)
MTBF = MTTF + MTTR
Availability = MTTF / (MTTF + MTTR)
Improving Availability
Increase MTTF: More reliable hardware/software
+ Fault Tolerance
Reduce MTTR: improved tools and processes for diagnosis
and repair
Availability Measures
Availability = MTTF / (MTTF + MTTR) as %
MTTF, MTBF usually measured in hours
Since hope rarely down, shorthand is
“number of 9s of availability per year”
1 nine: 90% => 36 days of repair/year
2 nines: 99% => 3.6 days of repair/year
3 nines: 99.9% => 526 minutes of repair/year
4 nines: 99.99% => 53 minutes of repair/year
5 nines: 99.999% => 5 minutes of repair/year
Reliability Measures
Another is average number of failures per
year: Annualized Failure Rate (AFR)
E.g., 1000 disks with 100,000 hour MTTF
365 days * 24 hours = 8760 hours
(1000 disks * 8760 hrs/year) / 100,000
= 87.6 failed disks per year on average
87.6/1000 = 8.76% annual failure rate
Google’s 2007 study* found that actual AFRs
for individual drives ranged from 1.7% for first
year drives to over 8.6% for three-year old
drives
*research.google.com/archive/disk_failures.pdf
Hard Drive Failures
Annualized
hard-drive
failure rates
Failures In Time (FIT) Rate
The Failures In Time (FIT) rate of a device is the number
of failures that can be expected in one billion (109)
device-hours of operation
Or 1000 devices for 1 million hours,
1 million devices for 1000 hours each
MTBF = 1,000,000,000 x 1/FIT
+ p
+
c
Minimum Hamming distance of parity code is 2
A non-zero parity check indicates an error occurred:
2 errors (on different bits) are not detected
Nor any even number of errors, just odd numbers of errors are detected
Parity Example
Data 0101 0101 Read from memory
4 ones, even parity now 0101 0101 0
Write to memory: 4 ones => even parity, so no
0101 0101 0 error
to keep parity even Read from memory
Data 0101 0111 1101 0101 0
5 ones, odd parity now 5 ones => odd parity,
so error
Write to memory:
0101 0111 1 What if error is in parity bit?
to make parity even
Suppose Want to Correct One Error?
Hamming came up with simple to understand
mapping to allow Error Correction at minimum
distance of three
Single error correction, double error detection
Called “Hamming ECC”
Worked weekends on relay computer with unreliable
card reader, frustrated with manual restarting
Got interested in error correction; published 1950
R. W. Hamming, “Error Detecting and Correcting
Codes,” The Bell System Technical Journal, Vol. XXVI,
No 2 (April 1950) pp 147-160.
Detecting/Correcting Code Concept
Space of possible bit patterns (2N)
Invalid
Codewords
Encoded data bits p1 p2 d1 p4 d2 d3 d4 p8 d5 d6 d7 d8 d9 d10 d11 p16 d12 d13 d14 d15
p1
p2 …
Parity
p4
bit
coverage p8
p16
Hamming ECC
Set parity bits to create even parity for each
group
A byte of data: 10011010
Create the coded word, leaving spaces for the
parity bits:
__1_001_1010
1 2 3 4 5 6 7 8 9 a b c – bit position
0 1 1 1 0 0 1 0 1 1 1 0
Bit position 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
Encoded data bits p1 p2 d1 p4 d2 d3 d4 p8 d5 d6 d7 d8 d9 d10 d11 p16 d12 d13 d14 d15
p1
p2 …
Parity
p4
bit
coverage p8
p16
Hamming ECC Error Check
Suppose receive
011100101110
Hamming ECC Error Check
Suppose receive
011100101110
0 1 0 1 1 1 √
11 01 11 X-Parity 2 in error
1001 0 √
01110 X-Parity 8 in error
Implies position 8+2=10 is in error
011100101110
11/21/2020
Hamming ECC Error Correct
Flip the incorrect bit …
011100101010
Hamming ECC Error Correct
Suppose receive
011100101010
0 1 0 1 1 1 √
11 01 01 √
1001 0 √
01010 √
What if More Than 2-Bit Errors?
Use double-error correction, triple-error
detection (DECTED)
Network transmissions, disks, distributed
storage common failure mode is bursts of bit
errors, not just one or two bit errors
Contiguous sequence of B bits in which first, last and any
number of intermediate bits are in error
Caused by impulse noise or by fading in wireless
Effect is greater at higher data rates
Solve with Cyclic Redundancy Check (CRC),
interleaving or other more advanced codes
RAID: Redundant Arrays of (Inexpensive) Disks
Data is stored across multiple disks
Files are "striped" across multiple disks
Redundancy yields high data availability
Availability: service still provided to user, even if
some components failed
Disks will still fail
Contents reconstructed from data
redundantly stored in the array
− Capacity penalty to store redundant info
− Bandwidth penalty to update redundant info
Redundant Arrays of Inexpensive Disks
RAID 1: Disk Mirroring/Shadowing
recovery
group
D8 D9 D10 D11 P
D0 D1 D2 D3 P
D4 D5 D6 D7 P
RAID 5: High I/O Rate Interleaved Parity
Increasing
D0 D1 D2 D3 P Logical
Independent writes Disk
possible because of Addresses
interleaved parity D4 D5 D6 P D7
D8 D9 P D10 D11