0% found this document useful (0 votes)

20 views45 pages

Lec38 BW

The document discusses six great ideas in computer architecture, with a focus on the idea of dependability via redundancy. It describes how redundancy allows systems to continue functioning even if individual components fail. Examples mentioned include redundant data centers, redundant disk arrays, and error correcting memory. The concepts of reliability, availability, mean time to failure, mean time to repair, and failures in time are defined. Error detection and correction codes like parity and Hamming codes are introduced as ways to detect and correct errors in memory through the addition of redundant bits.

Uploaded by

k12730709

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

20 views45 pages

Lec38 BW

Uploaded by

k12730709

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 45

Great Ideas

UC Berkeley
in UC Berkeley
Teaching Professor Computer Architecture Professor
Dan Garcia (a.k.a. Machine Structures) Bora Nikolić

Dependability

Garcia, Nikolić

cs61c.org
6 Great Ideas in Computer Architecture
1. Abstraction (Layers of
Representation/Interpretation)
2. Moore’s Law
3. Principle of Locality/Memory Hierarchy
4. Parallelism
5. Performance Measurement &
Improvement
6. Dependability via Redundancy
6 Great Ideas in Computer Architecture
1. Abstraction (Layers of
Representation/Interpretation)
2. Moore’s Law
3. Principle of Locality/Memory Hierarchy
4. Parallelism
5. Performance Measurement &
Improvement
6. Dependability via Redundancy
Computers Fail…
 May fail transiently…

 …or permanently

We will discuss hardware failures

and methods to mitigate them
Great Idea #6: Dependability via Redundancy
 Redundancy so that a failing piece doesn’t
make the whole system fail

1+1=2 2 of 3 agree

1+1=2 1+1=2 1+1=1 FAIL!

Increasing transistor density reduces the cost of redundancy

Great Idea #6: Dependability via Redundancy
 Applies to everything from datacenters to
storage to memory to instructors
 Redundant datacenters so that can
lose 1 datacenter but Internet service
stays online
 Redundant disks so that can lose 1 disk
but not lose data (Redundant
Arrays of Independent Disks/RAID)
 Redundant memory bits of so that
can lose 1 bit but no data
(Error Correcting Code/ECC Memory)
Dependability
 Fault: failure of a component
 May or may not lead to system failure
Service accomplishment
Service delivered
as specified

Restoration Failure

Service interruption
Deviation from
specified service
Dependability via Redundancy: Time vs. Space
 Spatial Redundancy – replicated data or check
information or hardware to handle hard and
soft (transient) failures
 Temporal Redundancy – redundancy in time
(retry) to handle soft (transient) failures
Dependability Measures
 Reliability: Mean Time To Failure (MTTF)
 Service interruption: Mean Time To Repair (MTTR)
 Mean time between failures (MTBF)
 MTBF = MTTF + MTTR
 Availability = MTTF / (MTTF + MTTR)
 Improving Availability
 Increase MTTF: More reliable hardware/software
+ Fault Tolerance
 Reduce MTTR: improved tools and processes for diagnosis
and repair
Availability Measures
 Availability = MTTF / (MTTF + MTTR) as %
 MTTF, MTBF usually measured in hours
 Since hope rarely down, shorthand is
“number of 9s of availability per year”
 1 nine: 90% => 36 days of repair/year
 2 nines: 99% => 3.6 days of repair/year
 3 nines: 99.9% => 526 minutes of repair/year
 4 nines: 99.99% => 53 minutes of repair/year
 5 nines: 99.999% => 5 minutes of repair/year
Reliability Measures
 Another is average number of failures per
year: Annualized Failure Rate (AFR)
 E.g., 1000 disks with 100,000 hour MTTF
 365 days * 24 hours = 8760 hours
 (1000 disks * 8760 hrs/year) / 100,000
= 87.6 failed disks per year on average
 87.6/1000 = 8.76% annual failure rate
 Google’s 2007 study* found that actual AFRs
for individual drives ranged from 1.7% for first
year drives to over 8.6% for three-year old
drives
*research.google.com/archive/disk_failures.pdf
Hard Drive Failures
Annualized
hard-drive
failure rates
Failures In Time (FIT) Rate
 The Failures In Time (FIT) rate of a device is the number
of failures that can be expected in one billion (109)
device-hours of operation
 Or 1000 devices for 1 million hours,
1 million devices for 1000 hours each
 MTBF = 1,000,000,000 x 1/FIT

 Relevant: Automotive safety integrity level (ASIL)

defines FT rates for different classes of components in
vehicles
Dependability Design Principle
 Design Principle: No single points of failure
 “Chain is only as strong as its weakest link”
 Dependability corollary of Amdahl’s Law
 Doesn’t matter how dependable you make one
portion of system
 Dependability limited by part you do not improve

Fall 2017 – Lecture #25

Error Detection/Correction Codes
 Memory systems generate errors
(accidentally flipped bits)
 DRAMs store very little charge per bit
 “Soft” errors occur occasionally when cells are struck by alpha
particles or other environmental upsets
 “Hard” errors” can occur when chips permanently fail
 Problem gets worse as memories get denser and larger
 Memories protected against soft errors with EDC/ECC
 Extra bits are added to each data-word
 Used to detect and/or correct faults in the memory system
 Each data word value mapped to unique code word
 A fault changes valid code word to invalid one, which can be
detected
Block Code Principles
 Hamming distance = difference in # of bits
 p = 011011, q = 001111, Ham. distance (p,q) = 2
 p = 011011,
q = 110001,
distance (p,q) = ?
 Can think of extra bits as creating
a code with the data
 What if minimum distance
between codewords is 2
and get a 1-bit error? Richard Hamming, 1915-98
Turing Award Winner
Parity: Simple Error-Detection Coding
 Each data value, before it is
 Each word, as it is read
written to memory is
from memory is “checked”
“tagged” with an extra bit by finding its parity
to force the stored word to (including the parity bit).
have even parity:
b7b6b5b4b3b2b1b0 p
b7b6b5b4b3b2b1b0

+ p
+
c
 Minimum Hamming distance of parity code is 2
 A non-zero parity check indicates an error occurred:
 2 errors (on different bits) are not detected
 Nor any even number of errors, just odd numbers of errors are detected
Parity Example
 Data 0101 0101  Read from memory
 4 ones, even parity now 0101 0101 0
 Write to memory:  4 ones => even parity, so no
0101 0101 0 error
to keep parity even  Read from memory
 Data 0101 0111 1101 0101 0
 5 ones, odd parity now  5 ones => odd parity,
so error
 Write to memory:
0101 0111 1  What if error is in parity bit?
to make parity even
Suppose Want to Correct One Error?
 Hamming came up with simple to understand
mapping to allow Error Correction at minimum
distance of three
 Single error correction, double error detection
 Called “Hamming ECC”
 Worked weekends on relay computer with unreliable
card reader, frustrated with manual restarting
 Got interested in error correction; published 1950
 R. W. Hamming, “Error Detecting and Correcting
Codes,” The Bell System Technical Journal, Vol. XXVI,
No 2 (April 1950) pp 147-160.
Detecting/Correcting Code Concept
Space of possible bit patterns (2N)

Error changes bit pattern to non-code

Sparse population of code words (2M << 2N)

- with identifiable signature

 Detection: bit pattern fails codeword check

 Correction: map to nearest valid code word
Hamming Distance: Eight Code Words
Hamming Distance 2: Detection

Detect Single Bit Errors

Invalid
Codewords

• No 1-bit error goes to another valid codeword

• ½ codewords are valid
Hamming Distance 2: Detection

Correct Single Bit Errors

One bit
away from 111

One bit away

from 000

• 1-bit errors near valid codewords

• ¼ codewords are valid
Hamming ECC
 Interleave data and parity bits
 Place parity bits at binary positions 1, 10, 100, etc
 p1 covers all positions with LSB = 1
 p2 covers all positions with next to LSB = 1, etc
 Can continue indefinitely
Bit position 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

Encoded data bits p1 p2 d1 p4 d2 d3 d4 p8 d5 d6 d7 d8 d9 d10 d11 p16 d12 d13 d14 d15

p1
p2 …
Parity
p4
bit
coverage p8

p16
Hamming ECC
Set parity bits to create even parity for each
group
 A byte of data: 10011010
 Create the coded word, leaving spaces for the
parity bits:
 __1_001_1010
1 2 3 4 5 6 7 8 9 a b c – bit position

 Calculate the parity bits

Hamming ECC
 Position 1 checks bits 1,3,5,7,9,11:
? _ 1 _ 0 0 1 _ 1 0 1 0. set position 1 to a _:

 Position 2 checks bits 2,3,6,7,10,11:

0 ? 1 _ 0 0 1 _ 1 0 1 0. set position 2 to a _:

 Position 4 checks bits 4,5,6,7,12:

0 1 1 ? 0 0 1 _ 1 0 1 0. set position 4 to a _:

 Position 8 checks bits 8,9,10,11,12:

0 1 1 1 0 0 1 ? 1 0 1 0. set position 8 to a _:
Hamming ECC
 Final code word: 011100101010
 Data word: 1 001 1010
Hamming ECC
 Suppose receive
011100101110

0 1 1 1 0 0 1 0 1 1 1 0
Bit position 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

Encoded data bits p1 p2 d1 p4 d2 d3 d4 p8 d5 d6 d7 d8 d9 d10 d11 p16 d12 d13 d14 d15

p1
p2 …
Parity
p4
bit
coverage p8

p16
Hamming ECC Error Check
 Suppose receive
011100101110
Hamming ECC Error Check
 Suppose receive
011100101110
0 1 0 1 1 1 √
11 01 11 X-Parity 2 in error
1001 0 √
01110 X-Parity 8 in error
 Implies position 8+2=10 is in error
011100101110

11/21/2020
Hamming ECC Error Correct
 Flip the incorrect bit …
011100101010
Hamming ECC Error Correct
 Suppose receive
011100101010
0 1 0 1 1 1 √
11 01 01 √
1001 0 √
01010 √
What if More Than 2-Bit Errors?
 Use double-error correction, triple-error
detection (DECTED)
 Network transmissions, disks, distributed
storage common failure mode is bursts of bit
errors, not just one or two bit errors
 Contiguous sequence of B bits in which first, last and any
number of intermediate bits are in error
 Caused by impulse noise or by fading in wireless
 Effect is greater at higher data rates
 Solve with Cyclic Redundancy Check (CRC),
interleaving or other more advanced codes
RAID: Redundant Arrays of (Inexpensive) Disks
 Data is stored across multiple disks
 Files are "striped" across multiple disks
 Redundancy yields high data availability
 Availability: service still provided to user, even if
some components failed
 Disks will still fail
 Contents reconstructed from data
redundantly stored in the array
− Capacity penalty to store redundant info
− Bandwidth penalty to update redundant info
Redundant Arrays of Inexpensive Disks
RAID 1: Disk Mirroring/Shadowing

recovery
group

• Each disk is fully duplicated onto its “mirror”

Very high availability can be achieved
• Writes limited by single-disk speed
• Reads may be optimized

Most expensive solution: 100% capacity overhead

RAID 3: Parity Disk
10010011
11001101
10010011 P
...
1 1 1 1
logical record 0 1 0 1
Striped physical
records 1 0 1 0
0 0 0 0
P contains sum of 0 1 0 1
other disks per stripe 0 1 0 1
mod 2 (“parity”) 1 0 1 0
If disk fails, subtract 1 1 1 1
P from sum of other
disks to find missing information
RAID 4: High I/O Rate Parity
Increasing
D0 D1 D2 D3 P Logical
Insides of 5 Disk
disks Address
D4 D5 D6 D7 P

D8 D9 D10 D11 P

Example: small read

D0 & D5, large write D12 D13 D14 D15 P
Stripe
D12-D15
D16 D17 D18 D19 P

D20 D21 D22 D23 P

. . . . .
. . Disk Columns
. . .
. . . . .
Inspiration for RAID 5
 RAID 4 works well for small reads
 Small writes (write to one disk):
 Option 1: read other data disks, create new sum and
write to Parity Disk
 Option 2: since P has old sum, compare old data to
new data, add the difference to P
 Small writes are limited by Parity Disk: Write to
D0, D5 both also write to P disk

D0 D1 D2 D3 P

D4 D5 D6 D7 P
RAID 5: High I/O Rate Interleaved Parity

Increasing
D0 D1 D2 D3 P Logical
Independent writes Disk
possible because of Addresses
interleaved parity D4 D5 D6 P D7

D8 D9 P D10 D11

D12 P D13 D14 D15

Example: write to P D16 D17 D18 D19

D0, D5 uses disks
0, 1, 3, 4
D20 D21 D22 D23 P
. . . . .
. . .
Disk Columns . .
. . . . .
“And in Conclusion…”
 Great Idea: Redundancy to Get Dependability
 Spatial (extra hardware) and Temporal (retry if error)
 Reliability: MTTF, Annualized Failure Rate (AFR), and FIT
 Availability: % uptime (MTTF/MTTF+MTTR)
 Memory
 Hamming distance 2: Parity for Single Error Detect
 Hamming distance 3: Single Error Correction Code + encode
bit position of error
 Treat disks like memory, except you know when a disk
has failed—erasure makes parity an Error Correcting
Code
 RAID-2, -3, -4, -5 (and -6, -10): Interleaved data and
parity

Error-Correcting Codes For Semiconductor Memory Applications: A State-of-the-Art Review
100% (1)
Error-Correcting Codes For Semiconductor Memory Applications: A State-of-the-Art Review
11 pages
CS61C Su18 27 MRR Dependability
No ratings yet
CS61C Su18 27 MRR Dependability
60 pages
NI Tutorial 6480
No ratings yet
NI Tutorial 6480
4 pages
Error Detection and Correction: Parity Code
No ratings yet
Error Detection and Correction: Parity Code
4 pages
Error Detection and Correction: Parity Code
No ratings yet
Error Detection and Correction: Parity Code
4 pages
AN0271V1-NAND Error Correction Codes Introduction-0217
No ratings yet
AN0271V1-NAND Error Correction Codes Introduction-0217
17 pages
Error Detection and Correction
No ratings yet
Error Detection and Correction
36 pages
UNIT V Notes 2
No ratings yet
UNIT V Notes 2
3 pages
Module5and 6 - Error Detection and Correction
No ratings yet
Module5and 6 - Error Detection and Correction
34 pages
CNPPT
No ratings yet
CNPPT
10 pages
The Datalink & The Mac (Medium Access) Sublayer
No ratings yet
The Datalink & The Mac (Medium Access) Sublayer
69 pages
Wireless Networks: Error Detecting and Correcting Techniques
No ratings yet
Wireless Networks: Error Detecting and Correcting Techniques
22 pages
Unit 2 Design of Embedded System Hardware-I
No ratings yet
Unit 2 Design of Embedded System Hardware-I
87 pages
Ecc in Nand Flash
100% (1)
Ecc in Nand Flash
14 pages
CCN L07Data Link Error Control
No ratings yet
CCN L07Data Link Error Control
27 pages
1.error Detection and Correction
No ratings yet
1.error Detection and Correction
74 pages
Error Correction
100% (1)
Error Correction
70 pages
Computer Codes
100% (17)
Computer Codes
38 pages
Error Detection and Correction: CIT 595 Spring 2007
No ratings yet
Error Detection and Correction: CIT 595 Spring 2007
7 pages
Diagonal - Hamming Code
No ratings yet
Diagonal - Hamming Code
63 pages
Error Detection and Correction From Simon
No ratings yet
Error Detection and Correction From Simon
30 pages
CN Unit 3
No ratings yet
CN Unit 3
60 pages
DE-Unit - 1 Error Detection & Correction Code, Parity Bit, Hamming Code
No ratings yet
DE-Unit - 1 Error Detection & Correction Code, Parity Bit, Hamming Code
5 pages
Design of Hamming Encoder (23,16) For Emerging Applications
No ratings yet
Design of Hamming Encoder (23,16) For Emerging Applications
7 pages
Hamming Code Trainer
No ratings yet
Hamming Code Trainer
38 pages
Error Detection and Correction: CIT 595 Spring 2008
No ratings yet
Error Detection and Correction: CIT 595 Spring 2008
7 pages
CN PT CGC (1)
No ratings yet
CN PT CGC (1)
121 pages
5 Error Detection
No ratings yet
5 Error Detection
10 pages
Lecture7
No ratings yet
Lecture7
28 pages
Error Correcting Codes
No ratings yet
Error Correcting Codes
27 pages
Hamming Code Eng
No ratings yet
Hamming Code Eng
3 pages
Error Detection/Correction: Section 1.7 Section 3.9 Bonus Material: Hamming Code
No ratings yet
Error Detection/Correction: Section 1.7 Section 3.9 Bonus Material: Hamming Code
28 pages
3F4 Error Control Coding: Dr. I. J. Wassell
No ratings yet
3F4 Error Control Coding: Dr. I. J. Wassell
70 pages
Error Detection
No ratings yet
Error Detection
7 pages
Fault Tolerant Parallel FFTs Using Error Correction Codes and Parseval Checks
No ratings yet
Fault Tolerant Parallel FFTs Using Error Correction Codes and Parseval Checks
9 pages
Group Presentation
No ratings yet
Group Presentation
21 pages
Mini Project 138,45
No ratings yet
Mini Project 138,45
8 pages
Area and Power Efficient Ecc For Multiple Adjacent Bit Errors in Srams
No ratings yet
Area and Power Efficient Ecc For Multiple Adjacent Bit Errors in Srams
4 pages
NW Lec 10
No ratings yet
NW Lec 10
39 pages
Detecting and Correccting Multiple Bit Upsets in Static
No ratings yet
Detecting and Correccting Multiple Bit Upsets in Static
23 pages
Error Detection and Correction
No ratings yet
Error Detection and Correction
12 pages
CN_CDT12
No ratings yet
CN_CDT12
19 pages
Error Detection and Correction
No ratings yet
Error Detection and Correction
27 pages
The Theory
No ratings yet
The Theory
26 pages
Week 4
No ratings yet
Week 4
44 pages
ch6&7
No ratings yet
ch6&7
21 pages
Error Detection: Transmission Errors Occur
No ratings yet
Error Detection: Transmission Errors Occur
29 pages
rohini_66006427465
No ratings yet
rohini_66006427465
4 pages
Computer Networks: BCSC0008
No ratings yet
Computer Networks: BCSC0008
42 pages
LDPC
No ratings yet
LDPC
15 pages
A Tutorial On LDPC
No ratings yet
A Tutorial On LDPC
15 pages
Data Link Layer Design Issues: Services To The Network Layer
No ratings yet
Data Link Layer Design Issues: Services To The Network Layer
12 pages
IICT Notes Unit-3
No ratings yet
IICT Notes Unit-3
15 pages
Error
No ratings yet
Error
8 pages
Data Link Layer (Chapter 3)
No ratings yet
Data Link Layer (Chapter 3)
12 pages
Fix Common Failures
From Everand
Fix Common Failures
Mei Gates
No ratings yet
Information Technology: Things You Should Know (Questions and Answers)
From Everand
Information Technology: Things You Should Know (Questions and Answers)
Rumi Michael Leigh
No ratings yet
Programming Concepts in Java
From Everand
Programming Concepts in Java
Robert Burns
No ratings yet
Programming and Prototyping with Teensy Microcontrollers: Definitive Reference for Developers and Engineers
From Everand
Programming and Prototyping with Teensy Microcontrollers: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Programming Concepts in Python
From Everand
Programming Concepts in Python
Robert Burns
No ratings yet
Lec 06
No ratings yet
Lec 06
49 pages
Lec 12
No ratings yet
Lec 12
30 pages
Lec 02
No ratings yet
Lec 02
28 pages
CS61C 2023fa L01 Intro 1upbw
No ratings yet
CS61C 2023fa L01 Intro 1upbw
48 pages
ICT-9-Q1-MODULE-3
No ratings yet
ICT-9-Q1-MODULE-3
23 pages
Human Computer Interaction
No ratings yet
Human Computer Interaction
9 pages
BPV Certification Form Checklist Accreditation
100% (1)
BPV Certification Form Checklist Accreditation
13 pages
NPB 840 Service Manual Schematics
50% (2)
NPB 840 Service Manual Schematics
32 pages
General Catalogue 2010 2011 en 08
No ratings yet
General Catalogue 2010 2011 en 08
43 pages
MOSFET Worksheet R2 PDF
No ratings yet
MOSFET Worksheet R2 PDF
2 pages
RBE Computer (Eng) E-BOOK
No ratings yet
RBE Computer (Eng) E-BOOK
226 pages
FSM Implementation of I2C Protocol and Its Verification Using Verilog
No ratings yet
FSM Implementation of I2C Protocol and Its Verification Using Verilog
3 pages
Conditional Instructions 8051
No ratings yet
Conditional Instructions 8051
5 pages
SMT 770 Chameleon Installer Manual V 1.21
No ratings yet
SMT 770 Chameleon Installer Manual V 1.21
41 pages
Legrand CTX3 MPX3 RTX3
No ratings yet
Legrand CTX3 MPX3 RTX3
56 pages
Activar Office 365
No ratings yet
Activar Office 365
1 page
Vxdisk Init 1
No ratings yet
Vxdisk Init 1
5 pages
Electrical Drives:: An Application of Power Electronics
No ratings yet
Electrical Drives:: An Application of Power Electronics
36 pages
Implementation Details
0% (1)
Implementation Details
4 pages
Reseller Registration Form - India
No ratings yet
Reseller Registration Form - India
5 pages
Preonic Keymap v1
No ratings yet
Preonic Keymap v1
1 page
Network Command - Guide
No ratings yet
Network Command - Guide
21 pages
Journal of Computer Applications - WWW - Jcaksrce.org - Volume 4 Issue 2
No ratings yet
Journal of Computer Applications - WWW - Jcaksrce.org - Volume 4 Issue 2
4 pages
Ranking-2: Floor, Behind Bus Stand, Karnal - Contact: 7015275075, 7206600658
No ratings yet
Ranking-2: Floor, Behind Bus Stand, Karnal - Contact: 7015275075, 7206600658
9 pages
Mini-Skat 6-Wheel Plans Assembly
100% (3)
Mini-Skat 6-Wheel Plans Assembly
40 pages
Manual APC1100
No ratings yet
Manual APC1100
22 pages
Rajiv Gandhi Institute of Technology Government Engineering College KOTTAYAM - 686 501
No ratings yet
Rajiv Gandhi Institute of Technology Government Engineering College KOTTAYAM - 686 501
45 pages
D3E804AF4E0-VAS 6150B Technical Data
No ratings yet
D3E804AF4E0-VAS 6150B Technical Data
9 pages
Memory Elements: - Combinational Logic Cannot Remember
No ratings yet
Memory Elements: - Combinational Logic Cannot Remember
29 pages
AGC Tenova
No ratings yet
AGC Tenova
2 pages
551.76 Win11 Win10 Release Notes
No ratings yet
551.76 Win11 Win10 Release Notes
37 pages
A Case Study of Sony Interactive Entertainment
No ratings yet
A Case Study of Sony Interactive Entertainment
6 pages
Christos's Resume
No ratings yet
Christos's Resume
2 pages
Subspace Iteration Method
No ratings yet
Subspace Iteration Method
60 pages

Lec38 BW

Uploaded by

Lec38 BW

Uploaded by

Great Ideas

We will discuss hardware failures

1+1=2 1+1=2 1+1=1 FAIL!

Increasing transistor density reduces the cost of redundancy

 Relevant: Automotive safety integrity level (ASIL)

Fall 2017 – Lecture #25

Error changes bit pattern to non-code

Sparse population of code words (2M << 2N)

 Detection: bit pattern fails codeword check

Detect Single Bit Errors

• No 1-bit error goes to another valid codeword

Correct Single Bit Errors

One bit away

• 1-bit errors near valid codewords

 Calculate the parity bits

 Position 2 checks bits 2,3,6,7,10,11:

 Position 4 checks bits 4,5,6,7,12:

 Position 8 checks bits 8,9,10,11,12:

• Each disk is fully duplicated onto its “mirror”

Most expensive solution: 100% capacity overhead

Example: small read

D20 D21 D22 D23 P

D12 P D13 D14 D15

Example: write to P D16 D17 D18 D19

You might also like