Fault Tolerant System Design
Fault Tolerant System Design
Lecturer
Elena Dubrova
Electronic Systems (ES)
ICT/KTH
[email protected]
http://www.ict.kth.se/~dubrova
p. 2 Design of Fault Tolerant Systems Elena Dubrova, ESDlab
Office hours
• No fixed time
• Send me an email with your questions or ask for
a meeting
p. 3 Design of Fault Tolerant Systems Elena Dubrova, ESDlab
Teaching Assistant
Shohreh Sarif Mansouri
PhD Student, Electronic Systems
ICT/KTH
p. 4 Design of Fault Tolerant Systems Elena Dubrova, ESDlab
Text book
• E. Dubrova, FaultTolerant Design: An
Introduction, draft
• Available from my homepage
p. 5 Design of Fault Tolerant Systems Elena Dubrova, ESDlab
Course evaluation
• Midterm exam (20%)
M
• Final exam (60%)
F
• 5 assignments (20%)
5
p. 6 Design of Fault Tolerant Systems Elena Dubrova, ESDlab
Assignments
• 5 assignments, worth 20% of the final
grade
– each consists of 56 tasks, worth 13 points
– should be handled to me on the due date
– late assignments will get reduced points (25%
per day)
p. 7 Design of Fault Tolerant Systems Elena Dubrova, ESDlab
Examinations
• Midterm exam, 45 min, worth 20% of the
final grade
– will be done during 45 min on a lecture in the
middle of the course, 45 tasks
– cannot be redone
• Final exam, 4 hours, worth 60% of the final
grade
– 4 hours, 1012 tasks
p. 8 Design of Fault Tolerant Systems Elena Dubrova, ESDlab
PhD students
• Additional component for PhD students:
– select 2 interesting papers/problems, related to
the course material
– bring them to me for discussion
– I will select one of them
– you will read this paper/solve the problem,
write a 2page report and give a 20 min talk at
the last lecture
p. 9 Design of Fault Tolerant Systems Elena Dubrova, ESDlab
Objectives
• understanding fault tolerance
– faults and their effects (errors, failures)
f
– redundancy techniques
– evaluation of faulttolerant systems
• balance
– concepts, underlying principles
– applications
p. 10 Design of Fault Tolerant Systems Elena Dubrova, ESDlab
Overview
• Introduction
– definition of fault tolerance, applications
• Fundamentals of dependability
– dependability attributes: reliability, availability, safety
– dependability impairments: faults, errors, failures
– dependability means
• Dependability evaluation techniques
– common measures: failure rate, MTTF, MTTR
– reliability block diagrams
– Markov processes
p. 11 Design of Fault Tolerant Systems Elena Dubrova, ESDlab
Overview
• Redundancy techniques
– space redundancy
• hardware redundancy
• information redundancy
• software redundancy
– time redundancy
p. 12 Design of Fault Tolerant Systems Elena Dubrova, ESDlab
Introduction to Fault Tolerance
Fault tolerance
p. 14 Design of Fault Tolerant Systems Elena Dubrova, ESDlab
Easily testable system
• Easily testable system is one whose ability
to work correctly can be verified in a simple
manner
p. 15 Design of Fault Tolerant Systems Elena Dubrova, ESDlab
Why do we need faulttolerance?
• It is practically impossible to build a perfect
system
– suppose a component has the reliability
99.99%
– a system consisting of 100 nonredundant
components will have the reliability 99.01%
– a system consisting of 10.000 components will
have the reliability 36.79%
• It is hard to forsee all the factors
p. 16 Design of Fault Tolerant Systems Elena Dubrova, ESDlab
Redundancy
• Redundancy is the provision of functional
capabilities that would be unnecessary in a
faultfree environment
– replicated hardware component
– parity check bit attached to digital data
– a line of program verfiying the correcntess of
the resut
p. 17 Design of Fault Tolerant Systems Elena Dubrova, ESDlab
History
• early computer systems
– basic components had very low reliability
– faulttolerant techniques were need to
overcome it
• redundant structures with voting
• errordetection and error correction codes
p. 18 Design of Fault Tolerant Systems Elena Dubrova, ESDlab
History
• early computer systems
– EDVAC (1949)
E
• duplicate ALU and compare results of both
• continue processing if agreed, else report error
– Bell Relay Computer (1950)
B
• 2 CPU’s
• one unit begin executing the next instruction if the
other encounts an error
– IBM650, UNIVAC (1955)
I
• parity check on data transfers
p. 19 Design of Fault Tolerant Systems Elena Dubrova, ESDlab
History
• Advent of transistors
– more reliable components
– led to temporary decrease in the emphasis on
faulttolerant computing
– designers thought it is enough to depend on
the improved reliability of the transistor to
guarantee correct computations
p. 20 Design of Fault Tolerant Systems Elena Dubrova, ESDlab
History
• last decades
– more critical applications
• space programs, military applications
• control of nuclear power stations
• banking transactions
– VLSI made the implementation of many redundancy
techniques practical and cost effective
– Other than hardware component faults need to be
tolerated:
• transient faults (soft errors) caused by environmental factors
• software faults
p. 21 Design of Fault Tolerant Systems Elena Dubrova, ESDlab
Applications
• safetycritical applications
– critical to human safety
• aircraft flight control
– environmental disaster must be avoided
• chemical plants, nuclear plants
– requirements
• 99.99999% probability to be operational at the end
of a 3hour period
p. 22 Design of Fault Tolerant Systems Elena Dubrova, ESDlab
Applications
• missioncritical applications
– it is important to complete the mission
– repair is impossible or prohibitively expensive
• Pioneer 10 was launched 2 March 1970,
passed Pluto 13 June 1983
• requirements
• 95% probability to be operational at the end of
mission (e.g. 10 years)
• may be degraded or reconfigured before (operator
i
interaction possible)
p. 23 Design of Fault Tolerant Systems Elena Dubrova, ESDlab
Applications
• bisnesscritical applications
– users want to have a high probability of
receiving service when it is requested
– transaction processing (banking, stock
e
exchange or other timeshared systems)
• ATM: < 10 hours/year unavailable
• airline reservation: < 1 min/day unavailable
p. 24 Design of Fault Tolerant Systems Elena Dubrova, ESDlab
Applications
• maintenance postponement applications
– avoid unscheduled maintenance
– should continue to function until next planned
r
repair (economical benefits)
– examples:
• remotely controlled systems
• telephone switching systems (in remote areas)
t
p. 25 Design of Fault Tolerant Systems Elena Dubrova, ESDlab
Goals of fault tolerance
p. 26 Design of Fault Tolerant Systems Elena Dubrova, ESDlab
Dependability
Dependability
is the ability of a system to
deliver its intended level of
service to its users
p. 27 Design of Fault Tolerant Systems Elena Dubrova, ESDlab
Dependability tree
availability
attributes reliability
safety
fault tolerance
dependability fault prevention
means fault removal
fault forecasting
faults
impairments errors
failures
p. 28 Design of Fault Tolerant Systems Elena Dubrova, ESDlab
Reliability
• R(t) is the probability that a system
operates without failure in the interval [0,t],
given that it worked at time 0
• We need high reliability when:
– even momentary periods of incorrect
performance are unacceptable (aircraft, heart
p
pace maker)
– no repair possible (satellite, spacecraft)
n
p. 29 Design of Fault Tolerant Systems Elena Dubrova, ESDlab
High reliability examples
• airplane:
– R(several hours) = 0.999 999 9 = 0.97
• spacecraft:
– R(several years) = 0.95
p. 30 Design of Fault Tolerant Systems Elena Dubrova, ESDlab
Reliability versus fault tolerance
• Fault tolerance is a technique that can
improve reliability, but
– a fault tolerant system does not necessarily
have a high reliability
– a system can be designed to tolerate any
single error, but the probability of such error to
occur can be so high that the reliability is very
low
p. 31 Design of Fault Tolerant Systems Elena Dubrova, ESDlab
Reliability versus fault tolerance
• A highly reliable system is not necessarily
fault tolerant
– a very simple system can be designed using
very good components such that the probability
of hardware failing is very low
– but if the hardware fails, the system cannot
continue its functions
p. 32 Design of Fault Tolerant Systems Elena Dubrova, ESDlab
How fault tolerance helps
• Fault tolerance can improve a system’s
reliability by keeping the system operational
when hardware or software faults occur
– a computer system with one redundant
processor can be designed to continue working
correctly even if one of the processors fails
– QUESTION: Will a faulttolerant system always
be more reliable than an individual component?
p. 33 Design of Fault Tolerant Systems Elena Dubrova, ESDlab
Availability
• A(t) is the probability that a system is
functioning correctly at the instant of time t
• depends on
– how frequently the system becomes non
operational
– how quickly it can be repaired
p. 34 Design of Fault Tolerant Systems Elena Dubrova, ESDlab
Steadystate availability
• Often the availability assumes a time
indepentent value after some initial time
interval
• This value is called steadystate availability
Ass
• Steadystate availability is often specified in
terms of downtime per year
Ass = 90%, downtime = 36.5 days/year
Ass = 99%, downtime = 3.65 days/year
p. 35 Design of Fault Tolerant Systems Elena Dubrova, ESDlab
Reliability versus availability
• reliability depends on an interval of time
• availability is taken at an instant of time
• a system can be highly available yet
experience frequent periods of being non
operational as long as the length of each
period is extremely short
p. 36 Design of Fault Tolerant Systems Elena Dubrova, ESDlab
High availability examples
• examples
– transaction processing
• ATM: Ass=0.93 (< 10 hours/year unavailable)
• banking: Ass=0.997 (< 10 s/hour unavailable)
=
– computing
• supercomputer centres
Ass=0.997 (< 10 days/year unavailable)
=
– embedded
• telecom: Ass=0.95 (< 5 min./year unavailable)
p. 37 Design of Fault Tolerant Systems Elena Dubrova, ESDlab
How fault tolerance helps
• Fault tolerance can improve a system’s
availability by keeping the system
operational when a failure occur
– a spare processor can perform the functions of
the system, keeping its available for use, while
the primary processor is being repaired
p. 38 Design of Fault Tolerant Systems Elena Dubrova, ESDlab
Safety
• Safety is the probability that a system will
either perform its function correctly or will
discontinue its operation in a safe way
• System is safe
– if it functions correctly, or
– if it fails, it remains in a safe state
p. 39 Design of Fault Tolerant Systems Elena Dubrova, ESDlab
High safety examples
• railway signalling
– all semaphores red
• nuclear energy
– stop reactor if a problem occur
• banking
– don’t give the money if in doubt
p. 40 Design of Fault Tolerant Systems Elena Dubrova, ESDlab
Reliability versus safety
• Reliability is the probability that a system
will perform its functions correctly
• Safety is the probability that a system will
either work correctly or will stop in a
manner that causes no harm
p. 41 Design of Fault Tolerant Systems Elena Dubrova, ESDlab
How fault tolerance helps
• Fault tolerance techniques can improve
safety by turning a system off if a failure of
a certain sort is detected
– in a nuclear power plant the reaction process
should be stopped if some discrepancy is
detected
p. 42 Design of Fault Tolerant Systems Elena Dubrova, ESDlab
Summary: attributes of dependability
• reliability:
– continuity of service
• availability:
– readiness for usage
• safety:
– nonoccurrence of catastrophic consequences
on environment
p. 43 Design of Fault Tolerant Systems Elena Dubrova, ESDlab
Next lecture
• Faults, error and failures
• Design philosophies to combat faults
p. 44 Design of Fault Tolerant Systems Elena Dubrova, ESDlab