0% found this document useful (0 votes)
38 views

001. Lesson 1 - Introduction to Fault-Tolerant Computing

Introduction to Fault Tolerance

Uploaded by

Paul Pogba Clive
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
38 views

001. Lesson 1 - Introduction to Fault-Tolerant Computing

Introduction to Fault Tolerance

Uploaded by

Paul Pogba Clive
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

Lesson 1: Introduction to Fault-Tolerant Computing

1. Introduction
In computing systems, failures can occur due to various reasons, such as hardware malfunctions,
software bugs, or external factors like power outages. These failures can lead to system crashes,
data loss, or service disruptions. Fault tolerance refers to the ability of a system to continue
operating, even in the presence of faults. This is achieved by using various techniques that allow
a system to detect, isolate, and recover from faults without affecting the system's overall
functionality.
Fault tolerance is critical in systems that require high availability, reliability, and uninterrupted
service, such as in aviation, banking, healthcare, and telecommunications.
2. Learning Outcomes
By the end of the lesson, students should be able to:
1. Demonstrate the understanding of the concept of fault tolerance and its significance in
computing.
2. Identify different fault-tolerant schemes.
3. Explain the role of fault tolerance in critical systems like aviation and banking.

3. The Concept of Fault Tolerance


3.1 What is a Fault?
A fault is any abnormal condition that causes a system to deviate from the expected behavior.
Faults can be categorized into different types:
a. Transient Faults: Occur temporarily and disappear without any intervention.
b. Intermittent Faults: Appear and disappear at irregular intervals.
c. Permanent Faults: Persist until repaired or replaced.
3.2 Fault Tolerance
Fault tolerance is the property that enables a system to continue functioning even when one or
more of its components fail. This is achieved through redundancy and other mechanisms that
help in fault detection, isolation, and recovery.
3.3 Importance of Fault Tolerance in Modern Computing
a. Ensures that systems remain operational with minimal downtime, which is critical for
industries like healthcare and financial services.

1
b. Improves the overall reliability of a system by preventing complete system failure due to
component faults.
c. In systems like aviation and autonomous vehicles, fault tolerance ensures that the
system continues to operate safely, even in the event of a fault.
4. Fault-Tolerant Schemes
Fault-tolerant schemes are strategies used to detect, mask, and recover from faults. Some
common fault tolerance schemes include:
4.1 Redundancy
Redundancy is one of the most widely used fault-tolerant schemes. It involves duplicating critical
components or systems to ensure that a backup is available if the primary system fails.
Redundancy can be applied in three domains as follows:
a. Hardware Redundancy: Involves having multiple hardware components, such as
processors, power supplies, or disks, that take over in case of failure.
b. Software Redundancy: Uses diverse software versions or duplicate software modules
that can substitute for one another in case of failure.
c. Information Redundancy: Includes error-detecting and error-correcting codes like
Hamming Code or Reed-Solomon Code, which ensure data integrity.
Diagram: Hardware Redundancy in a RAID System

An illustration showing multiple hard drives in a RAID setup, where redundant data is distributed
across disks to ensure fault tolerance.

2
4.2 Replication
In replication, identical copies of processes or data are maintained. This ensures that if one
instance fails, others can take over with no loss of data or service.
 Data Replication: Data is stored in multiple locations. If one data center fails, another can
provide the same data.
 Process Replication: Critical processes are run on multiple machines or virtual
environments. In case of failure in one, the others can continue executing without
disruption.
Example of Process Replication in Distributed Systems
The diagram below shows a leader-based replication process. The Leader-based replication is
an ideal choice for read-scaling scenarios where the read requests processed by a distributed
system are far more than the number of write requests. This is often true of internet applications.
The number of followers can be increased as the read load on the system increases.

An illustration of process replication across multiple servers in a distributed system, ensuring


continuity of service.

3
4.3 Failover Systems
Failover systems automatically switch to a redundant or standby system when a fault is
detected. This is common in critical services, such as web hosting and financial services, where
downtime is unacceptable.
 Cold Failover: Involves switching to a backup system that is not running until a failure is
detected. This approach has some delay due to startup time.
 Hot Failover: The standby system runs in parallel with the active system and can take
over almost instantly when the primary system fails.
Diagram: Failover Mechanism

An
illustration showing the primary system and a standby system, where the standby system takes
over during a failure.

4
5. Role of Fault Tolerance in Critical Systems
Fault tolerance is essential in systems that require high reliability and availability. Some
applications are found in banking, aviation, automobiles, defense etc.
5.1 In Aviation
In aviation, fault tolerance is critical for safety. Modern aircraft use triple-modular redundancy
(TMR) in their flight control systems. In TMR, three independent systems run the same
calculations. If one system gives a different result from the other two, it is automatically isolated,
and the majority decision is taken as correct.
Diagram: Triple-Modular Redundancy in Aircraft

An illustration of TMR in a flight control system where three processors make parallel decisions.
5.2 In Banking
In banking, systems like Automated Teller Machines (ATMs) and online transaction services
require continuous availability. Financial institutions use data replication across multiple data
centers to ensure that transactions are processed even if one server fails. Additionally, RAID
systems (Redundant Array of Independent Disks) are used to protect critical financial data
against hardware failure.
Diagram: Data Replication in Banking Systems

5
An illustration showing how transaction data is replicated across multiple data centers to ensure
availability.
5.3 In Medical Devices: In life-supporting medical devices like pacemakers and ventilators, fault
tolerance ensures that the devices function reliably. Redundant components and error-
correcting mechanisms help avoid failures that could result in life-threatening situations.
5.4 In Financial Systems: Banks, stock exchanges, and payment processing systems require
fault tolerance to ensure continuous service availability and data integrity, even during hardware
failures, network outages, or security breaches.
5.5 In Data Centers and Cloud Computing: Fault tolerance ensures that cloud services and data
centers maintain high availability by employing redundant servers, storage systems, and failover
mechanisms. This minimizes downtime and protects against data loss.
5.6 In Nuclear Power Plants: Fault tolerance is essential for the safe operation of nuclear power
plants. It helps in managing and mitigating potential system failures that could lead to radiation
leaks or meltdowns. Redundant sensors, control systems, and backup safety protocols are
commonly used.
Fault tolerance is a fundamental aspect of designing modern computing systems, especially for
critical applications like aviation, banking, and healthcare. By employing various schemes like
redundancy, replication, and failover, fault-tolerant systems can continue to provide reliable and
uninterrupted services, even in the face of faults and failures.
Evaluation Questions
1. Define fault tolerance and explain its importance in modern computing systems.
2. Describe the following three fault-tolerant schemes with real-world examples.
i. Redundancy
ii. Failover Systems
iii. Replication

You might also like