System Reliability Availability Calculations
System Reliability Availability Calculations
A business imperative for companies of all sizes, cloud computing allows organizations to consume
IT services on a usage-based subscription model. To evaluate the dependability of a system, the
promise of cloud computing depends on two viral metrics:
Service reliability
Service vailability
Vendors offer service level agreements (SLAs) to meet specific standards of reliability and
availability. An SLA breach not only incurs cost penalty to the vendor but also compromises end-
user experience of apps and solutions running on the cloud network.
Though reliability and availability are often used interchangeably, they are different concepts in the
engineering domain. Let’s explore the distinction between reliability and availability, then move into
how both are calculated.
What is reliability?
Reliability is the probability that a system performs correctly during a specific time duration. During
this correct operation:
What is availability?
Availability refers to the probability that a system performs correctly at a specific time instance (not
duration). Interruptions may occur before or after the time instance for which the system’s availability
is calculated. The service must:
Be operational
Adequately satisfy the defined specifications at the time of its usage
Availability is measured at its steady state, accounting for potential downtime incidents that can (and
will) render a service unavailable during its projected usage duration. For example, a 99.999%
(Five-9’s) availability refers to 5 minutes and 15 seconds of downtime per year.
(Learn more about availability metrics and the 9s of availability.)
(Source)
Failure rate
The frequency of component failure per unit time. It is usually denoted by the Greek letter λ
(Lambda) and is used to calculate the metrics specified later in this post. In reliability engineering
calculations, failure rate is considered as forecasted failure intensity given that the component is
fully operational in its initial condition. The formula is given for repairable and non-repairable
systems respectively as follows:
Repair rate
The frequency of successful repair operations performed on a failed component per unit time. It is
usually denoted by the Greek letter μ (Mu) and is used to calculate the metrics specified later in this
post. Repair rate is defined mathematically as follows:
For series connected components, the effective failure rate is determined as the sum of failure
rates of each component.
For N series-connected components:
For parallel connected components, MTTF is determined as the reciprocal sum of failure rates
of each system component.
For N parallel-connected components:
For hybrid systems, the connections may be reduced to series or parallel configurations first.
For series connected components, compute the product of all component values.
For N series-connected components.
It can be observed that the reliability and availability of a series-connected network of components
is lower than the specifications of individual components. For example, two components with 99%
availability connect in series to yield 98.01% availability. The converse is true for parallel combination
model. If one component has 99% availability specifications, then two components combine in
parallel to yield 99.99% availability; and four components in parallel connection yield 99.9999%
availability. Adding redundant components to the network further increases the reliability and
availability performance.
It’s important to note a few caveats regarding these incident metrics and the associated reliability
and availability calculations.
1. These metrics may be perceived in relative terms. Failure may be defined differently for the
same components in different applications, use cases, and organizations.
2. The value of metrics such as MTTF, MTTR, MTBF, and MTTD are averages observed in
experimentation under controlled or specific environments. These measurements may not
hold consistently in real-world applications.
Organizations should therefore map system reliability and availability calculations to business value
and end-user experience. Decisions may require strategic trade-offs with cost, performance and,
security, and decision makers will need to ask questions beyond the system dependability metrics
and specifications followed by IT departments.
Additional resources
The following literature was referenced for system reliability and availability calculations described
in this article:
Johnson, Barry. (1988). Design & analysis of fault tolerant digital systems. Chapters 1-4.
Johnson, Barry. (1996). An introduction to the design and analysis of fault-tolerant systems.
1-87.
Related reading
BMC It Operations Blog
BMC Service Management Blog
MTBF vs. MTTF vs. MTTR: Defining IT Failure
MTTR Explained: Repair vs Recovery in a Digitized Environment
What Is High Availability? Concepts & Best Practices