0% found this document useful (0 votes)

32 views

Meantime To Failure

Mean Time To failure, MTTF, is a commonly accepted metric for reliability. In this paper we present a novel approach to achieve the desired MTTF with minimum redundancy. We analyze the failure behavior of large scale systems using failure logs collected by Los Alamos National Laboratory. We analyze the root cause of failures and present a choice of specific hardware and software components to be made fault-tolerant to achieve target MTTF at minimum expense.

Uploaded by

Satsangat Khalsa

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

32 views

Meantime To Failure

Uploaded by

Satsangat Khalsa

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 10

Identifying Reliability-Critical Components in High Performance

Computing Systems to Improve MTTF

Nithin Nakka†, Alok Choudhary†, Gary Grider§, John Bent§, James Nunez§ and Satsangat Khalsa§
†
Department of Electrical Engineering and Computer Science, Northwestern University, Evanston, IL, USA
{nakka, choudhar}@eecs.northwestern.edu
§
Los Alamos National Laboratories, Albuquerque, New Mexico, U.S.A.
{ggrider, johnbent, jnunez, satsang}@lanl.gov
state of the application at a suitable point, and
Abstract effective techniques to detect failures so as to roll
Mean Time To failure, MTTF, is a commonly back execution to the checkpointed state. For
accepted metric for reliability. In this paper we applications in the latter category, such as systems
present a novel approach to achieve the desired and software for flight, or spacecraft control, a time
MTTF with minimum redundancy. We analyze the for the length of the mission (or mission time) is pre-
failure behavior of large scale systems using failure determined and appropriate fault-tolerance
logs collected by Los Alamos National Laboratory. techniques need to be deployed to ensure that the
We analyze the root cause of failures and present a entire system does not fail within the mission time.
choice of specific hardware and software components The mission time of a flight system directly
to be made fault-tolerant to achieve target MTTF at determines the length of its travel and is a highly
minimum expense. Not all components show similar critical decision point.
failure behavior in the systems. Our objective,
The MTTF of a system is an estimate of the time for
therefore, was to arrive at an ordering of components
which the system can be expected to work without
to be incrementally selected for protection to achieve
any failures. Therefore, for applications that can be
a target MTTF. We propose a model for MTTF for
checkpointed MTTF could be used to determine the
tolerating failures in a specific component, system-
checkpointing interval, within which the
wide, and order components according to the
application’s state must be checkpointed. This would
coverage provided. Systems grouped based on
ensure that the checkpoint state itself is not corrupted
hardware configuration showed similar
and hence by rolling back to this state on detecting a
improvements in MTTF when different components in
failure the application will continue correct
them were targeted for fault-tolerance.
execution. For applications of the latter category, the
1 Introduction MTTF can be used to determine the mission time,
Computers are being employed increasingly in highly before which the system executes without any failure.
mission- and life-critical and long-running Understanding the failure behavior of a system can
applications. In this scenario, there is a corresponding greatly benefit the design of fault-tolerance and
demand for high reliability and availability of the reliability techniques for that as well as other systems
systems. Since failures are inevitable in a system, the with similar characteristics and thereby increasing
best use of the bad bargain is to employ fault- their MTTF. Failure and repair logs are a valuable
detection and recovery techniques to meet the source of field failure information. The extent to
requirements. Mean Time to Failure (MTTF) is an which the logs aid in reliable design depends on the
important well-accepted measure for the reliability of granularity at which the logging is performed.
a system. MTTF is the time elapsed, on an average, System level logs could assist in system-wide
between any two failures in the component being techniques such as global synchronous checkpointing
studied. etc. However, logging at a finer granularity, like that
Broadly speaking, two types of applications demand at the node-level, improves the effectiveness of
high reliability – (i) those which can be stopped and techniques applied at the node level. An important
their state captured at a suitable point and their observation that we make in this paper is that, “All
execution resumed at a later point in time from the components in a system are not equal” (either by
captured state, also called the checkpointed state, (ii) functionality or by failure behavior).
those programs that cannot be interrupted and need to Component-level reliability information also helps in
execute for a minimum amount of time, till the designing and deploying techniques such as
application (or the mission) is completed. Most long- duplication selectively to the most critical portions of
running scientific applications are examples of the system. This decreases setup, maintenance and
applications in the former category. They require performance costs for the system. Furthermore, with
efficient mechanisms to take checkpoints of the entire a framework for selective fault-tolerance in place the
techniques could be customized to meet the specific previously analyzed by Schroeder et. al. [10] from
reliability requirements of the application. In this CMU to study the statistics of the data in terms of the
paper we show how component-level selective fault- root cause of the failures, mean time between failures
tolerance can be customized to meet an application’s and the mean time to repair.
target MTTF. Table 1 provides the details of the systems. In the
The key contributions of this work are: “Production Time” column the range of times
1. Analysis of failures in specific components and specifies the node installation to decommissioning
their correlation with system configuration. times. A value of “N/A” indicates that the nodes in
2. Data-driven estimation of the coverage provided the system have been installed before the observation
for fault tolerance in components in the system. period began. For these machines, we consider the
3. A methodology for selecting an optimal or near- beginning of the observation period (November
optimal subset of components to be duplicated to 1996) as the starting date for calculating the
achieve the MTTF requirements of the production time. For some machines, multiple ranges
application. of production times have been provided. This is
because the systems have been upgraded during the
2 Description of Systems under study observation period and each production time range
and data sets corresponds to set of nodes within the system. For the
Los Alamos National Laboratory has collected and sake of simplicity, for all such machines, we consider
published data regarding the failure and usage of 22 the production time to be the longest range among the
of their supercomputing clusters. This data was specified ranges.
Table 1: Characteristics of Systems under study
System System # procs Network
# nodes # procs Production Time Remarks
Type Category Number in node Topology
Not
A smp 7 1 8 8 N/A – 12/99
Applicable
Not
B smp 24 1 32 32 N/A – 12/03
Applicable
Not
C smp 22 1 4 4 N/A – 04/03
Applicable
GB Ethernet 04/01 – 11/05
D cluster 8 164 2 328
Tree 12/02 – 11/05
cluster 20 256 4 1024 12/01 – 11/05
cluster 21 128 4 512 Dual rail fat 09/01 – 01/02
cluster 18 1024 4 4096 tree 05/02 – 11/05
cluster 19 1024 4 4096 10/02 – 11/05
E
cluster 3 128 4 512 09/03 – 11/05
cluster 4 128 4 512 Single rail fat 09/03 – 11/05
cluster 5 128 4 512 tree 09/03 – 11/05
cluster 6 32 4 128 09/03 – 11/05
cluster 14 128 2 256 09/03 – 11/05
cluster 9 256 2 512 09/03 – 11/05
cluster 10 256 2 512 09/03 – 11/05 Most jobs run on
Single rail fat
F cluster 11 256 2 512 09/03 – 11/05 only one 256- node
tree
cluster 13 256 2 512 09/03 – 11/05 segment
09/03 – 11/05
cluster 12 512 2 1024
03/05 – 06/05
cluster 16 16 128 2048 12/96 – 09/02
01/97 – 11/05
cluster 2 49 128, 80 6152
06/05 – 11/05
Multi rail fat
G 10/98 – 12/04
tree
01/98 – 12/04
cluster 23 5 128, 32 544
11/02 – 11/05
11/05 – 12/04
H numa 15 1 256 256 11/04 – 11/05
The failure data for a single system includes the causes are broadly classified into Facilities,
following fields: Hardware, Operator Error (or Human Error),
System: The number of the system under study (as Network Error, and Software. Those failures whose
referred to by Los Alamos Laboratory) root cause could not be resolved are placed in the
machine type: The type of the system (smp, cluster, Undetermined category. In this analysis we consider
numa) failures for all systems together as well as for each
number of nodes: The number of nodes in the system system at a time. The first approach provides insight
number of processors per node: Some systems are on the failure rate of each of the component
heterogenous in that they contain nodes with different irrespective of the type of system they are part of. By
number of processors per node conditioning this analysis with the system, we can
number of processors total: Total number of understand how the failure behavior of each of the
processors in the system components changes with the specific type and
=∑ ∗ , where for a system, is the number configuration of the system.
of node types in the system, is the number of Each of the six broad categories, are further classified
nodes of type and is the number of processors in into sub- categories or components. The failure
a node of type . analysis traces the root cause of each failure to one of
node number starting with zero: This node these sub-categories/components.
numbering is provided by Los Alamos so as to Table 2 tabulates the failure categories and their
maintain consistency among all systems in terms of corresponding set of sub-categories/components (For
node numbering for easier comparison of systems. brevity we will refer to sub-categories/components as
install date: Date that the node was installed in the sub-components for the following discussion).
system. Since the systems under study were upgraded
during the period of study, this field can vary for Table 2: Failure Categories and subcomponents
nodes within a system Failure Failing Sub Category or Sub
production date: Date, after the installation date, Category Component
where the node has been tested for initial burn-in Operator Error Human Error
effects and known operational failures, so that the Network Error Network
node could be used to run production applications. Undetermined Security,Unresolvable,Undetermined
decommission date: Date that the node was removed Environment,Chillers,Power
Facilities
from the system. Spike,UPS,Power Outage
Compilers and libraries, Scratch Drive,
field replacable unit type: Whether on a failure the Software
Security Software, Vizscratch FS, …
repair procedure involves replacing the entire unit or WACS Logic, SSD Logic, Site Network
a portion of it. Hardware
Interface, KGPSA, SAN Fiber Cable, …
memory: the amount of memory on the node
node purpose: The function of the node in the overall 3 Related Work
system: (i) compute, (ii) front end, (iii) graphics or a There has been significant study over the past few
combination of the three such as graphics.fe (both decades on analyzing failure logs from large-scale
graphics and front end). computers to understand the failure behavior of such
Problem Started (mm/dd/yy hh:mm): The time of systems and possibly use the knowledge in improving
occurrence of the failure event. This is logged by an system design. Plank et. al. [1] derive an optimal
automated fault detection engine. checkpointing interval for an application using the
Problem Fixed (mm/dd/yy hh:mm): The time at which failures logs obtained from networks of workstations.
the failure was repaired and the system restored. This They derive the failure rate of the machines using the
is logged by the administrators and repair staff. time between failures observed in the logs. In another
Down Time: The time elapsed between the study, Nath et. al. [2] study real-world failure traces
occurrence of the problem and restoration of the to recommend design principles that can be used to
system. tolerate correlated failures. Particularly they have
The root cause for the failures is classified in the used it in recommending data placement strategies
following categories: (i) Facilities, (ii) Hardware, (iii) for correlated failures.
Human Error, (iv) Network, (v) Undetermined, or, Previous work in analyzing the failure data set used
(vi) Software. in this paper also aimed at optimizing node level
The data includes a field for the root cause for each redundancy [3]. However, the metric used was
failure as determined by the administrators or erroneously referred to as Mean Time To Failure
maintenance personnel of the systems. This field (MTTF), even though the theory, analysis and results
provides information on the specific component of were presented for the metric Mean Time Between
the node/system that caused the failure. The root
Failure (MTBF). It is to be noted that MTBF is not study we have utilized failure and repair time
the same as the MTTF. Apart from the MTTF itself, distributions to propose a novel approach to selective
MTBF includes the time required to repair the fault-tolerance. The metric of evaluation used is the
previous failure. MTTR is defined as the Mean Time mean time to failure (MTTF) of the system.
To Repair a failure. Therefore, MTBF = MTTF + Duplication, both at the system- and node- level has
MTTR. In this paper, we perform the analysis and been a topic of active and extensive research in the
present the results for the metric MTTF. micro-architecture and fault –tolerant computing
Prior work in analyzing failure rates tried to arrive at areas. Error Detection Using Duplicated Instructions
reasonable curves that fit the failure distribution of (EDDI) [22] duplicates original instructions in the
the systems [4]-[9]. Schroeder and Gibson [10] have program but with different registers and variables.
presented the characteristics of the failure data that Duplication at the application level increases the code
has been used in this study as well. However, a size of the application in memory. More importantly,
limitation of these studies is that they do not evaluate it reduces the instruction supply bandwidth from the
how system design choices could be affected by the memory to the processor. Error Detection by Diverse
failure characteristics that they have derived from the Data and Duplicated Instructions (ED4I) [23] is a
data. In this work we specifically attempt to software-implemented hardware fault tolerance
understand the failure behavior and use that technique in which two “different” programs with the
knowledge in the design of an optimal component- same functionality are executed, but with different
level fault-tolerance strategy with the aim of data sets, and their outputs are compared. The
increasing the overall availability of the system at the “different” programs are generated by multiplying all
least cost. variables and constants in the original program by a
There is also interesting research work in diversity factor k.
understanding the correlations between system In the realm of commercial processors the IBM G5
parameters and failure rate. Sahoo et. al. [7] show processor [24] has extra I- and E- units to provide
that the workload of the system is closely correlated duplicate execution of instructions. To support
with its failure rate, whereas Iyer [11] and Castillo duplicate execution, the G5 is restricted to a single-
[12] bring out the correlation between workload issue processor and incurs 35% hardware overhead.
intensity and the failure rate. In our study we study In experimental research, simultaneous
the dependency of failure rate on network multithreading (SMT) [25] and the chip
configuration, which in turn determines the workload multiprocessor (CMP) architectures have been ideal
characteristics of the system. For example, a fat tree bases for space and time redundant fault-tolerant
topology necessitates higher communication and designs because of their inherent redundancy. In
computation bandwidth and load at the higher levels simultaneously and redundantly threaded (SRT)
of the tree structure. processor, only instructions whose side effects are
Oliner and Stearley [13] have analyzed system logs visible beyond the boundaries of the processor core
from five supercomputers and critically evaluate the are checked [26]-[28]. This was subsequently
interpretations of system administrators from patterns extended in SRTR to include recovery [21]. Another
observed in the logs. They propose a filtering fault-tolerant architecture is proposed in the DIVA
algorithm to identify alerts from system logs. They design [19][20]. DIVA comprises an aggressive out-
also recommend enhancements to the logging of-order superscalar processor along with a simple in-
procedures so as to include information crucial in order checker processor. Microprocessor-based
identifying alerts from non-alerts. introspection (MBI) [29] achieves time redundancy
Lan et. al. [14][15] have proposed a machine-learning by scheduling the redundant execution of a program
based automatic diagnosis and prognosis engine for during idle cycles in which a long-latency cache miss
failures through their analysis of the logs on Blue is being serviced. SRTR [21] and MBI [29] have
Gene/L systems deployed at multiple locations. Their reported up to 30% performance overhead. These
goal is to feed the knowledge inferred from the logs results counter the widely-used belief that full
to checkpointing and migration tools [16][17] to duplication at the processor-level incurs little or no
reduce the overall application completion time. performance overhead.
Oliner et. al. [18] derive failure distributions from SLICK [30] is an SRT-based approach to provide
multiple supercomputing systems and propose novel partial replication of an application. The goals of this
job-scheduling algorithms that take into account the approach are similar to ours. However, unlike this
occurrence of failures in the system. They evaluated approach we do not rely on a multi-threaded
the impact of this on the average bounded slowdown, architecture for the replication. Instead, this paper
average response time and system utilization. In our presents modifications to a general superscalar
processor to support partial or selective replication of component-level fault-tolerance. The records for a
the application. single component are ordered according to the time
As for research and production systems employing of occurrence of the failure, as given by the “Prob
system-level duplication, the space mission to land Started field”. The time elapsed between the repair of
on the moon used a TMR enhanced computer system one failure and the occurrence of the next failure in
[31]. The TANDEM, now HP, Integrity S2 computer this ordered list gives the time to failure for the
system [32] provided reliability through the concept second failure (In case of the first failure the time to
of full duplication at the hardware level. The AT&T failure is the time from the installation of the
No.5 ESS telecommunications switch [33], [34] uses component to the failure). Following this procedure
duplication in its administrative module consisting of the times to failure for each failure for the component
the 3B20S processor, an I/O processor, and an are calculated. The average of all these times is used
automatic message accounting unit, to provide high as an estimate for the mean time to failure (MTTF)
reliability and availability. The JPL STAR computer for the component. Time To Failure (TTF) for a fault
[35] system for space applications primarily used i is given by:
hardware subsystem fault-tolerant techniques, such as
functional unit redundancy, voting, power-spare
switching, coding, and self-checks. 2
4 Approach and therefore,
This section describes our approach for analyzing the
data and building a model used for selective
∑

∑

∑

∑

∑

∑

MTTF is calculated as:

Eq. 1

The period of study of a system is its production .

time, defined elsewhere as the time between its Therefore, Mean Time To Repair is given by
installation and its decommissioning or the end of the ∑
observation period, whichever occurs first. Total
downtime for all failures is the sum of the downtimes
of all failures observed for the system. 4.1 Introducing component protection
From the previous analysis procedures, the MTTF
The “Downtime” field provides the time required by and the MTTR for a component have been estimated.
the administrators to fix the failure and bring the Now, we introduce a methodology to understand the
system back to its original state. It can be calculated effect on component failure if we augment it with a
as the difference between the “Prob Ended” and spare component. Figure 1 shows the states through
“Prob Started” fields. This is the repair time for this which a duplicated component transitions on failure
failure. Averaging this field over all the records for a and repair events. When both the original and the
single component provides an estimate for the mean spare component are working correctly the system is
time to repair (MTTR) for this component. Time To in state “2”. It is assumed that, after a failure is
Repair (TTR) for a failure detected in the component, the system has the
reconfiguration capability to fail over to the spare
component instantaneously. Computation therefore 1st failure
continues uninterruptedly. This state of the Failure of original node

component is represented by state “1” in the figure.

In the mean time, the original component is repaired.
2 1
The roles of the spare component and original Mean time to repair for Node
component are switched. If no other failure occurs in
the component, before the original component is
repaired, then the original component assumes the
role of the spare component, while the computation
continues on the spare component. Essentially, the 0
system is brought back to its pristine, fault-free state
(State “2” in the figure). However, if the next failure Figure 1: State transition diagram for component
for the component occurs within the mean time to failure with single fault-tolerance
repair for that component, then it is not possible to Before choosing a component to be duplicated, let
continue computation on that component. The be the total time the system was in operation and
component reaches state “0”. We declare that let be the number of failures. Then the MTTF of
protecting this component cannot cover this second
the system at this time is given by .
failure. There are other possible transitions between
these states, shown as dotted lines in Figure 1. They Let be the failures in a component i that are
are (i) State “0” to “2”: When both the original and covered by duplicating it and let be the downtime
spare components are repaired and the component due to these failures. If component i is duplicated
returns to its normal state. (ii) State “0” to “1”: When the total time the system is in operation is given by
one of the failed components (the original or the and the number of failures is
spare) is repaired and computation continues on this . Therefore, the MTTF of the system
component. (iii) State “2” to “0”: When both if component i is duplicated is given by
components fail at the same time. However, it is to be
noted that for the analysis based on the data these . If component i is to be chosen as the
transitions need not be considered. There would not next best candidate for duplication in improving the
be a transition from State “2” to “0” since the data MTTF of the system then:
represent failures only in one single component and
would not therefore have two simultaneous failures. ∀
The purpose of the analysis (aided by the state
.
transition diagram) is to decide whether a particular
failure can be covered by protecting this component We note that the fraction is dependent not only
or not. Once the component reaches State “0” it is
declared that the failure cannot be covered by and but also on and . The choice of the
protecting it. Therefore, outward transitions from best component i to be chosen cannot be made only
State “0” (to States “1” and “2”) are not considered. by comparing the corresponding ’s and ’s. Rather,
before making every consecutive choice for the best
Based on this analysis, conducted for each
component, the current and must be noted,
component individually, we evaluate all the failures
that are covered by providing fault-tolerance to that and the fraction must be calculated for each
component. This analysis provides an estimate of the component i that has not yet been duplicated. Then
components, which when duplicated, provide the the component j that gives the maximum value of
most benefit in terms of improvement in the MTTF
of the system. The next part of the study is used to is chosen for duplication.
achieve application requirements of MTTF.
5 Root Cause Analysis
Referring to [10] we see that the 22 systems under
study are divided into 8 categories based on the types
of CPU and memory and the network configuration.
Of these we will limit our analysis to sizeable
systems (with more than 500 processors). Thus only
Systems of Type E, F and G are considered in this
analysis.
5.1 Failures for all systems component, a component throughout the entire
Failure data analysis independent of the system system is protected against failures. For example, for
brings out the impact of the specific Category and failures in CPUs, all CPUs in the system are
sub-component where the failure occurred. For this duplicated to cover any failures. We follow the state
reason this specific focuses on the distribution of diagram shown in Figure 1 to determine the coverage
failures from all systems and their impact. Figure 3 of failures. As in the analysis shown in Section 4.1 let
shows the distribution of failures across the six be the downtime due to all failures in component i,
categories. Figure 3 (a) shows the frequency of let be the number of failures occurring in
occurrence of the failures, Figure 3 (b) shows the component i, and let and be the total system
total downtime caused due to these failures, and operation time and failures at the time of choosing
Figure 3 (c) shows the average downtime due to each the next best component for fault-tolerance. If
of the six categories. From Figure 3(a) we can see component i is chosen for protection, then the
that most of the failures occurred in hardware
resultant MTTF is given by: . The
components of the systems, while human error
caused the least number of errors. Figure 3(b) shows resultant MTTF on protecting all components, one at
that hardware components also had the highest a time, is calculated. These values are compared and
overall impact on the system in terms of their the component providing the highest MTTF is
combined contribution to the total downtime. chosen. The order of choosing components for
However, when seen on an average per failure, (as different systems is shown in the following figures.
shown in Figure 3(c)) a failure occurring in the From Figure 4 we see that systems within a group
facilities category had a higher impact (in terms of undergo similar types of failures. The set of failure
downtime) than one in any other category. categories is almost the same for all systems in a
5.2 Failure distribution for all systems group. The curves for the improvement in MTTF of
within a category similar systems are also similar showing that the
Of the six categories Operator Error, Network Error specific components and their order of choice is also
and Undetermined have only 1 to 3 sub-components. more or less similar across the systems in a group.
Therefore, we do not consider these in our detailed For example, for Systems 9, 10, 11, and 12 HW-
analysis for failures in sub-components. The data Memory Dimm (hardware) is the most critical
shown consider only failures in Facilities, Software component, followed by HW-Interconnect (Soft
and Hardware categories for all systems put together Error/Interface) and so on.
and for individual systems within a characteristic 80%

Comparison of Critical components of systems within a group
group. Among the 22 systems, we will focus on the 70%

larger systems for the analysis for root cause analysis 60%

to understand the most critical components in each

50%
system. These large systems are further divided in
groups based on their characteristics such as CPU 40%

type, Memory Type etc. The groups and the 30%

constituent systems are given in Table 3. 20%

Table 3: Grouping of systems based on 10%

configuration 0%

Group Systems
E 3, 4, 5, 6, 18, 19, 20, 21
F 9, 10, 11, 12, 13, 14
G 16, 2, 23 System 9 System 10 System 11 System 12

In the previous section it was determined as to which Figure 2: Comparison of MTTF Improvement for
component would provide the highest improvement different components across systems within a
in MTTF when it is duplicated. We now analyze each group
system and group failures according to the
component in which they occur. Based on this Table 4 shows the improvement in MTTF for
grouping we determine the component, which when systems within a group (here Systems 9, 10, 11, and
protected, provides the best improvement in MTTF. 12) when different components are chosen as the first
The set of failing components for a system are a component for fault-tolerance. We see that all
subset of those listed in Table 2. systems show more or less similar improvements for
specific components but for a few exceptions. This
The analytical procedure presented in Section 4.1 is trend is also shown in Figure 2. The figure
used for the components as well. In place of a
graphically depicts the data presented in Table 4. The component for fault-tolerance. Most of the peaks are
x-axis shows components in which errors occurred in in the beginning few components and consistently for
any of the 4 systems. Each set of 4 consecutive all systems there is little or no MTTF improvement
columns associated with a component represent the for the components towards the right side of the x-
MTTF improvement in the 4 systems had this axis.
component been chosen as the first critical

(a) (b)

(c)
Figure 3: Failure Distribution for All Systems
Table 4: Comparison of MTTF Improvement for different components across systems within a group
System System
9 10 11 12 9 10 11 12
Component Component
SW-Upgrade/Install OS
Undet-Undetermined 41.2% 32.2% 47.6% 35.3% 1.0% 1.1% 1.1% 1.0%
SW
HW-Memory Dimm 35.5% 63.1% 45.3% 68.5% HW-Interconnect Cable 0.8% 0.0% 0.0% 0.0%
HW-Interconnect Interface 27.2% 6.9% 4.4% 7.3% SW-Kernel software 0.4% 0.4% 0.0% 0.4%
HW-Interconnect Soft Error 13.2% 23.7% 21.5% 17.7% SW-NFS 0.4% 0.0% 0.0% 0.4%
SW-Parallel File System 10.8% 11.0% 6.4% 7.1% HW-Other 0.4% 0.0% 0.0% 0.0%
HW-Power Supply 6.5% 0.0% 3.6% 1.1% SW-Scheduler Software 0.4% 0.4% 0.4% 0.4%
SW-Network 5.6% 1.3% 2.3% 1.2% HW-CPU 0.0% 0.5% 0.4% 2.0%
SW-User code 2.9% 1.3% 3.9% 1.2% HW-Heatsink bracket 0.0% 0.0% 2.6% 0.0%
HW-Disk Drive 2.7% 6.9% 33.4% 3.5% HW-IDE Cable 0.0% 0.0% 0.6% 0.0%
HW-System Board 2.0% 27.0% 5.3% 5.6% HW-Memory Module 0.0% 0.0% 0.8% 1.5%
HW-Console Network
1.8% 1.4% 0.8% 1.5% HW-Node Board 0.0% 0.0% 0.4% 0.4%
Device
HW-40MM Cooling Fan 1.5% 0.4% 0.0% 2.0% HW-Riser Card 0.0% 0.0% 0.4% 0.0%
HE-Human Error 1.1% 1.7% 0.4% 1.8% HW-Temp Probe 0.0% 2.1% 1.2% 0.4%
Net-Network 1.1% 0.6% 1.1% 0.9% SW-Interconnect 0.0% 0.0% 0.0% 0.4%
Facs-Power Outage 1.1% 1.2% 1.1% 1.5% Undet-Unresolvable 0.0% 0.4% 0.0% 0.4%
System 9 MTTF Improvement with Node Duplication System 10 MTTF Improvement with Node Duplication

10000% 10000%
Percentage MTTF Improvement

Percentage MTTF Improvement
1000% 1000%

100% 100%

10% 10%
Number of Nodes Duplicated Number of Nodes Duplicated

Undet‐Undetermined HW‐Memory Dimm HW‐Interconnect Interface HW‐Memory Dimm Undet‐Undetermined HW‐Interconnect Soft Error

HW‐Interconnect Soft Error SW‐Parallel File System SW‐Network SW‐Parallel File System HW‐System Board HW‐Interconnect Interface
SW‐User code HW‐Disk Drive HW‐Console Network Device HW‐Disk Drive HE‐Human Error HW‐Console Network Device
HW‐System Board HW‐40MM Cooling Fan HE‐Human Error SW‐Network SW‐User code HW‐Temp Probe
Net‐Network HW‐Power Supply Facs‐Power Outage Facs‐Power Outage SW‐Upgrade/Install OS sftw Net‐Network
HW‐Interconnect Cable SW‐Upgrade/Install OS sftw SW‐Kernel software HW‐CPU SW‐Kernel software HW‐40MM Cooling Fan
SW‐NFS HW‐Other SW‐Scheduler Software Undet‐Unresolvable SW‐Scheduler Software

(a) (b)
System 11 MTTF Improvement with Node Duplication System 12 MTTF Improvement with Node Duplication

10000%

Percentage MTTF Improvement
Percentage MTTF Improvement

10000%

1000%
1000%

100%
100%

10%
Number of Nodes Duplicated
10%
Number of Nodes Duplicated
HW‐Memory Dimm Undet‐Undetermined HW‐Interconnect Soft Error
Undet‐Undetermined HW‐Memory Dimm HW‐Interconnect Soft Error SW‐Parallel File System HW‐Interconnect Interface HW‐System Board
HW‐Disk Drive SW‐Parallel File System HW‐Interconnect Interface HW‐Disk Drive HW‐CPU HW‐40MM Cooling Fan
SW‐User code HW‐System Board SW‐Network HE‐Human Error Facs‐Power Outage HW‐Console Network Device
HW‐Power Supply Facs‐Power Outage Net‐Network SW‐Network SW‐User code HW‐Memory Module
HW‐Memory Module HW‐Console Network Device HW‐Heatsink bracket HW‐Power Supply Net‐Network SW‐Upgrade/Install OS sftw
HW‐Temp Probe SW‐Upgrade/Install OS sftw HW‐IDE Cable HW‐Node Board SW‐Kernel software HW‐Temp Probe
HW‐Node Board HW‐CPU HW‐Riser Card SW‐Scheduler Software SW‐NFS SW‐Interconnect
SW‐Scheduler Software HE‐Human Error Undet‐Unresolvable

(c) (d)
Figure 4: Improvement in MTTF incrementally covering failures in different components for
(a) System 9 (b) System 10 (c) System 11 (d) System 12
software configurations showed similar MTTF
6 Conclusions and future directions improvement when specific components or failure
In this paper, we have presented our analysis of the types are targeted for fault tolerance.
failure behavior of large scale systems using the
The failure data from LANL provides node level
failure logs collected by LANL on 22 of their
failure information even though each node has
computing clusters. We note that not all components
multiple and different number of processors.
show similar failure behavior in the systems. Our
Therefore a more fine-grained logging of failures at
objective, therefore, was to arrive at an ordering of
the processor-level could provide even higher
components to be incrementally (one by one) selected
improvement in hardware overheads in achieving
for duplication so as to achieve a target MTTF for the
higher levels of System-level MTTFs.
system after duplicating the least number of
components. Using the start times and the down References
times logged for the failures we derived the time to [1] J. S. Plank and W. R. Elwasif. Experimental
failures and the mean time for repairs failures on a assessment of workstation failures and their impact on
component. Using these quantities, we arrived at a checkpointing systems. In Proceedings of FTCS-98.
[2] S. Nath, H. Yu, P. B. Gibbons, and S. Seshan.
model for the fault coverage provided by duplicating
Subtleties in tolerating correlated failures. In Proceedings
each component and ordered the components of the Symposium On Networked Systems Design and
according to MTTF improvement provided by Implementation (NSDI’06), 2006.
duplicating each component. We analyze the failures [3] N. Nakka, A. Choudhary, “Failure data-driven selective
grouped by the components in which they occur to node-level duplication to improve MTTF in High
understand the critical components and failures types. Performance Computing Systems”, In Proceedings of
We observed that systems of similar hardware and HPCS 2009, June 2009, Kingston, Ontario, CA.
[4] T. Heath, R. P.Martin, and T. D. Nguyen. Improving [21] T. Vijaykumar, I. Pomeranz, and K. Cheng,
cluster availability using workstation validation. In “Transient fault recovery using simultaneous
Proceedings of ACM SIGMETRICS, 2002. multithreading,” in Proceedings of the Twenty-Ninth
[5] D. Long, A. Muir, and R. Golding. A longitudinal Annual International Symposium on Computer
survey of internet host reliability. In Proceedings of the 14th Architecture, May 2002, pp. 87-98.
Intl. Symposium on Reliable Distributed Systems, 1995. [22] N. Oh, P.P. Shirvani, and E.J. McCluskey, “Error
[6] D. Nurmi, J. Brevik, and R. Wolski. Modeling machine detection by duplicated instructions in super-scalar
availability in enterprise and wide-area distributed processors,” IEEE Transactions on Reliability, vol. 51(1),
computing environments. In Euro-Par’05, 2005. pp. 63-75, Mar. 2002.
[7] R. K. Sahoo, R. K., A. Sivasubramaniam, M. S. [23] N. Oh, S. Mitra, and E.J. McCluskey, “ED4I: Error
Squillante, and Y. Zhang. Failure data analysis of a large- Detection by Diverse Data and Duplicated Instructions,”
scale heterogeneous server environment. In Proceedings of IEEE Transactions on Computers, vol. 51(2), pp. 180-199,
Dependable Systems and Networks, June 2004. Feb. 2002.
[8] D. Tang, R. K. Iyer, and S. S. Subramani. Failure [24] T. Slegel, et al. “IBM’s S/390 G5 microprocessor
analysis and modelling of a VAX cluster system. In Fault design,” IEEE Micro, vol. 19(2), pp. 12–23, 1999.
Tolerant Computing Systems, 1990. [25] D. M. Tullsen, S. J. Eggers, and H. M. Levy,
[9] J. Xu, Z. Kalbarczyk, and R. K. Iyer. Networked “Simultaneous multithreading: Maximizing on-chip
Windows NT system field failure data analysis. In Proc. of performance,” in Proceedings of the Twenty-Second
the PRDC, 1999. International Symposium on Computer Architecture, June
[10] B. Schroeder and G. Gibson. A large-scale study of 1995, pp. 392-403.
failures in high-performance-computing systems. In [26] E. Rotenberg, “AR-SMT: A microarchitectural
Proceedings of the DSN, Philadelphia, PA, June 2006. approach to fault tolerance in microprocessors,” in
[11] R. K. Iyer, D. J. Rossetti, and M. C. Hsueh. Proceedings of the Twenty-Ninth International Symposium
Measurement and modeling of computer reliability as on Fault-Tolerant Computing Systems, June 1999, pp. 84-
affected by system activity. ACM Transactions on 91.
Computing Systems, Vol. 4, No. 3, 1986. [27] K. Sundaramoorthy, Z. Purser, and E. Rotenberg,
[12] X. Castillo and D. Siewiorek. Workload, “Slipstream processors: Improving both performance and
performance, and reliability of digital computing systems. fault tolerance,” In Proceedings of the Thirty-Third
In the 11th FTCS, 1981. International Symposium on Microarchitecture, December
[13] Adam J. Oliner, Jon Stearley: What Supercomputers 2000, pp. 269-280.
Say: A Study of Five System Logs. In Proceedings of the [28] S. K. Reinhardt and S. S. Mukherjee, “Transient fault
DSN, Edinburgh, UK, June 2007, pp. 575-584. detection via simultaneous multithreading,” in Proceedings
[14] Z. Lan, Y. Li, P. Gujrati, Z. Zheng, R. Thakur, and J. of the Twenty-Seventh International Symposium on
White, "A Fault Diagnosis and Prognosis Service for Computer Architecture, June 2000, pp. 25-36.
TeraGrid Clusters", In Proceedings of TeraGrid'07 , 2007. [29] M. A. Qureshi, O. Mutlu, and Y. N. Patt,
[15] P. Gujrati, Y. Li, Z. Lan, R. Thakur, and J. “Microarchitecture-based introspection: A technique for
White,"Exploring Meta-learning to Improve Failure transient-fault tolerance in microprocessors,” In
Prediction in Supercomputing Clusters", In Proceedings of Proceedings of International Conference on Dependable
International Conference on Parallel Processing (ICPP), Systems and Networks, June 2005, pp. 434-443.
2007. [30] A. Parashar, A. Sivasubramaniam, S. Gurumurthi.
[16] Y. Li and Z. Lan, "Using Adaptive Fault Tolerance to “SlicK: slice-based locality exploitation for efficient
Improve Application Robustness on the TeraGrid", In redundant multithreading,” in Proceedings of the 12th Intl.,
Proceedings of TeraGrid'07 , 2007. conference on ASPLOS, 2006.
[17] Z. Lan and Y. Li, "Adaptive Fault Management of [31] A.E. Cooper and W.T. Chow, “Development of on-
Parallel Applications for High Performance Computing", board space computer systems,” IBM Journal of Research
IEEE Transactions on Computers ,Vol. 57, No. 12, pp. and Development, vol. 20, no. 1, pp. 5-19, January 1976.
1647-1660, 2008. [32] D. Jewett, “Integrity S2: A fault-tolerant Unix
[18] A. J. Oliner, R. K. Sahoo, J. E. Moreira, M. Gupta, platform,” Digest of Papers Fault-Tolerant Computing: The
and A. Sivasubramaniam. Fault-aware job scheduling for Twenty-First International Symposium, Montreal, Canada,
Bluegene/L systems. In Proceedings of the 18th pp. 512 - 519, June 25-27, 1991.
International Parallel and Distributed Processing [33] “AT&T 5ESS™ from top to bottom,”
Symposium (IPDPS), 2004. http://www.morehouse.org/hin /ess/ess05.htm.
[19] C. Weaver and T. Austin. “A fault tolerant approach [34] AT&T Technical Staff. “The 5ESS switching
to microprocessor design,” in Proceedings of the system,” The AT&T Technical Journal, Vol. 64(6), Part 2,
International Conference on Dependable Systems and July-August 1985.
Networks, July 2001, pp. 411-420. [35] A. Avizienis, “Arithmetic error codes: Cost and
[20] T. Austin, “DIVA: A reliable substrate for deep effectiveness studies for Application in digital system
submicron microarchitecture design,” in Proceedings of the design,” IEEE Transactions on Computers, vol. 20, no. 11,
Thirty-Second International Symposium on pp. 1332-1331, November 1971.
Microarchitecture, November 1999, pp. 196-207.

Unit 6 - The Power of Protest
No ratings yet
Unit 6 - The Power of Protest
5 pages
Instructional Design Models
100% (7)
Instructional Design Models
30 pages
SDA Session 8
No ratings yet
SDA Session 8
17 pages
Empirical System Reliability
No ratings yet
Empirical System Reliability
7 pages
Basic Reliability Concepts and Analysis Chapter 2
No ratings yet
Basic Reliability Concepts and Analysis Chapter 2
34 pages
Unit - 1 Soft-Reli
No ratings yet
Unit - 1 Soft-Reli
14 pages
Theory of Measurement Lecture 3
No ratings yet
Theory of Measurement Lecture 3
16 pages
BDS Session 3
No ratings yet
BDS Session 3
68 pages
why_do_computers_stop_jim_gray
No ratings yet
why_do_computers_stop_jim_gray
8 pages
Research Paper
No ratings yet
Research Paper
63 pages
Week09-Fault Tolerant System
No ratings yet
Week09-Fault Tolerant System
26 pages
CH- 2.1
No ratings yet
CH- 2.1
3 pages
03 Characteristics of Reliability
No ratings yet
03 Characteristics of Reliability
9 pages
IEEEStd 30067 - 2013presentation
100% (3)
IEEEStd 30067 - 2013presentation
42 pages
001. Lesson 1 - Introduction to Fault-Tolerant Computing
No ratings yet
001. Lesson 1 - Introduction to Fault-Tolerant Computing
6 pages
Reliability
100% (2)
Reliability
27 pages
Software Reliability Theory: Keywords: History Theory Random Point Process Exponential Order Statistics
No ratings yet
Software Reliability Theory: Keywords: History Theory Random Point Process Exponential Order Statistics
43 pages
Fault trees 1st Edition Nikolaos Limnios download
100% (2)
Fault trees 1st Edition Nikolaos Limnios download
58 pages
Fault trees 1st Edition Nikolaos Limnios - Download the full ebook now to never miss any detail
No ratings yet
Fault trees 1st Edition Nikolaos Limnios - Download the full ebook now to never miss any detail
55 pages
Where can buy Fault trees 1st Edition Nikolaos Limnios ebook with cheap price
100% (1)
Where can buy Fault trees 1st Edition Nikolaos Limnios ebook with cheap price
61 pages
chapter 2 maintnability reliability and availability
No ratings yet
chapter 2 maintnability reliability and availability
60 pages
dis sys
No ratings yet
dis sys
16 pages
Fault trees 1st Edition Nikolaos Limnios instant download
100% (3)
Fault trees 1st Edition Nikolaos Limnios instant download
54 pages
Margin_calculation_of_VSC_HVDC_modules_based_on_MMC
No ratings yet
Margin_calculation_of_VSC_HVDC_modules_based_on_MMC
7 pages
30 Calculo Indicadores RAM - Ind y Estrat Conf Primavera 2022v3 Apuntes de Clases-2
No ratings yet
30 Calculo Indicadores RAM - Ind y Estrat Conf Primavera 2022v3 Apuntes de Clases-2
24 pages
Get Fault trees 1st Edition Nikolaos Limnios free all chapters
No ratings yet
Get Fault trees 1st Edition Nikolaos Limnios free all chapters
51 pages
Quick_algorithms_to_calculate_mean_time (2)
No ratings yet
Quick_algorithms_to_calculate_mean_time (2)
13 pages
Opmanual Model 280pd
No ratings yet
Opmanual Model 280pd
34 pages
Calculating Reliability
100% (1)
Calculating Reliability
7 pages
CPE 515 - Reliability and Maintability
No ratings yet
CPE 515 - Reliability and Maintability
108 pages
CS203 - Advanced Computer Architecture: Dependability & Reliability
No ratings yet
CS203 - Advanced Computer Architecture: Dependability & Reliability
17 pages
Definition of Reliability
No ratings yet
Definition of Reliability
8 pages
Fault Tolerant Systems: Prerequisites
No ratings yet
Fault Tolerant Systems: Prerequisites
14 pages
Survey ON Fault Tolerance IN Grid Computing: P. Latchoumy and P. Sheik Abdul Khader
No ratings yet
Survey ON Fault Tolerance IN Grid Computing: P. Latchoumy and P. Sheik Abdul Khader
14 pages
I E E E Trans. Reliab
No ratings yet
I E E E Trans. Reliab
1 page
Q & A Reliablity
No ratings yet
Q & A Reliablity
6 pages
ieee_std_p3006.7_presentation
No ratings yet
ieee_std_p3006.7_presentation
21 pages
Download ebooks file Fault trees 1st Edition Nikolaos Limnios all chapters
100% (9)
Download ebooks file Fault trees 1st Edition Nikolaos Limnios all chapters
60 pages
7.Fault_Tolerance
No ratings yet
7.Fault_Tolerance
35 pages
Faulttolerancech5 150426005118 Conversion Gate02
No ratings yet
Faulttolerancech5 150426005118 Conversion Gate02
24 pages
Huawei ELTE2.3 System Reliability Prediction Technical White Paper
No ratings yet
Huawei ELTE2.3 System Reliability Prediction Technical White Paper
22 pages
Reliability and Reusability
No ratings yet
Reliability and Reusability
35 pages
Fault Lecture 01 - Introduction
No ratings yet
Fault Lecture 01 - Introduction
20 pages
Rtos Group 10
No ratings yet
Rtos Group 10
9 pages
Electronics 04 00526
No ratings yet
Electronics 04 00526
12 pages
Fundamentals of Reliability Engineering and Applications
100% (1)
Fundamentals of Reliability Engineering and Applications
63 pages
Stochastic Evaluation of Availability For Subsystems by Markov and Semi-Markov Models
No ratings yet
Stochastic Evaluation of Availability For Subsystems by Markov and Semi-Markov Models
10 pages
An Introduction To Fault
No ratings yet
An Introduction To Fault
6 pages
A Review On Fault Tolerance in Distributed Database
No ratings yet
A Review On Fault Tolerance in Distributed Database
4 pages
MTTF Versus MTBF
No ratings yet
MTTF Versus MTBF
2 pages
Synthesis of Fault-Tolerant Embedded Systems: Eles, Petru Izosimov, Viacheslav Pop, Paul Peng, Zebo
No ratings yet
Synthesis of Fault-Tolerant Embedded Systems: Eles, Petru Izosimov, Viacheslav Pop, Paul Peng, Zebo
7 pages
CS61C Su18 27 MRR Dependability
No ratings yet
CS61C Su18 27 MRR Dependability
60 pages
A survey of fault tolerance mechanisms adn checkpoint restart implementations for high performance computing systems
No ratings yet
A survey of fault tolerance mechanisms adn checkpoint restart implementations for high performance computing systems
25 pages
ISO-14224-INGLES-225
No ratings yet
ISO-14224-INGLES-225
1 page
Distributed Diagnosis of Failures in A Three Tier E-Commerce System
No ratings yet
Distributed Diagnosis of Failures in A Three Tier E-Commerce System
12 pages
Modelling System Reliability Using Continuous-Time Markov Chain
No ratings yet
Modelling System Reliability Using Continuous-Time Markov Chain
4 pages
References: Mcgraw-Hill Education, 2004
No ratings yet
References: Mcgraw-Hill Education, 2004
10 pages
DFTS BE 4 II Sem Unit 1
No ratings yet
DFTS BE 4 II Sem Unit 1
166 pages
Design Patterns For High Availability
No ratings yet
Design Patterns For High Availability
10 pages
Embedded Systems Programming with C++: Real-World Techniques
From Everand
Embedded Systems Programming with C++: Real-World Techniques
Robert Johnson
No ratings yet
Advanced Backend Code Optimization
From Everand
Advanced Backend Code Optimization
Sid Touati
No ratings yet
Defect Prediction in Software Development & Maintainence
From Everand
Defect Prediction in Software Development & Maintainence
Rudra Kumar
No ratings yet
TiminitsAndTrolls PDF
No ratings yet
TiminitsAndTrolls PDF
25 pages
Quarterly Exam - Q1 Filipino 9
No ratings yet
Quarterly Exam - Q1 Filipino 9
5 pages
Tin-Yau Tam Multilinear Algebra (Lecture Notes)
No ratings yet
Tin-Yau Tam Multilinear Algebra (Lecture Notes)
151 pages
Knowledge Management in Theory and Practice 3rd Edition Kimiz Dalkir pdf download
No ratings yet
Knowledge Management in Theory and Practice 3rd Edition Kimiz Dalkir pdf download
53 pages
Principles of Natural Justice - IPleaders
No ratings yet
Principles of Natural Justice - IPleaders
8 pages
What Will The Middle East Look Like in 2030 An Israeli Perspective
No ratings yet
What Will The Middle East Look Like in 2030 An Israeli Perspective
26 pages
An Investigation On Resilient Modulus of Bituminous Mixtures
No ratings yet
An Investigation On Resilient Modulus of Bituminous Mixtures
11 pages
Serana v. Sandiganbayan G.R. No. 162059
No ratings yet
Serana v. Sandiganbayan G.R. No. 162059
2 pages
The Gerontologist: Innovation in Aging
No ratings yet
The Gerontologist: Innovation in Aging
2 pages
Humour - Britannica Online Encyclopedia PDF
No ratings yet
Humour - Britannica Online Encyclopedia PDF
16 pages
Graphic Rating Scale
No ratings yet
Graphic Rating Scale
12 pages
SUSTAINABLE DEVELOPMENT & GREEN TECHNOLOGY IN AFRICA
No ratings yet
SUSTAINABLE DEVELOPMENT & GREEN TECHNOLOGY IN AFRICA
111 pages
Immuno
No ratings yet
Immuno
4 pages
In The Court of Appeal of The State of Miami
No ratings yet
In The Court of Appeal of The State of Miami
6 pages
"Ours Is A Service Economy and It Has Been For Some Time": - Karl Albrecht and Ron Zemke
No ratings yet
"Ours Is A Service Economy and It Has Been For Some Time": - Karl Albrecht and Ron Zemke
15 pages
Yale Law Journal
No ratings yet
Yale Law Journal
12 pages
03 - Butet - Data Integrity
No ratings yet
03 - Butet - Data Integrity
58 pages
Lop 11 New - HK I.vocabulary and Grammar
No ratings yet
Lop 11 New - HK I.vocabulary and Grammar
11 pages
The Future Is Not Ahead of Us. It Has Already Happened: Philip Kotler
No ratings yet
The Future Is Not Ahead of Us. It Has Already Happened: Philip Kotler
28 pages
Unit 3 Stories
No ratings yet
Unit 3 Stories
3 pages
PDF Corporate Financial Distress Restructuring and Turnaround 1st Edition Alberto Tron download
100% (1)
PDF Corporate Financial Distress Restructuring and Turnaround 1st Edition Alberto Tron download
37 pages
Resource Guide For Adult English: Language Learners of New Jersey
No ratings yet
Resource Guide For Adult English: Language Learners of New Jersey
47 pages
By W. Timothy Coombs, PH.D October 30, 2007: Crisis Management and Communications
No ratings yet
By W. Timothy Coombs, PH.D October 30, 2007: Crisis Management and Communications
16 pages
Command Verbs Charterd Management Institute
No ratings yet
Command Verbs Charterd Management Institute
3 pages
B1019000041 - MGT1113 - Individual Assignment
No ratings yet
B1019000041 - MGT1113 - Individual Assignment
10 pages
CIC Exam 2000
No ratings yet
CIC Exam 2000
17 pages
Rizal and Nation Building
No ratings yet
Rizal and Nation Building
4 pages
Chapter 13 15 Report
No ratings yet
Chapter 13 15 Report
32 pages

Meantime To Failure

Uploaded by

Meantime To Failure

Uploaded by

Identifying Reliability-Critical Components in High Performance

Computing Systems to Improve MTTF

The period of study of a system is its production .

component is represented by state “1” in the figure.

to understand the most critical components in each

type, Memory Type etc. The groups and the 30%

constituent systems are given in Table 3. 20%

Table 3: Grouping of systems based on 10%

Undet‐Undetermined HW‐Memory Dimm HW‐Interconnect Interface HW‐Memory Dimm Undet‐Undetermined HW‐Interconnect Soft Error

You might also like