0% found this document useful (0 votes)
32 views

Meantime To Failure

Mean Time To failure, MTTF, is a commonly accepted metric for reliability. In this paper we present a novel approach to achieve the desired MTTF with minimum redundancy. We analyze the failure behavior of large scale systems using failure logs collected by Los Alamos National Laboratory. We analyze the root cause of failures and present a choice of specific hardware and software components to be made fault-tolerant to achieve target MTTF at minimum expense.

Uploaded by

Satsangat Khalsa
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
32 views

Meantime To Failure

Mean Time To failure, MTTF, is a commonly accepted metric for reliability. In this paper we present a novel approach to achieve the desired MTTF with minimum redundancy. We analyze the failure behavior of large scale systems using failure logs collected by Los Alamos National Laboratory. We analyze the root cause of failures and present a choice of specific hardware and software components to be made fault-tolerant to achieve target MTTF at minimum expense.

Uploaded by

Satsangat Khalsa
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

Identifying Reliability-Critical Components in High Performance

Computing Systems to Improve MTTF


Nithin Nakka†, Alok Choudhary†, Gary Grider§, John Bent§, James Nunez§ and Satsangat Khalsa§

Department of Electrical Engineering and Computer Science, Northwestern University, Evanston, IL, USA
{nakka, choudhar}@eecs.northwestern.edu
§
Los Alamos National Laboratories, Albuquerque, New Mexico, U.S.A.
{ggrider, johnbent, jnunez, satsang}@lanl.gov
state of the application at a suitable point, and
Abstract effective techniques to detect failures so as to roll
Mean Time To failure, MTTF, is a commonly back execution to the checkpointed state. For
accepted metric for reliability. In this paper we applications in the latter category, such as systems
present a novel approach to achieve the desired and software for flight, or spacecraft control, a time
MTTF with minimum redundancy. We analyze the for the length of the mission (or mission time) is pre-
failure behavior of large scale systems using failure determined and appropriate fault-tolerance
logs collected by Los Alamos National Laboratory. techniques need to be deployed to ensure that the
We analyze the root cause of failures and present a entire system does not fail within the mission time.
choice of specific hardware and software components The mission time of a flight system directly
to be made fault-tolerant to achieve target MTTF at determines the length of its travel and is a highly
minimum expense. Not all components show similar critical decision point.
failure behavior in the systems. Our objective,
The MTTF of a system is an estimate of the time for
therefore, was to arrive at an ordering of components
which the system can be expected to work without
to be incrementally selected for protection to achieve
any failures. Therefore, for applications that can be
a target MTTF. We propose a model for MTTF for
checkpointed MTTF could be used to determine the
tolerating failures in a specific component, system-
checkpointing interval, within which the
wide, and order components according to the
application’s state must be checkpointed. This would
coverage provided. Systems grouped based on
ensure that the checkpoint state itself is not corrupted
hardware configuration showed similar
and hence by rolling back to this state on detecting a
improvements in MTTF when different components in
failure the application will continue correct
them were targeted for fault-tolerance.
execution. For applications of the latter category, the
1 Introduction MTTF can be used to determine the mission time,
Computers are being employed increasingly in highly before which the system executes without any failure.
mission- and life-critical and long-running Understanding the failure behavior of a system can
applications. In this scenario, there is a corresponding greatly benefit the design of fault-tolerance and
demand for high reliability and availability of the reliability techniques for that as well as other systems
systems. Since failures are inevitable in a system, the with similar characteristics and thereby increasing
best use of the bad bargain is to employ fault- their MTTF. Failure and repair logs are a valuable
detection and recovery techniques to meet the source of field failure information. The extent to
requirements. Mean Time to Failure (MTTF) is an which the logs aid in reliable design depends on the
important well-accepted measure for the reliability of granularity at which the logging is performed.
a system. MTTF is the time elapsed, on an average, System level logs could assist in system-wide
between any two failures in the component being techniques such as global synchronous checkpointing
studied. etc. However, logging at a finer granularity, like that
Broadly speaking, two types of applications demand at the node-level, improves the effectiveness of
high reliability – (i) those which can be stopped and techniques applied at the node level. An important
their state captured at a suitable point and their observation that we make in this paper is that, “All
execution resumed at a later point in time from the components in a system are not equal” (either by
captured state, also called the checkpointed state, (ii) functionality or by failure behavior).
those programs that cannot be interrupted and need to Component-level reliability information also helps in
execute for a minimum amount of time, till the designing and deploying techniques such as
application (or the mission) is completed. Most long- duplication selectively to the most critical portions of
running scientific applications are examples of the system. This decreases setup, maintenance and
applications in the former category. They require performance costs for the system. Furthermore, with
efficient mechanisms to take checkpoints of the entire a framework for selective fault-tolerance in place the
techniques could be customized to meet the specific previously analyzed by Schroeder et. al. [10] from
reliability requirements of the application. In this CMU to study the statistics of the data in terms of the
paper we show how component-level selective fault- root cause of the failures, mean time between failures
tolerance can be customized to meet an application’s and the mean time to repair.
target MTTF. Table 1 provides the details of the systems. In the
The key contributions of this work are: “Production Time” column the range of times
1. Analysis of failures in specific components and specifies the node installation to decommissioning
their correlation with system configuration. times. A value of “N/A” indicates that the nodes in
2. Data-driven estimation of the coverage provided the system have been installed before the observation
for fault tolerance in components in the system. period began. For these machines, we consider the
3. A methodology for selecting an optimal or near- beginning of the observation period (November
optimal subset of components to be duplicated to 1996) as the starting date for calculating the
achieve the MTTF requirements of the production time. For some machines, multiple ranges
application. of production times have been provided. This is
because the systems have been upgraded during the
2 Description of Systems under study observation period and each production time range
and data sets corresponds to set of nodes within the system. For the
Los Alamos National Laboratory has collected and sake of simplicity, for all such machines, we consider
published data regarding the failure and usage of 22 the production time to be the longest range among the
of their supercomputing clusters. This data was specified ranges.
Table 1: Characteristics of Systems under study
System System # procs Network
# nodes # procs Production Time Remarks
Type Category Number in node Topology
Not
A smp 7 1 8 8 N/A – 12/99
Applicable
Not
B smp 24 1 32 32 N/A – 12/03
Applicable
Not
C smp 22 1 4 4 N/A – 04/03
Applicable
GB Ethernet 04/01 – 11/05
D cluster 8 164 2 328
Tree 12/02 – 11/05
cluster 20 256 4 1024 12/01 – 11/05
cluster 21 128 4 512 Dual rail fat 09/01 – 01/02
cluster 18 1024 4 4096 tree 05/02 – 11/05
cluster 19 1024 4 4096 10/02 – 11/05
E
cluster 3 128 4 512 09/03 – 11/05
cluster 4 128 4 512 Single rail fat 09/03 – 11/05
cluster 5 128 4 512 tree 09/03 – 11/05
cluster 6 32 4 128 09/03 – 11/05
cluster 14 128 2 256 09/03 – 11/05
cluster 9 256 2 512 09/03 – 11/05
cluster 10 256 2 512 09/03 – 11/05 Most jobs run on
Single rail fat
F cluster 11 256 2 512 09/03 – 11/05 only one 256- node
tree
cluster 13 256 2 512 09/03 – 11/05 segment
09/03 – 11/05
cluster 12 512 2 1024
03/05 – 06/05
cluster 16 16 128 2048 12/96 – 09/02
01/97 – 11/05
cluster 2 49 128, 80 6152
06/05 – 11/05
Multi rail fat
G 10/98 – 12/04
tree
01/98 – 12/04
cluster 23 5 128, 32 544
11/02 – 11/05
11/05 – 12/04
H numa 15 1 256 256 11/04 – 11/05
The failure data for a single system includes the causes are broadly classified into Facilities,
following fields: Hardware, Operator Error (or Human Error),
System: The number of the system under study (as Network Error, and Software. Those failures whose
referred to by Los Alamos Laboratory) root cause could not be resolved are placed in the
machine type: The type of the system (smp, cluster, Undetermined category. In this analysis we consider
numa) failures for all systems together as well as for each
number of nodes: The number of nodes in the system system at a time. The first approach provides insight
number of processors per node: Some systems are on the failure rate of each of the component
heterogenous in that they contain nodes with different irrespective of the type of system they are part of. By
number of processors per node conditioning this analysis with the system, we can
number of processors total: Total number of understand how the failure behavior of each of the
processors in the system components changes with the specific type and
=∑ ∗ , where for a system, is the number configuration of the system.
of node types in the system, is the number of Each of the six broad categories, are further classified
nodes of type and  is the number of processors in into sub- categories or components. The failure
a node of type . analysis traces the root cause of each failure to one of
node number starting with zero: This node these sub-categories/components.
numbering is provided by Los Alamos so as to Table 2 tabulates the failure categories and their
maintain consistency among all systems in terms of corresponding set of sub-categories/components (For
node numbering for easier comparison of systems. brevity we will refer to sub-categories/components as
install date: Date that the node was installed in the sub-components for the following discussion).
system. Since the systems under study were upgraded
during the period of study, this field can vary for Table 2: Failure Categories and subcomponents
nodes within a system Failure Failing Sub Category or Sub
production date: Date, after the installation date, Category Component
where the node has been tested for initial burn-in Operator Error Human Error
effects and known operational failures, so that the Network Error Network
node could be used to run production applications. Undetermined Security,Unresolvable,Undetermined
decommission date: Date that the node was removed Environment,Chillers,Power
Facilities
from the system. Spike,UPS,Power Outage
Compilers and libraries, Scratch Drive,
field replacable unit type: Whether on a failure the Software
Security Software, Vizscratch FS, …
repair procedure involves replacing the entire unit or WACS Logic, SSD Logic, Site Network
a portion of it. Hardware
Interface, KGPSA, SAN Fiber Cable, …
memory: the amount of memory on the node
node purpose: The function of the node in the overall 3 Related Work
system: (i) compute, (ii) front end, (iii) graphics or a There has been significant study over the past few
combination of the three such as graphics.fe (both decades on analyzing failure logs from large-scale
graphics and front end). computers to understand the failure behavior of such
Problem Started (mm/dd/yy hh:mm): The time of systems and possibly use the knowledge in improving
occurrence of the failure event. This is logged by an system design. Plank et. al. [1] derive an optimal
automated fault detection engine. checkpointing interval for an application using the
Problem Fixed (mm/dd/yy hh:mm): The time at which failures logs obtained from networks of workstations.
the failure was repaired and the system restored. This They derive the failure rate of the machines using the
is logged by the administrators and repair staff. time between failures observed in the logs. In another
Down Time: The time elapsed between the study, Nath et. al. [2] study real-world failure traces
occurrence of the problem and restoration of the to recommend design principles that can be used to
system. tolerate correlated failures. Particularly they have
The root cause for the failures is classified in the used it in recommending data placement strategies
following categories: (i) Facilities, (ii) Hardware, (iii) for correlated failures.
Human Error, (iv) Network, (v) Undetermined, or, Previous work in analyzing the failure data set used
(vi) Software. in this paper also aimed at optimizing node level
The data includes a field for the root cause for each redundancy [3]. However, the metric used was
failure as determined by the administrators or erroneously referred to as Mean Time To Failure
maintenance personnel of the systems. This field (MTTF), even though the theory, analysis and results
provides information on the specific component of were presented for the metric Mean Time Between
the node/system that caused the failure. The root
Failure (MTBF). It is to be noted that MTBF is not study we have utilized failure and repair time
the same as the MTTF. Apart from the MTTF itself, distributions to propose a novel approach to selective
MTBF includes the time required to repair the fault-tolerance. The metric of evaluation used is the
previous failure. MTTR is defined as the Mean Time mean time to failure (MTTF) of the system.
To Repair a failure. Therefore, MTBF = MTTF + Duplication, both at the system- and node- level has
MTTR. In this paper, we perform the analysis and been a topic of active and extensive research in the
present the results for the metric MTTF. micro-architecture and fault –tolerant computing
Prior work in analyzing failure rates tried to arrive at areas. Error Detection Using Duplicated Instructions
reasonable curves that fit the failure distribution of (EDDI) [22] duplicates original instructions in the
the systems [4]-[9]. Schroeder and Gibson [10] have program but with different registers and variables.
presented the characteristics of the failure data that Duplication at the application level increases the code
has been used in this study as well. However, a size of the application in memory. More importantly,
limitation of these studies is that they do not evaluate it reduces the instruction supply bandwidth from the
how system design choices could be affected by the memory to the processor. Error Detection by Diverse
failure characteristics that they have derived from the Data and Duplicated Instructions (ED4I) [23] is a
data. In this work we specifically attempt to software-implemented hardware fault tolerance
understand the failure behavior and use that technique in which two “different” programs with the
knowledge in the design of an optimal component- same functionality are executed, but with different
level fault-tolerance strategy with the aim of data sets, and their outputs are compared. The
increasing the overall availability of the system at the “different” programs are generated by multiplying all
least cost. variables and constants in the original program by a
There is also interesting research work in diversity factor k.
understanding the correlations between system In the realm of commercial processors the IBM G5
parameters and failure rate. Sahoo et. al. [7] show processor [24] has extra I- and E- units to provide
that the workload of the system is closely correlated duplicate execution of instructions. To support
with its failure rate, whereas Iyer [11] and Castillo duplicate execution, the G5 is restricted to a single-
[12] bring out the correlation between workload issue processor and incurs 35% hardware overhead.
intensity and the failure rate. In our study we study In experimental research, simultaneous
the dependency of failure rate on network multithreading (SMT) [25] and the chip
configuration, which in turn determines the workload multiprocessor (CMP) architectures have been ideal
characteristics of the system. For example, a fat tree bases for space and time redundant fault-tolerant
topology necessitates higher communication and designs because of their inherent redundancy. In
computation bandwidth and load at the higher levels simultaneously and redundantly threaded (SRT)
of the tree structure. processor, only instructions whose side effects are
Oliner and Stearley [13] have analyzed system logs visible beyond the boundaries of the processor core
from five supercomputers and critically evaluate the are checked [26]-[28]. This was subsequently
interpretations of system administrators from patterns extended in SRTR to include recovery [21]. Another
observed in the logs. They propose a filtering fault-tolerant architecture is proposed in the DIVA
algorithm to identify alerts from system logs. They design [19][20]. DIVA comprises an aggressive out-
also recommend enhancements to the logging of-order superscalar processor along with a simple in-
procedures so as to include information crucial in order checker processor. Microprocessor-based
identifying alerts from non-alerts. introspection (MBI) [29] achieves time redundancy
Lan et. al. [14][15] have proposed a machine-learning by scheduling the redundant execution of a program
based automatic diagnosis and prognosis engine for during idle cycles in which a long-latency cache miss
failures through their analysis of the logs on Blue is being serviced. SRTR [21] and MBI [29] have
Gene/L systems deployed at multiple locations. Their reported up to 30% performance overhead. These
goal is to feed the knowledge inferred from the logs results counter the widely-used belief that full
to checkpointing and migration tools [16][17] to duplication at the processor-level incurs little or no
reduce the overall application completion time. performance overhead.
Oliner et. al. [18] derive failure distributions from SLICK [30] is an SRT-based approach to provide
multiple supercomputing systems and propose novel partial replication of an application. The goals of this
job-scheduling algorithms that take into account the approach are similar to ours. However, unlike this
occurrence of failures in the system. They evaluated approach we do not rely on a multi-threaded
the impact of this on the average bounded slowdown, architecture for the replication. Instead, this paper
average response time and system utilization. In our presents modifications to a general superscalar
processor to support partial or selective replication of component-level fault-tolerance. The records for a
the application. single component are ordered according to the time
As for research and production systems employing of occurrence of the failure, as given by the “Prob
system-level duplication, the space mission to land Started field”. The time elapsed between the repair of
on the moon used a TMR enhanced computer system one failure and the occurrence of the next failure in
[31]. The TANDEM, now HP, Integrity S2 computer this ordered list gives the time to failure for the
system [32] provided reliability through the concept second failure (In case of the first failure the time to
of full duplication at the hardware level. The AT&T failure is the time from the installation of the
No.5 ESS telecommunications switch [33], [34] uses component to the failure). Following this procedure
duplication in its administrative module consisting of the times to failure for each failure for the component
the 3B20S processor, an I/O processor, and an are calculated. The average of all these times is used
automatic message accounting unit, to provide high as an estimate for the mean time to failure (MTTF)
reliability and availability. The JPL STAR computer for the component. Time To Failure (TTF) for a fault
[35] system for space applications primarily used i is given by:
hardware subsystem fault-tolerant techniques, such as          
functional unit redundancy, voting, power-spare           
switching, coding, and self-checks.  2
4 Approach and therefore,
This section describes our approach for analyzing the
data and building a model used for selective

 
        ∑        
 
        ∑        
 
        ∑      
 
              ∑  
 
        ∑  
 
MTTF is calculated as:
   
Eq. 1

The period of study of a system is its production       .


time, defined elsewhere as the time between its Therefore, Mean Time To Repair is given by
installation and its decommissioning or the end of the ∑
observation period, whichever occurs first. Total  
downtime for all failures is the sum of the downtimes
of all failures observed for the system. 4.1 Introducing component protection 
From the previous analysis procedures, the MTTF
The “Downtime” field provides the time required by and the MTTR for a component have been estimated.
the administrators to fix the failure and bring the Now, we introduce a methodology to understand the
system back to its original state. It can be calculated effect on component failure if we augment it with a
as the difference between the “Prob Ended” and spare component. Figure 1 shows the states through
“Prob Started” fields. This is the repair time for this which a duplicated component transitions on failure
failure. Averaging this field over all the records for a and repair events. When both the original and the
single component provides an estimate for the mean spare component are working correctly the system is
time to repair (MTTR) for this component. Time To in state “2”. It is assumed that, after a failure is
Repair (TTR) for a failure detected in the component, the system has the
reconfiguration capability to fail over to the spare
component instantaneously. Computation therefore 1st failure
continues uninterruptedly. This state of the Failure of original node

component is represented by state “1” in the figure.


In the mean time, the original component is repaired.
2 1
The roles of the spare component and original Mean time to repair for Node
component are switched. If no other failure occurs in
the component, before the original component is
repaired, then the original component assumes the
role of the spare component, while the computation
continues on the spare component. Essentially, the 0
system is brought back to its pristine, fault-free state
(State “2” in the figure). However, if the next failure Figure 1: State transition diagram for component
for the component occurs within the mean time to failure with single fault-tolerance
repair for that component, then it is not possible to Before choosing a component to be duplicated, let
continue computation on that component. The be the total time the system was in operation and
component reaches state “0”. We declare that let be the number of failures. Then the MTTF of
protecting this component cannot cover this second
the system at this time is given by   .
failure. There are other possible transitions between
these states, shown as dotted lines in Figure 1. They Let be the failures in a component i that are
are (i) State “0” to “2”: When both the original and covered by duplicating it and let be the downtime
spare components are repaired and the component due to these failures. If component i is duplicated
returns to its normal state. (ii) State “0” to “1”: When the total time the system is in operation is given by
one of the failed components (the original or the   and the number of failures is
spare) is repaired and computation continues on this   . Therefore, the MTTF of the system
component. (iii) State “2” to “0”: When both if component i is duplicated is given by
components fail at the same time. However, it is to be
noted that for the analysis based on the data these     . If component i is to be chosen as the
transitions need not be considered. There would not next best candidate for duplication in improving the
be a transition from State “2” to “0” since the data MTTF of the system then:
represent failures only in one single component and
would not therefore have two simultaneous failures.      ∀
The purpose of the analysis (aided by the state
                 .
transition diagram) is to decide whether a particular
failure can be covered by protecting this component We note that the fraction is dependent not only
or not. Once the component reaches State “0” it is
declared that the failure cannot be covered by and but also on and . The choice of the
protecting it. Therefore, outward transitions from best component i to be chosen cannot be made only
State “0” (to States “1” and “2”) are not considered. by comparing the corresponding  ’s and ’s. Rather,
before making every consecutive choice for the best
Based on this analysis, conducted for each
component, the current and  must be noted,
component individually, we evaluate all the failures
that are covered by providing fault-tolerance to that and the fraction   must be calculated for each
component. This analysis provides an estimate of the component i that has not yet been duplicated. Then
components, which when duplicated, provide the the component j that gives the maximum value of
most benefit in terms of improvement in the MTTF
of the system. The next part of the study is used to is chosen for duplication.
achieve application requirements of MTTF.
5 Root Cause Analysis
Referring to [10] we see that the 22 systems under
study are divided into 8 categories based on the types
of CPU and memory and the network configuration.
Of these we will limit our analysis to sizeable
systems (with more than 500 processors). Thus only
Systems of Type E, F and G are considered in this
analysis.
5.1 Failures for all systems  component, a component throughout the entire
Failure data analysis independent of the system system is protected against failures. For example, for
brings out the impact of the specific Category and failures in CPUs, all CPUs in the system are
sub-component where the failure occurred. For this duplicated to cover any failures. We follow the state
reason this specific focuses on the distribution of diagram shown in Figure 1 to determine the coverage
failures from all systems and their impact. Figure 3 of failures. As in the analysis shown in Section 4.1 let
shows the distribution of failures across the six be the downtime due to all failures in component i,
categories. Figure 3 (a) shows the frequency of let be the number of failures occurring in
occurrence of the failures, Figure 3 (b) shows the component i, and let and be the total system
total downtime caused due to these failures, and operation time and failures at the time of choosing
Figure 3 (c) shows the average downtime due to each the next best component for fault-tolerance. If
of the six categories. From Figure 3(a) we can see component i is chosen for protection, then the
that most of the failures occurred in hardware
resultant MTTF is given by:   . The
components of the systems, while human error
caused the least number of errors. Figure 3(b) shows resultant MTTF on protecting all components, one at
that hardware components also had the highest a time, is calculated. These values are compared and
overall impact on the system in terms of their the component providing the highest MTTF is
combined contribution to the total downtime. chosen. The order of choosing components for
However, when seen on an average per failure, (as different systems is shown in the following figures.
shown in Figure 3(c)) a failure occurring in the From Figure 4 we see that systems within a group
facilities category had a higher impact (in terms of undergo similar types of failures. The set of failure
downtime) than one in any other category. categories is almost the same for all systems in a
5.2 Failure distribution for all systems  group. The curves for the improvement in MTTF of
within a category  similar systems are also similar showing that the
Of the six categories Operator Error, Network Error specific components and their order of choice is also
and Undetermined have only 1 to 3 sub-components. more or less similar across the systems in a group.
Therefore, we do not consider these in our detailed For example, for Systems 9, 10, 11, and 12 HW-
analysis for failures in sub-components. The data Memory Dimm (hardware) is the most critical
shown consider only failures in Facilities, Software component, followed by HW-Interconnect (Soft
and Hardware categories for all systems put together Error/Interface) and so on.
and for individual systems within a characteristic 80%

Comparison of Critical components of systems within a group
group. Among the 22 systems, we will focus on the 70%

larger systems for the analysis for root cause analysis 60%

to understand the most critical components in each


50%
system. These large systems are further divided in
groups based on their characteristics such as CPU 40%

type, Memory Type etc. The groups and the 30%

constituent systems are given in Table 3. 20%

Table 3: Grouping of systems based on 10%

configuration 0%

Group Systems
E 3, 4, 5, 6, 18, 19, 20, 21
F 9, 10, 11, 12, 13, 14
G 16, 2, 23 System 9 System 10 System 11 System 12

In the previous section it was determined as to which Figure 2: Comparison of MTTF Improvement for
component would provide the highest improvement different components across systems within a
in MTTF when it is duplicated. We now analyze each group
system and group failures according to the
component in which they occur. Based on this Table 4 shows the improvement in MTTF for
grouping we determine the component, which when systems within a group (here Systems 9, 10, 11, and
protected, provides the best improvement in MTTF. 12) when different components are chosen as the first
The set of failing components for a system are a component for fault-tolerance. We see that all
subset of those listed in Table 2. systems show more or less similar improvements for
specific components but for a few exceptions. This
The analytical procedure presented in Section 4.1 is trend is also shown in Figure 2. The figure
used for the components as well. In place of a
graphically depicts the data presented in Table 4. The component for fault-tolerance. Most of the peaks are
x-axis shows components in which errors occurred in in the beginning few components and consistently for
any of the 4 systems. Each set of 4 consecutive all systems there is little or no MTTF improvement
columns associated with a component represent the for the components towards the right side of the x-
MTTF improvement in the 4 systems had this axis.
component been chosen as the first critical

(a) (b)

(c)
Figure 3: Failure Distribution for All Systems
Table 4: Comparison of MTTF Improvement for different components across systems within a group
System System
9 10 11 12 9 10 11 12
Component Component
SW-Upgrade/Install OS
Undet-Undetermined 41.2% 32.2% 47.6% 35.3% 1.0% 1.1% 1.1% 1.0%
SW
HW-Memory Dimm 35.5% 63.1% 45.3% 68.5% HW-Interconnect Cable 0.8% 0.0% 0.0% 0.0%
HW-Interconnect Interface 27.2% 6.9% 4.4% 7.3% SW-Kernel software 0.4% 0.4% 0.0% 0.4%
HW-Interconnect Soft Error 13.2% 23.7% 21.5% 17.7% SW-NFS 0.4% 0.0% 0.0% 0.4%
SW-Parallel File System 10.8% 11.0% 6.4% 7.1% HW-Other 0.4% 0.0% 0.0% 0.0%
HW-Power Supply 6.5% 0.0% 3.6% 1.1% SW-Scheduler Software 0.4% 0.4% 0.4% 0.4%
SW-Network 5.6% 1.3% 2.3% 1.2% HW-CPU 0.0% 0.5% 0.4% 2.0%
SW-User code 2.9% 1.3% 3.9% 1.2% HW-Heatsink bracket 0.0% 0.0% 2.6% 0.0%
HW-Disk Drive 2.7% 6.9% 33.4% 3.5% HW-IDE Cable 0.0% 0.0% 0.6% 0.0%
HW-System Board 2.0% 27.0% 5.3% 5.6% HW-Memory Module 0.0% 0.0% 0.8% 1.5%
HW-Console Network
1.8% 1.4% 0.8% 1.5% HW-Node Board 0.0% 0.0% 0.4% 0.4%
Device
HW-40MM Cooling Fan 1.5% 0.4% 0.0% 2.0% HW-Riser Card 0.0% 0.0% 0.4% 0.0%
HE-Human Error 1.1% 1.7% 0.4% 1.8% HW-Temp Probe 0.0% 2.1% 1.2% 0.4%
Net-Network 1.1% 0.6% 1.1% 0.9% SW-Interconnect 0.0% 0.0% 0.0% 0.4%
Facs-Power Outage 1.1% 1.2% 1.1% 1.5% Undet-Unresolvable 0.0% 0.4% 0.0% 0.4%
System 9 MTTF Improvement with Node Duplication System 10 MTTF Improvement with Node Duplication

10000% 10000%
Percentage MTTF Improvement

Percentage MTTF Improvement
1000% 1000%

100% 100%

10% 10%
Number of Nodes Duplicated Number of Nodes Duplicated

Undet‐Undetermined HW‐Memory Dimm HW‐Interconnect Interface HW‐Memory Dimm Undet‐Undetermined HW‐Interconnect Soft Error


HW‐Interconnect Soft Error SW‐Parallel File System SW‐Network SW‐Parallel File System HW‐System Board HW‐Interconnect Interface
SW‐User code HW‐Disk Drive HW‐Console Network Device HW‐Disk Drive HE‐Human Error HW‐Console Network Device
HW‐System Board HW‐40MM Cooling Fan HE‐Human Error SW‐Network SW‐User code HW‐Temp Probe
Net‐Network HW‐Power Supply Facs‐Power Outage Facs‐Power Outage SW‐Upgrade/Install OS sftw Net‐Network
HW‐Interconnect Cable SW‐Upgrade/Install OS sftw SW‐Kernel software HW‐CPU SW‐Kernel software HW‐40MM Cooling Fan
SW‐NFS HW‐Other SW‐Scheduler Software Undet‐Unresolvable SW‐Scheduler Software

(a) (b)
System 11 MTTF Improvement with Node Duplication System 12 MTTF Improvement with Node Duplication

10000%

Percentage MTTF Improvement
Percentage MTTF Improvement

10000%

1000%
1000%

100%
100%

10%
Number of Nodes Duplicated
10%
Number of Nodes Duplicated
HW‐Memory Dimm Undet‐Undetermined HW‐Interconnect Soft Error
Undet‐Undetermined HW‐Memory Dimm HW‐Interconnect Soft Error SW‐Parallel File System HW‐Interconnect Interface HW‐System Board
HW‐Disk Drive SW‐Parallel File System HW‐Interconnect Interface HW‐Disk Drive HW‐CPU HW‐40MM Cooling Fan
SW‐User code HW‐System Board SW‐Network HE‐Human Error Facs‐Power Outage HW‐Console Network Device
HW‐Power Supply Facs‐Power Outage Net‐Network SW‐Network SW‐User code HW‐Memory Module
HW‐Memory Module HW‐Console Network Device HW‐Heatsink bracket HW‐Power Supply Net‐Network SW‐Upgrade/Install OS sftw
HW‐Temp Probe SW‐Upgrade/Install OS sftw HW‐IDE Cable HW‐Node Board SW‐Kernel software HW‐Temp Probe
HW‐Node Board HW‐CPU HW‐Riser Card SW‐Scheduler Software SW‐NFS SW‐Interconnect
SW‐Scheduler Software HE‐Human Error Undet‐Unresolvable

(c) (d)
Figure 4: Improvement in MTTF incrementally covering failures in different components for
(a) System 9 (b) System 10 (c) System 11 (d) System 12
software configurations showed similar MTTF
6 Conclusions and future directions improvement when specific components or failure
In this paper, we have presented our analysis of the types are targeted for fault tolerance.
failure behavior of large scale systems using the
The failure data from LANL provides node level
failure logs collected by LANL on 22 of their
failure information even though each node has
computing clusters. We note that not all components
multiple and different number of processors.
show similar failure behavior in the systems. Our
Therefore a more fine-grained logging of failures at
objective, therefore, was to arrive at an ordering of
the processor-level could provide even higher
components to be incrementally (one by one) selected
improvement in hardware overheads in achieving
for duplication so as to achieve a target MTTF for the
higher levels of System-level MTTFs.
system after duplicating the least number of
components. Using the start times and the down References
times logged for the failures we derived the time to [1] J. S. Plank and W. R. Elwasif. Experimental
failures and the mean time for repairs failures on a assessment of workstation failures and their impact on
component. Using these quantities, we arrived at a checkpointing systems. In Proceedings of FTCS-98.
[2] S. Nath, H. Yu, P. B. Gibbons, and S. Seshan.
model for the fault coverage provided by duplicating
Subtleties in tolerating correlated failures. In Proceedings
each component and ordered the components of the Symposium On Networked Systems Design and
according to MTTF improvement provided by Implementation (NSDI’06), 2006.
duplicating each component. We analyze the failures [3] N. Nakka, A. Choudhary, “Failure data-driven selective
grouped by the components in which they occur to node-level duplication to improve MTTF in High
understand the critical components and failures types. Performance Computing Systems”, In Proceedings of
We observed that systems of similar hardware and HPCS 2009, June 2009, Kingston, Ontario, CA.
[4] T. Heath, R. P.Martin, and T. D. Nguyen. Improving [21] T. Vijaykumar, I. Pomeranz, and K. Cheng,
cluster availability using workstation validation. In “Transient fault recovery using simultaneous
Proceedings of ACM SIGMETRICS, 2002. multithreading,” in Proceedings of the Twenty-Ninth
[5] D. Long, A. Muir, and R. Golding. A longitudinal Annual International Symposium on Computer
survey of internet host reliability. In Proceedings of the 14th Architecture, May 2002, pp. 87-98.
Intl. Symposium on Reliable Distributed Systems, 1995. [22] N. Oh, P.P. Shirvani, and E.J. McCluskey, “Error
[6] D. Nurmi, J. Brevik, and R. Wolski. Modeling machine detection by duplicated instructions in super-scalar
availability in enterprise and wide-area distributed processors,” IEEE Transactions on Reliability, vol. 51(1),
computing environments. In Euro-Par’05, 2005. pp. 63-75, Mar. 2002.
[7] R. K. Sahoo, R. K., A. Sivasubramaniam, M. S. [23] N. Oh, S. Mitra, and E.J. McCluskey, “ED4I: Error
Squillante, and Y. Zhang. Failure data analysis of a large- Detection by Diverse Data and Duplicated Instructions,”
scale heterogeneous server environment. In Proceedings of IEEE Transactions on Computers, vol. 51(2), pp. 180-199,
Dependable Systems and Networks, June 2004. Feb. 2002.
[8] D. Tang, R. K. Iyer, and S. S. Subramani. Failure [24] T. Slegel, et al. “IBM’s S/390 G5 microprocessor
analysis and modelling of a VAX cluster system. In Fault design,” IEEE Micro, vol. 19(2), pp. 12–23, 1999.
Tolerant Computing Systems, 1990. [25] D. M. Tullsen, S. J. Eggers, and H. M. Levy,
[9] J. Xu, Z. Kalbarczyk, and R. K. Iyer. Networked “Simultaneous multithreading: Maximizing on-chip
Windows NT system field failure data analysis. In Proc. of performance,” in Proceedings of the Twenty-Second
the PRDC, 1999. International Symposium on Computer Architecture, June
[10] B. Schroeder and G. Gibson. A large-scale study of 1995, pp. 392-403.
failures in high-performance-computing systems. In [26] E. Rotenberg, “AR-SMT: A microarchitectural
Proceedings of the DSN, Philadelphia, PA, June 2006. approach to fault tolerance in microprocessors,” in
[11] R. K. Iyer, D. J. Rossetti, and M. C. Hsueh. Proceedings of the Twenty-Ninth International Symposium
Measurement and modeling of computer reliability as on Fault-Tolerant Computing Systems, June 1999, pp. 84-
affected by system activity. ACM Transactions on 91.
Computing Systems, Vol. 4, No. 3, 1986. [27] K. Sundaramoorthy, Z. Purser, and E. Rotenberg,
[12] X. Castillo and D. Siewiorek. Workload, “Slipstream processors: Improving both performance and
performance, and reliability of digital computing systems. fault tolerance,” In Proceedings of the Thirty-Third
In the 11th FTCS, 1981. International Symposium on Microarchitecture, December
[13] Adam J. Oliner, Jon Stearley: What Supercomputers 2000, pp. 269-280.
Say: A Study of Five System Logs. In Proceedings of the [28] S. K. Reinhardt and S. S. Mukherjee, “Transient fault
DSN, Edinburgh, UK, June 2007, pp. 575-584. detection via simultaneous multithreading,” in Proceedings
[14] Z. Lan, Y. Li, P. Gujrati, Z. Zheng, R. Thakur, and J. of the Twenty-Seventh International Symposium on
White, "A Fault Diagnosis and Prognosis Service for Computer Architecture, June 2000, pp. 25-36.
TeraGrid Clusters", In Proceedings of TeraGrid'07 , 2007. [29] M. A. Qureshi, O. Mutlu, and Y. N. Patt,
[15] P. Gujrati, Y. Li, Z. Lan, R. Thakur, and J. “Microarchitecture-based introspection: A technique for
White,"Exploring Meta-learning to Improve Failure transient-fault tolerance in microprocessors,” In
Prediction in Supercomputing Clusters", In Proceedings of Proceedings of International Conference on Dependable
International Conference on Parallel Processing (ICPP), Systems and Networks, June 2005, pp. 434-443.
2007. [30] A. Parashar, A. Sivasubramaniam, S. Gurumurthi.
[16] Y. Li and Z. Lan, "Using Adaptive Fault Tolerance to “SlicK: slice-based locality exploitation for efficient
Improve Application Robustness on the TeraGrid", In redundant multithreading,” in Proceedings of the 12th Intl.,
Proceedings of TeraGrid'07 , 2007. conference on ASPLOS, 2006.
[17] Z. Lan and Y. Li, "Adaptive Fault Management of [31] A.E. Cooper and W.T. Chow, “Development of on-
Parallel Applications for High Performance Computing", board space computer systems,” IBM Journal of Research
IEEE Transactions on Computers ,Vol. 57, No. 12, pp. and Development, vol. 20, no. 1, pp. 5-19, January 1976.
1647-1660, 2008. [32] D. Jewett, “Integrity S2: A fault-tolerant Unix
[18] A. J. Oliner, R. K. Sahoo, J. E. Moreira, M. Gupta, platform,” Digest of Papers Fault-Tolerant Computing: The
and A. Sivasubramaniam. Fault-aware job scheduling for Twenty-First International Symposium, Montreal, Canada,
Bluegene/L systems. In Proceedings of the 18th pp. 512 - 519, June 25-27, 1991.
International Parallel and Distributed Processing [33] “AT&T 5ESS™ from top to bottom,”
Symposium (IPDPS), 2004. http://www.morehouse.org/hin /ess/ess05.htm.
[19] C. Weaver and T. Austin. “A fault tolerant approach [34] AT&T Technical Staff. “The 5ESS switching
to microprocessor design,” in Proceedings of the system,” The AT&T Technical Journal, Vol. 64(6), Part 2,
International Conference on Dependable Systems and July-August 1985.
Networks, July 2001, pp. 411-420. [35] A. Avizienis, “Arithmetic error codes: Cost and
[20] T. Austin, “DIVA: A reliable substrate for deep effectiveness studies for Application in digital system
submicron microarchitecture design,” in Proceedings of the design,” IEEE Transactions on Computers, vol. 20, no. 11,
Thirty-Second International Symposium on pp. 1332-1331, November 1971.
Microarchitecture, November 1999, pp. 196-207.

You might also like