Meantime To Failure
Meantime To Failure
Comparison of Critical components of systems within a group
group. Among the 22 systems, we will focus on the 70%
larger systems for the analysis for root cause analysis 60%
configuration 0%
Group Systems
E 3, 4, 5, 6, 18, 19, 20, 21
F 9, 10, 11, 12, 13, 14
G 16, 2, 23 System 9 System 10 System 11 System 12
In the previous section it was determined as to which Figure 2: Comparison of MTTF Improvement for
component would provide the highest improvement different components across systems within a
in MTTF when it is duplicated. We now analyze each group
system and group failures according to the
component in which they occur. Based on this Table 4 shows the improvement in MTTF for
grouping we determine the component, which when systems within a group (here Systems 9, 10, 11, and
protected, provides the best improvement in MTTF. 12) when different components are chosen as the first
The set of failing components for a system are a component for fault-tolerance. We see that all
subset of those listed in Table 2. systems show more or less similar improvements for
specific components but for a few exceptions. This
The analytical procedure presented in Section 4.1 is trend is also shown in Figure 2. The figure
used for the components as well. In place of a
graphically depicts the data presented in Table 4. The component for fault-tolerance. Most of the peaks are
x-axis shows components in which errors occurred in in the beginning few components and consistently for
any of the 4 systems. Each set of 4 consecutive all systems there is little or no MTTF improvement
columns associated with a component represent the for the components towards the right side of the x-
MTTF improvement in the 4 systems had this axis.
component been chosen as the first critical
(a) (b)
(c)
Figure 3: Failure Distribution for All Systems
Table 4: Comparison of MTTF Improvement for different components across systems within a group
System System
9 10 11 12 9 10 11 12
Component Component
SW-Upgrade/Install OS
Undet-Undetermined 41.2% 32.2% 47.6% 35.3% 1.0% 1.1% 1.1% 1.0%
SW
HW-Memory Dimm 35.5% 63.1% 45.3% 68.5% HW-Interconnect Cable 0.8% 0.0% 0.0% 0.0%
HW-Interconnect Interface 27.2% 6.9% 4.4% 7.3% SW-Kernel software 0.4% 0.4% 0.0% 0.4%
HW-Interconnect Soft Error 13.2% 23.7% 21.5% 17.7% SW-NFS 0.4% 0.0% 0.0% 0.4%
SW-Parallel File System 10.8% 11.0% 6.4% 7.1% HW-Other 0.4% 0.0% 0.0% 0.0%
HW-Power Supply 6.5% 0.0% 3.6% 1.1% SW-Scheduler Software 0.4% 0.4% 0.4% 0.4%
SW-Network 5.6% 1.3% 2.3% 1.2% HW-CPU 0.0% 0.5% 0.4% 2.0%
SW-User code 2.9% 1.3% 3.9% 1.2% HW-Heatsink bracket 0.0% 0.0% 2.6% 0.0%
HW-Disk Drive 2.7% 6.9% 33.4% 3.5% HW-IDE Cable 0.0% 0.0% 0.6% 0.0%
HW-System Board 2.0% 27.0% 5.3% 5.6% HW-Memory Module 0.0% 0.0% 0.8% 1.5%
HW-Console Network
1.8% 1.4% 0.8% 1.5% HW-Node Board 0.0% 0.0% 0.4% 0.4%
Device
HW-40MM Cooling Fan 1.5% 0.4% 0.0% 2.0% HW-Riser Card 0.0% 0.0% 0.4% 0.0%
HE-Human Error 1.1% 1.7% 0.4% 1.8% HW-Temp Probe 0.0% 2.1% 1.2% 0.4%
Net-Network 1.1% 0.6% 1.1% 0.9% SW-Interconnect 0.0% 0.0% 0.0% 0.4%
Facs-Power Outage 1.1% 1.2% 1.1% 1.5% Undet-Unresolvable 0.0% 0.4% 0.0% 0.4%
System 9 MTTF Improvement with Node Duplication System 10 MTTF Improvement with Node Duplication
10000% 10000%
Percentage MTTF Improvement
Percentage MTTF Improvement
1000% 1000%
100% 100%
10% 10%
Number of Nodes Duplicated Number of Nodes Duplicated
(a) (b)
System 11 MTTF Improvement with Node Duplication System 12 MTTF Improvement with Node Duplication
10000%
Percentage MTTF Improvement
Percentage MTTF Improvement
10000%
1000%
1000%
100%
100%
10%
Number of Nodes Duplicated
10%
Number of Nodes Duplicated
HW‐Memory Dimm Undet‐Undetermined HW‐Interconnect Soft Error
Undet‐Undetermined HW‐Memory Dimm HW‐Interconnect Soft Error SW‐Parallel File System HW‐Interconnect Interface HW‐System Board
HW‐Disk Drive SW‐Parallel File System HW‐Interconnect Interface HW‐Disk Drive HW‐CPU HW‐40MM Cooling Fan
SW‐User code HW‐System Board SW‐Network HE‐Human Error Facs‐Power Outage HW‐Console Network Device
HW‐Power Supply Facs‐Power Outage Net‐Network SW‐Network SW‐User code HW‐Memory Module
HW‐Memory Module HW‐Console Network Device HW‐Heatsink bracket HW‐Power Supply Net‐Network SW‐Upgrade/Install OS sftw
HW‐Temp Probe SW‐Upgrade/Install OS sftw HW‐IDE Cable HW‐Node Board SW‐Kernel software HW‐Temp Probe
HW‐Node Board HW‐CPU HW‐Riser Card SW‐Scheduler Software SW‐NFS SW‐Interconnect
SW‐Scheduler Software HE‐Human Error Undet‐Unresolvable
(c) (d)
Figure 4: Improvement in MTTF incrementally covering failures in different components for
(a) System 9 (b) System 10 (c) System 11 (d) System 12
software configurations showed similar MTTF
6 Conclusions and future directions improvement when specific components or failure
In this paper, we have presented our analysis of the types are targeted for fault tolerance.
failure behavior of large scale systems using the
The failure data from LANL provides node level
failure logs collected by LANL on 22 of their
failure information even though each node has
computing clusters. We note that not all components
multiple and different number of processors.
show similar failure behavior in the systems. Our
Therefore a more fine-grained logging of failures at
objective, therefore, was to arrive at an ordering of
the processor-level could provide even higher
components to be incrementally (one by one) selected
improvement in hardware overheads in achieving
for duplication so as to achieve a target MTTF for the
higher levels of System-level MTTFs.
system after duplicating the least number of
components. Using the start times and the down References
times logged for the failures we derived the time to [1] J. S. Plank and W. R. Elwasif. Experimental
failures and the mean time for repairs failures on a assessment of workstation failures and their impact on
component. Using these quantities, we arrived at a checkpointing systems. In Proceedings of FTCS-98.
[2] S. Nath, H. Yu, P. B. Gibbons, and S. Seshan.
model for the fault coverage provided by duplicating
Subtleties in tolerating correlated failures. In Proceedings
each component and ordered the components of the Symposium On Networked Systems Design and
according to MTTF improvement provided by Implementation (NSDI’06), 2006.
duplicating each component. We analyze the failures [3] N. Nakka, A. Choudhary, “Failure data-driven selective
grouped by the components in which they occur to node-level duplication to improve MTTF in High
understand the critical components and failures types. Performance Computing Systems”, In Proceedings of
We observed that systems of similar hardware and HPCS 2009, June 2009, Kingston, Ontario, CA.
[4] T. Heath, R. P.Martin, and T. D. Nguyen. Improving [21] T. Vijaykumar, I. Pomeranz, and K. Cheng,
cluster availability using workstation validation. In “Transient fault recovery using simultaneous
Proceedings of ACM SIGMETRICS, 2002. multithreading,” in Proceedings of the Twenty-Ninth
[5] D. Long, A. Muir, and R. Golding. A longitudinal Annual International Symposium on Computer
survey of internet host reliability. In Proceedings of the 14th Architecture, May 2002, pp. 87-98.
Intl. Symposium on Reliable Distributed Systems, 1995. [22] N. Oh, P.P. Shirvani, and E.J. McCluskey, “Error
[6] D. Nurmi, J. Brevik, and R. Wolski. Modeling machine detection by duplicated instructions in super-scalar
availability in enterprise and wide-area distributed processors,” IEEE Transactions on Reliability, vol. 51(1),
computing environments. In Euro-Par’05, 2005. pp. 63-75, Mar. 2002.
[7] R. K. Sahoo, R. K., A. Sivasubramaniam, M. S. [23] N. Oh, S. Mitra, and E.J. McCluskey, “ED4I: Error
Squillante, and Y. Zhang. Failure data analysis of a large- Detection by Diverse Data and Duplicated Instructions,”
scale heterogeneous server environment. In Proceedings of IEEE Transactions on Computers, vol. 51(2), pp. 180-199,
Dependable Systems and Networks, June 2004. Feb. 2002.
[8] D. Tang, R. K. Iyer, and S. S. Subramani. Failure [24] T. Slegel, et al. “IBM’s S/390 G5 microprocessor
analysis and modelling of a VAX cluster system. In Fault design,” IEEE Micro, vol. 19(2), pp. 12–23, 1999.
Tolerant Computing Systems, 1990. [25] D. M. Tullsen, S. J. Eggers, and H. M. Levy,
[9] J. Xu, Z. Kalbarczyk, and R. K. Iyer. Networked “Simultaneous multithreading: Maximizing on-chip
Windows NT system field failure data analysis. In Proc. of performance,” in Proceedings of the Twenty-Second
the PRDC, 1999. International Symposium on Computer Architecture, June
[10] B. Schroeder and G. Gibson. A large-scale study of 1995, pp. 392-403.
failures in high-performance-computing systems. In [26] E. Rotenberg, “AR-SMT: A microarchitectural
Proceedings of the DSN, Philadelphia, PA, June 2006. approach to fault tolerance in microprocessors,” in
[11] R. K. Iyer, D. J. Rossetti, and M. C. Hsueh. Proceedings of the Twenty-Ninth International Symposium
Measurement and modeling of computer reliability as on Fault-Tolerant Computing Systems, June 1999, pp. 84-
affected by system activity. ACM Transactions on 91.
Computing Systems, Vol. 4, No. 3, 1986. [27] K. Sundaramoorthy, Z. Purser, and E. Rotenberg,
[12] X. Castillo and D. Siewiorek. Workload, “Slipstream processors: Improving both performance and
performance, and reliability of digital computing systems. fault tolerance,” In Proceedings of the Thirty-Third
In the 11th FTCS, 1981. International Symposium on Microarchitecture, December
[13] Adam J. Oliner, Jon Stearley: What Supercomputers 2000, pp. 269-280.
Say: A Study of Five System Logs. In Proceedings of the [28] S. K. Reinhardt and S. S. Mukherjee, “Transient fault
DSN, Edinburgh, UK, June 2007, pp. 575-584. detection via simultaneous multithreading,” in Proceedings
[14] Z. Lan, Y. Li, P. Gujrati, Z. Zheng, R. Thakur, and J. of the Twenty-Seventh International Symposium on
White, "A Fault Diagnosis and Prognosis Service for Computer Architecture, June 2000, pp. 25-36.
TeraGrid Clusters", In Proceedings of TeraGrid'07 , 2007. [29] M. A. Qureshi, O. Mutlu, and Y. N. Patt,
[15] P. Gujrati, Y. Li, Z. Lan, R. Thakur, and J. “Microarchitecture-based introspection: A technique for
White,"Exploring Meta-learning to Improve Failure transient-fault tolerance in microprocessors,” In
Prediction in Supercomputing Clusters", In Proceedings of Proceedings of International Conference on Dependable
International Conference on Parallel Processing (ICPP), Systems and Networks, June 2005, pp. 434-443.
2007. [30] A. Parashar, A. Sivasubramaniam, S. Gurumurthi.
[16] Y. Li and Z. Lan, "Using Adaptive Fault Tolerance to “SlicK: slice-based locality exploitation for efficient
Improve Application Robustness on the TeraGrid", In redundant multithreading,” in Proceedings of the 12th Intl.,
Proceedings of TeraGrid'07 , 2007. conference on ASPLOS, 2006.
[17] Z. Lan and Y. Li, "Adaptive Fault Management of [31] A.E. Cooper and W.T. Chow, “Development of on-
Parallel Applications for High Performance Computing", board space computer systems,” IBM Journal of Research
IEEE Transactions on Computers ,Vol. 57, No. 12, pp. and Development, vol. 20, no. 1, pp. 5-19, January 1976.
1647-1660, 2008. [32] D. Jewett, “Integrity S2: A fault-tolerant Unix
[18] A. J. Oliner, R. K. Sahoo, J. E. Moreira, M. Gupta, platform,” Digest of Papers Fault-Tolerant Computing: The
and A. Sivasubramaniam. Fault-aware job scheduling for Twenty-First International Symposium, Montreal, Canada,
Bluegene/L systems. In Proceedings of the 18th pp. 512 - 519, June 25-27, 1991.
International Parallel and Distributed Processing [33] “AT&T 5ESS™ from top to bottom,”
Symposium (IPDPS), 2004. http://www.morehouse.org/hin /ess/ess05.htm.
[19] C. Weaver and T. Austin. “A fault tolerant approach [34] AT&T Technical Staff. “The 5ESS switching
to microprocessor design,” in Proceedings of the system,” The AT&T Technical Journal, Vol. 64(6), Part 2,
International Conference on Dependable Systems and July-August 1985.
Networks, July 2001, pp. 411-420. [35] A. Avizienis, “Arithmetic error codes: Cost and
[20] T. Austin, “DIVA: A reliable substrate for deep effectiveness studies for Application in digital system
submicron microarchitecture design,” in Proceedings of the design,” IEEE Transactions on Computers, vol. 20, no. 11,
Thirty-Second International Symposium on pp. 1332-1331, November 1971.
Microarchitecture, November 1999, pp. 196-207.