0% found this document useful (0 votes)

25 views

Empirical System Reliability

The document discusses empirical research on system reliability. It analyzes failure data from storage systems and high-performance computing clusters. Key findings include field failure rates being higher than datasheet rates, and failure distributions often having decreasing hazard rates and correlations rather than being independent Poisson processes as commonly assumed. Future work aims to collect more failure data and use it to build more reliable systems.

Uploaded by

Georgina Sule

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

25 views

Empirical System Reliability

Uploaded by

Georgina Sule

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 7

Empirical system reliability

[ Overview | Recent work | Longterm agenda | Publications | In the media ]

Overview
System reliability is a major challenge in system design. Unreliable systems are not only major source of user frustration,
they are also expensive. Avoiding downtime and the cost of actual downtime make up more than 40% of the total cost of
ownership for modern IT systems. Unfortunately, with the large component count in today's large-scale systems, failures
are quickly becoming the norm rather than the exception.

We believe that the key to building more reliable systems is to first better understand what makes system unreliable, i.e.
what do failures in today's large-scale production systems look like. Although system reliability has been a key concern
since the first computer systems were build 50 years ago, we know embarrassingly little about basic characteristics of
failures in real systems. Much research, in industry as well as academia, is based on hypothetical and often simplistic
assumptions, e.g. ``the time between failures is exponentially distributed'' and ``failures are independent''. The reason is
that there is virtually no data on failures in real large-scale systems publicly available that could be used to derive more
realistic models. The longterm goal of this project is to enable creation of more reliable systems through deeper
understanding of real-world failures.

In our recent work, we have collected and analyzed failure data on node outages in a large number of HPC clusters and
data on storage failures in several large production systems. Our initial analysis shows that many commonly used models
and assumptions about failures are not realistic [FAST07, DSN06]. Below we first describe some of our recent results and
then outline our longterm research plans.

Recent work
Understanding failures in storage systems

As part of this project, we have analyzed field-gathered disk replacement data from a number of large production systems,
including high-performance computing sites and internet services sites. About 100,000 disks are covered by this data,
some for an entire lifetime of five years. The data include drives with SCSI and FC, as well as SATA interfaces. The mean
time to failure (MTTF) of those drives, as specified in their datasheets, ranges from 1,000,000 to 1,500,000 hours,
suggesting a nominal annual failure rate of at most 0.88%. Below is a summary of a few of our results.
Figure 2: Annual replacement rates
Figure 1: Comparison of datasheet annual observed in the field as a function of
failure rates (solid and dashed line in the drive age. Note that rates in the field
graph) and annual replacement rates are continuously rising with age,
(ARR) observed in the field for 14 different while common models suggest
disk drives populations. steady state during years 2-5 of a
drive's nominal lifetime.

 Large-scale installation field usage appears to differ widely from nominal datasheet MTTF conditions. The field
replacement rates of systems were significantly larger than we expected based on datasheet MTTFs. For drives
less than five years old, field replacement rates were larger than what the datasheet MTTF suggested by a factor
of 2-10. For five to eight year old drives, field replacement rates were a factor of 30 higher than what the
datasheet MTTF suggested. Figure 1 above shows the annual replacement rates (ARR) for the 14 different disk
populations in our study that included only disks less than 5 years old. Nearly all exhibit significantly higher
replacement rates that the datasheet MTTFs (solid and dashed line).
 Interestingly, the replacement rates of SATA disks are not worse than the replacement rates of SCSI or FC disks
(unlike commonly assumed). In Figure 1 above the blue bars and the right-most one of the cyan bars correspond
to SATA disk populations, while all other bars correspond to SCSI or FC populations. This may indicate that disk-
independent factors, such as operating conditions, usage and environmental factors, affect replacement rates
more than component specific factors.
 Changes in disk replacement rates during the first five years of the lifecycle were more dramatic than often
assumed. While replacement rates are often expected to be in steady state in year 2-5 of operation (bottom of
the ``bathtub curve''), we observed a continuous increase in replacement rates, starting as early as in the second
year of operation. Figure 2 above shows the increase in replacement rates as a function of drive age for one of
the disk drive populations in our study.
 The common concern that MTTFs underrepresent infant mortality has led to the proposal of new standards that
incorporate infant mortality. Our findings suggest that the underrepresentation of the early onset of wear-out is a
much more serious factor than underrepresentation of infant mortality and recommend to include this in new
standards.

Understanding failures in high-performance computing systems

In our recent work [DSN06] we analyze data on node outages in high-performance computing clusters. The data has been
collected over the past 9 years at Los Alamos National Laboratory and includes 23000 failures recorded on more than 20
different systems, mostly large clusters of SMP and NUMA nodes. Our findings include for example that average failure
rates differ wildly across systems, ranging from 20 to more than 700 failures per year, and that time between failures is
modeled well by a Weibull distribution with decreasing hazard rate. From one system to another, mean repair time varies
from less than an hour to more than a day, and repair times are well modeled by a lognormal distribution.
Figure 3: The expected decrease in mean time Figure 4: Projections of effective application
to interrupt (MTTI) in future HPC systems utilization in future HPC systems under the
assuming that the number of sockets increases checkpoint restart model.
to stay on top500.org.
In a related thread of work [SciDAC07, CTWatch], we use the failure data for projections of failure rates of future petascale
systems and how they will affect application effectiveness. We predict that if current technology and failure trends
continue, the mean time to interrupt (MTTI) can be expected to drop dramatically over the next couple of years (see Figure
3 above). Assuming that fault tolerance is implemented by checkpoint-restart, this means that the largest applications will
surrender large fractions of the system's resources to taking checkpoints and restarting from a checkpoint after an
interruption, leading to greatly reduced application utilization ( see Figure 4 above). We also discuss coping strategies
such as application-level checkpoint compression and system-level process-pairs fault-tolerance for supercomputing.

Statistical properties of failure processes

A significant part of our work focuses on the analysis of the statistical properties of failures, as recorded in our data. Better
knowledge about the statistical properties of storage failure processes is not only necessary for a realistic reliability
evaluation of new system designs (e.g. when creating synthetic failure workloads in simulations), but may also empower
researchers and designers to develop new, more reliable and available systems.

The most common assumption about the statistical characteristics of failures is that they form a Poisson process, which
implies two key properties: exponentially distributed time between failures and independence of failures. We find in our
analysis for both, disk failures and cluster node outages, that this assumptions is not very realistic. Below we provide
some more detail on our results.

Figure 5: Illustration of decreasing hazard rates Figure 6: The autocorrelation of disk failures
in cluster node failure data. at different lags.

 While many have suspected that, for disk failures, the commonly made assumption of exponentially distributed
time between failures/replacements is not realistic, previous studies have not found enough evidence to prove
this assumption wrong with significant statistical confidence. Based on our data analysis, we are able to reject the
hypothesis of exponentially distributed time between disk replacements with high confidence. We suggest that
researchers and designers use field replacement data, when possible, or two parameter distributions, such as the
Weibull distribution.
 For both disk failures and cluster node outages, we identify as the key features that distinguish the empirical
distributions from the exponential distribution, higher levels of variability and decreasing hazard rates. Figure 3
illustrates the decreasing hazard rates observed for the time between cluster node outages. We find that the
empirical distributions are fit well by a Weibull distribution with a shape parameter between 0.7 and 0.8.
 We also find for both, disk replacements and cluster node outages, strong evidence for the existence of various
types of correlations. For example, the empirical data exhibits significant levels of autocorrelation and long-range
dependence. Figure 4 shows the autocorrelation function for the disk replacement process.

Longterm research agenda

Collecting failure data

Our plan is to collect detailed failure data from a diverse set of real, large-scale production systems that cover all aspects
of system failures: software failures, hardware failures, failures due to operator error, network failures, and failures due to
environmental problems (e.g. power outages). At this point, we have established relationships with more than a dozen
large commercial sites and high-performance computing (HPC) sites, five of which have already contributed data. We are
currently working with the Usenix Association to create a public failure data repository to host these data. A first draft of
the repository can be viewed here .

While collecting and sharing failure data might seem like a purely mechanical process, it turns out that it involves many
research questions by itself. One question is, for example, how to efficiently and reliably sanitize and anonymize
Gigabytes of free-form text data, such as trouble tickets. Several of these problems will require techniques from other
areas. For example, We plan to investigate the use of methods from text analysis and document retrieval to help automate
anonymization and analysis of free-form text data.

Analyzing failure data

Our initial results indicate the strong need for new, more realistic failure models. We plan to identify and characterize the
most relevant aspects of failure behavior in large IT systems with the goal of deriving accurate failure and repair models
for a wide range of systems. Important aspects could, for example, include various statistical properties of the failure
process, but also correlations between system parameters, such as workload, and the failure behavior. The results of this
work will provide a more realistic basis for both experimental and analytical research on system reliability. While our initial
results above are very recent, they are already being used by several researchers to parameterize their experiments and
simulations.

In our analysis we plan to use not only traditional statistical methods, but also to investigate techniques from data mining,
which might be particularly useful in identifying relationships and correlations between various aspects of system behavior
and observed failure modes.

A key question will be how complex new failures models need to be. While highly complex models with a large number of
parameters will provide a better fit to observed data, they not only pose a risk of overfitting, but will also be harder to use,
since they are computationally and intuitively more complex. We are looking for the simplest models that still provide
realistic results.

Exploiting failure data

Armed with more realistic failure models, a natural next step will be to re-examine existing algorithms and techniques for
fault-tolerant systems to understand where simpler (standard) models result in poor design choices and for those cases
explore new algorithms. As one example, we revisit the old question of estimating the probability of losing data in a RAID
system. We find that the probabilities derived with standard methods (assuming exponential time between failures and
independent failures) can be two orders of magnitude lower than estimates derived from real data.

Figure 7: The probability of a second drive failure in a RAID system during reconstruction, estimated in four different ways.

Figure 5 above illustrates this point by plotting the probability that a second drive in a RAID fails during reconstruction,
derived in four different ways: the purple bar estimates the probability based on exponential time between failures using
the datasheet MTTF; the blue bar estimates the probability based on exponential time between failures, but using the
actual empirical MTTF; the orange bar uses a Weibull distribution fit to empirical data; and the green bar shows the
estimates directly derived from the data. As the graph shows the estimates derived using the standard approaches (pink
and blue bar) can greatly underestimate the probability of a RAID failure.

We also plan to investigate whether we can directly exploit some of the statistical properties of failure behavior. For
example, we find that the time between node outages in HPC clusters exhibits decreasing hazard rates and am currently
investigating how this property can be used to design more efficient checkpoint protocols. In the realm of storage systems,
one could investigate whether statistical properties of latent sector errors could be exploited to develop smarter scrubbing
algorithms or better algorithms for deciding when to replace a drive. Another general questions is whether we can exploit
correlations between past system behavior and future failures for proactive fault management or for automated diagnosis.
An interesting avenue for future work would be to investigate the use of data mining and machine learning techniques to
solve some of these problems.

Publications
The work on this project has resulted in several publications, which are listed below.

 L. Bairavasundaram, G. Goodson, B. Schroeder, A. Arpaci-Dusseau, R. Arpaci-Dusseau, FAST.08. "An

analysis of data corruption in the storage stack." 6th Usenix Conference on File and Storage Technologies
(FAST 2008). pdf.
Winner of FAST'07 best paper award.
 Garth Gibson, Bianca Schroeder, Joan Digney. Failure Tolerance in Petascale Computers. CTWatch
Quarterly, vol. 3 no. 4. Volume on Software Enabling Technologies for Petascale Science. November 2007.
www.ctwatch.org. pdf
 Bianca Schroeder, Garth Gibson. "Understanding failure in petascale computers." Presented at the SciDAC
2007 conference. Journal of Physics: Conf. Ser. 78. pdf.
 Bianca Schroeder, Garth Gibson. "The computer failure data repository." Invited contribution to the Workshop
on Reliability Analysis of System Failure Data (RAF'07) to be held at MSR Cambridge, UK. pdf.
 Bianca Schroeder, Garth Gibson. "Disk failures in the real world: What does an MTTF of 1,000,000 hours
mean too you?" 5th Usenix Conference on File and Storage Technologies (FAST '07). [ pdf | html ].
Winner of FAST'07 best paper award.

Extended version of the above paper appeared in ACM Transactions on Storage (TOS), Volume 3 Issue 3,
October 2007, under the title "Understanding disk failure rates: What does an MTTF of 1,000,000 hours
mean to you?".

 Bianca Schroeder, Garth Gibson. "A large scale study of failures in high-performance-computing systems."
. International Symposium on Dependable Systems and Networks (DSN '06). [ pdf | ps ].
As one of the best DSN'06 papers invited to IEEE Transactions on Dependable and Secure Computing
(TDSC).

In the media
This project has been featured in a number of online media reports:

28 February 2007 - eWeek.com

Hard Disk MTBF: Flap or Farce?

2 March 2007 - Computerworld

Disk drive failures 15 times what vendors say, study says; Drive vendors declined to be interviewed

2 March 2007 - PC World

Study: Hard Drive Failure Rates Much Higher Than Makers Estimate
Customers replace disk drives 15 times more often than drive vendors estimate, according to a study by
Carnegie Mellon University.

20 February 2007 - StorageMojo

Everything You Know About Disks Is Wrong

20 February 2007 - Slashdot

Everything You Know About Disks Is Wrong (slashdot comment thread which received more than 75,000 hits)

Acknowledgements
We would like to thank Gary Grider, Laura Davey and Jamez Nunez from the High Performance Computing Division at
Los Alamos National Lab and Katie Vargo, J. Ray Scott and Robin Flaus from the Pittsburgh Supercomputing Center for
collecting and providing us with data and helping us to interpret the data. We also thank the other people and
organizations, who have provided us with data, but would like to remain unnamed. For discussions relating to the use of
high end systems, we would like to thank Mark Seager and Dave Fox of the Lawrence Livermore National Lab.

We thank the members and companies of the PDL Consortium (including APC, Cisco, EMC, Hewlett-Packard, Hitachi,
IBM, Intel, Network Appliance, Oracle, Panasas, Seagate, and Symantec) for their interest and support.This material is
based upon work supported by the Department of Energy under Award Number DE-FC02-06ER25767 and on research
sponsored in part by the Army Research Office, under agreement number DAAD19--02--1--0389.

For questions and comments regarding this page, please contact Bianca Schroeder.

List of Lubricants Blending Plant in Nigeria
No ratings yet
List of Lubricants Blending Plant in Nigeria
4 pages
CLC 113 Computer Applications For Information Communication
No ratings yet
CLC 113 Computer Applications For Information Communication
107 pages
NetOps 2.0 Transformation: The DIRE Methodology
From Everand
NetOps 2.0 Transformation: The DIRE Methodology
Ray Belleville
5/5 (1)
Bitbucket Git Tutorial PDF
100% (1)
Bitbucket Git Tutorial PDF
314 pages
Horizon View True Sso Enrollment Server Diagnostics Tool-2-2
No ratings yet
Horizon View True Sso Enrollment Server Diagnostics Tool-2-2
14 pages
Disk Failures in The Real World: What Does An MTTF of 1,000,000 Hours Mean To You?
No ratings yet
Disk Failures in The Real World: What Does An MTTF of 1,000,000 Hours Mean To You?
16 pages
Dolzilek D., MacDonald B. - In The_ Recent Security Failures Prompt Review of Secure Computing Practices
No ratings yet
Dolzilek D., MacDonald B. - In The_ Recent Security Failures Prompt Review of Secure Computing Practices
10 pages
Meantime To Failure
No ratings yet
Meantime To Failure
10 pages
A More Reliable Way To Predict Potential Disk Failure: White Paper
No ratings yet
A More Reliable Way To Predict Potential Disk Failure: White Paper
18 pages
Unit 3
No ratings yet
Unit 3
53 pages
The Component Outage Model For Power Systems Using Availability Approximations
No ratings yet
The Component Outage Model For Power Systems Using Availability Approximations
20 pages
arno2012
No ratings yet
arno2012
7 pages
Failure Rate Analysis
No ratings yet
Failure Rate Analysis
19 pages
Industrial Computing Systems: A Case Study of Fault Tolerance Analysis
No ratings yet
Industrial Computing Systems: A Case Study of Fault Tolerance Analysis
6 pages
Lecture 4 ITI
No ratings yet
Lecture 4 ITI
25 pages
Reliability, Availability, and Maintainability: Probability Models For Populations
No ratings yet
Reliability, Availability, and Maintainability: Probability Models For Populations
8 pages
MM JURY GROUP 6
No ratings yet
MM JURY GROUP 6
23 pages
LPIC-3 Exam 306-300 Mastery: 500 Practice Questions on High Availability & Storage Clusters
From Everand
LPIC-3 Exam 306-300 Mastery: 500 Practice Questions on High Availability & Storage Clusters
Steve Brown
No ratings yet
why_do_computers_stop_jim_gray
No ratings yet
why_do_computers_stop_jim_gray
8 pages
Basic Reliability Concepts and Analysis Chapter 2
No ratings yet
Basic Reliability Concepts and Analysis Chapter 2
34 pages
Ieee Ha Swieorick
No ratings yet
Ieee Ha Swieorick
19 pages
World's Largest Science, Technology & Medicine Open Access Book Publisher
No ratings yet
World's Largest Science, Technology & Medicine Open Access Book Publisher
19 pages
Reliability Prediction Basics
No ratings yet
Reliability Prediction Basics
9 pages
Practical System Reliability
No ratings yet
Practical System Reliability
17 pages
Maintenance Ass Q1
No ratings yet
Maintenance Ass Q1
6 pages
Failure Rate 1
No ratings yet
Failure Rate 1
18 pages
07 17653 IJSA PP 417-428
No ratings yet
07 17653 IJSA PP 417-428
12 pages
Ten Fallacies of Availability and Reliability Analysis: 1 Prologue
No ratings yet
Ten Fallacies of Availability and Reliability Analysis: 1 Prologue
20 pages
Design Patterns For High Availability
No ratings yet
Design Patterns For High Availability
10 pages
Cooling of Electronic Systems III - Shared
No ratings yet
Cooling of Electronic Systems III - Shared
122 pages
10 1108 - Ijqrm 02 2014 0016
No ratings yet
10 1108 - Ijqrm 02 2014 0016
21 pages
Margin_calculation_of_VSC_HVDC_modules_based_on_MMC
No ratings yet
Margin_calculation_of_VSC_HVDC_modules_based_on_MMC
7 pages
What Is PFDavg
No ratings yet
What Is PFDavg
11 pages
Marine Fleet Reliability Availability Maintainability
No ratings yet
Marine Fleet Reliability Availability Maintainability
8 pages
Lifespan and Failures of SSDs and HDDs: Similarities, Differences, and Prediction Models
No ratings yet
Lifespan and Failures of SSDs and HDDs: Similarities, Differences, and Prediction Models
16 pages
Virtual Storage Failure Prediction Model Using Supervised Machine Learning
No ratings yet
Virtual Storage Failure Prediction Model Using Supervised Machine Learning
9 pages
An Empirical Appraisal of The Availability of A 500KVA Stand
No ratings yet
An Empirical Appraisal of The Availability of A 500KVA Stand
12 pages
Methods and Tools Used in Criticality Analysis in Industrial Systems
100% (1)
Methods and Tools Used in Criticality Analysis in Industrial Systems
18 pages
Chapter I
No ratings yet
Chapter I
7 pages
Achieving and Assuring High Availability: 1 Overview
No ratings yet
Achieving and Assuring High Availability: 1 Overview
6 pages
16 Fault Tolerance
No ratings yet
16 Fault Tolerance
34 pages
Information Technology Infrastructure IT602
No ratings yet
Information Technology Infrastructure IT602
10 pages
Software Reliability Theory: Keywords: History Theory Random Point Process Exponential Order Statistics
No ratings yet
Software Reliability Theory: Keywords: History Theory Random Point Process Exponential Order Statistics
43 pages
Modul 4. Availability Concepts
No ratings yet
Modul 4. Availability Concepts
46 pages
Pflex Wp002 - en P (Advanced Predictive Maintenance)
No ratings yet
Pflex Wp002 - en P (Advanced Predictive Maintenance)
10 pages
Case Study On BTRFS: A Fault Tolerant File System
No ratings yet
Case Study On BTRFS: A Fault Tolerant File System
6 pages
Definition of Reliability
No ratings yet
Definition of Reliability
8 pages
Software Reliability: CIS 376 Bruce R. Maxim UM-Dearborn
No ratings yet
Software Reliability: CIS 376 Bruce R. Maxim UM-Dearborn
37 pages
Scoring and Thresholding For Availability: Ibm Systems Journal, Vol 47, No 4, 2008 Heisig and Hosking
No ratings yet
Scoring and Thresholding For Availability: Ibm Systems Journal, Vol 47, No 4, 2008 Heisig and Hosking
14 pages
Reliability Theory: Dr. Shakuntla Singla Associate Professor Department of Mathematics and Humanities
No ratings yet
Reliability Theory: Dr. Shakuntla Singla Associate Professor Department of Mathematics and Humanities
29 pages
Revision Notes - 02 Reliability in Computer Systems
No ratings yet
Revision Notes - 02 Reliability in Computer Systems
12 pages
Annual Reviews in Control: Marco Muenchhof, Mark Beck, Rolf Isermann
No ratings yet
Annual Reviews in Control: Marco Muenchhof, Mark Beck, Rolf Isermann
13 pages
Disk Failures PDF
No ratings yet
Disk Failures PDF
13 pages
Generic Ageing Characteristics of Conventional Power Plants
No ratings yet
Generic Ageing Characteristics of Conventional Power Plants
53 pages
Measure of Logistic
No ratings yet
Measure of Logistic
34 pages
IT602-MidTerm Handouts by Yasir Ejaz
0% (1)
IT602-MidTerm Handouts by Yasir Ejaz
201 pages
Reliability Models of Maintained Systems
No ratings yet
Reliability Models of Maintained Systems
8 pages
Paper 7
No ratings yet
Paper 7
10 pages
Industrial Cases in Simulation Modeling
From Everand
Industrial Cases in Simulation Modeling
James A. Chisman PhD
No ratings yet
The Study of Building the Data Warehouse
From Everand
The Study of Building the Data Warehouse
venkateswara Rao
No ratings yet
Crafting Data-Driven Solutions: Core Principles for Robust, Scalable, and Sustainable Systems
From Everand
Crafting Data-Driven Solutions: Core Principles for Robust, Scalable, and Sustainable Systems
Peter Jones
No ratings yet
Software Transparency: Supply Chain Security in an Era of a Software-Driven Society
From Everand
Software Transparency: Supply Chain Security in an Era of a Software-Driven Society
Chris Hughes
No ratings yet
Oracle: Protect Your Data
From Everand
Oracle: Protect Your Data
Floribert TCHOKO
No ratings yet
Google BigQuery Analytics
From Everand
Google BigQuery Analytics
Jordan Tigani
3/5 (1)
QUESTIONNAIRE
100% (1)
QUESTIONNAIRE
5 pages
Questionnaire 3
No ratings yet
Questionnaire 3
6 pages
Mem Stock Planning 2
No ratings yet
Mem Stock Planning 2
34 pages
Rainoil Acquires Majority Stake in Eterna - Punch Newspapers
No ratings yet
Rainoil Acquires Majority Stake in Eterna - Punch Newspapers
2 pages
Preventive Maintenace 2
No ratings yet
Preventive Maintenace 2
1 page
Six Sigma Roles and Responsibilities
No ratings yet
Six Sigma Roles and Responsibilities
4 pages
Risk Analysis
No ratings yet
Risk Analysis
31 pages
Storage Tank
No ratings yet
Storage Tank
1 page
Probability Density Function
No ratings yet
Probability Density Function
6 pages
Chi Squared Test For Maintenance Record
No ratings yet
Chi Squared Test For Maintenance Record
16 pages
Pareto Analysis Template 1
No ratings yet
Pareto Analysis Template 1
2 pages
Type of Production System
No ratings yet
Type of Production System
19 pages
Weibull Distribution 2
No ratings yet
Weibull Distribution 2
3 pages
Virgin Atlantic
No ratings yet
Virgin Atlantic
2 pages
Determining The Reliability and Plant Availability of A Production Syztem
No ratings yet
Determining The Reliability and Plant Availability of A Production Syztem
53 pages
Availability, Reliability, SIL
No ratings yet
Availability, Reliability, SIL
3 pages
MEM 6 Exams Questions
No ratings yet
MEM 6 Exams Questions
4 pages
Chevron Havoline Motor Oil Range
No ratings yet
Chevron Havoline Motor Oil Range
4 pages
Down Time
No ratings yet
Down Time
14 pages
Pareto Chart
No ratings yet
Pareto Chart
15 pages
Mem Construction 4
No ratings yet
Mem Construction 4
21 pages
Mem Manufacturing Systems Analysis Pre 842
No ratings yet
Mem Manufacturing Systems Analysis Pre 842
17 pages
Maintenance Plots
No ratings yet
Maintenance Plots
168 pages
Reliability 1
No ratings yet
Reliability 1
13 pages
Chevron Clarity Hydraulic Oil AW
No ratings yet
Chevron Clarity Hydraulic Oil AW
1 page
Chevron Cetus DE Range
No ratings yet
Chevron Cetus DE Range
2 pages
Chevron Delo Protection For Construction Equipment
No ratings yet
Chevron Delo Protection For Construction Equipment
1 page
Chevron Clarity Synthetic Hydraulic Oil AW
No ratings yet
Chevron Clarity Synthetic Hydraulic Oil AW
1 page
Chevron Capella HFC
No ratings yet
Chevron Capella HFC
1 page
History: Autocad Is
No ratings yet
History: Autocad Is
6 pages
Alcatel 9135 OMC-Radio PDF
No ratings yet
Alcatel 9135 OMC-Radio PDF
43 pages
Lab 4 DSD
No ratings yet
Lab 4 DSD
6 pages
Develop With NWDI Using External Libraries
No ratings yet
Develop With NWDI Using External Libraries
17 pages
Wemos Lolin Esp32 0.96oled
No ratings yet
Wemos Lolin Esp32 0.96oled
6 pages
CGV - Module-1 Notes
No ratings yet
CGV - Module-1 Notes
42 pages
Lab 9.3
No ratings yet
Lab 9.3
3 pages
Matrix Multiplication Parallel
No ratings yet
Matrix Multiplication Parallel
5 pages
Sisteme de Prelucrare Numerica Cu Procesoare
No ratings yet
Sisteme de Prelucrare Numerica Cu Procesoare
189 pages
C Questions
No ratings yet
C Questions
12 pages
Fortinetnointernet PDF
No ratings yet
Fortinetnointernet PDF
4 pages
MCSL-204 Windows and Linux Lab
No ratings yet
MCSL-204 Windows and Linux Lab
49 pages
Step-By-step Creation of A BAPI in Detailed Steps
100% (1)
Step-By-step Creation of A BAPI in Detailed Steps
27 pages
ENIAC
No ratings yet
ENIAC
5 pages
MSC Thesis Jianyu Chen 2
No ratings yet
MSC Thesis Jianyu Chen 2
50 pages
ITC LAB 3 - Networking
No ratings yet
ITC LAB 3 - Networking
10 pages
Bro ibaPDA-PLC-Xplorer v1.7 en
No ratings yet
Bro ibaPDA-PLC-Xplorer v1.7 en
4 pages
Serial Communication in Java With Example Program: Hardware Setup
No ratings yet
Serial Communication in Java With Example Program: Hardware Setup
9 pages
Data Structure All Algorithm
No ratings yet
Data Structure All Algorithm
9 pages
Programming Methodology in C: Hugh Anderson
No ratings yet
Programming Methodology in C: Hugh Anderson
117 pages
Debug 1214
No ratings yet
Debug 1214
4 pages
Cisco 1921 Series Router Datasheet
No ratings yet
Cisco 1921 Series Router Datasheet
8 pages
Introduction To Asics: Application-Specific Integrated Circuits
No ratings yet
Introduction To Asics: Application-Specific Integrated Circuits
21 pages
Ccs HP Pricelist 01052011
No ratings yet
Ccs HP Pricelist 01052011
6 pages
Python Lecture - 13
No ratings yet
Python Lecture - 13
13 pages
Ni6hant Lenovo Laptop Prices
No ratings yet
Ni6hant Lenovo Laptop Prices
29 pages
1.1 About The Project
No ratings yet
1.1 About The Project
27 pages

Empirical System Reliability

Uploaded by

Empirical System Reliability

Uploaded by

Empirical system reliability

[ Overview | Recent work | Longterm agenda | Publications | In the media ]

Understanding failures in high-performance computing systems

Statistical properties of failure processes

Longterm research agenda

Analyzing failure data

Exploiting failure data

 L. Bairavasundaram, G. Goodson, B. Schroeder, A. Arpaci-Dusseau, R. Arpaci-Dusseau, FAST.08. "An

28 February 2007 - eWeek.com

2 March 2007 - Computerworld

2 March 2007 - PC World

20 February 2007 - StorageMojo

20 February 2007 - Slashdot

You might also like