0% found this document useful (0 votes)
25 views

ATWOOD - ENGELHARDT - 2004 - Bayesian Estimation of Unavailability

Uploaded by

rigoniscribd
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
25 views

ATWOOD - ENGELHARDT - 2004 - Bayesian Estimation of Unavailability

Uploaded by

rigoniscribd
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 15

Reliability Engineering and System Safety 84 (2004) 225–239

www.elsevier.com/locate/ress

Bayesian estimation of unavailability


Corwin L. Atwooda,*, Max Engelhardtb
a
Statwood Consulting, 2905 Covington Rd., Silver Spring, MD 20910, USA
b
Engelhardt Consulting, 57 Cedar Lane, Silex, MO 63377, USA
Received 20 September 2003; accepted 10 November 2003

Abstract
This paper gives two Bayesian methods for estimating test-and-maintenance unavailability. Both unplanned and periodic maintenance are
considered. One estimation method uses ‘detailed data,’ the individual outage times. The other method uses ‘summary data,’ totals of outage
time and exposure time in various time periods such as calendar months. Either method can use either a noninformative or an informative
prior distribution. Both methods are illustrated with an example data set, and the results are compared.
q 2003 Elsevier Ltd. All rights reserved.
Keywords: Alternating renewal process; Quantile– quantile plots; Data aggregation; Exponential durations; Noninformative prior; Informative prior

1. Introduction unplanned testing or maintenance. The event of a train being


out of service will be called an outage here, and the length of
Probabilistic Risk Assessments (PRAs) commonly use time when it is out of service is called an outage duration or
Bayesian distributions to quantify the uncertainty in out-of-service time.
parameters. These distributions are propagated through Several definitions of train unavailability have been
logic models to give a Bayesian distribution on the frequency given in the literature. Apostolakis and Chu [1] list four:
of some undesirable end state. This approach is used when (1) Pointwise unavailability,
the parameter is an initiating-event frequency, a probability
qðtÞ ; Prðtrain is out of service at time tÞ
that a standby component fails to start, a rate of failure to run
for a running component, or some other parameter. (2) Limiting unavailability,
Bayesian analysis of unavailability seems to be less well
known than Bayesian analysis of other parameters. This q ; limt!1 qðtÞ if the limit exists
article presents ways to analyze two kinds of unavailability (3) Average unavailability over ð0; TÞ;
data in a Bayesian way. That is, it shows ways to obtain a
posterior distribution of the unavailability, by using data to 1 ðT
qav ; qðtÞdt:
update a noninformative or informative prior. This posterior T 0
distribution may be put in the PRA. (4) Limiting average unavailability,
The discussion here is presented in terms of trains,
combinations of hardware that combine to perform some 1 ðT
qav1 ; limT!1 qðtÞdt if the limit exists:
function such as fluid injection, and that can be considered T 0
either in service or out of service, with no intermediate
The limiting average unavailability is appropriate for
states. For example, a motor-driven train of the Auxiliary
both periodic maintenance and unplanned maintenance, and
Feedwater (AFW) system in a nuclear power plant is
is the definition used in this paper. It corresponds to the
normally in standby condition and available if it should be
unavailability at a random time, after any initial conditions
demanded, but sometimes it is out of service for planned or
have been forgotten. Thus, it is appropriate for use in PRA.
* Corresponding author. Fax: þ 1-301-589-7158. This is discussed further in Section 2.
E-mail addresses: [email protected] (C.L. Atwood); When collecting data, both outage data and exposure
[email protected] (M. Engelhardt). times are needed. The exposure time is the time when
0951-8320/$ - see front matter q 2003 Elsevier Ltd. All rights reserved.
doi:10.1016/j.ress.2003.11.010
226 C.L. Atwood, M. Engelhardt / Reliability Engineering and System Safety 84 (2004) 225–239

the train should have been in service. Some workers call it value. That is,
the required time. It should not be confused with ‘fault
qav1 ¼ q for an alternating renewal process:
exposure time,’ an unrelated concept used in a different
context. This paper presents Bayesian methods for estimat-
ing qav1 based on reported outage data and exposure times. 2.1.2. Planned maintenance
The assumed underlying model for planned maintenance
is periodic maintenance. The system is taken out of service
2. Assumptions and basic facts at times it; for some known maintenance interval t and
every integer i $ 0: The ith outage has duration Zi ; where
2.1. Two types of maintenance the Zi s are independent and have a common continuous
distribution with PrðZi , tÞ ¼ 1: It follows that EðZi Þ , t:
Both unplanned and planned maintenance are considered Mathematically, this is simpler than the alternating renewal
in this paper, although unplanned maintenance is process, because the times of the outage onsets are known,
emphasized. not random.
For periodic maintenance, it is shown below that
2.1.1. Unplanned maintenance
qav1 ¼ MTTR=t: ð3Þ
The assumed underlying model for unplanned mainten-
ance is an alternating renewal process At any point in time a This is the periodic-maintenance analogue of Eq. (1). From
train is in one of two states, ‘up’ or ‘down’, corresponding in the viewpoint of parameter estimation, Eq. (1) has two
our application to being in service or being out of service. unknown parameters. Eq. (3) has only one, because the
Initially the train is up and it remains up for a random time value of t is known.
Y1 ; and then it goes down and stays down for a random To derive Eq. (3), write F for the cumulative distribution
time Z1 : Then it goes up for a time Y2 ; and then down for a of the repair time Z; and write FðtÞ ¼ 1 2 FðtÞ: For 0 #
time Z2 ; and so forth. Assume that the Yi s have a common 
t # t note that qðtÞ ¼ PrðZi . tÞ ¼ FðtÞ: Let us consider the
continuous distribution with finite mean, that the Zi s also integral contained in the definition of qav ; and as a first
have a common continuous distribution with finite mean, special case set T ¼ t;
and that all the random variables are independent of each ðt ðt
other. 
qðtÞdt ¼ FðtÞdt:
Ross [2] provides results about alternating renewal 0 0

processes. Although the primary focus of Ross is properties Integration by parts yields
of a train in the up-state, the results yield similar conclusions ðt ðt
about the down-state (i.e. they apply to train unavailability).  t
qðtÞdt ¼ FðtÞ·t j0 þ tf ðtÞdt ¼ 0 þ EðZÞ ¼ MTTR:
The methods discussed by Ross yield the following 0 0

conclusions: Denote the mean outage duration EðZÞ by Now consider the integral with more general limits. For any
MTTR (an acronym for mean time to repair), and denote the t , t and any nonnegative integer i; we have
mean time between outages EðYÞ by MTTF (an acronym for

qðit þ tÞ ¼ PrðZi . tÞ ¼ FðtÞ:
mean time to failure). Denote the mean number of outages at
or before time t by mdown ðtÞ: Then it can be shown using the That is, qðtÞ is periodic. It follows for any s , t that
methods of Ross [2, p. 61] that the limiting value of
ðitþs ðs
mdown ðtÞ=t as t approaches infinity is 1/(MTTR þ MTTF), qðtÞdt ¼ i £ MTTR þ FðtÞdt 
and thus this limit can be interpreted as the outage rate, or 0 0
frequency of outages. By the methods of Ross [2, p. 67] it where the last term on the right is between 0 and MTTR.
also follows that under the above assumptions for an Eq. (3) then follows immediately.
alternating renewal process, the limiting unavailability Note in passing that the limit q does not exist for periodic
equals maintenance, because qðtÞ is periodic. That is the reason
why qav1 is chosen as ‘the’ definition of unavailability in
q ¼ MTTR=ðMTTR þ MTTFÞ: ð1Þ this paper, valid for both unplanned and planned
maintenance.
The numerator is the mean outage duration, and the
denominator has been discussed earlier in this paragraph.
It follows that 2.2. Two types of data

q ¼ ðmean outage durationÞ £ ðfrequency of outagesÞ: ð2Þ Two types of data are considered in this paper

Because the limiting unavailability exists, it is easy to show † Detailed Data. The onset time and duration of each
that the limiting average unavailability also equals this outage is recorded. The exposure times are also recorded.
C.L. Atwood, M. Engelhardt / Reliability Engineering and System Safety 84 (2004) 225–239 227

Table 1
Exposure and outage times (h) for CVC trains

Month Exposure time per train Train 1 Train 2 Both

Outages Total down time Outages Total down time Total down time

1 364 0 0 0 0 0
2 720 25.23 25.23 24.88 99.96 125.19
75.08
3 744 0 0 0 0 0
4 711 23.45 23.45 0 0 23.45
5 621 9.75 11.48 15.15 16.66 28.14
0.49 1.02
1.24 0.49
6 502 0.34 7.67 4.43 4.77 12.44
2.90 0.34
4.43
7 0 0 0 0 0 0
8 637 18.02 18.02 12.47 39.97 57.99
9.48
18.02
9 676 0 0 0 0 0
10 595 0 0 0 0 0
11 600 11.05 11.05 0 0 11.05
12 546 0 0 0 0 0
13 745 0 0 52.25 52.25 52.25
14 720 0 0 0 0
15 744 0 0 0 0
Totals 8925 96.90 96.90 213.61 213.61 310.51

Data are recorded separately for planned and unplanned The many zero values may be surprising. They appear to
maintenance. rule out any periodic maintenance with a monthly or even
† Summary Data. The history is grouped into bins, such as quarterly cycle. It has been pointed out to us1 that the
months, quarter years, or years. For each bin, the total present NRC Reactor Oversight Process (and earlier SSPI)
outage time and the total exposure time are recorded. In allows plants to perform periodic testing without counting
this case, planned and unplanned maintenance data may planned outage time if certain conditions are met. For
be kept separate, or may be combined. example, when emergency diesel generators (EDGs) are
tested, if an operator is locally stationed during the test, then
if an actual unplanned demand occurred the EDG can be
quickly reconfigured to respond to that demand. In that case,
3. Example data the outage time for the test is not reported. Also, some
components automatically re-align if an unplanned demand
Unavailability data are proprietary, and we were unable occurs. This may be the reason that zero entries are possible
to obtain real data for planned and unplanned mainten- for planned maintenance hours over an entire quarter, even
ance. The following data were taken from a PRA of a with monthly testing.
commercial nuclear power plant. The PRA was developed It might be noted that months 5, 6, and 8 contain
in the 1980s, and does not distinguish between planned instances of both trains being down for exactly the same
and unplanned maintenance. We will use the data to time. This is probably not coincidence. It may be that both
illustrate the estimation methods of this paper, recognizing trains were taken out together, violating the independence
that the example is for illustration only, and not perfectly assumption given earlier. If that is the case, the 18.02-h
realistic. outage is troubling because it constitutes over 10% of the
During a 15-month period, two chemical and volume total outage time. If, indeed, both trains were out at the same
control (CVC) pump trains at a nuclear power plant had a time, not only is the independence assumption violated, but
number of outages. The exposure hours (when the trains the PRA model should include a basic event that both trains
should have been available) and outage hours for mainten- are out of service simultaneously. Another possibility,
ance are shown in Table 1, for 15 calendar months. The data however, is that the equal times are an artifact of reporting;
here include both planned and unplanned outages. For each that is, perhaps the trains were taken out in succession and
month and train the individual outage durations are given in the average time was reported for each train. These concerns
the columns labeled ‘outages’, and are summed as the ‘total
1
down time.’ Month 5 illustrates this summing. Steven A. Eide, personal communication, Sept. 22, 2003.
228 C.L. Atwood, M. Engelhardt / Reliability Engineering and System Safety 84 (2004) 225–239

cannot be addressed at this time, years after the events. The decreases. This independence cannot be exactly true,
data are used for illustration in the analysis below, ignoring because
any possible violation of the assumptions.
1=ldown ¼ MTTR , MTTR þ MTTF ¼ 1=lcycle :
However, if MTTR p MTTF, the constraint has little effect
4. Analysis of detailed data and independence can be approximately true.
Under the above assumptions, the likelihood is the
Given the individual outage data, the parameters can be Poisson probability of n outages, multiplied by the product
estimated in a Bayesian way. It is assumed here that the of the exponential densities of the n outage durations. When
planned-maintenance and unplanned-maintenance outages this is multiplied by the joint prior density, the result is
are recorded separately. First, the analysis of unplanned- proportional to a gamma density for lcycle times a gamma
maintenance data is presented. The details can be simple or density for ldown : In summary, the posterior distributions are
messy, depending on the assumptions that are justifiable. independent, with
lcycle , gammaðacycle;1 ; bcycle;1 Þ ¼ x2 ð2acycle;1 Þ=2bcycle;1
4.1. Analysis of unplanned outages under simple
assumptions ldown , gammaðadown;1; bdown;1 Þ ¼ x2 ð2adown;1 Þ=2bdown;1 ;

Because this section considers unplanned outages, the where


process is assumed to be an alternating renewal process. The acycle;1 ¼ acycle;0 þ n
limiting average unavailability qav1 is equal to the limiting
bcycle;1 ¼ bcycle;0 þ texpos
unavailability, and therefore will be written as q in this
section. adown;1 ¼ adown;0 þ n
The simplest assumptions are that
bdown;1 ¼ bdown;0 þ tdown :
1. the in-service or up-time durations (Y in Section 2) have Eq. (1) or (2) implies that
an exponentialðlup Þ distribution; q ¼ lcycle =ldown :
2. the outage durations (Z in Section 2) have an exponential
ðldown Þ distribution; Here lcycle =ldown is a ratio of two independent gamma
3. the mean down-time is small compared to the mean variables, and so is proportional to a ratio of two
up-time. This can be rewritten in terms of frequencies independent chi-squared variables. However, the ratio of
as ldown q lup : two independent chi-squared variables, each divided by its
degrees of freedom, is F distributed, as is shown in many
Here, both lup and ldown have units 1/time. The books on statistics. Thus the posterior distribution is
parameter ldown can be interpreted as the reciprocal of the
mean outage duration MTTR, and lup is the reciprocal of q , ðbdown;1 =bcycle;1 Þ
MTTF. Let tup and tdown denote the total up time and total
down time, with texpos ¼ tup þ tdown :  ðacycle;1 =adown;1 ÞFð2acycle;1 ; 2adown;1 Þ: ð4Þ
The Bayesian analysis proceeds as follows. Let a ‘cycle’ This result—the fact that the posterior is a rescaled F
consist of the time from the onset of one outage to the onset distribution under these assumptions—has long been
of the next. Treat the number of outages in any time t as known. One reference is Example 10.3 in Ref. [3].
Poissonðlcycle tÞ: Here lcycle is the outage frequency. It can The mean and variance of an Fðm; nÞ distribution are
be interpreted as the reciprocal of the mean time from one given by Bain and Engelhardt [4] as
outage to the next, that is lcycle ¼ 1=ðMTTF þ MTTRÞ:
Assume a gammaðacycle;0 ; bcycle;0 ) prior distribution for mean ¼ n=ðn 2 2Þ
lcycle : 2n2 ðm þ n 2 2Þ
This Poisson distribution is an approximation that is variance ¼ :
mðn 2 2Þ2 ðn 2 4Þ
valid if MTTR p MTTF, so that the duration of a cycle is
dominated by the exponentially distributed up-time. The Percentiles of the F distribution are calculated by many
case with MTTR and MTTF comparable in size is discussed popular software packages, including widely-used
in Section 4.2.1. spreadsheets.
Also for simplicity, assume a gammaðadown;0 ; bdown;0 ) What prior parameters should be used? Informative
prior for ldown : priors would be quite appropriate. However, if the Jeffreys
Finally, assume that the prior distributions for lcycle and noninformative priors were to be used, they would be
ldown are independent. That is, if the mean duration of an acycle;0 ¼ 0:5
outage were to increase or decrease, this would provide no
information about whether outage frequency increases or bcycle;0 ¼ 0
C.L. Atwood, M. Engelhardt / Reliability Engineering and System Safety 84 (2004) 225–239 229

adown;0 ¼ 0 Here ldown =lup is proportional to a ratio of two independent


chi-squared variables, and therefore is a multiple of an
bdown;0 ¼ 0: F-distributed variable, just as in Section 4.1. Thus the
posterior distribution of q is
Then
1
q , ðtdown =texpos Þ½ðn þ 0:5Þ=nFð2n þ 1; 2nÞ: ð5Þ q, : ð6Þ
adown;1 bup;1
1þ Fð2adown;1 ; 2aup;1 Þ
aup;1 bdown;1

4.2. Analysis of unplanned outages under other assumptions The Jeffreys prior is not as simple as in Section 4.1. It can be
calculated from first principles, but is rather complicated. In
Two possible departures from the above assumptions are the special case when MTTR p MTTF, it is consistent with
considered. Section 4.2.1 gives an exact analysis when Section 4.1:
MTTR is not necessarily small, that is, when Assumption 3, aup;0 ¼ 0:5
listed at the start of Section 4.1, is not necessarily true. Then
Section 4.2.2 describes an analysis when the exponential
bup;0 ¼ 0
distributions in Assumptions 1 and 2 are replaced by gamma adown;0 ¼ 0
distributions with known shape parameters. Details of both
cases are given in Appendix A, so as not to interrupt the flow bdown;0 ¼ 0:
of the paper. Then
1
4.2.1. Exact analysis when MTTR is not necessarily small q, r tup : ð7Þ
In a way similar to Section 4.1, assume a 1þ Fð2r; 2n þ 1Þ
gammaðaup;0 ; bup;0 Þ prior for lup and a n þ 0:5 tdown
gammaðadown;0 ; bdown;0 Þ prior for ldown Let us also make Eqs. (6) and (7) are slightly more complicated than their
the reasonable assumption that the prior distributions are counterparts (4) and (5).
independent; that is, if the mean duration of up times were to
increase or decrease, this would provide no information 4.2.2. Analysis when durations have gamma distributions
about whether the mean outage duration increases or An article by Kuo [5] has a quite different perspective
decreases. from this paper. However, in that article Kuo generalizes the
In Section 4.1, outages were considered nearly instan- exponential assumptions as follows. Suppose that Z1 ; Z2 ; …
taneous when compared to the total exposure time. The are independently gammaðkdown ; ldown Þ distributed, with
present section, however, must consider the possibility that kdown assumed to be known and ldown unknown. Let ldown
the total exposure period may begin or end with the train in have a prior gammaðadown;0 ; bdown;0 Þ distribution. Appendix
the down condition. Thus, define n as the number of outage A imitates the analysis of Section 4.1 or 4.2.1. The results
onsets (or up-time completions) and r as the number of up- when MTTR p MTTF are summarized here, imitating
time onsets (or repair completions) in the observable data Section 4.1 and following Kuo.
period. Because of the alternating nature of up-times and Assume that ðY1 þ Z1 Þ; ðY2 þ Z2 Þ; … are independently
down-times, n and r can differ from each other by at most 1. distributed with a gammaðkcycle ; lcycle Þ distribution. Assign a
Appendix A develops the formula for the likelihood, and prior gammaðacycle;0 ; bcycle;0 Þ distribution to lcycle ; and
derives the posterior distributions. They are independent, assume that lcycle and ldown are a priori independent, at
with least to good approximation. Let n0 be the number of cycles
that are fully observable. Examples are given in Appendix
lup , gammaðaup;1 ; bup;1 Þ ¼ x2 ð2aup;1 Þ=2bup;1
A. It is the analyst’s choice whether a cycle is defined as an
ldown , gammaðadown;1 ; bdown;1 Þ ¼ x2 ð2adown;1 Þ=2bdown;1 ; up-time followed by an outage or as an outage followed by
an up-time. The prior distributions of ldown and lcycle are
where assumed to be independent to good approximation, just as in
aup;1 ¼ aup;0 þ n Section 4.1.
Similarly, let r 0 be the number of outages whose
bup;1 ¼ bup;0 þ tup durations are fully observable, which presumably are all
adown;1 ¼ adown;0 þ r of the outages in the present case with small MTTR.
As shown in Appendix A, the posterior distribution of
bdown;1 ¼ bdown;0 þ tdown : ldown is gammaðadown;1 ; bdown;1 Þ; with
By Eq. (1), it follows that adown;1 ¼ adown;0 þ r0 £ kdown
q ¼ ð1=ldown Þ=ð1=ldown þ 1=lup Þ ¼ 1=ð1 þ ldown =lup Þ: bdown;1 ¼ bdown;0 þ tdown :
230 C.L. Atwood, M. Engelhardt / Reliability Engineering and System Safety 84 (2004) 225–239

The posterior distribution of lcycle is gamma One diagnostic plot for checking whether times follow an
ðacycle;1 ; bcycle;1 Þ; with exponential distribution is a quantile – quantile plot, or a
Q –Q plot [6]. Denote the times generically as t1 ; …; tn : Let
acycle;1 ¼ acycle;0 þ n0 £ kcycle
the ordered times, or order statistics, be denoted tð1Þ # tð2Þ #
bcycle;1 ¼ bcycle;0 þ texpos : · · · # tðnÞ : If the times are a random sample from a population
with cumulative distribution F; it should be true that
The posterior distributions of ldown and lcycle are
independent to good approximation. By Eq. (1) or (2), it tðiÞ < F 21 ½i=ðn þ 1Þ: ð8Þ
follows that
Eq. (8) is the basis of a Q –Q plot. When the distribution is
q ¼ ðkdown =ldown Þ=ðkcycle =lcycle Þ: exponential(l), we have
Therefore, the analogue of Eq. (4) is FðtðiÞ Þ ¼ 1 2 expð2ltðiÞ Þ:

q , ðkdown =kcycle Þðbdown;1 =bcycle;1 Þ Applying F to both sides of Eq. (8) gives
1 2 expð2ltðiÞ Þ < i=ðn þ 1Þ;
 ðacycle;1 =adown;1 ÞFð2acycle;1 ; 2adown;1 Þ:
and a little algebraic manipulation transforms this into
tðiÞ < 2ln½1 2 i=ðn þ 1Þ=l:
4.2.3. Analysis when durations have other distributions
In more general cases, the foundation of the method Thus, a plot of the ordered times tðiÞ against 2ln½1 2 i=ðn þ
remains Eq. (1) or (2), with the two pieces each estimated in 1Þ should fall approximately on a straight line. This plot can
a Bayesian way. There are many possible distributions for be constructed without estimating l: The linearity or
the durations, such as a lognormal, two-parameter gamma, nonlinearity of the plot shows whether the distribution
and Weibull. Each distribution has its own techniques for appears to be exponential or not.
estimation. Fig. 1 shows the Q –Q plot for the 21 outage durations of
In many cases, however, it will be much simpler to Table 1.
convert the data to summary data and follow the analysis The line is not perfectly straight. The two largest times
method given in Section 5. For this reason, the above are a bit too large, and there are too many very small times.
approach will not be pursued further here. Thus, the true distribution is apparently more skewed than
an exponential distribution. Nevertheless, we will assume
4.3. Analysis of planned outages that the exponential distribution is approximately correct,
while recognizing that the resulting uncertainty intervals are
From the viewpoint of parameter estimation, planned not exact.
maintenance is a simpler setting. Instead of estimating the A formal goodness-of-fit test could also be applied, such
cycle period, MTTR þ MTTF, we know its value, t: The as a chi-squared test or a Kolmogorov-Smirnov test.
mean down time can be estimated just as in Sections 4.1 or However, such detail is not considered here.
4.2. Eq. (3) then gives The Q –Q plot also allows a check of the assumption
that the outages occur following a Poisson process. If
qav1 ¼ ð1=tÞ=ldown : the assumption is true, then the time from the start of one
The parameter ldown has a gamma posterior distribution, the
same distribution as in Sections 4.1 or 4.2. Therefore, tqav1
has an inverted gamma distribution, or equivalently, 1=tqav1
has a gamma distribution.
If the exponential (Section 4.1) or gamma (Section 4.2.2)
assumption on Z is not consistent with the data, then the data
can be aggregated, as presented in Section 5.

4.4. Example data analysis

For purposes of illustration, the data of Table 1 will be


treated as if they consisted only of unplanned outages. The
table shows 21 outages, with a total duration tdown ¼
310:51 train-hours and a total exposure time texpos ¼
17; 850 train-hours. Because the down time is a small
fraction of the exposure time, the approximation of Fig. 1. Quantile–quantile plot for examining whether outage durations are
Section 4.1 is used. exponentially distributed.
C.L. Atwood, M. Engelhardt / Reliability Engineering and System Safety 84 (2004) 225–239 231

As a sanity check, this can be compared with the simple


estimate of the total outage time divided by the total
exposure time, 1.74 £ 1022.

5. Analysis of summary data

Data from many train-months may be reported, (28 in the


example, not counting the one month when the plant was
shut down and its trains were not required to be available).
The task now is to use the summary data only, not the data
from individual outages, to obtain a Bayesian distribution
for qav1 : The fundamental technique is data aggregation to
yield quantities that are approximately normally distributed,
as described in Section 5.1.
For planned outages, the method here aggregates the data
Fig. 2. Quantile– quantile plot for examining whether times between
outages are exponentially distributed. into subsets so that each subset contains the same number of
test intervals. For example, with monthly maintenance each
outage to the start of the next is exponentially distributed. subset contains the same number of train-months. Because
Thus, the same type of plot as given above can check the the mean and variance of the time to repair are finite, the
‘Poissonness.’ The example data do not give the exact times Central Limit Theorem (for example, see Ref. [4]) states
of the outage onsets. However, an approximation can be that the total outage time in a subset is asymptotically
made by assuming that each single outage in a month occurs normal, as the number of maintenance events in a subset
at midmonth, each pair of outages in a month occurs in an becomes large.
equally spaced way with one third of the exposure time It appears difficult to construct a corresponding theoreti-
between the two events, and so forth. For example, if two cal proof of asymptotic normality for unplanned outages.
outages occur in a 720 h month, they are assumed to begin at The Central Limit Theorem cannot be used in any direct
240 h and 480 h. When this is done, Fig. 2 results. The way, because it concerns the limit of the sum Z1 þ · · · þ Zn :
horizontal axis shows the times between the approximate This sum, when normalized by functions of n; is
onsets of the outages. approximately normal(0,1) when n is large. In the present
The approximations used involve errors of at most a few setting, however, n is random, the number of up and down
hundred hours, and so have little effect on the overall form cycles in a fixed aggregation period. We have not been able
of the plot. This figure shows the same general curvature to prove asymptotic normality in that setting. Instead, we
that Fig. 1 showed, indicating that the times between have constructed simulations, described in Appendix B,
outages may have a more skewed distribution than assumed. giving empirical evidence that the sum is indeed asympto-
Nevertheless, we press on with the example analysis. tically normal when the aggregation interval is long enough.
Although an informative prior would certainly be The required assumptions are not known, although pre-
reasonable, and would apply to Eq. (4), only the Jeffreys sumably it is necessary for the durations to have finite means
noninformative priors are considered here, corresponding and variances.
to Eq. (5). Then the posterior distribution of the When aggregating data, there appears to be no advan-
unavailability q is tage—at least from the perspective of estimating total
unavailability—to separating planned and unplanned
ð310:51=17850Þð21:5=21ÞFð43; 42Þ:
outages. The two kinds of data give total outage times that
The 5th and 95th percentiles of the F distribution, as are approximately normal. Therefore their sum is approxi-
calculated by Microsoft Excel2 [7], are 0.601 and 1.667. mately normal, and may be used directly to estimate the
Therefore, the 5th and 95th percentiles of the unavail- total unavailability.
ability are 1.07 £ 1022 and 2.97 £ 1022. By the way, if the
exact method of Section 4.2.1 were used, the 5th and 95th 5.1. Data aggregation
percentiles would differ from these numbers only in the
third significant digit. For simplicity of explanation, assume that the data are
The mean of the posterior distribution of unavailability is summarized by month, for each train. Denote the total
exposure time for the ith train-month by ei and denote the
ð310:51=17850Þð21:5=21Þð42=40Þ ¼ 1:87 £ 1022 : corresponding outage time by oi : The corresponding simple
point estimate of the unavailability qav1 is the ratio xi ¼
2
Mention of specific products and/or manufacturers implies neither oi =ei : This gives one such estimate of qav1 for each train-
endorsement or preference, nor disapproval. month of data. The estimate from any one train-month is not
232 C.L. Atwood, M. Engelhardt / Reliability Engineering and System Safety 84 (2004) 225–239

Table 2
Sample statistics for estimates of qav1 ; with different levels of aggregation

Train-month System-month Train-quarter System-quarter

n 28 14 10 5
Mean 1.63 £ 1022 1.63 £ 1022 1.78 £ 1022 1.78 £ 1022
Median 0.00 4.60 £ 1023 1.38 £ 1022 1.75 £ 1022
St. Dev., s 3.07 £ 1022 2.51 £ 1022 1.64 £ 1022 1.16 £ 1022
s=n1=2 5.81 £ 1023 6.70 £ 1023 5.18 £ 1023 5.19 £ 1023
Skewness 2.79 2.02 1.25 0.33
No. of zeros 17 7 2 0

very good, because it is based on only a small data set. generate repeated values, such as multiple observed values
Indeed, if ei ¼ 0 the estimate is undefined. of zero. In addition, the normal distribution does not
The data may contain many zeros, as seen in columns 4, produce strongly skewed data. The data must be aggregated
6, and 7 of Table 1. As a result of the many zeros and few enough to have these characteristics.
relatively large outage times, the data can be quite skewed. Table 2 gives some sample statistics for x; based on
To eliminate some of the zeros and make the data less various amounts of aggregation of the data of the
skewed, the data can be pooled in various ways. For example. The skewness is a measure of asymmetry.
example, the rightmost column of Table 1 shows the total Positive skewness corresponds to a long tail on the
outage time for the two trains. Similarly, the data could be right. Zero skewness corresponds to a symmetrical
aggregated by time periods longer than one month, such as distribution.
by calendar quarter or calendar year. This aggregation over The 28 values of x corresponding to train-months do not
time could be done separately for each train or for the come from a normal distribution. They are too skewed, as is
pooled data from the trains. seen by the fact that the mean (1.63 £ 1022) is very different
This aggregation groups the data into subsets, for from the median (0), and the skewness (2.79) is far from
example train-months (the least aggregation), or train- zero. Also, they have many occurrences of a single value, 0.
quarters, or system-months, etc. Note, the parameter to Pooling the two trains into 14 subsets somewhat reduces the
estimate is still the train unavailability, not system skewness and the percentage of zeros.
unavailability, even if the data from the two trains in the Pooling the three months for each train makes the
system are pooled. Let oi and ei now denote the outage time distribution more symmetrical yet: the mean and median are
and exposure time for the ith subset. The simple moment within 30% of each other, and the skewness is down to 1.25.
estimate of qav1 based on the ith subset is xi ¼ oi =ei : When the data are aggregated by pooling trains and by
In the example data, if the two trains are pooled the total pooling months into quarters, multiple values of zero are
train exposure time for month 2 is e2 ¼ 720 þ 720 ¼ 1440 h; finally eliminated, and the distribution appears to be nearly
and the total train outage time is o2 ¼ 125:19 h: The symmetrical: the mean and median are within 2% of each
estimate based on this one month is 125:19=1440 ¼ other, and the skewness is moderately small. This suggests
8:69 £ 1022 : If calendar quarters are pooled but trains are that the five values of xi may be treated as a random sample
not pooled, the total train exposure time for Train 1 in quarter from a normal distribution.
3 is e3 ¼ 0 þ 637 þ 676 ¼ 1313 h; and the corresponding To investigate this further, a Q – Q plot was constructed,
train outage time is o3 ¼ 0 þ 18:02 þ 0 ¼ 18:02 h: The plotting the order statistics against the inverse of the
estimate of qav1 based on this one train-quarter is standard normal cumulative distribution function. That is,
18:02=1313 ¼ 1:37 £ 1022 : xðiÞ was plotted against
Whatever level of aggregation is used, this approach
F21 ½i=ðn þ 1Þ:
pools the numerators and denominators separately within
each subset and then calculates the ratio. Here F21 is the inverse of the standard normal cumulative
The purpose of this aggregation is to produce multiple distribution function. It is evaluated by many software
estimates oi =ei that are denoted generically as xi : The xi packages, including widely-used spreadsheets. The plot is
values must all come from a single distribution. Therefore, shown in Fig. 3.
the pooling assumes that the parameter qav1 does not change The points lie on a nearly straight line, indicating
within the data set, and that the various subsets of the data good agreement with a normal distribution. For this
(calendar quarters or years, etc.) have similar exposure reason, the x values are treated as coming from a normal
times, so that the random xi s all come from close to the same distribution when the data are aggregated to this degree.
distribution. A goodness-of-fit test could be performed, but it would
The xi values should come from an approximately not be able to detect non-normality based on only five
normal distribution. The normal distribution would not points.
C.L. Atwood, M. Engelhardt / Reliability Engineering and System Safety 84 (2004) 225–239 233

5.3. Noninformative prior

5.3.1. Formulas
The joint noninformative prior for ðm; s2 Þ is proportional
to 1=s2 : Lee [8, Sec. 2.13] presents this prior, as do Box and
Tiao [9, Sec. 2.4]. The posterior distribution then results in
pffiffi
ðm 2 x Þ=ðsX = nÞ
having a Student’s t distribution with n 2 1 degrees of
freedom. Here m is the quantity with the Bayesian
distribution, and everything else in the expression is a
calculated number. Then a 100ð1 2 aÞ% credible interval
for m is
pffiffi
x ^ t12a ðn 2 1ÞsX = n:

Fig. 3. Quantile–quantile plot, investigating normality of x when trains are As shown in most beginning statistics courses, this formula
pooled and months within quarters are pooled. is exactly the formula for a 100ð1 2 aÞ% confidence
interval for m: That is, the Bayesian credible intervals
based on data and a noninformative prior are numerically
equal to non-Bayesian confidence intervals.
There is a problem with the third quarter, because it has
smaller exposure time than the other quarters. That means 5.3.2. Example data analysis
that x corresponding to this quarter has larger variance than In the example, the mean m is interpreted as the
the others. This fact is ignored here, but if the exposure time unavailability qav1 : Based on the values of xi and
for quarter 3 had been even smaller, it might have been calculations in Table 2, the expression
necessary to drop that quarter from the analysis or to pool
differently. ðqav1 2 1:78 £ 1022 Þ=ð5:19 £ 1023 Þ
We repeat: The only purpose of the data aggregation
has a Student’s t distribution with 4 degrees of freedom. For
is to eliminate the skewness and eliminate multiple
example the 95th percentile of Student’s t with 4 degrees of
values, thus permitting the use of normal methods. To
freedom is 2.132. Therefore, a 90% credible interval is
the extent possible, the data should be pooled so that the
aggregated subsets have the same number of periodic 1:78 £ 1022 ^ 2:132 £ 5:19 £ 1023
maintenance events and have similar exposure times, in ¼ 1:78 £ 1022 ^ 1:11 £ 1022 ¼ ð7 £ 1023 ; 2:9 £ 1022 Þ:
order to have x values that come from a common
distribution. Not all PRA software packages contain Student’s t
Appendix B performs simulations suggesting that the distribution. Software really ought to be constructed to
normal approximation would be improved if more aggrega- implement the methods, but sometimes it is necessarily to
tion could be performed in the example. However, the temporarily adjust the method to match the available software.
example data set is so small that little more can be done. In Analysts who are working without Student’s t distribution in
practice, it would be wise when possible to aggregate their software package may be forced to use a normal
beyond the point where normality appears just barely to distribution with the same 90% interval as the one given by the
have been achieved. above calculation. (To be more conservative, match the 95%
intervals or the 99% intervals.) If the degrees of freedom are
5.2. Bayesian estimation of qav1 not too small ($ 3 as a bare minimum) the approximation of a
Student’s t by a normal is probably acceptable.
Bayesian estimates are given here. Examples are In the above example, a normal distribution with the
worked out after the general formulas are given. Assume same 90% interval would have 1:645s ¼ 1:11 £ 1022 :
that the data have been aggregated enough so that Therefore, a normal approximation for the posterior
{x1 ; …; xn } is a random sample from a normalðm; s2 Þ distribution of qav1 is normal with mean 1.78 £ 1022 and
distribution. ! Define standard deviation 6.75 £ 1023.
X
x ¼ xi =n
i 5.4. Informative conjugate priors
and
" #
X 5.4.1. Formulas
s2X ¼ ðxi 2 x Þ =ðn 2 1Þ :
2
Now consider informative priors. The conjugate priors
i
and update formulas are presented by Lee [8, Sec. 2.13].
234 C.L. Atwood, M. Engelhardt / Reliability Engineering and System Safety 84 (2004) 225–239

There are two unknown parameters, m and s 2 : Both conditional on s2 : It is based on the fact that
parameters have Bayesian uncertainty distributions. There- pffiffiffi
fore, these distributions depend on four prior parameters, ðm 2 m1 Þ=ðs1 = n1 Þ ð9Þ
denoted here as d0 ; s02 ; n0 ; and m0 : (This notation is not
has a Student’s t distribution with d1 degrees of freedom. It
the same as Lee’s.) Quantities with subscripts, such as s02 or
follows that a 100ð1 2 aÞ% credible interval for m is
d1 ; denote numbers here. Quantities without subscripts, m
pffiffiffi
and s 2 ; have uncertainty distributions. The distributions of m1 ^ t12a=2 ðd1 Þs1 = n1 :
this section are a bit complicated, because there are four
parameters to update. However, the update formulas are Lee points out that when d0 ¼ 21; n0 ¼ 0; and s20 ¼ 0;
easy to use, as illustrated by the example below. the conjugate prior distribution reduces to the noninfor-
It is useful to think of having d0 degrees of freedom, mative prior, and the above distribution agrees with that
corresponding to d0 þ 1 prior observations for estimating of Section 5.3.1.
the variance, and a prior estimate s20 : More precisely, let the
prior distribution for s2 =ðd0 s20 Þ be inverted chi-squared with 5.4.2. Example data analysis
d0 degrees of freedom. This is another way of saying that Some generic data3 from seven plants give a mean
d0 s20 =s2 has a chi-squared distribution with d0 degrees of unavailability for a CVC train of 6 £ 1023, and a standard
freedom. Therefore d0 s20 =s2 has mean d0 ; and therefore the deviation of 3.5 £ 1023. Therefore, we set m0 ¼ 6 £ 1023
prior mean of 1=s2 is 1=s20 : and s0 ¼ 3:5 £ 1023 : We could set n0 to 7 and d0 to 6, but
An alternative notation for the above paragraph would instead let us be conservative and assume less prior
define the precision t ¼ 1=s2 ; and the prior precision t0 ¼ information: set n0 ¼ 3 and d0 ¼ 21: The reason for this
1=s20 : Then the prior distribution of d0 t=t0 is chi-squared conservatism about n0 is that we are not sure how well the
with d0 degrees of freedom. Although this paper does not plants in the generic study match those of our example—in
use this parameterization, it has adherents. In particular, fact the example data are quite old, so may not match current
BUGS [10] uses t instead of s2 as the second parameter of practice well. We are even more conservative about d0
the normal distribution. This allows the parameter to have because there is little or no relation between the variance
the familiar chi-squared distribution instead of an inverted corresponding to calendar quarters at a single plant in Table 1
chi-squared distribution. and the between-plant variance of the generic data. The value
Conditional on s2 ; let the prior distribution for m be of 21 corresponds to no prior information about the variance.
normal with mean m0 and variance s2 =n0 : This says that Assuming the conjugate distribution, substitution into the
the prior knowledge of m is equivalent to n0 observations update formulas gives
with variance s2 : It is not necessary for n0 to have any
relation to d0 : d1 ¼ 21 þ 5 ¼ 4
The Bayes update formulas are
n1 ¼ 3 þ 5 ¼ 8
d1 ¼ d0 þ n
m1 ¼ ð3 £ 6 £ 1023 þ 5 £ 1:78 £ 1022 Þ=8 ¼ 1:34 £ 1022
n1 ¼ n0 þ n

m1 ¼ ðn0 m0 þ nxÞ=n1 s21 ¼{ 21 £ ð3:5 £ 1023 Þ2 þ ð5 2 1Þ £ ð1:16 £ 1022 Þ2


 þ ½3 £ 8=ð3 þ 8Þð6 £ 1023 2 1:78 £ 1022 Þ2 }=4
n0 n1
s21 ¼ d0 s20 þ ðn 2 1Þs2X þ ðm 2 x Þ2 dl
n0 þ n1 0 ¼ 2:07 £ 1024 ¼ ð1:44 £ 1022 Þ2 :
Here the subscript 1 identifies the posterior parameters. The These updates appear reasonable, because m1 is between the
posterior distributions are given as follows. First, s2 =ðd1 s21 Þ m0 and the mean of the data, and s1 is not far from the
has an inverted chi-squared distribution with d1 degrees of standard deviation of the data. (A comparison with s0 is
freedom. That is, the posterior mean of 1=s2 is 1=s21 ; and a meaningless in this example, because d0 was chosen to
two-sided 100ð1 2 aÞ% credible interval for s2 is eliminate any information from s0 :)
Let us now use the notation qav1 for unavailability
ðd1 s21 =x212a=2 ðd1 Þ; d1 s21 =x2a=2 ðd1 ÞÞ:
instead of the more generic symbol m: Expression (9)
Conditional on s2 ; the posterior distribution of m is normal becomes
ðm1 ; s2 =n1 Þ: Therefore, conditional on s2 ; a two-sided
100ð1 2 aÞ% credible interval for m is ðqav1 2 1:34 £ 1022 Þ ^ ½ð1:44 £ 1022 Þ=2:83;
pffiffiffi
m1 ^ z12a=2 s= n1 : which has a posterior Student’s t distribution with 4
degrees of freedom. A 90% posterior credible interval for
The present application to unavailability needs the marginal
posterior distribution of m; that is, the distribution that is not 3
S. A. Eide, personal communication, May 15, 2003.
C.L. Atwood, M. Engelhardt / Reliability Engineering and System Safety 84 (2004) 225–239 235

unavailability is When noninformative priors are used with an example


data set, the results are compared in Table 3. As can be
1:34 £ 1022 ^ 2:13 £ ð1:44 £ 1022 Þ=2:83 seen, the two approaches give consistent results for the
example.
¼ 1:34 £ 1022 ^ 1:12 £ 1022 ¼ ð3 £ 1023 ; 2:4 £ 1022 Þ:

Compared to the interval using the noninformative prior, Acknowledgements


this interval is pulled lower, toward the prior mean. It is not
noticeably narrower; the benefit of the prior data is partly This work arose out of preparation for the Handbook of
offset by the fact that the data and the prior are not very Parameter Estimation for Probabilistic Risk Assessment
consistent. [11] under NRC contract JCN W6970. Thanks go to
Steven A. Eide for advice and for all the data used here.
5.5. Nonconjugate priors Thanks also go to Robert F. Cavedo, Dana L. Kelly, and
Jeffrey L. LaChance for valuable discussion on the subject
Nonconjugate priors can also be considered. Then the of this paper. Finally, this paper benefited greatly from
posteriors cannot be found by simple algebraic updating. comments by George Apostolakis on an earlier version.
Instead, random sampling is the most promising tool. The The opinions expressed here are solely those of the
freely available BUGS software [10] is one useful tool for authors.
performing such sampling.

6. Summary and conclusions Appendix A. Bayesian updates with detailed data


for unplanned outages, under more complicated
Two Bayesian approaches have been given for estimat- assumptions
ing unavailability.
One approach uses the individual outage times, ‘detailed A.1. Exact Bayesian update with exponential durations
data.’ It works most easily if the outage durations can be
assumed to have exponential distributions and if, for The update formulas derived here are valid under the
unplanned maintenance, the up-time durations also have following assumptions
exponential distributions. A graphical tool was presented to
help assess the correctness of these assumptions. The 1. the in-service or up-time durations (Y in Section 2) have
method is also not hard if the durations have gamma an exponentialðlup Þ distribution;
distributions with known shape parameters. 2. the outage durations (Z in Section 2) have an
The other approach uses ‘summary data,’ totals of outage exponentialðldown Þ distribution.
time and exposure time in various time periods. It requires
aggregation of the data into subsets until the estimated This differs from the assumptions in Section 4.1 in that
unavailabilities for the various subsets appear to be MTTR is not required to be small.
approximately normal. This eliminates the need to assume For simplicity, use a gammaðaup;0 ; bup;0 Þ prior for lup and
any particular distribution in the underlying process. a gammaðadown;0 ; bdown;0 Þ prior for ldown : Let us also make
Tabular and graphical tools were presented to help assess the reasonable assumption that the prior distributions are
whether normality has been achieved. independent; that is, if the mean duration of up times were to
Both methods work easily with noninformative and increase or decrease, this would provide no information
informative priors, although the formulas for updating an about whether the mean outage duration increases or
informative prior with summary data are more intricate than decreases.
those that use detailed data. Because MTTR is not necessarily small, this section
Table 3
must consider the possibility that the total exposure period
Comparison of analysis results using example data and noninformative may begin or end with the train in the down condition. Thus,
priors define n as the number of outage onsets (or up-time
completions) and r as the number of up-time onsets (or
Detailed data, assuming Summary data
exponential durations
repair completions) in the observable data period. Because
of the alternating nature of up-times and down-times, n and
Posterior distribution Based on F Based on Student’s r can differ from each other by at most 1. The formula for
of qav1 distribution t distribution the likelihood is now developed.
Posterior mean 1.87 £ 1022 1.78 £ 1022 Let the observable data period be from a to b: Let outages
Posterior 90% (1.1 £ 1022, (7 £ 1023, begin at times t1 through tn ; and up-times begin at times s1
interval 3.0 £ 1022) 2.9 £ 1022)
through sr : There are four cases, depending on the condition
236 C.L. Atwood, M. Engelhardt / Reliability Engineering and System Safety 84 (2004) 225–239

of the train at the start and end of the data period: A.2. Updates when the durations have gamma distributions

Case 1: Up at start, up at end. Suppose that Z1 ; Z2 ; … are independently


Case 2: Up at start, down at end. gammaðkdown ; ldown Þ distributed, with kdown assumed to
Case 3: Down at start, up at end. be known and ldown unknown. Let ldown have a prior
Case 4. Down at start, down at end. gammaðadown;0 ; bdown;0 Þ distribution. Let r 0 be the number of
outages whose durations are fully observable. It does not
Consider Case 1 first. Then we have include outages that begin before the start of the observable
a , t1 , s1 , t2 , s2 , · · · , tn , sr , b; ðA1Þ data period or that end after the observable data period. One
can write out the portion of the likelihood corresponding to
with r ¼ n: The up-durations are yi ¼ ti 2 si21 ; and the the fully observed outages. This is the product of the
outage durations are zi ¼ si 2 ti : The likelihood is the densities of z1 through zr0 : When this is multiplied by the
product prior density, the product is proportional to a gamma
fup ðy1 lT1 . aÞ £ fdown ðz1 Þ £ fup ðy2 Þ £ fdown ðz2 Þ· · · density. Therefore ldown has a posterior gamma
ðadown;1 ; bdown;1 Þ distribution, with
£ fup ðyn Þ £ fdown ðzr Þ £ PrðYnþ1 . b 2 sr Þ ðA2Þ
adown;1 ¼ adown;0 þ r0 £ kdown
with f denoting the probability density of the duration. The
value of y1 is not known, because the first up-time began at bdown;1 ¼ bdown;0 þ tdown :
some unknown time s0 before the start of the observable One can apply the same approach to the up-times and
data. However, for an exponentially distributed random imitate the exact method of Section 4.2.1. Alternatively, if
variable Y with density f ; and any constant c $ 0; it is well MTTR p MTTF the approximate method of Section 4.1
known and easily shown that can be imitated. This is done next.
f ðylY . cÞ ¼ f ðy 2 cÞ: Assume that ðY1 þ Z1 Þ; ðY2 þ Z2 Þ; … are independently
distributed with a gammaðkcycle ; lcycle Þ distribution. Assign a
Therefore, with Y1 ¼ T1 2 s0 ; we have prior gammaðacycle;0 ; bcycle;0 Þ distribution to lcycle : Let n0 be
f ðy1 lT1 . aÞ ¼ f ðy1 lY1 . a 2 s0 Þ ¼ f ðy1 þ s0 2 aÞ the number of cycles that are fully observable. For example,
¼ f ðt1 2 aÞ: if the up and down times occurred as in Inequality (A1), the
first fully observable cycle would go from s1 to s2 ; and the
In conclusion, therefore, the likelihood is last fully observable cycle would go from sn21 to sn ; in all,
Y
n Y
r there would be n0 ¼ n 2 1 fully observable cycles, and any
lup e2lup ðt1 2aÞ £ ðlup e2lup yi Þ ðldown e2ldown zj Þ£e2lup ðb2sr Þ information in the data before s1 or after sn would not be
i¼2 j¼1
used. In the example in the body of this paper, the 21 train
which equals outages correspond to 19 fully observable cycles. It is the
lnup e2lup tup lrdown e2ldown tdown ðA3Þ analyst’s choice whether a cycle is defined as an up-time
followed by an outage or as an outage followed by an up-
where tup and tdown are the total time up and time down in the time. The reason for discarding the end pieces of the data is
observable data period. that the last term of Expression (A2) makes simple update
This is the likelihood for Case 1. The other cases are formulas impossible, and the first term of Expression (A2)
similar. It is not hard to show that they all result in exactly cannot even be written when the durations are not
the same expression for the likelihood, Expression (A3). exponentially distributed.
The joint posterior density is proportional to the joint Then the posterior distribution of lcycle is
prior density multiplied by the likelihood. It follows that the gammaðacycle;1 ; bcycle;1 Þ, with
posterior distributions are independent, with
acycle;1 ¼ acycle;0 þ n0 £ kcycle
lup , gammaðaup;1 ; bup;1 Þ ¼ x ð2aup;1 Þ=2bup;1
2
bcycle;1 ¼ bcycle;0 þ texpos :
ldown , gammaðadown;1 ; bdown;1 Þ ¼ x2 ð2adown;1 Þ=2bdown;1 ;
If ldown and lcycle are assumed to have independent prior
where distributions (to good approximation), then they also have
aup;1 ¼ aup;0 þ n independent posterior distributions.

bup;1 ¼ bup;0 þ tup


Appendix B. Simulations to investigate asymptotic
normality and performance of aggregation method
adown;1 ¼ adown;0 þ r
The following simulations were carried out with alterna-
bdown;1 ¼ bdown;0 þ tdown :
ting renewal processes. They showed that the aggregation
C.L. Atwood, M. Engelhardt / Reliability Engineering and System Safety 84 (2004) 225–239 237

method does eventually lead to approximate normality for Table B1


the cases investigated. However, convergence may be slower Sample moments of X; for exponential durations and various amounts of
aggregation
than suggested by simple inspection of any one data set.
Therefore, the prudent analyst should get as large a data set as a (h) Meana St. dev.b Skewnessc Kurtosisd
realistically possible, and aggregate the data into a few
relatively large subsets. In addition, the analyst must be alert 3000 0.01961 0.01370 1.030 1.39
to lower limits that go close to zero or even below zero. 10,000 0.01960 0.00751 0.556 0.40
30,000 0.01961 0.004342 0.329 0.15
All the investigations were based on simulating up-time
100,000 0.01961 0.002377 0.174 0.043
and down-time durations to construct many aggregated data
a
subsets and corresponding fractions of down time x: In 90% confidence interval for mean is no wider than mean ^ 0.16%. The
Section B.1, the distribution of the xs was compared to a true mean of X is 0.0196078.
b
90% confidence interval for st. dev. is no wider than mean ^ 0.2%.
normal distribution. In Section B.2, many intervals were c
90% confidence interval for skewness is no wider than mean ^ 2%. The
constructed, each based on five xs. The (long-term normal distribution has zero skewness.
frequentist) distribution properties of the intervals were d
90% confidence interval for kurtosis is no wider than mean ^ 20%. The
compared with the corresponding properties if the xs were normal distribution has zero kurtosis. Some authors define kurtosis as the
normal. above value plus 3.

B.1. Asymptotic normality


As can be seen, in both cases the skewness decreases by
roughly a factor of 2 when the aggregation increases by a
The assumptions for the simulations were the following.
factor of 3.
We also wished to see whether the assumed indepen-
Assumption 1 (exponential distributions)
dence of up-time durations and subsequent outage durations
Down-time durations are exponentially distributed
was crucial. Therefore, similar simulations were performed
with mean 15 h.
with lognormal distributions, when the underlying normal
Up-time durations are exponentially distributed with
distributions for Yi and Zi had correlation either 2 0.8 or
mean 750 h.
þ 0.8. The resulting tables are not shown but they resemble
Assumption 2 (lognormal distributions)
Table B2, only with somewhat different rates of
Down-time durations are lognormally distributed with
convergence.
mean 15 h and standard deviation 20 h.
Up-time durations are lognormally distributed with
B.2. Effectiveness of aggregation procedure with only
mean 750 h and standard deviation 1000 h.
approximate normality
In every case, the process was started with the train up,
and allowed to run for 10,000 h, the burn-in time. This was The above tables suggest that the distribution of X is
to allow the train to ‘forget’ the initial condition of starting asymptotically normal as the aggregation level increases.
while up. After the burn-in time, the train was run for They do not address the question of how large an
100,000 consecutive periods of a hours per period. The aggregation is ‘good enough’, how close to normal the
distribution needs to be. To investigate this question, the
interpretation of a is the size of an aggregated data subset,
following studies were performed.
and the subsets were chosen to be consecutive to mimic the
data aggregation within an actual data set. Within each
period the total down time was recorded. The ratio of down- Table B2
time to a is x; the estimate of q for an aggregated data subset. Sample moments of X; for lognormal durations and various amounts of
(The simple notation q is used here instead of qav1 because aggregation
the two quantities are equal for an alternating renewal
a (h) Meana St. dev.b Skewnessc Kurtosisd
process.) Based on the 100,000 values of x; the sample
mean, variance, skewness, and kurtosis of X were 3000 0.01962 0.01696 2.02 11
calculated. 10,000 0.01962 0.00971 0.97 2.8
This was repeated 10 times, that is, 10 sets of 100,000 30,000 0.01961 0.00571 0.485 0.88
values were found, to give an idea of the accuracy of the 100,000 0.01961 0.003161 0.249 0.32
above sample moments. The 10 sample means were a
90% confidence interval for mean is no wider than mean ^ 0.14%. The
averaged, and a 90% confidence interval was found, true mean of X is 0.0196078.
b
assuming that the sample means are normally distributed. 90% confidence interval for st. dev. is no wider than mean ^ 0.4%.
c
Similarly, the sample variances and higher moments were 90% confidence interval for skewness is no wider than mean ^ 3%. The
normal distribution has zero skewness.
averaged and confidence intervals were approximated. d
90% confidence interval for kurtosis is no wider than mean ^ 13%. The
The results are shown in Tables B1 and B2 for several normal distribution has zero kurtosis. Some authors define kurtosis as the
values of a: above value plus 3.
238 C.L. Atwood, M. Engelhardt / Reliability Engineering and System Safety 84 (2004) 225–239

Table B3
Behavior of n intervals with nominal 90% confidence level, based on aggregation when durations are lognormal

a (h) All data sets Data sets with no zeros

N Fraction too low Fraction too high n Fraction too low Fraction too high

3000 20,000 0.133 0.021 14,828 0.090 0.028


10,000 20,000 0.086 0.029 19,514 0.083 0.030
30,000 20,000 0.068 0.037 19,980 0.068 0.037
100,000 20,000 0.057 0.043 20,000 0.057 0.043

B.2.1. Results based on probability that interval contains B.2.2. Results based on distributions of interval end points
true q and midpoint
One of the sequences of 100,000 x values was broken Fig. B1 shows the cumulative distribution of the
into 20,000 pieces, with five consecutive x values in each lower limit, point estimate, and upper limit, based on
piece. These five consecutive values mimic the ones in the the simulations used for the right portion of row 1 of Table
example data analysis of this paper, which also used five x B3. Row 1 is of interest because it corresponds to the least
values. (In addition, the use of 3000 h for each x is similar to convergence to normality.
the example data analysis, which had an average of 3570 In Fig. B1, the curve on the left represents the cumulative
train-hours per subset.) Based on these five values, a 90% distribution of the lower end of the 90% interval. As can be
confidence interval for q was calculated, equivalent to a seen from the figure, the lower end was negative in about
90% posterior credible interval based on a noninformative 6% of the cases, and greater than q (marked by the dashed
prior. This interval is supposed to contain q with high vertical line) in about 3% of the cases. Similarly, the middle
probability. The true value of q is known in these curve represents the cumulative distribution of the midpoint,
simulations, so the interval could be compared with q: x : It was smaller than q in about 47% the cases and larger in
The number of times when the interval lay entirely on one the other 53%. This slight bias results from having discarded
side of q was recorded. If the procedure really calculates any data sets that contain zeros. Finally, the curve on the
90% confidence intervals, then the interval should fail to right is the cumulative distribution of the upper end of the
contain q about 10% of the time. 90% interval.
In fact, a data analyst would not use the aggregation It is tempting to draw a horizontal line at some
method when any of the aggregated subsets had zero elevation, extending from the left curve to the right, and
observed outage time. Instead, the analyst would aggregate to say that this horizontal line represents one particular
further. To mimic this behavior at least partially, a 90% interval for q: This is an oversimplification, because
simulation was considered in which a data set was discarded the data sets can have very diverse standard deviations,
if it contained any subset with zero outage time. (In more extreme than is indicated by the spacing between the
addition, the data analyst might aggregate further if the curves. To construct Fig. B1, the lower limits were sorted
lower limit of the 90% interval were near zero or negative. and plotted. Then the point estimates were sorted,
However, this is not part of the procedure given in Section independently of the sorting for the lower limits, and
5.1, so it is not considered here.) Table B3 shows the
resulting performance of the 90% intervals. The numbers of
the table are based on 20,000 sets of five xi s each.
Simple modifications to the procedure make little
difference to the table. For example, when the same sort of
simulations are carried out using three points per interval
instead of five, the fractions too high and too low are virtually
unchanged. However, using fewer points would allow the
aggregation level a to be somewhat higher within a fixed total
data set. Also, if data sets are discarded whenever any data
point is zero or the lower end of the 90% interval is negative,
the number of discarded data sets is only slightly larger than
shown in the right portion of Table B3.
If the evaluation were based only on the probability that
the interval covers the true q (the criterion on which
confidence intervals are defined), one would conclude that Fig. B1. Cumulative distributions of point estimate and 90% limits for
the aggregation procedure of this paper seems to be biased unavailability, based on simulations with lognormal durations, and 3000
low. However, this is not the whole story, as shown below. exposure hours in each aggregated subset.
C.L. Atwood, M. Engelhardt / Reliability Engineering and System Safety 84 (2004) 225–239 239

the intervals from normal data stray into the negative region
about 30% of the time. If one tries to solve this problem by
excluding any data sets that contain values of zero or
smaller, the resulting curves are shifted to the right, i.e.
biased high. Thus, the ‘gold standard’ also has its problems.
The problems seem to go away with increased aggrega-
tion. For example, Fig. B3 shows the same curves when the
aggregation level a is increased from 3000 to 30,000 h.
Here, the normal-approximation method seems to work very
well.

B.2.3. Conclusions
The final conclusions seem to be the following. As long
as the process involves small aggregated subsets, so that
pffiffi
Fig. B2. Curves from Fig. B1 and analogous curves based on normal data.
s= n is comparable to the mean, there will be some data sets
giving intervals that stray below zero, and lack of full
agreement between the approximation of this paper and
results with truly normal data. The ‘perfect’ solution is to
use more data, and to aggregate into larger subsets. If
instead one must make do with an imperfect solution, the
aggregation proposed in this paper may be acceptable. For
example, halfway up Fig. B2 the width of the band using
aggregated data is 86% of the width of the band using
normal data. This may be good enough. The data analyst can
help guard against problems by aggregating as much as
possible, and by being alert to cases when the lower end
point of the interval is near zero.

References
Fig. B3. Curves from Fig. B2 but now using 30,000 h of aggregation in each
subset instead of 3000 h. [1] Apostolakis G, Chu TL. The unavailability of systems under periodic
test and maintenance. Nuclear Technol 1980;50:5– 15.
[2] Ross S. Stochastic processes. New York: Wiley; 1983.
plotted. Finally, the upper limits were also sorted [3] Cox DR, Hinkley DV, Theoretical statistics, p. 371. London:
independently of the other values and plotted. Any relation Chapman & Hall; 1974.
between lower and upper limits in a single interval was lost [4] Bain LJ, Engelhardt M. Introduction to probability and mathematical
statistics. Boston: PWS-KENT Publishing Company; 1992.
when the values were sorted.
[5] Kuo W. Bayesian availability using gamma distributed priors. IIE
To see how much bias is present, the above results were Trans 1985;17:132 –40.
compared with the ‘gold standard’, the analogous curves if [6] Wilk MB, Gnanadesikan R. Probability plotting methods for the
the xs actually came from a normal distribution. The normal analysis of data. Biometrika 1968;55:1–17.
distribution was constructed to match the data used for the [7] Microsoft Excel, Microsoft Corporation, Seattle, WA; 2001.
simulation in row 1 of Table B3, with mean q ¼ 0:01961 [8] Lee PM. Bayesian statistics: an introduction, 2nd ed. London: Arnold,
a member of the Hodder Headline Group; 1997.
and standard deviation 0.01708. Sets of five such values [9] Box GEP, Tiao GC. Bayesian inference in statistical analysis.
were generated, and the point estimates and 90% intervals Reading, MA: Addison-Wesley; 1973.
were calculated, 100,000 intervals in all. Many of the data [10] Spiegelhalter DJ, Thomas A, Best NG, Gilks WR. BUGS: Bayesian
sets contained negative values, but discarding them would inference using Gibbs sampling, Version 0.50. Cambridge, UK: MRC
destroy the normality of the data so they were used. Fig. B2 Biostatistics Unit; 1995. Available at http://www.mrc-bsu.cam.ac.uk/
bugs/.
shows the comparison of the three curves from Fig. B1 and
[11] Atwood CL, LaChance JL, Martz HF, Anderson DJ, Engelhardt
the analogous three curves from the normal data. M, Whitehead D, Wheeler T. Handbook of parameter estimation
While it is true that the normal intervals are to the left of for probabilistic risk assessment, NUREG/CR-6823, SAND2003-
q in 5% of the cases and to the right of q in 5% of the cases, 3348P. 2003.

You might also like