0% found this document useful (0 votes)
6 views

An integrated Tool Set for Software Safety Analysis

The document discusses the development of an integrated tool set for software safety analysis aimed at addressing the integration issues present in traditional safety assessment methods for safety-critical systems. It emphasizes the importance of systematic assessment of software failure modes and the creation of a comprehensive safety case. The authors describe their approach, which combines fault tree analysis and failure modes, effects, and criticality analysis, along with a prototype tool set designed to facilitate these analyses within the software development lifecycle.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views

An integrated Tool Set for Software Safety Analysis

The document discusses the development of an integrated tool set for software safety analysis aimed at addressing the integration issues present in traditional safety assessment methods for safety-critical systems. It emphasizes the importance of systematic assessment of software failure modes and the creation of a comprehensive safety case. The authors describe their approach, which combines fault tree analysis and failure modes, effects, and criticality analysis, along with a prototype tool set designed to facilitate these analyses within the software development lifecycle.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

J.

SYSTEMS SOFIWARE 279


1993;21:279--290

An Integrated Tool Set for Software Safety Analysis

Peter Fenelon and John A. McDermid


Department of Computer Science, Universiry of York Heslington, York, United Kingdom

Traditional methods for assessing software safety suf- ment of the software components of safety-critical
fer from poor integration (from methodological, opera- systems should therefore include the systematic as-
tional, and semantic points of view) both with each sessment of the potential failure modes of the soft-
other and with the rest of the develpment life cycle of ware and the analysis of the consequences of these
safety-critical systems. Our goal is to develop a set of failure modes. The results of these analyses must
methods and tools that addresses these weaknesses;
then be integrated into an overall safety case for the
this article describes our current research in these
system. A safety case is a complete set of evidence
areas. We describe an integrated approach to soft-
ware safety analysis based on the techniques of fault
produced to demonstrate that an operational system
tree analysis and failure modes, effects, and criticality is safe to deploy for its intended use.
analysis, together with a prototype toot set to imple- There are already a number of methods and tools
ment these techniques. Issues pertaining to the inte- that claim to assist in this process, several of which
gration of safety analysis into a broader development will be discussed in more detail later. However, it is,
life cycle are also discussed. Our approach empha- our belief that, in their traditional forms, these
sizes pragmatism and simplicity-we aim to create a methods suffer from poor integration into the sys-
set of tools and methods that are robust, and straight- tem development lifecycle at several levels:
forward, and directly usable by industrial practitioners
in the field of software safety. Semantic level: Links between different methods
used in safety assessment and links between as-
sessment, design, and implementation, are gener-
1. INTRODUCTION ally inadequate in that there are no common se-
As the use of computer systems in safes-related mantic frameworks (system or ~mputational
applications becomes ever more widespread, we are models) to link the techniques; hence (for exam-
increasingly faced with the task of deciding whether ple) the meaning of data derived from safety anal-
the software used in the systems is sufficiently “safe.” ysis of a piece of software may be open to misin-
To avoid confusion, we introduce the term safety terpretation in the context of the system design
integrity to identify the properties of the design and without a sound underlying model for integrating
implementation of software such that it shall not information from different views and levels of
behave in a way that will endanger the safety of the abstraction of a system.
system in its intended use. The concept of software Procedural level: There is often no sound method-
safety integrity is relatively subtle: software failures ological framework in which to use the safety
alone cannot cause harm; only the interactions be- assessment methods, so that, for instance, the
tween “faulty” software and the rest of the system effect of a design decision on the safety case is not
can do so. Correctness and safety integrity are not properly considered at the time the decision is
necessarily the same thing-it is possible for a sys- made; this can be addressed by integrating safety
tem to be in a state that is incorrect with respect to analysis procedures and their results into the re-
a functional specification but that still meets defined view and change-management procedures.
safety criteria.
Operational level: Integration between tools used
The task of the engineer involved in the develop-
for system development and safety assessment is
traditionally poor.

Address cwespondence to Peter Fenelon, Dept. of Computer


This article describes our attempts within the
Science, lJnive&y of York, Heslington, York YOI SDD, England. Software Safety Assessment Procedures (SSAP)

Q Elsevier Science PubIishing Co., Inc.


655 Avenue of the Americas, New York, NY 10010 0164.1212/93/$6,00
280 J. SYSTEMS SOFTWARE P. Fenelon and J. A. McDermid
1993; 21:279-290

project to design methods and tools that bridge and explain what we mean by transformation of
these semantic and operational gaps and thus facili- failures.
tate procedural integration, yet also retain a high Note that SSAP is only one of a number of
degree of compatibility with traditional methods. projects at the University of York concerned with
Although our emphasis here is on semantic integra- safety-critical systems; we are already considering
tion, we shall also briefly discuss our current imple- integration of SSAP with various other tools. We
mentation of a prototype tool set, built largely from will discuss operational integration in the concluding
standard software components, which aims to pro- section of the article.
vide operational support for our methods. We believe that the use of well-founded methods
at all levels of the systems engineering process is
important, but also that these methods should be
2. BACKGROUND OF THE SSAP PROJECT
presented to the engineer in as straightforward a
The SSAP project, funded by British Aerospace De- fashion as possible. Our aim in SSAP is to provide a
fence (Military Aircraft Division), commenced in simple set of graphical or textual notations with an
September 1990. Its objectives are to construct a underlying set of formalisms and semantic bases to
flexible and powerful environment in which tools link these methods into the formal methods used
concerned with software safety can operate and to in safety-critical software development. The basic
use this framework to assess the usefulness of these methods used in SSAP have been selected because it
methods. We have worked with particular reference is simple to link FMECA to FTA (via our new
to existing standards used in large-scale industrial notation), and, since it has been demonstrated that
safety-critical software development. it is also conceptually easy to link FTA to Dijkstra’s
Wherever possible, we have chosen to use existing wp-calculus [l], we can provide a link between soft-
methods as a basis for our analyses and simple ware FTA and formal methods.
software components as a basis for implementation. We therefore hope to be able to provide a set of
A pragmatic approach to design and implementation methods which have a real semantic basis and reflect
based on object-oriented principles has enabled us the structure and meaning of software components
to propose and develop a loosely coupled set of tools at all levels. We are particularly interested in auto-
sharing a common data base and offering the user a matic derivation of fault trees from program code
broad range of software safety analysis methods. and from structured design notations, notably CORE
The two basic techniques at the heart of SSAP are (controlled requirements expression).
fault tree analysis (ITA) and failure modes, effects, This article describes the principles underlying
and criticality analysis (FMECA). Both methods are our work and the status of our tool set. We discuss
well understood in the systems engineering pro- FTA, FMECA, and our new FPTN notation before
cesses in the aerospace industry (and elsewhere); considering broader integration issues.
much of our effort has concentrated on developing
idioms and extensions that will enable us to describe
the failure properties of software systems more pre- 3. FTA IN SSAP
cisely. FTA has been in use since at least the 1960s as a
We have also proposed a new notation which links tool for the assessment of system reliability [2] and
FTA and FMECA in a simple and easily compre- has developed into a well-understood, standardized
hensible fashion. This was required because we be- method with broad applications throughout the dis-
lieve that software FTA and FMECA, although both cipline of safety and reliability engineering. The best
powerful and useful methods, do not integrate well introduction to classic FTA is the extensive and
at present. We required a notation that offers the authoritative Fault Tree Handbook [31.
precise semantics of fault trees and the intuitive Traditional FTA is a probabilistic method in which
readability of FMECA and that can be used effec- potential causes of some failure (top event) are
tively as a bridge between the two. Our new failure recursively organized in a tree structure reflecting
propagation and transformation notation (FPTN) is causality; causality is a crucial notion underlying all
somewhat analogous to traditional data flow-based safety analysis techniques. Higher level events can
design notations, although instead of showing nor- be caused by various combinations of lower level
mal data flow between elements in a system, it events, with the principal connectives used in the
describes the propagation and transformation of tree being the AND and OR gates, which have
failures. We will describe the relationship between meanings analogous to those traditionally used in
this notation, FTA, and FMECA in a later section electronic circuit design. Majority gates (where N
Software Safety Analysis J. SYSTEMS SOFWARE 281
1993; 21:279-290

out of A4 inputs must be true before the output is with code that is already written, we believe that the
true), exlusive-OR gates and INHIBIT gates (which bottom-up method of ETA has prevented its proac-
generate a true output if some input representing an tive use in the design and implementation phases of
event in the system is true and some external “con- the software life cycle. This is particularly important
ditioning event” has occurred) are also available, as we believe that successful integration of safety
although these tend not to be used when FIA is assessment techniques such as FIA into the devel-
applied to software systems. opment life cycle is possible only if we have a strong
Much of the basic work on the application of fault link between safety assessment methods and the
trees to software systems has been carried out by design process, for example, that would allow us to
Leveson and colleagues. In particular, Leveson and determine where safety features (defenses that pre-
Harvey [4] demonstrated a template-based approach vent potentially harmful internal failures from prop-
to software FIA in which programming language agating) need to be included in a system (this im-
constructs are mapped onto instances of a set of plies that there is a need for rapid iteration between
templates describing the failure properties of the design and safety analysis and strong semantic links
statement in question (for example, there are tem- between the methods used). We have therefore pro-
plates for IF, WHILE, assignment, and so on). These posed a form of hierarchical FTA (HFTA) which,
are composed in well-defined ways to derive a fault although compatible with Leveson’s methods, takes
tree describing the failure properties of a software an almost diametrically opposed approach and is
component. In Leveson’s work, the control flow inherently a top-down and breadth-first method.
structure of the system is assumed to reflect the
causality of faults, provided the faults are such that
control flow is not violated by the occurrence of a 3.1 HFTA
fault. Here we attempt to demonstrate some of the princi-
Note that software ETA is not probabilistic since ples of our HFTA method. We show the broad
the failure modes of software are (in theory) deter- techniques applicable in HFTA by illustrating the
ministic. Software ETA can, however, be interfaced initial steps of a small case study of a piece of
to system-level FI’A to provide a more complete noncritical software. The same methods are cur-
view of system safety. It is also possible to consider a rently being used on a more substantial case study
software fault tree in which the leaf nodes are with a critical component; initial results indicate that
representations of hardware events which can mean- there is a substantial saving in effort and increase in
ingfully be assigned failure probabilities; from such comprehensibili~ over the traditional approach.
a tree, we can estimate the probability of a particu- The case study we use here centers on a simple
lar software failure mode arising by examining its desk accessory program for a microcomputer which
external causes. allows the user to format floppy disks with various
Leveson’s extensions to traditional methods of nonstandard parameters. Obviously this example is
FTA are tied closely to traditional imperative lan- not safety critical, but it demonstrates the principles
guages with simple constructs. Later attempts to add of HFTA without requiring us to consider the more
support for the Ada concurrency model [5] are complex hazards inherent in “real-world” applica-
somewhat less convincing, since the templates are tions. The software under analysis was written in
more complex and it is debatable whether they cap- MODULA-2 as two modules. One of these
ture the full failure semantics of Ada’s complex (DskForml contains the format procedure used to
tasking statements. Leveson-style fault trees are also format the disks; the other (FopmAce) contains the
structured in terms of the system in normal mode of procedure that dispatches work to the formatter
operation rather than reflecting its failure proper- (Do__ Work) and the entry and main event loop of
ties; they emphasize a depth-first and bottom-up the accessory (as module-level initialization code).
approach to analysis. This is counter to the tradi- A note on terminology: AES (application environ-
tional top-down application of fault trees. We have ment services) is a part of the Atari ST’s GEM
found that the bottom-up approach to software FIA windowing system, which deals with multitasking ap-
is difficult to explain to traditional safety engineers. plications. It is impossible for a program which has
Software fault trees can also become unwieldly and not successfully negotiated with AES to make use of
difficult to manipulate, but our tool set handles any high-level graphics functions such as menus,
decomposition and ‘folding’ of fault trees and allevi- dialogue boxes, etc. TOS is the lowest level of the
ates this problem to some degree. Atari OS, approximately equivalent to the BIOS on
While this approach is acceptable when dealing an IBM-compatible personal computer.
282 J. SYSTEMS SOFTWARE P. Fenelon and J. A. McDermid
1993; 21:279-290

For a floppy disk to be successfully formatted, the


program must gather the appropriate parameters
(number of sides, tracks, and sectors per track) from
the user, create the individual tracks by means of
repeated calls to an operating system routine, and
then write a “boot sector” to the disk containing
information about the type of disk just formatted.

3.2 Top-Level View (Systems Perspective)


We are concerned with a particular failure mode of
the formatter-what can cause it to fail to format
disks correctly?
From a systems point of view, we might see a
top-level fault tree, as shown in Figure 1, consider- Figure 2. Software view of fault tree.
ing the following as potential root causes of the
failure of the formatter to work correctly:
action here is to investigate the three potential
l AES communication failure-the appli~tion has causes of failure and discover what failure modes
not successfully registered itself with the AES. they can bring about.
l Error caused inside formatter software (internal
error)
3.4 Second-Level DecompositiorxDskForm
l Fault on media-there is a genuine defect in Initialization Code
either the floppy disk or the drive.
The only failure in the initialization/main loop that
can cause the formatter not to succeed leads to the
3.3 Top-Level View (Software Perspective) accessory not responding to requests to format disks.
There are three potential causes of this, each of
Without access to the underlying operating system
which can be uniquely associated with and a particu-
source code, we can only investigate the failure of
lar block of code:
the software down to the level of failure of individ-
ual system calls involved in the successful formatting l failure to initialize as an AES application (call to
of a disk. We consider the top level of the system the OS routine app~_~n~t~
from a software point of view, looking at the mod- l failure to register on the Desk menu (call to the
ules which can fail (Figure 2). The software view is OS routine menu_register);
related to the systems view-the failure of the for-
l failure to respond to AES messages from the
matting software is equivalent to the “internal fail-
Desktop (desk accessory fails to return from a call
ure inside formatter” case in Figure 1. The obvious
to the OS routine eunt_mesag).
This leads to the fault tree for this subsystem shown
in Figure 3.
Disk fails
to i*ml‘Qt
3.5 Second-Level Decomposition: Do-Work Code
The possibility exists that the user could select in-
valid parameters inside the Do_ Work procedure,
which pops up dialogue boxes with buttons inviting
the user to select which drive and how many sides,
tracks, and sectors per track should be used. (Al-
though Atari and compatible floppy drives are nomi-
nally rated at 80 tracks, 9 sectors per track, many
drives allow more; this program gives the user the
choice of selecting up to 84 tracks in 10 sectors.
I-I Both single- and double-sided drives are available
Figure 1. System view of fault tree. and selectable). There is no guarantee that the pa-
Software Safety Analysis J. SYSTEMS SOFTWARE 283
1993; 21:279-290

I I

Bad dnve Bad physml


parameters medm

Figure 3. Failure inside the Dsk Form module. Figure 5. Failure inside the Format procedure.

rameters the user selects will work on the selected handled by a pair of nested FOR loops in the
drive: this leads to the fault tree shown in Figure 4. Format procedure, one loop iterating over the sides
of the disk and the other over the tracks on each
side. We can apply Leveson-style templates to this
3.6 Second-Level Decomposition: Format Code section of code if necessary; the format procedure
We now consider the consequences of failures inside will return an error code if the call to the operating
the actual Format procedure. There are only two system flopfmt (format track) routine fails on a bad
basic causes of failure to format a floppy disk cor- sector or track.
rectly in this procedure-failure to format one of We can also do some third-level decomposition:
the tracks correctly or failure to generate a correct we can examine potential causes of the generation
boot sector (written on the first sector of the disk of an invalid boot sector,
and containing information on the capacity of the
disk, number of sides, sectors, and tracks, and a
3.7 Third-Level Decomposition: Boot Sector
unique volume identifier). However, we might find
Generation
that bad parameters passed in by the Do_ Work
routine give rise to an anomalous situation (e.g., Two potential failures can prejudice the creation of
trying to format a disk to 84 sectors in a drive a usable boot sector:
capable of only 82). This gives rise to the fault tree l failure of the call to the OS protobt (create proto-
shown in Figure 5. type boot sector) routine (invalid parameters
The “bad media” failure corresponds to a failure passed to it);
in physical formatting of the disk tracks. This is
l failure to write a successfully created prototype
boot sector to the disk-the call to the OS JEopwr
routine (floppy disk write) might fail.
The corresponding fault tree is shown in Figure 6.

3.8 Further Decomposition


We have yet to decompose the system down to code
level, yet we have gained a good understanding of
what potential failures there are and how they con-
tribute to the overall failure to format disks. Where
necessary, we can apply Leveson-style template FTA
to the blocks of code represented by the leaves of
our second- and third-level fault trees. Note that
throughout our analysis, we have identified faults
Figure 4. Failure inside the Do-Work procedure. that are meaningful in terms of the application, not
284 J. SYSTEMS SOFTWARE P. Fenelon and J. A. McDermid
1993; 21:279-290

well integrated into the development life cycle.


FMECA is also often used as the basis for analyzing
maintainability and related requirements, although
this is in the broader systems context; we do not
discuss this further here. FMECA and related tech-
niques are of great pragmatic importance, as one
of the most common requirements in many safety-
critical systems is that there should be no single
point of failure that can lead to a hazard; FMECA
can be used to analyze a design to verify that no
single-point failure can propagate through the sys-
tem to cause a hazard.
FMECA is well understood at the systems level;
Figure ii. Failure in Boot Sector generation. sound procedures and standards have existed for
many years (see [7] for more information) and
equipment suppliers and users (at least in the aero-
merely artifacts of the program structure. Were we space industry) have developed much experience
to descend into the implementational aspects of using the method. Suppliers of all manner of equip-
such an analysis, we could then use Leveson-style ment provide lists of potential failure modes for
templates to model the code fragments we discover. components and equipment, which can be used as
For example, the actual formatting of the tracks on data for FMECA at higher levels; standardized
the disk is carried out by repeated calls (nested For worksheets for constructing FMECAs of subsystems
loops, one iterating over sides and the other over and systems exist and computer support (basically
tracks) to the operating system Jic&rzt routine; we interfaces to traditional data bases) is often avail-
could construct a Leveson-style model by nesting the able. The principle behind FMECA is simple-it
fault tree representation of the inner For loop in- attempts to evaluate the effects of a single failure of
side that of the outer loop and appending a template a component on the system as a whole. The mapping
representing procedure call with an exceptional re- from failure modes to effects may be a direct one or
turn (flopfmt returns a negative number to the hierarchies of failure modes and effects may be
caller if an error has occurred). necessary, with combinations of failures required to
bring about particular output states. FMECA is tra-
ditionally a laborious process with no real effective
3.9 Conclusions On Software FTA procedure beyond the mechanical filling in of work-
The high-level failure modes relate to the semantics sheets. In essence, these are an abstraction of a
of the application. We can think of them as ways in judgemental causal analysis of the design concerned
which the system might violate its specification. In with the propagation of a single-failure mode
effect, they are an application-oriented causal ab- through the system (compare this to FTA, which is
straction from the details of the program. They are concerned with the conjunction of failure modes).
hypothetical (judgmental) until they are confirmed There is little methodological support-the abstrac-
by detailed analysis in terms of the control flow tion of failure modes from the design is largely a
structure of the program. In this example, as is judgemental process and the method’s applicabili~
normally the case, FTA leads us to focus on small to software systems has in the past often been
parts of the actual code. This is at considerable doubted. FMECA is also largely an experience-based
variance with normal approaches to program verif- procedure-prior knowledge of similar systems
cation, but it can lead to move focused and efficient often serves to structure the analysis and, at the
analysis. systems level, FMECA is often closely related to
similar experiential methods such as Zonal hazard
analysis, which is concerned with the interaction
4. FMECA IN SSAP between systems within particular physical regions.
FMECA is a quantitative and qualitative method FMECA has yet to be applied to software in a
used as to analyze the effects of a single failure on a truly comprehensive and successful fashion. In part,
system. It is particularly useful as a tool for summa- this is due to the lack of readily identifiable software
rizing the failure behavior of a system and can be components with well-defined failure properties, and,
used proactively as a tool for reliability growth [6] if we believe, because existing methods of software
Software Safety Analysis J. SYSTEMS SOFTWARE 285
1993; 21:279-290

FTA are not structured in terms of the failure potentially infinite. In practice, though, we limit the
behavior of the software but in terms of its logical modeling to failure modes and combinations thereof
structure. We believe that our hierarchical FTA that are credible, i.e., that can plausibly occur during
method, with its emphasis on a failure-based ap- the system lifetime. We return to such methodologi-
proach to structuring the tree, is inherently more cal issues at the end of the section.
suited to integration with FMECA: every intermedi- We use the term transformation of failures be-
ate node, right down to the code level, reflects a real cause a design can change the nature of a failure.
failure mode of some identifiable element within the Recovery mechanisms or other defensive program-
system rather than corresponding to some artificial ming strategies may prevent a failure from propagat-
notion of failure of a tiny fragment of code. The ing at all. More subtly, a protection mechanism may,
failure modes at the “bottom” of FMECA work- for example, detect a timeout and cause a function
sheets should be the same as those that represent to return an approximate value instead of an exact
leaf nodes in corresponding fault trees. The FI’As one. Thus, a value domain failure may result from
then show how combinations of these events can an earlier time domain failure. The notation re-
lead to hazards. FMECAs show the effects of quires that such transformations be illustrated.
single-failure modes; thus, the approaches show A software module in FTPN is represented by a
complementary aspects of the consequences of fail- simple box with a set of input and output failure
ure modes. If a single-failure mode could lead to a modes. Inside the box we list a set of predicates
hazard, then the fault tree and FMECA chart would describing the relationship between the input and
be equivalent for that failure mode. output failure modes of the module; in fact, these
Thus, there are links between FTAs and FMECA, predicates correspond to the sum-of-products form
but there is a more fundamental relationship: we of the minimal cutsets of the fault trees (a minimal
believe that FTA and FMECA are both abstractions cutset being any set of conditions necessary and
of the same underlying causal model of the propaga- sufficient to cause the loss event described at the top
tion of failures (cause and effect) through a system. of the tree) for each output failure mode. In effect,
These causal possibilities are properties of the de- the FPTN box contains a forest of fault trees laid on
sign, but they do not necessarily correspond to the their sides.
logical, physical, or functional form of the system. We also provide simple representations for failure
Traditionally, FIAs and FMECAs are developed in modes which can arise inside a module and for
an intuitive fashion by experienced safety engineers. exceptions handled by modules. There is also a
We propose a systematic model of failure propaga- facility for attaching a criticality to a module which
tion which can be used to reinforce or replace this is not yet used in the current FPTN implementation.
intuitionistic reasoning. This is the domain of the Some elements of the notation are depicted in Fig-
FPTN discussed below. ure 7
It is our intention that failure modes in FPTN are
typed; failures may be classified into various broad
5. FPTN
categories, for example:
FPTN is a simple graphical method for expressing
timing failures
the failure behavior of systems with complex inter-
nal structures. It allows us to address many of the value failures
limitations inherent in both FTA and FMECA. In failures of commission
particular, our aims in creating FPTN were to pro- failures of omission
vide a simple, clean notation which reflected both
system architecture and the way in which failures This classification follows that of Ezhilchelvan and
within the system interact. Shrivastava (8). We further subdivide failures into
Since failure propagation is arguably a form of internal failures-those due to the module being
(abnormal) data flow, FPTN was initially intended to considered
resemble data flow-based methods such as CORE external failures-those due to other application
and Mascot. It is a modular and hierarchical nota- modules
tion that allows decomposition based on system ar-
chitecture but replaces the conventional concept of infrastructural failures-those due to the underly-
data flow by failure propagation between modules. ing hardware, operating system, or external envi-
Since we are concerned with all possible interac- ronment
tions between failure modes, FPTN diagrams are Thus, for example, deadlock of processes would
J.SYSTEMS SOFTWARE P. Fenelon and J. A. McDermid
1993;21:279-290

Exe? Module Name Crit

PREDICATES

e.g.

A:=BlC
D := EF &GH

Exe? is a flag to indicate whether exception handlers or other Figure 7. Elements of FPTN notation.
recovery mechanisms are present.
*
Crit is an indicator of the relative criticality of the module

Input output
Failure Failure
I
Modes Modes

Shadowing to indicate decomposability for compouind modules

lead to an external omission failure; loss of a physi- ently hierarchical, results from FPTN analysis of one
cal link would be an infrastructural failure. This level of a system can be “folded away” into black
categorization gives us a basis for analyzing failure boxes for use at higher levels.
modes and determining what detection and recovery
mechanisms are needed (an important part of the
5.2 Subsystem 1
iterative design process; safety requirements are a
driving force). Software FTA of subsystem 1 reveals that it is sus-
FPTN enables us to observe the consequences of, ceptible to input failure modes A (timing) and B (a
for example, a value error in the inputs to a particu- value error>. Failure mode A alone causes output
lar module causing a timing error in one of its failure mode Y, a communications error. A and B
outputs (perhaps through excessive iteration when must both occur to cause failure mode X, a timing
using an input that had not been range checked as a error. The subsystem generates no new failure modes
loop counter), and so on. It also allows us to con- unrelated to input failures (this is a rather unrealis-
sider common-mode failures in a more tractable tic example because most real subsystems would
fashion than FMECA. generate some failure modes that would be propa-
gated into the environment), but has an exception
handler or recovery which prevents failure mode C
5.1 FPTN in Use alone from being propagated any further. The FPTN
Here we introduce some of the notational ideas in representation appears below in Figure 8. Note the
FPTN by referring to a small example. typing of failure modes-in general, X:y refers to a
Suppose we have some hypothetical system S failure mode X of type y; single-character abbrevia-
which consists of three communicating software tions are used for timing (t), value (u), etc.
modules, subsystem 1, subsystem 2, and subsystem 3,
and we have initial sets of failure hypotheses for all
of these modules. How can we relate the failure
EH SSubsysteml II’
modes of the three modules to those of the system
as a whole, and what would the resulting FPTN A:t
X:t == A:t AND B:v
x:t
diagram look like? Y:c == A:t AND NOT B:v
B:v
We start by considering the modules in isolation. * C:t HANDLED BY Errorl’roc
YZC
To reduce our initial hypothetical failure modes to a
realistic set, techniques such as FTA or FMECA can
be used. We consider the subsystems in turn and
integrate the FPTN views of them into a coherent
view of the system at a whole. Since FPTN is inher- Figure 8. Subsystem 1.
==
Software Safety Analysis J. SYSTEMS SOFTWARE 287
1993;21:279-290

5.3 Subsystem 2 S.Subsystem3


FIA reveals that subsystem 2 is susceptible to fail-
ure modes A (timing) and Y (communications). Fail- V:v BYANDX:t
A:t == Z:t AND (B:v OR X::)
ure mode W occurs if A or Y occurs; in addition, a
new failure mode 2 (a timing error) is generated by
an internal value error inside the subsystem. The
FPTN representation appears in Figure 9.

5.4 Subsystem 3 Figure 10. Subsystem 3.


Analysis of subsystem 3 shows that it is vulnerable to
failure modes B, X, and 2, all of which we have met
before. It generates the failure modes I/ (a value The fault classification gives us a basis for a sys-
error) and A (a previously encountered timing er- tematic analysis of the failure behavior of the sys-
ror). This gives rise to the FPTN diagram shown in tem. As a minimal improvement, it seems feasible
Figure 10. that this will yield a rigorous manual procedure. We
are aware of a project (IFME) carried out at the
Turing Institute in collaboration with British
5.5 Combining the Subsystems Aerospace which has partially automated the gener-
We now take the three modules and combine them ation of FMECAs from design descriptions 191;we
into a single diagram by connecting corresponding believe it may be possible to link these ideas to
failure modes (Figure 11). Those failure modes generate FPTN diagrams from design data. We have
propagating out of the system, or those generated yet to evaluate this possibility.
externally and affecting it, can be thought of as the
output and input failure modes of the system as a
whole, respectively. We can simplify this view of the 6. METHODOLOGICAL INTEGRATION
system and replace it with another FPTN diagram How can we use FPTN to structure a safety analysis
(Figure 121, which could be used as a component in of a software system? We envisage the use of FPTN
a higher level analysis. in the early stages of a safety analysis as a simple
architectural modeling tool: the system is decom-
posed into a series of FPTN modules reflecting the
5.6 Commentary on FPTN
operational structure of the software. We then use
We commented above on the inadequate links be- hypothetical failure modes, which we believe exist in
tween FMECA and FTA. Our FPTN is a generaliza- the system, and tentatively connect the modules that
tion of both techniques: FMECA and FTA can both we believe cause and are affected by those failures.
be regarded as special cases (abstractions from) We will usually have some idea of the basic failure
FPTN. However, in some senses we have merely properties of a module if we know its function; for
moved the problems we alluded to above, not solved example, a procedure carrying out some calculation
them-we still have to construct the FPTN diagrams in a hard real-time system might return an imprecise
(although we have reduced the number of judge- result it forced to relinquish a timeslice, or (in a
mental decisions). There is some hope that we can badly designed system) cause a timing error by over-
do better. running its timeslice. Similarly, we might expect data
acquisition routines, communications software, and
so on to display characteristic patterns of failure
modes which we can use as initial failure hypothe-
ses.
We then consider each module in turn, starting
from the lowest level and progressing up the hierar-
chy. Note that this does not mean abandoning the
top-down FTA method-inside the modules the
trees will be developed according to the hierarchical
approach sketched above. For each module, we con-
struct a software fault tree and plug in each “output”
Figure 9. Subsystem 2. failure mode we believe affects the module as a
288 J. SYSTEMS SOFTWARE P. Fenelon and J. A. McDermid
1993; 21:279-290

B:v
C’:t

Figure 11. FPTN diagram of system S.

SSubsysledl ,,.

L y-----z
S

potential top event. Note that these are not neces- A fault tree would be constructed with an AND gate
sarily hazardous failure modes-they may be han- with two main branches, one corresponding to the
dled at other levels of the system. Standard minimal THEN part and the other to the ELSE branch.’ It is
cutset analysis is then carried out. We may also clear that the THEN branch of the statement can
carry out more “semantic” analysis of a software never be executed; a fault tree system with sufficient
fault tree. For example, a particular branch of a tree links to the original semantics of the program (in
might represent a code fragment such as the follow- this case, a system able to recognize the constraints
ing: on the types) could recognize this and prune the
THEN branch from the Leveson-style representa-
type A-TYPE is new integer range l..lOO;
tion of the code fragment.
X : A-TYPE;
From such analysis we can determine
...
if X > 200 l whether the top event (and therefore the hypo-
then thetical failure mode) occurs given the set of input
(statements) failure modes
else l which of the input failure modes are responsible
(other statements) for the generation of the output failure.
end if;
Therefore, we can modify the set of inputs and
Assume that a hazardous condition arises when X outputs connected to the module after this analysis
> 200 at the end of the block under consideration. by removing those inputs and outputs that, after
analysis, clearly do not form part of the failure
behavior of the module and by rewriting the sum-
System S II* _ of-products form of the minimal cutsets giving rise
to each output failure mode as the FPTN equational
B:v VT descriptions of the module. The eliminated failure
> >
modes would be retained in FMECA tables; they are
C:t w:c failure modes with no effect.
za
Thus, FPTN acts as both a causal model and a

I ’ We ignore the potential for failure modes caused by the


Figure 12. Abstract FF’TN view of system S. evaluation of the expression in the IF statement in this example.
Software Safety Analysis J. SYSTEMS SOFTWARE 289
1993: 21:279-290

representation of the system architecture. The fail- ables us to create tools for new notations very
ure modes identified as outputs of a software system quickly. For example, our initial prototype editor for
can be easily integrated into systems-level FMECA. the FPTN notation took only a couple of days to
They can be viewed as corresponding to failure build and integrate into the system, including the
modes of the processing node on which they are addition of some necessary basic data base support.
running. Thus, we are free to concentrate on development of
analysis methods rather than implementation de-
7. IMPLEMENTATION ISSUES tails; the lower (infrastructural) levels of the class
hierarchy provide most of the basic functionality.
The original design for the SSAP software was a
We have also constructed a system that trans-
purely object-oriented system using a persistent store
forms annotated programs in an Ada subset (we are
and an object-oriented user interface management
aiming at full compatibility with SPARK, an Ada
system. Data within the system were to be repre-
subset for safety critical applications) into CSIF de-
sented in a flexible and extensible class hierarchy,
scriptions of their corresponding fault trees; an ap-
with the various analyses embedded in the classes as
proach combining HFIA as a higher level structur-
methods. We originally intended to build a “mono-
ing technique and Leveson-style templates at the
lithic” integrated application which would be ex-
lowest level has been adopted. The Leif syntax-
tended by adding new classes and methods into the
directed editing and parsing tool kit [ll] has been
system. This approach has had to be modified
used in conjunction with Gnu Emacs to provide a
slightly-we were obliged to split the functionality
user interface with the Ada-specific and parsing
of the system into distinct front and back ends,
functions of the system; the textual representation
which at present run on different machines con-
of the Leif parse tree is then processed to form a
nected via our local area network. Our current
CSIF representation of a fault tree, which can be
architecture retains the persistent store-imple-
stored in the data base.
mented on top of the ONTOS object-oriented data
base-which contains instances of the classes de-
fined in the hierarchy. All analyses are implemented 8. INTEGRATION OF SSAP WITH
in this back end, which runs on an IBM RS/6000 OTHER TOOLS
under AIX. SSAP does not exist in isolation. Many research
The original display and editing tool was custom projects at the University of York are considering
built with the Interviews 3.0b tool kit [lo] and ran other interesting areas in the development of high-
on a Sun-3 workstation. This application, although integrity software and are producing their own tools
user friendly and quite simple to use, was slow and to assist in this work. These projects are complemen-
inefficient. We abandoned it in favor of a set of tary and there are significant synergistic benefits to
simple interfaces built on a locally enhanced version be offered if we can achieve a level of integration
of the Xsim tool from the University of Washington; between these tools.
Xsim is a generic application running under the X Perhaps the project most closely related to SSAP
window system; it can be tailored to provide a user is ASAM, which is concerned with the production of
interface for any software tool that requires input of a series of prototype safety argument managers and
graphs. We have constructed editors for fault trees a methodology for using them. A safety argument
and for the FPTN using Xsim; these generate output manager provides a framework for the generation of
in Xsim’s internal format. A translation program safety cases (as mentioned in our introduction). We
converts this into a simple language we term Com- believe that there is obvious scope for communica-
mon Safety Interchange Format (CSIF). CSIF out- tion between SSAP and the SAMs produced by
put from the tool is then sent over TCP/IP to the ASAM. ASAM is based on methods of argument
data base/back end running on the IBM machine. representation developed in the 1950s by the En-
Results of the analyses are transmitted in CSIF from glish philosopher Stephen Toulmin [12]. It is a long-
the IBM to the Sun; a CSIF-to-Xsim translator then term goal to be able to use data derived from SSAP
generates a new Xsim file and invokes the display analyses as data, backing, or warrant (supporting
tool. There is little distinction between the user’s material) in Toulmin-style arguments.
input and the system’s output generated from it. It may also prove interesting to link SSAP into the
Indeed, the user is free to modify system output and CADiZ tool set(a typechecker, browser, and (emerg-
store it in the data base, thus enabling “what-if’ ing) theorem prover for the Z notation). Predicates
exercises to be carried out. derived from FPTN diagrams or fault trees could be
The combination of Xsim, CSIF, and the class mechanically translated into Z and manipulated in-
hierarchy we have implemented with ONTOS en- side CADiZ and perhaps its theorem-proving capa-
290 J. SYSTEMS SOFTWARE P. Fenelon and J. A. McDermid
1993;21:279-290

bilities could be used to simplify expressions and, ACKNOWLEDGMENTS


therefore, the fault trees from which they are de- This research was funded by British Aerospace Defence
rived. Links to other verification systems such as (Military Aircraft Division), Warton, Lanes, U.K. We thank the
HOL have also been considered. As we have men- Software Technology department at Warton for its support
and technical input. In particular, discussions with Julian
tioned, SFTA displays many of the same semantic
Johnson, Brian Jepson, and Lynn Spencer have influenced
properties as wp calculus; the use of SFTA as a our thinking. In the Airworthiness group, Sandy Drysdale’s
user-friendly front end to traditionally difficult tasks comments from the perspective of an engineer involved in
would appear to be beneficial. the safety assessment of real systems have been invaluable.
We have also briefly considered the possibility of At York, Chris Higgins (ASAM), Ian Toyn (CADiZ), and Ken
Tindell and Mike Richardson of the Real Time Systems group
linking SSAP to the tool set under construction by
have also provided many interesting insights into the poten-
the Real Time Systems research group, thus en- tial for linking SSAP with other tools. Andy Vickers’ work on
abling us to integrate failure analysis into the com- enhancing Xsim has also contributed much to our work.
plex tasking problems in the latest generation of
hard real-time systems.
REFERENCES
9. CONCLUSION 1. S. J. Clarke, and J. A. McDermid, Software Fault
We believe that SSAP tools and methods are evolv- Trees and Weakest Preconditions-A Comparison and
ing to a stage where successful integration at all Analysis, in press.
three levels previously identified can be achieved. At 2. P. 0. Chelson, Reliability Computation Using Fault
Trees, Technical Report, NASA-CR-124740, NASA
the semantic level, we believe that the common
Jet Propulsion Laboratory, 1971.
causal model shared by FTA, FMECA, and FPTN
3. W. E. Veseley, Fault Tree Handbook, Division of the
allows data to be shared between these notations. At System Safety Office of Nuclear Reactor Regulation,
the procedural level, we have suggested how FPTN US Nuclear Regulatory Commission, Washington, DC,
can be used to mirror the design process and how it 1981.
links to FTA on software components. Finally, at the 4. N. G. Leveson and P. R. Harvey, Software Fault Tree
operational level, we have demonstrated that it is Analysis, J. Cyst. Softwre 3, 173-181, (1983).
possible to construct a tool set that allows data to be 5. S. S. Cha, N. G. Leveson, and T. J. Shimeall, Safety
shared among a range of notations and that has a verification in murphy using fault tree analysis, in
sufficiently open architecture to facilitate extensibil- Software Risk Management (B. W. Boehm, ed.), IEEE,
ity and interoperability with other systems. 1989.
The SSAF’ project has identified several interest- 6. D. Raheja, Software system failure mode and effects
analysis (SSFMEA)-A tool for reliability growth, in
ing areas of research into the development of
Proceedings of the International Symposium on Reliabil-
safety-critical systems and has looked closely at the
ity and Muintuinability, 1990, pp. 1X-l-1X-7.
analysis of software failures. We have developed 7. Design Analysis Procedure For Failure Modes, Effects
several new techniques and extensions to FTA; in and Criticality Analysis (FMECA), Aerospace Recom-
particular, we have developed a notation (FPTN) mended Practice CARP) 926, Society of Automotive
that enables us to bridge the semantic gap between Engineers, Detroit, Michigan, 1967.
FTA and FMECA. 8. P. D. Ezhilchelvan, and S. K. Shrivastava, A Charac-
We have designed an open architecture that facili- terisation of Faults in Systems, Technical Report, Uni-
tates easy integration of new tools and methods into versity of Newcastle upon Tyne, 1985.
the SSAP system and, perhaps more significantly, 9. J. Murdoch, D. Pearce, and G. Ward, Logic modelling
should allow us to integrate SSAF’ with other tools of dependable systems, in Proceedings of the IFAC
Symposium On Safety Of Computer Control Systems,
being built at the University of York, for example,
1992.
the ASAM safety argument manager, the CADiZ
10. M. A. Linton, P. R. Calder, and J. M. Vlissides, The
specification tool, and the STRESS real-time simula-
design and implementation of Interviews, in Proceed-
tor. ings of the USENIX C-t -I- Conference, 1987.
The SSAP approach, which emphasizes pragma- 11. W. W. Smith, R. Campbell, and D. LaLiberte, User
tism and simplicity in both methods and tools, seems Manual For Leif With GNU Emacs, University of Illi-
to provide a sound framework for further research nois, 1988.
and a useful platform for software safety analysis in 12. S. Toulmin, The Uses Of Argument, Cambridge Univer-
its own right. sity Press, 1958.

You might also like