Defining Software Faults: Why It Matters
Defining Software Faults: Why It Matters
Why It Matters
Abstract— Because faults are at the center of software quality chosen level of granularity (e.g. statement, expression,
concerns, they ought to be defined formally, by semantics-based variable reference, lexical token, etc). For the sake of
criteria that enable us to reason about them. In this paper, we generality, we admit that a feature needs not be
consider a semantics-based definition of a fault, which involves contiguous, hence may involve two or more lexemes at
the program, the faulty feature (at the appropriate level of
different locations in the program.
granularity) and the specification against which correctness and
incorrectness are defined. We explore the implications of this Our definition of a fault f in a program P with respect to
definition for various aspects of software testing, software specification R involves nothing other than f, P and R; and
reliability, and software repair; and we argue that providing a it is totally formal, modulo the definition of a feature. It
formal, verifiable definition of faults is not a mere intellectual does not involve any subjective value judgement
exercise, but has important practical applications. (adjudging or hypothesizing) nor does it require that we
know the expected state of the program at intermediate
Keywords— fault, fault removal, relative correctness, steps in its execution.
correctness enhancement, software testing, software reliability,
software repair.
We have found, and we argue in this paper, that a formal
I. INTRODUCTION definition of program faults enables us to achieve a number of
capabilities, with concrete practical implications:
In (Avizienis, Laprie, Randell, & Landwehr, 2004) Avizienis
et al. define a fault as the adjudged or hypothesized cause of
an error; an error, in turn, is defined as deviation of the A formal characterization of fault removal. Given a fault
system state from the correct state. The IEEE Standard IEEE f in a program P, we formulate the condition under which
Std 7-4.3.2-2003 defines a software fault as an incorrect step, a substitute f’ of f constitutes a certifiable fault removal.
process or data definition in a computer program. Whereas A distinction between a single multi-site fault and
these definitions fulfill their purpose as part of a broader multiple single-site faults. This distinction is important in
ontology, we argue that they do little to support the practice because it enables us to control combinatorial
engineering processes of identifying, inventorying, explosion when we attempt to generate patches: the only
diagnosing, and removing faults in a program. Indeed, the time we ever need to combine patches is if we are looking
definition of Aviezienis et al. relies on adjudging or for a multi-site single fault; other than that, we ought to
hypothesizing, two highly subjective criteria, and assumes, remove faults one at a time, to avoid combinatorial
through its definition of an error, that we have means to judge explosion.
the correctness of arbitrary system states (vs. initial or final A definition of a unitary increment of correctness enhan-
states). Also, the IEEE definition is vague as to what cement. We introduce the concept of elementary fault
constitutes a correct or an incorrect step, and who gets to removal, which represents an atomic/ minimal program
decide what is or is not correct. transformation that enhances the correctness of a
program, for a given level of granularity.
In this paper, we consider a semantics-based definition of a Insights into oracle design. The definition of fault
software fault, and discuss, through analytical and empirical removal enables us to design precise test oracles that
arguments, how this definition can be deployed to support the characterize valid program repairs; we find in this paper
engineering processes that we use routinely to deal with that while non-regression is a sufficient condition for
software faults. Specifically, the definition we use has the correctness enhancement, it is not a necessary condition;
following attributes: so that traditional regression testing is prone to cause a
It is based on an implicit level of granularity of the source loss of recall.
code, which is determined according to the level of The distinction between removing a fault and remedying a
precision at which we want to localize faults; we use the failure. There is no one-to-one mapping between faults
term feature to refer to a segment of source code at the and failures; the same fault may cause several failures and
the same failure may be traced back to more than one (𝑑𝑜𝑚(𝑅)) and the pre-restriction (of relation 𝑅 to set 𝑇: 𝑇\𝑅 ).
fault. Hence focusing on faults and focusing on failures It is easy to see that the product 𝑅𝐿 (of relation 𝑅 by the
yield vastly different policies. universal relation 𝐿) is: 𝑅𝐿 = {(𝑠, 𝑠’)|𝑠 ∈ 𝑑𝑜𝑚(𝑅)}; we may,
Letting programs dictate the fault removal schedule. when this does not lead to confusion, use 𝑅𝐿 and 𝑑𝑜𝑚(𝑅)
Programs do not expose all their faults at once; rather, interchangeably. A relation 𝑅 is said to be deterministic if and
some faults may have to be removed before others can be only if 𝑅̂𝑅𝐼, reflexive if and only if 𝐼𝑅, symmetric if and
discovered; the order in which the faults of a program only if 𝑅𝑅̂, and transitive if and only if 𝑅𝑅𝑅. A relation is
come to light depends on the test data we use, and on how said to be a partial ordering if and only if it is reflexive,
each discovered fault is fixed. A given observed failure antisymmetric and transitive.
may be due to a combination of faults (some of which
may be visible earlier than others), hence cannot be III. RELATIVE CORRECTNESS AND FAULTS
remedied until we have derived the set of patches that In order to define faults, we need to discuss relative
remove simultaneously all the relevant faults; this carries correctness, which in turns requires that we discuss (absolute)
a significant risk of combinatorial explosion. By focusing correctness
on fault removal rather than failure remedial, we let the
program expose its faults in the order it determines, and A. Programs and Specifications
we remove them in sequence as they appear. Given a program 𝑃 on space 𝑆, we let the function of program
Separating debugging from testing. Fault removal (aka 𝑃 be the set of pairs (𝑠, 𝑠’) such that if execution of 𝑃 starts in
debugging) is so inconceivable without testing that these state 𝑠 then it terminates in state 𝑠’. We may, when this does
two terms are used almost interchangeably. Yet, with a not lead to confusion, use a program and its function
definition of fault removal, we can identify, remove and interchangeably. Given a space 𝑆, we let a specification 𝑅 on
prove the removal of a fault by static analysis, without space 𝑆 be a relation on 𝑆.
recourse to any testing; this shown in (Ghardallou,
Diallo, Frias, & Mili, 2016). B. Refinement and Absolute Correctness
Fault Density vs Fault Depth. Once we define what a Given two relations 𝑅 and 𝑅’, we say that 𝑅’ refines 𝑅, or that
fault is, we discover that there is a difference between the 𝑅 is refined by 𝑅’ (denoted by: 𝑅′ ≥ 𝑅, or 𝑅 ≤ 𝑅′) if and only
statement “Program P has N faults”, and the statement if 𝑅𝐿 ∩ 𝑅′ 𝐿 ∩ (𝑅 ∪ 𝑅′ ) = 𝑅′ . This relation (between relations)
“Program P requires N fault removals”. This difference is a partial ordering. Interpretation: 𝑅’ refines 𝑅 if and only if
stems from the interdependence between faults; we 𝑅’ has a larger domain and assigns fewer images than 𝑅 to
discuss in this paper why we find that the latter is a more each element of the domain of 𝑅. See Figure 1.
meaningful measure of faultiness than the former.
Figure 2: 𝑃′ ≥𝑅 𝑃
Figure 4: Reliability of 𝑃 with respect to 𝑅 and .
How do we know whether our definition is any good? We
check it against some properties that we expect a definition of We infer from this definition that for any probability
relative correctness to satisfy. distribution , if 𝑃’ is more-correct than 𝑃 with respect to 𝑅,
then 𝑃’ is also more reliable than 𝑃 with respect to 𝑅 and .
Relative Correctness is reflexive and transitive but not More interestingly, we find that if 𝑃’ is more reliable than
antisymmetric. It is clear why we want relative correctness to 𝑃 with respect to 𝑅 for any probability distribution , then
be reflexive and transitive. Figure 3 shows why we do not 𝑃’ is more-correct than 𝑃 with respect to 𝑅. To prove this, we
want it to be antisymmetric: we want to allow two programs assume that 𝑃’ is not more correct than 𝑃 , and we find a
to be equally correct yet distinct. probability distribution on 𝑑𝑜𝑚(𝑅) for which 𝑃 is more-
reliable than 𝑃’. If 𝑃’ is not more-correct than 𝑃, then there
exists an element 𝑠0 of the competence domain of 𝑃 that is not
in the competence domain of 𝑃’ . We let (𝑠0) = 1 and
(𝑠) = 0 for all 𝑠 ≠ 𝑠0 , and we find that 𝑅 (𝑃) = 1 and
𝑅 (𝑃′ ) = 0, whence 𝑃 is more reliable than 𝑃’. QED.
Relative Correctness culminates in absolute correctness. Of (𝑃′ ≥𝑅 𝑃) (∀ ∶ 𝑅 (𝑃′ ) ≥ 𝑅 (𝑃)).
course, we want a (absolutely) correct program P’ to be more-
correct than (or as correct as) any candidate program. This equation links relative correctness with reliability: to be
According to Mills’ Proposition (previous section) if 𝑃’ is more-correct means to be more-reliable with respect to any
(absolutely) correct with respect to 𝑅 , then 𝑑𝑜𝑚(𝑅 ∩ 𝑃′ ) = probability distribution of inputs.
𝑑𝑜𝑚(𝑅); on the other hand, 𝑑𝑜𝑚(𝑅) is an upper bound of
𝑑𝑜𝑚(𝑅 ∩ 𝑃) for any candidate program 𝑃. QED. Relative Correctness and Refinement. If program 𝑃’ refines
program 𝑃, it means that whatever 𝑃 does, 𝑃’ can do better.
Relative Correctness and Reliability. The reliability of a We would expect this to imply that 𝑃’ refines 𝑃 if and only if
program 𝑃 is defined by means of two parameters: a 𝑃’ is more-correct than 𝑃 with respect to any specification.
specification 𝑅 and a probability distribution on the domain The proof of necessity is trivial: if 𝑃’ refines 𝑃, then 𝑃’ is a
superset of 𝑃, hence it has a larger competence domain with but not a necessary condition of fault removal; the use of an
respect to any specification, by monotonicity. The proof of unnecessary conditions causes a loss of recall (missing
sufficiency relies on a lemma to the effect that if two functions programs that are actually more-correct). Whereas regression
𝐹 and 𝐹’ are such that 𝐹𝐹′ and 𝑑𝑜𝑚(𝐹 ′ )𝑑𝑜𝑚(𝐹) then testing checks the following condition:
𝐹 = 𝐹 ′ . If 𝑃’ is more correct than 𝑃 for all 𝑅, then it is more- 𝑠𝑑𝑜𝑚(𝑅 ∩ 𝑃) 𝑃′ (𝑠) = 𝑃(𝑠),
correct than 𝑃 with respect to 𝑃 , which can be written relative correctness mandates the (weaker) condition:
𝑑𝑜𝑚(𝑃 ∩ 𝑃′ )𝑑𝑜𝑚(𝑃) . This, in conjunction with the set 𝑠𝑑𝑜𝑚(𝑅 ∩ 𝑃) 𝑠𝑑𝑜𝑚(𝑅 ∩ 𝑃′).
theoretic identity 𝑃 ∩ 𝑃′𝑃 yields 𝑃′ ∩ 𝑃 = 𝑃, from which we
infer 𝑃𝑃′ . QED. Another issue that our definition highlights is the use of fitness
functions in program repair (LeGoues, Dewey-Voigt, Forrest,
Whence we write: & Weimer, 2012). Fitness functions are usually computed as
the sum of weights associated to the test data on which the
(𝑃′ 𝑃) (∀𝑅 ∶ 𝑃′ ≥𝑅 𝑃).
candidate repair runs successfully, where weights are assigned
to test data according to their preponderance in some usage
Figure 5 shows relative correctness as an intermediate pattern; as such, the fitness function is an approximation of the
property between being more reliable (when we quantify with candidate program’s reliability. Yet, we saw in the previous
respect to ), and being more-refined (when we quantify with section that for a given probability distribution ( ) enhanced
respect to 𝑅). For the sake of completeness, we also show the reliability is a necessary but not a sufficient condition of
quantifications in reverse order (𝑅 then ). relative correctness; hence the use of fitness functions may
cause a loss of precision (retrieving candidate repairs that are
𝑷′ ≥ 𝑷 not actually more-correct than the original).
IV. IMPLICATIONS AND APPLICATIONS
∀ ∀𝑅 A. Elementary Faults
We consider a program 𝑃 and we let 𝑓1 and 𝑓2 be two
features in the source code of 𝑃 that admit substitutes, say 𝑓1’
and 𝑓2’ , such that the program 𝑃’ obtained from 𝑃 by
∑ (𝑠) ∑ (𝑠) 𝑷′ ≥𝑹 𝑷 replacing 𝑓1 by 𝑓1’ and 𝑓2 by 𝑓2’ is strictly more-correct than
𝑠𝑑𝑜𝑚(𝑃∩𝑃′ ) 𝑠𝑑𝑜𝑚(𝑃)
𝑃 with respect to some specification 𝑅 . According to our
definition of a fault, (𝑓1, 𝑓2) is a fault in 𝑃 with respect to 𝑅.
The question we wish to ponder is whether we are looking at a
∀𝑅 ∀ single two-site fault (𝑓1, 𝑓2) or two single-site faults (𝑓1 and
𝑓2 ). The answer to this question depends, of course, on
whether 𝑓1 alone is a fault and whether 𝑓2 alone is a fault. If
𝑅 (𝑃′ )𝑅 (𝑃)
neither 𝑓1 nor 𝑓2 is a fault, but (𝑓1, 𝑓2) is a fault, then we say
that (𝑓1, 𝑓2) is an elementary fault; we also designate as
Figure 5: Reliability, Relative Correctness, Refinement
elementary fault any single-site fault.
Indeed, if the specification mandates to compute the sum from 𝑅 = {(𝑠, 𝑠 ′ )| 𝑎[0] = 𝑎′ [0] 𝑎′ [1. . 𝑁] = 0},
1 to N and the program mistakenly computes the sum from 0
to N-1, then the program behaves according to the P: { k=0; while (k!=N) {a[k]=0; k=k+1}}.
specification for those arrays that satisfy the condition
𝑎[0] = 𝑎[𝑁]. One way to fix this program is to change (k=0) The function of program P can be written as:
into (k=1) and (k!=N) into (k!=N+1). This raises the question:
are we dealing with a single two-site fault or two single-site 𝑷 = {(𝒔, 𝒔′ )|𝒌′ = 𝑵 𝑎′ [0. . 𝑁 − 1] = 0 𝑎′ [𝑁] = 𝑎[𝑁]}.
faults? To answer this question, we consider programs 𝑃0,
where we make the first substitution, and 𝑃1, where we make Indeed, 𝑃 assigns 𝑁 to 𝑘, puts 0 in 𝑎[ ] between indices 0 and
the second substitution. 𝑁 − 1, and preserves (does not modify) 𝑎[𝑁]. Taking the
P0: {x=0; k=1; while (k!=N) {x=x+a[k]; k=k+1}}. intersection of 𝑅 and 𝑃, we find:
P1: {x=0; k=0; while (k!=N+1) {x=x+a[k]; k=k+1}}.
𝑅∩𝑃
By computing the functions of these programs then their = {substitutions}
competence domain the same way we did above for P, we {(𝑠, 𝑠’)|𝑎′ [0] = 𝑎[0]𝑎′ [𝑁] = 𝑎[𝑁] 𝑘 ′ = 𝑁 𝑎′ [0. . 𝑁]
find: = 0 }.
𝐶𝐷0 = {𝑠| 𝑎[𝑁] = 0}.
𝐶𝐷1 = {𝑠| 𝑎[0] = 0}. Taking the domain of this relation, we find:
Because there is no inclusion relation between CD and CD0, 𝐶𝐷 = {𝑠| 𝑎[0] = 0𝑎[𝑁] = 0}.
nor between CD and CD1, neither the substitution of (k=0)
We let 𝑃0 and 𝑃1 be the programs obtained from 𝑃 by
into (k=1) nor the substitution of (k!=N) into (k!=N+1) is a fault
removal. But performing both substitutions simultaneously changing (k=0) into (k=1) and (k!=N) into (k!=N+1),
yields
P’: {x=0; k=1; while (k!=N+1) {x=x+a[k]; k=k+1}}, P0: { k=1; while (k!=N) {a[k]=0; k=k+1}}
P1: { k=0; while (k!=N+1) {a[k]=0; k=k+1}} B. Fault Density and Fault Depth
We consider the array sum program presented earlier, along
We find the following functions of 𝑃0 and 𝑃1: with the specification it is supposed to satisfy:
𝑁
Taking the intersection with 𝑅, we find: We let the level of granularity at which we want to analyze
faults be the expression; in other words, we restrict our
𝑅 ∩ 𝑃0 = {(𝑠, 𝑠 ′ )|𝑘 ′ = 𝑁 attention to faults that stem from using the wrong expression
𝑎[0] = 𝑎′ [0]𝑎[𝑁] = 𝑎′ [𝑁]𝑎′ [1. . 𝑁] = 0} (in an assignment statement, an array reference, a function
call, etc). Given this restriction, we see two faults in this
𝑅 ∩ 𝑃1 = {(𝑠, 𝑠 ′ )|𝑘 ′ = 𝑁 + 1 program, i.e. two features that admit substitutions that would
𝑎[0] = 𝑎′ [0]𝑎′ [0. . 𝑁] = 0} make the program strictly more-correct:
The fault made up of the aggregate (k=0, k!=N), which
From which we derive the competence domains of 𝑃0 and 𝑃1: we had discussed above; we refer to this fault as 𝑓12 and
we refer to its corresponding substitution (into (k=1,
𝐶𝐷0 = 𝑑𝑜𝑚(𝑅 ∩ 𝑃0) = {𝑠|𝑎[𝑁] = 0}, k!=N+1)) as 12.
𝐶𝐷1 = 𝑑𝑜𝑚(𝑅 ∩ 𝑃1) = {𝑠|𝑎[0] = 0}. The fault 𝑓3 = (a[k]), which admits a substitution 3
from (a[k]) to (a[k+1]); indeed substitution 3 would also
By comparing these competence domains against that of 𝑃, we produce a correct program, hence enhance correctness.
find that they are both supersets of 𝐶𝐷, hence each individual
transformation has improved the correctness of 𝑃. If we let 𝑃’ Even though we have two faults, 𝑓12 and 𝑓3, we are only one
be the program obtained from 𝑃 by performing both fault removal away from a correct program, because when we
transformations, we find that it is correct with respect to 𝑅, apply substitution 12 to remove fault 𝑓12, we find that 𝑓3 is
hence its competence domain is the domain of 𝑅, which is all no longer a fault; and when we apply substitution 3 to
of 𝑆 . Figure 7 shows the pattern of relative correctness remove fault 𝑓3, we find that 𝑓12 is no longer a fault. But if
relations that characterize a situation where we have two we apply both substitutions, we find a program 𝑄 that has two
single-site faults; we use the same abbreviations 1 and 2 to faults, just like 𝑃. See Figure 8, where thick black arrows
represent program substitutions. represent relative correctness relations.
𝑷′ 𝑷′ 𝑷′′
3
2 1
12 12
3
𝑷𝟎 (1, 2) 𝑷𝟏 𝑷 𝑸
B. Empirical Observations
Figure 9 shows the graph that results from applying the above
algorithm to the replace sample of the Siemens Benchmark.
At the conclusion of the four first iterations, the algorithm
produces m79.3.42.47 as the only maximal element of the
graph; this node is not absolutely correct, and none of its
simple mutants turned out to be strictly more-correct. When
we deploy double mutation, however, we find two double-
mutants that are strictly more-correct than it, and they are both
absolutely correct. One of them (m79.3.42.47.37.85) is
actually the original replace program; the other maximal
mutant is different from the original, but is absolutely correct
with respect to 𝑇\𝑅 all the same.
Figure 9: Fault Removal Graph of replace
For the sake of argument, we assume that the mutant
generation method used in this experiment is complete, in the C. Analysis and Lessons Learned
sense that if a program 𝑃 has a fault 𝑓 and (𝑓, 𝑓’) is a fault When we say that program 𝑃 has six faults, we implicitly
removal for 𝑃 then the generator will produce a mutant of 𝑃 assume that these faults are fixed features of the source code
that has 𝑓’ in lieu of 𝑓 in 𝑃; considering the modifications we of 𝑃, that they are all visible in 𝑃, that we can fix any one of
have entered in replace, and the way we have parameterized the six we choose, that there is a unique way to fix each fault,
the mutant generator, this appears to be a legitimate and that the result would be a program with five faults. If this
assumption. Under this assumption, the density of each were the case, then the graph we would find by applying the
program in this graph is the outgoing degree of the algorithm of section V.B to the replace program would be the
corresponding node. Also, the depth of program is the graph shown in Figure 10; note that in this ideal graph the
minimal distance from the node to the top of the graph. We density of each program (node) equals its depth. The vast
find, from Figure 9, that even though we have applied six difference between Figure 9 and Figure 10 reflects the extent
modifications to produce 𝑃, 𝑃 has only one fault, since the to which this vision of faults is unrealistic. For a given test
node of 𝑃 has a single outgoing arc. One may ask: how can 𝑃 data set 𝑇, program 𝑃 does not expose all its faults at once,
have only one fault if we seeded six? The answer is that the and whenever it exposes more than one fault at a time, the
other faults may be hidden by the first one, and can only be choice of which fault to fix first and how to fix it does matter.
seen once the first one is removed. So in fact the fault density
of 𝑃 is one but its fault depth is five, which again shows that Another observation we can infer from this example is that
fault depth is a more meaningful measure than fault density. when we try to repair a program, it is necessary to focus on
In this graph, fault depth decreases by one with each fault removing faults rather than remedying failures. When we
removal but fault density does not: For example, select a specific failure of the program, defined by an initial
state that leads to an execution that violates the specification,
we have no way to tell whether the fault that causes it low or
high in the fault removal graph (Figure 9). If the fault is high, VI. CONCLUSION
say 𝑘 arcs away from program 𝑃, and each mutant generator In this paper we revisit a definition of software faults (given in
produces 𝑁 mutants, then we need to search a space of size earlier work), and discuss its impact on routine software
𝑁 𝑘 to find an adequate repair. By contrast, if we remove engineering processes such as software testing, software
elementary faults one at a time, in the order in which the quality analysis, and program repair. Our definition of a fault
program exposes them, the search space is never larger than assumes an implicit level of granularity (at which we want to
𝑁 𝑚 , where m is the multiplicity of highest order multi-site isolate faults) and involves only the faulty feature, the
fault (m=2 in our example) rather than the fault depth of the program in which this feature appears, and the specification
program (which is usually unknown and unbounded). against which correctness of the program is judged. Our
definition of a software fault is based on a formal definition of
Another important lesson offered by this example is the relative correctness for deterministic programs, which we have
distinction between fault density and fault depth. If programs validated by analyzing its intrinsic properties and its relation
behaved as shown in Figure 10, then density and depth would to (traditional) absolute correctness, reliability, and
be identical: if we have six faults in a program, it takes six refinement.
fault removals before we can turn it into a correct program.
But in reality, density and depth are unrelated: in Figure 9, Several other authors have introduced and discussed some
𝑃 has a fault density of 1 and a fault depth of five; also, as we approximations of absolute correctness that may be construed
climb the graph, fault depth decrease by one at each step, but as definitions of relative correctness (Zhao, Littlewood,
fault density evolves in an unpredictable manner. Povyakalo, Strigini, & Wright, 2016) (Littlewood & Rushby,
2012) (von Essen & Jobstman, 2013) (Lahiri S. , McMillan,
Finally, we point out how a purely semantic test, the test of Sharma, & Hawbiltzel, 2013) (Logozzo, Lahiri, Faehndrich, &
strict relative correctness, was, with the help of adequate Blackshear, 2014) (Logozzo & Ball, Modular and Verifiable
mutant generation, able to detect and remove all the faults of Program Repair, 2014). Our approach can be characterized by
the program, one at a time. Note that if 𝑃’ is absolutely the following distinguishing premises: we model programs as
correct with respect to 𝑇\𝑅 , this does not mean that 𝑃’ is simple input/output mappings (rather than finite state
absolutely correct with respect to 𝑅 of course, though we can automata); we model specifications as relations (rather than
prove under some conditions that 𝑃’ is then more-correct than frames of assertions); we model relative correctness as a
𝑃 with respect to R (not merely to 𝑇\𝑅 ), which is usually what semantic property between two programs and a specification
(rather than an empirical operational property about program
we want to achieve in a program repair operation; we do not
executions); we make provisions for the fact that correct
show the proof, due to lack of space.
program behavior is not unique (by virtue of the fact that
specifications are usually vastly non-deterministic).