Black-Box Testing Using Flowgraphs
Black-Box Testing Using Flowgraphs
an experimental assessment of
effectiveness and automation
potential
STEPHEN H. EDWARDS*
DEPARTMENT OF COMPUTER SCIENCE, VIRGINIA TECH, 660 MCBRYDE HALL, BLACKSBURG, VA 240610106,
U.S.A.
Agenda
Definitions
FlowGraph
Definitions- 1
Idiom
Definition
Object Oriented
Testing
Adequacy criteria
Formal Specification
Aformalsoftwarespecificationis aspecificationexpressed in a
language whose vocabulary, syntax and semantics
areformallydefined. This need for aformal definition means that
thespecificationlanguages must be based on mathematical concepts
whose properties are well understood.
Definitions- 2
Idiom
Definition
Programming by
contract
Interface violation
Violation may not discover until system integration and may after
deployment
Proposed Solution
Testing to the contract is at the heart of specification-based testing
Template
Generic Component
Data Type
Initialize: auto invoked on
3 operations declaration
Finalize: auto invoked when
it goes out of scope
Precondition
ensure
Post condition
#q
Initial value
<#X>
String
Initialize
Enqueue
Dequeue
Is Empty
Flow Graph -1
A use occurs if the incoming value may affect the behavior of the operation
Flow Graph -2
node coverage
branch coverage
definition coverage
Use coverage
Unlike program-level testing, where a test case consists of input data for
the program, here a test case corresponds to a sequence of operation
invocations with associated parameter values.
Each edge (say, from v1 to v2) in the flowgraph indicates that there is
some legitimate object lifetime that includes v1 followed by v2; it
specifically does not imply that every lifetime containing v1 followed by
v2 is legitimate.
Search Component
PreCondition: Login
Request Book Component
Precondition: Search
Issue Component
Precondition: Request Book
All nodes: every node in a flow graph, constructed from the specification of the class to be tested
must be exercised at least once
All branches
All uses: Test cases for every use of the variable, there is a path from the definition of that variable
to the use.
All DU paths: This is the strongest data flow testing strategy. Every du path from every definition
of every variable to every use of that definition
All paths: all paths leading from the initial to the final node
2- Generating flowgraphs
The primary issue is how to decide correctly and efficiently which edges
should be included in the graph
Experience with resolve-specified components indicates that the vast majority of operations
have relatively simple preconditions
This allow one to identify a subset of the edges that are always feasible (or always
infeasible) for every object lifetime
2- Generating flowgraphs
omit all difficult >> it ensures no infeasible edges are included. The cost of
this conservatism is exclusion of some feasible edges, and hence exclusion of
desirable test cases rarely leads to desirable results
decide on an edge-by-edge >> Cost + Effort + Human Error
include all difficult edges >> inclusion of some infeasible edges, and hence the
inclusion of undesirable test cases that force operations to be exercised when
their preconditions are false >> risky >> it is possible to screen out test cases
that exercise infeasible edges automatically
Experience with the prototype suggests that it is much easier to include more test cases than
necessary at generation time and automatically weed out infeasible cases later
3- Enumerating paths
All three of the criteria used in the prototype generate a set of easily identifiable
paths for coverage
One test frame can be generated for each such path P (from v1 to v2)
composing the three paths to form the sequence of operations in the test frame
Solution1: Generate all test frames, later on filter out the infeasible paths
Drawback: this is less ideal because of the large number of infeasible test frames produced
Example: Initialize >> Enqueue(q, x) >> Dequeue(q,x) >> Dequeue(q, x) >> Finalize is the
test frame generated when v1 = v2 = Dequeue
3- Enumerating paths
Solution2:
compute the initialization subpath for v1 (Initialize > v1,1 > > v1,m > v1)
Compute the initialization subpath for v2 (Initialize > v2,1 > > v2,n > v2)
Use (Initialize > v1,1 > > v1,m > v2,1 > > v2,n > v1) as the initialization
subpath for P, provided that the edges v1,m > v2,1 and v2,n > v1 exist.
Solution3
one can modify the method of selecting initialization and finalization subpaths by weighting
difficult edges
Finally, one should note that this enumeration strategy typically results in nave test
framesthose with the minimal number of operations and minimal number of distinct objects
necessary. The result is a larger number of fairly small, directed test cases that use as little
additional information as possible.
One must instantiate the test frames with specific parameter values
Random
Boundary Values Analysis BVA: was added so that scalars with easily describable
operational domains could be more effectively supported
Even more difficult when testing generic components that may have potentially
complex data structures as parameters
The side-effect of this is more critical: legitimate test frames may be thrown out
in practice because the nave selection of parameter values
One can liberally include edges in the flowgraph and then generate test cases
in the normal fashion.
This runs the risk of generating infeasible test cases, but automatic detection
and filtering make this option practical
Suggestion
I think one can use the UML diagrams (Sequence, Component) instead of
Formal specification, I think its easy to use and can easily generate the test
cases
The effectiveness of the testing strategy described here hinges in great part on
automaticallydetecting interface contract violations for the component under test.
After the underlying component completes its work, the detection wrapper
performs a run-time post-condition check on the results.
This is the primary means for addressing the satisfiability issues raised in Section
3. A
Further, this indication will be raised in the operation or method where the failure
occurred, whether or not the failure would be detected by observing the top-level
output produced for the test case.
Finally, invariant checking ensures that internal faults that manifest themselves via
inconsistencies in an objects state will be detected at the point where they occur.
Without invariant checking, such faults would require observable differences in
output produced by subsequent operations in order to be detected.
Finally, if the component under test is built on top of other components, one should
also encase those lower-level components in violation detection wrappers (at least
wrappers that check preconditions).
This is necessary to spot instances where the component under test violates its
client-level obligations in invoking the methods of its collaborators.
The use of violation detection wrappers can lead to an automated testing approach
that has a greater fault revealing capability than traditional black-box strategies.
5. An experimental
assessment
As discussed in Section
3, a prototype test set generator for three of Zweben
et al.sadequacy criteria has been implemented.
The design of the prototype included several tradeoffs, some of which are
quite simplistic, that might adversely affect the usefulness of the approach.
At the same time, the only empirical test of the fault-detecting ability of test
sets following this approach is the original analysis reported by Zweben et al.
[2].
5.1. Method
1.
2.
3.
generate a test set for each of the three criteria, for each of the components;
4.
execute each buggy version of each subject component on each test set;
5.
check the test outputs (only) to determine which faults were revealed without using
violation detection wrappers;
6.
check the violation detection wrapper outputs to determine which faults were
revealed by precondition, post-condition and invariant checking
5.1. Methodcont
Four resolve-specified components were selected for this study: a queue, a stack,
a one way list and a partial map.
All are container data structures with implementations ranging from simple to fairly
complex.
The queue and stack components both use singly linked chains of dynamically
allocated nodes for storage.
The primary differences are that the queue uses a sentinel node at the beginning of
its chain and maintains pointers to both ends of the chain, while the stack maintains
a single pointer and does not use a sentinel node.
The one-way list also uses a singly linked chain of nodes with a sentinel node at the
head of the chain.
Internally, it also maintains a pointer to the last node in the chain together with
a pointer to represent the current list position. The partial map is the most complex
data structure in the set
5.1. Methodcont
Internally, its implementation uses a fixed-size hash table with one-way lists
for buckets.
Although these components are all relatively small, the goal is to support the
testing of software components, including object-oriented classes.
5.1. Methodcont
This version of mutation testing uses only five mutation operators: ABS, AOR, LCR,
ROR
and UOI [10]; it dramatically reduces the number of mutants generated, but has
been
experimentally shown to achieve almost full mutation coverage [9].
5.1. Methodcont
All remaining mutants were guaranteed to differ from the original program on some
legitimate object lifetime in a manner observable through the parameter values
returned by at least one method supported by the component.
Table summarizes the characteristics of the subject components and number of faulty
versions generated.
5.2. Results
Because the goal was to assess test sets generated using the specific heuristics
described in this paper rather than the more general goal of assessing the adequacy
criteria themselves, the experiment was limited to test sets produced by the
prototype.
Since all of the subject components are data structures that are relatively insensitive
to the values of the items they contain, such superficial variations in generated test
sets were not explored. As a result, a total of 12 test sets were generated, one for
each of the chosen adequacy criteria for each of the subject components.
5.2. Resultscont
For each subject component, the three corresponding test sets were run against
every faulty version.
Table II summarizes the results. Observed failures indicates the number of
mutants killed by the corresponding test case based solely on observable output
(without considering violation detection wrapper checks).
Detected violations indicates the number of mutants killed solely by using the
invariant and post-condition checking provided by the subjects detection
wrapper. In theory, any failure identifiable from observable output will
The rightmost column lists the number of infeasible test cases produced by the
prototype in each test set
5.2. Resultscont
5.2. Resultscont
5.3. Discussion
On the subject components, it is clear that all definitions reveals the fewest faults
of the three criteria studied, while all uses reveals the most, which is no surprise.
In all cases, the use of detection wrappers significantly increased the number of
mutants killed.
Further, the increase was more dramatic with weaker test sets.
In the extreme case of the all nodes test set for the one-way list, its faultdetection ability was doubled.
5.3. Discussioncont
Also surprising is the fact that for three of the four subjects, the all uses test
sets
achieved a 100 per cent fault detection rate with the use of detection wrappers.
As a result, the all nodes and all uses test sets for the stack, queue and oneway list components actually achieved 100 per cent white-box statement-level
coverage of all statements where mutation operators were applied
5.3. Discussioncont
Presumably, this is atypical of most components, so the 100 per cent fault
detection results for all uses should not be unduly generalized.
Nevertheless, the fact that a relatively weak adequacy criterion could lead to such
effective fault revelation is a promising sign for this technique.
5.3. Discussioncont
In that study, all nodes revealed 6 out of 10 defects, all definitions revealed 6 out of
10, and all uses revealed 8 out of 10, all of which are comparable to the Observed
failures results here.
Although the results of this experiment are promising, there are also important threats
to the validity of any conclusions drawn from it.
The subjects were limited in size for practical reasons. Although well-designed classes
in typical OO designs are often similar in size, it is not clear how representative the
subjects are in size or logical complexity. Also, the question of how well mutation-based
fault injection models real-world faults is relevant, and has implications for any
interpretation of the results.
With respect to the adequacy criteria themselves, this experiment only aims to assess
test sets generated using the strategy described in this paper, rather than aspiring to a
more sweeping assessment of the fault detecting ability of the entire class of test sets
meeting a criterion
6. RELATED WORK
The test set generation approach described here has been incorporated into an endto-end test automation strategy that also includes generation of component test
drivers and partial to full automation of violation detection wrappers .
One key difference with the current work is that model-based specifications are used
while DAISTS and ASTOOT are based on algebraic specifications.
6. RELATED WORKcont
Algebraic specifications often encourage the use of function-only operations and may
suppress any explicit view of the content stored in an object.
6. RELATED WORKcont
An FSM model typically contains a subset of the states and transitions supported by
the actual component under consideration, and may be developed by identifying
equivalence classes of states that behave similarly.
Test coverage is gauged against the states and transitions of the model.
The work described here, by contrast, does not involve collapsing the state space of
6. RELATED WORKcont
Other work on test data adequacy, including both specification-based and white-box
criteria, is surveyed in detail by Zhu et al.
The focus of the current work is to develop and assess practical test set generation
strategies based on existing criteria, rather than describing new criteria. Similarly,
Edwards et al.
provide a more detailed discussion of interface violation detection wrappers and the
work related to them, including alternative approaches to run-time post-condition
checking
6. RELATED WORKcont
7. CONCLUSIONS
Although there are a number of very difficult issues related to satisfiability involved in
generating test data, a prototype test set generator was implemented using specific
design tradeoffs to overcome these obstacles.
The results of the experiment, together with experiences with the generator, indicate
that there is the potential for practical automation of this strategy