A Framework For The Automation of Testing Computer Vision Systems
A Framework For The Automation of Testing Computer Vision Systems
✗
Classifier
surance applications, e.g., for finding surface defects in products Fail
during manufacturing, surveillance, but also automated driving,
requiring reliable behavior. Interestingly, there is only little work
Automated test case generation
on quality assurance and especially testing of vision systems in
Test correctness of functionality
general. In this paper, we contribute to the area of testing vision Test robustness
software, and present a framework for the automated generation Test meeting requirements, e.g., false positive or negative rate
of tests for systems based on vision and image recognition with
the focus on easy usage, uniform usability and expandability. Generation of input (images)
with different features, e.g.:
The framework makes use of existing libraries for modifying the • Similar but small disturbances
original images and to obtain similarities between the original • Color modifications
and modified images. We show how such a framework can be • Brightness modifications
• Single pixel modifications
used for testing a particular industrial application on identifying
defects on riblet surfaces and present preliminary results from
the image classification domain.
Index Terms—test case generation; testing vision software;
testing image classifiers Fig. 1. The testing approach of the riblet surface quality inspection systems.
I. I NTRODUCTION
where we change only one or a few pixels, leading to a com-
Within the last decades, vision systems and image recog-
plete misclassification. There has been a lot of research dealing
nition have gained importance due to their increasing use in
with testing based on such adversarial examples including [1]
practical applications, ranging from automating inspect tasks,
or [2]. In the context of autonomous driving Tian et al.’s work
surveillance, to autonomous systems including autonomous
on DeepTest [3], where the authors also address the important
cars or mobile robots. In all these systems, their behavior
topic of metamorphic testing, has gained great attention. Other
heavily rely on the quality of the qualitative or quantitative ob-
similar work include [4] and most recently [5]. However, there
servations extracted from images. Such vision systems should
has also been work on successfully preventing vision systems
deliver a correct, i.e., expected output, for the widest range of
from being attacked using adversarial examples (see [6]). It is
images. The output should also be as robust as possible. Small
worth noting that testing vision systems does not necessarily
changes in the image should not lead to large deviations on
only include testing the underlying vision algorithm. Such
side of the output. Especially in case of autonomous driving
systems comprise other parts including hardware, which have
the latter property has gained a lot of attention.
to work smoothly together for delivering the expected results.
Unfortunately, and especially in cases where the vision
Hence, classical testing may also be required (see, e.g., [7]
system is based on the application of machine learning and in
discussing the use of coverage tools for computer vision
particular neural networks, robustness cannot be guaranteed.
applications).
We may always find an adversarial example, i.e., an image
For our work, we do not rely on a certain underlying method
The research described in this paper has been carried out within the of computer vision used in a particular application. However,
FFG Beyond Europe program as part of project 874163 RiSPECT – Riblet we follow previous research on testing vision software where
Inspection and Efficiency Assessment Technology funded by the Federal
Ministry for Digital and Economic Affairs (BMDW).
it turned out that often object detection and classification
∗ Authors are listed in reverse alphabetical order. delivers wrong results, e.g., when making use of adversarial
Authorized licensed use limited to: HCT Musnaa. Downloaded on October 21,2021 at 04:54:37 UTC from IEEE Xplore. Restrictions apply.
examples. Hence, we consider the use of image modifications The latter is important to guarantee that deviations between
for testing whether the system under test is able to detect images used for obtaining the mapping do not have variations
and handle such modified images correctly, e.g., raise error leading to completely wrong values. In addition, we have to
messages if necessary. The application area of the vision test that the system also works well in case of faults in the
system is the area of surface quality inspection. In particular, overall system. For example, due to an image sensors that
we want to test a computer vision application that is currently may break, lenses that may be distorted due to dust and other
under development, which returns a quality measure for any effects. Such faults should be either detected by the system and
riblet surface. Riblets are used for reducing drag of surfaces lead to warnings or error messages, or being compensated.
relying on the shark skin effect, and thus saving energy. Riblets In the following, we formalize the testing problem con-
implement tiny structures of a height of about 50µm. They are sidering a more general view where we also take care that
used in aviation for reducing fuel consumption. The task of the system reacts in an expected manner in case of image
riblet surface quality inspection is to gain information whether deviations ranging from smaller deviations to more substantial
the riblet surface is of good or bad quality in order to determine ones.
replacement. Formally, we state that a system under test S is implement-
To test the riblet surface quality inspection system, we ing a function θS mapping an image µI to a value from the
suggest the generation of test cases like the ones depicted domain D = DO ∪ DE , where DO denotes a set of possible
in Figure 1. In the approach the system under test will be values S returns as a result of image analysis, and DE a set of
tested using riblet surface images and also images where we other pre-defined behavioral characterizations of S that may
apply modifications. Depending on the modification, the final be revealed, e.g., a crash or an error message indicating that
system should react differently. In case of small deviations, there is a problem with the image sensor or a lens. Note that
the system should still deliver back the same quality measure DO may only comprise pass (X) or fail (×) in case of an
or classification. If we change too much, like replacing the image classifier, but may also comprise elements allowing to
original image with a black image, which may happen in state the objects position in the image, as well as an indicator
practice due to a broken light in the vision system, the system of accuracy of the particular detection. An execution of the
should come up with a corresponding error message indicating system under test always delivers back an element of D, i.e.,
the problem. The testing approach we are following does θS (µI ) ∈ DO ∪ DE .
not depend on underlying computer vision methods but is We define a test case t as a pair (I, x) where I is an image
generally applicable in this domain. and x ∈ DO ∪ DE the expected output. (I, x) is a passing
In this paper, we present the first steps towards imple- test case if and only if θS (I) = x, and a failing test case,
menting the automated test case generation method. This otherwise. Furthermore, a test suite TS for a system under
includes a description of how test cases can be generated test S is a non-empty set of test cases t = (I, x). As already
and a Python framework that enables the usage of different noted, there is always an initial test suite TSI available for a
image modifiers, which has been developed to provide a vision system under test S used to come up with a model
uniform interface for image modifications and which can used for mapping images to a particular domain regardless
be easily extended. Moreover, we discuss preliminary results of the underlying methodology, which may comprise machine
considering a computer vision based object identifier aiming learning, and the hardware setup comprising cameras and light
at showing the practicability of the framework. sources.
The testing problem for vision systems S, we consider in
II. P RELIMINARIES our work, is to generate test suites from the given system S and
During development we use a set of images together with the initial test suite TSI that assure robustness of the obtained
their expected output, i.e., a test suite, for coming up with a results and the system itself requiring to: (i) come up with
model used to map images to an expected value. This modeling slightly modified images from TSI where S responses similar,
step can be performed manually or automatically. The latter and (ii) other (severe) modifications of images showing that
relies on machine learning approaches like most recently deep faults in the system like distortions of lenses, or broken sensor
neural networks. In any case the initial test suite is also used or light, are correctly handled by the system. The presented
partially to validate (or evaluate) the model with respect to its testing framework has been developed considering both types
capability of performing the correct mapping between images of modifications.
and values. Formally, the testing framework comprises modification
In testing, however, the aim is not only on validating how operators m1 , . . . , mk . The resulting test suite is generated
good the system under test performs the mapping, i.e., using by applying all modification operators to the initial test suite.
metrics for quantifying differences between expectations and Modification operators can be classified as returning similar
the real mapping, e.g., the root mean square error. It is also images or different images where for the latter we assume
necessary to assure that (i) there are no other effects where that the system under test S shall return the error message
the system would behave wrong, e.g., finding interactions with err. We use the predicate sim on a modification operator to
the system that cause crashes or exceptions raised during check for similarity. To solve the testing problem for vision
operation, and (ii) the mapping of images to values is robust. systems, we define the outcome of test suite generation, i.e., a
122
Authorized licensed use limited to: HCT Musnaa. Downloaded on October 21,2021 at 04:54:37 UTC from IEEE Xplore. Restrictions apply.
test suite TS≈ and TS for the case of similar images and more TABLE I
severe modifications respectively. C OMPUTED SSIM VALUES AFTER THE APPLIED IMAGE MODIFICATIONS
AND OBJECT RECOGNITION RESULTS .
TS≈ = {(I 0 , x)|(I, x) ∈ TSI ∧ I 0 = mi (I) ∧ sim(mi )}
modification SSIM object recognition results
TS = {(I 0 , err)|(I, x) ∈ TSI ∧ I 0 = mi (I) ∧ ¬sim(mi )} Some objects like cars are misclassified
1 inverted 0.50
as boats.
The testing framework we are going to discuss in more
Most of the classifications are correct
detail in the next section, implements modification operators 1 pixels 0.93 but objects far away are not recognized
allowing to come up with both types of test suites. anymore.
Classification correct except for cars
2 blur 0.27
III. T HE TESTING FRAMEWORK and the policeman.
Classification correct except for one car
We implemented the testing framework for testing computer 2 affine 0.00
recognized as boat.
vision systems in Python 3.8.3. The framework is open, relies 3 darkened 0.55 One of the dogs recognized as a kite.
on other frameworks that have been integrated, can be easily
extended, and provides a simple to use interface to come 3 flipped 0.05 Only one object recognized correctly.
up with different test suites. Its architecture comprises the 4 brightened 0.84 Classification correct.
following three main parts: 4 fog 0.41 Classification correct.
4 shadow 0.87 Classification correct.
• Classifier/Object recognition. This part allows the inte-
5 snow 0.60 Classification correct.
gration of a deep neural network to recognize different 5 rain 0.34 Classification correct.
objects in images. It produces a set of bounding boxes as 5 sun 0.82 Classification correct.
output and these boxes contain an object each and also
their description of what the object is, for example road
sign, person, animal, bus, car etc. For our first evaluation IV. P RELIMINARY EXPERIMENTAL STUDY
we used Mask R-CNN [8] for object detection which The objective behind the experimental study was to indicate
was pretrained on the MS COCO dataset [9]. This part the usefulness of the proposed testing framework for computer
of the framework, takes any image as input and produces vision applications. Due to the fact that the riblet surface
the same picture along with the objects recognized inside inspection system is currently under development and not
it. We added this part to the framework for evaluation ready for testing, we made use of the R-CNN, which is part
purposes. of our framework. In our experiments, we considered five
• Modifier. This part implements modification opera-
images and applied several different modification methods. We
tors allowing to modify any given image. Some of depict the original images and the considered modifications
the modification methods we implemented are: in- in Figure 2. It is worth noting that for a first proof only
version, pixel change, affine transformation, blur, add a selection of modifications was applied to the images. An
snow/rain/fog/sun, darken, and brighten. All modification extension where all modifications are applied to all kind of
operators take an image as input and produce a modifica- different images is part of future research.
tion of this image according to the method chosen. After
After applying the modification operators, we used the R-
modifying the image, the modifier saves the image into
CNN on the modified images to compare its object detection
a directory/folder specified by the user.
and classification output to the expected output and to reveal
• Image diff. We implemented a method for finding the
possible differences. This is followed by the computation
difference between two given images. The diff tool uses
of the SSIM difference between the original image and the
an algorithm called Structural Similarity Index Measure
modified one to get an indication on the magnitude of the
(SSIM) [10], which is used to compute the similarity
modification. We summarize the obtained results in Table I.
between two images. It takes into account the structural
From the results, we see that in many cases the R-CNN
information of the image and compares it accordingly.
works fine only failing for some objects. We also see that
In contrast to other techniques like mean squared error
there seems to be little correlation between the SSIM and
(MSE), SSIM finds a pattern and relation of pixels of an
the degradation of object recognition. For example, when
image and defines a structure. Afterwards, it compares the
applying fog to image 4 causing the SSIM to be 0.41 the
structures of two images. SSIM returns a value between
recognition of objects is not influenced. Almost the same
0 and 1 where 0 means that two images are completely
happens for image 2 with affine transformation, resulting in
different, and 1 means the two images are identical.
only one wrongly recognized object but having a SSIM of
The framework provides a command line interface and is 0.00. Therefore, further investigations are required to clarify
very simple to use. The user can classify any image that he or the relationship between SSIM, the modification operators
she prefers, can modify images, and also compare two images. used, and the classification results. In addition, we plan to
The framework also allows to implement a test case generator extend the study using more images where we apply all
as described in the previous section. modifications, and make use of other similarity metrics.
In the following, we provide an initial experimental study
making use of the implemented R-CNN.
123
Authorized licensed use limited to: HCT Musnaa. Downloaded on October 21,2021 at 04:54:37 UTC from IEEE Xplore. Restrictions apply.
Original Image 1 colors inverted black pixels added
Fig. 2. The images used for our experiments. The modifications applied on the images 1,2 and 3 represent mostly but not exclusively general modifications
whereas the modifications on the images 4 and 5 represent more domain specific modifications, i.e., automotive domain.
under development.
V. C ONCLUSIONS
In this paper, we introduced a framework for the automated R EFERENCES
generation of tests for systems utilizing vision and image [1] M. Wicker, X. Huang, and M. Kwiatkowska, “Feature-guided black-box
recognition since often testing is only performed by evaluating safety testing of deep neural networks,” CoRR, vol. abs/1710.07859,
2017.
the system on an image test set without specific modifications. [2] X. Huang, M. Kwiatkowska, S. Wang, and M. Wu, “Safety verification
In particular, the framework architecture comprises three parts, of deep neural networks,” in CAV (1), ser. Lecture Notes in Computer
i.e., the classifier or object recognition part which allows the Science, vol. 10426. Springer, 2017, pp. 3–29.
[3] Y. Tian, K. Pei, S. Jana, and B. Ray, “Deeptest: automated testing of
integration of different deep neural networks used for image deep-neural-network-driven autonomous cars,” in ICSE. ACM, 2018,
recognition which should be validated, the modifier part which pp. 303–314.
applies various image modification and that can easily be [4] K. Eykholt, I. Evtimov, E. Fernandes, B. Li, A. Rahmati, C. Xiao,
A. Prakash, T. Kohno, and D. Song, “Robust physical-world attacks on
extended by additional modifications and a part for calculating deep learning visual classification,” in CVPR. IEEE Computer Society,
the differences between the original image and the modified 2018, pp. 1625–1634.
image based on a structural similarity index measure metric [5] P. Arcaini, A. Bombarda, S. Bonfanti, and A. Gargantini, “Dealing with
robustness of convolutional neural networks for image classification,” in
which can also be extended by additional metrics of interest. AITest. IEEE, 2020, pp. 7–14.
From the first results it can be seen that there seems to be lit- [6] I. J. Goodfellow, P. D. McDaniel, and N. Papernot, “Making machine
tle correlation between the applied metric and the recognition learning robust against adversarial inputs,” Commun. ACM, vol. 61,
no. 7, pp. 56–66, 2018.
of objects. Therefore, we intend to further work on the integra- [7] I. Nica, G. Jakob, K. Juhart, and F. Wotawa, “Results of a comparative
tion of additional metrics as well as the integration of different study of code coverage tools in computer vision,” in ICST Workshops.
object detectors to investigate if this is the case in general. IEEE Computer Society, 2017, pp. 36–37.
[8] K. He, G. Gkioxari, P. Dollár, and R. Girshick, “Mask r-cnn,” 2018.
Moreover, we may consider also the applied modification [9] T.-Y. Lin, M. Maire, S. Belongie, L. Bourdev, R. Girshick, J. Hays,
operators and their corresponding parameters, like the degree P. Perona, D. Ramanan, C. L. Zitnick, and P. Dollár, “Microsoft coco:
of blurring or the intensity of the rain when investigating the Common objects in context,” 2015.
[10] Zhou Wang, A. C. Bovik, H. R. Sheikh, and E. P. Simoncelli, “Image
relationship between metrics and object recognition results. In quality assessment: from error visibility to structural similarity,” IEEE
future work, we will make use of the framework for testing the Transactions on Image Processing, vol. 13, no. 4, pp. 600–612, 2004.
mentioned riblet surface inspection system, which is currently
124
Authorized licensed use limited to: HCT Musnaa. Downloaded on October 21,2021 at 04:54:37 UTC from IEEE Xplore. Restrictions apply.