Automatic Test Suite Generation for Key-Points Detection DNNs using Many-Objective Search

Automatic Test Suite Generation for Key-Points
Detection DNNs using Many-Objective Search
Fitash Ul Haq, Donghwan Shin, Lionel Briand, Thomas Stifter, Jun Wang
Date: 14/07/2021

2
Introduction
Automatic Test Suite Generation for Key-Points Detection DNNs using Many-Objective Search (Experience Paper)
• Automatically detecting key-points in an image or a video is a fundamental step
for many applications, such as face recognition and drowsiness detection
• With the recent advances in Deep Neural Networks (DNNs), Key-point Detection DNNs (KP-DNNs) are
widely used to detect key-points in an image
Car dash
camera Video

3
Introduction
Car dash
camera Video
DNN

4
Introduction
Car dash
camera Video
DNN Driver is
awake

6
Motivation and Goal
• IEE developed a drowsiness detection system based on a Facial KP-DNN
• In the facial key-points detection problem, each Key-Point (KP) is important,
as even one incorrectly predicted KP can have a major impact on system
reliability and safety
• Hence, we should test KPs individually to properly test the DNN
• One test requirement for each KP: “the DNN should correctly predict the KP”
• For our subject DNN (IEE-DNN), we have 27 test requirements as we have 27 KPs
• Our goal is to find a test suite that causes IEE-DNN to severely mis-predict as
many key-points as possible
Example
Input
Reference Image
showing 27 KPs

7
Challenges and Idea
• Challenges
• The input space is too large to be exhaustively explored
• The number of KPs is typically large (e.g., our evaluation uses the IEE-DNN that detects 27 KPs)
• One should not simply consider average prediction errors across all KPs
• It may be infeasible to find a test image causing severe prediction errors for some KPs
• In such cases, it is essential to dynamically and efficiently distribute the computational resources dedicated to testing to the
other KPs
• To address them, we apply many-objective search for test suite generation for the IEE-DNN
• State-of-the-art algorithms (i.e., MOSA* and FITEST**) aim to efficiently achieve each objective individually
• We set the misprediction of each KP as one objective
* Panichella, Annibale, Fitsum Meshesha Kifetew, and Paolo Tonella. "Reformulating branch coverage as a many-objective optimization problem." 2015 IEEE 8th international conference on software
testing, verification and validation (ICST). IEEE, 2015.
** Abdessalem, Raja Ben, et al. "Testing autonomous cars for feature interaction failures using many-objective search." 2018 33rd IEEE/ACM International Conference on Automated Software
Engineering (ASE). IEEE, 2018.

8
Overview: Automatic Test Suite Generation using Many-Objective Search

9
Search Engine

10
Search Engine Simulator
Input (vector)

11
Input (vector)
DNN
Actual Key-points Positions
Test
Image

12
Input (vector)
DNN
Predicted Key-points Positions
Test
Image

13
Input (vector)
DNN
Fitness Score
(Error Value)
Test
Image

14
Input (vector)
DNN
Fitness Score
(Error Value)
Test
Image
Most Critical
Test Inputs

15
Research Questions
• RQ1: How do alternative many-objective search algorithms fare in terms of test effectiveness?
• Check whether using many-objective search is indeed a suitable solution for the problem
• RQ2: Can we further distinguish search algorithms using the degree of mispredictions caused by
the test suites they generate?
• Compares how severely key-points are mispredicted by test suites generated across different search algorithms
• RQ3: Can we explain individual key-point mispredictions in terms of image characteristics?
• Investigate whether it is possible to provide accurate and interpretable explanations of mispredictions based on
image characteristics

16
Subject DNN and Simulator
• IEE-DNNv1.0
• Architecture: Stacked hourglass*
• Training set: 18,120 synthetic images generated by Blender
using make-human and 4D faces models
• Test set: 2738 synthetic images
• Input: Takes 256 x 256 pixel image
• Output: locations of 27 key-points
• NME: 0.018
• IEE-SIMv1.0
• Input: Model ID, roll, pitch and yaw (Range: -30 to +30; defined
by IEE)
• Output: Image and ground truth for locations of key-points
• Number of models available: 10
* Newell, Alejandro, Kaiyu Yang, and Jia Deng. "Stacked hourglass networks for human pose estimation." European conference on computer vision. Springer, Cham, 2016.
Sample Images from 3D models

17
RQ1: Effectiveness of Test Suites
• Objective
• Find the best search algorithm for generating test suites with maximum Effectiveness Score (ES)
• Search algorithms
• Random Search (RS), MOSA, FITEST
• MOSA+ and FITEST+: identical to MOSA and FITEST, but use a different crossover strategy (i.e.,
using dynamic distribution index) to better guide new test data towards uncovered objectives
• Experiment Parameters
• Search budget — 2 hours
• Repetition: 20 times
!" =
$%&'() *+ ,-.*))(./01 2)(34./(3 5(16*4-/7
8*/90 $%&'() *+ 5(16*4-/7
• Statistical analysis
• Significance: Mann–Whitney U test
• Effect Size: Vargha and Delaney’s effect size

18
RQ1: Results
• MOSA and FITEST families outperform RS
• Overall, MOSA+ is the best in terms of maximizing the number of severely mispredicted KPs
A B p-value Effect Size
MOSA RS 0 1
MOSA+ RS 0 1
FITEST RS 0 1
FITEST+ RS 0 1
MOSA FITEST 0 0.837
MOSA+ FITEST 0 0.945
FITEST+ FITEST 0.7502 0.53
MOSA FITEST+ 0.0045 0.7575
MOSA+ FITEST+ 0 0.86
MOSA+ MOSA 0.1091 0.6375
Statistical Analysis
Average of ES for 20 Runs of different search algorithms

19
RQ1: Implications
• Our approach is effective in generating test suites that cause IEE-DNN to severely mispredict
more than 93% of all key-points on average
• MOSA and MOSA+ are significantly better than FITEST and FITEST+ in terms of ES
• There is no significant difference between MOSA (and FITEST) and MOSA+ (and FITEST+), this
shows that dynamically controlling the similarity between parents and children in crossover does
not significantly improve effectiveness
• RQ1 only considers the number of severely mispredicted key-points, differences in effectiveness
across search algorithms may not appear clearly and completely
• For example, two test suites generated by different algorithms may cause the same number of severely
mispredicted key-points.

20
RQ2: Misprediction Severity for Individual Key-points
• Objective
• Find the best search algorithm for generating test suite with maximum Misprediction Severity (maximum error) for
each key-point.
• Search algorithms (Same as RQ1)
• Random Search (RS), MOSA, FITEST, MOSA+, FITEST+
• Experiment Parameters (Same as RQ1)
• Search budget — 2 hours
• Repetition: 20 times
• Statistical analysis
• Significance: Wilcoxon signed-rank test
• Effect Size: Vargha and Delaney’s effect size

21
RQ2: Results
• MOSA and FITEST families subsume RS
• We found that there are specific KPs that are more severely mis-predicted than others
MS for individual key-points for search algorithms
A B p-values Effect Size
MOSA RS 0 0.897
MOSA+ RS 0 0.902
FITEST RS 0 0.876
FITEST+ RS 0 0.873
MOSA FITEST 0.57 0.541
MOSA+ FITEST 0.594 0.543
FITEST+ FITEST 0.052 0.507
MOSA FITEST+ 0.177 0.545
MOSA+ FITEST+ 0.009 0.556
MOSA+ MOSA 0.78 0.5
Statistical Analysis

22
RQ2: Implications
• Some KPs are more severely mis-predicted than others,
mainly because:
• Under-representation of some KPs in the training data (e.g., KP7 is
only present in 79% of training data)
• Large variation in the shape and size of the mouth across different
3D models, KP24, KP25, KP26, and KP27 are located on the mouth
which shows the largest variation among face features
• There is no statistically significant difference in MS
between MOSA and MOSA+, and between FITEST and
FITEST+
• This implies that, consistent with RQ1, dynamically adjusting the
distribution index in crossover does not increase misprediction
severity for individual key-points Sample Images showing different variations of mouth

23
RQ3: Explaining Mispredictions
• Objective
• Investigate whether it is possible to provide accurate and
interpretable explanations of mispredictions based on image
characteristics used by the simulator to generate test images
Example Regression Tree
Model-ID
Pitch
NE = 0.04
< 18.41 ≥ 18.41
= 9 ≠ 9
…
…

24
RQ3: Explaining Mispredictions
• Objective
• Investigate whether it is possible to provide accurate and
interpretable explanations of mispredictions based on image
characteristics used by the simulator to generate test images
• Approach
• Build a regression tree for each KP using test results
• Dataset: test images generated during the execution of our approach
• Input variables: roll, pitch, yaw, and 3D model ID
• Target variable: normalized prediction error (NE) of the IEE-DNN
• Evaluate the (predictive) error of generated regression trees
using 10-fold CV
Example Regression Tree
Model-ID
Pitch
NE = 0.04
< 18.41 ≥ 18.41
= 9 ≠ 9
…
…

25
RQ3: Results
Representative rules derived from the decision tree for KP26
(M: Model-ID, P: Pitch, R: Roll, Y: Yaw)
Image Characteristics Condition NE
! = 9 ∧ # < 18.41 0.04
! = 9 ∧ # ≥ 18.41 ∧ $ < −22.31 ∧ % < 17.06 0.26
! = 9 ∧ # ≥ 18.41 ∧ $ < −22.31 ∧ 17.06 ≤ % < 19 0.71
! = 9 ∧ # ≥ 18.41 ∧ $ < −22.31 ∧ % ≥ 19 0.36
(A) A test image satisfying
the first condition
(B) A test image satisfying
the third condition
NE= 0.013 NE= 0.89
• Using the conditions, we performed detailed analysis to find the root causes of high NE value and found out
that shadow on the location of KP26 is the cause of high NE value
• The average MAE from all the trees is 0.01 (far less than IEE threshold: 0.05) with average tree size 25.7

26
RQ3: Implications
Knowing under what conditions severe mispredictions are occurring can help engineers in two
ways:
• Helps to assess the risks associated with individual key-points for specific conditions, in the
context of a specific application
• Enables the generation of specific test images, using the simulator, that are expected to cause
particularly severe mispredictions and can be used for retraining the DNN

27
Lessons Learned
• Automated test suite generation is indeed useful in practice
• Testing results helped IEE assess and improve the IEE-DNN
• They continuously enriched the dataset by adding more training images from diverse 3D face models
• They Improved the IEE-DNN’s architecture by doubling the number of hidden layers to drastically increase its accuracy
• The results also helped IEE improve the simulator
• The detailed analysis of the testing results showed the labeled KP positions were not accurate; this was later fixed
• Understanding mispredictions is critical
• Such findings led IEE to better target their development resources to improve the driver’s gaze detection system rather
than just focusing on the IEE-DNN itself
• Simulation-based testing brings key benefits
• We can effectively generate as many different test images as needed, with ground truth

28
Conclusion
• We formalize the problem definition of KP-DNN testing and present an approach to automatically
generate test data for KP-DNNs with many independent outputs
• We empirically compare state-of-the-art, many-objective search algorithms and their variants
tailored for test suite generation
• We further investigate and demonstrate a way, based on regression trees, to learn the conditions,
in terms of image characteristics, that cause severe mispredictions for individual key-point

Automatic Test Suite Generation for Key-
Points Detection DNNs using Many-
Objective Search
Fitash Ul Haq, Donghwan Shin, Lionel Briand, Thomas Stifter, Jun Wang
Date: 14-07-2021

31
Pseudo-code: Many Objective Search Algorithm (animations will be added)
Initialization Calculating
Objectives Updating Archive
and Objectives
Remaining
Budget
Offspring
Generation
Calculating
Objectives
Updating Archive
and Objectives
Generating
Next Generation
Archive

32
Applying Meta-heuristic Search Algorithm
Automatic Test Suite Generation for Key-Points Detection DNNs using Many-Objective Search (Experience
Paper)
Automatic test data generation using meta-heuristic search algorithms is widely studied in software testing
• Transform the test data generation as an optimization problem and apply
meta-heuristics to cost-effectively solve it
Fitness function f
Initial solution xi
Vector representation of facial
image features (e.g., head posture)
Prediction error for the IEE-DNN Search Algorithm
Best solution x such that
f(x) is the maximum
Best facial image (features)
that maximizes f

33
Incorrectly Predicted Key-Points
Automatic Test Suite Generation for Key-Points Detection DNNs using Many-Objective Search (Experience
Paper)
Originally, IEE defines a test input (image) to be unsafe if the NME (Normalized Mean Error) is greater than or equal to 0.05
• NME is the average error of all key-points
We define a key-point is “incorrectly predicted” if its normalized error is greater than or equal to 0.05

Automatic Test Suite Generation for Key-Points Detection DNNs using Many-Objective Search

Recommended

More Related Content

What's hot (20)

Similar to Automatic Test Suite Generation for Key-Points Detection DNNs using Many-Objective Search (20)

More from Lionel Briand (20)

Recently uploaded (20)

Automatic Test Suite Generation for Key-Points Detection DNNs using Many-Objective Search