Revisiting the Notion of Diversity in Software Testing

Revisiting the Notion of Diversity
in Software Testing
Lionel Briand
SBFT 2023 Keynote
http://www.lbriand.info

Why Diversity?
• Diverse test cases
• Exercising the system to the largest extent possible within a
budget
• Increase probability of fault detection
• While working with incomplete knowledge
• Cost of acquiring information
• Missing information
2

Example: Fuzzing with AFL
3
Diversity mechanisms: Mutation, coverage
Credits: Antonio Morales, https://github.com/antonio-morales/Fuzzing101

Aspects of Diversity
4
SUT
Inputs Outputs
Execution (internal):
- Structural coverage
- Model coverage (e.g., states)

Questions
• What aspects of diversity to focus on?
• Information access
• Information cost, e.g., execution time
• Context-dependent
• How to measure diversity?
• Representation (e.g., inputs)
• Distance measure, e.g., cosine, edit
• Computational cost
• Guidance, e.g., in search
• How to maximize diversity?
• Mutation, metaheuristic search, symbolic execution …
• Issues: cost, scalability, bias, effectiveness
5

Aspects of Diversity
• Inputs: No instrumentation, does not require the execution of
the SUT
• Outputs: No instrumentation, execution required but directly
characterizes the behavior of the SUT
• Internal SUT structure: Instrumentation, possibly modeling,
additional execution cost and significant data storage
6

Example: Testing DNNs
• Redundant or invalid inputs
• Labeling cost is high
• Domain-specific knowledge is
required to manually label test
inputs
• Cost of test execution can be
high
• Coverage ineffective
• Test selection based on inputs
7
Aghababaeyan et al., 2023

We want to test a DNN model with a fixed test budget.
• How can we automatically select a candidate test subset with high-fault
revealing power to test DNNs?
• Black-box test selection based on input diversity.
8
Black-box test
selection method
Test inputs T Subset
S⊆T

• No model execution
• No access to model internals or training set
• Studies show that proposed coverage measures for DNNs
not associated with faults
• Solution: Geometric diversity of image features
9

Extracting Image Features
• VGG16 is a convolutional neural network trained on a
subset of the ImageNet dataset, a collection of over 14
million images belonging to 22,000 categories.
10
Features:
- Activation values
after last convolutional
layer
- Characterize semantic
elements such as shapes
and colors

Geometric Diversity (GD)
• Given a dataset X and its corresponding feature vectors V,
the geometric diversity of a subset S ⊆ X is defined as the
hyper-volume of the parallelepiped spanned by the rows of
Vs, i.e., feature vectors of items in S, where the larger the
volume, the more diverse is the feature space of S
11

Measuring Diversity
• Representation and measure: Construct validity?
• Cost of computing diversity
• Guidance provided by diversity, e.g., test selection search
12

Example: Test Minimization
• Permanently remove redundant test cases in a test suite that are
unlikely to detect new faults
• Black-box versus white-box techniques
• FAST-R: Quick and black-box, but low fault detection rates
• ATM: Abstract Syntax Tree (AST)-based Test case Minimizer
• Motivation: Achieve a better trade-off between effectiveness and
efficiency than FAST-R
• Context: Minimization only applied to major releases
13

Example: ATM
• Representation: AST of pre-processed test code
• Tree similarity measures: top-down, bottom-up, combined, edit distance
• Common subtree isomorphism algorithms
• Top-down and bottom-up emphasize different aspects of similarity
between ASTs
14
Transform test code
to ASTs
Test Suite
Measure test
case similarity
Run search
algorithms
Minimized test
suite
Pre-process test
code
4 tree-based similarity
measures
GA & NSGA-II
Pan et al., 2023

Example: ATM
• Alternatives evaluated in terms of Fault Detection Rate (FDR)
• Edit distance is expensive but offers good guidance
• Combined similarity not significantly different
• Multi-objective search more expensive
• Much higher fault detection than FAST-R in significantly higher execution
time, though practical up to an extent
15
GA NSGA-II
Top-Down Bottom-Up Combined Tree Edit Distance Top-Down & Bottom-Up Combined & Tree Edit
Distance
FDR
0.78
Time
70.87
FDR
0.74
Time
67.05
FDR
0.80
Time
72.75
FDR
0.81
Time
82.23
FDR
0.78
Time
235.41
FDR
0.82
Time
258.44

Example: Input Diversity in DNNs
• Alternative diversity measures: Geometrics Diversity,
Normalized Compression Distance (NCD), standard deviation
• Construct validity?
• Analysis:
• We study how diversity scores change while varying the number of classes or
concepts inside the images of the input sets.
• We assume that diversity scores should increase with the number of classes or
concepts that are present in an input set.
16

Example: Input Diversity in DNNs
• Geometrics Diversity shows a clear monotonic relationship
with the number of classes in the input set
17
11
(a) Evolution of GD on Cifar-10 (b) Evolution of STD on Cifar-10 (c) Evolution of NCD on Cifar-10
(d) Evolution of GD on MNIST (e) Evolution of STD on MNIST (f) Evolution of NCD on MNIST
Figure 8: Evolution of the diversity scores for input sets from Cifar-10 and MNIST. Each boxplot shows the distribution of
diversity scores of 20 input sets of size 100.
This article has been accepted for publication in IEEE Transactions on Software Engineering. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/TSE.2023.3243522

Applications
• Test minimization, selection, prioritization
• Mutation analysis
• Identify boundaries in the input space, e.g., safe vs unsafe
18

MASS: CPS Mutation Testing
19
Create mutants Compile mutants
Killed Mutants
Live Mutants
2
Collect test data
1
Code
Coverage
Remove equivalent/duplicate
based on compiler optimizations
4
3
Mutants
Code coverage
Mutants successfully
compiled
Unique mutants
Evaluate mutation
score’s confidence
Sampled mutants
Sample mutants
Execute prioritized
subset of test cases
5
6 7
Cornejo et al., 2021
• Selection and prioritization of test
cases based on statement coverage
• Test suite prioritization:
• Greedy algorithm
• Select first the test case that
largely differ from the most
similar, already selected,
test case
• Test suite reduction: exclude test
cases with perfect similarity

MASS: CPS Mutation Testing
• Compare the sets of source code
statements that have been
covered by test cases: Jaccard
and Ochiai
• Compare the number of times
each statement has been covered
by test cases: Euclidian, cosine
• Focus on functions in source file
where mutated statement located
• Best: Cosine distance
• Difference in mutation score < 5%
20
Create mutants Compile mutants
Killed Mutants
Live Mutants
2
Collect test data
1
Code
Coverage
Remove equivalent/duplicate
based on compiler optimizations
4
3
Mutants
Code coverage
Mutants successfully
compiled
Unique mutants
Evaluate mutation
score’s confidence
Sampled mutants
Sample mutants
Execute prioritized
subset of test cases
5
6 7
Reduction in mutation
analysis time > 70%

Explanations for DNN Errors (SEDE)
Can we explain DNN failures of real-world images
using simulator parameters?
21
Training Set
Simulator
Images
DNN
Training
Test Set
Simulator
Images
DNN
Testing
DNN
Training
(fine-tuning)
Training Set
Real-world
Images
DNN
Testing
Trained
DNN
Fine-Tuned
DNN
Real-world Error
Inducing Images
Test Set
Real-world
Images

SEDE
22
Real-world
Error-inducing images HUDD
Evolutionary
Algorithms
Simulator
Simulator
images
Configuration
Parameters
RCC Prototype Images
Step 1. Identify root-cause clusters (RCCs)
Step 2. Generate images associated to RCCs
RCCs
Step 2.1. Identify RCC Prototype Images
Step 2.2. Generate a set of unsafe images belonging to the cluster
Step 2.3. Generate one safe image for each unsafe image
PaiR
Error-inducing
Test Set images
Step1.
Heatmap
based
clustering
Root cause clusters
C1 C2 C3 Step 2. Inspection of subset
of cluster elements.
HUDD: Fahmy et al. 2021
Cluster 2
(near closed eyes)
incomplete training set
Cluster 1
(angle ~157.5)
borderline cases

SEDE
23
Synthetic
images
Parameters-based Description
Improved
DNN
model
Retraining: +18.6%
Real-World
Images
HeadPosex > 10
& HeadPosey > 50.34
Real-world images
Diverse simulator images, within the cluster
Diverse failing simulator images, close to these images
Passing simulator images, close to failing ones in cluster
S1
S2
S3
Cluster
Process: Simulator-based Explanations for DNN Errors (SEDE)
Real-world
Error-inducing images HUDD
Evolutionary
Algorithms
Simulator
Simulator
images
Configuration
Parameters
RCC Prototype Images
Step 1. Identify root-cause clusters (RCCs)
Step 2. Generate images associated to RCCs
RCCs
Step 2.1. Identify RCC Prototype Images
Step 2.2. Generate a set of unsafe images belonging to the cluster
Step 2.3. Generate one safe image for each unsafe image
PaiR
Fahmy et al. 2022

WCET for Critical Tasks
24
• Real-time systems
• Schedulability analysis verifies time constraints for critical
tasks
• Early schedulability analysis and design decisions require
early task Worst Case Execution Time (WCET) estimates
• Challenges: Tasks not fully implemented, worst case
inputs unknown
• Goal: Estimating Probabilistic Safe WCET Ranges at Design
Stages

SAFE: WCET boundaries
• Safe WCET boundaries: implementation objectives, evaluate design options
• Iterative, distance-based sampling of WCET values within ranges
25
Phase 1. Worst-case task arrivals analysis Phase 2. Safe WCET computation
Training dataset
Worst-case
sequences of
task arrivals
Task
descriptions
Search Learning
Safe
Unsafe
WCET T1
WCET T2
SAFE: Safe WCET Analysis method For real-time
task schEdulability (Lee et al., 2022)

Use of Language Models
• Code (e.g., test) or trace vector representation (encoding)
• Benefit from pre-trained language models, e.g., CodeBERT
• Example: test case prioritization (Test2Vec, Jabbar et al., 2022), minimization
• Embedding test execution traces with fine-tuned CodeBERT
• Fined-tuned with pass/fail labels for past test cases in a system
• Test2Vec maps test execution traces, i.e., sequences of method calls with their
inputs and return values, to fixed-length, numerical vectors
• Heuristic: Similarity to previous failing test cases in the same project
26

Test2Vec Architecture
Preprocessing
(abstraction)
Embedding
Prediction
(prioritization)
27

Conclusions
• Many applications of diversity in testing
• Various aspects warrant different solutions: information access,
execution cost, instrumentation cost
• Trade-off between representations (test cases) and
distance/similarity measures: computation cost, guidance
• Determining the best solution can only be done empirically, in a
well-defined (application) context
• Check assumptions and properties of distance/similarity
measures, e.g., desired sensitivity to change
• Scalability is usually the stumbling block for many applications
28

Selected References
• Cornejo et al., “Mutation Analysis for Cyber-Physical Systems: Scalable Solutions and Results in the Space Domain”, IEEE Transactions on
Software Engineering, 2021
• Fahmy et al. "Supporting DNN Safety Analysis and Retraining through Heatmap-based Unsupervised Learning" IEEE Transactions on
Reliability, Special section on Quality Assurance of Machine Learning Systems, 2021
• Fahmy et al. "Simulator-based explanation and debugging of hazard-triggering events in DNN-based safety-critical systems”, ACM TOSEM,
2022
• Attaoui et al., “Black-box Safety Analysis and Retraining of DNNs based on Feature Extraction and Clustering”, ACM TOSEM, 2022
• Pan et al., , “ATM: Black-box Test Case Minimization based on Test Code Similarity and Evolutionary Search”, IEEE /ACM ICSE 2023,
• Lee et al., “Estimating probabilistic safe wcet ranges of real-time systems at design stages”, ACM TOSEM 2022
• Jabbar et al., Test2Vec: An Execution Trace Embedding for Test Case Prioritization, ArXIV, 2022
• Aghababaeyan et al., “Black-Box Testing of Deep Neural Networks through Test Case Diversity”, IEEE Transactions on Software
Engineering, 2023
• Aghababaeyan et al., “DeepGD: A Multi-Objective Black-Box Test Selection Approach for Deep Neural Networks”,
https://arxiv.org/abs/2303.04878
29

Looking for Postdocs!
Lionel Briand
SBFT 2023 Keynote
http://www.lbriand.info

Revisiting the Notion of Diversity in Software Testing

Recommended

More Related Content

What's hot (20)

Similar to Revisiting the Notion of Diversity in Software Testing (20)

More from Lionel Briand (20)

Recently uploaded (20)

Revisiting the Notion of Diversity in Software Testing