SlideShare a Scribd company logo
nanda-lab.ca
nanda-lab.ca
FlakyFix: Using Large Language
Models for Predicting Flaky Test
Fix Categories and Test Code
Repair
Sakina Fatima, Hadi Hemmati, Lionel Briand
02-04-2025
nanda-lab.ca
Background: Software Testing
Software Testing is an essential activity that ensures software dependability. Test cases are
executed to detect bugs in the code under test.
1
nanda-lab.ca
Flaky Tests intermittently pass and fail even for the same version of the source code
(i.e., non-deterministic testing results)
Why Detect and Repair Flaky Tests?
❖ Test failures caused by flaky tests can be hard to reproduce as re-running is required (computationally expensive)
❖ Flaky tests might hide real bugs in the source code
❖ Tests become unreliable
❖ Software releases might be delayed
❖ Hard to manually detect and fix so developers ignore these tests
Problem: Flaky Tests
2
nanda-lab.ca
FlakyFix: Black Box Automated Repair of Flaky Tests
Language
Model
Flaky Test A Fix for Flaky Test
Proposed Solution
*Black Box: Using test case code only, No access to code under test. This research focuses on those flaky tests where part of the flakiness lies in the test code. (10% of overall flaky tests dataset)
1. Sakina Fatima, Taher A Ghaleb, and Lionel Briand. Flakify: A black-box, language model-based predictor for flaky tests. IEEE Transactions on Software Engineering, 2022. 3
nanda-lab.ca
Proposed Approach
→ Definition of flaky tests fixes and labeling of flaky tests accordingly
• A set of heuristics
• First labeled dataset categorizing flaky tests by the type of fix needed
• Open-source script* to automatically label flaky tests based on their fixes
→ Prediction of flaky test fix category using pre-trained code language models
• Suggest to developers a type of fix they can implement to repair flaky tests
• Aid Conventional Large Language Models i.e. GPT to automatically generate the fix
→ Generation of a fully repaired flaky test using the predicted fix category and LLMS
• Attempt to generate a fully or semi-automated repair of Flaky tests
4
nanda-lab.ca
Prediction of Flaky Test Fix Category
Model
Change Data Structure
Cause of
Flakiness
Predicts
HashMap Should be replaced with LinkedHashMap to
maintain the order in which their elements are stored,
regardless of how many times a code is executed.
5
nanda-lab.ca
Generation of a Repaired Flaky Test using the Predicted Fix Category
6
nanda-lab.ca
Dataset
Used International Dataset of Flaky Tests (IDoFT)
→Largest available dataset of flaky tests where cause of flakiness is in the test code
→562 Flaky Tests in Java and their developer repaired fixes
→Flaky Tests belong to 123 different projects, helpful for the generalizability of prediction
models
7
nanda-lab.ca
Defining Flaky Test Fix Categories and Labeling the Dataset
→ Detected 13 Different Categories of the Fix
8
nanda-lab.ca
Fix Category Prediction
Fine-tune Pre-Trained Code Models i.e. CodeBERT and UniXcoder for the task of Flaky Tests
Fix Classification with two different techniques:
✓ Feed Forward Neural Network (FNN)
✓ Few Shot Learning (recommended for smaller datasets)
9
nanda-lab.ca
Fix Category Prediction: Technique #1
Fine-tune Pre-Trained Code Models i.e. CodeBERT and UniXcoder for the task of
Flaky Tests Fix Classification using a Feed Forward Neural Network (FNN).
10
nanda-lab.ca
Fix Category Prediction: Technique #2
Since a smaller dataset of 562 tests, we used Few Shot Learning (FSL), popular with
smaller datasets.
Step 1: Fine-tune code models (UnixCoder and CodeBert) using Siamese Network for
Flaky Tests fix category classification
11
nanda-lab.ca
Fix Category Prediction: Technique #2
Step 2: Evaluate the trained Model through Few-Shot Learning:
12
nanda-lab.ca
Prediction Results for Fix Categories Using Code Models
13
nanda-lab.ca
Example of Flaky Test Fix Category Prediction
Change Data Structure
Model
Flaky Test Original Fixed Flaky Test
14
nanda-lab.ca
Generating Fully Repaired Flaky Tests from GPT using Predicted Fix
Categories
Prompt 1 Prompt 2 (In-Context Learning)
16
nanda-lab.ca
Example of Repaired Flaky Test from GPT
Flaky Test Generated Fixed Flaky Test without Fix Category
Cause of Flakiness
Incorrect Repair Suggested
17
nanda-lab.ca
Example of Repaired Flaky Test from GPT (2)
Fix Generation with Fix Category Label Original Fix for Flaky Test
Repaired by GPT
Repaired by Developer
18
nanda-lab.ca
Generation Results from GPT using Predicted Fix Category Labels
19
nanda-lab.ca
Generation Results from GPT using In-Context Learning
20
nanda-lab.ca
GPT Generated Tests-Execution Results
• We ran a sample of 35 generated tests: 24 Passed, 11 Failed
• We conducted a series of analysis:
→Overall, among passing tests average CodeBLEU Score is 94%, Higher
Code BLEU scores have a higher likelihood to pass.
→16 GPT- fixed tests have 100% CodeBLEU score, indicating an exact match
with the developer-repaired versions.
→Bootstrapping: With 95% Confidence Interval, 51% to 83% GPT-fixed tests are
estimated to pass. (Helpful for Testers)
→Logistic Regression: Trained on the executable 35 tests to estimate the
passing test rate among the non-executable tests. 80 % Accuracy.
→Edit Distance is calculated to assess the manual fixing effort for 11 failed
tests. 16% average token replacement is needed.
21
nanda-lab.ca
GPT Generated Tests-Execution Results
Based on the trained Logistic Regression Model, Passing Estimates for non-executable tests for
both 181 and 131 test dataset:
22
nanda-lab.ca
Practical Implications
How do we envision our approach to be used in practice?
• Deploy in Continuous Integration (CI) environments to repair a flaky test without developer’s
explicit command.
• Guide developers about possible causes of flakiness that need to be addressed though Test
Smells and Fix Labels Information.
• Reduce the manual effort to fix the tests even when the GPT repair is not fully correct, semi-
automated repair approach.
23
nanda-lab.ca
Summary
24
nanda-lab.ca
nanda-lab.ca
FlakyFix: Using Large Language
Models for Predicting Flaky Test
Fix Categories and Test Code
Repair
Sakina Fatima, Hadi Hemmati, Lionel Briand
02-04-2025
Ad

More Related Content

Similar to FlakyFix: Using Large Language Models for Predicting Flaky Test Fix Categories and Test Code Repair (20)

Fut Lsi
Fut LsiFut Lsi
Fut Lsi
pralhad19
 
TEA Presentation V 0.3
TEA Presentation V 0.3TEA Presentation V 0.3
TEA Presentation V 0.3
Ian McDonald
 
Leveraging Existing Tests in Automated Test Generation for Web Applications
Leveraging Existing Tests in Automated Test Generation for Web ApplicationsLeveraging Existing Tests in Automated Test Generation for Web Applications
Leveraging Existing Tests in Automated Test Generation for Web Applications
SALT Lab @ UBC
 
p4pktgen: Automated Test Case Generation for P4 Programs
p4pktgen:  Automated Test Case  Generation for P4 Programsp4pktgen:  Automated Test Case  Generation for P4 Programs
p4pktgen: Automated Test Case Generation for P4 Programs
AJAY KHARAT
 
ACSAC2016: Code Obfuscation Against Symbolic Execution Attacks
ACSAC2016: Code Obfuscation Against Symbolic Execution AttacksACSAC2016: Code Obfuscation Against Symbolic Execution Attacks
ACSAC2016: Code Obfuscation Against Symbolic Execution Attacks
Sebastian Banescu
 
Software_Testing_Techniques_undergraduate.pptx
Software_Testing_Techniques_undergraduate.pptxSoftware_Testing_Techniques_undergraduate.pptx
Software_Testing_Techniques_undergraduate.pptx
MrittikaMahbub1
 
Do Your Homework! Writing tests for Data Science and Stochastic Code - David ...
Do Your Homework! Writing tests for Data Science and Stochastic Code - David ...Do Your Homework! Writing tests for Data Science and Stochastic Code - David ...
Do Your Homework! Writing tests for Data Science and Stochastic Code - David ...
PyData
 
AI for Program Specifications UW PLSE 2025 - final.pdf
AI for Program Specifications UW PLSE 2025 - final.pdfAI for Program Specifications UW PLSE 2025 - final.pdf
AI for Program Specifications UW PLSE 2025 - final.pdf
shuvendulahiri1
 
Curiosity and Xray present - In sprint testing: Aligning tests and teams to r...
Curiosity and Xray present - In sprint testing: Aligning tests and teams to r...Curiosity and Xray present - In sprint testing: Aligning tests and teams to r...
Curiosity and Xray present - In sprint testing: Aligning tests and teams to r...
Curiosity Software Ireland
 
Deliver Faster with BDD/TDD - Designing Automated Tests That Don't Suck
Deliver Faster with BDD/TDD - Designing Automated Tests That Don't SuckDeliver Faster with BDD/TDD - Designing Automated Tests That Don't Suck
Deliver Faster with BDD/TDD - Designing Automated Tests That Don't Suck
Kevin Brockhoff
 
oose ppt white box testing and black box
oose ppt white box testing and black boxoose ppt white box testing and black box
oose ppt white box testing and black box
SUJALArora15
 
Leveling Up With Unit Testing - LonghornPHP 2022
Leveling Up With Unit Testing - LonghornPHP 2022Leveling Up With Unit Testing - LonghornPHP 2022
Leveling Up With Unit Testing - LonghornPHP 2022
Mark Niebergall
 
Developers Testing - Girl Code at bloomon
Developers Testing - Girl Code at bloomonDevelopers Testing - Girl Code at bloomon
Developers Testing - Girl Code at bloomon
Ineke Scheffers
 
[FullStack NYC 2019] Effective Unit Tests for JavaScript
[FullStack NYC 2019] Effective Unit Tests for JavaScript[FullStack NYC 2019] Effective Unit Tests for JavaScript
[FullStack NYC 2019] Effective Unit Tests for JavaScript
Hazem Saleh
 
Triantafyllia Voulibasi
Triantafyllia VoulibasiTriantafyllia Voulibasi
Triantafyllia Voulibasi
ISSEL
 
Automock: Interaction-Based Mock Code Generation
Automock: Interaction-Based Mock Code GenerationAutomock: Interaction-Based Mock Code Generation
Automock: Interaction-Based Mock Code Generation
Sabrina Souto
 
1803_STAMP_OpenCloudForum2018
1803_STAMP_OpenCloudForum20181803_STAMP_OpenCloudForum2018
1803_STAMP_OpenCloudForum2018
STAMP Project
 
Dmytro Linchenko: LLM application comparison for development and testing task...
Dmytro Linchenko: LLM application comparison for development and testing task...Dmytro Linchenko: LLM application comparison for development and testing task...
Dmytro Linchenko: LLM application comparison for development and testing task...
Lviv Startup Club
 
Code Review with Sonar
Code Review with SonarCode Review with Sonar
Code Review with Sonar
Max Kleiner
 
Codeception: introduction to php testing (v2 - Aberdeen php)
Codeception: introduction to php testing (v2 - Aberdeen php)Codeception: introduction to php testing (v2 - Aberdeen php)
Codeception: introduction to php testing (v2 - Aberdeen php)
Engineor
 
TEA Presentation V 0.3
TEA Presentation V 0.3TEA Presentation V 0.3
TEA Presentation V 0.3
Ian McDonald
 
Leveraging Existing Tests in Automated Test Generation for Web Applications
Leveraging Existing Tests in Automated Test Generation for Web ApplicationsLeveraging Existing Tests in Automated Test Generation for Web Applications
Leveraging Existing Tests in Automated Test Generation for Web Applications
SALT Lab @ UBC
 
p4pktgen: Automated Test Case Generation for P4 Programs
p4pktgen:  Automated Test Case  Generation for P4 Programsp4pktgen:  Automated Test Case  Generation for P4 Programs
p4pktgen: Automated Test Case Generation for P4 Programs
AJAY KHARAT
 
ACSAC2016: Code Obfuscation Against Symbolic Execution Attacks
ACSAC2016: Code Obfuscation Against Symbolic Execution AttacksACSAC2016: Code Obfuscation Against Symbolic Execution Attacks
ACSAC2016: Code Obfuscation Against Symbolic Execution Attacks
Sebastian Banescu
 
Software_Testing_Techniques_undergraduate.pptx
Software_Testing_Techniques_undergraduate.pptxSoftware_Testing_Techniques_undergraduate.pptx
Software_Testing_Techniques_undergraduate.pptx
MrittikaMahbub1
 
Do Your Homework! Writing tests for Data Science and Stochastic Code - David ...
Do Your Homework! Writing tests for Data Science and Stochastic Code - David ...Do Your Homework! Writing tests for Data Science and Stochastic Code - David ...
Do Your Homework! Writing tests for Data Science and Stochastic Code - David ...
PyData
 
AI for Program Specifications UW PLSE 2025 - final.pdf
AI for Program Specifications UW PLSE 2025 - final.pdfAI for Program Specifications UW PLSE 2025 - final.pdf
AI for Program Specifications UW PLSE 2025 - final.pdf
shuvendulahiri1
 
Curiosity and Xray present - In sprint testing: Aligning tests and teams to r...
Curiosity and Xray present - In sprint testing: Aligning tests and teams to r...Curiosity and Xray present - In sprint testing: Aligning tests and teams to r...
Curiosity and Xray present - In sprint testing: Aligning tests and teams to r...
Curiosity Software Ireland
 
Deliver Faster with BDD/TDD - Designing Automated Tests That Don't Suck
Deliver Faster with BDD/TDD - Designing Automated Tests That Don't SuckDeliver Faster with BDD/TDD - Designing Automated Tests That Don't Suck
Deliver Faster with BDD/TDD - Designing Automated Tests That Don't Suck
Kevin Brockhoff
 
oose ppt white box testing and black box
oose ppt white box testing and black boxoose ppt white box testing and black box
oose ppt white box testing and black box
SUJALArora15
 
Leveling Up With Unit Testing - LonghornPHP 2022
Leveling Up With Unit Testing - LonghornPHP 2022Leveling Up With Unit Testing - LonghornPHP 2022
Leveling Up With Unit Testing - LonghornPHP 2022
Mark Niebergall
 
Developers Testing - Girl Code at bloomon
Developers Testing - Girl Code at bloomonDevelopers Testing - Girl Code at bloomon
Developers Testing - Girl Code at bloomon
Ineke Scheffers
 
[FullStack NYC 2019] Effective Unit Tests for JavaScript
[FullStack NYC 2019] Effective Unit Tests for JavaScript[FullStack NYC 2019] Effective Unit Tests for JavaScript
[FullStack NYC 2019] Effective Unit Tests for JavaScript
Hazem Saleh
 
Triantafyllia Voulibasi
Triantafyllia VoulibasiTriantafyllia Voulibasi
Triantafyllia Voulibasi
ISSEL
 
Automock: Interaction-Based Mock Code Generation
Automock: Interaction-Based Mock Code GenerationAutomock: Interaction-Based Mock Code Generation
Automock: Interaction-Based Mock Code Generation
Sabrina Souto
 
1803_STAMP_OpenCloudForum2018
1803_STAMP_OpenCloudForum20181803_STAMP_OpenCloudForum2018
1803_STAMP_OpenCloudForum2018
STAMP Project
 
Dmytro Linchenko: LLM application comparison for development and testing task...
Dmytro Linchenko: LLM application comparison for development and testing task...Dmytro Linchenko: LLM application comparison for development and testing task...
Dmytro Linchenko: LLM application comparison for development and testing task...
Lviv Startup Club
 
Code Review with Sonar
Code Review with SonarCode Review with Sonar
Code Review with Sonar
Max Kleiner
 
Codeception: introduction to php testing (v2 - Aberdeen php)
Codeception: introduction to php testing (v2 - Aberdeen php)Codeception: introduction to php testing (v2 - Aberdeen php)
Codeception: introduction to php testing (v2 - Aberdeen php)
Engineor
 

More from Lionel Briand (20)

Requirements in Engineering AI- Enabled Systems: Open Problems and Safe AI Sy...
Requirements in Engineering AI- Enabled Systems: Open Problems and Safe AI Sy...Requirements in Engineering AI- Enabled Systems: Open Problems and Safe AI Sy...
Requirements in Engineering AI- Enabled Systems: Open Problems and Safe AI Sy...
Lionel Briand
 
Precise and Complete Requirements? An Elusive Goal
Precise and Complete Requirements? An Elusive GoalPrecise and Complete Requirements? An Elusive Goal
Precise and Complete Requirements? An Elusive Goal
Lionel Briand
 
Metamorphic Testing for Web System Security
Metamorphic Testing for Web System SecurityMetamorphic Testing for Web System Security
Metamorphic Testing for Web System Security
Lionel Briand
 
Simulator-based Explanation and Debugging of Hazard-triggering Events in DNN-...
Simulator-based Explanation and Debugging of Hazard-triggering Events in DNN-...Simulator-based Explanation and Debugging of Hazard-triggering Events in DNN-...
Simulator-based Explanation and Debugging of Hazard-triggering Events in DNN-...
Lionel Briand
 
Fuzzing for CPS Mutation Testing
Fuzzing for CPS Mutation TestingFuzzing for CPS Mutation Testing
Fuzzing for CPS Mutation Testing
Lionel Briand
 
Data-driven Mutation Analysis for Cyber-Physical Systems
Data-driven Mutation Analysis for Cyber-Physical SystemsData-driven Mutation Analysis for Cyber-Physical Systems
Data-driven Mutation Analysis for Cyber-Physical Systems
Lionel Briand
 
Many-Objective Reinforcement Learning for Online Testing of DNN-Enabled Systems
Many-Objective Reinforcement Learning for Online Testing of DNN-Enabled SystemsMany-Objective Reinforcement Learning for Online Testing of DNN-Enabled Systems
Many-Objective Reinforcement Learning for Online Testing of DNN-Enabled Systems
Lionel Briand
 
ATM: Black-box Test Case Minimization based on Test Code Similarity and Evolu...
ATM: Black-box Test Case Minimization based on Test Code Similarity and Evolu...ATM: Black-box Test Case Minimization based on Test Code Similarity and Evolu...
ATM: Black-box Test Case Minimization based on Test Code Similarity and Evolu...
Lionel Briand
 
Black-box Safety Analysis and Retraining of DNNs based on Feature Extraction ...
Black-box Safety Analysis and Retraining of DNNs based on Feature Extraction ...Black-box Safety Analysis and Retraining of DNNs based on Feature Extraction ...
Black-box Safety Analysis and Retraining of DNNs based on Feature Extraction ...
Lionel Briand
 
PRINS: Scalable Model Inference for Component-based System Logs
PRINS: Scalable Model Inference for Component-based System LogsPRINS: Scalable Model Inference for Component-based System Logs
PRINS: Scalable Model Inference for Component-based System Logs
Lionel Briand
 
Revisiting the Notion of Diversity in Software Testing
Revisiting the Notion of Diversity in Software TestingRevisiting the Notion of Diversity in Software Testing
Revisiting the Notion of Diversity in Software Testing
Lionel Briand
 
Applications of Search-based Software Testing to Trustworthy Artificial Intel...
Applications of Search-based Software Testing to Trustworthy Artificial Intel...Applications of Search-based Software Testing to Trustworthy Artificial Intel...
Applications of Search-based Software Testing to Trustworthy Artificial Intel...
Lionel Briand
 
Autonomous Systems: How to Address the Dilemma between Autonomy and Safety
Autonomous Systems: How to Address the Dilemma between Autonomy and SafetyAutonomous Systems: How to Address the Dilemma between Autonomy and Safety
Autonomous Systems: How to Address the Dilemma between Autonomy and Safety
Lionel Briand
 
Mathematicians, Social Scientists, or Engineers? The Split Minds of Software ...
Mathematicians, Social Scientists, or Engineers? The Split Minds of Software ...Mathematicians, Social Scientists, or Engineers? The Split Minds of Software ...
Mathematicians, Social Scientists, or Engineers? The Split Minds of Software ...
Lionel Briand
 
Reinforcement Learning for Test Case Prioritization
Reinforcement Learning for Test Case PrioritizationReinforcement Learning for Test Case Prioritization
Reinforcement Learning for Test Case Prioritization
Lionel Briand
 
Mutation Analysis for Cyber-Physical Systems: Scalable Solutions and Results ...
Mutation Analysis for Cyber-Physical Systems: Scalable Solutions and Results ...Mutation Analysis for Cyber-Physical Systems: Scalable Solutions and Results ...
Mutation Analysis for Cyber-Physical Systems: Scalable Solutions and Results ...
Lionel Briand
 
On Systematically Building a Controlled Natural Language for Functional Requi...
On Systematically Building a Controlled Natural Language for Functional Requi...On Systematically Building a Controlled Natural Language for Functional Requi...
On Systematically Building a Controlled Natural Language for Functional Requi...
Lionel Briand
 
Efficient Online Testing for DNN-Enabled Systems using Surrogate-Assisted and...
Efficient Online Testing for DNN-Enabled Systems using Surrogate-Assisted and...Efficient Online Testing for DNN-Enabled Systems using Surrogate-Assisted and...
Efficient Online Testing for DNN-Enabled Systems using Surrogate-Assisted and...
Lionel Briand
 
Guidelines for Assessing the Accuracy of Log Message Template Identification ...
Guidelines for Assessing the Accuracy of Log Message Template Identification ...Guidelines for Assessing the Accuracy of Log Message Template Identification ...
Guidelines for Assessing the Accuracy of Log Message Template Identification ...
Lionel Briand
 
A Theoretical Framework for Understanding the Relationship between Log Parsin...
A Theoretical Framework for Understanding the Relationship between Log Parsin...A Theoretical Framework for Understanding the Relationship between Log Parsin...
A Theoretical Framework for Understanding the Relationship between Log Parsin...
Lionel Briand
 
Requirements in Engineering AI- Enabled Systems: Open Problems and Safe AI Sy...
Requirements in Engineering AI- Enabled Systems: Open Problems and Safe AI Sy...Requirements in Engineering AI- Enabled Systems: Open Problems and Safe AI Sy...
Requirements in Engineering AI- Enabled Systems: Open Problems and Safe AI Sy...
Lionel Briand
 
Precise and Complete Requirements? An Elusive Goal
Precise and Complete Requirements? An Elusive GoalPrecise and Complete Requirements? An Elusive Goal
Precise and Complete Requirements? An Elusive Goal
Lionel Briand
 
Metamorphic Testing for Web System Security
Metamorphic Testing for Web System SecurityMetamorphic Testing for Web System Security
Metamorphic Testing for Web System Security
Lionel Briand
 
Simulator-based Explanation and Debugging of Hazard-triggering Events in DNN-...
Simulator-based Explanation and Debugging of Hazard-triggering Events in DNN-...Simulator-based Explanation and Debugging of Hazard-triggering Events in DNN-...
Simulator-based Explanation and Debugging of Hazard-triggering Events in DNN-...
Lionel Briand
 
Fuzzing for CPS Mutation Testing
Fuzzing for CPS Mutation TestingFuzzing for CPS Mutation Testing
Fuzzing for CPS Mutation Testing
Lionel Briand
 
Data-driven Mutation Analysis for Cyber-Physical Systems
Data-driven Mutation Analysis for Cyber-Physical SystemsData-driven Mutation Analysis for Cyber-Physical Systems
Data-driven Mutation Analysis for Cyber-Physical Systems
Lionel Briand
 
Many-Objective Reinforcement Learning for Online Testing of DNN-Enabled Systems
Many-Objective Reinforcement Learning for Online Testing of DNN-Enabled SystemsMany-Objective Reinforcement Learning for Online Testing of DNN-Enabled Systems
Many-Objective Reinforcement Learning for Online Testing of DNN-Enabled Systems
Lionel Briand
 
ATM: Black-box Test Case Minimization based on Test Code Similarity and Evolu...
ATM: Black-box Test Case Minimization based on Test Code Similarity and Evolu...ATM: Black-box Test Case Minimization based on Test Code Similarity and Evolu...
ATM: Black-box Test Case Minimization based on Test Code Similarity and Evolu...
Lionel Briand
 
Black-box Safety Analysis and Retraining of DNNs based on Feature Extraction ...
Black-box Safety Analysis and Retraining of DNNs based on Feature Extraction ...Black-box Safety Analysis and Retraining of DNNs based on Feature Extraction ...
Black-box Safety Analysis and Retraining of DNNs based on Feature Extraction ...
Lionel Briand
 
PRINS: Scalable Model Inference for Component-based System Logs
PRINS: Scalable Model Inference for Component-based System LogsPRINS: Scalable Model Inference for Component-based System Logs
PRINS: Scalable Model Inference for Component-based System Logs
Lionel Briand
 
Revisiting the Notion of Diversity in Software Testing
Revisiting the Notion of Diversity in Software TestingRevisiting the Notion of Diversity in Software Testing
Revisiting the Notion of Diversity in Software Testing
Lionel Briand
 
Applications of Search-based Software Testing to Trustworthy Artificial Intel...
Applications of Search-based Software Testing to Trustworthy Artificial Intel...Applications of Search-based Software Testing to Trustworthy Artificial Intel...
Applications of Search-based Software Testing to Trustworthy Artificial Intel...
Lionel Briand
 
Autonomous Systems: How to Address the Dilemma between Autonomy and Safety
Autonomous Systems: How to Address the Dilemma between Autonomy and SafetyAutonomous Systems: How to Address the Dilemma between Autonomy and Safety
Autonomous Systems: How to Address the Dilemma between Autonomy and Safety
Lionel Briand
 
Mathematicians, Social Scientists, or Engineers? The Split Minds of Software ...
Mathematicians, Social Scientists, or Engineers? The Split Minds of Software ...Mathematicians, Social Scientists, or Engineers? The Split Minds of Software ...
Mathematicians, Social Scientists, or Engineers? The Split Minds of Software ...
Lionel Briand
 
Reinforcement Learning for Test Case Prioritization
Reinforcement Learning for Test Case PrioritizationReinforcement Learning for Test Case Prioritization
Reinforcement Learning for Test Case Prioritization
Lionel Briand
 
Mutation Analysis for Cyber-Physical Systems: Scalable Solutions and Results ...
Mutation Analysis for Cyber-Physical Systems: Scalable Solutions and Results ...Mutation Analysis for Cyber-Physical Systems: Scalable Solutions and Results ...
Mutation Analysis for Cyber-Physical Systems: Scalable Solutions and Results ...
Lionel Briand
 
On Systematically Building a Controlled Natural Language for Functional Requi...
On Systematically Building a Controlled Natural Language for Functional Requi...On Systematically Building a Controlled Natural Language for Functional Requi...
On Systematically Building a Controlled Natural Language for Functional Requi...
Lionel Briand
 
Efficient Online Testing for DNN-Enabled Systems using Surrogate-Assisted and...
Efficient Online Testing for DNN-Enabled Systems using Surrogate-Assisted and...Efficient Online Testing for DNN-Enabled Systems using Surrogate-Assisted and...
Efficient Online Testing for DNN-Enabled Systems using Surrogate-Assisted and...
Lionel Briand
 
Guidelines for Assessing the Accuracy of Log Message Template Identification ...
Guidelines for Assessing the Accuracy of Log Message Template Identification ...Guidelines for Assessing the Accuracy of Log Message Template Identification ...
Guidelines for Assessing the Accuracy of Log Message Template Identification ...
Lionel Briand
 
A Theoretical Framework for Understanding the Relationship between Log Parsin...
A Theoretical Framework for Understanding the Relationship between Log Parsin...A Theoretical Framework for Understanding the Relationship between Log Parsin...
A Theoretical Framework for Understanding the Relationship between Log Parsin...
Lionel Briand
 
Ad

Recently uploaded (20)

Douwan Crack 2025 new verson+ License code
Douwan Crack 2025 new verson+ License codeDouwan Crack 2025 new verson+ License code
Douwan Crack 2025 new verson+ License code
aneelaramzan63
 
Download Wondershare Filmora Crack [2025] With Latest
Download Wondershare Filmora Crack [2025] With LatestDownload Wondershare Filmora Crack [2025] With Latest
Download Wondershare Filmora Crack [2025] With Latest
tahirabibi60507
 
Scaling GraphRAG: Efficient Knowledge Retrieval for Enterprise AI
Scaling GraphRAG:  Efficient Knowledge Retrieval for Enterprise AIScaling GraphRAG:  Efficient Knowledge Retrieval for Enterprise AI
Scaling GraphRAG: Efficient Knowledge Retrieval for Enterprise AI
danshalev
 
What Do Contribution Guidelines Say About Software Testing? (MSR 2025)
What Do Contribution Guidelines Say About Software Testing? (MSR 2025)What Do Contribution Guidelines Say About Software Testing? (MSR 2025)
What Do Contribution Guidelines Say About Software Testing? (MSR 2025)
Andre Hora
 
Adobe After Effects Crack FREE FRESH version 2025
Adobe After Effects Crack FREE FRESH version 2025Adobe After Effects Crack FREE FRESH version 2025
Adobe After Effects Crack FREE FRESH version 2025
kashifyounis067
 
FL Studio Producer Edition Crack 2025 Full Version
FL Studio Producer Edition Crack 2025 Full VersionFL Studio Producer Edition Crack 2025 Full Version
FL Studio Producer Edition Crack 2025 Full Version
tahirabibi60507
 
LEARN SEO AND INCREASE YOUR KNOWLDGE IN SOFTWARE INDUSTRY
LEARN SEO AND INCREASE YOUR KNOWLDGE IN SOFTWARE INDUSTRYLEARN SEO AND INCREASE YOUR KNOWLDGE IN SOFTWARE INDUSTRY
LEARN SEO AND INCREASE YOUR KNOWLDGE IN SOFTWARE INDUSTRY
NidaFarooq10
 
Top 10 Client Portal Software Solutions for 2025.docx
Top 10 Client Portal Software Solutions for 2025.docxTop 10 Client Portal Software Solutions for 2025.docx
Top 10 Client Portal Software Solutions for 2025.docx
Portli
 
Avast Premium Security Crack FREE Latest Version 2025
Avast Premium Security Crack FREE Latest Version 2025Avast Premium Security Crack FREE Latest Version 2025
Avast Premium Security Crack FREE Latest Version 2025
mu394968
 
Adobe Marketo Engage Champion Deep Dive - SFDC CRM Synch V2 & Usage Dashboards
Adobe Marketo Engage Champion Deep Dive - SFDC CRM Synch V2 & Usage DashboardsAdobe Marketo Engage Champion Deep Dive - SFDC CRM Synch V2 & Usage Dashboards
Adobe Marketo Engage Champion Deep Dive - SFDC CRM Synch V2 & Usage Dashboards
BradBedford3
 
Landscape of Requirements Engineering for/by AI through Literature Review
Landscape of Requirements Engineering for/by AI through Literature ReviewLandscape of Requirements Engineering for/by AI through Literature Review
Landscape of Requirements Engineering for/by AI through Literature Review
Hironori Washizaki
 
Why Orangescrum Is a Game Changer for Construction Companies in 2025
Why Orangescrum Is a Game Changer for Construction Companies in 2025Why Orangescrum Is a Game Changer for Construction Companies in 2025
Why Orangescrum Is a Game Changer for Construction Companies in 2025
Orangescrum
 
Exceptional Behaviors: How Frequently Are They Tested? (AST 2025)
Exceptional Behaviors: How Frequently Are They Tested? (AST 2025)Exceptional Behaviors: How Frequently Are They Tested? (AST 2025)
Exceptional Behaviors: How Frequently Are They Tested? (AST 2025)
Andre Hora
 
How to Optimize Your AWS Environment for Improved Cloud Performance
How to Optimize Your AWS Environment for Improved Cloud PerformanceHow to Optimize Your AWS Environment for Improved Cloud Performance
How to Optimize Your AWS Environment for Improved Cloud Performance
ThousandEyes
 
Adobe Master Collection CC Crack Advance Version 2025
Adobe Master Collection CC Crack Advance Version 2025Adobe Master Collection CC Crack Advance Version 2025
Adobe Master Collection CC Crack Advance Version 2025
kashifyounis067
 
Exploring Code Comprehension in Scientific Programming: Preliminary Insight...
Exploring Code Comprehension  in Scientific Programming:  Preliminary Insight...Exploring Code Comprehension  in Scientific Programming:  Preliminary Insight...
Exploring Code Comprehension in Scientific Programming: Preliminary Insight...
University of Hawai‘i at Mānoa
 
Not So Common Memory Leaks in Java Webinar
Not So Common Memory Leaks in Java WebinarNot So Common Memory Leaks in Java Webinar
Not So Common Memory Leaks in Java Webinar
Tier1 app
 
Societal challenges of AI: biases, multilinguism and sustainability
Societal challenges of AI: biases, multilinguism and sustainabilitySocietal challenges of AI: biases, multilinguism and sustainability
Societal challenges of AI: biases, multilinguism and sustainability
Jordi Cabot
 
Revolutionizing Residential Wi-Fi PPT.pptx
Revolutionizing Residential Wi-Fi PPT.pptxRevolutionizing Residential Wi-Fi PPT.pptx
Revolutionizing Residential Wi-Fi PPT.pptx
nidhisingh691197
 
PDF Reader Pro Crack Latest Version FREE Download 2025
PDF Reader Pro Crack Latest Version FREE Download 2025PDF Reader Pro Crack Latest Version FREE Download 2025
PDF Reader Pro Crack Latest Version FREE Download 2025
mu394968
 
Douwan Crack 2025 new verson+ License code
Douwan Crack 2025 new verson+ License codeDouwan Crack 2025 new verson+ License code
Douwan Crack 2025 new verson+ License code
aneelaramzan63
 
Download Wondershare Filmora Crack [2025] With Latest
Download Wondershare Filmora Crack [2025] With LatestDownload Wondershare Filmora Crack [2025] With Latest
Download Wondershare Filmora Crack [2025] With Latest
tahirabibi60507
 
Scaling GraphRAG: Efficient Knowledge Retrieval for Enterprise AI
Scaling GraphRAG:  Efficient Knowledge Retrieval for Enterprise AIScaling GraphRAG:  Efficient Knowledge Retrieval for Enterprise AI
Scaling GraphRAG: Efficient Knowledge Retrieval for Enterprise AI
danshalev
 
What Do Contribution Guidelines Say About Software Testing? (MSR 2025)
What Do Contribution Guidelines Say About Software Testing? (MSR 2025)What Do Contribution Guidelines Say About Software Testing? (MSR 2025)
What Do Contribution Guidelines Say About Software Testing? (MSR 2025)
Andre Hora
 
Adobe After Effects Crack FREE FRESH version 2025
Adobe After Effects Crack FREE FRESH version 2025Adobe After Effects Crack FREE FRESH version 2025
Adobe After Effects Crack FREE FRESH version 2025
kashifyounis067
 
FL Studio Producer Edition Crack 2025 Full Version
FL Studio Producer Edition Crack 2025 Full VersionFL Studio Producer Edition Crack 2025 Full Version
FL Studio Producer Edition Crack 2025 Full Version
tahirabibi60507
 
LEARN SEO AND INCREASE YOUR KNOWLDGE IN SOFTWARE INDUSTRY
LEARN SEO AND INCREASE YOUR KNOWLDGE IN SOFTWARE INDUSTRYLEARN SEO AND INCREASE YOUR KNOWLDGE IN SOFTWARE INDUSTRY
LEARN SEO AND INCREASE YOUR KNOWLDGE IN SOFTWARE INDUSTRY
NidaFarooq10
 
Top 10 Client Portal Software Solutions for 2025.docx
Top 10 Client Portal Software Solutions for 2025.docxTop 10 Client Portal Software Solutions for 2025.docx
Top 10 Client Portal Software Solutions for 2025.docx
Portli
 
Avast Premium Security Crack FREE Latest Version 2025
Avast Premium Security Crack FREE Latest Version 2025Avast Premium Security Crack FREE Latest Version 2025
Avast Premium Security Crack FREE Latest Version 2025
mu394968
 
Adobe Marketo Engage Champion Deep Dive - SFDC CRM Synch V2 & Usage Dashboards
Adobe Marketo Engage Champion Deep Dive - SFDC CRM Synch V2 & Usage DashboardsAdobe Marketo Engage Champion Deep Dive - SFDC CRM Synch V2 & Usage Dashboards
Adobe Marketo Engage Champion Deep Dive - SFDC CRM Synch V2 & Usage Dashboards
BradBedford3
 
Landscape of Requirements Engineering for/by AI through Literature Review
Landscape of Requirements Engineering for/by AI through Literature ReviewLandscape of Requirements Engineering for/by AI through Literature Review
Landscape of Requirements Engineering for/by AI through Literature Review
Hironori Washizaki
 
Why Orangescrum Is a Game Changer for Construction Companies in 2025
Why Orangescrum Is a Game Changer for Construction Companies in 2025Why Orangescrum Is a Game Changer for Construction Companies in 2025
Why Orangescrum Is a Game Changer for Construction Companies in 2025
Orangescrum
 
Exceptional Behaviors: How Frequently Are They Tested? (AST 2025)
Exceptional Behaviors: How Frequently Are They Tested? (AST 2025)Exceptional Behaviors: How Frequently Are They Tested? (AST 2025)
Exceptional Behaviors: How Frequently Are They Tested? (AST 2025)
Andre Hora
 
How to Optimize Your AWS Environment for Improved Cloud Performance
How to Optimize Your AWS Environment for Improved Cloud PerformanceHow to Optimize Your AWS Environment for Improved Cloud Performance
How to Optimize Your AWS Environment for Improved Cloud Performance
ThousandEyes
 
Adobe Master Collection CC Crack Advance Version 2025
Adobe Master Collection CC Crack Advance Version 2025Adobe Master Collection CC Crack Advance Version 2025
Adobe Master Collection CC Crack Advance Version 2025
kashifyounis067
 
Exploring Code Comprehension in Scientific Programming: Preliminary Insight...
Exploring Code Comprehension  in Scientific Programming:  Preliminary Insight...Exploring Code Comprehension  in Scientific Programming:  Preliminary Insight...
Exploring Code Comprehension in Scientific Programming: Preliminary Insight...
University of Hawai‘i at Mānoa
 
Not So Common Memory Leaks in Java Webinar
Not So Common Memory Leaks in Java WebinarNot So Common Memory Leaks in Java Webinar
Not So Common Memory Leaks in Java Webinar
Tier1 app
 
Societal challenges of AI: biases, multilinguism and sustainability
Societal challenges of AI: biases, multilinguism and sustainabilitySocietal challenges of AI: biases, multilinguism and sustainability
Societal challenges of AI: biases, multilinguism and sustainability
Jordi Cabot
 
Revolutionizing Residential Wi-Fi PPT.pptx
Revolutionizing Residential Wi-Fi PPT.pptxRevolutionizing Residential Wi-Fi PPT.pptx
Revolutionizing Residential Wi-Fi PPT.pptx
nidhisingh691197
 
PDF Reader Pro Crack Latest Version FREE Download 2025
PDF Reader Pro Crack Latest Version FREE Download 2025PDF Reader Pro Crack Latest Version FREE Download 2025
PDF Reader Pro Crack Latest Version FREE Download 2025
mu394968
 
Ad

FlakyFix: Using Large Language Models for Predicting Flaky Test Fix Categories and Test Code Repair

  • 1. nanda-lab.ca nanda-lab.ca FlakyFix: Using Large Language Models for Predicting Flaky Test Fix Categories and Test Code Repair Sakina Fatima, Hadi Hemmati, Lionel Briand 02-04-2025
  • 2. nanda-lab.ca Background: Software Testing Software Testing is an essential activity that ensures software dependability. Test cases are executed to detect bugs in the code under test. 1
  • 3. nanda-lab.ca Flaky Tests intermittently pass and fail even for the same version of the source code (i.e., non-deterministic testing results) Why Detect and Repair Flaky Tests? ❖ Test failures caused by flaky tests can be hard to reproduce as re-running is required (computationally expensive) ❖ Flaky tests might hide real bugs in the source code ❖ Tests become unreliable ❖ Software releases might be delayed ❖ Hard to manually detect and fix so developers ignore these tests Problem: Flaky Tests 2
  • 4. nanda-lab.ca FlakyFix: Black Box Automated Repair of Flaky Tests Language Model Flaky Test A Fix for Flaky Test Proposed Solution *Black Box: Using test case code only, No access to code under test. This research focuses on those flaky tests where part of the flakiness lies in the test code. (10% of overall flaky tests dataset) 1. Sakina Fatima, Taher A Ghaleb, and Lionel Briand. Flakify: A black-box, language model-based predictor for flaky tests. IEEE Transactions on Software Engineering, 2022. 3
  • 5. nanda-lab.ca Proposed Approach → Definition of flaky tests fixes and labeling of flaky tests accordingly • A set of heuristics • First labeled dataset categorizing flaky tests by the type of fix needed • Open-source script* to automatically label flaky tests based on their fixes → Prediction of flaky test fix category using pre-trained code language models • Suggest to developers a type of fix they can implement to repair flaky tests • Aid Conventional Large Language Models i.e. GPT to automatically generate the fix → Generation of a fully repaired flaky test using the predicted fix category and LLMS • Attempt to generate a fully or semi-automated repair of Flaky tests 4
  • 6. nanda-lab.ca Prediction of Flaky Test Fix Category Model Change Data Structure Cause of Flakiness Predicts HashMap Should be replaced with LinkedHashMap to maintain the order in which their elements are stored, regardless of how many times a code is executed. 5
  • 7. nanda-lab.ca Generation of a Repaired Flaky Test using the Predicted Fix Category 6
  • 8. nanda-lab.ca Dataset Used International Dataset of Flaky Tests (IDoFT) →Largest available dataset of flaky tests where cause of flakiness is in the test code →562 Flaky Tests in Java and their developer repaired fixes →Flaky Tests belong to 123 different projects, helpful for the generalizability of prediction models 7
  • 9. nanda-lab.ca Defining Flaky Test Fix Categories and Labeling the Dataset → Detected 13 Different Categories of the Fix 8
  • 10. nanda-lab.ca Fix Category Prediction Fine-tune Pre-Trained Code Models i.e. CodeBERT and UniXcoder for the task of Flaky Tests Fix Classification with two different techniques: ✓ Feed Forward Neural Network (FNN) ✓ Few Shot Learning (recommended for smaller datasets) 9
  • 11. nanda-lab.ca Fix Category Prediction: Technique #1 Fine-tune Pre-Trained Code Models i.e. CodeBERT and UniXcoder for the task of Flaky Tests Fix Classification using a Feed Forward Neural Network (FNN). 10
  • 12. nanda-lab.ca Fix Category Prediction: Technique #2 Since a smaller dataset of 562 tests, we used Few Shot Learning (FSL), popular with smaller datasets. Step 1: Fine-tune code models (UnixCoder and CodeBert) using Siamese Network for Flaky Tests fix category classification 11
  • 13. nanda-lab.ca Fix Category Prediction: Technique #2 Step 2: Evaluate the trained Model through Few-Shot Learning: 12
  • 14. nanda-lab.ca Prediction Results for Fix Categories Using Code Models 13
  • 15. nanda-lab.ca Example of Flaky Test Fix Category Prediction Change Data Structure Model Flaky Test Original Fixed Flaky Test 14
  • 16. nanda-lab.ca Generating Fully Repaired Flaky Tests from GPT using Predicted Fix Categories Prompt 1 Prompt 2 (In-Context Learning) 16
  • 17. nanda-lab.ca Example of Repaired Flaky Test from GPT Flaky Test Generated Fixed Flaky Test without Fix Category Cause of Flakiness Incorrect Repair Suggested 17
  • 18. nanda-lab.ca Example of Repaired Flaky Test from GPT (2) Fix Generation with Fix Category Label Original Fix for Flaky Test Repaired by GPT Repaired by Developer 18
  • 19. nanda-lab.ca Generation Results from GPT using Predicted Fix Category Labels 19
  • 20. nanda-lab.ca Generation Results from GPT using In-Context Learning 20
  • 21. nanda-lab.ca GPT Generated Tests-Execution Results • We ran a sample of 35 generated tests: 24 Passed, 11 Failed • We conducted a series of analysis: →Overall, among passing tests average CodeBLEU Score is 94%, Higher Code BLEU scores have a higher likelihood to pass. →16 GPT- fixed tests have 100% CodeBLEU score, indicating an exact match with the developer-repaired versions. →Bootstrapping: With 95% Confidence Interval, 51% to 83% GPT-fixed tests are estimated to pass. (Helpful for Testers) →Logistic Regression: Trained on the executable 35 tests to estimate the passing test rate among the non-executable tests. 80 % Accuracy. →Edit Distance is calculated to assess the manual fixing effort for 11 failed tests. 16% average token replacement is needed. 21
  • 22. nanda-lab.ca GPT Generated Tests-Execution Results Based on the trained Logistic Regression Model, Passing Estimates for non-executable tests for both 181 and 131 test dataset: 22
  • 23. nanda-lab.ca Practical Implications How do we envision our approach to be used in practice? • Deploy in Continuous Integration (CI) environments to repair a flaky test without developer’s explicit command. • Guide developers about possible causes of flakiness that need to be addressed though Test Smells and Fix Labels Information. • Reduce the manual effort to fix the tests even when the GPT repair is not fully correct, semi- automated repair approach. 23
  • 25. nanda-lab.ca nanda-lab.ca FlakyFix: Using Large Language Models for Predicting Flaky Test Fix Categories and Test Code Repair Sakina Fatima, Hadi Hemmati, Lionel Briand 02-04-2025