A Hybrid Machine Learning Approach for Enhanced Software Defect Prediction Through Optimized Feature Selection
A Hybrid Machine Learning Approach for Enhanced Software Defect Prediction Through Optimized Feature Selection
Abstract – This project aims to improve software defect prediction (SDP) using machine
learning methods to enhance accuracy and efficiency in predicting software defects. The
major goal is to creates a powerful model that utilizes advanced algorithms for feature
selection, which is essential for optimizing predictive performance. We present a
combination that combines Arithmetic Optimization Algorithm (AOA) with a multilayer
perceptron (MLP), that is one of the forms of artificial neural network (ANN) well-
recognized to be able to learn intricate data patterns. The new AOA-MLP model
overcomes critical issues in existing SDP techniques, including excessive time
complexity and the curse of dimensionality, by efficiently selecting meaningful features
out of a large set of potential predictors. We assess the performance of the model through
rigorous experimentation on actual software defect datasets and attain significant training
and testing accuracies, together with a high ROC-AUC score that demonstrates its ability
to distinguish among faulty and non-faulty software components. This research ensures
betterment to the field of software quality assurance by providing empirical evidence
supporting the use of machine learning techniques, particularly feature reduction
strategies, to enhance the accuracy and relevance of software defect prediction. The
findings are further illustrated through confusion matrices, demonstrating a reduction in
false positives and negatives, thereby improving the overall reliability of the prediction
model. This work aims to advance quality assurance practices in software development,
ultimately resulting in higher software performance and reduced maintenance costs.
1. Introduction
This section highlights the growing importance of software defect prediction (SDP) and
how machine learning (ML) is playing an important role in transformative role in
improving software system performance. Identifying defects early is a crucial part of
software development, as undetected issues can lead to serious consequences such as
security vulnerabilities, costly fixes, or poor user experience. Early detection helps us to
ensure that software is both reliable and user-friendly, and industry reports suggest that
failing to catch these issues early can result in significant financial losses.
As software becomes more complex, the need for accurate and efficient defect prediction
becomes even more critical. Traditional methods rely heavily on manual efforts, which
are not only time-consuming but also prone to human error. In contrast, machine learning
technique that are particularly those that incorporate automation that have shown promise
in identifying potential defects more accurately and efficiently. These approaches can
lead to improved software quality while also reducing development costs. However, the
effectiveness of these models depends largely on selecting the right features and
achieving high classification accuracy. This makes the careful selection of relevant
software metrics a key factor in building robust and reliable prediction models.
The Role of Feature Selection in Defect Prediction Feature selection (FS) is a very
important part of defect prediction since useless or redundant features can lower the
accuracy of the model. The aim is to concentrate on features that contribute the most to
defect prediction. IEEE defines a defect as any departure from the normal behavior of a
software program due to faulty action or information. Prediction techniques have been
developed over time based on both quantitative factors such as internal structure of
software and past defect history. Yet, the community has not yet agreed on which
characteristics are best suited for defect prediction, and current research stresses that all
irrelevant characteristics should be eliminated using FS methods. Feature selection
methods can substantially improve model performance by eliminating the features, which
in turn improves classification efficiency and decreases the computational complexity.
The AOA-MLP Model: A Hybrid Method for Feature Selection and Prediction In light of
the shortcomings of existing models, the article introduces a novel integrated method by
merging the Arithmetic Optimization Algorithm (AOA) with the Multilayer Perceptron
(MLP) classifier for better defect prediction. The Arithmetic Optimization Algorithm is a
metaheuristic based on mathematical arithmetic operators that manages a balance of both
exploration and exploitation to attain the best subset of features. This equilibrium makes
AOA especially suitable for feature selection, as it is able to search through vast search
spaces to effectively find relevant features.
By applying AOA for feature selection, the proposed model seeks to shrink the
dimensional structure of the feature space while maintaining the most fundamental
metrics for software defect prediction. This results in more precise and effective
predictions, especially for complicated datasets which could otherwise be hard to handle.
The paper also demonstrates the efficacy of the developed AOA-MLP model through its
comparison against four real-world datasets: CM1(Component Module), PC1(Project
Component), PC2, and JDT (Java Development Tools). These datasets simulate various
software development scenarios and offer a solid test bed for testing the working
performance of the model. The findings disclose that the AOA-MLP model performs
better than other methods in terms of important performance measures like precision,
recall, and accuracy. The computational complexity of the model is also minimized
because of the effective feature selection process facilitated by AOA.
Key Contributions
1. Hybrid Model Design: A major contribution of this work is the introduction of a hybrid
AOA-MLP model for defect prediction, which addresses the limitations of earlier
approaches by combining two powerful techniques.
2. Optimized Feature Selection: By using the Arithmetic Optimization Algorithm (AOA),
the model is able to select the most relevant software metrics. This not only boosts
prediction accuracy but also avoids unnecessary computational load.
3. Real-World Validation: The model’s performance has been tested across multiple real-
world datasets, demonstrating its reliability and consistency in a variety of software
development environments.
4. Enhanced Accuracy and Efficiency: Compared to existing methods, the proposed
model delivers improved precision, recall, and overall accuracy that makes it a strong
candidate for practical defect prediction applications.
In summary, the AOA-MLP model presents an innovative approach to a long-standing
challenge in software engineering. By blending the global optimization capabilities of the
AOA algorithm with the predictive power of the Multilayer Perceptron (MLP), this
hybrid model significantly enhances the performance of defect prediction systems. It
offers a promising tool for software developers and QA teams aiming to build more
reliable and high-quality software.
2. Literature Review
Author(s) Area of Methodo Key
Name work / logy findings Drawbacks
Focus used / /
Approac Contrib
h utions
3. Modeling
This following part provides a description of the NASA MDP repository CM1, PC1,
PC2, and KC1 datasets and summarizes the techniques/methods used for training the
model and the dataset is given below.
3.1 Dataset
NASA MDP repository consists of datasets like CM1, PC1, PC2, and KC1 which consist
of C language-written flight software metrics for spacecraft. These metrics obtained by
Halstead and McCabe approaches measure the characteristics and quality of the software
and, based on further analysis, forecast defects in the software. Datasets identify modules
as either Defective (D) or Non-defective (ND) with proportions established in the
documentation.
The Arithmetic Optimization Algorithm is a mathematical arithmetic operator-
inspired meta-heuristic optimization strategy. AOA emulates the operation of the
arithmetic rules, which is addition, subtraction, multiplication, and division, to efficiently
explore and exploit the solution space. AOA is used in optimization problems, such as
software defect prediction, to leverage global search abilities and randomization to arrive
at optimal solutions. AOA was shown to be more efficient in exploration-exploitation
balancing compared to other algorithms. For defect prediction for NASA datasets, AOA
can be hybridized with machine learning algorithms such as Multilayer Perceptron
(MLP). Therefore, this hybrid solution would result in enhancing the algorithms
pertaining to classifying modules as defective or non-defective and solving other
sophisticated optimization problems associated with the datasets.
Initialization phase
In AOA, the initial optimization stage starts with some initial set of candidate
solutions(X), which is generated randomly, and the optimum candidate solution
generated in each iteration is utilized in the subsequent iteration is utilized as the better
solution or nearly optimum till now.
● UB and LB in the above formula stands for the upper and lower boundaries of
the search space.
● R1 is some non-specific number in the range [0,1].
● MOA (Math Optimizer Acceleration) dynamically controls the transition from
exploration to exploitation, computed as:
MOA=(t/T)^
(2)
Where T is the highest possible number of iterations, and λ is a constraint that adjusts the
convergence behavior.
Exploitation Phase
Xi(t+1) = Xbest(t)+r2×(Xbest(t)−Xi(t))×MOA (3)
Foraging Energy Concept in AOA: In analogy to the foraging energy decay concept in
AOA, AOA uses the math optimizer probability (MOP), which controls the balance
between exploration and exploitation:
MOP=sin (π/2×(1−t/T))
(4)
4. Multilayer perceptron
Machine learning challenges are generally divided into three core categories: supervised,
semi-supervised, and unsupervised learning. These categories can be distinguished based
on the type of problem being addressed and the nature of the data. In the paradigm of
software defect prediction (SDP), the focus is primarily on classification specifically,
identifying whether a software component is defective (D) or non-defective (ND). Over
the years, numerous artificial intelligence (AI) techniques have been introduced to tackle
classification tasks, including decision trees, logistic regression, and random forests.
Among these, supervised learning approaches have shown strong performance in SDP.
One popular method within this group is the Artificial Neural Network (ANN), which is
effectively applied to classification tasks in previous studies [1].
To develop and validate classification models in SDP, datasets such as PC1 and CM1 are
frequently used. A typical approach involves splitting the available data into two sets for
training and testing to evaluate how well the model generalizes to unseen data. Before
training begins, it's important to set key parameters—such as the learning rate, weight
initialization, and the number of hidden layers—to ensure effective model learning. In
classification problems like this, the sigmoid activation function is often used due to its
ability to map outputs to probabilities between 0 and 1. As displayed in Fig. 2, the
sigmoid function supports the model to estimate the likelihood that a given software
instance belongs to the defective or non-defective class.
the final layer that is for output, Y = [y1, y2],which corresponds to classification
categories such as defective or non-defective.The model relies on various parameters,
including weight values, defined as:
W = [w1,w2,w3,…wn] (6)
Key components include the learning rate (β), and biases associated with different
neurons — for example, b₀ⱼ represents the bias for the jᵗʰ hidden unit, while v₀ ₖ is the
bias for the kᵗʰ output unit. The net input to the jᵗʰ hidden node is calculated as:
associated weights:
The model adjusts its output using an error correction mechanism, where the error is
calculated based on the difference between the predicted output and the actual class
(either Non-Defective (ND) or Defective (D)):
The sequence of tasks of the AOA-MLP model is displayed in Fig. 4, showcasing its
application to datasets from the NASA MDP repository. This iterative process enables
the model to continuously learn from and adapt to the intricacies of the software data,
eventually leading to a more accurate and robust software defect prediction (SDP)
system. The flowchart emphasizes the adaptive and cyclic nature of the AOA-MLP
framework, which effectively combines the Arithmetic Optimization Algorithm
(AOA) with a Multilayer Perceptron (MLP). By leveraging AOA’s adaptive search
strategy, the model enhances feature selection, thereby improving the predictive
performance of the MLP classifier.
Fig. 4: Workflow of AOA MLP model
5. Metrics Evaluation
Accuracy = 𝑇𝑃 + 𝑇𝑁 / 𝑇𝑃 + 𝐹𝑁 + 𝐹𝑃 + 𝑇𝑁
model. It is calculated as:
(15)
Here, TP in the formula is used to depict true positives, TN to depict true negatives, FP to
depict false positives, and FN to depict false negatives. Improving accuracy is critical in
reducing the cost and effort associated with software testing and enhancing decision-
making for the project that is software defect detection.
2. Precision: Precision indicates the fraction of rightly predicted positive events to the
total events that are predicted positive. It is calculated using the formula:
Precision = 𝑇𝑃 / 𝑇𝑃 + 𝐹𝑃 (16)
This metric is essential for ensuring that the AOA -MLP model does not miss true defects,
even if some defects may not be classified perfectly.
1. F1-Score: F1-score integrates precision and recall into one combined measure,
providing a stabilized view of model performance. It has a value between 0 and 1,
with the ideal score being 1. It is given by:
2. Experimental Evaluation
The experiments for the software fault prediction project were implemented using Python
3, with all code developed and executed within the Spyder Integrated Development
Environment (IDE) [1]. The machine learning pipeline made extensive use of well-
established Python libraries, including imbalanced-learn, scikit-learn, pandas, and
matplotlib. All tests were carried out on a system running Windows 10 Pro, powered by
an Intel Core i7 vPro (7th Generation) processor, with 8 GB of RAM, a 1 TB hard disk,
and a 64-bit architecture. This setup offered adequate computational resources to support
the training, evaluation, and visualization phases of the model development process.
This section reports the performance of the AOA-MLP model when it is used on the
CM1 dataset of the NASA Metrics Data Program (MDP) repository. Both statistical
measures and visual inspection are used in evaluation to ascertain the effectiveness of the
classification by the model. Confusion matrix gives a complete picture of the model's
capability to classify faulty and non-faulty software modules. Here, rows signify the
original class labels and columns indicate the classification predicted. The AOA-MLP
model was usually accurate in locating faulty instances. It produced high levels of true
positives and few false positives. Unfortunately, it did struggle correctly marking non-
faulty instances as such. They were falsely predicted as faulty in some of them. This
implies that classification performance was quite fine, but slightly improved with further
fine-tuning. In order to gain deeper insights into learning, training and validation
accuracy have been plotted as functions of time. This reflected how the model generalizes
on longer timescales along with any evidence of overfitting or underfitting. Additionally,
training and validation loss curves are examined giving better insights into the manner in
which model optimizes its internal parameters as it learns. So the combined visualizations
provide a clear picture of the manner in which the model is converging as well as
training.
Additional assessment was carried out in line with the ROC curve. It is an essential
parameter for the classification of tasks that handle imbalanced datasets. An instance,
where the ROC is observed is for the CM1 dataset. The AUC is importantly high thus
inferring that the model can better differentiate between the defective and non-defective
classification. It means the AOA-MLP model can be used to predict well. The AUC is the
measure of accuracy and is robust according to the experimental results. It was used to
determine the effectiveness of the AOA with the Multilayer Perceptron. The evaluation
shows that the AOA-MLP can be used in improving the performance of classification
through feature selection. The experimental results demonstrate that the AOA-MLP is a
potential tool for software defect prediction with stable performance metrics and orderly
learning.
Figure 7
Figure 8
Figure 9
Real Non-Faulty
8 360
(N)
Table 2.
This section discusses how the AOA-MLP model performs on the KC1 dataset, which is
being sourced from the NASA repository. The confusion matrix in Table 2 provides a
clear overview of the model's ability to distinguish between faulty and non-faulty
instances. In the matrix, rows indicate the actual class labels, while columns show the
predicted classifications model
The ROC curve is a significant tool in our evaluation that enhances the performance
analysis of our model. The ROC curve presented significant AUC values, which means a
robust performance of the classification model. The experimental results presented in the
evaluation of the KC1 dataset show the excellence of the AOA-MLP model, which
achieved a high AUC value. This means that the AOA-MLP model would always give
the most accurate and reliable prediction. Consequently, the AUC values show the AOA-
MLP model’s output of the actual class value compared to a random model. The AUC
values achieved from testing the model with the experimental dataset of the class value
was presented, and the values confirmed our hypothesis.
The AUC value achieved, the AOA-MLP model, should be the same as a random
classifier, but the AUC value obtained in this research confirms that the AOA-MLP
model’s performance is better when predicting the actual class value.
figure 12
figure 13
Real Non-Faulty
1 35
(N)
Table 3
2.3 PC1 DATASET
The following section describes how well the AOA-MLP model does on the PC1 dataset
from the NASA repository. Looking at the confusion matrix, it is possible to see how the
model groups faulty and non-faulty instances. The model does a very good job of
grouping the faulty examples, and it correctly labels most of them with only a few false
positives. The model has a harder time with the non-faulty cases and will every now and
then group them together with the faulty examples. This provides us with a good idea of
the model's strengths and areas where it can improve.
The learning process model has been visualized to understand better. This is done by
representing the accuracy trend in both training and validation. The graph indicates that
the performance of the model increases over time for several epochs. In addition, it also
represents the loss curve where it reduces the error with epochs. The visualization is
crucial to pinpoint potential problems such as overfit and underfit.
The ROC curve gives an additional dimension to our assessment, providing a high AUC
value, which means that the model is highly efficient in separating defective and non-
defective instances in the PC1 dataset. This is a good hint for both the model's reliability
and potential for real-world applications in software defect prediction.
Figure 15 Confusion Matrix - Fine-tuned Model
Figure 12 ROC Curve - Highly Optimized Model
Figure 13 AOA Convergence - Baseline Model
Real Non-Faulty
1 62
(N)
Now we are going to look at the performance of an AOA-MLP model on PC2 dataset
from the NASA repository. The confusion matrix shows very few false positives,
indicating that the model is good at catching defective instances. However, it struggles
with non-defective instances, some of which it has labelled as defective. This provides a
balanced view of both the strengths and weaknesses of the model. The accuracy trends
for training and validation, which shows how the model performs over time, are also
shown. More information on how the model reduces errors in learning can be found in
the training and validation loss curves. These visualizations are crucial for understanding
how the model behaves and making sure it adjusts well to new inputs.
Our ROC curve is the base for our evaluation since it provides a robust AUC value to
provide insight into how well the model can differentiate between faulty and non-faulty
cases in the PC2 dataset, which shows the potential of the model as a technique for
software defect prediction tasks by providing a clear indicator of correctness and
dependability.
Figure 11 Confusion Matrix - Fine-tuned Model
Figure 14 Confusion Matrix - Poor Recall (Variant)
F
igure 15 ROC Curve - Variant Model
Figure 16 Precision Recall Curve
Real Non-Faulty
26 842
(N)
This section introduces a case study experimental assessment of the CM1, PC1, PC2 and
KC1 datasets to evaluate the performance of the proposed AOA-MLP model compared to
some state-of-the-art software defect prediction (SDP) mechanisms. Through a variety of
evaluation metrics and performance indicators, we primarily compare the results of our
model against those from previous works to reflect the effectiveness and potential
improvements presented by the AOA-MLP framework, proving its power to enhance
classification performance across multiple benchmark datasets within the SDP domain.
REFERENCES
1. Zhang, L., Li, Z. and Xu, W. (2023), “A novel hybrid model based on feature
selection for software defect prediction”, Journal of Software: Evolution and
Process, Vol. 35 No. 3, e2416, doi: 10.1002/smr.2416.
2. Cai, X., Niu, Y., Geng, S., Zhang, J., Cui, Z., Li, J. and Chen, J. (2020), “An
under-sampled software defect prediction method based on hybrid multi-
objective cuckoo search”, Concurrency and Computation: Practice and
Experience, Vol. 32 No. 5, p. e5478, doi: 10.1002/cpe.5478.
3. Alam, M., Haidri, R.A. and Shahid, M. (2020), “Resource-aware load balancing
model for batch of tasks (BoT) with best fit migration policy on heterogeneous
distributed computing systems”, International Journal of Pervasive Computing
and Communications, Vol. 16 No. 2, pp. 113-141, doi: 10.1108/ijpcc-10-2019-
0081.
4. Alrezaamiri, H., Ebrahimnejad, A. and Motameni, H. (2019), “Software
requirement optimization using a fuzzy artificial chemical reaction optimization
algorithm”, Soft Computing, Vol. 23 No. 20, pp. 9979-9994, doi:
10.1007/s00500-018-3553-7.
5. Arar, € O.F. and Ayan, K. (2015), “Software defect prediction using cost-
sensitive neural network”, Applied Soft Computing, Vol. 33, pp. 263-277, doi:
10.1016/j.asoc.2015.04.045.
6. Li, J., He, P., Zhu, J. and Lyu, M.R. (2017), “Software defect prediction via
convolutional neural network”, 2017 IEEE international conference on
software quality, reliability and security (QRS), IEEE, pp. 318-328.
7. Goyal, S. (2022), “Handling class-imbalance with KNN (neighbourhood)
under-sampling for software defect prediction”, Artificial Intelligence Review,
Vol. 55 No. 3, pp. 2023-2064, doi: 10.1007/s10462-021-10044- w.
8. Pandey, S.K., Mishra, R.B. and Tripathi, A.K. (2020), “BPDET: an effective
software bug prediction model using deep representation and ensemble learning
techniques”, Expert Systems with Applications, Vol. 144, 113085, doi:
10.1016/j.eswa.2019.113085.
9. Xu, Z., Liu, J., Yang, Z., An, G. and Jia, X. (2016), “The impact of feature
selection on defect prediction performance: an empirical comparison”, 2016
IEEE 27th International Symposium on Software Reliability Engineering
(ISSRE), IEEE, pp. 309-320.
10. Wahono, R.S., Herman, N.S. and Ahmad, S. (2014), “Neural network
parameter optimization based on genetic algorithm for software defect
prediction”, Advanced Science Letters, Vol. 20 Nos 10-11, pp. 1951- 1955, doi:
10.1166/asl.2014.5641.
11. Manjula, C. and Florence, L. (2019), “Deep neural network based hybrid
approach for software defect prediction using software metrics”, Cluster
Computing, Vol. 22, Suppl 4, pp. 9847-9863, doi: 10.1007/s10586018-1696-z.
12. Gao, K., Khoshgoftaar, T.M., Wang, H. and Seliya, N. (2011), “Choosing
software metrics for defect prediction: an investigation on feature selection
techniques”, Software: Practice and Experience, Vol. 41 No. 5, pp. 579-606,
doi: 10.1002/spe.1043.
13. Fenton, N., & Neil, M. (2012). Software metrics: Successes, failures, and new
directions. Journal of Systems and Software, 85(8), 1933-1940.
14. Arora, S., & Singh, S. (2019). A conceptual comparison of firefly algorithm,
bat algorithm, and cuckoo search. Artificial Intelligence Review, 52(3), 1813-
1863.
15. Malhotra, R. (2015). A systematic review of machine learning techniques for
software fault prediction. Applied Soft Computing, 27, 504-518.
16. Song, Q., Jia, Z., Shepperd, M., Ying, S., & Liu, J. (2011). A general software
defect-proneness prediction framework. IEEE Transactions on Software
Engineering, 37(3), 356-370.
17. Wang, S., & Yao, X. (2013). Using class imbalance learning for software defect
prediction. IEEE Transactions on Reliability, 62(2), 434-443.
18. He, P., Shu, F., Yang, Q., Li, X., Ma, Y., & Qu, Y. (2015). An empirical study
on software defect prediction with a simplified metric set. Information and
Software Technology, 59, 170-190.
19. Hall, T., Beecham, S., Bowes, D., Gray, D., & Counsell, S. (2012). A
systematic literature review on fault prediction performance in software
engineering. IEEE Transactions on Software Engineering, 38(6), 1276-1304.
20. Jureczko, M., & Madeyski, L. (2010). Towards identifying software project
clusters with similar defect patterns. Computer Science, 11(4), 399-407.
21. Hosseini, R., Turhan, B., & Mendes, E. (2017). A systematic literature review
and meta-analysis on cross project defect prediction. IEEE Transactions on
Software Engineering, 43(11), 1239-1263.
22. Jiang, Y., Cukic, B., & Menzies, T. (2008). Can data transformation help in the
detection of fault-prone modules? Proceedings of the International Symposium
on Software Reliability Engineering, 200-209.
23. Boetticher, G. D. (2005). Improving credibility of machine learning models in
software engineering. Proceedings of the International Workshop on Predictor
Models in Software Engineering, 17-24.
24. Rodriguez, P., Herraiz, I., & German, D. (2012). An empirical study on the
relation between community structure and software defects. Empirical Software
Engineering, 17(3), 438-461.
25. Gondra, I. (2008). Applying machine learning to software fault-proneness
prediction. Journal of Systems and Software, 81(2), 186-195.
26. Menzies, T., Greenwald, J., & Frank, A. (2007). Data mining static code
attributes to learn defect predictors. IEEE Transactions on Software
Engineering, 33(1), 2-13.
27. Lessmann, S., Baesens, B., Mues, C., & Pietsch, S. (2008). Benchmarking
classification models for software defect prediction: A proposed framework and
novel findings. IEEE Transactions on Software Engineering, 34(4), 485-496.
28. Zhang, H., & Zhang, X. (2007). Comments on "Data Mining Static Code
Attributes to Learn Defect Predictors". IEEE Transactions on Software
Engineering, 33(9), 635-637.
29. Shivaji, S., White, R., Radlinski, F., & Shavlik, J. (2009). Reducing features to
improve code change-based bug prediction. IEEE Transactions on Software
Engineering, 39(4), 552-569.
30. Kamei, Y., Shihab, E., Adams, B., Hassan, A. E., Mockus, A., Sinha, A., &
Ubayashi, N. (2013). A large-scale empirical study of just-in-time quality
assurance. IEEE Transactions on Software Engineering, 39(6), 757-773.
31. Zhang, F., Hall, T., & Harman, M. (2011). Predicting fault-prone software
modules: A systematic review of performance and validation techniques.
Software Testing, Verification & Reliability, 21(3), 291-325.
32. Rahman, F., & Devanbu, P. (2011). Ownership, experience and defects: A fine-
grained study of authorship. Proceedings of the International Conference on
Software Engineering, 491-500.