Ensembles Based Combined Learning For Improved Software Fault Prediction: A Comparative Study
Ensembles Based Combined Learning For Improved Software Fault Prediction: A Comparative Study
Abstract—Software Fault Prediction (SFP) research has made compared with an individual classifier. As illustrated in the
enormous endeavor to accurately predict fault proneness of literatures [6, 14, 32, 35], is more like, to make wise
software modules to maximize precious software test resources, decisions, people may counsel many experts in the area and
reduce maintenance cost, help to deliver software products on take consideration of their opinions rather than only depend
time and satisfy customer, which ultimately contribute to on their own decisions. In fault prediction, a predictive
produce quality software products. In this regard, Machine model generated by ML can be considered as an expert.
Learning (ML) has been successfully applied to solve Therefore, a good approach to make decisions more
classification problems for SFP. Moreover, from ML, it has accurately is to combine the output of different predictive
been observed that Ensemble Learning Algorithms (ELA) are
models. So that, all can improve or at least equal to the
known to improve the performance of single learning
algorithms. However, neither of ELA alone handles the
predictive performance over an individual models [6, 14, 32,
challenges created by redundant and irrelevant features and 35]. Therefore, in this study, we develop a new framework to
class imbalance problem in software defect datasets. Therefore, compare eminent ELA, namely, bagging [16, 30, 32, 35] and
the objective of this paper is to independently examine and AdaBoost.M1 [16, 31, 32, 35] with J48 Decision Tree (DT)
compare prominent ELA and improves their performance as a base classifier. In addition to that, we use McCabe and
combined with Feature Selection (FS) and Data Balancing (DB) Halstead Static Code Metrics [19, 22] datasets for
techniques to identify more efficient ELA that better predict experimental analysis.
the fault proneness of software modules. Accordingly, a new In ML, the ELA is known to improve the predictive
framework that efficiently handles those challenges in a performance of individual classifiers, but neither of these
combined form is proposed. The experimental results confirm ensemble techniques alone solves the data skewness (class
that the proposed framework has exhibited the robustness of imbalance) problem [16, 26] and the existence of redundant
combined techniques. Particularly the framework has high and irrelevant features, specifically in defect datasets [26].
performance when using combined bagging ELA with DB on Thus, to deal with these issues, the ensemble based
selected features. Therefore, as shown in this study, ensemble combined framework has to be designed specifically.
techniques used for SFP must be carefully examined and Therefore, in this study, we combine ELA with Feature
combined with both FS and DB in order to obtain robust Selection (FS) [9, 12-14, 20, 21] and Data Balancing (DB)
performance. [11, 20, 23-25] techniques. FS is carried out by removing
less important and redundant features, so that only important
Keywords-Software Fault Prediction, Ensemble Learning
Algorithms, Feature Selection, Data Balancing.
features are left for training the predictive models and the
performance of ELA could be improved. Moreover, as
software defect datasets are composed of Not Fault Prone
I. INTRODUCTION (NFP) instances with only a small percentage of Fault Prone
The growing demands of quality software in different (FP) instances, DB is carried out to resolve this skewed
industry have been igniting the Software Fault Prediction nature of defect datasets, so that building SFP models on
(SFP) research area; thereby the quality can be cautiously balanced data could improve ELA performance.
inspected and undertaken before releasing the software. SFP Therefore, this paper aims to independently examine and
is targeted to inspect and detect faulty proneness of software compare ELA and realize their performance improvement
modules and help to focus more on those specific modules when combined with FS and DB to identify efficient
predicted as faulty so as to manage resources efficiently and techniques that better perform for SFP. Hence, the main
reduce the number of faults occurring during operations. In contribution of this study is the empirical analysis of
this regard, statistical and Machine Learning (ML) multiple ELA in combination with FS and DB. Interestingly,
techniques have been employed for SFP in most studies [2- the proposed framework has exhibited the robustness of
14]. On the other hand, from ML techniques, Ensemble combined techniques. Particularly, it has high performance
Learning Algorithms (ELA) have been demonstrated to be when combining ensemble techniques with DB on selected
useful in different areas of research [7, 15-18], where all of features, which constitutes a primary contribution credited to
them have confirmed that ELA can effectively solve this study.
classification problems with better performance when
TABLE II. CLASSIFICATION RESULTS OF ELA COMBINED WITH IG A. Comparison: ELA Performance Combined with IG
IGDTBagging IGDTAdaBoost.M1
Performance comparison of bagging and AdaBoost.M1
Dataset
Accuracy AUC Accuracy AUC using IG FS are given in Figure 2 (a) and (b) and Table II. In
JM1' 81.766 0.72 80.568 0.696 terms of both indexes used in this study AdaBoost.M1
MC1" 97.712 0.8 98.305 0.821 appears to perform low and bagging demonstrates the
MW1' 88.799 0.678 86.373 0.669 highest values.
PC3' 86.587 0.803 85.024 0.777
PC4" 88.874 0.908 87.887 0.894
Thus, the result reflects the better performance of
ar1 90.404 0.755 87.929 0.744 bagging over AdaBoost.M1. Except that out of eight datasets,
ar4 84.545 0.833 80.709 0.794 in MC1” and KC2, AdaBoost.M1 shows better performance
KC2 81.934 0.833 82.718 0.801 in accuracy as well as in AUC measure using MC1” dataset.
Average 87.58 0.791 86.19 0.775
However, considering the average performance of all
TABLE III. CLASSIFICATION RESULTS OF ELA COMBINED WITH BOTH
datasets, bagging still outperforms AdaBoost.M1.
IG AND SMOTE
B. Comparison: ELA Performance Combined with IG and
Dataset
SMOTEIGDTBagging SMOTEIGDTAdaBoost.M1 SMOTE
Accuracy AUC Accuracy AUC
JM1' 80.773 0.855 78.926 0.835
In Figure 3 (a) and (b) and Table III, the performance
MC1" 96.596 0.988 97.303 0.992 comparison of bagging and AdaBoost.M1 combined with
MW1' 84.966 0.916 85.595 0.907 both IG and SMOTE are given. In terms of both indexes,
PC3' 83.266 0.905 83.834 0.911 AdaBoost.M1 appears to perform low and bagging
PC4" 90.158 0.962 90.439 0.963
ar1 82.092 0.901 81.931 0.896
demonstrates the highest values. However, for MC1"(97.303,
ar4 77.401 0.854 76.923 0.846 0.992), MW1' (85.595), PC3' (83.834, 0.911), and PC4"
KC2 80.39 0.871 79.078 0.836 (90.439, 0.963) datasets, AdaBoost.M1 outperforms bagging
Average 84.46 0.907 84.25 0.898 ensemble learning in both accuracy and AUC (except MW1')
measures. Nevertheless, considering the average
(a) (b)
Figure 3. Comparison of ELA Combined with both IG and SMOTE using Accuracy and AUC
performance of all datasets, bagging still outperforms drown based on the important features selected using IG. In
AdaBoost.M1. Thus, the results reflect the better terms of the total instances and number of classes, the
performance of combined bagging but closely followed by datasets may not be good representatives. However, this
combined AdaBoost.M1 on software defect datasets. On the practice is common among the fault prediction research area.
other hand, based on this performance results, we can say
that, after resolving class imbalance problem with some VII. CONCLUSION AND FUTURE WORKS
datasets, AdaBoost.M1 competitively shows good This study made empirical evaluation of the capability of
performance, which clearly needs further investigation with ELA in predicting FP software modules and compared their
more datasets from another software metrics. performance combined with FS and both FS and DB
C. Comparison: IGDTBagging with SMOTEIGDTBagging techniques using eight NASA software defect datasets. Our
objective of using FS and DB was that, by combining those
As expected, selecting useful features and resolving class filtering techniques with ELA, we would be able to prune
imbalance problem has been proved to be useful and non-relevant features and balance classes, and then learn an
improve ELA performance. In this regard, based on our ELA that performs better than from learning on the whole
proposed framework, Sections V (A) and (B) experimental feature set and in imbalanced classes. Accordingly, the
results show achieved performance improvements. And the experimental results reveal our combined technique assures
more efficiently performed ELA in average is found to be the performance improvement. Thus, dealing with the
combined bagging in both strategy one and two. Therefore, challenges of SFP mentioned in this study, our proposed
this section points out the performance improvement framework confirms remarkable classification performance
achieved through combined bagging ELA when combined and lays the pathway to software quality assurance.
with IG as well as combined with both IG and SMOTE using As the future work, we plan to explore more ELA
AUC PE. Accordingly, as shown in Figure 4, the including vote and stacking, and more data preprocessing
performance of combined bagging algorithm gives better techniques with more defect datasets which consist of
results in all datasets when combining with both IG and different software metrics; and to realize how the proposed
SMOTE than combining only with IG. These affirms the framework helps to identify the more efficient combined
contribution of combined preprocessing as removing ensemble techniques and improve its classification
irrelevant and redundant features as well as resolving class performance to accurately predict FP software modules.
imbalance problem and its power to improve the
performance of ELA. ACKNOWLEDGEMENT
This work is supported by the Fundamental Research
Funds for the Central Universities (No. 2682015QM02).
REFERENCES
[1] T. Menzies, R. Krishna, and D. Pryor. (2016). The Promise
Repository of Empirical Software Engineering Data. Available:
http://openscience.us/repo
[2] E. Arisholm, L. C. Briand, and E. B. Johannessen, "A systematic and
comprehensive investigation of methods to build and evaluate fault
prediction models," Journal of Systems and Software, vol. 83, pp. 2–
17, 2010.
[3] K. O. Elish and M. O. Elish, "Predicting defect-prone software
modules using support vector machines," Journal of Systems and
Software, vol. 81, pp. 649–660, 2008.
[4] I. Gondra, "Applying machine learning to software fault-proneness
prediction," Journal of Systems and Software, vol. 81, pp. 186–195,
Figure 4. Comparison of Bagging ELA Combined with IG and both IG
2008.
and SMOTE using AUC [5] T. M. Khoshgoftaar, C. Seiffert, J. V. Hulse, A. Napolitano, and A.
Folleco, "Learning with limited minority class data," in the Sixth
International Conference on Machine Learning and Applications,
VI. THREAT TO VALIDITY Cincinnati, OH, 2007.
[6] J. Han, M. Kamber, and J. Pei, Data Mining: Concepts and
There are threats that may have an effect on our Techniques: Morgan Kaufmann Publishers Inc., 2011.
experimental results. The proposed prediction models were [7] R. Kohavi, "A study of cross-validation and bootstrap for accuracy
created without changing the parameter setting except that estimation and model selection," in the International Joint Conference
DT algorithm is used with ensembles techniques, which was on Artificial Intelligence, 1995.
[8] I. H. Laradji, M. Alshayeb, and L. Ghouti, "Software defect
not default in both cases. Thus, investigations were not made prediction using ensemble learning on selected features," Information
by changing the default parameters setting to see how the and Software Technology, vol. 58, pp. 388–402, 2015.
variation affects the model performance. In addition to that, [9] R. Malhotra, "A systematic review of machine learning techniques for
as many software metrics are defined in literature, different software fault prediction," Applied Soft Computing, vol. 27, pp. 504-
518, 2015.
software metrics might be better indicator to defectiveness of [10] T. Menzies, J. Greenwald, and A. Frank, "Data mining static code
modules. However, we used static code software metrics attributes to learn defect predictors," IEEE Transactions on Software
which were available in selected datasets. Conclusions were Engineering, vol. 33, pp. 2–13, 2007.
[11] N. V. Chawla, K. W. Bowyer, L. O. Hall, and W. P. Kegelmeyer, [33] F. Provost and T. Fawcett, "Robust classification for imprecise
"SMOTE: Synthetic minority over-sampling technique," Journal of environments," Machine Learning, vol. 42 pp. 203–231, 2001.
Artificial Intelligence Research, vol. 16, pp. 321-357, 2002. [34] C. Catal, "Performance evaluation metrics for software fault
[12] S. Shivaji, E. J. Whitehead, R. Akella, and S. Kim, "Reducing prediction studies," Acta Polytechnica Hungarica, vol. 9, pp. 193–206,
features to improve code change-based bug prediction," IEEE 2012.
Transactions on Software Engineering, vol. 39, pp. 552–569, 2013. [35] R. Polikar, "Ensemble based systems in decision making," IEEE
[13] H. Wang, T. M. Khoshgoftaar, and A. Napolitano, "A comparative Circuits and Systems Magazine, vol. 6, pp. 21-45, 2006.
study of ensemble feature selection techniques for software defect
prediction," in the Ninth International Conference on Machine
Learning and Applications, IEEE, Washington, DC, 2010.
[14] E. Frank, M. A. Hall, and I. H. Witten, The WEKA Workbench.
Online Appendix for "Data Mining: Practical Machine Learning
Tools and Techniques," 4th ed.: Morgan Kaufmann, 2016.
[15] A. Shanthini and R. M. Chandrasekaran, "Analyzing the effect of
bagged ensemble approach for software fault prediction in class level
and package level metrics," in the IEEE International Conference on
Information Communication and Embedded Systems (ICICES), India,
2014.
[16] Mikel Galar, Alberto Fernandez, Edurne Barrenechea, Humberto
Bustince, and F. Herrera, "A Review on Ensembles for the Class
Imbalance Problem: Bagging-, Boosting-, and Hybrid-Based
Approaches," IEEE Transactions on Systems, Man, and Cybernetics,
Part C (Applications and Reviews), vol. 42, pp. 463 - 484, 2012.
[17] [17] S.K. Mathanker, P.R. Weckler, T.J. Bowser, N. Wang, and N. O.
Maness, "AdaBoost classifiers for pecan defect classification,"
Computers and Electronics in Agriculture, vol. 77, pp. 60–68, 2011.
[18] Taghi M. Khoshgoftaar, Kehan Gao, and A. Napolitano, "Improving
software quality estimation by combining feature selection strategies
with sampled ensemble learning," in the IEEE 15th International
Conference on Information Reuse and Integration (IRI), San
Francisco, California, USA, 2014.
[19] D. Radjenovic, M. Hericko, R. Torkar, and A. Zivkovic, "Software
fault prediction metrics: A systematic literature review," Journal of
Information and Software Technology, vol. 55, pp. 1397–1418, 2013.
[20] H. Liu, H. Motoda, and L. Yu, "A selective sampling approach to
active feature selection," Artificial Intelligence, vol. 159, pp. 49–74,
2004.
[21] S. Liu, X. Chen, W. Liu, J. Chen, Q. Gu, and D. Chen, "FECAR: A
feature selection framework for software defect prediction," in the
38th Annual International Computers, Software and Applications
Conference, Vasteras, 2014.
[22] T. J. McCabe, "A complexity measure,," IEEE Transactions on
Software Engineering, vol. SE-2, pp. 308–320, 1976.
[23] V. García, J. S. Sánchez, and R. A. Mollineda, "On the effectiveness
of preprocessing methods when dealing with different levels of class
imbalance," Knowledge-Based Systems, vol. 25, pp. 13-21, 2012.
[24] H. He and E. A. Garcia, "Learning from Imbalanced Data," IEEE
Transactions on Knowledge and Data Engineering, vol. 21, pp. 1263-
1284, 2009
[25] P. Sarakit, T. Theeramunkong, and C. Haruechaiyasak, "Improving
emotion classification in imbalanced YouTube dataset using SMOTE
algorithm," in the 2nd International Conference on Advanced
Informatics: Concepts, Theory and Applications, Chonburi, 2015.
[26] W. Y. Chubato and T. Li, "A Combined-Learning Based Framework
for Improved Software Fault Prediction," International Journal of
Computational Intelligence Systems, vol. 10, pp. 647–662, 2017.
[27] M. Hall, E. Frank, G. Holmes, B. Pfahringer, P. Reutemann, and I. H.
Witten, "The WEKA data mining software: an update; SIGKDD
Explorations," Retrieved 01 Sep. 2017.
[28] C. Catal, "Software fault prediction: A literature review and current
trends," Expert Systems with Applications, vol. 38, pp. 4626–4636,
2011.
[29] T. Hall, S. Beecham, D. Bowes, D. Gray, and S. Counsell, "A
systematic literature review on fault prediction performance in
software engineering," IEEE Transactions on Software Engineering,
vol. 38, pp. 1276–1304, 2012.
[30] L. Breiman, "Bagging predictors," Machine Learning, vol. 24, pp.
123-140, 1996.
[31] Yoav Freund and R. E. Schapire, "Experiments with a new boosting
algorithm," in Thirteenth International Conference on Machine
Learning, San Francisco, 1996, pp. 148-156.
[32] Polikar R., “Ensemble Learning,” Scholarpedia, 2009.