A Meta-Stacked Software Bug Prognosticator Classifier
A Meta-Stacked Software Bug Prognosticator Classifier
2Department of CSE, Aurora’s Technological and Research Institute, Uppal, Hyderabad, Telangana, India
ABSTRACT
Predicting defects defines the proactive process of classifying the defects that can be found in entire software’s
content, within and cross-project codes for producing high quality product with optimized cost. Error
prediction in open source software is more crucial due to its inherent complexity and the large repository of
contributors. In this paper we present the meta-stacked regression model (MSRM) which improvises the
Rayleigh Probabilistic distribution for feature selection estimates. Firstly, a heuristic bug mining approach is
adopted to mine the parameters reported by developers and contributors of various Open source projects
(Bugzilla, Eclipse, Mozilla) activity logs. In the second part, Stacked Regression is compared to Neural Networks
and Linear Support Vector Machine models in terms of the bug prediction performance with Feature
importance and Correlation amongst parameters. The results show that the ensemble based Stacked regression
has better precision and F-measure compared to simple machine learning models. The MSRM model accurately
predicts and classifies bugs with accuracy of 96.8% and reduces the impact of false positives by recall of 71.2%.
Keywords: Stacked Regression, Bug Prediction, Cost Estimation, Rayleigh defect density, Software Project Bugs
I. INTRODUCTION
The most fundamental calculation in the COCOMO
The occurrence of bugs is a deterrent in any software model is the use of the Effort Equation to estimate the
project cost estimation and directly affects the quality number of Person-Months required developing a
and success of any software management. It can affect project. The other results including the estimates for
macroscopic elements like resource allocation, project Requirements and Maintenance are derived from LOC
planning and bidding, as well as micro-level phase and effort equation. However, the model estimates the
wise design and execution hence bug estimation at an cost and schedule of the project, starting from the
early stage of software testing has led to extensive design phase and till the end of integration phase. For
research efforts. Bugs can reduce the reliability of a the remaining phases a separate estimation model
software system affecting its estimation and model should be used. Also, since the cost estimation may
accuracy. Boehm’s constructive cost model COCOMO vary due to changes in the requirements, staff size,
and COCOMO II [3], Albrecht’s function point and environment in which the software is being
method [2] and Putnam’s software life cycle developed. This paper hence focuses on feature
management (SLIM) [15] are the initial algorithmic selection of reported bug attributes by proposing an
estimation methods which were used for software Integrated Meta-Stacked Regression Model (MSRM)
estimation by Constructive Cost Model is by far the improvising cost of Bug estimation and prediction
most commonly used because of its simplicity for accuracy.
estimating the effort in person-month for a project at
different stages.
IJSRST184517 | Received : 01 March 2018 | Accepted : 10 March 2018 | March-April-2018 [ (4) 5 : 111-117 ]
111
Ajay Kumar Shrivastava et al. Int J S Res Sci. Tech. 2018 Mar-Apr;4(5) : 111-117
Barry Boehm et.al [3](2000) addressed on an overview 3.1 Logistic Regression Model
of a variety of software estimation models indicating The regression model considered for classification of
that neural-net and dynamics-based techniques are bug occurrence can be written in the form
less mature than the other classes of techniques and For n independent pairs (xi, yi), i=1,2,.. n where
are challenged by the rapid changes in software x’i =(x0i,x1i........, xni). For the outcome Yi, the logistic
technology. The key to arriving at sound estimates is regression model assumes that
to have a grasp of the factors which are driving the P(Yi=1 for xi) = µ(xi) .................................................(i)
costs of the project at hand to support the project where µ(xi)=eg(xi)/ (1 + eg(xi)) with g(xi)=x’iß............(ii)
planning and control functions performed by the The maximum likelihood is obtained for parameter
management. estimates and the fitted values are specified as
µ’(xi) = µ(xi, ß) i.e. logit[µ (x)]= x’ ß......................(iii)
S.Wang et.al [2](2016) proposed to bridge the gap
between programs’ semantics and defect prediction 3.2 Linear Support Vector Model
features by representation-learning algorithm. Deep Linear Classifiers define the margin as the width that
Belief Network (DBN) was adopted to learn semantic the boundary could be increased by before hitting a
features from token vectors extracted from programs’ datapoint. Support Vectors are those data points that
Abstract Syntax Trees (ASTs) and evaluated on ten the margin pushes up against linear classifier with the
open source projects. Results showed improvement in maximum margin. This is called LSVM.
WPDP 14.7% in precision, 11.5% in recall, and 14.2%
in F1. For CPDP, the semantic features based
approach outperforms the state-of-the-art technique
TCA+ with traditional features by 8.9% in F1 score.
The corresponding goal of weights and bias is given as IV. PROPOSED FRAMEWORK
w =Σαiyixi b= yk- wTxk……………………...(v)
;
for any xk such that αk 0 .Here, each non-zero αi In this paper the following steps were taken in the
indicates that corresponding xi is a support vector and process of model building of MSRM :-
classifying function will have the form
ℱ (x) = ΣαiyixiTx + b...............................................(vi) 1 ) Data Extraction
The bug reports of different products of Eclipse,
3.3 Radial Basis Function Neural network Model Mozilla and Bugzilla [5]open source software were
The objective of Neural system is to find a progression retrieved from the CVS repository and saved in .csv
of weights that will give important values in the yield format from source : http://bugzilla.mozilla.org
when determined particular cases of its information. 2) Data Pre-Processing & Preparation
Each node in hidden layer gain input from the input Remove missing and Noisy data by setting all non zero
layer, which are multiplexed with proper weights and value to 1 for depends on and duplicate count
reduced. attributes. Divide the Dataset into Training data for
Model creation and testing data for validation (60:40).
3.4 Rayleigh’s Density Distribution Figure 3. A Sample Bug report for Open Office Project
The Rayleigh distribution[6] is a special case of the with BugID 1425569
Weibull distribution, which provides a population 3) Modeling
model useful in several areas of statistics, including Build three models based on LSVM, RBNN classifiers
life testing and reliability study. Rayleigh distribution and Regression model.
RD (Ø) is employed in parameter estimation using Calculate the summary weight of bug attributes by
different types of censoring and non-censoring data using information gain criteria. Train the Model by
and written as :- using most relevant twelve attributes to predict and
Probability Distribution Function (PDF) the bug severity.
= 2 * Øx*e-Øx2 where x>0 and Ø>0.............(vii) Apply Rayleigh probability density to predict defect
Also Cumulative Distribution Function (CDF) density for different phases of project life cycle.
= 1-e-Øx2 ............................................................(viii) 4) Testing and Validation
Test the model for remaining dataset. Evaluate and
access the performance of prediction models.
TP
precision ..............................................(xi)
Several experiments were conducted by recording the TP FP
bug repository to study the performance of the
proposed stacked model in comparison with existing TP
recall
classifiers. The experiments were run on a 2.5GHz i5- TP FN ....................................................(xii)
3210M machine with 4GB RAM.
To measure defect prediction results, we use four Here Precision is the ratio of all relevant correctly
Evaluation metrics: Correlation, Precision, Recall, and classified bugs to all retrieved bugs.
F1score[3].
Pearson’s Correlation: It is used as a measure for Recall is measured as the fraction of relevant items
quantifying linear dependence between two retrieved out of all relevant items including False data
continuous variables X and Y. Its value varies from -1 not correctly classified.
to +1. Pearson’s correlation is given as:
cov X , Y V. RESULTS AND DISCUSSION
X ,Y
XY ..........................(ix)
The features are filtered according to the importance
F-measure: Weighted (by class size) average F-
derived from the feature importance graph of LVSM.
measure was obtained from the classification using The positive and negative values in the graph show
feature subset with the same three classifiers as above. the role of feature in classifying positive and negative
It describes the Harmonic mean of Precision and values. Therefore we select the extremities of the
Recall. features for both the classes in case of Linear SVM.
The general formula for positive real β is given by;
Pr ecision recall
F 1 2
( 2 Pr ecision) recall .............(x)
Next, Stacking was performed by applying Rayleigh Root mean squared error (RMSE): RMSE is a quadratic
defect density on generating the mean probabilities of scoring rule that also measures the average magnitude
the classifiers hence further performance tuning led to of the error. It’s the square root of the average of
the improved accuracy of the testing dataset as squared differences between prediction and actual
compared to the individual classifiers. Figures 7-9 observation.
depict the results of proposed MSRM model applied
on 204545 bugs to depict the F1 Score.
...................................(xiii)
The RMSE values as calculated on the bug dataset
classification and prediction was
Table 2. Comparative Error Estimates
Model RMSE Value
Neural Network 0.0184908700285
Linear SVM 0.0136835156595
Figure 7. Bug Reports on Severity Parameter for
Linear Regression 0.0115793611054
Eclipse
MetaStacked Regression 0.00989779285401
VI. CONCLUSION
Future enhancements include adopting the MSRM [9]. S.W. Haider, J.W. Cangussu, K.M.L. Cooper, R.
model for web-based and component-based software Dantu, "Estimation of Defects Based on Defect
data sets. Decay Model: ED3M", IEEE Transactions on
VII. REFERENCES
Software Engineering, vol. 34, no. 3, pp. 336-
356, 2008.
[1]. X. Huo, M. Li, and Z.-H. Zhou, "Learning
[10]. R. M. El-Sagheer, Inferences using type-II
unified features from natural and programming
progressively censored data with binomial
languages for locating buggy source code," in
removals, Arabian Journal of Mathematics 4 (1)
Proceedings of IJCAI'2016
(2015) 127-139.
[2]. S. Wang, T. Liu, and L. Tan, "Automatically
[11]. K. Herzig, S. Just, and A. Zeller. It's not a bug,
learning semantic features for defect prediction,"
it's a feature: how misclassification impacts bug
in ICSE'16: Proc. of the International
prediction. In ICSE'13, pages 392-401.
Conference on Software Engineering, 2016
[3]. V. Raychev, M. Vechev, and E. Yahav, "Code
completion with statistical language models," in
ACM SIGPLAN Notices, vol. 49, no. 6. ACM,
2014, pp. 419-428.
[4]. Jian Li, Pinjia He, Jieming Zhu, and Michael R.
Lyu.2017. "Software Defect Prediction via
Convolutional Neural Network" at IEEE
International Conference on Software Quality,
Reliability and Security, 2017
[5]. Z. He, F. Peters, T. Menzies, and Y. Yang,
"Learning from open-source projects: An
empirical study on defect prediction," in
ESEM'13: Proc. of the International Symposium
on Empirical Software Engineering and
Measurement, 2013.
[6]. N. A. Abou-Elheggag, Estimation for Rayleigh
distribution using progressive first-failure
censored data, Journal of Statistics Applications
and Probability 2(2) (2013) 171-182.
[7]. J. Wang, B. Shen, and Y. Chen, "Compressed c4.
5 models for software defect prediction," in
QSIC'12: Proc. of the International Conference
on Quality Software, 2012.
[8]. T. Gyimothy, R. Ferenc, I. Siket, "Empirical
Validation of Object-Oriented Metrics on Open
Source Software for Fault Prediction", IEEE
Transactions on Software Engineering, vol. 31,
no.10, pp. 897-910, 2005.