0% found this document useful (0 votes)
23 views

A Meta-Stacked Software Bug Prognosticator Classifier

A Meta-Stacked Software Bug Prognosticator Classifier

Uploaded by

Jexia
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
23 views

A Meta-Stacked Software Bug Prognosticator Classifier

A Meta-Stacked Software Bug Prognosticator Classifier

Uploaded by

Jexia
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

© 2018 IJSRST | Volume 4 | Issue 5 | Print ISSN: 2395-6011 | Online ISSN: 2395-602X

Themed Section: Science and Technology

A Meta-Stacked Software Bug Prognosticator Classifier


Ajay Kumar Shrivastava*1, Dr. Ekbal Rashid2
* M.Tech Research Scholar, Jharkhand Rai University, Ranchi, Jharkhand, India
1

2Department of CSE, Aurora’s Technological and Research Institute, Uppal, Hyderabad, Telangana, India

ABSTRACT

Predicting defects defines the proactive process of classifying the defects that can be found in entire software’s
content, within and cross-project codes for producing high quality product with optimized cost. Error
prediction in open source software is more crucial due to its inherent complexity and the large repository of
contributors. In this paper we present the meta-stacked regression model (MSRM) which improvises the
Rayleigh Probabilistic distribution for feature selection estimates. Firstly, a heuristic bug mining approach is
adopted to mine the parameters reported by developers and contributors of various Open source projects
(Bugzilla, Eclipse, Mozilla) activity logs. In the second part, Stacked Regression is compared to Neural Networks
and Linear Support Vector Machine models in terms of the bug prediction performance with Feature
importance and Correlation amongst parameters. The results show that the ensemble based Stacked regression
has better precision and F-measure compared to simple machine learning models. The MSRM model accurately
predicts and classifies bugs with accuracy of 96.8% and reduces the impact of false positives by recall of 71.2%.
Keywords: Stacked Regression, Bug Prediction, Cost Estimation, Rayleigh defect density, Software Project Bugs

I. INTRODUCTION
The most fundamental calculation in the COCOMO
The occurrence of bugs is a deterrent in any software model is the use of the Effort Equation to estimate the
project cost estimation and directly affects the quality number of Person-Months required developing a
and success of any software management. It can affect project. The other results including the estimates for
macroscopic elements like resource allocation, project Requirements and Maintenance are derived from LOC
planning and bidding, as well as micro-level phase and effort equation. However, the model estimates the
wise design and execution hence bug estimation at an cost and schedule of the project, starting from the
early stage of software testing has led to extensive design phase and till the end of integration phase. For
research efforts. Bugs can reduce the reliability of a the remaining phases a separate estimation model
software system affecting its estimation and model should be used. Also, since the cost estimation may
accuracy. Boehm’s constructive cost model COCOMO vary due to changes in the requirements, staff size,
and COCOMO II [3], Albrecht’s function point and environment in which the software is being
method [2] and Putnam’s software life cycle developed. This paper hence focuses on feature
management (SLIM) [15] are the initial algorithmic selection of reported bug attributes by proposing an
estimation methods which were used for software Integrated Meta-Stacked Regression Model (MSRM)
estimation by Constructive Cost Model is by far the improvising cost of Bug estimation and prediction
most commonly used because of its simplicity for accuracy.
estimating the effort in person-month for a project at
different stages.

IJSRST184517 | Received : 01 March 2018 | Accepted : 10 March 2018 | March-April-2018 [ (4) 5 : 111-117 ]
111
Ajay Kumar Shrivastava et al. Int J S Res Sci. Tech. 2018 Mar-Apr;4(5) : 111-117

The remainder of this paper is organized as follows: III. METHODOLOGY


Sect. II discusses about the related literature review
and its shortcomings. Section III discusses about the Feature selection aims to find a Q-dimensional subset
methods involved in implementing the various meta- of features, Set Q, (QF) that optimizes classifier
classifiers. performance and optionally minimizes the feature set
size. The primary reasons to use feature selection are
Section IV discusses the proposed MSRM process flow that it enables the machine learning algorithm to train
on various classifiers. Section V discusses the results faster, improves the accuracy of a model with the
and paves way for further research. right subset and it substantially reduces over fitting.
With a carefully chosen set A ⊂ R, we can classify a
II. RELATED WORK new data point x ∈ R d by checking whether f(x) ∈ A.

Barry Boehm et.al [3](2000) addressed on an overview 3.1 Logistic Regression Model
of a variety of software estimation models indicating The regression model considered for classification of
that neural-net and dynamics-based techniques are bug occurrence can be written in the form
less mature than the other classes of techniques and For n independent pairs (xi, yi), i=1,2,.. n where
are challenged by the rapid changes in software x’i =(x0i,x1i........, xni). For the outcome Yi, the logistic
technology. The key to arriving at sound estimates is regression model assumes that
to have a grasp of the factors which are driving the P(Yi=1 for xi) = µ(xi) .................................................(i)
costs of the project at hand to support the project where µ(xi)=eg(xi)/ (1 + eg(xi)) with g(xi)=x’iß............(ii)
planning and control functions performed by the The maximum likelihood is obtained for parameter
management. estimates and the fitted values are specified as
µ’(xi) = µ(xi, ß) i.e. logit[µ (x)]= x’ ß......................(iii)
S.Wang et.al [2](2016) proposed to bridge the gap
between programs’ semantics and defect prediction 3.2 Linear Support Vector Model
features by representation-learning algorithm. Deep Linear Classifiers define the margin as the width that
Belief Network (DBN) was adopted to learn semantic the boundary could be increased by before hitting a
features from token vectors extracted from programs’ datapoint. Support Vectors are those data points that
Abstract Syntax Trees (ASTs) and evaluated on ten the margin pushes up against linear classifier with the
open source projects. Results showed improvement in maximum margin. This is called LSVM.
WPDP 14.7% in precision, 11.5% in recall, and 14.2%
in F1. For CPDP, the semantic features based
approach outperforms the state-of-the-art technique
TCA+ with traditional features by 8.9% in F1 score.

The first contribution of our work is to find the set of


priority attributes that affect the cost of bug
estimation the most. Second is to evaluate the stacked Figure 1. Linear Support Vectors as function of weight
classifiers (Regression, Neural Network, Linear SVM) and bias
with density distribution in terms of prediction The linear objective is expressed as
accuracy and F1 score. ℱ (xi,xj)=xiTxj …......……………………………….(iv)

International Journal of Scientific Research in Science and Technology (www.ijsrst.com) 112


Ajay Kumar Shrivastava et al. Int J S Res Sci. Tech. 2018 Mar-Apr;4(5) : 111-117

The corresponding goal of weights and bias is given as IV. PROPOSED FRAMEWORK
w =Σαiyixi b= yk- wTxk……………………...(v)
;

for any xk such that αk  0 .Here, each non-zero αi In this paper the following steps were taken in the
indicates that corresponding xi is a support vector and process of model building of MSRM :-
classifying function will have the form
ℱ (x) = ΣαiyixiTx + b...............................................(vi) 1 ) Data Extraction
The bug reports of different products of Eclipse,
3.3 Radial Basis Function Neural network Model Mozilla and Bugzilla [5]open source software were
The objective of Neural system is to find a progression retrieved from the CVS repository and saved in .csv
of weights that will give important values in the yield format from source : http://bugzilla.mozilla.org
when determined particular cases of its information. 2) Data Pre-Processing & Preparation
Each node in hidden layer gain input from the input Remove missing and Noisy data by setting all non zero
layer, which are multiplexed with proper weights and value to 1 for depends on and duplicate count
reduced. attributes. Divide the Dataset into Training data for
Model creation and testing data for validation (60:40).

Figure 2. The mathematical analogy of ANN with


synaptic structure of Neural Systems with output f(x)

3.4 Rayleigh’s Density Distribution Figure 3. A Sample Bug report for Open Office Project
The Rayleigh distribution[6] is a special case of the with BugID 1425569
Weibull distribution, which provides a population 3) Modeling
model useful in several areas of statistics, including Build three models based on LSVM, RBNN classifiers
life testing and reliability study. Rayleigh distribution and Regression model.
RD (Ø) is employed in parameter estimation using Calculate the summary weight of bug attributes by
different types of censoring and non-censoring data using information gain criteria. Train the Model by
and written as :- using most relevant twelve attributes to predict and
Probability Distribution Function (PDF) the bug severity.
= 2 * Øx*e-Øx2 where x>0 and Ø>0.............(vii) Apply Rayleigh probability density to predict defect
Also Cumulative Distribution Function (CDF) density for different phases of project life cycle.
= 1-e-Øx2 ............................................................(viii) 4) Testing and Validation
Test the model for remaining dataset. Evaluate and
access the performance of prediction models.

International Journal of Scientific Research in Science and Technology (www.ijsrst.com) 113


Ajay Kumar Shrivastava et al. Int J S Res Sci. Tech. 2018 Mar-Apr;4(5) : 111-117

Figure 4. Overview of our proposed MSRM for defect prediction

TP
precision  ..............................................(xi)
Several experiments were conducted by recording the TP  FP
bug repository to study the performance of the
proposed stacked model in comparison with existing TP
recall 
classifiers. The experiments were run on a 2.5GHz i5- TP  FN ....................................................(xii)
3210M machine with 4GB RAM.
To measure defect prediction results, we use four Here Precision is the ratio of all relevant correctly
Evaluation metrics: Correlation, Precision, Recall, and classified bugs to all retrieved bugs.
F1score[3].
Pearson’s Correlation: It is used as a measure for Recall is measured as the fraction of relevant items
quantifying linear dependence between two retrieved out of all relevant items including False data
continuous variables X and Y. Its value varies from -1 not correctly classified.
to +1. Pearson’s correlation is given as:
cov X , Y  V. RESULTS AND DISCUSSION
 X ,Y 
 XY ..........................(ix)
The features are filtered according to the importance
F-measure: Weighted (by class size) average F-
derived from the feature importance graph of LVSM.
measure was obtained from the classification using The positive and negative values in the graph show
feature subset with the same three classifiers as above. the role of feature in classifying positive and negative
It describes the Harmonic mean of Precision and values. Therefore we select the extremities of the
Recall. features for both the classes in case of Linear SVM.
The general formula for positive real β is given by;
Pr ecision  recall

F  1   2  
(  2 Pr ecision)  recall .............(x)

International Journal of Scientific Research in Science and Technology (www.ijsrst.com) 114


Ajay Kumar Shrivastava et al. Int J S Res Sci. Tech. 2018 Mar-Apr;4(5) : 111-117

Table 1. Bug Attribute Description


Attribute Short Description

* Severity It is the Response Categorical


Variable This Indicates how severe
the problem is.
Bug Id The unique numeric id of a bug
Priority This field describes the importance
and order in which a bug should be
fixed compared to other bugs. P1 is
considered the higher and P5 is the
lowest.
Resolution The resolution field indicates what
happened to this bugs
Status The status field indicates the
current state of bug
(New,Resolved,Progress)
Number of Bugs have comment added to them
Comments by user . #comments made to a bug
report
Figure 5. Bug Attributes Selected by Feature
Create Date When the bug was field.
Importance Graph
Dependency If this bug cannot be fixed unless
other bugs are fixed (depend on), or
this bug stops other bugs being
fixed (blocks) their number are
recorded here.
Summery A one-sentence summery of the
problem.
Date of When the bug was closed.
close
Keywords The administrator can define
keywords which you can use to tag
and categorize bugs e.g. the Mozilla
project has keyword like crash and
regression.
Version The field define the version of the
software the bug was found in.
CC List A list of people who get mail when
the bug changes. #people in CC list
Platform These indicate the computing
Figure 6. Correlation Graph for Most relevant Feature and OS environment where the bug was
Selection found.
Number of Number of Attachment for a bug.
Attachment
Bug Fix Last resolved time-Opened time.
Time Time to fix a bug.

International Journal of Scientific Research in Science and Technology (www.ijsrst.com) 115


Ajay Kumar Shrivastava et al. Int J S Res Sci. Tech. 2018 Mar-Apr;4(5) : 111-117

Next, Stacking was performed by applying Rayleigh Root mean squared error (RMSE): RMSE is a quadratic
defect density on generating the mean probabilities of scoring rule that also measures the average magnitude
the classifiers hence further performance tuning led to of the error. It’s the square root of the average of
the improved accuracy of the testing dataset as squared differences between prediction and actual
compared to the individual classifiers. Figures 7-9 observation.
depict the results of proposed MSRM model applied
on 204545 bugs to depict the F1 Score.
...................................(xiii)
The RMSE values as calculated on the bug dataset
classification and prediction was
Table 2. Comparative Error Estimates
Model RMSE Value
Neural Network 0.0184908700285
Linear SVM 0.0136835156595
Figure 7. Bug Reports on Severity Parameter for
Linear Regression 0.0115793611054
Eclipse
MetaStacked Regression 0.00989779285401

Figure 8. Severity Classification for Mozilla Product

Figure 8. Comparative Predictive error of MSRM


Model.

VI. CONCLUSION

It has been clearly demonstrated that Meta stacked


Figure 8. No. Of bugs Correctly classified for Bugzilla
regression analysis can be successfully applied to
with Severity Prediction.
formulate a prediction model for bug classification
and hence effort estimation. It is feasible to
The final evaluation metric is to compute the error in incorporate and implement defect prediction as part of
prediction by the stacked regression Model(MSRM) as software development process, particularly test
compared to the NN, LSVM and Linear regression process. The data collected has been tested on open
Model. source Projects with datasets (Mozilla, Eclipse,Bugzilla)
and have scope for extension to other software
development projects with their respective metrics.

International Journal of Scientific Research in Science and Technology (www.ijsrst.com) 116


Ajay Kumar Shrivastava et al. Int J S Res Sci. Tech. 2018 Mar-Apr;4(5) : 111-117

Future enhancements include adopting the MSRM [9]. S.W. Haider, J.W. Cangussu, K.M.L. Cooper, R.
model for web-based and component-based software Dantu, "Estimation of Defects Based on Defect
data sets. Decay Model: ED3M", IEEE Transactions on
VII. REFERENCES
Software Engineering, vol. 34, no. 3, pp. 336-
356, 2008.
[1]. X. Huo, M. Li, and Z.-H. Zhou, "Learning
[10]. R. M. El-Sagheer, Inferences using type-II
unified features from natural and programming
progressively censored data with binomial
languages for locating buggy source code," in
removals, Arabian Journal of Mathematics 4 (1)
Proceedings of IJCAI'2016
(2015) 127-139.
[2]. S. Wang, T. Liu, and L. Tan, "Automatically
[11]. K. Herzig, S. Just, and A. Zeller. It's not a bug,
learning semantic features for defect prediction,"
it's a feature: how misclassification impacts bug
in ICSE'16: Proc. of the International
prediction. In ICSE'13, pages 392-401.
Conference on Software Engineering, 2016
[3]. V. Raychev, M. Vechev, and E. Yahav, "Code
completion with statistical language models," in
ACM SIGPLAN Notices, vol. 49, no. 6. ACM,
2014, pp. 419-428.
[4]. Jian Li, Pinjia He, Jieming Zhu, and Michael R.
Lyu.2017. "Software Defect Prediction via
Convolutional Neural Network" at IEEE
International Conference on Software Quality,
Reliability and Security, 2017
[5]. Z. He, F. Peters, T. Menzies, and Y. Yang,
"Learning from open-source projects: An
empirical study on defect prediction," in
ESEM'13: Proc. of the International Symposium
on Empirical Software Engineering and
Measurement, 2013.
[6]. N. A. Abou-Elheggag, Estimation for Rayleigh
distribution using progressive first-failure
censored data, Journal of Statistics Applications
and Probability 2(2) (2013) 171-182.
[7]. J. Wang, B. Shen, and Y. Chen, "Compressed c4.
5 models for software defect prediction," in
QSIC'12: Proc. of the International Conference
on Quality Software, 2012.
[8]. T. Gyimothy, R. Ferenc, I. Siket, "Empirical
Validation of Object-Oriented Metrics on Open
Source Software for Fault Prediction", IEEE
Transactions on Software Engineering, vol. 31,
no.10, pp. 897-910, 2005.

International Journal of Scientific Research in Science and Technology (www.ijsrst.com) 117

You might also like