A Developer Centered Bug Prediction Model
A Developer Centered Bug Prediction Model
fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TSE.2017.2659747, IEEE
Transactions on Software Engineering
Abstract—Several techniques have been proposed to accurately predict software defects. These techniques generally exploit
characteristics of the code artefacts (e.g., size, complexity, etc.) and/or of the process adopted during their development and
maintenance (e.g., the number of developers working on a component) to spot out components likely containing bugs. While these bug
prediction models achieve good levels of accuracy, they mostly ignore the major role played by human-related factors in the introduction
of bugs. Previous studies have demonstrated that focused developers are less prone to introduce defects than non-focused developers.
According to this observation, software components changed by focused developers should also be less error prone than components
changed by less focused developers. We capture this observation by measuring the scattering of changes performed by developers
working on a component and use this information to build a bug prediction model. Such a model has been evaluated on 26 systems
and compared with four competitive techniques. The achieved results show the superiority of our model, and its high complementarity
with respect to predictors commonly used in the literature. Based on this result, we also show the results of a “hybrid” prediction model
combining our predictors with the existing ones.
Index Terms—Scattering Metrics, Bug Prediction, Empirical Study, Mining Software Repositories
The “structural distance” between two code compo- Structure of the paper. Section 2 discusses the related lit-
nents is measured as the number of subsystems one erature, while Section 3 presents the proposed scattering
needs to cross in order to reach one component from measures. Section 4 presents the design of our empirical
the other. study and provides details about the data extraction pro-
The second measure (i.e., the semantic scattering) is cess and analysis method. Section 5 reports the results of
instead meant to capture how much spread in terms the study, while Section 6 discusses the threats that could
of implemented responsibilities the code components affect their validity. Section 7 concludes the paper.
modified by a developer in a given time period are.
The conjecture behind the proposed metrics is that high
2 R ELATED W ORK
levels of structural and semantic scattering make the de-
veloper more error-prone. To verify this conjecture, we Many bug prediction techniques have been proposed
built two predictors exploiting the proposed measures, in the literature in the last decade. Such techniques
and we used them in a bug prediction model. The mainly differ for the specific predictors they use, and can
results achieved on five software systems showed the roughly be classified in those exploiting product metrics
superiority of our model with respect to (i) the Basic (e.g., lines of code, code complexity, etc), those relying
Code Change Model (BCCM) built using the entropy on process metrics (e.g., change- and fault-proneness of
of changes [8] and (ii) a model using the number of code components), and those exploiting a mix of the two.
developers working on a code component as predictor Table 1 summarizes the related literature, by grouping
[9] [10]. Most importantly, the two scattering measures the proposed techniques on the basis of the metrics they
showed a high degree of complementarity with the exploit as predictors.
measures exploited by the baseline prediction models. The Chidamber and Kemerer (CK) metrics [36] have
been widely used in the context of bug prediction. Basili
In this paper, we extend our previous work [23] to
et al. [1] investigated the usefulness of the CK suite
further investigate the role played by scattered changes
for predicting the probability of detecting faulty classes.
in bug prediction. In particular we:
They showed that five of the experimented metrics are
1) Extend the empirical evaluation of our bug predic- actually useful in characterizing the bug-proneness of
tion model by considering a set of 26 systems. classes. The same set of metrics has been successfully
2) Compare our model with two additional compet- exploited in the context of bug prediction by El Emam et
itive approaches, i.e., a prediction model based al. [26] and Subramanyam et al. [27]. Both works reported
on the focus metrics proposed by Posnett et al. the ability of the CK metrics in predicting buggy code
[22] and a prediction model based on structural components, regardless of the size of the system under
code metrics [24], that together with the previously analysis.
considered models, i.e., the BCCM proposed by Still in terms of product metrics, Nikora et al. [28]
Hassan [8] and the one proposed by Ostrand et al. showed that measuring the evolution of structural at-
[9] [10], lead to a total of four different baselines tributes (e.g., number of executable statements, number
considered in our study. of nodes in the control flow graph, etc.) it is possible to
3) Devise and discuss the results of a hybrid bug predict the number of bugs introduced during the sys-
prediction model, based on the best combination of tem development. Later, Gyimothy et al. [2] performed
predictors exploited by the five prediction models a new investigation on the relationship between CK
experimented in the paper. metrics and bug proneness. Their results showed that the
4) Provide a comprehensive replication package [25] Coupling Between Object metric is the best in predicting
including all the raw data and working data sets the bug-proneness of classes, while other CK metrics are
of our studies. untrustworthy.
The achieved results confirm the superiority of our Ohlsson et al. [3] focused the attention on the use
model, achieving a F-Measure 10.3% higher, on average, of design metrics to identify bug-prone modules. They
than the change entropy model [8], 53.7% higher, on performed a study on an Ericsson industrial system
average, with respect to what achieved by exploiting the showing that at least four different design metrics can be
number of developers working on a code component used with equivalent results. The metrics performance
as predictor [9], 13.3% higher, on average, than the F- are not statistically worse than those achieved using
Measure obtained by using the developers’ focus metric a model based on the project size. Zhou et al. [29]
by Posnett et al. [22] as predictor, and 29.3% higher, on confirmed their results showing that size-based models
average, with respect to the prediction model built on seem to perform as well as those based on CK metrics
top of product metrics [1]. The two scattering measures except than the Weighted Method per Class on some
confirmed their complementarity with the metrics used releases of the Eclipse system. Thus, although Bell et
by the alternative prediction models. Thus, we devised a al. [35] showed that more complex metric-based models
“hybrid” model providing an average boost in prediction have more predictive power with respect to size-based
accuracy (i.e., F-Measure) of +5% with respect to the best models, the latter seem to be generally useful for bug
performing model (i.e., the one proposed in this paper). prediction.
0098-5589 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TSE.2017.2659747, IEEE
Transactions on Software Engineering
TABLE 1
Prediction models proposed in literature
Nagappan and Ball [4] exploited two static analysis Hassan and Holt [31] conjectured that a chaotic devel-
tools to early predict the pre-release bug density. The opment process has bad effects on source code quality
results of their study, conducted on the Windows Server and introduced the concept of entropy of changes. Later
system, show that it is possible to perform a coarse they also presented the top-10 list [32], a methodology
grained classification between high and low quality to highlight to managers the top ten subsystems more
components with a high level of accuracy. Nagappan et likely to present bugs. The set of heuristics behind their
al. [14] analyzed several complexity measures on five approach includes a number of process metrics, such as
Microsoft software systems, showing that there is no considering the most recently modified, the most frequently
evidence that a single set of measures can act universally modified, the most recently fixed and the most frequently
as bug predictor. They also showed how to methodically fixed subsystems.
build regression models based on similar projects in Bell et al. [12] pointed out that although code churns
order to achieve better results. Complexity metrics in the are very effective bug predictors, they cannot improve a
context of bug prediction are also the focus of the work simpler model based on the code components’ change-
by Zimmerman et al. [5]. Their study reports a positive proneness. Kim et al. [33] presumed that faults do not
correlation between code complexity and bugs. occur in isolation but in burst of related faults. They
Differently from the previous discussed techniques, proposed the bug cache algorithm that predicts future
other approaches try to predict bugs by exploiting pro- faults considering the location of previous faults. Simi-
cess metrics. Khoshgoftaar et al. [6] analyzed the contri- larly, Nagappan et al. [34] defined change burst as a set
bution of debug churns (defined as the number of lines of consecutive changes over a period of time and proposed
of code added or changed to fix bugs) to a model based new metrics based on change burst. The evaluation of
on product metrics in the identification of bug-prone the prediction capabilities of the models was performed
modules. Their study, conducted on two subsequent on Windows Vista, achieving high accuracy.
releases of a large legacy system, shows that modules Graves et al. [7] experimented both product and pro-
exceeding a defined threshold of debug churns are often cess metrics for bug prediction. They observed that
bug-prone. The reported results show a misclassification history-based metrics are more powerful than product
rate of just 21%. metrics (i.e., change-proneness is a better indicator than
Nagappan et al. [30] proposed a technique for early LOC). Their best results were achieved using a combi-
bug prediction based on the use of relative code churn nation of module’s age and number of changes, while
measures. These metrics relate the number of churns to combining product metrics had no positive effect on
other factors such as LOC or file count. An experiment the bug prediction. They also saw no benefits provided
performed on the Windows Server system showed that by the inclusion of a metric based on the number of
relative churns are better than absolute value. developers modifying a code component.
0098-5589 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TSE.2017.2659747, IEEE
Transactions on Software Engineering
Moser et al. [13] performed a comparative study Fig. 1. Example of two developers having different levels
between product-based and process-based predictors. of “scattering”
Their study, performed on Eclipse, highlights the su-
it.gui.addUser
periority of process metrics in predicting buggy code
it.gui.login it.gui.logout it.gui.confirmRegistration it.gui.logout
components. Later, they performed a deeper study [11]
on the bug prediction accuracy of process metrics, re-
porting that the past number of bug-fixes performed on a file Jan Feb
2015 it.gui.logout it.whouse.showStocks 2015
(i.e., bug-proneness), the maximum changeset size occurred
it.db.insertPayslip it.db.deleteUserAccount
in a given period, and the number of changes involving
a file in a given period (i.e., change-proneness) are the
process metrics ensuring the best performances in bug
prediction.
D’Ambros et al. [15] performed an extensive compari-
Following this conjecture, they defined two symmetric
son of bug prediction approaches relying on process and
metrics, namely the Module Activity Focus metric (shortly,
product metrics, showing that no technique based on a
MAF), and the Developer Attention Focus metric (shortly,
single metric works better in all contexts.
DAF) [22]. The former is a metric which captures to
Hassan [8] analyzed the complexity of the develop-
what extent a module receives focused attention by
ment process. In particular he defined the entropy of
developers. The latter measures how focused are the
changes as the scattering of code changes across time.
activities of a specific developer. As it will be clearer
He proposed three bug prediction models, namely Basic
later, our scattering measures not only take into account
Code Change Model (BCCM), Extended Code Change
the frequency of changes made by developers over the
Model (ECCM), and File Code Change Model (FCCM).
different system’s modules, but also considers the “dis-
These models mainly differ for the choice of the temporal
tance” between the modified modules. This means that,
interval where the bug proneness of components is
for example, the contribution of a developer working on
studied. The reported study indicates that the proposed
a high number of files all closely related to a specific
techniques have a stronger prediction capability than
responsibility might not be as much “scattered” as the
a model purely based on the amount of changes ap-
contribution of a developer working on few unrelated
plied to code components or on the number of prior
files.
faults. Differently from our work, all these predictors do
not consider the number of developers who performed
changes to a component, neither how many components 3 C OMPUTING D EVELOPER ’ S S CATTERING
they changed at the same time. C HANGES
Ostrand et al. [9], [10] proposed the use of the number
of developers who modified a code component in a give time pe- We conjecture that the developer’s effort in performing
riod as a bug predictor. Their results show that combining maintenance and evolution tasks is proportional to the
developers’ information poorly, but positively, impact number of involved components and their spread across
the detection accuracy of a prediction model. Our work different subsystems. In other words, we believe that
does not use a simple count information of developers a developer working on different components scatters
who worked on a file, but also takes in consideration the her attention due to continuous changes of context. This
change activities they carry out. might lead to an increase of the developer’s “scattering”
Bird et al. [20] investigated the relationship between with a consequent higher chance of introducing bugs.
different ownership measures and pre- and post-releases To get a better idea of our conjecture, consider the
failures. Specifically, they analyzed the developers’ con- situation depicted in Figure 1, where two developers,
tribution network by means of social network analysis d1 (black point) and d2 (grey point) are working on the
metrics, finding that developers having low levels of same system, during the same time period, but on dif-
ownership tend to increase the likelihood of introducing ferent code components. The tasks performed by d1 are
defects. Our scattering metrics are not based on code very focused on a specific part of the system (she mainly
ownership, but on the “distance” between the code works on the system’s GUI) and on a very targeted topic
components modified by a developer in a given time (she is mainly in charge of working on GUIs related to
period. the users’ registration and login features). On the con-
Posnett et al. [22] investigated factors related to the one trary, d2 performs tasks scattered across different parts
we aim at capturing in this paper, i.e., the developer’s of the system (from GUIs to database management) and
scattering. In particular, the “focus” metrics presented by on different topics (users’ accounts, payslips, warehouse
Posnett et al. [22] are based on the idea that a developer stocks).
performing most of her activities on a single module (a Our conjecture is that during the time period shown in
module could be a method, a class, etc.) has a higher Figure 1, the contribution of d2 might have been more
focus on the activities she is performing and is less likely “scattered” than the contribution of d1 , thus having a
to introduce bugs. higher likelihood of introducing bugs in the system.
0098-5589 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TSE.2017.2659747, IEEE
Transactions on Software Engineering
To verify our conjecture we define two measures, Fig. 2. Example of structural scattering
named the structural and the semantic scattering mea-
sures, aimed at assessing the scattering of a developer
org.apache.tools.ant
d in a given time period p. Note that both measures
are meant to work in object oriented systems at the
class level granularity. In other words, we measure how ProjectHelper
scattered are the changes performed by developer d taskdefs types
FilterMapper
3.1 Structural scattering
Let CHd,p be the set of classes changed by a developer d
during a time period p. We define the structural scattering
measure as:
StrScatd,p = |CHd,p | × average [dist(ci , cj )] (1)
∀ci ,cj ∈CHd,p
In particular, the leafs of the tree represent the classes
where dist is the number of packages to traverse in modified by the developer in the considered time pe-
order to go from class ci to class cj ; dist is com- riod, while the internal nodes (as well as the root
puted by applying the shortest path algorithm on node) illustrate the package structure of the system.
the graph representing the system’s package struc- In this example, the developer worked on the classes
ture. For example, the dist between two classes Target and UpToDate, both contained in the pack-
it.user.gui.c1 and it.user.business.db.c2 is age org.apache.tools.ant.taskdefs grouping to-
three, since in order to reach c1 from c2 we need to tra- gether classes managing the definition of new com-
verse it.user.business.db, it.user.business, mands that the Ant’s user can create for customizing
and it.user.gui. We (i) use the average operator for her own building process. In addition, the developer
normalizing the distances between the code components also modified FilterMapper, a class containing utility
modified by the developer during the time period p methods (e.g., map a java String into an array), and
and (ii) assign a higher scattering to developers working the class ProjectHelper responsible for parsing the
on a higher number of code components in the given build file and creating java instances representing the
time period (see |CHd,p |). Note the the choice to use build workflow. To compute the structural scattering we
the average to normalize the distances is driven by the compute the distance between every pair of classes
fact that other central operators, such as the median, modified by the developer. If two classes are in the
are not affected by the outliers. Indeed, suppose that a same package, as in the case of the classes Target and
developer performs a change (i.e., commit) C, modifying UpToDate, then the distance between them will be zero.
four files F1, F2, F3, and F4. The first three files are in the Instead, if they are in different packages, like in the case
same package, while the fourth one (F4) is in a different of ProjectHelper and Target, their distance is the
subsystem. When computing the structural scattering for minimum number of packages one needs to traverse to
C, the median would not reflect the scattering of the reach one class from the other. For example, the distance
change performed on F4, since half of the six pairs of files is one between ProjectHelper and Target (we need
involved in the change (and in particular, F1-F2, F1-F3, to traverse the package taskdefs), and three between
F2-F3) have zero as structural distance (i.e., they are in UpToDate and FilterMapper (we need to traverse the
the same package). Thus, the median would not capture packages taskdefs, types and mappers).
the fact that C was, at least in part, a scattered change.
After computing the distance between every pair of
This is instead captured by the mean that is influenced
classes, we can compute the structural scattering. Table
by outliers.
2 shows the structural distances between every pair of
To better understand how the structural scattering mea-
classes involved in our example as well as the value
sure is computed and how it is possible to use it in order
for the structural scattering. Note that, if the developer
to estimate the developer’s scattering in a time period,
had modified only the Target and UpToDate classes in
Figure 2 provides a running example based on a real
the considered time period, then her structural scattering
scenario we found in Apache Ant1 , a tool to automate
would have been zero (the lowest possible), since her
the building of software projects.
changes were focused on just one package. By also
The tree shown in Figure 2 depicts the activity of a
considering the change performed to ProjectHelper,
single developer in the time period between 2012-03-01
the structural scattering raises to 2.01. This is due to the
and 2012-04-30.
number of classes involved in change set (3) and the
1. http://ant.apache.org/ average of the distance among them (0.67).
0098-5589 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TSE.2017.2659747, IEEE
Transactions on Software Engineering
0098-5589 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TSE.2017.2659747, IEEE
Transactions on Software Engineering
The predictors are defined as follow: cohesion metrics (i.e., the Lack of Cohesion of Method—
X LCOM), coupling metrics (i.e., the Coupling Between
StrScatPredc,p = StrScatd,p (3) Objects—CBO—and the Response for a Class—RFC),
d∈developersc,p
X and complexity metrics (i.e., the Weighted Methods per
SemScatPredc,p = SemScatd,p (4) Class—WMC). We refer to this model as CM.
d∈developersc,p We also compared our approach with three prediction
models based on process metrics. The first is the one
where the developersc,p is the set of developers that
based on the work by Ostrand et al. [10], and exploiting
worked on the component c during the time period p.
the number of developers that work on a code compo-
nent in a specific time period as predictor variable (from
4 E VALUATING S CATTERING M ETRICS IN THE now on, we refer to this model as DM).
C ONTEXT OF B UG P REDICTION The second is the Basic Code Change Model (BCCM)
The goal of the study is to evaluate the usefulness of proposed by Hassan and using code change entropy
the developer’s scattering measures in the prediction of information [8]. This choice is justified by the superiority
bug-prone components, with the purpose of improving of this model with respect to other techniques exploit-
the allocation of resources in the verification & validation ing change-proneness information [11], [12], [13]. While
activities focusing on components having a higher bug- such a superiority has been already demonstrated by
proneness. The quality focus is on the detection accuracy Hassan [8], we also compared these techniques before
and completeness of the proposed technique as com- choosing BCCM as one of the baselines for evaluating
pared to competitive approaches. The perspective is of our approach. We found that the BCCM works better
researchers, who want to evaluate the effectiveness of with respect to a model that simply counts the number of
using information about developer scattered changes in changes. This is because it filters the changes that differ
identifying bug-prone components. from the code change process (i.e., fault repairing and
The context of the study consists of 26 Apache soft- general maintenance modifications) considering only the
ware projects having different size and scope. Table 4 Feature Introduction modifications (FI), namely the changes
reports the characteristics of the analyzed systems, and related to adding or enhancing features. However, we
in particular (i) the software history we investigated, observed a high overlap between the BCCM and the
(ii) the mined number of commits, (iii) the size of the model that use the number of changes as predictor
active developers base (those who performed at least one (almost 84%) on the dataset used for the comparison,
commit in the analyzed time period), (iv) the system’s probably due to the fact that the nature of the infor-
size in terms of KLOC and number of classes, and (v) mation exploited by the two models is similar. The
the percentage of buggy files identified (as detailed later) interested reader can find the comparison between these
during the entire change history. All data used in our two models in our online appendix [25].
study are publicly available [25]. Finally, the third baseline is a prediction model based
on the Module Activity Focus metric proposed by Posnett
et al. [22]. It relies on the concept of predator-prey food
4.1 Research Questions and Baseline Selection
web existing in ecology (from now on, we refer to this
In the context of the study, we formulated the following model as MAF). The metric is based on the measure-
research questions: ment of the degree to which a code component receives
• RQ1 : What are the performances of a bug prediction focused attention by developers. It can be considered as
model based on developer’s scattering measures and how a form of ownership metric of the developers on the
it compares to baseline techniques proposed in literature? component. It is worth noting that we do not consider
the other Developer Attention Focus metric proposed by
• RQ2 : What is the complementarity between the proposed Posnett et al., since (i) the two metrics are symmetric, and
bug prediction model and the baseline techniques? (ii) in order to provide a probability that a component is
• RQ3 : What are the performances of a “hybrid” model buggy, we need to qualify to what extent the activities
built by combining developer’s scattering measures with on a file are focused, rather than measuring how are
baseline predictors? developers’ activities focused. Even if Posnett et al. have
not proposed a prediction model based on their metric,
In the first research question we quantify the perfor- the results of this comparison will provide insights on
mances of a prediction model based on developer’s scat- the usefulness of developer’s scattering measures for
tering measures (DCBM). Then, we compare its perfor- detecting bug-prone components.
mances with respect to four baseline prediction models, Note that our choice of the baselines is motived by the
one based on product metrics and the other three based will of: (i) considering both models based on product
on process metrics. and process metrics, and (ii) covering a good number
The first model exploits as predictor variables the CK of different process metrics (since our model exploits
metrics [1], and in particular size metrics (i.e., the Lines process metrics), including approaches exploiting infor-
of Code—LOC—and the Number of Methods—NOM), mation similar to the ones used by our scattering metrics.
0098-5589 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TSE.2017.2659747, IEEE
Transactions on Software Engineering
TABLE 4
Characteristics of the software systems used in the study
% buggy
Project Period #Commits #Dev. #Classes KLOC
classes
AMQ Dec 2005 - Sep 2015 8,577 64 2,528 949 54
Ant Jan 2000 - Jul 2014 13,054 55 1,215 266 72
Aries Sep 2009 - Sep 2,015 2,349 24 1,866 343 40
Camel Mar 2007 - Sep 2015 17,767 128 12,617 1,552 30
CXF Apr 2008 - Sep 2015 10,217 55 6,466 1,232 26
Drill Sep 2012 - Sep 2015 1,720 62 1,951 535 63
Falcon Nov 2011 - Sep 2015 1,193 26 581 201 25
Felix May 2007 - May 2015 11,015 41 5,055 1,070 18
JMeter Sep 1998 - Apr 2014 10,440 34 1,054 192 37
JS2 Feb 2008 - May 2015 1,353 7 1,679 566 34
Log4j Nov 2000 - Feb 2014 3,274 21 309 59 58
Lucene Mar 2010 - May 2015 13,169 48 5,506 2,108 12
Oak Mar 2012 - Sep 2015 8,678 19 2,316 481 43
OpenEJB Oct 2011 - Jan 2013 9,574 35 4,671 823 36
OpenJPA Jun 2007 - Sep 2015 3,984 25 4,554 822 38
Pig Oct 2010 - Sep 2015 1,982 21 81,230 48,360 16
Pivot Jan 2010 - Sep 2015 1,488 8 11,339 7,809 22
Poi Jan 2002 - Aug 2014 5,742 35 2,854 542 62
Ranger Aug 2014 - Sep 2015 622 18 826 443 37
Shindig Feb 2010 - Jul 2015 2,000 27 1,019 311 14
Sling Jun 2009 - May 2015 9,848 29 3,951 1,007 29
Sqoop Jun 2011 - Sep 2015 699 22 667 134 14
Sshd Dec 2008 - Sep 2015 629 8 658 96 33
Synapse Aug 2005 - Sep 2015 2,432 24 1,062 527 13
Whirr Jun 2010 - Apr 2015 569 17 275 50 21
Xerces-J Nov 1999 - Feb 2014 5,471 34 833 260 6
In the second research question we aim at evaluating operators which are used to reflect the structure of the if-
the complementarity of the different models, while in the then rules; and (ii) an outcome which mirrors the classi-
third one we build and evaluate a “hybrid” prediction fication of a software entity respecting the corresponding
model exploiting as predictor variables the scattering rule as bug-prone or non bug-prone. Majority Decision
measures we propose as well as the measures used by Table uses an attribute reduction algorithm to find a
the four experimented competitive techniques (i.e., DM, good subset of predictors with the goal of eliminating
BCCM, MAF, and CM). Note that we do not limit our equivalent rules and reducing the likelihood of over-
analysis to the experimentation of a model including fitting the data.
all predictor variables, but we exercise all 2,036 possible To assess the performance of the five models, we
combinations of predictor variables to understand which split the change-history of the object systems into three-
is the one achieving the best performances. month time periods and we adopt a three-month sliding
window to train and test the bug prediction models.
4.2 Experimental process and oracle definition Starting from the first time period T P1 (i.e., the one
To evaluate the performances of the experimented bug starting at the first commit), we train each model on
prediction models we need to define the machine learn- it, and test its ability in predicting buggy classes on
ing classifier to use. For each prediction technique, we T P2 (i.e., the subsequent three-month time period). Then,
experimented several classifiers, namely ADTree [39], we move three months forward the sliding window,
Decision Table Majority [40], Logistic Regression [41], training the classifiers on T P2 and testing their accuracy
Multilayer Perceptron [42] and Naive Bayes [43]. We on T P3 . This process is repeated until the end of the
empirically compared the results achieved by the five analyzed change history (see Table 4) is reached. Note
different models on the software systems used in our that our choice of considering three-month periods is
study (more details on the adopted procedure later in based on: (i) choices made in previous work, like the one
this section). For all the prediction models the best by Hassan et al. [8]; and (ii) the results of an empirical
results were obtained using the Majority Decision Table assessment we performed on such a parameter showing
(the comparison among the classifiers can be found in that the best results for all experimented techniques are
our online appendix [25]). Thus, we exploit it in the achieved by using three-month periods. In particular, we
implementation of the five models. This classifier can be experimented with time windows of one, two, three, and
viewed as an extension of one-valued decision trees [40]. six months. The complete results are available in our
It is a rectangular table where the columns are labeled replication package [25].
with predictors and rows are sets of decision rules. Finally, to evaluate the performances of the five experi-
Each decision rule of a decision table is composed of mented models we need an oracle reporting the presence
(i) a pool of conditions, linked through and/or logical of bugs in the source code.
0098-5589 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TSE.2017.2659747, IEEE
Transactions on Software Engineering
Although the PROMISE repository collects a large 4.3 Metrics and Data Analysis
dataset of bugs in open source systems [44], it provides Once defined the oracle and obtained the predicted
oracles at release-level. Since the proposed measures buggy classes for every three-month period, we answer
work at time period-level, we had to build our own RQ1 by using three widely-adopted metrics, namely
oracle. Firstly, we identified bug fixing commits hap- accuracy, precision and recall [38]:
pened during the change history of each object system
by mining regular expressions containing issue IDs in TP + TN
the change log of the versioning system (e.g., “fixed issue accuracy = (5)
TP + FP + TN + FN
#ID” or “issue ID”). After that, for each identified issue
ID, we downloaded the corresponding issue report from TP
the issue tracking system and extracted the following precision = (6)
TP + FP
information: product name; issue’s type (i.e., whether an
issue is a bug, enhancement request, etc); issue’s status TP
(i.e., whether an issue was closed or not); issue’s resolution recall = (7)
TP + FN
(i.e., whether an issue was resolved by fixing it, or it was
a duplicate bug report, or a “works for me” case); issue’s where T P is the number of classes containing bugs that
opening date; issue’s closing date, if available. are correctly classified as bug-prone; T N denotes the
number of bug-free classes classified as non bug-prone
Then, we checked each issue’s report to be correctly
classes; F P and F N measure the number of classes for
downloaded (e.g., the issue’s ID identified from the
which a prediction model fails to identify bug-prone
versioning system commit note could be a false positive).
classes by declaring bug-free classes as bug-prone (F P )
After that, we used the issue type field to classify the
or identifying actually buggy classes as non buggy ones
issue and distinguish bug fixes from other issue types
(F N ). As an aggregate indicator of precision and recall,
(e.g., enhancements). Finally, we only considered bugs
we also report the F-measure, defined as the harmonic
having Closed status and Fixed resolution. Basically, we
mean of precision and recall:
restricted our attention to (i) issues that were related to
bugs as we used them as a measure of fault-proneness,
precision ∗ recall
and (ii) issues that were neither duplicate reports nor F -measure = 2 ∗ (8)
precision + recall
false alarms.
Once collected the set of bugs fixed in the change Finally, we also report the Area Under the Curve
history of each system, we used the SZZ algorithm [45] (AUC) obtained by the prediction model. The AUC
to identify when each fixed bug was introduced. The quantifies the overall ability of a prediction model to
SZZ algorithm relies on the annotation/blame feature of discriminate between buggy and non-buggy classes. The
versioning systems. In essence, given a bug-fix identified closer the AUC to 1, the higher the ability of the classifier
by the bug ID, k, the approach works as follows: to discriminate classes affected and not by a bug. On
the other hand, the closer the AUC to 0.5, the lower the
1) For each file fi , i = 1 . . . mk involved in the bug-fix accuracy of the classifier. To compare the performances
k (mk is the number of files changed in the bug-fix obtained by DCBM with the competitive techniques, we
k), and fixed in its revision rel-fixi,k , we extract the performed the bug prediction using the four baseline
file revision just before the bug fixing (rel-fixi,k − 1). models BCCM, DM, MAF, and CM on the same systems
2) starting from the revision rel-fixi,k − 1, for each and the same periods on which we ran DCBM.
source line in fi changed to fix the bug k the blame To answer RQ2 , we analyzed the orthogonality of the
feature of Git is used to identify the file revision different measures used by the five experimented bug
where the last change to that line occurred. In prediction models using Principal Component Analysis
doing that, blank lines and lines that only contain (PCA). PCA is a statistical technique able to identify vari-
comments are identified using an island grammar ous orthogonal dimensions (principal components) from
parser [46]. This produces, for each file fi , a set of a set of data. It can be used to evaluate the contribution
ni,k fix-inducing revisions rel-bugi,j,k , j = 1 . . . ni,k . of each variable to the identified components. Through
Thus, more than one commit can be indicated by the analysis of the principal components and the contri-
the SZZ algorithm as responsible for inducing a butions (scores) of each predictor to such components,
bug. it is possible to understand whether different predic-
By adopting the process described above we are able tors contribute to the same principal components. Two
to approximate the periods of time where each class of models are complementary if the predictors they exploit
the subject systems was affected by one or more bugs contribute to capture different principal components.
(i.e., was a buggy class). In particular, given a bug-fix Hence, the analysis of the principal components provides
BFk performed on a class ci , we consider ci buggy from insights on the complementarity between models.
the date in which the bug fixed in BFk was introduced Such an analysis is necessary to assess whether the
(as indicated by the SZZ algorithm) to the date in which exploited predictors assign the same bug-proneness to
BFk (i.e., the patch) was committed in the repository. the same set of classes.
0098-5589 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TSE.2017.2659747, IEEE
Transactions on Software Engineering
10
However, PCA does not tell the whole story. Indeed, Thus, we also investigate the subset of predictors
using PCA it is not possible to identify to what extent actually leading to the best prediction accuracy. To this
a prediction model complements another and vice versa. aim, we use the wrapper approach proposed by Kohavi
This is the reason why we complemented the PCA by and John [47]. Given a training set built using all the
analyzing the overlap of the five prediction models. features available, the approach systematically exercises
Specifically, given two prediction models mi and mj , we all the possible subsets of features against a test set, thus
computed: assessing their accuracy. Also in this case we used the
Majority Decision Table [40] as machine learner.
|corrmi ∩ corrmj |
corrmi ∩mj = % (9) In our study, we considered as training set the penul-
|corrmi ∪ corrmj | timate three-month period of each subject system, and
as test set the last three-month period of each system.
|corrmi \ corrmj | Note that this analysis has not been run on the whole
corrmi \mj = % (10) change history of the software systems due to its high
|corrmi ∪ corrmj |
computational cost. Indeed, experimenting all possible
where corrmi represents the set of bug-prone classes cor- combinations of the eleven predictors means the run
rectly classified by the prediction model mi , corrmi ∩mj of 2,036 different prediction models across each of the
measures the overlap between the sets of true positives 26 systems (52,936 overall runs). This required approx-
correctly identified by both models mi and mj , corrmi \mj imately eight weeks on four Linux laptops having two
measures the percentage of bug-prone classes correctly dual-core 3.10 GHz CPU and 4 Gb of RAM.
classified by mi only and missed by mj . Clearly, the Once obtained all the accuracy metrics for each combi-
overlap metrics are computed by considering each com- nation, we analyzed these data in two steps. Firstly, we
bination of the five experimented detection techniques plot the distribution of the average F-measure obtained
(e.g., we compute corrBCCM ∩DM , corrBCCM ∩DCBM , by the 2,036 different combinations over the 26 software
corrBCCM ∩CM , corrDM ∩DCBM , etc.). In addition, given systems. Then we discuss the performances of the top
the five experimented prediction models mi , mj , mk , mp , five configurations comparing the results with the ones
mz , we computed: achieved by (i) each of the five experimented models,
|corrmi \ (corrmj ∪ corrmk ∪ corrmp ∪ corrmz )|
(ii) the models built plugging-in the scattering metrics
corrmi \(mj ∪mk ∪mp ∪mz ) =
|corrmi ∪ corrmj ∪ corrmk ∪ corrmp ∪ corrmz |
% as additional features in the four baseline models, and
(11) (iii) the comprehensive prediction models that include all
the metrics exploited by the four baseline models plus
that represents the percentage of bug-prone classes cor- our scattering metrics.
rectly identified only by the prediction model mi . In the
paper, we discuss the results obtained when analyzing
the complementarity between our model and the base- 5 A NALYSIS OF THE RESULTS
line ones. The other results concerning the complemen- In this section we discuss the results achieved aiming at
tarity between the baseline approaches are available in answering the formulated research questions.
our online appendix [25].
Finally, to answer RQ3 we build and assess the per-
formances of a “hybrid” bug prediction model exploit- 5.1 RQ1 : On the Performances of DCBM and Its
ing different combinations of the predictors used by Comparison with the Baseline Techniques
the five experimented models (i.e., DCBM, BCCM, DM, Table 5 reports the results—in terms of AUC-ROC, accu-
MAF, and CM). Firstly, we assess the boost in perfor- racy, precision, recall, and F-measure—achieved by the
mances (if any) provided by our scattering metrics when five experimented bug prediction models, i.e., our model,
plugged-in the four competitive models, similarly to exploiting the developer’s scattering metrics (DCBM),
what has been done by Bird et al. [20], who explained the BCCM proposed by Hassan [8], a prediction model
the relationship between ownership metrics and bugs that uses as predictor the number of developers that
building regression models in which the metrics are work on a code component (DM) [9], [10], the prediction
added incrementally in order to evaluate their impact model based on the degree to which a module receives
on increasing/decreasing the likelihood of developers to focused attention by developers (MAF) [22], and a pre-
introduce bugs. diction model exploiting product metrics capturing size,
Then, we create a “comprehensive baseline model” cohesion, coupling, and complexity of code components
featuring all predictors exploited by the four competi- (CM) [1].
tive models and again, we assess the possible boost in The achieved results indicate that the proposed pre-
performances provided by our two scattering metrics diction model (i.e., DCBM) ensures better prediction
when added to such a comprehensive model. Clearly, accuracy as compared to the competitive techniques.
simply combining together the predictors used by the Indeed, the area under the ROC curve of DCBM ranges
five models could lead to sub-optimal results, due for between 62% and 91%, outperforming the competitive
example to model overfitting. models.
0098-5589 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
TABLE 5
AUC-ROC, Accuracy, Precision, Recall, and F-Measure of the five bug prediction models
DCBM DM BCCM
System
AUC-ROC Accuracy Precision Recall F-measure AUC-ROC Accuracy Precision Recall F-measure AUC-ROC Accuracy Precision Recall F-measure
AMQ 83% 53% 42% 53% 47% 58% 24% 18% 19% 19% 61% 52% 33% 49% 39%
Ant 88% 69% 66% 72% 69% 69% 26% 28% 37% 31% 67% 63% 67% 68% 68%
Aries 86% 56% 51% 54% 52% 65% 23% 23% 25% 24% 58% 50% 34% 45% 39%
Camel 81% 55% 51% 55% 53% 51% 27% 18% 28% 22% 50% 39% 39% 46% 42%
CFX 79% 94% 88% 94% 91% 54% 25% 19% 25% 21% 71% 79% 86% 84% 85%
Drill 63% 53% 45% 48% 46% 58% 23% 22% 39% 28% 52% 39% 14% 25% 18%
Falcon 75% 98% 96% 98% 97% 50% 25% 20% 21% 21% 75% 89% 86% 90% 88%
Felix 88% 70% 69% 67% 68% 50% 25% 17% 30% 22% 59% 61% 60% 65% 63%
JMeter 91% 77% 72% 68% 70% 50% 29% 24% 53% 33% 69% 65% 65% 63% 64%
JS2 62% 87% 83% 86% 84% 50% 26% 22% 17% 19% 58% 81% 70% 74% 72%
Log4j 89% 71% 62% 66% 64% 50% 19% 13% 26% 17% 52% 43% 36% 78% 49%
Lucene 77% 84% 79% 83% 81% 54% 27% 22% 30% 26% 63% 72% 61% 86% 71%
Oak 67% 97% 95% 97% 96% 52% 27% 15% 29% 19% 66% 95% 92% 80% 86%
OpenEJB 82% 98% 97% 98% 98% 50% 22% 25% 20% 22% 78% 95% 81% 91% 85%
OpenJPA 83% 79% 71% 77% 74% 51% 20% 20% 38% 26% 78% 72% 61% 68% 64%
Pig 79% 89% 79% 89% 84% 50% 22% 21% 37% 27% 73% 71% 64% 75% 69%
Pivot 78% 86% 75% 86% 80% 53% 26% 19% 24% 21% 68% 69% 71% 79% 75%
Poi 87% 68% 88% 59% 71% 50% 25% 34% 16% 22% 66% 60% 74% 49% 59%
Ranger 77% 95% 90% 95% 93% 50% 28% 18% 19% 19% 76% 92% 83% 91% 87%
Shindig 73% 66% 50% 65% 56% 50% 24% 23% 23% 23% 58% 58% 43% 61% 50%
Sling 62% 85% 76% 84% 80% 57% 21% 17% 18% 18% 61% 80% 62% 68% 65%
Sqoop 78% 98% 96% 98% 97% 55% 26% 19% 32% 23% 77% 97% 90% 89% 90%
Sshd 86% 70% 59% 70% 64% 55% 24% 19% 36% 25% 69% 52% 49% 54% 52%
Synapse 67% 62% 50% 62% 56% 53% 23% 17% 24% 20% 64% 49% 48% 56% 52%
Whirr 76% 98% 95% 98% 97% 52% 26% 20% 24% 21% 74% 96% 84% 88% 86%
Xerces-J 83% 94% 94% 88% 91% 52% 49% 28% 35% 31% 71% 74% 59% 80% 68%
CM MAF
System
AUC-ROC Accuracy Precision Recall F-measure AUC-ROC Accuracy Precision Recall F-measure
AMQ 55% 43% 37% 41% 39% 56% 56% 38% 45% 41%
Transactions on Software Engineering
Ant 58% 38% 28% 33% 30% 60% 59% 60% 62% 61%
Aries 56% 38% 28% 33% 30% 51% 45% 30% 43% 35%
Camel 41% 42% 44% 41% 42% 50% 38% 35% 38% 36%
CFX 53% 52% 55% 46% 50% 76% 75% 82% 73% 77%
Drill 50% 34% 26% 32% 29% 52% 32% 22% 29% 25%
Falcon 51% 52% 45% 54% 49% 71% 81% 70% 81% 75%
Felix 53% 55% 53% 51% 52% 67% 56% 62% 65% 63%
JMeter 50% 43% 44% 43% 43% 68% 58% 61% 59% 60%
JS2 50% 43% 44% 43% 43% 62% 80% 72% 78% 75%
Log4j 50% 35% 38% 31% 34% 52% 51% 44% 58% 52%
Lucene 50% 35% 38% 31% 34% 65% 66% 66% 76% 70%
Oak 52% 46% 54% 55% 54% 64% 88% 89% 78% 83%
OpenEJB 61% 62% 66% 57% 61% 80% 78% 77% 79% 78%
OpenJPA 51% 55% 59% 45% 51% 67% 70% 57% 67% 62%
Pig 57% 62% 58% 52% 55% 69% 68% 62% 68% 65%
Pivot 50% 41% 47% 40% 43% 66% 64% 65% 69% 67%
Poi 58% 65% 61% 65% 63% 61% 55% 58% 56% 57%
Ranger 60% 65% 61% 65% 63% 76% 81% 77% 82% 79%
Shindig 52% 48% 36% 46% 40% 54% 55% 39% 59% 47%
Sling 55% 38% 35% 41% 38% 61% 76% 59% 63% 61%
Sqoop 53% 59% 59% 64% 61% 78% 92% 89% 84% 87%
Sshd 50% 28% 26% 31% 28% 67% 48% 46% 52% 49%
Synapse 56% 43% 47% 52% 49% 61% 47% 47% 53% 50%
Whirr 54% 61% 55% 63% 59% 69% 82% 82% 82% 82%
11
Xerces-J 58% 61% 55% 63% 59% 65% 71% 68% 75% 72%
0098-5589 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TSE.2017.2659747, IEEE
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TSE.2017.2659747, IEEE
Transactions on Software Engineering
12
In particular, the Developer Model achieves an AUC Regarding the other models, we observe that the in-
between 50% and 69%, the Basic Code Change Model formation about the ownership of a class as well as the
between 50% and 78%, the MAF model between 50% code metrics and the entropy of changes have a stronger
and 78%, and the CM model between 41% and 61%. predictive power compared to number of developers.
Also in terms of accuracy, precision and recall (and, However, they still exhibit a lower prediction accuracy
consequently, of F-measure) DCBM achieves better re- with respect to what allowed by the developer scattering
sults. In particular, across all the different object sys- information.
tems, DCBM achieves a higher F-measure with respect In particular, we observed that the MAF model
to DM (mean=+53.7%), BCCM (mean=+10.3%), MAF has good performances when it is adopted on well-
(mean=+13.3%), and CM (mean=+29.3%). The higher modularized systems, i.e., systems grouping in the same
values achieved for precision and recall indicates that package classes implementing related responsibilities.
DCBM provides less false positives (i.e., non-buggy Indeed, MAF achieved the highest accuracy on the
classes indicated as buggy ones) while also being able Apache CFX, Apache OpenEJB, and Apache Sqoop
to identify more classes actually affected by a bug as systems, where the average modularization quality (MQ)
compared to the competitive models. Moreover, when [48] is of 0.84, 0.79, and 0.88, respectively. The rea-
considering the AUC, we observed that DCBM reaches son behind this result is that a high modularization
higher values with respect the competitive bug predic- quality often correspond to a good distribution of de-
tion approaches. This result highlights how the proposed velopers activities. For instance, the average number
model performs better in discriminating between buggy of developers per package working on Apache CFX
and non-buggy classes. is 5. As a consequence, the focus of developers on
Interesting is the case of Xerces-J where DCBM is specific code entities is high. The same happens on
able to identify buggy classes with 94% of accuracy (see Apache OpenEJB and Apache Sqoop, where the av-
Table 5), as compared to the 74% achieved by BCCM, erage number of developers per package is 3 and 7,
49% of DM, 71% of MAF, and 59% of CM. We looked respectively. However, even if the developers mainly
into this project to understand the reasons behind such focus their attention on few packages, in some cases
a strong result. We found that the Xerces-J’s buggy they also apply changes to classes contained in other
classes are often modified by few developers that, on packages, increasing their chances of introducing bugs.
average, perform a small number of changes on them. This is the reason why our prediction model still con-
As an example, the class XSSimpleTypeDecl of the tinue to work better in such cases. A good example
package org.apache.xerces.impl.dv.xs has been is the one of the class HBaseImportJob, contained
modified only twice between May 2008 and July 2008 in the package org.apache.sqoop.mapreduce of
(one of the three-month periods considered in our study) the project Apache Sqoop. Only two developers
by two developers. However, the sum of their structural worked on this class over the time period be-
and semantic scattering in that period was very high (161 tween July 2013 and September 2013, however the
and 1,932, respectively). It is worth noting that if a low same developers have been involved in the main-
number of developers work on a file, they have higher tenance of the class HiveImport of the package
chances to be considered as the owner of that file. This com.cloudera.sqoop.hive. Even if the two classes
means that, in the case of the MAF model, the probability shared the goal to import data from other projects into
that the class is bug-prone decreases. At the same time, Sqoop, they implement significantly different mecha-
models based on the change entropy (BCCM) or on the nisms for importing data. This results in a higher prone-
number of developers modifying a class (DM) experi- ness of introducing bugs. The sum of the structural and
ence difficulties in identifying this class as buggy due to semantic scattering in that period for the two develop-
the low number of changes it underwent and to the low ers reached 86 and 92, respectively, causing the correct
number of involved developers, respectively. Conversely, prediction of the buggy file for our model, and an error
our model does not suffer of such a limitation thanks to in the prediction of the MAF model.
the exploited developers’ scattering information. The BCCM [8] often achieves a good prediction
Finally, the CM model relying on product metrics accuracy. This is due to the higher change-
fails in the prediction since the class has code metrics proneness of components being affected by bugs.
comparable with the average metrics of the system (e.g., As an example, in the JS2 project, the class
the CBO of the class 12, while the average CBO of the PortalAdministrationImpl of the package
system is 14). org.apache.jetspeed.administration has been
Looking at the other prediction models, we can ob- modified 19 times between January and March 2010.
serve that the model based only on the number of de- Such a high change frequency led to the introduction of
velopers working on a code component never achieves a bug. However, not always such a conjecture is valid.
an accuracy higher than 49%. This result confirms what Let us consider the Apache Aries project, in which BCCM
previously demonstrated by Ostrand et al. [10], [9] on obtained a low accuracy (recall=45%, precision=34%).
the limited impact of individual developer data on bug Here we found several classes with high change-
prediction. proneness that were not subject to any bug. For instance,
0098-5589 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TSE.2017.2659747, IEEE
Transactions on Software Engineering
13
0098-5589 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TSE.2017.2659747, IEEE
Transactions on Software Engineering
14
TABLE 7
Results achieved applying the Principal Component Analysis
PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8 PC9 PC10 PC11
Proportion of Variance 0.39 0.16 0.11 0.10 0.06 0.05 0.03 0.03 0.03 0.02 0.02
Cumulative Variance 0.39 0.55 0.66 0.76 0.82 0.87 0.90 0.92 0.95 0.97 1.00
Structural scattering predictor 0.69 - - 0.08 0.04 - - - - - -
Semantic scattering predictor - 0.51 0.33 0.16 0.03 - - - - - -
Change entropy 0.07 0.34 0.45 0.25 0.11 0.22 - 0.01 - - -
Number of Developers - - 0.05 0.02 0.29 - 0.04 0.05 0.01 - 0.07
MAF 0.04 0.11 - 0.38 0.45 - 0.21 0.04 0.06 - 0.1
LOC 0.04 - 0.01 - 0.03 0.07 0.18 0.21 0.11 0.09 0.33
CBO 0.1 0.04 0.05 0.07 - 0.56 0.2 0.33 0.21 0.44 0.12
LCOM 0.01 - 0.04 - 0.01 - 0.24 0.1 0.06 0.09 0.05
NOM 0.03 - 0.01 0.01 - 0.11 - 0.12 0.43 0.22 0.1
RFC 0.01 - 0.04 0.01 0.03 - 0.13 0.06 0.12 0.1 0.09
WMC 0.01 - 0.02 0.02 0.01 0.04 - 0.08 - 0.06 0.14
0098-5589 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TSE.2017.2659747, IEEE
Transactions on Software Engineering
15
0098-5589 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TSE.2017.2659747, IEEE
Transactions on Software Engineering
16
TABLE 10 TABLE 11
Overlap Analysis between DCBM and CM Overlap Analysis between DCBM and MAF
0098-5589 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TSE.2017.2659747, IEEE
Transactions on Software Engineering
17
TABLE 12
Overlap Analysis considering each Model independently
models we discuss in the following. case), possibly causing model overfitting on the training
The second part of Table 13 (i.e., Boost provided by sets with consequent bad performances on the test set.
our scattering metrics to each baseline model), reports Again, the combination of predictors does no seem to
the performances of the four competitive bug prediction improve the performances of our DBCM model. Thus,
models when augmented with our predictors. as explained in Section 4.3, to verify the possibility to
The boost provided by our metrics is evident in all build an effective hybrid model we investigated in an
the baseline models. Such a boost goes from a minimum exhaustive way the combination of predictors that leads
of +8% in terms of F-Measure (for the model based on to the best prediction accuracy by using the wrapper
change entropy) up to +49% for the model exploiting approach proposed by Kohavi and John [47].
the number of developers as predictor. However, it is Figure 4 plots the average F-measure obtained by
worth noting that the combined models do not seem to each of the 2,036 combinations of predictors experi-
improve the performances of our DBCM model. mented. The first thing that leaps to the eyes is the
The third part of Table 13 (i.e., Boost provided by our very high variability of performances obtained by the
scattering metrics to a comprehensive baseline model) different combinations of predictors, ranging between a
seems to tell a different story. In this case, we combined minimum of 62% and a maximum of 79% (mean=70%,
all predictors belonging to the four baseline models into median=71%). The bottom part of Table 13 (i.e., Top-
a single, comprehensive, bug prediction model, and as- 5 predictors combinations obtained from the wrapper
sessed its performances. Then, we added our scattering selection algorithm) reports the performances of the
metrics to such a comprehensive baseline model and top five predictors combinations. The best configuration,
assessed again its performances. As it can be seen from achieving an average F-Measure of 79% exploits as pre-
Table 13, the performances of the two models (i.e., the dictors the CBO coupling metric [1], the change entropy
one with and the one without our scattering metrics) are by Hassan [8], the structural and semantic scattering
almost the same (F-measure=71% for both of them). This defined in this paper, and the module activity focus by
suggests the absence of any type of impact (positive or Posnett et al. [22]. Such a configuration also exhibits
negative) of our metrics on the model’s performances, a very high AUC (90%) and represents a substantial
which is something unexpected considered the previ- improvement in prediction accuracy over the best model
ously performed analyses. used in isolation (i.e., DCBM with an average F-Measure
Such a result might be due to the high number of of 74% and an AUC=76%) as well as over the compre-
predictor variables exploited by the model (eleven in this hensive model exploiting all the baselines’ predictors
0098-5589 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TSE.2017.2659747, IEEE
Transactions on Software Engineering
18
TABLE 13
RQ3 : Performances of “hybrid” prediction models
Avg. AUC-ROC Avg. Accuracy Avg. Precision Avg. Recall Avg. F-measure
Performances of each experimented model
DM 51 24 19 25 21
BCCM 63 70 61 69 64
CM 52 46 44 45 44
MAF 62 65 59 64 61
DCBM 76 77 72 77 74
0098-5589 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TSE.2017.2659747, IEEE
Transactions on Software Engineering
19
• Imprecision in issue classification made by issue-tracking brid” prediction models include only a subset of code
systems [19]: while we cannot exclude misclassifica- metrics.
tion of issues (e.g., an enhancement classified as a Threats to external validity concern the generalization
bug), at least all the systems considered in our study of results. We analyzed 26 Apache systems from different
used Bugzilla as issue tracking system, explicitly application domains and with different characteristics
pointing to bugs in the issue type field; (number of developers, size, number of classes, etc).
• Undocumented bugs present in the system: while we However, systems from different ecosystems should
relied on the issue tracker to identify the bugs fixed be analyzed to corroborate our findings.
during the change history of the object systems, it
is possible that undocumented bugs were present
in some classes, leading to wrong classifications of
buggy classes as “clean” ones.
7 C ONCLUSION AND F UTURE W ORK
• Approximations due to identifying fix-inducing changes
using the SZZ algorithm [45]: at least we used heuris-
tics to limit the number of false positives, for exam- A lot of effort has been devoted in the last decade to
ple excluding blank and comment lines from the set analyze the influence of the development process on
of fix-inducing changes. the likelihood of introducing bugs. Several empirical
studies have been carried out to assess under which
Threats to internal validity concern external factors we circumstances and during which coding activities devel-
did not consider that could affect the variables being opers tend to introduce bugs. In addition, bug prediction
investigated. We computed the developer’s scattering techniques built on top of process metrics have been
measures by analyzing the developers’ activity on a proposed. However, changes in source code are made by
single software system. However, it is well known that, developers that often work under stressing conditions
especially in open source communities and ecosystems, due to the need of delivering their work as soon as
developers contribute to multiple projects in parallel possible.
[53]. This might negatively influence the “developer’s The role of developer-related factors in the bug pre-
scattering” assessment made by our metrics. Still, the diction field is still a partially explored area. This paper
results of our approach can only improve by considering makes a further step ahead, by studying the role played
more sophisticated ways of computing our metrics. by the developer’s scattering in bug prediction. Specifically,
Threats to conclusion validity concern the relation be- we defined two measures that consider the amount of
tween the treatment and the outcome. The metrics used code components a developer modifies in a given time
in order to evaluate our defect prediction approach (i.e., period and how these components are spread struc-
accuracy, precision, recall, F-Measure, and AUC), are turally (structural scattering) and in terms of the responsi-
widely used in the evaluation of the performances of bilities they implement (semantic scattering). The defined
defect prediction techniques [15]. Moreover, we used measures have been evaluated as bug predictors in an
appropriate statistical procedures, (i.e., PCA [54]), and empirical study performed on 26 open source systems.
the computation of overlap metrics to study the orthog- In particular, we built a prediction model exploiting our
onality between our model and the competitive ones. measures and compared its prediction accuracy with
Since we had the necessity to exploit change-history four baseline techniques exploiting process metrics as
information to compute the scattering metrics we pro- predictors. The achieved results showed the superiority
posed, the evaluation design adopted in our study is of our model and its high level of complementarity with
different from the k-fold cross validation [55] generally respect to the considered competitive techniques. We
exploited while evaluating bug prediction techniques. also built and experimented a “hybrid” prediction model
In particular, we split the change-history of the object on top of the eleven predictors exploited by the five
systems into three-month time periods and we adopted competitive techniques. The achieved results show that
a three-month sliding window to train and test the (i) the “hybrid” is able to achieve a higher accuracy with
experimented bug prediction models. This type of vali- respect to each of the five models taken in isolation, and
dation is typically adopted when using process metrics (ii) the predictors proposed in this paper play a major
as predictors [8], although it might be penalizing when role in the best performing “hybrid” prediction models.
using product metrics, which are typically assessed us- Our future research agenda includes a deeper inves-
ing a ten-fold cross validation. Furthermore, although tigation of the factors causing scattering to developers,
we selected a model exploiting a set of product metrics and negatively impacting their ability of dealing with
previously shown to be effective in the context of bug code change tasks. We plan to reach such an objective
prediction [1], the poor performances of the CM model by performing a large survey with industrial and open
might be due to the fact that the model relies on too source developers. We also plan to apply our technique
many predictors, resulting in a model overfitting. This at different levels of granularity, to verify if we can point
conjecture is supported by the results achieved in the out buggy code components at a finer granularity level
context of RQ3 , where we found that the top five “hy- (e.g., methods).
0098-5589 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TSE.2017.2659747, IEEE
Transactions on Software Engineering
20
0098-5589 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TSE.2017.2659747, IEEE
Transactions on Software Engineering
21
[42] F. Rosenblatt, Principles of Neurodynamics: Perceptrons and the The- [49] W. J. Conover, Practical Nonparametric Statistics, 3rd ed. Wiley,
ory of Brain Mechanisms. Spartan Books, 1961. 1998.
[43] G. H. John and P. Langley, “Estimating continuous distributions [50] R. J. Grissom and J. J. Kim, Effect sizes for research: A broad practical
in bayesian classifiers,” in Eleventh Conference on Uncertainty in approach, 2nd ed. Lawrence Earlbaum Associates, 2005.
Artificial Intelligence. San Mateo: Morgan Kaufmann, 1995, pp. [51] C. Bird, A. Bachmann, E. Aune, J. Duffy, A. Bernstein, V. Filkov,
338–345. and P. Devanbu, “Fair and balanced?: Bias in bug-fix datasets,”
[44] T. Menzies, B. Caglayan, Z. He, E. Kocaguneli, J. Krall, in Proceedings of the the 7th Joint Meeting of the European Software
F. Peters, and B. Turhan. (2012, June) The promise repository Engineering Conference and the ACM SIGSOFT Symposium on The
of empirical software engineering data. [Online]. Available: Foundations of Software Engineering, ser. ESEC/FSE ’09. New
http://promisedata.googlecode.com York, NY, USA: ACM, 2009, pp. 121–130. [Online]. Available:
[45] J. Sliwerski, T. Zimmermann, and A. Zeller, “When do changes http://doi.acm.org/10.1145/1595696.1595716
induce fixes?” in Proceedings of the 2005 International Workshop on [52] K. Herzig and A. Zeller, “The impact of tangled code changes,”
Mining Software Repositories, MSR 2005. ACM, 2005. in Proceedings of the 10th Working Conference on Mining Software
[46] L. Moonen, “Generating robust parsers using island grammars,” Repositories, MSR ’13, San Francisco, CA, USA, May 18-19, 2013,
in Reverse Engineering, 2001. Proceedings. Eighth Working Conference 2013, pp. 121–130.
on, 2001, pp. 13–22. [53] G. Bavota, G. Canfora, M. Di Penta, R. Oliveto, and S. Panichella,
[47] R. Kohavi and G. H. John, “Wrappers for feature subset selection,” “The evolution of project inter-dependencies in a software ecosys-
Artif. Intell., vol. 97, no. 1-2, pp. 273–324, Dec. 1997. [Online]. tem: The case of apache,” in Software Maintenance (ICSM), 2013
Available: http://dx.doi.org/10.1016/S0004-3702(97)00043-X 29th IEEE International Conference on, Sept 2013, pp. 280–289.
[48] S. Mancoridis, B. S. Mitchell, C. Rorres, Y.-F. Chen, and E. R. [54] I. Jolliffe, Principal Component Analysis. John Wiley & Sons, Ltd,
Gansner, “Using automatic clustering to produce high-level sys- 2005. [Online]. Available: http://dx.doi.org/10.1002/0470013192.
tem organizations of source code,” in Proccedings of 6th Interna- bsa501
tional Workshop on Program Comprehension. Ischia, Italy: IEEE CS [55] P. A. Devijver and J. Kittler, Pattern Recognition: A Statistical
Press, 1998. Approach, 1982.
0098-5589 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.