0% found this document useful (0 votes)
69 views21 pages

A Developer Centered Bug Prediction Model

This document summarizes a research article that proposes a new developer-centered bug prediction model. The model measures the scattering of changes made by individual developers to predict bug-prone code components. Specifically, it measures the structural and semantic scattering of changes to capture how focused or scattered a developer's modifications are. An empirical study evaluated the model on 26 systems, finding it outperformed four other techniques and had high complementarity when combined with existing predictors. The new model is the first to incorporate how focused developers are in their changes.

Uploaded by

Xevol
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
69 views21 pages

A Developer Centered Bug Prediction Model

This document summarizes a research article that proposes a new developer-centered bug prediction model. The model measures the scattering of changes made by individual developers to predict bug-prone code components. Specifically, it measures the structural and semantic scattering of changes to capture how focused or scattered a developer's modifications are. An empirical study evaluated the model on 26 systems, finding it outperformed four other techniques and had high complementarity when combined with existing predictors. The new model is the first to incorporate how focused developers are in their changes.

Uploaded by

Xevol
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 21

This article has been accepted for publication in a future issue of this journal, but has not been

fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TSE.2017.2659747, IEEE
Transactions on Software Engineering

A Developer Centered Bug Prediction Model


Dario Di Nucci1 , Fabio Palomba1 , Giuseppe De Rosa1
Gabriele Bavota2 , Rocco Oliveto3 , Andrea De Lucia1
1 University of Salerno, Fisciano (SA), Italy, 2 Università della Svizzera italiana (USI), Switzerland,
3 University of Molise, Pesche (IS), Italy

[email protected], [email protected], [email protected]


[email protected], [email protected], [email protected]

Abstract—Several techniques have been proposed to accurately predict software defects. These techniques generally exploit
characteristics of the code artefacts (e.g., size, complexity, etc.) and/or of the process adopted during their development and
maintenance (e.g., the number of developers working on a component) to spot out components likely containing bugs. While these bug
prediction models achieve good levels of accuracy, they mostly ignore the major role played by human-related factors in the introduction
of bugs. Previous studies have demonstrated that focused developers are less prone to introduce defects than non-focused developers.
According to this observation, software components changed by focused developers should also be less error prone than components
changed by less focused developers. We capture this observation by measuring the scattering of changes performed by developers
working on a component and use this information to build a bug prediction model. Such a model has been evaluated on 26 systems
and compared with four competitive techniques. The achieved results show the superiority of our model, and its high complementarity
with respect to predictors commonly used in the literature. Based on this result, we also show the results of a “hybrid” prediction model
combining our predictors with the existing ones.

Index Terms—Scattering Metrics, Bug Prediction, Empirical Study, Mining Software Repositories

1 I NTRODUCTION In particular, Eyolfson et al. [17] showed that more


experienced developers tend to introduce less faults
Bug prediction techniques are used to identify areas of in software systems. Rahman and Devanbu [18] partly
software systems that are more likely to contain bugs. contradicted the study by Eyolfson et al. by showing that
These prediction models represent an important aid the experience of a developer has no clear link with the
when the resources available for testing are scarce, since bug introduction. Bird et al. [20] found that high levels
they can indicate where to invest such resources. The sci- of ownership are associated with fewer bugs. Finally,
entific community has developed several bug prediction Posnett et al. [22] showed that focused developers (i.e.,
models that can be roughly classified into two families, developers focusing their attention on a specific part
based on the information they exploit to discriminate of the system) introduce fewer bugs than unfocused
between “buggy” and “clean” code components. The developers.
first set of techniques exploits product metrics (i.e., metrics Although such studies showed the potential of
capturing intrinsic characteristics of the code compo- human-related factors in bug prediction, this information
nents, like their size and complexity) [1], [2], [3], [4], is not captured in state-of-the-art bug prediction models
[5], while the second one focuses on process metrics (i.e., based on process metrics extracted from version history.
metrics capturing specific aspects of the development Indeed, previous bug prediction models exploit predic-
process, like the frequency of changes performed to code tors based on (i) the number of developers working on
components) [6], [7], [8], [9], [10], [11], [12]. While some a code component [9] [10]; (ii) the analysis of change-
studies highlighted the superiority of these latter with proneness [13] [11] [12]; and (iii) the entropy of changes
respect to the product metric based techniques [7], [13], [8]. Thus, despite the previously discussed finding by
[11] there is a general consensus on the fact that no Posnett et al. [22], none of the proposed bug prediction
technique is the best in all contexts [14], [15]. For this models considers how focused the developers perform-
reason, the research community is still spending effort ing changes are and how scattered these changes are.
in investigating under which circumstances and during In our previous work [23] we studied the role played
which coding activities developers tend to introduce by scattered changes in bug prediction. We defined two
bugs (see e.g., [16], [17], [18], [19], [20], [21], [22]). measures, namely the developer’s structural and semantic
Some of these studies have highlighted the central role scattering. The first assesses how “structurally far” in
played by developer-related factors in the introduction the software project the code components modified by a
of bugs. developer in a given time period are.

1 See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.


0098-5589 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TSE.2017.2659747, IEEE
Transactions on Software Engineering

The “structural distance” between two code compo- Structure of the paper. Section 2 discusses the related lit-
nents is measured as the number of subsystems one erature, while Section 3 presents the proposed scattering
needs to cross in order to reach one component from measures. Section 4 presents the design of our empirical
the other. study and provides details about the data extraction pro-
The second measure (i.e., the semantic scattering) is cess and analysis method. Section 5 reports the results of
instead meant to capture how much spread in terms the study, while Section 6 discusses the threats that could
of implemented responsibilities the code components affect their validity. Section 7 concludes the paper.
modified by a developer in a given time period are.
The conjecture behind the proposed metrics is that high
2 R ELATED W ORK
levels of structural and semantic scattering make the de-
veloper more error-prone. To verify this conjecture, we Many bug prediction techniques have been proposed
built two predictors exploiting the proposed measures, in the literature in the last decade. Such techniques
and we used them in a bug prediction model. The mainly differ for the specific predictors they use, and can
results achieved on five software systems showed the roughly be classified in those exploiting product metrics
superiority of our model with respect to (i) the Basic (e.g., lines of code, code complexity, etc), those relying
Code Change Model (BCCM) built using the entropy on process metrics (e.g., change- and fault-proneness of
of changes [8] and (ii) a model using the number of code components), and those exploiting a mix of the two.
developers working on a code component as predictor Table 1 summarizes the related literature, by grouping
[9] [10]. Most importantly, the two scattering measures the proposed techniques on the basis of the metrics they
showed a high degree of complementarity with the exploit as predictors.
measures exploited by the baseline prediction models. The Chidamber and Kemerer (CK) metrics [36] have
been widely used in the context of bug prediction. Basili
In this paper, we extend our previous work [23] to
et al. [1] investigated the usefulness of the CK suite
further investigate the role played by scattered changes
for predicting the probability of detecting faulty classes.
in bug prediction. In particular we:
They showed that five of the experimented metrics are
1) Extend the empirical evaluation of our bug predic- actually useful in characterizing the bug-proneness of
tion model by considering a set of 26 systems. classes. The same set of metrics has been successfully
2) Compare our model with two additional compet- exploited in the context of bug prediction by El Emam et
itive approaches, i.e., a prediction model based al. [26] and Subramanyam et al. [27]. Both works reported
on the focus metrics proposed by Posnett et al. the ability of the CK metrics in predicting buggy code
[22] and a prediction model based on structural components, regardless of the size of the system under
code metrics [24], that together with the previously analysis.
considered models, i.e., the BCCM proposed by Still in terms of product metrics, Nikora et al. [28]
Hassan [8] and the one proposed by Ostrand et al. showed that measuring the evolution of structural at-
[9] [10], lead to a total of four different baselines tributes (e.g., number of executable statements, number
considered in our study. of nodes in the control flow graph, etc.) it is possible to
3) Devise and discuss the results of a hybrid bug predict the number of bugs introduced during the sys-
prediction model, based on the best combination of tem development. Later, Gyimothy et al. [2] performed
predictors exploited by the five prediction models a new investigation on the relationship between CK
experimented in the paper. metrics and bug proneness. Their results showed that the
4) Provide a comprehensive replication package [25] Coupling Between Object metric is the best in predicting
including all the raw data and working data sets the bug-proneness of classes, while other CK metrics are
of our studies. untrustworthy.
The achieved results confirm the superiority of our Ohlsson et al. [3] focused the attention on the use
model, achieving a F-Measure 10.3% higher, on average, of design metrics to identify bug-prone modules. They
than the change entropy model [8], 53.7% higher, on performed a study on an Ericsson industrial system
average, with respect to what achieved by exploiting the showing that at least four different design metrics can be
number of developers working on a code component used with equivalent results. The metrics performance
as predictor [9], 13.3% higher, on average, than the F- are not statistically worse than those achieved using
Measure obtained by using the developers’ focus metric a model based on the project size. Zhou et al. [29]
by Posnett et al. [22] as predictor, and 29.3% higher, on confirmed their results showing that size-based models
average, with respect to the prediction model built on seem to perform as well as those based on CK metrics
top of product metrics [1]. The two scattering measures except than the Weighted Method per Class on some
confirmed their complementarity with the metrics used releases of the Eclipse system. Thus, although Bell et
by the alternative prediction models. Thus, we devised a al. [35] showed that more complex metric-based models
“hybrid” model providing an average boost in prediction have more predictive power with respect to size-based
accuracy (i.e., F-Measure) of +5% with respect to the best models, the latter seem to be generally useful for bug
performing model (i.e., the one proposed in this paper). prediction.

0098-5589 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TSE.2017.2659747, IEEE
Transactions on Software Engineering

TABLE 1
Prediction models proposed in literature

Type of Information Exploited Prediction Model Predictors


Basili et al. [1] CK metrics
El Emam et al. [26] CK metrics
Subramanyam et al. [27] CK metrics
Nikora et al. [28] CFG metrics
Product metrics
Gyimothy et al. [2] CK metrics, LOC
Ohlsson et al. [3] CFG metrics, complexity metrics, LOC
Zhou et al. [29] CK metrics, OO metrics, complexity metrics, LOC
Nagappan et al. [14] CK metrics, CFG metrics, complexity metrics
Khoshgoftaar et al. [6] debug churn
Nagappan et al. [30] relative code churn
Hassan and Holt [31] entropy of changes
Hassan and Holt [32] entropy of changes
Kim et al. [33] previous fault location
Process metrics
Hassan [8] entropy of changes
Ostrand et al. [10] number of developers
Nagappan et al. [34] consecutive changes
Bird et al. [20] social network analysis on developers’ activities
Ostrand et al. [9] number of developers
Posnett et al. [22] module activity focus, developer attention focus
Graves et al. [7] various code and change metrics
Nagappan and Ball [4] LOC, past defects
Bell et al. [35] LOC, age of files, number of changes, program type
Zimmerman et al. [5] complexity metrics, CFG metrics, past defects
Product and process metrics
Moser et al. [13] various code and change metrics
Moser et al. [11] various code and change metrics
Bell et al. [12] various code and change metrics
D’Ambros et al. [15] various code and change metrics

Nagappan and Ball [4] exploited two static analysis Hassan and Holt [31] conjectured that a chaotic devel-
tools to early predict the pre-release bug density. The opment process has bad effects on source code quality
results of their study, conducted on the Windows Server and introduced the concept of entropy of changes. Later
system, show that it is possible to perform a coarse they also presented the top-10 list [32], a methodology
grained classification between high and low quality to highlight to managers the top ten subsystems more
components with a high level of accuracy. Nagappan et likely to present bugs. The set of heuristics behind their
al. [14] analyzed several complexity measures on five approach includes a number of process metrics, such as
Microsoft software systems, showing that there is no considering the most recently modified, the most frequently
evidence that a single set of measures can act universally modified, the most recently fixed and the most frequently
as bug predictor. They also showed how to methodically fixed subsystems.
build regression models based on similar projects in Bell et al. [12] pointed out that although code churns
order to achieve better results. Complexity metrics in the are very effective bug predictors, they cannot improve a
context of bug prediction are also the focus of the work simpler model based on the code components’ change-
by Zimmerman et al. [5]. Their study reports a positive proneness. Kim et al. [33] presumed that faults do not
correlation between code complexity and bugs. occur in isolation but in burst of related faults. They
Differently from the previous discussed techniques, proposed the bug cache algorithm that predicts future
other approaches try to predict bugs by exploiting pro- faults considering the location of previous faults. Simi-
cess metrics. Khoshgoftaar et al. [6] analyzed the contri- larly, Nagappan et al. [34] defined change burst as a set
bution of debug churns (defined as the number of lines of consecutive changes over a period of time and proposed
of code added or changed to fix bugs) to a model based new metrics based on change burst. The evaluation of
on product metrics in the identification of bug-prone the prediction capabilities of the models was performed
modules. Their study, conducted on two subsequent on Windows Vista, achieving high accuracy.
releases of a large legacy system, shows that modules Graves et al. [7] experimented both product and pro-
exceeding a defined threshold of debug churns are often cess metrics for bug prediction. They observed that
bug-prone. The reported results show a misclassification history-based metrics are more powerful than product
rate of just 21%. metrics (i.e., change-proneness is a better indicator than
Nagappan et al. [30] proposed a technique for early LOC). Their best results were achieved using a combi-
bug prediction based on the use of relative code churn nation of module’s age and number of changes, while
measures. These metrics relate the number of churns to combining product metrics had no positive effect on
other factors such as LOC or file count. An experiment the bug prediction. They also saw no benefits provided
performed on the Windows Server system showed that by the inclusion of a metric based on the number of
relative churns are better than absolute value. developers modifying a code component.

0098-5589 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TSE.2017.2659747, IEEE
Transactions on Software Engineering

Moser et al. [13] performed a comparative study Fig. 1. Example of two developers having different levels
between product-based and process-based predictors. of “scattering”
Their study, performed on Eclipse, highlights the su-
it.gui.addUser
periority of process metrics in predicting buggy code
it.gui.login it.gui.logout it.gui.confirmRegistration it.gui.logout
components. Later, they performed a deeper study [11]
on the bug prediction accuracy of process metrics, re-
porting that the past number of bug-fixes performed on a file Jan Feb
2015 it.gui.logout it.whouse.showStocks 2015
(i.e., bug-proneness), the maximum changeset size occurred
it.db.insertPayslip it.db.deleteUserAccount
in a given period, and the number of changes involving
a file in a given period (i.e., change-proneness) are the
process metrics ensuring the best performances in bug
prediction.
D’Ambros et al. [15] performed an extensive compari-
Following this conjecture, they defined two symmetric
son of bug prediction approaches relying on process and
metrics, namely the Module Activity Focus metric (shortly,
product metrics, showing that no technique based on a
MAF), and the Developer Attention Focus metric (shortly,
single metric works better in all contexts.
DAF) [22]. The former is a metric which captures to
Hassan [8] analyzed the complexity of the develop-
what extent a module receives focused attention by
ment process. In particular he defined the entropy of
developers. The latter measures how focused are the
changes as the scattering of code changes across time.
activities of a specific developer. As it will be clearer
He proposed three bug prediction models, namely Basic
later, our scattering measures not only take into account
Code Change Model (BCCM), Extended Code Change
the frequency of changes made by developers over the
Model (ECCM), and File Code Change Model (FCCM).
different system’s modules, but also considers the “dis-
These models mainly differ for the choice of the temporal
tance” between the modified modules. This means that,
interval where the bug proneness of components is
for example, the contribution of a developer working on
studied. The reported study indicates that the proposed
a high number of files all closely related to a specific
techniques have a stronger prediction capability than
responsibility might not be as much “scattered” as the
a model purely based on the amount of changes ap-
contribution of a developer working on few unrelated
plied to code components or on the number of prior
files.
faults. Differently from our work, all these predictors do
not consider the number of developers who performed
changes to a component, neither how many components 3 C OMPUTING D EVELOPER ’ S S CATTERING
they changed at the same time. C HANGES
Ostrand et al. [9], [10] proposed the use of the number
of developers who modified a code component in a give time pe- We conjecture that the developer’s effort in performing
riod as a bug predictor. Their results show that combining maintenance and evolution tasks is proportional to the
developers’ information poorly, but positively, impact number of involved components and their spread across
the detection accuracy of a prediction model. Our work different subsystems. In other words, we believe that
does not use a simple count information of developers a developer working on different components scatters
who worked on a file, but also takes in consideration the her attention due to continuous changes of context. This
change activities they carry out. might lead to an increase of the developer’s “scattering”
Bird et al. [20] investigated the relationship between with a consequent higher chance of introducing bugs.
different ownership measures and pre- and post-releases To get a better idea of our conjecture, consider the
failures. Specifically, they analyzed the developers’ con- situation depicted in Figure 1, where two developers,
tribution network by means of social network analysis d1 (black point) and d2 (grey point) are working on the
metrics, finding that developers having low levels of same system, during the same time period, but on dif-
ownership tend to increase the likelihood of introducing ferent code components. The tasks performed by d1 are
defects. Our scattering metrics are not based on code very focused on a specific part of the system (she mainly
ownership, but on the “distance” between the code works on the system’s GUI) and on a very targeted topic
components modified by a developer in a given time (she is mainly in charge of working on GUIs related to
period. the users’ registration and login features). On the con-
Posnett et al. [22] investigated factors related to the one trary, d2 performs tasks scattered across different parts
we aim at capturing in this paper, i.e., the developer’s of the system (from GUIs to database management) and
scattering. In particular, the “focus” metrics presented by on different topics (users’ accounts, payslips, warehouse
Posnett et al. [22] are based on the idea that a developer stocks).
performing most of her activities on a single module (a Our conjecture is that during the time period shown in
module could be a method, a class, etc.) has a higher Figure 1, the contribution of d2 might have been more
focus on the activities she is performing and is less likely “scattered” than the contribution of d1 , thus having a
to introduce bugs. higher likelihood of introducing bugs in the system.

0098-5589 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TSE.2017.2659747, IEEE
Transactions on Software Engineering

To verify our conjecture we define two measures, Fig. 2. Example of structural scattering
named the structural and the semantic scattering mea-
sures, aimed at assessing the scattering of a developer
org.apache.tools.ant
d in a given time period p. Note that both measures
are meant to work in object oriented systems at the
class level granularity. In other words, we measure how ProjectHelper
scattered are the changes performed by developer d taskdefs types

during the time period p across the different classes


of the system. However, our measures can be easily
Target UpToDate mappers
adapted to work at other granularity levels.

FilterMapper
3.1 Structural scattering
Let CHd,p be the set of classes changed by a developer d
during a time period p. We define the structural scattering
measure as:
StrScatd,p = |CHd,p | × average [dist(ci , cj )] (1)
∀ci ,cj ∈CHd,p
In particular, the leafs of the tree represent the classes
where dist is the number of packages to traverse in modified by the developer in the considered time pe-
order to go from class ci to class cj ; dist is com- riod, while the internal nodes (as well as the root
puted by applying the shortest path algorithm on node) illustrate the package structure of the system.
the graph representing the system’s package struc- In this example, the developer worked on the classes
ture. For example, the dist between two classes Target and UpToDate, both contained in the pack-
it.user.gui.c1 and it.user.business.db.c2 is age org.apache.tools.ant.taskdefs grouping to-
three, since in order to reach c1 from c2 we need to tra- gether classes managing the definition of new com-
verse it.user.business.db, it.user.business, mands that the Ant’s user can create for customizing
and it.user.gui. We (i) use the average operator for her own building process. In addition, the developer
normalizing the distances between the code components also modified FilterMapper, a class containing utility
modified by the developer during the time period p methods (e.g., map a java String into an array), and
and (ii) assign a higher scattering to developers working the class ProjectHelper responsible for parsing the
on a higher number of code components in the given build file and creating java instances representing the
time period (see |CHd,p |). Note the the choice to use build workflow. To compute the structural scattering we
the average to normalize the distances is driven by the compute the distance between every pair of classes
fact that other central operators, such as the median, modified by the developer. If two classes are in the
are not affected by the outliers. Indeed, suppose that a same package, as in the case of the classes Target and
developer performs a change (i.e., commit) C, modifying UpToDate, then the distance between them will be zero.
four files F1, F2, F3, and F4. The first three files are in the Instead, if they are in different packages, like in the case
same package, while the fourth one (F4) is in a different of ProjectHelper and Target, their distance is the
subsystem. When computing the structural scattering for minimum number of packages one needs to traverse to
C, the median would not reflect the scattering of the reach one class from the other. For example, the distance
change performed on F4, since half of the six pairs of files is one between ProjectHelper and Target (we need
involved in the change (and in particular, F1-F2, F1-F3, to traverse the package taskdefs), and three between
F2-F3) have zero as structural distance (i.e., they are in UpToDate and FilterMapper (we need to traverse the
the same package). Thus, the median would not capture packages taskdefs, types and mappers).
the fact that C was, at least in part, a scattered change.
After computing the distance between every pair of
This is instead captured by the mean that is influenced
classes, we can compute the structural scattering. Table
by outliers.
2 shows the structural distances between every pair of
To better understand how the structural scattering mea-
classes involved in our example as well as the value
sure is computed and how it is possible to use it in order
for the structural scattering. Note that, if the developer
to estimate the developer’s scattering in a time period,
had modified only the Target and UpToDate classes in
Figure 2 provides a running example based on a real
the considered time period, then her structural scattering
scenario we found in Apache Ant1 , a tool to automate
would have been zero (the lowest possible), since her
the building of software projects.
changes were focused on just one package. By also
The tree shown in Figure 2 depicts the activity of a
considering the change performed to ProjectHelper,
single developer in the time period between 2012-03-01
the structural scattering raises to 2.01. This is due to the
and 2012-04-30.
number of classes involved in change set (3) and the
1. http://ant.apache.org/ average of the distance among them (0.67).

0098-5589 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TSE.2017.2659747, IEEE
Transactions on Software Engineering

TABLE 2 Fig. 3. Example of semantic scattering measure


Example of structural scattering computation
org.apache.tools.ant.type
Changed components Distance
org.apache.tools.ant.ProjectHelper org.apache.tools.ant.taskdefs.Target 1
org.apache.tools.ant.ProjectHelper org.apache.tools.ant.taskdefs.UpToDate 1
org.apache.tools.ant.ProjectHelper org.apache.tools.ant.types.mappers.FilterMapper 2
org.apache.tools.ant.taskdefs.Target org.apache.tools.ant.taskdefs.UpToDate 0
org.apache.tools.ant.taskdefs.Target org.apache.tools.ant.types.mappers.FilterMapper 3
org.apache.tools.ant.taskdefs.UpToDate org.apache.tools.ant.types.mappers.FilterMapper 3 Path Resource ZipScanner
Structural Developer scattering 6.67

Finally the structural scattering reaches the value of 6.67


when also considering the change to the FilterMapper TABLE 3
class. In this case the change set is composed of 4 classes Example of semantic scattering computation
and the average of the distances among them is 1.67.
Note that the structural scattering is a direct scattering Changed components
org.apache.tools.ant.type.Path org.apache.tools.ant.type.Resource
Text. sim.
0.22
measure: the higher the measure, the higher the estimated org.apache.tools.ant.type.Path org.apache.tools.ant.type.ZipScanner 0.05
org.apache.tools.ant.type.Resource org.apache.tools.ant.type.ZipScanner 0.10
developer’s scattering. Semantic Developer scattering 24.32

3.2 Semantic scattering


Path and Resource are two data types and have
Considering the package structure might not be an some code in common, while ZipScanner is an archives
effective way of assessing the similarity of the classes scanner. While the structural scattering is zero for the ex-
(i.e., to what extent the modified classes implement ample depicted in Figure 3 (all classes are from the same
similar responsibilities). Because of the software “aging” package), the semantic scattering is quite high (24.32) due
or wrong design decisions, classes grouped in the same to the low textual similarity between the pairs of classes
package may have completely different responsibilities contained in the package (see Table 3). To compute
[37]. In such cases, the structural scattering measure might the semantic scattering we firstly calculate the textual
provide a wrong assessment of the level of developer’s similarity between every pair of classes modified by the
scattering, by considering classes implementing different developer, as reported in Table 3. Then we calculate
responsibilities as similar only because grouped inside the average of the textual similarities (≈ 0.12) and we
the same package. For this reason, we propose the se- apply the inverse operator (≈ 8.11). Finally the semantic
mantic scattering measure, based on the textual similarity scattering is calculated multiplying the obtained value
of the changed software components. Textual similarity by the number of elements in the change set, that is 3,
between documents is computed using the Vector Space achieving the final result of ≈ 24.32.
Model (VSM) [38]. In our application of VSM we (i) used
tf-idf weighting scheme [38], (ii) normalized the text by
splitting the identifiers (we also have maintained the 3.3 Applications of Scattering Measures
original identifiers), (iii) applied a stop word removal,
and (iv) stemmed the words to their root (using the The scattering measures defined above could be adopted
well known Porter stemmer). The semantic scattering in different areas concerned with monitoring mainte-
measure is computed as: nance and evolution activities. As an example, a project
manager could use the scattering measures to estimate
1 the workload of a developer, as well as to re-allocate
SemScatd,p = |CHd,p | × (2)
average [sim(ci , cj )] resources. In the context of this paper, we propose the
∀ci ,cj ∈CHd,p
use of the defined measures for class-level bug predic-
where the sim function returns the textual similarity tion (i.e., to predict which classes are more likely to be
between the classes ci and cj as a value between zero buggy). The basic conjecture is that developers having a
(no textual similarity) and one (the textual content of the high scattering are more likely to introduce bugs during code
two classes is identical). Note that, as for the structural change activities.
scattering, we adopt the average operator and assign To exploit the defined scattering measures in the
a higher scattering to developers working on a higher context of bug prediction, we built a new prediction
number of code components in the given time period. model called Developer Changes Based Model (DCBM) that
Figure 3 shows an example of computation for the analyzes the components modified by developers in a
semantic scattering measure. Also in this case the figure given time period. The model exploits a machine learn-
depicts a real scenario we identified in Apache Ant of ing algorithm built on top of two predictors. The first,
a single developer in the time period between 2004-04- called structural scattering predictor, is defined starting
01 and 2004-06-30. The developer worked on the classes from the structural scattering measure, while the second
Path, Resource and ZipScanner, all contained in the one, called semantic scattering predictor, is based on the
package org.apache.tools.ant.types. semantic scattering measure.

0098-5589 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TSE.2017.2659747, IEEE
Transactions on Software Engineering

The predictors are defined as follow: cohesion metrics (i.e., the Lack of Cohesion of Method—
X LCOM), coupling metrics (i.e., the Coupling Between
StrScatPredc,p = StrScatd,p (3) Objects—CBO—and the Response for a Class—RFC),
d∈developersc,p
X and complexity metrics (i.e., the Weighted Methods per
SemScatPredc,p = SemScatd,p (4) Class—WMC). We refer to this model as CM.
d∈developersc,p We also compared our approach with three prediction
models based on process metrics. The first is the one
where the developersc,p is the set of developers that
based on the work by Ostrand et al. [10], and exploiting
worked on the component c during the time period p.
the number of developers that work on a code compo-
nent in a specific time period as predictor variable (from
4 E VALUATING S CATTERING M ETRICS IN THE now on, we refer to this model as DM).
C ONTEXT OF B UG P REDICTION The second is the Basic Code Change Model (BCCM)
The goal of the study is to evaluate the usefulness of proposed by Hassan and using code change entropy
the developer’s scattering measures in the prediction of information [8]. This choice is justified by the superiority
bug-prone components, with the purpose of improving of this model with respect to other techniques exploit-
the allocation of resources in the verification & validation ing change-proneness information [11], [12], [13]. While
activities focusing on components having a higher bug- such a superiority has been already demonstrated by
proneness. The quality focus is on the detection accuracy Hassan [8], we also compared these techniques before
and completeness of the proposed technique as com- choosing BCCM as one of the baselines for evaluating
pared to competitive approaches. The perspective is of our approach. We found that the BCCM works better
researchers, who want to evaluate the effectiveness of with respect to a model that simply counts the number of
using information about developer scattered changes in changes. This is because it filters the changes that differ
identifying bug-prone components. from the code change process (i.e., fault repairing and
The context of the study consists of 26 Apache soft- general maintenance modifications) considering only the
ware projects having different size and scope. Table 4 Feature Introduction modifications (FI), namely the changes
reports the characteristics of the analyzed systems, and related to adding or enhancing features. However, we
in particular (i) the software history we investigated, observed a high overlap between the BCCM and the
(ii) the mined number of commits, (iii) the size of the model that use the number of changes as predictor
active developers base (those who performed at least one (almost 84%) on the dataset used for the comparison,
commit in the analyzed time period), (iv) the system’s probably due to the fact that the nature of the infor-
size in terms of KLOC and number of classes, and (v) mation exploited by the two models is similar. The
the percentage of buggy files identified (as detailed later) interested reader can find the comparison between these
during the entire change history. All data used in our two models in our online appendix [25].
study are publicly available [25]. Finally, the third baseline is a prediction model based
on the Module Activity Focus metric proposed by Posnett
et al. [22]. It relies on the concept of predator-prey food
4.1 Research Questions and Baseline Selection
web existing in ecology (from now on, we refer to this
In the context of the study, we formulated the following model as MAF). The metric is based on the measure-
research questions: ment of the degree to which a code component receives
• RQ1 : What are the performances of a bug prediction focused attention by developers. It can be considered as
model based on developer’s scattering measures and how a form of ownership metric of the developers on the
it compares to baseline techniques proposed in literature? component. It is worth noting that we do not consider
the other Developer Attention Focus metric proposed by
• RQ2 : What is the complementarity between the proposed Posnett et al., since (i) the two metrics are symmetric, and
bug prediction model and the baseline techniques? (ii) in order to provide a probability that a component is
• RQ3 : What are the performances of a “hybrid” model buggy, we need to qualify to what extent the activities
built by combining developer’s scattering measures with on a file are focused, rather than measuring how are
baseline predictors? developers’ activities focused. Even if Posnett et al. have
not proposed a prediction model based on their metric,
In the first research question we quantify the perfor- the results of this comparison will provide insights on
mances of a prediction model based on developer’s scat- the usefulness of developer’s scattering measures for
tering measures (DCBM). Then, we compare its perfor- detecting bug-prone components.
mances with respect to four baseline prediction models, Note that our choice of the baselines is motived by the
one based on product metrics and the other three based will of: (i) considering both models based on product
on process metrics. and process metrics, and (ii) covering a good number
The first model exploits as predictor variables the CK of different process metrics (since our model exploits
metrics [1], and in particular size metrics (i.e., the Lines process metrics), including approaches exploiting infor-
of Code—LOC—and the Number of Methods—NOM), mation similar to the ones used by our scattering metrics.

0098-5589 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TSE.2017.2659747, IEEE
Transactions on Software Engineering

TABLE 4
Characteristics of the software systems used in the study

% buggy
Project Period #Commits #Dev. #Classes KLOC
classes
AMQ Dec 2005 - Sep 2015 8,577 64 2,528 949 54
Ant Jan 2000 - Jul 2014 13,054 55 1,215 266 72
Aries Sep 2009 - Sep 2,015 2,349 24 1,866 343 40
Camel Mar 2007 - Sep 2015 17,767 128 12,617 1,552 30
CXF Apr 2008 - Sep 2015 10,217 55 6,466 1,232 26
Drill Sep 2012 - Sep 2015 1,720 62 1,951 535 63
Falcon Nov 2011 - Sep 2015 1,193 26 581 201 25
Felix May 2007 - May 2015 11,015 41 5,055 1,070 18
JMeter Sep 1998 - Apr 2014 10,440 34 1,054 192 37
JS2 Feb 2008 - May 2015 1,353 7 1,679 566 34
Log4j Nov 2000 - Feb 2014 3,274 21 309 59 58
Lucene Mar 2010 - May 2015 13,169 48 5,506 2,108 12
Oak Mar 2012 - Sep 2015 8,678 19 2,316 481 43
OpenEJB Oct 2011 - Jan 2013 9,574 35 4,671 823 36
OpenJPA Jun 2007 - Sep 2015 3,984 25 4,554 822 38
Pig Oct 2010 - Sep 2015 1,982 21 81,230 48,360 16
Pivot Jan 2010 - Sep 2015 1,488 8 11,339 7,809 22
Poi Jan 2002 - Aug 2014 5,742 35 2,854 542 62
Ranger Aug 2014 - Sep 2015 622 18 826 443 37
Shindig Feb 2010 - Jul 2015 2,000 27 1,019 311 14
Sling Jun 2009 - May 2015 9,848 29 3,951 1,007 29
Sqoop Jun 2011 - Sep 2015 699 22 667 134 14
Sshd Dec 2008 - Sep 2015 629 8 658 96 33
Synapse Aug 2005 - Sep 2015 2,432 24 1,062 527 13
Whirr Jun 2010 - Apr 2015 569 17 275 50 21
Xerces-J Nov 1999 - Feb 2014 5,471 34 833 260 6

In the second research question we aim at evaluating operators which are used to reflect the structure of the if-
the complementarity of the different models, while in the then rules; and (ii) an outcome which mirrors the classi-
third one we build and evaluate a “hybrid” prediction fication of a software entity respecting the corresponding
model exploiting as predictor variables the scattering rule as bug-prone or non bug-prone. Majority Decision
measures we propose as well as the measures used by Table uses an attribute reduction algorithm to find a
the four experimented competitive techniques (i.e., DM, good subset of predictors with the goal of eliminating
BCCM, MAF, and CM). Note that we do not limit our equivalent rules and reducing the likelihood of over-
analysis to the experimentation of a model including fitting the data.
all predictor variables, but we exercise all 2,036 possible To assess the performance of the five models, we
combinations of predictor variables to understand which split the change-history of the object systems into three-
is the one achieving the best performances. month time periods and we adopt a three-month sliding
window to train and test the bug prediction models.
4.2 Experimental process and oracle definition Starting from the first time period T P1 (i.e., the one
To evaluate the performances of the experimented bug starting at the first commit), we train each model on
prediction models we need to define the machine learn- it, and test its ability in predicting buggy classes on
ing classifier to use. For each prediction technique, we T P2 (i.e., the subsequent three-month time period). Then,
experimented several classifiers, namely ADTree [39], we move three months forward the sliding window,
Decision Table Majority [40], Logistic Regression [41], training the classifiers on T P2 and testing their accuracy
Multilayer Perceptron [42] and Naive Bayes [43]. We on T P3 . This process is repeated until the end of the
empirically compared the results achieved by the five analyzed change history (see Table 4) is reached. Note
different models on the software systems used in our that our choice of considering three-month periods is
study (more details on the adopted procedure later in based on: (i) choices made in previous work, like the one
this section). For all the prediction models the best by Hassan et al. [8]; and (ii) the results of an empirical
results were obtained using the Majority Decision Table assessment we performed on such a parameter showing
(the comparison among the classifiers can be found in that the best results for all experimented techniques are
our online appendix [25]). Thus, we exploit it in the achieved by using three-month periods. In particular, we
implementation of the five models. This classifier can be experimented with time windows of one, two, three, and
viewed as an extension of one-valued decision trees [40]. six months. The complete results are available in our
It is a rectangular table where the columns are labeled replication package [25].
with predictors and rows are sets of decision rules. Finally, to evaluate the performances of the five experi-
Each decision rule of a decision table is composed of mented models we need an oracle reporting the presence
(i) a pool of conditions, linked through and/or logical of bugs in the source code.

0098-5589 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TSE.2017.2659747, IEEE
Transactions on Software Engineering

Although the PROMISE repository collects a large 4.3 Metrics and Data Analysis
dataset of bugs in open source systems [44], it provides Once defined the oracle and obtained the predicted
oracles at release-level. Since the proposed measures buggy classes for every three-month period, we answer
work at time period-level, we had to build our own RQ1 by using three widely-adopted metrics, namely
oracle. Firstly, we identified bug fixing commits hap- accuracy, precision and recall [38]:
pened during the change history of each object system
by mining regular expressions containing issue IDs in TP + TN
the change log of the versioning system (e.g., “fixed issue accuracy = (5)
TP + FP + TN + FN
#ID” or “issue ID”). After that, for each identified issue
ID, we downloaded the corresponding issue report from TP
the issue tracking system and extracted the following precision = (6)
TP + FP
information: product name; issue’s type (i.e., whether an
issue is a bug, enhancement request, etc); issue’s status TP
(i.e., whether an issue was closed or not); issue’s resolution recall = (7)
TP + FN
(i.e., whether an issue was resolved by fixing it, or it was
a duplicate bug report, or a “works for me” case); issue’s where T P is the number of classes containing bugs that
opening date; issue’s closing date, if available. are correctly classified as bug-prone; T N denotes the
number of bug-free classes classified as non bug-prone
Then, we checked each issue’s report to be correctly
classes; F P and F N measure the number of classes for
downloaded (e.g., the issue’s ID identified from the
which a prediction model fails to identify bug-prone
versioning system commit note could be a false positive).
classes by declaring bug-free classes as bug-prone (F P )
After that, we used the issue type field to classify the
or identifying actually buggy classes as non buggy ones
issue and distinguish bug fixes from other issue types
(F N ). As an aggregate indicator of precision and recall,
(e.g., enhancements). Finally, we only considered bugs
we also report the F-measure, defined as the harmonic
having Closed status and Fixed resolution. Basically, we
mean of precision and recall:
restricted our attention to (i) issues that were related to
bugs as we used them as a measure of fault-proneness,
precision ∗ recall
and (ii) issues that were neither duplicate reports nor F -measure = 2 ∗ (8)
precision + recall
false alarms.
Once collected the set of bugs fixed in the change Finally, we also report the Area Under the Curve
history of each system, we used the SZZ algorithm [45] (AUC) obtained by the prediction model. The AUC
to identify when each fixed bug was introduced. The quantifies the overall ability of a prediction model to
SZZ algorithm relies on the annotation/blame feature of discriminate between buggy and non-buggy classes. The
versioning systems. In essence, given a bug-fix identified closer the AUC to 1, the higher the ability of the classifier
by the bug ID, k, the approach works as follows: to discriminate classes affected and not by a bug. On
the other hand, the closer the AUC to 0.5, the lower the
1) For each file fi , i = 1 . . . mk involved in the bug-fix accuracy of the classifier. To compare the performances
k (mk is the number of files changed in the bug-fix obtained by DCBM with the competitive techniques, we
k), and fixed in its revision rel-fixi,k , we extract the performed the bug prediction using the four baseline
file revision just before the bug fixing (rel-fixi,k − 1). models BCCM, DM, MAF, and CM on the same systems
2) starting from the revision rel-fixi,k − 1, for each and the same periods on which we ran DCBM.
source line in fi changed to fix the bug k the blame To answer RQ2 , we analyzed the orthogonality of the
feature of Git is used to identify the file revision different measures used by the five experimented bug
where the last change to that line occurred. In prediction models using Principal Component Analysis
doing that, blank lines and lines that only contain (PCA). PCA is a statistical technique able to identify vari-
comments are identified using an island grammar ous orthogonal dimensions (principal components) from
parser [46]. This produces, for each file fi , a set of a set of data. It can be used to evaluate the contribution
ni,k fix-inducing revisions rel-bugi,j,k , j = 1 . . . ni,k . of each variable to the identified components. Through
Thus, more than one commit can be indicated by the analysis of the principal components and the contri-
the SZZ algorithm as responsible for inducing a butions (scores) of each predictor to such components,
bug. it is possible to understand whether different predic-
By adopting the process described above we are able tors contribute to the same principal components. Two
to approximate the periods of time where each class of models are complementary if the predictors they exploit
the subject systems was affected by one or more bugs contribute to capture different principal components.
(i.e., was a buggy class). In particular, given a bug-fix Hence, the analysis of the principal components provides
BFk performed on a class ci , we consider ci buggy from insights on the complementarity between models.
the date in which the bug fixed in BFk was introduced Such an analysis is necessary to assess whether the
(as indicated by the SZZ algorithm) to the date in which exploited predictors assign the same bug-proneness to
BFk (i.e., the patch) was committed in the repository. the same set of classes.

0098-5589 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TSE.2017.2659747, IEEE
Transactions on Software Engineering

10

However, PCA does not tell the whole story. Indeed, Thus, we also investigate the subset of predictors
using PCA it is not possible to identify to what extent actually leading to the best prediction accuracy. To this
a prediction model complements another and vice versa. aim, we use the wrapper approach proposed by Kohavi
This is the reason why we complemented the PCA by and John [47]. Given a training set built using all the
analyzing the overlap of the five prediction models. features available, the approach systematically exercises
Specifically, given two prediction models mi and mj , we all the possible subsets of features against a test set, thus
computed: assessing their accuracy. Also in this case we used the
Majority Decision Table [40] as machine learner.
|corrmi ∩ corrmj |
corrmi ∩mj = % (9) In our study, we considered as training set the penul-
|corrmi ∪ corrmj | timate three-month period of each subject system, and
as test set the last three-month period of each system.
|corrmi \ corrmj | Note that this analysis has not been run on the whole
corrmi \mj = % (10) change history of the software systems due to its high
|corrmi ∪ corrmj |
computational cost. Indeed, experimenting all possible
where corrmi represents the set of bug-prone classes cor- combinations of the eleven predictors means the run
rectly classified by the prediction model mi , corrmi ∩mj of 2,036 different prediction models across each of the
measures the overlap between the sets of true positives 26 systems (52,936 overall runs). This required approx-
correctly identified by both models mi and mj , corrmi \mj imately eight weeks on four Linux laptops having two
measures the percentage of bug-prone classes correctly dual-core 3.10 GHz CPU and 4 Gb of RAM.
classified by mi only and missed by mj . Clearly, the Once obtained all the accuracy metrics for each combi-
overlap metrics are computed by considering each com- nation, we analyzed these data in two steps. Firstly, we
bination of the five experimented detection techniques plot the distribution of the average F-measure obtained
(e.g., we compute corrBCCM ∩DM , corrBCCM ∩DCBM , by the 2,036 different combinations over the 26 software
corrBCCM ∩CM , corrDM ∩DCBM , etc.). In addition, given systems. Then we discuss the performances of the top
the five experimented prediction models mi , mj , mk , mp , five configurations comparing the results with the ones
mz , we computed: achieved by (i) each of the five experimented models,
|corrmi \ (corrmj ∪ corrmk ∪ corrmp ∪ corrmz )|
(ii) the models built plugging-in the scattering metrics
corrmi \(mj ∪mk ∪mp ∪mz ) =
|corrmi ∪ corrmj ∪ corrmk ∪ corrmp ∪ corrmz |
% as additional features in the four baseline models, and
(11) (iii) the comprehensive prediction models that include all
the metrics exploited by the four baseline models plus
that represents the percentage of bug-prone classes cor- our scattering metrics.
rectly identified only by the prediction model mi . In the
paper, we discuss the results obtained when analyzing
the complementarity between our model and the base- 5 A NALYSIS OF THE RESULTS
line ones. The other results concerning the complemen- In this section we discuss the results achieved aiming at
tarity between the baseline approaches are available in answering the formulated research questions.
our online appendix [25].
Finally, to answer RQ3 we build and assess the per-
formances of a “hybrid” bug prediction model exploit- 5.1 RQ1 : On the Performances of DCBM and Its
ing different combinations of the predictors used by Comparison with the Baseline Techniques
the five experimented models (i.e., DCBM, BCCM, DM, Table 5 reports the results—in terms of AUC-ROC, accu-
MAF, and CM). Firstly, we assess the boost in perfor- racy, precision, recall, and F-measure—achieved by the
mances (if any) provided by our scattering metrics when five experimented bug prediction models, i.e., our model,
plugged-in the four competitive models, similarly to exploiting the developer’s scattering metrics (DCBM),
what has been done by Bird et al. [20], who explained the BCCM proposed by Hassan [8], a prediction model
the relationship between ownership metrics and bugs that uses as predictor the number of developers that
building regression models in which the metrics are work on a code component (DM) [9], [10], the prediction
added incrementally in order to evaluate their impact model based on the degree to which a module receives
on increasing/decreasing the likelihood of developers to focused attention by developers (MAF) [22], and a pre-
introduce bugs. diction model exploiting product metrics capturing size,
Then, we create a “comprehensive baseline model” cohesion, coupling, and complexity of code components
featuring all predictors exploited by the four competi- (CM) [1].
tive models and again, we assess the possible boost in The achieved results indicate that the proposed pre-
performances provided by our two scattering metrics diction model (i.e., DCBM) ensures better prediction
when added to such a comprehensive model. Clearly, accuracy as compared to the competitive techniques.
simply combining together the predictors used by the Indeed, the area under the ROC curve of DCBM ranges
five models could lead to sub-optimal results, due for between 62% and 91%, outperforming the competitive
example to model overfitting. models.

0098-5589 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
TABLE 5
AUC-ROC, Accuracy, Precision, Recall, and F-Measure of the five bug prediction models

DCBM DM BCCM
System
AUC-ROC Accuracy Precision Recall F-measure AUC-ROC Accuracy Precision Recall F-measure AUC-ROC Accuracy Precision Recall F-measure
AMQ 83% 53% 42% 53% 47% 58% 24% 18% 19% 19% 61% 52% 33% 49% 39%
Ant 88% 69% 66% 72% 69% 69% 26% 28% 37% 31% 67% 63% 67% 68% 68%
Aries 86% 56% 51% 54% 52% 65% 23% 23% 25% 24% 58% 50% 34% 45% 39%
Camel 81% 55% 51% 55% 53% 51% 27% 18% 28% 22% 50% 39% 39% 46% 42%
CFX 79% 94% 88% 94% 91% 54% 25% 19% 25% 21% 71% 79% 86% 84% 85%
Drill 63% 53% 45% 48% 46% 58% 23% 22% 39% 28% 52% 39% 14% 25% 18%
Falcon 75% 98% 96% 98% 97% 50% 25% 20% 21% 21% 75% 89% 86% 90% 88%
Felix 88% 70% 69% 67% 68% 50% 25% 17% 30% 22% 59% 61% 60% 65% 63%
JMeter 91% 77% 72% 68% 70% 50% 29% 24% 53% 33% 69% 65% 65% 63% 64%
JS2 62% 87% 83% 86% 84% 50% 26% 22% 17% 19% 58% 81% 70% 74% 72%
Log4j 89% 71% 62% 66% 64% 50% 19% 13% 26% 17% 52% 43% 36% 78% 49%
Lucene 77% 84% 79% 83% 81% 54% 27% 22% 30% 26% 63% 72% 61% 86% 71%
Oak 67% 97% 95% 97% 96% 52% 27% 15% 29% 19% 66% 95% 92% 80% 86%
OpenEJB 82% 98% 97% 98% 98% 50% 22% 25% 20% 22% 78% 95% 81% 91% 85%
OpenJPA 83% 79% 71% 77% 74% 51% 20% 20% 38% 26% 78% 72% 61% 68% 64%
Pig 79% 89% 79% 89% 84% 50% 22% 21% 37% 27% 73% 71% 64% 75% 69%
Pivot 78% 86% 75% 86% 80% 53% 26% 19% 24% 21% 68% 69% 71% 79% 75%
Poi 87% 68% 88% 59% 71% 50% 25% 34% 16% 22% 66% 60% 74% 49% 59%
Ranger 77% 95% 90% 95% 93% 50% 28% 18% 19% 19% 76% 92% 83% 91% 87%
Shindig 73% 66% 50% 65% 56% 50% 24% 23% 23% 23% 58% 58% 43% 61% 50%
Sling 62% 85% 76% 84% 80% 57% 21% 17% 18% 18% 61% 80% 62% 68% 65%
Sqoop 78% 98% 96% 98% 97% 55% 26% 19% 32% 23% 77% 97% 90% 89% 90%
Sshd 86% 70% 59% 70% 64% 55% 24% 19% 36% 25% 69% 52% 49% 54% 52%
Synapse 67% 62% 50% 62% 56% 53% 23% 17% 24% 20% 64% 49% 48% 56% 52%
Whirr 76% 98% 95% 98% 97% 52% 26% 20% 24% 21% 74% 96% 84% 88% 86%
Xerces-J 83% 94% 94% 88% 91% 52% 49% 28% 35% 31% 71% 74% 59% 80% 68%
CM MAF
System
AUC-ROC Accuracy Precision Recall F-measure AUC-ROC Accuracy Precision Recall F-measure
AMQ 55% 43% 37% 41% 39% 56% 56% 38% 45% 41%
Transactions on Software Engineering

Ant 58% 38% 28% 33% 30% 60% 59% 60% 62% 61%
Aries 56% 38% 28% 33% 30% 51% 45% 30% 43% 35%
Camel 41% 42% 44% 41% 42% 50% 38% 35% 38% 36%
CFX 53% 52% 55% 46% 50% 76% 75% 82% 73% 77%
Drill 50% 34% 26% 32% 29% 52% 32% 22% 29% 25%
Falcon 51% 52% 45% 54% 49% 71% 81% 70% 81% 75%
Felix 53% 55% 53% 51% 52% 67% 56% 62% 65% 63%
JMeter 50% 43% 44% 43% 43% 68% 58% 61% 59% 60%
JS2 50% 43% 44% 43% 43% 62% 80% 72% 78% 75%
Log4j 50% 35% 38% 31% 34% 52% 51% 44% 58% 52%
Lucene 50% 35% 38% 31% 34% 65% 66% 66% 76% 70%
Oak 52% 46% 54% 55% 54% 64% 88% 89% 78% 83%
OpenEJB 61% 62% 66% 57% 61% 80% 78% 77% 79% 78%
OpenJPA 51% 55% 59% 45% 51% 67% 70% 57% 67% 62%
Pig 57% 62% 58% 52% 55% 69% 68% 62% 68% 65%
Pivot 50% 41% 47% 40% 43% 66% 64% 65% 69% 67%
Poi 58% 65% 61% 65% 63% 61% 55% 58% 56% 57%
Ranger 60% 65% 61% 65% 63% 76% 81% 77% 82% 79%
Shindig 52% 48% 36% 46% 40% 54% 55% 39% 59% 47%
Sling 55% 38% 35% 41% 38% 61% 76% 59% 63% 61%
Sqoop 53% 59% 59% 64% 61% 78% 92% 89% 84% 87%
Sshd 50% 28% 26% 31% 28% 67% 48% 46% 52% 49%
Synapse 56% 43% 47% 52% 49% 61% 47% 47% 53% 50%
Whirr 54% 61% 55% 63% 59% 69% 82% 82% 82% 82%
11

Xerces-J 58% 61% 55% 63% 59% 65% 71% 68% 75% 72%

0098-5589 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TSE.2017.2659747, IEEE
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TSE.2017.2659747, IEEE
Transactions on Software Engineering

12

In particular, the Developer Model achieves an AUC Regarding the other models, we observe that the in-
between 50% and 69%, the Basic Code Change Model formation about the ownership of a class as well as the
between 50% and 78%, the MAF model between 50% code metrics and the entropy of changes have a stronger
and 78%, and the CM model between 41% and 61%. predictive power compared to number of developers.
Also in terms of accuracy, precision and recall (and, However, they still exhibit a lower prediction accuracy
consequently, of F-measure) DCBM achieves better re- with respect to what allowed by the developer scattering
sults. In particular, across all the different object sys- information.
tems, DCBM achieves a higher F-measure with respect In particular, we observed that the MAF model
to DM (mean=+53.7%), BCCM (mean=+10.3%), MAF has good performances when it is adopted on well-
(mean=+13.3%), and CM (mean=+29.3%). The higher modularized systems, i.e., systems grouping in the same
values achieved for precision and recall indicates that package classes implementing related responsibilities.
DCBM provides less false positives (i.e., non-buggy Indeed, MAF achieved the highest accuracy on the
classes indicated as buggy ones) while also being able Apache CFX, Apache OpenEJB, and Apache Sqoop
to identify more classes actually affected by a bug as systems, where the average modularization quality (MQ)
compared to the competitive models. Moreover, when [48] is of 0.84, 0.79, and 0.88, respectively. The rea-
considering the AUC, we observed that DCBM reaches son behind this result is that a high modularization
higher values with respect the competitive bug predic- quality often correspond to a good distribution of de-
tion approaches. This result highlights how the proposed velopers activities. For instance, the average number
model performs better in discriminating between buggy of developers per package working on Apache CFX
and non-buggy classes. is 5. As a consequence, the focus of developers on
Interesting is the case of Xerces-J where DCBM is specific code entities is high. The same happens on
able to identify buggy classes with 94% of accuracy (see Apache OpenEJB and Apache Sqoop, where the av-
Table 5), as compared to the 74% achieved by BCCM, erage number of developers per package is 3 and 7,
49% of DM, 71% of MAF, and 59% of CM. We looked respectively. However, even if the developers mainly
into this project to understand the reasons behind such focus their attention on few packages, in some cases
a strong result. We found that the Xerces-J’s buggy they also apply changes to classes contained in other
classes are often modified by few developers that, on packages, increasing their chances of introducing bugs.
average, perform a small number of changes on them. This is the reason why our prediction model still con-
As an example, the class XSSimpleTypeDecl of the tinue to work better in such cases. A good example
package org.apache.xerces.impl.dv.xs has been is the one of the class HBaseImportJob, contained
modified only twice between May 2008 and July 2008 in the package org.apache.sqoop.mapreduce of
(one of the three-month periods considered in our study) the project Apache Sqoop. Only two developers
by two developers. However, the sum of their structural worked on this class over the time period be-
and semantic scattering in that period was very high (161 tween July 2013 and September 2013, however the
and 1,932, respectively). It is worth noting that if a low same developers have been involved in the main-
number of developers work on a file, they have higher tenance of the class HiveImport of the package
chances to be considered as the owner of that file. This com.cloudera.sqoop.hive. Even if the two classes
means that, in the case of the MAF model, the probability shared the goal to import data from other projects into
that the class is bug-prone decreases. At the same time, Sqoop, they implement significantly different mecha-
models based on the change entropy (BCCM) or on the nisms for importing data. This results in a higher prone-
number of developers modifying a class (DM) experi- ness of introducing bugs. The sum of the structural and
ence difficulties in identifying this class as buggy due to semantic scattering in that period for the two develop-
the low number of changes it underwent and to the low ers reached 86 and 92, respectively, causing the correct
number of involved developers, respectively. Conversely, prediction of the buggy file for our model, and an error
our model does not suffer of such a limitation thanks to in the prediction of the MAF model.
the exploited developers’ scattering information. The BCCM [8] often achieves a good prediction
Finally, the CM model relying on product metrics accuracy. This is due to the higher change-
fails in the prediction since the class has code metrics proneness of components being affected by bugs.
comparable with the average metrics of the system (e.g., As an example, in the JS2 project, the class
the CBO of the class 12, while the average CBO of the PortalAdministrationImpl of the package
system is 14). org.apache.jetspeed.administration has been
Looking at the other prediction models, we can ob- modified 19 times between January and March 2010.
serve that the model based only on the number of de- Such a high change frequency led to the introduction of
velopers working on a code component never achieves a bug. However, not always such a conjecture is valid.
an accuracy higher than 49%. This result confirms what Let us consider the Apache Aries project, in which BCCM
previously demonstrated by Ostrand et al. [10], [9] on obtained a low accuracy (recall=45%, precision=34%).
the limited impact of individual developer data on bug Here we found several classes with high change-
prediction. proneness that were not subject to any bug. For instance,

0098-5589 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TSE.2017.2659747, IEEE
Transactions on Software Engineering

13

TABLE 6 when our model is compared with the model based on


Wilcoxon’s t-test p-values of the hypothesis F-Measure the entropy of changes [8].
achieved by DCBM > than the compared model.
Statistically significant results are reported in bold face. Summary for RQ1 . Our approach showed quite
Cliff Delta d values are also shown. high accuracy in identifying buggy classes. Among
the 26 object systems its accuracy ranges between
Compared models p-value Cliff Delta Magnitude 53% and 98%, while the F-measure between 47%
DCBM - CM < 0.01 0.81 large
DCBM - BCCM 0.07 0.29 small and 98%. Moreover, DCBM performs better than
DCBM - DM < 0.01 0.96 large the baseline approaches, demonstrating its superi-
DCBM - MAF < 0.01 0.44 medium
ority in correctly predicting buggy classes.

the class AriesApplicationResolver of the package


org.apache.aries.application.managament has 5.2 RQ2 : On the Complementarity between DCBM
been changed 27 times between November 2011 and and Baseline Techniques
January 2012. Table 7 reports the results of the Principal Component
It was the class with the higher change-proneness in Analysis (PCA), aimed at investigating the complemen-
that time period, but this never led to the introduction tarity between the predictors exploited by the different
of a bug. It is worth noting that all the changes to the models. The different columns (PC1 to PC11) repre-
class were applied by only one developer. sent the components identified by the PCA as those
The model based on structural code metrics (CM) describing the phenomenon of interest (in our case,
obtains fluctuating performance, with quite low F- bug-proneness). The first row (i.e., the proportion of
measure achieved on some of the systems, like the variance) indicates on a scale between zero and one how
Sshd project (28%). Looking more in depth into such much each component contributes to the phenomenon
results, we observed that the structural metrics achieve description (the higher the proportion of variance, the
good performances in systems where the develop- higher the component’s contribution). The identified
ers tend to repeatedly perform evolution activities to components are sorted on the basis of their “importance”
the same subset of classes. Such a subset of classes in describing the phenomenon (e.g., the PC1 in Table 7 is
generally centralizes the system behavior, is com- the most important, capturing 39% of the phenomenon
posed of complex classes, and exhibits a high fault- as compared to the 2% brought by PC11). Finally, the
proneness. As an example, in the AMQ project the class values reported at row i and column j indicate how
activecluster.impl.StateServiceImpl controls much the predictor i contributes in capturing the PC
the state of the services provided by the system and it ex- j (e.g., structural scattering captures 69% of PC1). The
perienced five changes during the time period between structural scattering predictor is mostly orthogonal with
September 2009 and November 2009. In this period, respect to the other ten, since it is the one capturing
developers heavily worked on this class increasing its most of PC1, the most important component. As for
size from 40 to 265 lines of code. This sudden growth the other predictors, the semantic scattering and the
of the class size resulted in the introduction of a bug, change entropy information seem to be quite related
correctly predicted by the CM model. by capturing the same components (i.e., PC2 and PC3),
We also statistically compare the F-measure achieved while the MAF predictor is the one better capturing
by the five experimented prediction models. To this aim, PC4 and PC5. The number of developers is only able
we exploited the Mann-Whitney test [49] (results are to partially capture PC5, while the product metrics are
intended as statistically significant at α = 0.05). We also the most important to capture the remaining components
estimated the magnitude of the measured differences by (PC6 to PC11). From these results, we can firstly con-
using the Cliff’s Delta (or d), a non-parametric effect clude that the information captured by our predictors
size measure [50] for ordinal data. We followed well- is strongly orthogonal with respect to the competitive
established guidelines to interpret the effect size values: ones. Secondly, we also observe a high complementarity
negligible for |d| < 0.10, small for |d| < 0.33, medium between the MAF predictor and the others, while the
for 0.33 ≤ |d| < 0.474, and large for |d| ≥ 0.474 [50]. predictor based on the number of developers working
Table 6 reports the results of this analysis. The proposed on a code component only partially capture the phe-
DCBM model obtains a significant higher F-measure nomenon, demonstrating again its limited impact in the
with respect to the other baselines (p-value<0.05), with context of bug prediction. Finally, the code metrics cap-
the only exception of the model proposed by Hassan ture portions of the phenomenon that none of the other
[8], for which the p-value is partially significant (p- (process) metrics is able to capture. Such results highlight
value=0.07). At the same time, the magnitude of the the the possibility to achieve even better bug prediction
differences is large in the comparison with the model models by combining predictors capturing orthogonal
proposed by Ostrand et al. [9] and the one based on information (we investigate this possibility in RQ3 ).
product metrics [24], medium in the comparison with the As a next step toward understanding the complemen-
model based on the Posnett et al. metric [22], and small tarity of the five prediction models, Tables 8, 9, 10, and

0098-5589 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TSE.2017.2659747, IEEE
Transactions on Software Engineering

14

TABLE 7
Results achieved applying the Principal Component Analysis

PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8 PC9 PC10 PC11
Proportion of Variance 0.39 0.16 0.11 0.10 0.06 0.05 0.03 0.03 0.03 0.02 0.02
Cumulative Variance 0.39 0.55 0.66 0.76 0.82 0.87 0.90 0.92 0.95 0.97 1.00
Structural scattering predictor 0.69 - - 0.08 0.04 - - - - - -
Semantic scattering predictor - 0.51 0.33 0.16 0.03 - - - - - -
Change entropy 0.07 0.34 0.45 0.25 0.11 0.22 - 0.01 - - -
Number of Developers - - 0.05 0.02 0.29 - 0.04 0.05 0.01 - 0.07
MAF 0.04 0.11 - 0.38 0.45 - 0.21 0.04 0.06 - 0.1
LOC 0.04 - 0.01 - 0.03 0.07 0.18 0.21 0.11 0.09 0.33
CBO 0.1 0.04 0.05 0.07 - 0.56 0.2 0.33 0.21 0.44 0.12
LCOM 0.01 - 0.04 - 0.01 - 0.24 0.1 0.06 0.09 0.05
NOM 0.03 - 0.01 0.01 - 0.11 - 0.12 0.43 0.22 0.1
RFC 0.01 - 0.04 0.01 0.03 - 0.13 0.06 0.12 0.1 0.09
WMC 0.01 - 0.02 0.02 0.01 0.04 - 0.08 - 0.06 0.14

TABLE 8 model, 13% only by DM, and 14% of instances correctly


Overlap analysis between DCBM and DM classified by both models. This result is consistent on all
the object systems (see Table 8).
DCBM ∩ DCBM \ DM \
System An example of buggy class identified
DM% DM% DCBM%
AMQ 14 81 5 only by our model is represented by
Ant 9 74 17 LuceneIndexer contained in the package
Aries 12 65 23
Camel 16 67 17 org.apache.camel.component.lucene of the
CXF 12 66 22 Apache Lucene project. This class, between February
Drill 27 72 1 2012 and April 2012, has been modified by one
Falcon 12 84 4
Felix 14 65 21 developer that in the same time period worked on
JMeter 8 89 3 five other classes (the sum of structural and semantic
JS2 22 75 3 scattering reached 138 and 192, respectively). This is the
Log4j 13 75 12
Lucene 18 75 7 reason why our model correctly identified this class as
Oak 19 81 0 buggy, while DM was not able to detect it due to the
OpenEJB 17 80 3 single developer who worked on the class. On the other
OpenJPA 22 71 7
Pig 16 74 10 side, DM was able to detect few instances of buggy
Pivot 18 80 2 classes not identified by DCBM. This generally happens
Poi 11 72 17 when developers working on a code component apply
Ranger 11 76 13
Shindig 20 61 18 less scattered changes over the other parts of the system,
Sling 16 62 21 as in the case of the Apache Sling project, where
Sqoop 19 71 10 the class AbstractSlingRepository of the package
Sshd 22 64 14
Synapse 12 79 9 org.apache.sling.jrc.base was modified by four
Whirr 19 66 15 developers between March 2011 and May 2011. Such
Xerces 32 55 13 developers did not apply changes to other classes, thus
Overall 14 73 13
having a low structural and semantic scattering. DM
was instead able to correctly classify the class as buggy.
11 report the overlap metrics computed between DCBM- A similar trend is shown in Table 9, when analyzing
DM, DCBM-BCCM, DCBM-CM, and DCBM-MAF, re- the overlap between our model and BCCM. In this case,
spectively. our model correctly classified 42% of buggy classes that
In addition, Table 12 shows the percentage of buggy are not identified by BCCM that is, however, able to
classes correctly identified only by each of the single bug capture 29% of buggy classes missed by our approach
prediction models (e.g., identified by DCBM and not by (the remaining 29% of buggy classes are correctly iden-
DM, BCCM, CM and MAF). While in this paper we only tified by both models). Such complementarity is mainly
discuss in details the overlap between our model and due to the fact that the change-proneness of a class does
the alternative ones, the interested readers can find the not always correctly suggest buggy classes, even if it is
analysis of the overlap among the other models in our a good indicator. Often it is important to discriminate
online appendix [25]. in which situations such changes are done. For exam-
Regarding the overlap between our predictor (DCBM) ple, the class PropertyIndexLookup of the package
and the one built using the number of developers (DM), oak.plugins.index.property in the Apache Oak
it is interesting to observe that there is high comple- project, during the time period between April 2013 and
mentarity between the two models, with an overall June 2013, has been changed 4 times by 4 developers
73% of buggy classes correctly identified only by our that worked, in the same period, on other 6 classes. This

0098-5589 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TSE.2017.2659747, IEEE
Transactions on Software Engineering

15

caused a high scattering (both structural and semantic) TABLE 9


for all the developers, and our model correctly marked Overlap Analysis between DCBM and BCCM
the class as buggy.
Instead, BCCM did not classify the component as DCBM ∩ DCBM \ BCCM \
System
BCCM % BCCM % DCBM %
buggy since the number of changes applied on it is AMQ 23 32 45
not high enough to allow the model to predict a Ant 39 37 24
bug. However, the model proposed by Hassan [8] is Aries 24 39 37
Camel 19 43 38
able to capture several buggy files that our model CXF 20 44 36
does not identify. For example, in the Apache Pig Drill 27 47 26
project the class SenderHome contained in the pack- Falcon 34 40 26
Felix 29 38 34
age com.panacya.platform.service.bus.sender JMeter 28 45 27
experienced 27 changes between December 2011 and JS2 21 40 39
February 2012. Such changes were made by two devel- Log4j 16 67 17
Lucene 16 45 39
opers that touched a limited number of related classes Oak 29 37 34
of the same package. Indeed, the sum of structural and OpenEJB 36 35 28
semantic scattering was quite low (13 and 9, respec- OpenJPA 19 36 45
Pig 31 39 30
tively) thus not allowing our model to classify the class Pivot 34 46 20
as buggy. Instead, in this case the number of changes Poi 37 33 30
represent a good predictor. Ranger 40 44 16
Shindig 31 33 36
Regarding the overlap between our model and the Sling 16 31 53
code metrics-based model (Table 10), also in this case Sqoop 32 49 19
the set of code components correctly predicted by Sshd 18 36 46
Synapse 20 31 49
both the models represents only a small percentage Whirr 40 48 12
(13% on average). This means that the two mod- Xerces 22 43 35
els are able to predict the bug-proneness of differ- Overall 29 42 29
ent code components. Moreover, the DCBM model
captures 78% of buggy classes missed by the code
metrics model that is able to correctly predict 9% this class, as well as related classes belonging to different
of code components missed by our model. For ex- packages. Such related updates decreased the semantic
ample, the DCBM model is able to correctly classify scattering accumulated by developers.
the pivot.serialization.JSONSerializer class Thus, DCBM did not classify the instance as buggy,
of the Apache Pivot project, having low (good) values while MAF correctly detect less focused attention on the
of size, complexity, and coupling, but modified by four class and marked the class as buggy.
developers in the quarter going from January 2013 to Finally, looking at Table 12, we can see that our
March 2013. approach identifies 43% of buggy classes missed by the
As for the overlap between MAF and our model, other four techniques, as compared to 24% of BCCM, 8%
DCBM was able to capture 45% of buggy classes not of DM, 18% of MAF, and 7% of CM. This confirms that
identified by MAF. On the other hand, MAF correctly (i) our model captures something missed by the com-
captured 29% of buggy classes missed by DCBM, while petitive models, and (ii) by combining our model with
26% of the buggy classes were correctly classified by BCCM/DM/MAF/CM (RQ3 ) we could further improve
both models. An example of class correctly classified the detection accuracy of our technique. An example of
by DCBM and missed by MAF can be found in the a buggy class detected only by DCBM can be found
package org.apache.drill.common.config of the in the Apache Ant system. The class Exit belonging to
Apache Drill project, where the class DrillConfig the package org.apache.tools.ant.taskdefs has
was changed by three developers during the time pe- been modified just once by a single developer in the time
riod between November 2014 and January 2015. Such period going from January 2004 to April 2004. However,
developers mainly worked on this and other classes the sum of the structural and semantic scattering in that
of the same package (they can be considered as own- period was very high for the involved developer (461.61
ers of the DrillConfig class), but they also applied and 5,603.19, respectively), who modified a total of 38
changes to components structurally distant from it. classes spread over 6 subsystems. In the considered time
For this reason, the sum of structural and semantic period the DM does not identify Exit as buggy given
scattering increased and our model was able to cor- the single developer who worked on it, and the BCCM
rectly classify DrillConfig as buggy. On the other fails too due to the single change Exit underwent be-
hand, an example of class correctly classified by MAF tween January and April 2004. Similarly, the CM model
and missed by DCBM is LogManager of the package is not able to identify this class as buggy due to its low
org.apache.log4j from the Log4j project. Here the complexity and small size.
two developers working on the component between Conversely, an example of buggy class not
March 2006 and May 2006 applied several changes to detected by DCBM is represented by the class

0098-5589 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TSE.2017.2659747, IEEE
Transactions on Software Engineering

16

TABLE 10 TABLE 11
Overlap Analysis between DCBM and CM Overlap Analysis between DCBM and MAF

DCBM ∩ DCBM \ CM \ DCBM ∩ DCBM \ MAF \


System System
CM % CM % DCBM % MAF % MAF % DCBM %
AMQ 10 65 25 AMQ 24 47 29
Ant 8 68 24 Ant 23 46 31
Aries 12 58 30 Aries 32 47 21
Camel 22 53 25 Camel 19 51 29
CXF 7 84 9 CXF 20 50 31
Drill 5 73 22 Drill 23 43 34
Falcon 18 79 3 Falcon 19 42 39
Felix 15 68 17 Felix 24 56 20
JMeter 15 78 7 JMeter 25 53 22
JS2 6 88 6 JS2 23 40 37
Log4j 11 87 2 Log4j 26 45 30
Lucene 11 77 12 Lucene 31 41 28
Oak 14 83 3 Oak 26 46 28
OpenEJB 6 88 6 OpenEJB 28 49 24
OpenJPA 18 67 15 OpenJPA 22 46 32
Pig 16 75 9 Pig 25 44 31
Pivot 13 78 9 Pivot 26 55 19
Poi 14 75 11 Poi 27 44 29
Ranger 21 75 5 Ranger 27 41 32
Shindig 7 82 11 Shindig 27 46 27
Sling 7 82 11 Sling 21 37 42
Sqoop 9 86 5 Sqoop 33 43 24
Sshd 15 72 13 Sshd 19 40 41
Synapse 18 63 19 Synapse 21 56 23
Whirr 8 85 7 Whirr 27 53 20
Xerces2-j 39 59 2 Xerces2-j 30 42 28
Overall 13 78 9 Overall 26 45 29

Fig. 4. Boxplot of the average F-Measure achieved by


the 2,036 combinations of predictors experimented in our
AbstractEntityManager belonging to the package study.
org.apache.ivory.resource of the Apache Falcon
project.

Here we found 49 changes occurring on the class on


the time period going from October 2012 to January 2013
applied by two developers. The sum of the structural
and semantic scattering metrics in this time period was
very low for both the involved developers (14.77 is
the sum for the first developer, 18.19 for the second
one). Indeed, the developers in that period only apply
changes to another subsystem. This is the reason why
our prediction model is not able to mark this class as 60 65 70 75 80
buggy. On the other hand, BCCM and MAF prediction
models successfully identify the buggyness of the class
exploiting the information about the number of changes
and ownership, respectively. DM fails due to the low
number of developers involved in the change process 5.3 RQ3 : A “Hybrid” Prediction Model
of the class. Finally, CM is not able to correctly classify Table 13 shows the results obtained while investigating
this class as buggy because of the low complexity of the the creation of a “hybrid” bug prediction model, ex-
class. ploiting a combination of predictors used by the five
experimented models.
The top part of Table 13 (i.e., Performances of each ex-
Summary for RQ2 . The analysis of the comple- perimented model) reports the average performances—
mentarity between our approach and the four in terms of AUC-ROC, accuracy, precision, recall, and F-
competitive techniques showed that the proposed measure—achieved by each of the five experimented bug
scattering metrics are highly complementary with prediction models. As already discussed in the context
respect to the metrics exploited by the baseline of RQ1 , our DCBM model substantially outperforms the
approaches, paving the way to “hybrid” models competitive ones. Such values only serve as a reference
combining multiple predictors. to better interpret the results of the different hybrid

0098-5589 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TSE.2017.2659747, IEEE
Transactions on Software Engineering

17

TABLE 12
Overlap Analysis considering each Model independently

DCBM \ BCCM \ DM \ MAF \ CM \


System
(BCCM ∪ DM ∪ (DCBM ∪ DM ∪ (DCBM ∪ BCCM ∪ (CM ∪ DCBM ∪ (DCBM ∪ BCCM ∪
CM ∪ MAF) % CM ∪ MAF) % CM ∪ MAF) % BCCM ∪ DM) % CM ∪ DM) %
AMQ 44 24 9 17 6
Ant 40 25 8 20 7
Aries 41 22 10 19 8
Camel 39 21 6 22 12
CXF 45 25 9 14 7
Drill 44 25 8 18 5
Falcon 46 27 8 18 2
Felix 43 21 5 19 12
JMeter 42 23 7 17 11
JS2 45 26 10 15 4
Log4j 43 20 8 19 10
Lucene 44 23 8 20 5
Oak 39 26 9 19 7
OpenEJB 43 24 8 16 9
OpenJPA 41 26 9 18 6
Pig 44 25 9 20 2
Pivot 45 25 8 19 3
Poi 39 23 9 17 12
Ranger 48 19 10 14 9
Shindig 46 24 6 17 7
Sling 41 25 9 16 9
Sqoop 41 26 7 19 7
Sshd 44 22 10 19 5
Synapse 41 22 7 20 10
Whirr 40 23 8 18 11
Xerces 47 23 9 12 9
Overall 43 24 8 18 7

models we discuss in the following. case), possibly causing model overfitting on the training
The second part of Table 13 (i.e., Boost provided by sets with consequent bad performances on the test set.
our scattering metrics to each baseline model), reports Again, the combination of predictors does no seem to
the performances of the four competitive bug prediction improve the performances of our DBCM model. Thus,
models when augmented with our predictors. as explained in Section 4.3, to verify the possibility to
The boost provided by our metrics is evident in all build an effective hybrid model we investigated in an
the baseline models. Such a boost goes from a minimum exhaustive way the combination of predictors that leads
of +8% in terms of F-Measure (for the model based on to the best prediction accuracy by using the wrapper
change entropy) up to +49% for the model exploiting approach proposed by Kohavi and John [47].
the number of developers as predictor. However, it is Figure 4 plots the average F-measure obtained by
worth noting that the combined models do not seem to each of the 2,036 combinations of predictors experi-
improve the performances of our DBCM model. mented. The first thing that leaps to the eyes is the
The third part of Table 13 (i.e., Boost provided by our very high variability of performances obtained by the
scattering metrics to a comprehensive baseline model) different combinations of predictors, ranging between a
seems to tell a different story. In this case, we combined minimum of 62% and a maximum of 79% (mean=70%,
all predictors belonging to the four baseline models into median=71%). The bottom part of Table 13 (i.e., Top-
a single, comprehensive, bug prediction model, and as- 5 predictors combinations obtained from the wrapper
sessed its performances. Then, we added our scattering selection algorithm) reports the performances of the
metrics to such a comprehensive baseline model and top five predictors combinations. The best configuration,
assessed again its performances. As it can be seen from achieving an average F-Measure of 79% exploits as pre-
Table 13, the performances of the two models (i.e., the dictors the CBO coupling metric [1], the change entropy
one with and the one without our scattering metrics) are by Hassan [8], the structural and semantic scattering
almost the same (F-measure=71% for both of them). This defined in this paper, and the module activity focus by
suggests the absence of any type of impact (positive or Posnett et al. [22]. Such a configuration also exhibits
negative) of our metrics on the model’s performances, a very high AUC (90%) and represents a substantial
which is something unexpected considered the previ- improvement in prediction accuracy over the best model
ously performed analyses. used in isolation (i.e., DCBM with an average F-Measure
Such a result might be due to the high number of of 74% and an AUC=76%) as well as over the compre-
predictor variables exploited by the model (eleven in this hensive model exploiting all the baselines’ predictors

0098-5589 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TSE.2017.2659747, IEEE
Transactions on Software Engineering

18

TABLE 13
RQ3 : Performances of “hybrid” prediction models
Avg. AUC-ROC Avg. Accuracy Avg. Precision Avg. Recall Avg. F-measure
Performances of each experimented model
DM 51 24 19 25 21
BCCM 63 70 61 69 64
CM 52 46 44 45 44
MAF 62 65 59 64 61
DCBM 76 77 72 77 74

Boost provided by our scattering metrics to each baseline model


DM + Struct-scattering + Seman-scattering 78 71 73 68 70
BCCM + Struct-scattering + Seman-scattering 77 70 76 69 72
CM + Struct-scattering + Seman-scattering 76 70 73 70 71
MAF + Struct-scattering + Seman-scattering 77 70 73 70 71

Boost provided by our scattering metrics to a comprehensive baseline model


# Developers, Entropy, LOC, CBO, LCOM, NOM, RFC, WMC, MAF 78 69 73 68 71
# Developers, Entropy, LOC, CBO, LCOM, NOM, RFC, WMC, MAF, Struct-scattering, Seman-scattering 76 71 72 71 71

Top-5 predictors combinations obtained from the wrapper selection algorithm


CBO, Change Entropy, Struct-scattering, Seman-scattering, MAF 90 85 77 81 79
LOC, LCOM, Change Entropy, Seman-scattering, # Developers, MAF 78 72 77 77 77
LOC, NOM, WMC, Change Entropy, Struct-scattering 78 70 77 75 76
LOC, LCOM, NOM, Seman-scattering 77 70 75 75 75
LOC, CBO, LCOM, NOM, RFC, Struct-scattering, Seman-scattering 77 71 76 73 75

in combinations (+8% in terms of F-Measure). Such a introducing bugs.


result supports our conjecture that blindly combining In general, the results of all our three research ques-
predictors (as we did in the comprehensive model) could tions seem to confirm the observations made D’Ambros
result in sub-optimal performances likely due to model et al. [15]: no technique based on a single metric works
overfitting. better in all contexts. This is why the combination of
Interestingly, the best combination of baselines’ predic- multiple predictors can provide better results. We are
tors (i.e., all predictors from the four competitive models) confident that plugging other orthogonal predictors in
obtained as result of the wrapper approach is composed the “hybrid” prediction model could further increase the
of BCCM (i.e., entropy of changes), MAF, and the RFC prediction accuracy.
and WMC metrics from the CM model, and achieves
70% in terms of F-Measure (9% less with respect to the Summary for RQ3 . By combining the eleven pre-
best combination of predictors which also exploits our dictors exploited by the five prediction models
scattering metrics). subject of our study it is possible to obtain a boost
We also statistically compare the prediction accuracy of prediction accuracy up to +5% with respect to
obtained across the 26 subject systems by the best- the best performing model (i.e., DCBM) and +9%
performing “hybrid” configuration and the best per- with respect to the best combination of baseline
forming model. Also in this case, we exploited the Mann- predictors. Also, the top five “hybrid” prediction
Whitney test [49] for this statistical test, as well as models include at least one of the predictors pro-
the Cliff’s Delta [50] to estimate the magnitude of the posed in this work (i.e., the structural and semantic
measured differences. We observed a statistically signifi- scattering of changes) and the best model includes
cant difference (p-value=0.03) with a medium effect size both.
(d = 0.36).
Looking at the predictors more frequently exploited in
6 T HREATS TO VALIDITY
the five most accurate prediction models, we found that:
1) Semantic-scattering, LOC. Our semantic predictor This section describes the threats that can affect the
and the LOC are present in 4 out of the 5 most validity our study. Threats to construct validity concern
accurate prediction models. This confirms the well- the relation between the theory and the observation, and
known bug prediction power of the size metrics in this work are mainly due to the measurements we
(LOC) and suggests the importance for developers performed. This is the most important type of threat for
to work on semantically related code components our study and it is related to:
in the context of a given maintenance/evolution • Missing or wrong links between bug tracking systems
activity. and versioning systems [51]: although not much can
2) Change entropy, LCOM, Structural-scattering. These be done for missing links, as explained in the design
predictors are present in 3 out of the 5 most accu- we verified that links between commit notes and
rate prediction models. This confirms that (i) the issues were correct;
change entropy is a good predictor for buggy code • Imprecision due to tangled code changes [52]. We can-
components [8]), (ii) classes exhibiting low cohe- not exclude that some commits we identified as
sion can be challenging to maintain for developers bug-fixes grouped together tangled code changes,
[1], and (iii) scattered changes performed across of which just a subset represented the committed
different subsystems can increase the chances of patch.

0098-5589 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TSE.2017.2659747, IEEE
Transactions on Software Engineering

19

• Imprecision in issue classification made by issue-tracking brid” prediction models include only a subset of code
systems [19]: while we cannot exclude misclassifica- metrics.
tion of issues (e.g., an enhancement classified as a Threats to external validity concern the generalization
bug), at least all the systems considered in our study of results. We analyzed 26 Apache systems from different
used Bugzilla as issue tracking system, explicitly application domains and with different characteristics
pointing to bugs in the issue type field; (number of developers, size, number of classes, etc).
• Undocumented bugs present in the system: while we However, systems from different ecosystems should
relied on the issue tracker to identify the bugs fixed be analyzed to corroborate our findings.
during the change history of the object systems, it
is possible that undocumented bugs were present
in some classes, leading to wrong classifications of
buggy classes as “clean” ones.
7 C ONCLUSION AND F UTURE W ORK
• Approximations due to identifying fix-inducing changes
using the SZZ algorithm [45]: at least we used heuris-
tics to limit the number of false positives, for exam- A lot of effort has been devoted in the last decade to
ple excluding blank and comment lines from the set analyze the influence of the development process on
of fix-inducing changes. the likelihood of introducing bugs. Several empirical
studies have been carried out to assess under which
Threats to internal validity concern external factors we circumstances and during which coding activities devel-
did not consider that could affect the variables being opers tend to introduce bugs. In addition, bug prediction
investigated. We computed the developer’s scattering techniques built on top of process metrics have been
measures by analyzing the developers’ activity on a proposed. However, changes in source code are made by
single software system. However, it is well known that, developers that often work under stressing conditions
especially in open source communities and ecosystems, due to the need of delivering their work as soon as
developers contribute to multiple projects in parallel possible.
[53]. This might negatively influence the “developer’s The role of developer-related factors in the bug pre-
scattering” assessment made by our metrics. Still, the diction field is still a partially explored area. This paper
results of our approach can only improve by considering makes a further step ahead, by studying the role played
more sophisticated ways of computing our metrics. by the developer’s scattering in bug prediction. Specifically,
Threats to conclusion validity concern the relation be- we defined two measures that consider the amount of
tween the treatment and the outcome. The metrics used code components a developer modifies in a given time
in order to evaluate our defect prediction approach (i.e., period and how these components are spread struc-
accuracy, precision, recall, F-Measure, and AUC), are turally (structural scattering) and in terms of the responsi-
widely used in the evaluation of the performances of bilities they implement (semantic scattering). The defined
defect prediction techniques [15]. Moreover, we used measures have been evaluated as bug predictors in an
appropriate statistical procedures, (i.e., PCA [54]), and empirical study performed on 26 open source systems.
the computation of overlap metrics to study the orthog- In particular, we built a prediction model exploiting our
onality between our model and the competitive ones. measures and compared its prediction accuracy with
Since we had the necessity to exploit change-history four baseline techniques exploiting process metrics as
information to compute the scattering metrics we pro- predictors. The achieved results showed the superiority
posed, the evaluation design adopted in our study is of our model and its high level of complementarity with
different from the k-fold cross validation [55] generally respect to the considered competitive techniques. We
exploited while evaluating bug prediction techniques. also built and experimented a “hybrid” prediction model
In particular, we split the change-history of the object on top of the eleven predictors exploited by the five
systems into three-month time periods and we adopted competitive techniques. The achieved results show that
a three-month sliding window to train and test the (i) the “hybrid” is able to achieve a higher accuracy with
experimented bug prediction models. This type of vali- respect to each of the five models taken in isolation, and
dation is typically adopted when using process metrics (ii) the predictors proposed in this paper play a major
as predictors [8], although it might be penalizing when role in the best performing “hybrid” prediction models.
using product metrics, which are typically assessed us- Our future research agenda includes a deeper inves-
ing a ten-fold cross validation. Furthermore, although tigation of the factors causing scattering to developers,
we selected a model exploiting a set of product metrics and negatively impacting their ability of dealing with
previously shown to be effective in the context of bug code change tasks. We plan to reach such an objective
prediction [1], the poor performances of the CM model by performing a large survey with industrial and open
might be due to the fact that the model relies on too source developers. We also plan to apply our technique
many predictors, resulting in a model overfitting. This at different levels of granularity, to verify if we can point
conjecture is supported by the results achieved in the out buggy code components at a finer granularity level
context of RQ3 , where we found that the top five “hy- (e.g., methods).

0098-5589 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TSE.2017.2659747, IEEE
Transactions on Software Engineering

20

R EFERENCES [20] C. Bird, N. Nagappan, B. Murphy, H. Gall, and P. Devanbu,


“Don’t touch my code!: Examining the effects of ownership on
[1] V. Basili, L. Briand, and W. Melo, “A validation of object-oriented software quality,” in Proceedings of the 19th ACM SIGSOFT Sym-
design metrics as quality indicators,” Software Engineering, IEEE posium and the 13th European Conference on Foundations of Software
Transactions on, vol. 22, no. 10, pp. 751–761, Oct 1996. Engineering, ser. ESEC/FSE ’11. ACM, 2011, pp. 4–14.
[2] T. Gyimóthy, R. Ferenc, and I. Siket, “Empirical validation of [21] G. Bavota, B. De Carluccio, A. De Lucia, M. Di Penta, R. Oliveto,
object-oriented metrics on open source software for fault pre- and O. Strollo, “When does a refactoring induce bugs? an em-
diction,” IEEE Transactions on Software Engineering (TSE), vol. 31, pirical study,” in Proceedings of the 12th International Working
no. 10, pp. 897–910, 2005. Conference on Source Code Analysis and Manipulation, ser. SCAM
[3] N. Ohlsson and H. Alberg, “Predicting fault-prone software mod- ’12, 2012, pp. 104–113.
ules in telephone switchess,” Software Engineering, IEEE Transac- [22] D. Posnett, R. D’Souza, P. Devanbu, and V. Filkov, “Dual ecologi-
tions on, vol. 22, no. 12, p. 886894, 1996. cal measures of focus in software development,” in Proceedings of
[4] N. Nagappan and T. Ball, “Static analysis tools as early the 2013 International Conference on Software Engineering, ser. ICSE
indicators of pre-release defect density,” in Proceedings of the ’13. IEEE Press, 2013, pp. 452–461.
27th International Conference on Software Engineering, ser. ICSE [23] D. D. Nucci, F. Palomba, S. Siravo, G. Bavota, R. Oliveto, and
’05. New York, NY, USA: ACM, 2005, pp. 580–586. [Online]. A. D. Lucia, “On the role of developer’s scattered changes in bug
Available: http://doi.acm.org/10.1145/1062455.1062558 prediction,” in Proceedings of the 31st International Conference on
[5] T. Zimmermann, R. Premraj, and A. Zeller, “Predicting defects Software Maintenance and Evolution, ICSME ’15, Bremen, Germany,
for eclipse,” in Proceedings of the Third International Workshop 2015, pp. 241–250.
on Predictor Models in Software Engineering, ser. PROMISE ’07. [24] V. Basili, G. Caldiera, and D. H. Rombach, The Goal Question Metric
Washington, DC, USA: IEEE Computer Society, 2007, pp. 9–. Paradigm. John Wiley and Sons, 1994.
[Online]. Available: http://dx.doi.org/10.1109/PROMISE.2007.10 [25] D. D. Nucci, F. Palomba, G. D. Rosa, G. Bavota, R. Oliveto,
[6] A. N. Taghi M. Khoshgoftaar, Nishith Goel and J. McMullan, and A. D. Lucia. (2016) A developer centered bug prediction
“Detection of software modules with high debug code churn in model - replication package - https://figshare.com/articles/A
a very large legacy system,” in Software Reliability Engineering. Developer Centered Bug Prediction Model/3435299.
IEEE, 1996, pp. 364–371. [26] W. M. Khaled El Emam and J. C. Machado, “The prediction of
[7] J. S. M. Todd L. Graves, Alan F. Karr and H. P. Siy, “Predicting faulty classes using object-oriented design metrics,” Journal of
fault incidence using software change history,” Software Engineer- Systems and Software, vol. 56, no. 1, p. 6375, 2001.
ing, IEEE Transactions on, vol. 26, no. 7, pp. 653–661, 2000. [27] R. Subramanyam and M. S. Krishnan, “Empirical analysis of
[8] A. E. Hassan, “Predicting faults using the complexity of code ck metrics for object-oriented design complexity: Implications
changes,” in ICSE. Vancouver, Canada: IEEE Press, 2009, pp. for software defects,” Software Engineering, IEEE Transactions on,
78–88. vol. 29, no. 4, p. 297310, 2003.
[9] R. Bell, T. Ostrand, and E. Weyuker, “The limited impact of indi- [28] A. P. Nikora and J. C. Munson, “Developing fault predictors
vidual developer data on software defect prediction,” Empirical for evolving software systems,” in Proceedings of the 9th IEEE
Software Engineering, vol. 18, no. 3, pp. 478–505, 2013. [Online]. International Symposium on Software Metrics. IEEE CS Press, 2003,
Available: http://dx.doi.org/10.1007/s10664-011-9178-4 pp. 338–349.
[10] T. J. Ostrand, E. J. Weyuker, and R. M. Bell, “Programmer-based [29] Y. Zhou, B. Xu, and H. Leung, “On the ability of complexity
fault prediction,” in Proceedings of the 6th International Conference metrics to predict fault-prone classes in object-oriented systems,”
on Predictive Models in Software Engineering, ser. PROMISE ’10. Journal of Systems and Software, vol. 83, no. 4, pp. 660–674, 2010.
New York, NY, USA: ACM, 2010, pp. 19:1–19:10. [Online]. [30] N. Nagappan and T. Ball, “Use of relative code churn measures to
Available: http://doi.acm.org/10.1145/1868328.1868357 predict system defect density,” in Software Engineering, 2005. ICSE
[11] R. Moser, W. Pedrycz, and G. Succi, “Analysis of the reliability of 2005. Proceedings. 27th International Conference on. IEEE, 2005, pp.
a subset of change metrics for defect prediction,” in Proceedings 284–292.
of the Second ACM-IEEE International Symposium on Empirical [31] A. E. Hassan and R. C. Holt, “Studying the chaos of code
Software Engineering and Measurement, ser. ESEM ’08. New development,” in Proceedings of the 10th Working Conference on
York, NY, USA: ACM, 2008, pp. 309–311. [Online]. Available: Reverse Engineering, 2003.
http://doi.acm.org/10.1145/1414004.1414063 [32] ——, “The top ten list: dynamic fault prediction,” in Proceedings
[12] R. M. Bell, T. J. Ostrand, and E. J. Weyuker, “Does measuring of the 21st IEEE International Conference on Software Maintenance,
code change improve fault prediction?” in Proceedings of the 7th 2005, ser. ICSM ’05. IEEE Computer Society, 2005, pp. 263–272.
International Conference on Predictive Models in Software Engineering, [33] S. Kim, T. Zimmermann, E. J. Whitehead Jr, and A. Zeller,
ser. Promise ’11. New York, NY, USA: ACM, 2011, pp. 2:1–2:8. “Predicting faults from cached history,” in Proceedings of the 29th
[Online]. Available: http://doi.acm.org/10.1145/2020390.2020392 international conference on Software Engineering. IEEE Computer
[13] W. P. Raimund Moser and G. Succi, “A comparative analysis of Society, 2007, pp. 489–498.
the efficiency of change metrics and static code attributes for de- [34] N. Nagappan, A. Zeller, T. Zimmermann, K. Herzig, and B. Mur-
fect prediction,” in International Conference on Software Engineering phy, “Change bursts as defect predictors,” in Software Reliability
(ICSE), ser. ICSE ’08, 2008, pp. 181–190. Engineering (ISSRE), 2010 IEEE 21st International Symposium on.
[14] N. Nagappan, T. Ball, and A. Zeller, “Mining metrics to predict IEEE, 2010, pp. 309–318.
component failures,” in Proceedings of the 28th International [35] R. M. Bell, T. J. Ostrand, and E. J. Weyuker, “Looking for bugs
Conference on Software Engineering, ser. ICSE ’06. New York, in all the right places,” in Proceedings of the 2006 international
NY, USA: ACM, 2006, pp. 452–461. [Online]. Available: http: symposium on Software testing and analysis. ACM, 2006, pp. 61–72.
//doi.acm.org/10.1145/1134285.1134349 [36] S. R. Chidamber and C. F. Kemerer, “A metrics suite for object
[15] M. DAmbros, M. Lanza, and R. Robbes, “Evaluating defect pre- oriented design,” IEEE Transactions on Software Engineering (TSE),
diction approaches: a benchmark and an extensive comparison,” vol. 20, no. 6, pp. 476–493, June 1994.
Empirical Software Engineering, vol. 17, no. 4, p. 531577, 2012. [37] G. Bavota, A. D. Lucia, A. Marcus, and R. Oliveto, “Using
[16] J. Sliwerski, T. Zimmermann, and A. Zeller, “Don’t program on structural and semantic measures to improve software modular-
fridays! how to locate fix-inducing changes,” in Proceedings of the ization,” Empirical Software Engineering, vol. 18, no. 5, pp. 901–932,
7th Workshop Software Reengineering, May 2005. 2013.
[17] L. T. Jon Eyolfso and P. Lam, “Do time of day and developer [38] R. Baeza-Yates and B. Ribeiro-Neto, Modern Information Retrieval.
experience affect commit bugginess?” in Proceedings of the 8th Addison-Wesley, 1999.
Working Conference on Mining Software Repositories, ser. MSR ’11, [39] L. M. Y. Freund, “The alternating decision tree learning algo-
2011, pp. 153–162. rithm,” in Proceeding of the Sixteenth International Conference on
[18] F. Rahman and P. Devanbu, “Ownership, experience and defects: Machine Learning, 1999, pp. 124–133.
a fine-grained study of authorship,” in Proceedings of the 33rd [40] R. Kohavi, “The power of decision tables,” in 8th European Con-
International Conference on Software Engineering, ser. ICSE ’11, 2011, ference on Machine Learning. Springer, 1995, pp. 174–189.
pp. 491–500. [41] S. le Cessie and J. van Houwelingen, “Ridge estimators in logistic
[19] E. J. W. J. Sunghun Kim and Y. Zhang, “Classifying software
changes: Clean or buggy?” IEEE Transactions on Software Engineer- regression,” Applied Statistics, vol. 41, no. 1, pp. 191–201, 1992.
ing (TSE), vol. 34, no. 2, pp. 181–196, 2008.

0098-5589 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TSE.2017.2659747, IEEE
Transactions on Software Engineering

21

[42] F. Rosenblatt, Principles of Neurodynamics: Perceptrons and the The- [49] W. J. Conover, Practical Nonparametric Statistics, 3rd ed. Wiley,
ory of Brain Mechanisms. Spartan Books, 1961. 1998.
[43] G. H. John and P. Langley, “Estimating continuous distributions [50] R. J. Grissom and J. J. Kim, Effect sizes for research: A broad practical
in bayesian classifiers,” in Eleventh Conference on Uncertainty in approach, 2nd ed. Lawrence Earlbaum Associates, 2005.
Artificial Intelligence. San Mateo: Morgan Kaufmann, 1995, pp. [51] C. Bird, A. Bachmann, E. Aune, J. Duffy, A. Bernstein, V. Filkov,
338–345. and P. Devanbu, “Fair and balanced?: Bias in bug-fix datasets,”
[44] T. Menzies, B. Caglayan, Z. He, E. Kocaguneli, J. Krall, in Proceedings of the the 7th Joint Meeting of the European Software
F. Peters, and B. Turhan. (2012, June) The promise repository Engineering Conference and the ACM SIGSOFT Symposium on The
of empirical software engineering data. [Online]. Available: Foundations of Software Engineering, ser. ESEC/FSE ’09. New
http://promisedata.googlecode.com York, NY, USA: ACM, 2009, pp. 121–130. [Online]. Available:
[45] J. Sliwerski, T. Zimmermann, and A. Zeller, “When do changes http://doi.acm.org/10.1145/1595696.1595716
induce fixes?” in Proceedings of the 2005 International Workshop on [52] K. Herzig and A. Zeller, “The impact of tangled code changes,”
Mining Software Repositories, MSR 2005. ACM, 2005. in Proceedings of the 10th Working Conference on Mining Software
[46] L. Moonen, “Generating robust parsers using island grammars,” Repositories, MSR ’13, San Francisco, CA, USA, May 18-19, 2013,
in Reverse Engineering, 2001. Proceedings. Eighth Working Conference 2013, pp. 121–130.
on, 2001, pp. 13–22. [53] G. Bavota, G. Canfora, M. Di Penta, R. Oliveto, and S. Panichella,
[47] R. Kohavi and G. H. John, “Wrappers for feature subset selection,” “The evolution of project inter-dependencies in a software ecosys-
Artif. Intell., vol. 97, no. 1-2, pp. 273–324, Dec. 1997. [Online]. tem: The case of apache,” in Software Maintenance (ICSM), 2013
Available: http://dx.doi.org/10.1016/S0004-3702(97)00043-X 29th IEEE International Conference on, Sept 2013, pp. 280–289.
[48] S. Mancoridis, B. S. Mitchell, C. Rorres, Y.-F. Chen, and E. R. [54] I. Jolliffe, Principal Component Analysis. John Wiley & Sons, Ltd,
Gansner, “Using automatic clustering to produce high-level sys- 2005. [Online]. Available: http://dx.doi.org/10.1002/0470013192.
tem organizations of source code,” in Proccedings of 6th Interna- bsa501
tional Workshop on Program Comprehension. Ischia, Italy: IEEE CS [55] P. A. Devijver and J. Kittler, Pattern Recognition: A Statistical
Press, 1998. Approach, 1982.

0098-5589 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

You might also like