mining-software-engineering-data-for-software-reuse_compress
mining-software-engineering-data-for-software-reuse_compress
Themistoklis Diamantopoulos
Andreas L. Symeonidis
Mining Software
Engineering Data
for Software Reuse
Advanced Information and Knowledge
Processing
Editors-in-Chief
Lakhmi C. Jain, Bournemouth University, Poole, UK, and, University of South
Australia, Adelaide, Australia
Xindong Wu, University of Vermont, USA
Series Editors
Sheryl Brahnam, Missouri State University, Springfield, USA
Diane J Cook, Washington State University, Pullman, WA, USA
Josep Domingo-Ferrer, Universitat Rovira i Virgili, Tarragona, Spain
Bogdan Gabrys, School of Design, Bournemouth University, Poole, UK
Francisco Herrera, ETS de Ingenierias Infoy de Telecom, University of Granada,
Granada, Spain
Hiroshi Mamitsuka, School of Pharmaceutical Sciences, Kyoto University,
Kyoto, Japan
Vir V. Phoha, Department of Electrical Engineering and Computer Science,
Syracuse University, Ruston, LA, USA
Arno Siebes, Utrecht, The Netherlands
Philippe de Wilde, Office of the Vice Chancellor, University of Kent,
Edinburgh, UK
Information systems and intelligent knowledge processing are playing an increasing
role in business, science and technology. Recently, advanced information systems
have evolved to facilitate the co-evolution of human and information networks
within communities. These advanced information systems use various paradigms
including artificial intelligence, knowledge management, and neural science as well
as conventional information processing paradigms.
The aim of this series is to publish books on new designs and applications of
advanced information and knowledge processing paradigms in areas including but
not limited to aviation, business, security, education, engineering, health, manage-
ment, and science.
Books in the series should have a strong focus on information processing -
preferably combined with, or extended by, new results from adjacent sciences.
Proposals for research monographs, reference books, coherently integrated
multi-author edited books, and handbooks will be considered for the series and
each proposal will be reviewed by the Series Editors, with additional reviews from
the editorial board and independent reviewers where appropriate. Titles published
within the Advanced Information and Knowledge Processing Series are included in
Thomson Reuters’ Book Citation Index and Scopus.
Andreas L. Symeonidis
123
Themistoklis Diamantopoulos Andreas L. Symeonidis
Thessaloniki, Greece Thessaloniki, Greece
This Springer imprint is published by the registered company Springer Nature Switzerland AG
The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland
Preface
v
vi Preface
vii
viii Contents
xiii
xiv List of Figures
Fig. 5.13 Query for a file reader on the Advanced Search page
of AGORA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122
Fig. 5.14 Query for an edit distance method on the Advanced
Search page of AGORA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123
Fig. 5.15 Query for a pair class on the Advanced Search page
of AGORA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124
Fig. 5.16 Example sequence of queries for GitHub . . . . . . . . . . . . . . . . . . 126
Fig. 5.17 Example modified test case for the result of a code
search engine . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127
Fig. 5.18 Evaluation diagram depicting the average number of
compilable and tested results for the three code search
engines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129
Fig. 5.19 Evaluation diagrams depicting a the search length for finding
compilable results and b the search length for finding results
that pass the tests, for the three code search engines . . . . . . . . . 130
Fig. 6.1 Architecture of PARSEWeb [9] . . . . . . . . . . . . . . . . . . . . . . . . . 137
Fig. 6.2 Example screen of Code Conjurer for a loan calculator
component [13] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138
Fig. 6.3 Example screen of S6 for a converter of numbers
to roman numerals [35]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139
Fig. 6.4 The architecture/data flow of Mantissa . . . . . . . . . . . . . . . . . . . . 142
Fig. 6.5 Example signature for a class “Stack” with methods
“push” and “pop” . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142
Fig. 6.6 Example AST for the source code example of Fig. 6.5 . . . . . . . 143
Fig. 6.7 Example sequence of queries for Searchcode . . . . . . . . . . . . . . . 144
Fig. 6.8 Example sequence of queries for GitHub . . . . . . . . . . . . . . . . . . 145
Fig. 6.9 Regular expression for extracting method declarations
from snippets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146
Fig. 6.10 Algorithm that computes the similarity between two sets
of elements, U and V, and returns the best possible
matching as a set of pairs MatchedPairs . . . . . . . . . . . . . . . . . . 148
Fig. 6.11 String preprocessing steps . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149
Fig. 6.12 Example signature for a “Stack” component with two methods
for adding and deleting elements from the stack . . . . . . . . . . . . . 151
Fig. 6.13 Example vector space representation for the query “Stack”
and the retrieved result “MyStack” . . . . . . . . . . . . . . . . . . . . . . . 151
Fig. 6.14 Example signature for a class “Customer” with methods
“setAddress” and “getAddress” . . . . . . . . . . . . . . . . . . . . . . . . . . 153
Fig. 6.15 Example retrieved result for component “Customer” . . . . . . . . . 153
Fig. 6.16 Example modified result for the file of Fig. 6.15 . . . . . . . . . . . . 154
Fig. 6.17 Example test case for the “Customer” component
of Fig. 6.14 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155
Fig. 6.18 Screenshot depicting the search page of Mantissa. . . . . . . . . . . . 157
Fig. 6.19 Screenshot depicting the results page of Mantissa . . . . . . . . . . . 158
xvi List of Figures
Fig. 8.3 Example sequence extracted from a source code snippet . . . . . . 199
Fig. 8.4 Example code snippet from stack overflow post about
“how to delete file from ftp server using Java?” . . . . . . . . . . . . . 200
Fig. 8.5 Percentage of relevant results in the first 20 results
of each query, evaluated for questions with snippet
sequence length larger than or equal to 1 . . . . . . . . . . . . . . . . . . 201
Fig. 8.6 Percentage of relevant results in the first 20 results
of each query, evaluated for questions with snippet
sequence length larger than or equal to a 3 and b 5 . . . . . . . . . . 202
Fig. 9.1 The architecture of QualBoa . . . . . . . . . . . . . . . . . . . . . . . . . . . . 208
Fig. 9.2 Example signature for class “Stack” with methods
“push” and “pop” . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 208
Fig. 9.3 Boa query for components and metrics . . . . . . . . . . . . . . . . . . . . 210
Fig. 9.4 Evaluation results of QualBoa depicting a the average
precision and b the average reusability score
for each query. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 215
Fig. 10.1 Diagram depicting the number of stars versus the number
of forks for the 100 most popular GitHub repositories . . . . . . . . 222
Fig. 10.2 Package-level distribution of API Documentation
for all repositories . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 225
Fig. 10.3 Package-level distribution of API Documentation
for two repositories . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 226
Fig. 10.4 API Documentation versus star-based reusability score . . . . . . . 227
Fig. 10.5 Overview of the reusability evaluation system . . . . . . . . . . . . . . 228
Fig. 10.6 Average NRMSE for all three machine learning algorithms . . . . 229
Fig. 10.7 Reusability score distribution at a class and b package
level . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 230
Fig. 10.8 Boxplots depicting reusability distributions for three
human-generated ( ) and two auto-generated ( ) projects,
a at class level and b at package level (color online) . . . . . . . . . 232
List of Tables
xix
xx List of Tables
Table 5.7 Compiled results and results that passed the tests for the three
code search engines, for each query and on average, where
the format for each value is number of tested results/number
of compiled results/total number of results . . . . . . . . . . . . . . . . 128
Table 5.8 Search length for the compiled and passed results
for the three code search engines, for each result
and on average, where the format for each value
is passed/compiled . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130
Table 6.1 Feature comparison of popular test-driven reuse systems
and Mantissa . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140
Table 6.2 String preprocessing examples . . . . . . . . . . . . . . . . . . . . . . . . . 149
Table 6.3 Dataset for the evaluation of Mantissa against code search
engines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163
Table 6.4 Compiled and passed results out of the total returned
results for the three code search engines and the two
Mantissa implementations, for each result and on average,
where the format is passed/compiled/total results . . . . . . . . . . . 164
Table 6.5 Search length for the passed results for the three code
search engines and the two Mantissa implementations,
for each result and on average . . . . . . . . . . . . . . . . . . . . . . . . . 166
Table 6.6 Dataset for the evaluation of Mantissa against FAST . . . . . . . . 168
Table 6.7 Compiled and passed results out of the total returned
results for FAST and the two Mantissa implementations,
for each result and on average, where the format for each
value is passed/compiled/total results . . . . . . . . . . . . . . . . . . . . 169
Table 6.8 Dataset for the evaluation of Mantissa against Code
Conjurer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 170
Table 6.9 Results that passed the tests and response time for the two
approaches of Code Conjurer (Interface-based and
Adaptation) and the two implementations of Mantissa
(using AGORA and using GitHub), for each result
and on average . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 171
Table 7.1 API clusters of CodeCatch for query “How to upload
file to FTP” . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 186
Table 7.2 Statistics of the queries used as evaluation dataset . . . . . . . . . . 188
Table 8.1 Example lookup table for the snippet of Fig. 8.1 . . . . . . . . . . . 196
Table 8.2 Example stack overflow questions that are similar
to “uploading to FTP using java” . . . . . . . . . . . . . . . . . . . . . . . 199
Table 9.1 The reusability metrics of QualBoa . . . . . . . . . . . . . . . . . . . . . 209
Table 9.2 The reusability model of QualBoa . . . . . . . . . . . . . . . . . . . . . . 213
Table 9.3 Dataset for the evaluation of QualBoa . . . . . . . . . . . . . . . . . . . 214
Table 9.4 Evaluation results of QualBoa . . . . . . . . . . . . . . . . . . . . . . . . . 215
Table 10.1 Categories of static analysis metrics related to reusability . . . . 222
List of Tables xxi
1.1 Overview
alarming; the staggering fact, however, stems from the conclusions of the report,
which estimates that $81 billion out of these company/government funds were spent
on (ultimately) canceled software projects, and another $59 billion were spent on
challenged projects that may be completed; however, they exceeded their initial cost
and time estimates. Even now, more than 20 years later, the software industry faces
more or less the same challenges; the updated report of 2015 by the same organiza-
tion, which is publicly available [4], suggests that 19% of software projects usually
fail, while 52% are challenged (cost more than originally estimated and/or delivered
late). This means that only 29%, i.e., less than one-third of the projects, are delivered
successfully in a cost-effective and timely manner.1 The latest CHAOS report at the
time of the writing was released in late 2018 [5] and still indicates that only 36% of
projects succeed, i.e., almost 2 out of 3 projects fail or surpass the original estimates.
The reasons for these unsettling statistics are diverse. A recent study [6] identified
26 such reasons, including the inadequate identification of software requirements, the
inaccurate time/cost estimations, the badly defined specifications by the development
team, the lack of quality checks, the unfamiliarity of the team with the involved
technologies, etc. Although avoiding some of these reasons is feasible under certain
circumstances, the real challenge is to manage all or at least most of them, while
keeping the cost and effort required to a minimum. And this can only be accomplished
by taking the proper decisions at all stages of the software development life cycle.
Decision-making, however, is not trivial; it is a challenging process that requires
proper analysis of all relevant data. The set of methodologies used to support decision-
making in the area of software engineering are commonly described by the term soft-
ware intelligence.2 In specific, as noted by Hassan and Xie [8], software intelligence
“offers software practitioners up-to-date and pertinent information to support their
daily decision-making processes”. Software practitioners are not only developers,
but anyone actively participating in the software life cycle, i.e., product owners, soft-
ware maintainers, managers, and even end users. The vision of software intelligence
is to provide the means to optimize the software development processes, to facilitate
all software engineering phases, including requirements elicitation, code writing,
quality assurance, etc., to allow for unlocking the potential of software teams, and
generally to support decision-making based on the analysis of data.
As already mentioned, though, the main notion of using data for decision-making
has been pursued for quite some time. Today, however, there is a large potential in
1 https://www.infoq.com/articles/standish-chaos-2015.
2 Software intelligence can be actually seen as a specialization of business intelligence, which,
according to Gartner analyst Howard Dresner [7], is an umbrella term used to describe “concepts
and methods to improve business decision-making by using fact-based support systems”. Although
this contemporary definition can be traced in the late 1980s, the term business intelligence is
actually older than a century and is first attributed to Richard M. Devens, who used it (in his book
“Cyclopaedia of Commercial and Business Anecdotes: Comprising Interesting Reminiscences and
Facts, Remarkable Traits and Humors ... of Merchants, Traders, Bankers ... etc. in All Ages and
Countries”, published by D. Appleton & Company in 1868) to describe the ability of the banker
Sir Henry Furnese to gain profit by taking successful decisions according to information received
by his environment.
1.1 Overview 5
this aspect, because today more than ever the data are available. The introduction
of new collaborative development paradigms and the open-source software initia-
tives, such as the Open Source Initiative,3 have led to various online services with
different scopes, including code hosting services, question-answering communities,
bug tracking systems, etc. The data residing in these services can be harnessed to
build intelligent systems that can in turn be employed to support decision-making
for software engineering.
The answer to how these data sources can be effectively parsed and utilized is
actually a problem of reuse. Software reuse is a term referred for the first time by the
Bell engineer Douglas McIlroy, who suggested using already existing components
when developing new products [9]. More generally, software reuse can be defined
as “the use of existing assets in some form within the software product development
process” [10], where the assets can be software engineering artifacts (e.g., require-
ments, source code, etc.) or knowledge derived from data (e.g., software metrics,
semantic information, etc.) [11]. Obviously, reuse is already practiced to a certain
degree, e.g., software libraries are a good example of reusing existing functionality.
However, the challenge is to fully unlock the potential of the data in order to truly
take advantage of the existing assets and subsequently improve the practices of the
software industry. Thus, one may safely argue that the data is there, and the challenge
becomes to analyze them in order to extract the required knowledge and reform that
knowledge in order to successfully integrate it in the software development process
[12].
The data of a software project comprise its source code as well as all files accompany-
ing it, i.e., documentation, readme files, etc. In the case of using some type of version
control, the data may also include information about commits, release cycles, etc.,
while integrating also a bug tracking system may offer issues, bugs, links to commits
that resolve these, etc. In certain cases, there may also be access to the requirements
of the project, which may come in various forms, e.g., text, UML diagrams, etc.
Finally, at a slightly higher level, the raw data of these sources can serve to produce
processed data in the form of software metrics [13], used to measure various quality
aspects of the software engineering process (e.g., quality metrics, people metrics,
etc.). So, there is no doubt that the data are there; the challenge is to utilize them
effectively toward achieving certain goals.
The ambition posed by this book is to employ these software engineering data
in order to facilitate the actions and decisions of the various stakeholders involved
in a software project. In specific, one could summarize the scope of the book in the
following fragment:
3 http://opensource.org/.
6 1 Introduction
Apply data mining techniques on software engineering data in order to facilitate the different
phases of the software development life cycle
This purpose is obviously broad enough to cover different types of data and different
results; however, the aim is not to exhaust all possible methodologies in the area
of software engineering. The focus is instead on producing a unified approach and
illustrating that data analysis can potentially improve the software process.
We view the whole exercise under the prism of software reuse and illustrate how
already existing data can be harnessed to assist in the development process. The
proposed solution advances research in the requirements elicitation phase, continues
with the phases of design and specification extraction where focus is given on soft-
ware development and testing, and also contributes in software quality evaluation.
Using the products of our research, we aspire that development teams should be able
to save considerable time and effort, and navigate more easily through the process
of creating software.
Source code mining is an area that has lately gained considerable traction, as faster
high-quality development usually translates to faster time to market with minimized
cost. The idea of directly aiding developers in writing code has initially been cov-
ered mostly by the appropriate tools (e.g., Integrated Development Environments—
IDEs) and frameworks (languages, libraries, etc.). During the latest years, a clear
trend toward a component-based philosophy of software can be observed. As such,
developers require ready-to-use solutions for covering the desired software func-
tionality. These solutions can be reusable components of software projects, API
functionalities of software libraries, responses to queries in search engines and/or
question-answering systems, etc. Thus we focus our work in this aspect on different
areas in order to cover a large part of what we call source code or functionality reuse.
Our first contribution is a specialized search engine for source code, commonly
called a Code Search Engine (CSE). Several CSEs have been developed mainly during
8 1 Introduction
the latest 15 years; however, most of them have more or less suffered from the same
issues: they do not offer advanced queries according to source code syntax, they are
not integrated with code hosting services and thus can easily become outdated, and
they do not offer complete well-documented APIs. By contrast, our CSE performs
extensive document indexing, allowing for several different types of queries, while
it is also continuously updated from code hosting services. Furthermore, its API
allows it to serve as a source of input for the construction of several tools; in fact, we
ourselves use it as the core/input for building different services on top.
Possibly, the most straightforward integration scenario for a CSE is a connection
with a Recommendation System in Software Engineering (RSSE) [14]. Once again,
we identify an abundance of such systems, most of them following the same path
of actions: receiving a query for a component from the developer, searching for
useful components using a search engine, performing some type of scoring to rank
the components, and possibly integrating a component to the source code of the
developer. To achieve progress beyond the state of the art, we have designed and
developed a novel RSSE, which first of all receives queries as component interfaces
that are usually clear for the developer. Our RSSE downloads components on demand
and ranks them using a syntax-aware mining model, while it also automatically
transforms them to allow seamless integration with the source code of the developer.
Finally, the results can be further assessed using tests supplied by the developer in
order to ensure that the desired functionality is covered.
A similar, yet also useful, contribution to the above is offered by our research in
snippet and API mining. In specific, we develop a methodology to improve on the
focused retrieval of API snippets for common programming queries. We are based on
the source code syntax in order to incorporate information that can be successfully
used to provide useful ready-to-use snippets. With respect to current approaches, our
progress in this case can be measured quantitatively, as we prove that our method-
ology can offer snippets of API calls for different queries. As another contribution
in this aspect, we also provide a methodology for validating the usefulness and the
correctness of snippets using information from question-answering services.
Initially, we focus on the code reuse RSSEs described in the previous subsec-
tion. As already noted, several of these systems may be effective from a functional
perspective, as they recommend components that are suitable for the query of the
developer. However, they do not assess the reusability of the recommended compo-
nents, i.e., the extent to which a component is reusable. As a result, we design and
develop an RSSE that assesses development solutions not only from a functional,
but also from a non-functional perspective. Apart from recommending useful com-
ponents, our system further employs an adaptable model based on static analysis
metrics for assessing the reusability of each component.
In an effort to better evaluate the quality and the reusability of software com-
ponents, we also conduct further research on the area of quality and reusability
assessment using static analysis metrics. There is an abundance of approaches con-
cerning the construction of quality evaluation models, i.e., models that receive as
input values of source code metrics and output a quality score based on whether the
metrics exceed certain thresholds. These approaches, however, usually rely on expert
knowledge either for specifying metric thresholds directly or for providing a set of
ground truth quality values required for automatically determining those thresholds.
Our contribution in this aspect is a methodology for determining the quality of a
software component as perceived by developers, which refrains from expert repre-
sentations of quality; instead, we employ popularity as expressed by online services
and prove that a highly reused (or favorited) component is usually of high quality.
Given our findings, we design and develop a system that uses crowdsourcing infor-
mation as ground truth and train models capable of estimating the reusability degree
both of a software project as a whole and individually for different components.
Our system focuses on the main quality attributes and properties (e.g., complexity,
coupling, etc.) that should be assessed in order to determine whether a software
component is reusable.
As already mentioned in Sects. 1.2 and 1.3 of this chapter, the scope of this book
lies in a reusability-oriented perspective of software which tackles three different
software phases. As a result, the text is hopefully a comfortable and compact read
from top to bottom. However, as each software engineering phase on its own may
be of interest to certain readers, extra effort has been made to split the text into self-
contained parts, which therefore allow one to focus on a specific set of contributions,
as these were outlined in Sect. 1.3. Thus, after the first part of this book, one could
skip to any of the three subsequent parts and finish with the conclusions of the last
part. The book consists of five distinct parts, while each part includes a set of different
chapters. The overview of the book is provided as follows:
10 1 Introduction
Chapter 1. Introduction
The current chapter that included all information required to understand
the scope and purpose of this book, as well as a high-level overview of
the contributions of this book.
Chapter 2. Theoretical Background and State of the art
This chapter contains all the background knowledge required to proceed
with the rest of the parts and chapters of the book. All necessary defini-
tions about software engineering, software reuse, software quality, and
any other relevant areas are laid out and discussed.
References
1. Robert Curley (2011) Architects of the Information Age. Britannica Educational Publishing
2. Sommerville I (2010) Software Engineering, 9th edn. Addison-Wesley, Harlow, England
3. Standish Group (1994) The CHAOS Report (1994). Technical report, Standish Group
4. Standish Group (2015) The CHAOS Report (2015). Technical report, Standish Group
5. Standish Group (2018) The CHAOS Report (2018): Decision Latency Theory: It’s All About
the Interval. Technical report, Standish Group
6. Montequín VR, Fernández SC, Fernández FO, Balsera JV (2016) Analysis of the success factors
and failure causes in projects: comparison of the spanish information and communication
technology ICT sector. Int J Inf Technol Proj Manag 7(1):18–31
7. Power DJ (2007) A Brief History of Decision Support Systems. http://dssresources.com/
history/dsshistory.html. Retrieved November, 2017
8. Hassan AE, Xie T (2010) Software intelligence: the future of mining software engineering
data. In: Proceedings of the FSE/SDP Workshop on Future of Software Engineering Research,
FoSER ’10. ACM, New York, NY, USA, pp 161–166
9. McIlroy MD (1968) Components Mass-produced Software. In: Naur P, Randell B (eds) Soft-
ware Engineering; Report of a Conference sponsored by the NATO Science Committee. NATO
Scientific Affairs Division, Brussels, Belgium, NATO Scientific Affairs Division. Belgium,
Brussels, pp 138–155
10. Lombard Hill Group (2017) What is Software Reuse? http://www.lombardhill.com/what_
reuse.htm. Retrieved November, 2017
11. Frakes William B, Kang Kyo (2005) Software reuse research: status and future. IEEE Trans
Softw Eng 31(7):529–536
12. Krueger CW (1992) Software Reuse. ACM Comput Surv 24(2):131–183
13. Fenton N, Bieman J (2014) Software Metrics: A Rigorous and Practical Approach, 3rd edn.
CRC Press Inc, Boca Raton, FL, USA
14. Robillard M, Walker R, Zimmermann T (2010) Recommendation systems for software engi-
neering. IEEE Softw 27(4):80–86
15. Pressman Roger, Maxim Bruce (2019) Software Engineering: A Practitioner’s Approach, 9th
edn. McGraw-Hill Inc, New York, NY, USA
16. Pfleeger SL, Kitchenham B (1996) Software quality: The elusive target. IEEE Software. pp
12–21
17. ISO/IEC 25010:2011 (2011) https://www.iso.org/standard/35733.html. Retrieved: November
2017
Chapter 2
Theoretical Background
and State-of-the-Art
Ever since the dawn of software engineering, which is placed toward the late 1960s,1
it was clear that a structured way of creating software was eminent; individual
approaches to development were simply unable to scale to proper software sys-
tems. As a result, the field has been continuously expanding with the introduction
of different methodologies for developing software. And although there are several
different methodologies (e.g., waterfall, iterative, agile), all designed to create better
software, the main activities of the software development process are common to all
of them.
An analytical view of the typical activities/stages of software engineering, which
is consistent with current literature [1, 3, 4], can be found in Fig. 2.1. These activities
are more or less present in all software systems. For the development of any project,
one initially has to identify its requirements, translate them to specifications, and
subsequently design its architecture. After that, the project is implemented and tested,
so that it can be deployed and sustained by the required operations. Even after its
completion, the project has to be maintained in order to continue to be usable. Note
that non-maintainable projects tend to become easily deprecated, for a number of
reasons, such as the lack of security updates that may make the system vulnerable to
certain threats.
Thus, we focus on the different activities defined in Fig. 2.1 (e.g., requirements
elicitation, source code implementation, etc.), and on the connections between sub-
sequent activities (e.g., translating requirements to specifications).
1 According to Sommerville [1], the term was first proposed in 1968 at a conference held by NATO.
Nowadays, we may use the definition by IEEE [2], which describes software engineering as “the
application of a systematic, disciplined, quantifiable approach to the development, operation, and
maintenance of software; that is, the application of engineering to software”.
© Springer Nature Switzerland AG 2020 13
T. Diamantopoulos and A. L. Symeonidis, Mining Software Engineering Data
for Software Reuse, Advanced Information and Knowledge Processing,
https://doi.org/10.1007/978-3-030-30106-4_2
14 2 Theoretical Background and State-of-the-Art
Arguably, one can minimize the risks of developing software in a number of ways.
In this book, we focus on two such risk mitigation approaches, that of (indexing and)
reusing existing solutions and that of continuously assessing the selected processes,
the software engineering activities, and their outcome. These two ways correspond
roughly to the areas of software reuse and software quality assessment, which are
analyzed in the following sections.
Software reuse2 is a notion first introduced in the late 1960s, and, since then, has
dictated a new philosophy of developing software. In this section, we shall first
focus on the benefits of reuse and then indicate possible reuse sources and areas of
application.
In rather simplistic terms, reuse means using already existing knowledge and
data instead of reinventing the wheel. Obviously, this bears several advantages [6].
The first and possibly the most important benefit of reuse is that of time and effort
reduction. When reusing existing components, instead of building software from
scratch, a development team can produce software faster and the end product will
then have a better time to market. The developers themselves will have more time on
their hands to dedicate on other aspects of the project, which are also influenced by
reuse, such as the documentation. The second important gain from software reuse is
that of ensuring the final product is of high quality. Development can be quite hard,
as creating and thoroughly testing a component may be almost impossible within
certain time frames. Reusing an existing component, however, provides a way of
2 In the context of this book, software reuse can be defined as “the systematic application of existing
software artifacts during the process of building a new software system, or the physical incorporation
of existing software artifacts in a new software system” [5]. Note that the term “artifact” corresponds
to any piece of software engineering-related data and covers not only source code components, but
also software requirements, documentation, etc.
2.2 Software Reuse 15
separating between one’s software and the component at hand, as the production
and maintenance of the latter is practically outsourced. Even if the component is
not third party, it is always preferred to maintain a set of different clearly separated
components, rather than a large complex software system. With that said, the reuse
of certain non-properly constructed components may raise risks in the quality of the
end product; however, in this section, we focus on proper reuse methodologies and
discuss the concept of whether components are suitable for reuse in the next section.
As we dive in deeper to the concept of reuse, a question that arises is what are the
methodologies of reuse, or, in other words, what is considered reuse and what not. An
interesting point of view is provided by Krueger [7] in this aspect, who maintains that
reuse is practiced everywhere, from source code compilers and generators to libraries
and software design templates. The point, however, that Krueger made more than
20 years ago is that we have to consider reuse as a standard software engineering
practice. Today, we indeed have accomplished to develop certain methodologies
toward this direction. As Sommerville interestingly comments [1], informal reuse
actually takes place regardless of the selected software development process.3 The
methodologies that have been developed practically facilitate the incorporation of
software reuse with software engineering as a whole. They can be said to form a
so-called reuse landscape [1], which is shown in Fig. 2.2. This landscape involves
options to consider both prior and during the development of new software. It includes
methodologies such as component-based software engineering, reusable patterns
(design or architectural) for constructing new software, and existing functionality
wrapped in application frameworks or program libraries.
Another more recent viewpoint on reuse is given by Capilla et al. [8]. The authors
note that reuse has been initially practiced in a domain-specific basis. Domain analy-
sis was used to understand the software artifacts and processes, and produce reusable
information for software engineering [9–11]. In this context, reuse has also been
the target of several Model-Driven Engineering (MDE) approaches that transform
existing components to make them compatible with different frameworks or even
languages [12]. However, the major evolution of the reuse landscape has its traces in
component reuse. As Capilla et al. [8] note, component-based software engineering
reuse has its roots in the development of object-oriented languages and frameworks
that allowed for components with increased modularity and interoperability.
Finally, one cannot dismiss the importance of services and open data toward form-
ing the current reuse landscape [8, 13]. Deploying reusable components as services
(or, lately, microservices) has allowed for off-the-shelf reuse, while achieving not
only high modularity, but also improved scalability. Open data, on the other hand,
have provided multiple opportunities for building applications with increased inter-
operability in certain domains (e.g., transportation). Albeit not in the narrow scope
of this book, open data are today abundant4 and can provide testbeds for several
interesting research questions about software reuse.
In the context of this book, we plan to facilitate reuse when applied on different phases
of the software development processes. We focus on reuse from a data-oriented
perspective and argue that one can at any time exercise reuse based on the data that
are available to harness. Data in software engineering come from various sources
[14]. An indicative view of data that can be classified as software engineering-related
is shown in Fig. 2.3, where the major sources are shown in bold. Note also that the
format for the data of Fig. 2.3 can be quite diverse. For example, requirements may
be given in free text, as structured user stories, as UML diagrams, etc. Also, for
each project, different pieces of data may or may not be present, e.g., there may be
projects without bug tracking systems or without mailing lists. Another important
remark concerns the nature of the data: some types of data, such as the source code or
the requirements, may be available directly from the software project, whereas others,
such as the quality metrics, may be the result of some process. In any case, these
information sources are more or less what developers must be able to comprehend,
4 There are several open data repositories, one of the most prominent being the EU Open Data Portal
utilize, and take advantage of, in order to practice the reuse paradigm and, in a way,
improve on their development process and their product.5
At first glance, all the above sources may seem hard to gather. Moreover, even
though a development team can maintain a database of these data for its own projects,
enriching it also with third-party data, which admittedly in software reuse can be very
helpful, is hard. However, in modern software engineering, the necessary tools for
collecting these data are available, while the data themselves are abundant in online
repositories. As a result, the field of software engineering has changed, as developers
nowadays use the web for all challenges that may come out during the development
process [16]. A recent survey has shown that almost one-fifth of the time of developers
is spent on trying to find answers to their problems online [17]. Further analyzing
the sources where developers look for their answers, Li et al. [18] found out that
there is a rather large variety of them, including technical blogs, question-answering
communities, forums, code hosting sites, API specifications, code example sites, etc.
Several services have been developed in order to accommodate this new way of
creating software, and we are today in a position where we can safely say that “social
coding” is the new trend when it comes to developing software. An indicative set of
services that have been developed to support this new paradigm are analyzed in the
following paragraphs.
5 Note that we focus here mainly on software engineering data that are available online. There are
also other resources worth mining, which form what we may call the landscape of information that
a developer is faced with when joining a project, a term introduced in [15]. In specific, apart from
the source code of the project, the developer has to familiarize himself/herself with the architecture
of the project, the software process followed, the dependent APIs, the development environment,
etc.
18 2 Theoretical Background and State-of-the-Art
Arguably, the most important type of services is that of code hosting services, as they
are the ones that allow developers to host their source code online. As such, they
offer the required raw source code information, which is usually linked to by online
communities. An indicative set of popular source code hosting services is shown in
Table 2.1.
As we may see in this table, the peak for the establishment of code hosting
services can be placed around 2008, when GitHub and Bitbucket joined the already
large SourceForge and the three together nowadays host the projects of more than
30 million developers. If we consider the statistics for the total number of projects
in these services, we can easily claim that there are currently millions of projects
hosted online in some source code hosting facility.
Several online communities have been lately established so that developers can com-
municate their potential problems and find specific solutions. Even when developers
do not use directly these services, it is common to be redirected to them from search
engines as their queries, and subsequently their problems may match those already
posted by other developers. The most popular of these services is Stack Overflow,6
6 https://stackoverflow.com/.
2.2 Software Reuse 19
Up until now, we have outlined services that allow developers to index their source
code and find online possible solutions to their problems. Another quite important
type of services that is widely used today is that of software project indexes. The
term is rather broad, covering different types of software; however, the common point
among all is that they support the component-based development paradigm by allow-
ing the developer to reuse certain components in an intuitive and semi-automated
7 The term is attributed to Jeff Atwood, the one of the two creators of Stack Overflow, the other
being Joel Spolsky [19, 20].
8 https://stackexchange.com/sites.
9 https://softwareengineering.stackexchange.com/.
10 https://pm.stackexchange.com/.
11 https://sqa.stackexchange.com/.
12 https://www.quora.com/.
13 https://www.reddit.com/.
14 https://www.reddit.com/r/programming/.
15 https://www.reddit.com/r/softwaredevelopment/.
16 https://www.reddit.com/r/softwarearchitecture/.
17 https://www.codeproject.com/.
20 2 Theoretical Background and State-of-the-Art
manner. Typical services in this category include the maven central repository,18 the
Node Package Manager (npm) registry,19 and the Python Package Index (PyPI).20
All these services comprise an online index of software libraries as well as a tool
for retrieving them at will. Maven is used primarily for Java projects and has indexed
at the time of the writing more than 300.000 unique artifacts.21 The developer can
browse or search the index for useful libraries, which can then be easily integrated by
declaring them as dependencies in a configuration file. Similar processes are followed
by the npm registry. The index hosts more than 1 million packages for the JavaScript
programming language, allowing developers to easily download and integrate what
they need in their source code. The corresponding repository for the Python pro-
gramming language, PyPI, contains more than 120.000 Python libraries, which can
be installed in one’s distribution using the pip package management system.
Software directories are, however, not limited to libraries. There are, for instance,
directories for web services, such as the index of ProgrammableWeb,22 which con-
tains the APIs of more than 22.000 web services. Developers can navigate through
this directory to find a web service that can be useful in their case and integrate it
using its API.
There is no doubt that the sources of data discussed above comprise huge informa-
tion pools. Most of them contain raw data, i.e., source code, forum posts, software
libraries, etc. A different category of systems is that of systems hosting software
metadata. These include data from bug tracking systems, continuous integration
facilities, etc. Concerning bug tracking systems, one of the most popular ones is the
integrated issues service of GitHub. Other than that, we may also note that there are
several online services that facilitate the management of software development. An
example such service is Bugzilla,23 which functions as an issue/bug tracking system.
The service is used by several open-source projects, including the Mozilla Firefox
browser,24 the Eclipse IDE,25 and even the Linux Kernel,26 and the hosted informa-
tion is available to anyone for posting/resolving issues, or for any other use that may
possibly be relevant to one’s own project (e.g., when developing a plugin for Eclipse
or Firefox).
18 http://central.sonatype.org/.
19 https://www.npmjs.com/.
20 https://pypi.python.org/pypi.
21 https://search.maven.org/stats.
22 https://www.programmableweb.com/.
23 https://www.bugzilla.org/.
24 https://bugzilla.mozilla.org/.
25 https://bugs.eclipse.org/bugs/.
26 https://bugzilla.kernel.org/.
2.2 Software Reuse 21
As for other sources of metadata, one can refer to online indexes that have produced
data based on source code, on information from version control systems, etc. Such
services are the Boa Language and Infrastructure27 or the GHTorrent project,28 which
index GitHub projects and further extract information, such as the Abstract Syntax
Tree of the source code. The scope of these services, however, is mostly addressed
in other sections of this book, as they are usually research products, which operate
upon some form of primal unprocessed data (as in this case Boa employs the data of
GitHub).
One can easily note that the amount of software engineering data that are at any
time publicly available online is huge. This new data-oriented software engineering
paradigm means that developers can find anything they need, from tools to support
the management and/or the requirements specification and design phases, to reusable
components, test methodologies, etc. As a result, the main challenge reside on (a)
finding an appropriate artifact (either tool or component or generally any data-driven
solution), (b) understanding its semantics, and (c) integrating it into one’s project
(source code or even process).
When these steps are followed effectively, the benefits in software development
time and effort (and subsequently cost) are actually impressive. The problem, how-
ever, is that there are various difficulties arising when developers attempt to exercise
this type of reuse in practice. The first and probably most important difficulty is that
of localizing what the developer needs; given the vastness of the information avail-
able online, the developer can easily be intimidated and/or not be able to localize the
optimal solution in its case. Furthermore, even when the required components are
found and retrieved, there may be several issues with understanding how they func-
tion, e.g., certain software library APIs may not be sufficiently documented. This
may also make the integration of a solution quite difficult. Finally, this paradigm
of search and reuse can easily lead to quality issues, as the problems of third-party
components may propagate to one’s own product. Even if a component is properly
designed, there is still some risk involved as poor integration may also lead to quality
decline. Thus, as software quality is an important aspect of reuse, its background is
analyzed in the following section, while the overall problem described here is again
discussed with all relevant background in mind in Sect. 2.4.
27 http://boa.cs.iastate.edu/.
28 http://ghtorrent.org/.
22 2 Theoretical Background and State-of-the-Art
Software quality is one of the most important areas of software engineering, yet also
one of the most hard to define. As David Garvin effectively states [21], “quality is a
complex and multifaceted concept”. Although quality is not a new concept, the effort
to define it is somewhat notorious29 and its meaning has always been different for
different people [3]. According to the work of Garvin, which was later adopted by
several researchers [23, 24], the quality of a product is relevant to its performance, to
whether its features are desirable, to whether it is reliable, to its conformance to certain
standards, to its durability and serviceability, and to its aesthetics and perception from
the customer. Another early definition of the factors that influence software quality is
that of McCall et al. [25–27]. The authors focus on aspects relevant to (a) the operation
of the product, measured by the characteristics of correctness, reliability, usability,
integrity, and efficiency; (b) its ability to change, measured by the characteristics
of maintainability, flexibility, and testability; and (c) its adaptiveness, measured by
the characteristics of portability, reusability, and interoperability. Though useful as
definitions, these characteristics can be hard to measure [3].
Nowadays, the most widely accepted categorization of software quality is that
of the latest quality ISO, a standard developed to identify the key characteristics of
quality, which can be further split into sub-characteristics that are to some extent
measurable. The ISO/IEC 9126 [28], which was issued in 1991, included the quality
characteristics of functionality, reliability, usability, efficiency, maintainability, and
portability. An interesting note for this standard is that these characteristics were
further split into sub-characteristics, which were in turn divided into attributes that, in
theory at least, are actually measurable/verifiable. Of course, the hard part, for which
no commonly accepted solution is provided until today, is to define the metrics used
to measure each quality attribute that also are in most cases product- and technology-
specific. The most major update in the standard was the ISO/IEC 25010 [29], which
was first introduced in 2011 and comprises eight product quality characteristics that
are depicted in Fig. 2.4, along with their sub-characteristics.
Each of the proposed ISO characteristics refers to different aspects of the software
product. These aspects are the following:
• Functional suitability: the extent to which the product meets the posed require-
ments;
• Performance efficiency: the performance of the product with respect to the avail-
able resources;
• Portability: the capability of a product to be transferred to different hardware or
software environments;
29 Indicatively,
we may refer to an excerpt of the rather abstract “definition” provided by Robert
Persig [22]: “Quality... you know what it is, yet you don’t know what it is”.
2.3 Software Quality 23
• Compatibility: the degree to which the product can exchange information with
other products;
• Usability: the degree to which the product can be used effectively and satisfactorily
by the users;
• Reliability: the efficiency with which the product performs its functions under
specified conditions and time;
• Security: the degree to which the data of the product are protected and accessed
only when required by authorized parties; and
• Maintainability: the effectiveness and efficiency with which the product can be
modified, either for corrections or for adaptions to new environments or require-
ments.
As already noted, there are several difficulties when trying to measure the qual-
ity of a software product. Hence, there is an ongoing effort in designing metrics
that would effectively correspond to quality attributes, which, as already mentioned,
can be mapped to the sub-characteristics and the characteristics of Fig. 2.4. Several
metrics have been defined in the past [30, 31], which can be classified according to
the scope of the quality management technique, or in other words according to the
type of data that is measured. According to the SWEBOK [32],30 there are four cate-
gories of software quality management techniques: static analysis, dynamic analysis,
30 The SWEBOK, short for Software Engineering Body of Knowledge, is an international ISO
standard that has been created by the cooperative efforts of software engineering professionals and
24 2 Theoretical Background and State-of-the-Art
testing, and software process measurement techniques. Each of these categories has
its own purpose, as defined by current research and practice, while different met-
rics have been proposed over time as software methodologies change. We analyze
these quality management techniques and certain indicative metrics in the following
subsections.
The first category of metrics that we are going to analyze refers to metrics derived
by performing some type of analysis on the source code of an application, without
executing it. The introduction of this category of metrics can be traced back in the
1970s, with size and complexity being the two most important properties measured
at the time. Widely applied static analysis metrics were the Lines of Code (LoC), the
McCabe cyclomatic complexity [33], and the Halstead complexity measures [34].
Though rather simplistic, the LoC metric has been quite useful at the time of its
introduction as in certain commonly used languages, like FORTRAN, each non-
empty line represented a single specific command. Today, the metric is still useful,
however, mostly to have an order of magnitude for source code components. The
cyclomatic complexity was defined in 1976 as an effort to measure how complex a
software system is. The metric is computed by the number of linearly independent
paths through the source code of the software. The set of metrics that was developed in
1977 by Maurice Halstead [34] had also as purpose to measure complexity; however,
it was based on the counts of source code operators and operands.
is published by the IEEE Computer Society. The standard aspires to summarize the basic knowledge
of software engineering and include reference lists for its different concepts.
2.3 Software Quality 25
Although the aforementioned metrics are more than 40 years old, they are still
used widely (and in most cases correctly31 ) as indicators of program size and com-
plexity. However, the introduction of the object-oriented programming paradigm and
especially its establishment in the 1990s has raised new measurement needs, as the
source code quality of an object-oriented program is further relevant to the proper-
ties of coupling and inheritance. Hence, several new metrics have been proposed for
covering these properties. Indicatively, we may refer to the C&K metrics [35] that
are named after their creators, Chidamber and Kemerer, and are shown in Table 2.2.
WMC is usually computed as the sum of the complexities of all methods included
in the class under analysis. DIT and NOC refer to the number of superclasses and
subclasses of the class, respectively. RFC and CBO are used to model relationships
between classes; RFC is typically defined as the number of methods in the class
and all external methods that are called by them, while CBO refers to the number
of classes that the class under analysis depends on, i.e., their methods/variables that
are accessed by the class. The inverse metric also exists and is usually referred to
as Coupling Between Objects Inverse (CBOI). Finally, LCOM is a metric for the
degree of cohesion between the methods of a class and is related to the number of
method pairs with common variables among them, having subtracted the number of
method pairs without common variables among them.32
Another quite popular set of object-oriented metrics proposed in 1994 [37] is
the MOOD metrics suite, short for “Metrics for Object-Oriented Design”. The
authors constructed metrics for modeling encapsulation, i.e., the Method Hiding
Factor (MHF) and the Attribute Hiding Factor (AHF), for modeling inheritance, i.e.,
Method Inheritance Factor (MIF) and Attribute Inheritance Factor (AIF), as well as
31 A rather important note is that any metric should be used when it fits the context and when
its value indeed is meaningful. A popular quote, often attributed to Microsoft co-founder and
former chairman Bill Gates, denotes that “Measuring programming progress by lines of code is like
measuring aircraft building progress by weight”.
32 This original definition of LCOM has been considered rather hard to compute, as its maximum
value depends on the number of method pairs. Another computation formula was proposed later
by Henderson-Sellers [36], usually referred as LCOM3 (and the original metric then referred as
LCOM1), which defines LCOM as (M − A/V )/(M − 1), where M is the number of methods and
A is the number of accesses for the V class variables.
26 2 Theoretical Background and State-of-the-Art
for modeling polymorphism and coupling, i.e., the Polymorphism Factor (POF) and
the Coupling Factor (COF), respectively. The metric proposals, however, do not end
here; there are several other popular efforts including the suite proposed by Lorenz
and Kidd [38]. Nevertheless, the main rationale for all of these metrics is similar and
they all measure the main quality properties of object-oriented design using as input
the source code of a project.33
33 The interested reader is further referred to [39] for a comprehensive review of object-oriented
static analysis metrics.
34 There are several other metrics in this field; for an outline of popular metrics, the reader is referred
to [40].
35 On the other hand, static analysis is usually focused on the characteristics of maintainability and
usability.
2.3 Software Quality 27
the frequency of failures,36 and the Availability (AVAIL) that refers to the probability
that the system is available for use at a given time.
As with the two previous categories of metrics, models of software process metrics
cover a wide range of possible applications. This category measures the dynamics
of the process followed by a software development team. Some of the first notions
in this field are attributed to the IBM researcher Allan Albrecht. In 1979, Albrecht
attempted to measure application development productivity by using a metric that
he invented, known as the Function Point (FP) [44]. As opposed to size-oriented
metrics discussed earlier in this section, the FP, which is determined by an empirical
relationship based on the domain and the complexity of the software, gave rise to
what we call function-oriented metrics.37 If one manages to define FP, then it can be
used as a normalization value to assess the performance of the software development
process. Out of the several metrics that have been proposed, we indicatively refer to
the number of errors (or defects) per FP, the cost per FP, the amount of documentation
per FP, the number of FPs per person-month of work, etc.
Lately, the increased popularity of version control systems has also made avail-
able another set of metrics, the so-called “change metrics”, which constitute a type
of metrics that are preferred also for multiple purposes, such as predicting defects
or indicating bad decisions in the development process. An indicative set of change
metrics was proposed by Moser et al. [46] that defined the number of revisions of a
file, the code churn (number of added lines minus number of deleted lines) computed
per revision or in total, the change set (number of files committed together to the
repository) computed per revision or in total, etc. These metrics can be computed
using information from the commits and the issues (bugs) present in version control
systems, while their main scope is to predict defects in software projects. The ratio-
nale for the aforementioned metrics is quite intuitive; for instance, files (components)
with many changes from many people are expected to be more prone to bugs. The
change set metrics are also quite interesting, as they indicate how many files are
affected in each revision. Building on the intuition that changes affecting multiple
files can more easily lead to bugs than single-file commits, Hassanfurther proposed
the entropy of code changes [47], computed for a time interval as nk=1 pk · log2 pk ,
where pk is the probability that the kth file has changed.
As a final note, the metrics analyzed in the previous paragraphs can be applied
for several reasons, such as to assess the reusability of components, to detect bugs in
source code, and to determine which developer is more suitable for each part of the
source code. It is important, however, to understand that the metrics themselves are
metrics were highly correlated; however, the rise of different programming languages and models
quickly proved that their differences are quite substantial [45].
28 2 Theoretical Background and State-of-the-Art
not adequate when trying to assess software quality or possibly to detect defects. That
is because, on its own, each metric does not describe a component from a holistic
perspective. For instance, having a class with large size does not necessarily mean
complex code that is hard to maintain. Equivalently, a rather complex class with
several connections to other classes may be totally acceptable if it is rather small in
size and well-designed. As a result, the challenge with these metrics usually relies on
defining a model [48], i.e., a set of thresholds that, when applied together, properly
estimate software quality, detect defects, etc.
According to Sommerville [1], the testing process has two goals: (a) to demonstrate
that the software meets the posed requirements and (b) to discover errors in the
software. These goals are also well correlated with the view that testing belongs
in a wider spectrum, namely, that of verification and validation [3]. If we use the
definitions of Boehm [49], verification ensures that we are “building the product
right”, while validation ensures that we are “building the right product”. We shall
prefer this interpretation of testing, which is widely accepted,38 as it allows us to
relate it not only to the final product, but also to the requirements of the software.
When testing a software product, one has to take several decisions, two of the most
important being what to test and in what amount of detail should each individual test
case and subsequently the enclosing test suite go into. Given that testing may require
significant time and effort [3], the challenge of optimizing one’s resources toward
designing a proper testing strategy is crucial. An effective strategy (that of course
depends on each particular case) may comprise test cases in four different levels:
unit, integration, system, and acceptance testing [52]. Unit testing is concerned with
individual components, i.e., classes or methods, and is focused on assessing their
offered functionality. Integration testing is used to assess whether the components
are also well assembled together and is focused mainly on component interfaces.
System testing assesses the integration of system components at the highest level
and is focused on defects that arise at this level. Finally, acceptance testing evaluates
the software from the point of view of the customer, indicating whether the desired
functionality is correctly implemented. Thus, depending on the components of the
product at hand, one can select to design different test cases at different granularity.
An even more radical approach is to design the test cases even before constructing
the software itself. This approach to writing software is known as Test-Driven Devel-
38 There are of course other more focused interpretations, such as the one provided in [50], which
defines testing as “the process of executing a program with the intent of finding errors”. As, however,
times change, testing has grown to be a major process that can add actual value to the software, and
thus it is not limited to error discovery. An interesting view on this matter is provided by Beizer
[51], who claims that different development teams have different levels of maturity concerning the
purpose of testing, with others defining it as equivalent to that of debugging, others viewing it as a
way to show that the software works, others using it to reduce product quality risks, etc.
2.3 Software Quality 29
opment (TDD), and although it was introduced with relation to extreme programming
and agile concepts [53, 54], it is also fully applicable on traditional software process
models. TDD is generally considered as one of the most beneficial software prac-
tices, as it has multiple advantages [55]. At first, writing the test cases before writing
the code implicitly forces the developer to clarify the requirements and determine
the desired functionality for each piece of code. Along the same axis, connecting
test cases with requirements can further help in the early discovery of problems in
requirements, thus avoiding the propagation of these problems at later stages. Finally,
TDD can improve the overall quality of source code as defects are generally detected
earlier.
This introduction, though somewhat high level, covers the main elements, or rather
building blocks, of software development and serves two purposes: (a) to explain the
methodologies followed to build reusable high-quality software products and (b) to
identify the data that can be harnessed to further support the software development
life cycle. In this section, the aim is to restate the challenge posed by this book and,
given the background, discuss how it can be addressed while also making significant
progress beyond the current state of the art.
Up until now, we have explained what are the areas of software development that
can be improved, and we have also indicated the potential that is derived from the
vastness of available software engineering data. The challenge that we now wish to
address to is that of analyzing these data to practically improve the different phases of
the software development life cycle. The relevant research area that was established to
address this challenge is widely known as “Data Mining for Software Engineering”.
We have already discussed software engineering, so in the following subsection we
will focus on data mining, while in Sect. 2.4.2, we will explain how these fields are
combined so that data mining methods can assist the software development process.
Finally, in Sect. 2.4.3, we outline the potential of employing data mining to support
reuse and indicate our contributions in this aspect.
The concept of analyzing data is not new; people have always tried to find some
logic in the available data and take decisions accordingly, regardless of the relevant
practical or research field. Lately, however, with the rise of the information age
during the second half of the previous century, we have put ourselves in a challenging
position where the gap between the generation of new data and our understanding of it
is constantly growing [56]. As a result, the effective analysis of data has become more
eminent than ever. In principle, understanding and effectively using data to extract
30 2 Theoretical Background and State-of-the-Art
knowledge is the result of a process called data analysis. However, data analysis
is mainly acquainted with scientific fields such as statistics, which can of course
prove valuable under certain scenarios, yet their applicability is usually limited. In
specific, as noted by Tan et al. [57], traditional data analysis techniques cannot easily
handle heterogeneous and complex data, they do not scale to large amounts of data
(or even data with high dimensionality), and, most importantly, they are confined to
the traditional cycle of constructing and (dis)proving hypotheses manually, which is
not adequate for analyzing real data and discovering knowledge.
Thus, what differentiates data mining from traditional data analysis techniques is
its usage of algorithms from other fields to support knowledge extraction from data.
The interdisciplinary nature of the field inclines us to follow the definition proposed
by Tan et al. [57]:
Data mining is a technology that blends traditional data analysis methods with sophisticated
algorithms for processing large volumes of data.
39 An alternative definition provided by Hand et al. [58] involves the notions of “finding unsuspected
relationships” and “summarizing data in novel ways”. These phrases indicate the real challenges
that differentiate data mining from traditional data analysis.
2.4 Analyzing Software Engineering Data 31
These categories of data mining tasks are also shown in Fig. 2.5 along with several
indicative types of techniques for each category. The techniques shown in the left part
of Fig. 2.5 are only a fraction of those that can be useful for data mining. Although
we do not aspire (and do not find it relevant) to analyze all of these techniques here,
we will at least provide an abstract description for certain interesting methods, and
point to further reading where possible. At first, concerning descriptive modeling,
the main point is to try to extract knowledge from the data. As such, an initial thought
would be to visualize the data, which is quite useful, yet no always as trivial a task as
it may seem. One has to take into account the dimensionality of the data and produce
visualizations that depict useful information, while possibly discarding redundant
and possibly meaningless information. Thus, another quite important challenge in
descriptive modeling is to find out what is worth showing to the analyst. This is
accomplished by pattern discovery, which is considered as one of the most crucial
aspects of data mining [56]. In this context, we may also place unsupervised machine
learning methods, such as clustering [65], which is a technique that can organize the
data into groups and thus again provide intuition as to (possibly hidden) relations
between them. Finally, several modeling approaches can also be used for descriptive
knowledge extraction, including methods for fitting distributions to data, mixture
models, etc. [60].
Distribution fitting is more often used as part of a predictive modeling task and is
relevant to the supervised regression algorithms. Similar to regression, which can be
defined as the process of fitting a model to certain data, one can also define forecast-
ing, with the second, however, implying that the data are sequential [60]. Another
quite large category of techniques that are relevant to this task are the classifica-
tion techniques. In both regression (and forecasting) and classification, the initial
goal is to construct a model based on existing data, so that certain decisions can be
made for new data points. The output of regression, however, lies in a continuous
range of values, whereas in a classification problem the new data points need to
be classified in a class out of a predefined set of classes [66]. Finally, prescriptive
32 2 Theoretical Background and State-of-the-Art
modeling includes techniques similar to the ones already discussed, however, involv-
ing further the element of decision-making. As such, the machine learning algorithms
selected in this case may have as output sets of rules or heuristics. Moreover, they are
mainly focused on model optimization. Thus, prescriptive modeling usually involves
comprehensive classification/regression techniques (e.g., decision trees), rule-based
decision support systems (e.g., Learning Classifier Systems) [67], computational
intelligence algorithms (e.g., fuzzy inference models) [62], etc.
Upon having a basic understanding of the fields of software engineering and data
mining, the next step is to combine these two fields to produce added value with
respect to the software engineering process. This subsection includes a definition
of this broad research area, indicating its different axes, the current limitations, and
certain amount of discussion on points that have been considered for improvement
in this book. This area involves the application of data mining techniques on soft-
ware engineering data, with the purpose of extracting knowledge that can improve
the software engineering process. This definition practically means that there are
countless applications on this area. According to Xie et al. [68], we can distinguish
research work along three axes: the goal, the input data used, and the employed
mining technique. This decomposition is better visualized by Monperrus [69] as in
Fig. 2.6.
As one may notice, Fig. 2.6 involves almost everything that was discussed so far
in this chapter. It involves the different sources and types of software engineering
data, the employed algorithms that lie in the interdisciplinary field of data mining,
and of course a set of tasks that are all part of the software development process.
To further analyze the field, we have chosen to follow a categorization similar to
that in [70], which distinguishes the work on this research area according to the
software engineering activity that is addressed each time.40 From a rather high-level
view, we may focus on the activities of requirements elicitation and specification
extraction, software development, and testing and quality assurance. The current
state of research in these categories is analyzed in the following paragraphs in order
to better define the research field as a whole and indicate the focus of this book.
We view each research area separately and illustrate what one can do (or cannot do)
today by using the appropriate data along with the methods proposed (or methods
not already proposed or methods to be proposed by us in the following chapters).
Note, however, that the following paragraphs do not serve as exhaustive reviews of
40 Another possible distinction is that of the type of the input data used, which, according to Xie
et al. [68], fall into three large categories: sequences (e.g., execution traces), graphs (e.g., call
graphs), and text (e.g., documentation).
2.4 Analyzing Software Engineering Data 33
Fig. 2.6 Data mining for software engineering: goals, input data used, and employed mining
techniques
the state of the art in each of the discussed areas. Instead, they provide what we may
call the current research practice; for each area, we explain its purpose and describe
a representative fraction of its current research directions, while we focus on how
the underlying challenges should be addressed. For an extensive review of the state
of the art in each of these areas, the reader is referred to each corresponding chapter
of this book.
Requirements elicitation can be seen as the first step of the software development
process; knowing what to build is actually necessary before starting to build it. The
research work in this area is focused on optimizing the process of identifying require-
ments and subsequently on extracting specifications. As such, it usually involves the
process of creating a model for storing/indexing requirements as well as a method-
ology for mining them for various purposes. Given a sufficient set of requirements
and an effective model that would capture their structure and semantics (through
careful annotation), one could employ mining techniques to perform validation [71,
72], recommend new requirements [11, 73, 74], detect dependencies among require-
ments [75], etc. If we further add data from other sources, e.g., stakeholder prefer-
ence, software components, documentation, etc., then additional interesting tasks
34 2 Theoretical Background and State-of-the-Art
The development of a product is seemingly the phase that has the most added value;
therefore, it is important that it is performed correctly. The mining methodologies in
this phase are numerous, span along quite diverse directions, and have several appli-
cations.41 The input data are mainly in the form of source code and possibly project-
related information (e.g., documentation, readme files), while the inner models used
for mining involve various formats, such as Abstract Syntax Trees (ASTs), Control
Flow Graphs (CFGs), and Program Dependency Graphs (PDGs). As we move toward
a reuse-oriented component-based software development paradigm, one of the most
popular challenges in this research area is that of finding reusable solutions online.
Thus, there are several research works on designing Code Search Engines (CSEs)
[78–80] and their more specialized counterparts, Recommendation Systems in Soft-
ware Engineering (RSSEs) [81–83], that can be used for locating and retrieving
reusable components. These approaches perform some type of syntax-aware min-
ing on source code and rank the retrieved components according to certain criteria,
so that the developer can select the one that is most useful in his/her case. Finally,
there are also systems designed with the test-driven development paradigm in mind,
which further test the retrieved components to confirm that they meet the provided
requirements [84–86] and which are usually referred to as Test-Driven Reuse (TDR)
systems [87].
Although these systems may be effective when examined under the prism of
generic software reuse, they also have several limitations. Most CSEs, for example,
do not offer advanced syntax-aware queries and are not always integrated with code
hosting services, and thus their potential to retrieve useful components is limited by
the size of their index. As for RSSEs, their pitfalls are mainly located in the lack of
semantics when searching for components, the lack (or inadequacy) of code trans-
formation capabilities, and their purely functional one-sided results assessment [88].
41 We analyze here certain popular directions; however, we obviously do not exhaust all of them.
An indicative list of applications of source code analysis for the interested reader can be found at
[77].
2.4 Analyzing Software Engineering Data 35
In this book, we design and develop a CSE and two RSSEs, in an effort to overcome
these issues. Our systems employ syntax-aware mining and code transformation
capabilities and further provide information about the reusability of components, so
that the developer can select the optimal one for his/her case and easily integrate it
into his/her source code.
Research in this field can also be divided with respect to different levels of gran-
ularity. Apart from higher level components, there is a growing interest for finding
snippets that may be used as solutions to common programming questions, or snip-
pets that illustrate the usage of an API. In this area, there are again several valuable
contributions, which can be distinguished by the task at hand as well as by the source
of the data used. There are approaches focused on formulating useful examples for
library APIs [89, 90], others discovering useful snippets from question-answering
services [91], and others retrieving snippets for common programming queries [92,
93]. The main challenge of all of these approaches is whether they can indeed be
effective and provide added value to the developer. As such, in this book, we focus
on certain metrics that can quantitatively assess whether the retrieved solutions are
correct (e.g., measuring relevance in snippet mining), effective (e.g., how one can
easily search in question-answering systems to improve on existing source code), and
practically useful (e.g., indicate what is the best way to present results in a snippet
recommendation system).
The third and final research area analyzed as part of the broad software mining field is
that of quality assurance. Given that quality is a process that is relevant to all phases
of a software project, the research work also ranges within all these phases, as well
as within software evolution [94]. We are going to discuss current approaches with
respect to the static analysis described in Sect. 2.3.2. Although static analysis can be
used for different purposes, such as assessing the quality of software components and
localizing bugs, in our case we focus on assessing the quality of software components.
To better highlight the potential of this challenge, we initially construct a component
reuse system that employs a quality model to assess the recommended components
not only from a functional but also from a quality perspective.
In this context, the challenge is to construct a model that shall determine a set
of thresholds for the given static analysis metrics. The model then can be used
for assessing the quality of components [95–97] or for finding potentially defect-
prone locations [98, 99], by checking whether their metrics’ values exceed those
thresholds. In this context, several researchers have involved expert help in order to
define acceptable thresholds [100, 101]. However, expert help is usually subjective,
context-specific, and may not be always available, thus not allowing to adapt a qual-
ity assessment system [102]. To this end, there have been multiple efforts toward
automatically defining metrics’ thresholds from some ground truth quality values
[95–97]. As, however, it is hard to obtain an objective measure of software quality,
these approaches also typically resort to expert help. To refrain from the aid of experts,
36 2 Theoretical Background and State-of-the-Art
in this book, we use as ground truth the popularity and the reusability of software
components, as this is measured by online services, which thus provide an indicator
of quality as perceived by developers. Given this ground truth, we construct a quality
assessment system that provides a reusability score for software components.
The potential of employing data mining techniques to support reuse with respect
to the contributions of this book is illustrated in Fig. 2.7. At the top of this figure,
we place all information that is available to the developers, which may come from
multiple online and offline sources. The data from these sources have been outlined
in Sect. 2.2.2 and may include source code, documentation, software requirements,
data from version control or quality assurance systems, etc. Though useful, the infor-
mation from these sources is primary and may not be directly suited for employing
mining techniques with the purpose of software reuse. Hence, we may define the
so-called knowledge extraction layer, which is placed at the middle of Fig. 2.7. This
layer involves intermediate systems that receive the raw primary information and
processes it to provide data that are suitable for mining.
In the context of software requirements, the challenge is usually a modeling prob-
lem. Given a pool of software requirements that may be expressed in different formats
(e.g., user stories, UML diagrams, etc.), the first step is to effectively extract their
elements (e.g., extract the main actors or actions), and store them in a model that
Fig. 2.7 Overview of data mining for software reuse and book contributions
2.4 Analyzing Software Engineering Data 37
will be suitable for reuse. This is explored in Chap. 3 of this book. Having such a
knowledge extraction scheme and model, one can then employ data mining tech-
niques in order to facilitate reuse among different requirements artifacts. We provide
two such examples of reuse-oriented mining, one for functional requirements and
one for UML models (Chap. 4).
The next area of interest is that of source code reuse. This area can actually be
seen as the product of the former one; software requirements practically define the
functionality of the system and source code development is used to meet it. This
link is also bidirectional as good software should of course comply with the defined
requirements, while the requirements should also be up to date and thus conform
to the developed software. For an overall exploration of this link, we may direct
the interested reader in the broad field of automated software engineering, which
explores the connection between software requirements/specifications and source
code/application development, while also attempting to automate it as a process. In
the context of this book, we discuss this connection in Chap. 3, where we refer to a
web service scenario and indicate how requirements and source code can form the
discussed bidirectional and traceable link.
The area of source code mining is quite broad, involving the proposal of several
tools for knowledge extraction from source code, documentation, etc. The primary
information in this case resides in code repositories, version control systems, etc.,
while the relevant tools may be CSEs or other indexing facilities, analysis tools for
extracting syntax-aware information from source code (e.g., API call sequences), etc.
Our contribution in this aspect is discussed in Chap. 5, where we design and develop
a CSE that extracts syntactic information from source code and facilitates reuse in
different levels (component level, snippet level, and project/multi-component level).
Given our proposed tool as well as similar tools that are available online (e.g., other
CSEs, QA engines, etc.), we are able to produce valuable contributions in different
aspects of source code reuse. As such, we construct three systems, one for test-driven
reuse at component level (Chap. 6), one for recommending API usage examples for
different scenarios (Chap. 7), and one for further improving on existing snippets using
information from question-answering services (Chap. 8).
Up until this point, we have mainly discussed the functional aspects of software
development. To ensure, however, that reusing a component does not threaten the
overall quality of the project under development, one must have certain indications
that the component is of sufficient quality. We illustrate this need by means of a
recommendation system for software components, which are assessed both from a
functional and from a quality perspective (Chap. 9). The knowledge extraction layer
in this case involves tools for extracting metrics from source code, documentation,
version control systems, etc. Our focus is on static analysis metrics, for which we
propose a reusability assessment approach that is constructed using as ground truth
the popularity and the preference of software components, as measured by online
services (Chap. 10).
38 2 Theoretical Background and State-of-the-Art
As one may notice in the previous subsection, the contributions of the book can
be seen as a series of different reuse options in each phase of software engineering.
Although we do not claim to fully cover the reuse spectrum for software engineering,
we argue that our solution can offer a useful set of solutions when it comes to building
software products. In this subsection, we discuss this set of solutions, along with
providing an integrated example on how we aspire our contributions to be used in
the reuse context.
Assuming a team of engineers decides to build a new product, they would initially
craft the requirements in cooperation with the relevant stakeholders. As an example,
let’s consider that the team wants to build a social network for academics. The
reusable artifacts in this phase would most probably include functional requirements
and UML diagrams. The requirements would include the actors of the system (e.g.,
academics), the main objects (e.g., publications, universities, etc.), and several action
scenarios (e.g., academic logs in to the system, academic publishes paper, academic
follows academic, etc.). The methodology presented in Part II of this book covers
modeling this information (Chap. 3) and providing recommendations (in Chap. 4,
several forgotten requirements can be recommended based on existing ones; e.g.,
given a scenario where an academic uploads a preprint, we could further suggest
providing an option for editing the uploaded document).
Upon finalizing requirements elicitation and specification extraction, the devel-
opment team should start developing the application. Source code development is of
course a time-consuming process; they would initially have to design the required
components, write code to develop the components, and integrate them with one
another. Insights on source code component reuse are provided in Part III. In our
social network example, the team would probably first have to develop some objects,
such as a user account object and a publication object. Should they use our set
of tools, they could find ready-to-use code for these objects. For instance, as the
reader will see in Chaps. 5 and 6, the team could easily find a user account object
by specifying its possible structure (i.e., variables, such as username or password,
and/or methods, such as login, etc.). Issues related to source code, however, do not
end here. The development of the project would require integrating the components,
writing algorithms, connecting to external sources, debugging, etc. For instance, the
engineers may need to find out how to connect a database to store the data of their
social network (and also decide on which API to use for this task), or they may even
have to determine whether the source code implementations of an algorithm could
be improved. These are all queries that we aspire to be answered by the rest of Part
III of this book (Chaps. 7 and 8).
Finally, before reusing (or after adopting and adapting) any components, the devel-
opment team may also need to know if the components at hand are appropriate from
a quality perspective. For instance, given a set of different components for a query
about a user account object that are functionally equivalent, they may have to deter-
mine which one is the most reusable. This is a challenge for Part IV of this book,
2.4 Analyzing Software Engineering Data 39
which proposes not only retrieving functionally useful components but also assessing
their reusability using an estimation model (Chaps. 9 and 10). Using our solutions in
this case, the team would not only save valuable time and effort, but they would also
be more confident that the quality of the end product is acceptable.
References
21. Garvin DA (1984) What does ‘product quality’ really mean? MIT Sloan Manag Rev 26(1)
22. Pirsig RM (1974) Zen and the art of motorcycle maintenance: an inquiry into values. Harper-
Torch, New York
23. Suryn W (2014) Software quality engineering: a practitioner’s approach. Wiley-IEEE Press,
Hoboken
24. Pfleeger SL, Kitchenham B (1996) Software quality: the elusive target. IEEE Softw:12–21
25. Mccall JA, Richards PK, Walters GF (1977) Factors in software quality. Volume I. Concepts and
definitions of software quality. Technical report ADA049014, General Electric Co, Sunnyvale,
CA
26. Mccall JA, Richards PK, Walters GF (1977) Factors in software quality. Volume II. Metric data
collection and validation. Technical report ADA049014, General Electric Co, Sunnyvale, CA
27. Mccall JA, Richards PK, Walters GF (1977) Factors in software quality. Volume III. Prelimi-
nary handbook on software quality for an acquisiton manager. Technical report ADA049014,
General Electric Co, Sunnyvale, CA
28. ISO/IEC (1991) ISO/IEC 9126:1991. Technical report, ISO/IEC
29. ISO/IEC 25010:2011 (2011) https://www.iso.org/standard/35733.html. Accessed Nov 2017
30. Kan SH (2002) Metrics and models in software quality engineering, 2nd edn. Addison-Wesley
Longman Publishing Co, Inc, Boston
31. Fenton N, Bieman J (2014) Software metrics: a rigorous and practical approach, 3rd edn. CRC
Press Inc, Boca Raton
32. Bourque P, Fairley RE (eds) (2014) SWEBOK: guide to the software engineering body of
knowledge, version 3.0 edition. IEEE Computer Society, Los Alamitos, CA
33. Thomas JM (1976) A complexity measure. In: Proceedings of the 2nd international conference
on software engineering (ICSE’76), Los Alamitos, CA, USA. IEEE Computer Society Press,
p 407
34. Halstead MH (1977) Elements of software science (operating and programming systems series).
Elsevier Science Inc, New York
35. Chidamber SR, Kemerer CF (1994) A metrics suite for object oriented design. IEEE Trans
Softw Eng 20(6):476–493
36. Henderson-Sellers B (1996) Object-oriented metrics: measures of complexity. Prentice-Hall
Inc, Upper Saddle River
37. e Abreu FB, Carapuça R (1994) Candidate metrics for object-oriented software within a tax-
onomy framework. J Syst Softw 26(1):87–96
38. Lorenz M, Kidd J (1994) Object-oriented software metrics: a practical guide. Prentice-Hall
Inc, Upper Saddle River
39. Srinivasan KP, Devi T (2014) A comprehensive review and analysis on object-oriented software
metrics in software measurement. Int J Comput Sci Eng 6(7):247
40. Chhabra JK, Gupta V (2010) A survey of dynamic software metrics. J Comput Sci Technol
25(5):1016–1029
41. Yacoub SM, Ammar HH, Robinson T (1999) Dynamic metrics for object oriented designs. In:
Proceedings of the 6th international symposium on software metrics (METRICS’99), Wash-
ington, DC, USA. IEEE Computer Society, p 50
42. Arisholm E, Briand LC, Foyen A (2004) Dynamic coupling measurement for object-oriented
software. IEEE Trans Softw Eng 30(8):491–506
43. Schneidewind N (2009) Systems and software engineering with applications. Institute of Elec-
trical and Electronics Engineers, New York
44. Albrecht AJ (1979) Measuring application development productivity. In: I. B. M. Press (ed)
Proceedings of the IBM application development symposium, pp 83–92
45. Jones TC (1998) Estimating software costs. McGraw-Hill, Inc, Hightstown
46. Moser R, Pedrycz W, Succi G (2008) A comparative analysis of the efficiency of change
metrics and static code attributes for defect prediction. In: Proceedings of the 30th international
conference on software engineering (ICSE’08), New York, NY, USA. ACM, pp 181–190
47. Hassan AE (2009) Predicting faults using the complexity of code changes. In: Proceedings of
the 31st international conference on software engineering (ICSE’09), Washington, DC, USA.
IEEE Computer Society, pp 78–88
References 41
48. Spinellis D (2006) Code quality: the open source perspective. Effective software development
series. Addison-Wesley Professional, Boston
49. BW Boehm (1981) Software engineering economics, 1st edn. Prentice Hall PTR, Upper Saddle
River
50. Myers GJ, Sandler C, Badgett T (2011) The art of software testing, 3rd edn. Wiley Publishing,
Hoboken
51. Beizer B (1990) Software testing techniques, 2nd edn. Van Nostrand Reinhold Co, New York
52. Copeland L (2003) A practitioner’s guide to software test design. Artech House Inc, Norwood
53. Beck K (2002) Test driven development: by example. Addison-Wesley Longman Publishing
Co, Inc, Boston
54. Beck K (2000) Extreme programming explained: embrace change. Addison-Wesley Longman
Publishing Co, Inc, Boston
55. McConnell S (2004) Code complete, 2nd edn. Microsoft Press, Redmond
56. Witten IH, Frank E, Hall MA (2011) Data mining: practical machine learning tools and tech-
niques, 3rd edn. Morgan Kaufmann Publishers Inc, San Francisco
57. Tan P-N, Steinbach M, Kumar V (2005) Introduction to data mining, 1st edn. Addison-Wesley
Longman Publishing Co, Inc, Boston
58. Hand DJ, Smyth P, Mannila H (2001) Principles of data mining. MIT Press, Cambridge
59. Mitchell TM (1997) Machine learning, 1st edn. McGraw-Hill Inc, New York
60. Bishop CM (2006) Pattern recognition and machine learning. Springer-Verlag New York Inc,
Secaucus
61. Manning CD, Raghavan P, Schütze H (2008) Introduction to information retrieval. Cambridge
University Press, New York
62. Engelbrecht AP (2007) Computational intelligence: an introduction, 2nd edn. Wiley Publishing,
Hoboken
63. Han J, Pei J, Kamber M (2011) Data mining: concepts and techniques. Morgan Kaufmann
Publishers Inc, San Francisco
64. Delen D, Demirkan H (2013) Data, information and analytics as services. Decis Support Syst
55(1):359–363
65. Hastie T, Tibshirani R, Friedman J (2001) The elements of statistical learning. Springer series
in statistics. Springer New York Inc, New York
66. Webb AR (2002) Statistical pattern recognition, 2 edn. Wiley, Hoboken
67. Urbanowicz RJ, Browne WN (2017) Introduction to learning classifier systems. Springerbriefs
in intelligent systems. Springer New York Inc, New York
68. Xie T, Thummalapenta S, Lo D, Liu C (2009) Data mining for software engineering. Computer
42(8):55–62
69. Monperrus M (2013) Data-mining for software engineering. https://www.monperrus.net/
martin/data-mining-software-engineering. Accessed Nov 2017
70. Halkidi M, Spinellis D, Tsatsaronis G, Vazirgiannis M (2011) Data mining in software engi-
neering. Intell Data Anal 15(3):413–441
71. Kumar M, Ajmeri N, Ghaisas S (2010) Towards knowledge assisted agile requirements evo-
lution. In: Proceedings of the 2nd international workshop on recommendation systems for
software engineering (RSSE’10), New York, NY, USA. ACM, pp 16–20
72. Ghaisas S, Ajmeri N (2013) Knowledge-assisted ontology-based requirements evolution. In:
Maalej W, Thurimella AK (eds) Managing requirements knowledge, pp 143–167. Springer,
Berlin
73. Chen K, Zhang W, Zhao H, Mei H (2005) An approach to constructing feature models based
on requirements clustering. In: Proceedings of the 13th IEEE international conference on
requirements engineering (RE’05), Washington, DC, USA. IEEE Computer Society, pp 31–40
74. Alves V, Schwanninger C, Barbosa L, Rashid A, Sawyer P, Rayson P, Pohl C, Rummler A (2008)
An exploratory study of information retrieval techniques in domain analysis. In: Proceedings
of the 2008 12th international software product line conference (SPLC’08), Washington, DC,
USA. IEEE Computer Society, pp 67–76
42 2 Theoretical Background and State-of-the-Art
75. Felfernig A, Schubert M, Mandl M, Ricci F, Maalej W (2010) Recommendation and decision
technologies for requirements engineering. In: Proceedings of the 2nd international workshop
on recommendation systems for software engineering (RSSE’10), New York, NY, USA. ACM,
pp 11–15
76. Maalej W, Thurimella AK (2009) Towards a research agenda for recommendation systems in
requirements engineering. In: Proceedings of the 2009 2nd international workshop on managing
requirements knowledge (MARK’09), Washington, DC, USA. IEEE Computer Society, pp 32–
39
77. Binkley D (2007) Source code analysis: a road map. In: 2007 Future of software engineering
(FOSE’07), Washington, DC, USA. IEEE Computer Society, pp 104–119
78. Janjic W, Hummel O, Schumacher M, Atkinson C (2013) An unabridged source code dataset
for research in software reuse. In: Proceedings of the 10th working conference on mining
software repositories (MSR’13), Piscataway, NJ, USA. IEEE Press, pp 339–342
79. Bajracharya S, Ngo T, Linstead E, Dou Y, Rigor P, Baldi P, Lopes C (2006) Sourcerer: a search
engine for open source code supporting structure-based search. In: Companion to the 21st ACM
SIGPLAN symposium on object-oriented programming systems, languages, and applications
(OOPSLA’06), New York, NY, USA. ACM, pp 681–682
80. Linstead E, Bajracharya S, Ngo T, Rigor P, Lopes C, Baldi P (2009) Sourcerer: mining and
searching internet-scale software repositories. Data Min Knowl Discov 18(2):300–336
81. Holmes R, Murphy GC (2005) Using structural context to recommend source code examples.
In: Proceedings of the 27th international conference on software engineering (ICSE’05), New
York, NY, USA. ACM, pp 117–125
82. Mandelin D, Lin X, Bodík R, Kimelman D (2005) Jungloid mining: helping to navigate the
API jungle. SIGPLAN Not 40(6):48–61
83. Thummalapenta S, Xie T (2007) PARSEWeb: a programmer assistant for reusing open source
code on the web. In: Proceedings of the 22nd IEEE/ACM international conference on automated
software engineering (ASE’07), New York, NY, USA. ACM, pp 204–213
84. Hummel O, Janjic W, Atkinson C (2008) Code conjurer: pulling reusable software out of thin
air. IEEE Softw 25(5):45–52
85. Lazzarini Lemos OA, Bajracharya SK, Ossher J (2007) CodeGenie: a tool for test-driven
source code search. In: Companion to the 22nd ACM SIGPLAN conference on object-oriented
programming systems and applications companion (OOPSLA’07), New York, NY, USA. ACM,
pp 917–918
86. Reiss SP (2009) Semantics-based code search. In: Proceedings of the 31st international con-
ference on software engineering (ICSE’09), Washington, DC, USA. IEEE Computer Society,
pp 243–253
87. Hummel O, Atkinson C (2004) Extreme harvesting: test driven discovery and reuse of software
components. In: Proceedings of the 2004 IEEE international conference on information reuse
and integration (IRI 2004), pp 66–72
88. Nurolahzade M, Walker RJ, Maurer F (2013) An assessment of test-driven reuse: promises
and pitfalls. In: Favaro J, Morisio M (eds) Safe and secure software reuse. Lecture notes in
computer science, vol 7925. Springer, Berlin, pp 65–80
89. Xie T, Pei J (2006) MAPO: mining API usages from open source repositories. In: Proceedings
of the 2006 international workshop on mining software repositories (MSR’06), New York, NY,
USA. ACM, pp 54–57
90. Sahavechaphan N, Claypool K (2006) XSnippet: mining for sample code. SIGPLAN Not
41(10):413–430
91. Zagalsky A, Barzilay O, Yehudai A (2012) Example overflow: using social media for code rec-
ommendation. In: Proceedings of the 3rd international workshop on recommendation systems
for software engineering (RSSE’12), Piscataway, NJ, USA. IEEE Press, pp 38–42
92. Wightman D, Ye Z, Brandt J, Vertegaal R (2012) SnipMatch: using source code context to
enhance snippet retrieval and parameterization. In: Proceedings of the 25th annual ACM sym-
posium on user interface software and technology (UIST’12), New York, NY, USA. ACM, pp
219–228
References 43
93. Wei Y, Chandrasekaran N, Gulwani S, Hamadi Y (2015) Building bing developer assistant.
Technical report MSR-TR-2015-36, Microsoft Research
94. Kagdi H, Collard ML, Maletic JI (2007) A survey and taxonomy of approaches for mining
software repositories in the context of software evolution. J Softw Maint Evol 19(2):77–131
95. Alves TL, Ypma C, Visser J (2010) Deriving metric thresholds from benchmark data. In:
Proceedings of the IEEE international conference on software maintenance (ICSM). IEEE, pp
1–10
96. Ferreira KAM, Bigonha MAS, Bigonha RS, Mendes LFO, Almeida HC (2012) Identifying
thresholds for object-oriented software metrics. J Syst Softw 85(2):244–257
97. Zhong S, Khoshgoftaar TM, Seliya N (2004) Unsupervised learning for expert-based software
quality estimation. In: Proceedings of the 8th IEEE international conference on high assurance
systems engineering (HASE’04), pp 149–155
98. Hovemeyer D, Spacco J, Pugh W (2005) Evaluating and tuning a static analysis to find null
pointer bugs. SIGSOFT Softw Eng Notes 31(1):13–19
99. Ayewah N, Hovemeyer D, Morgenthaler JD, Penix J, Pugh W (2008) Using static analysis to
find bugs. IEEE Softw 25(5):22–29
100. Le Goues C, Weimer W (2012) Measuring code quality to improve specification mining. IEEE
Trans Softw Eng 38(1):175–190
101. Washizaki H, Namiki R, Fukuoka T, Harada Y, Watanabe H (2007) A framework for measuring
and evaluating program source code quality. In: Proceedings of the 8th international conference
on product-focused software process improvement (PROFES). Springer, pp 284–299
102. Cai T, Lyu MR, Wong K-F, Wong M (2001) ComPARE: a generic quality assessment envi-
ronment for component-based software systems. In: Proceedings of the 2001 international
symposium on information systems and engineering (ISE’2001)
Part II
Requirements Mining
Chapter 3
Modeling Software Requirements
3.1 Overview
Typically, during the early stages of the software development life cycle, developers,
and customers discuss and agree on the functionality of the system to be developed.
The functionality is recorded as a set of functional requirements, which form the basis
for the corresponding work implementation plan, cost estimations, and follow-up
directives [1]. Functional requirements can be expressed in various ways, including
UML diagrams, storyboards, and, most often, natural language [2].
Deriving formal specifications from functional requirements is one of the most
important steps of the software development process. The source code of a project
usually depends on an initial model (e.g., described by a class diagram) that has to
be designed very thoroughly in order to be functionally complete. Designing such
a model from scratch is not a trivial task. While requirements expressed in natural
language have the advantage of being intelligible to both clients and developers,
they can also be ambiguous, incomplete, and inconsistent. Several formal languages
have been proposed to eliminate some of these problems; however, customers rarely
possess the technical expertise for constructing and understanding highly formal-
ized requirements. Thus, investing effort into automating the process of translating
requirements to specifications can be highly cost-effective, as it removes the need
for customers to understand complex design models.
Automating the requirements translation process allows detecting ambiguities
and errors at an early stage of the development process (e.g., via logical inference
and verification tools), thus avoiding the costs of finding and fixing problems at a
later, more expensive stage [3]. In the field of requirements modeling, there are sev-
eral techniques for mapping requirements to specifications; however, they mostly
depend on domain-specific heuristics and/or controlled languages. In this chapter,
we propose a methodology that allows developers to design their envisioned system
through software requirements in multimodal formats. In specific, our system effec-
tively models the static and dynamic views of a software project, and employs NLP
Fig. 3.2 Example scenario and code skeleton using JBehave [7]
candidate relationship between a library class and a book class is shown in Fig. 3.3.
In this example, the user has selected the relationship library-contains-book.
Finally, there are several commercial solutions for requirements authoring and
validation. For example, the Reuse Company1 offers solutions for modeling require-
ments using structured patterns and further assists in maintaining a common ter-
minology throughout the domain of the project. A similar solution is offered by
Innoslate,2 which further provides version control for requirements. These tools are
usually domain-specific, enabling reuse within a project, and allow extracting entities
for validation purposes.
Concerning model representations, most of the aforementioned systems use either
known literature standards including UML models or ontologies, which have been
used extensively in Requirements Engineering (RE) for storing and validating infor-
mation extracted from requirements [15, 16]. Ontologies are also useful for storing
domain-specific knowledge and are known to integrate well with software modeling
1 https://www.reusecompany.com/.
2 https://www.innoslate.com/.
3.2 State of the Art on Requirements Elicitation and Specification Extraction 51
languages [17].3 They provide a structured means of organizing information, are par-
ticularly useful for storing linked data, and allow reasoning over implied relationships
between data items.
Though effective, these methodologies are usually restricted to formalized lan-
guages or depend on heuristics and domain-specific information. To this end, we
propose to employ semantic role labeling techniques in order to learn to map func-
tional requirements concepts to formal specifications. Note that although extracting
semantic relations between entities is one of the most well-known problems of NLP,4
applying these methods to the analysis of user requirements is a rather new idea.
Furthermore, with respect to current methodologies, we also aspire to provide a
unified model for storing requirements from multimodal formats, taking into account
both the static and dynamic views of software projects. Our model may be instantiated
from functional requirements written in natural language, UML use case and activity
diagrams, and graphical storyboards, for which we have constructed appropriate
tools. The extracted information is stored into ontologies, for analysis and reuse
purposes.
Our approach comprises the Reqs2Specs module, and its conceptual architecture is
shown in Fig. 3.4. The Reqs2Specs module includes a set of tools for developers
to enter the multimodal representations (functional requirements, storyboards) of
the envisaged software application as well as a methodology for transforming these
representations to specifications, i.e., models of the envisioned software project.
We have developed tools that allow developers to insert requirements in the form
of semi-structured text, UML use case and activity diagrams, and storyboards, i.e.,
diagrams depicting the flow of actions in the system. Using our approach, these
artifacts are syntactically and semantically annotated and stored in two ontologies, the
static ontology and the dynamic ontology, which practically store the static elements
and the dynamic elements of the system.
Having finalized the requirements elicitation process, the Reqs2Specs module
parses the ontologies into an aggregated ontology and extracts detailed system
specifications, which are suitable for use either directly by developers or even as
input in an automated source code generation engine (as performed in S-CASE [27]).
The ReqsMining module is used to provide recommendations about reusable infor-
mation present in functional requirements and UML diagrams. The methodologies
employed in this module are analyzed in detail in the next chapter. Note also that our
modeling methodology is fully traceable; any changes to the ontologies (e.g., as part
of a mining process) can be propagated back to the functional requirements, UML
diagrams, and storyboards. As a result, the stakeholders are always provided with a
clear requirements view of the system, and the specifications are well defined. The
following subsections outline the developed ontologies and the process of instanti-
ating them from functional requirements, UML diagrams, and storyboards.
This section discusses the design of the two ontologies that store information cap-
turing the static view (functional requirements, UML use case diagrams) and the
dynamic view (storyboards, UML activity diagrams) of software projects. These
ontologies provide a structured means of organizing information, underpin methods
for retrieving stored data via queries, and allow reasoning over implied relationships
between data items.
The Resource Description Framework (RDF)5 was selected as the formal frame-
work for representing requirements-related information. The RDF data model has
three object types: resources, properties, and statements. A resource is any “thing”
that can be described by the language, while properties are binary relations. An RDF
statement is a triple consisting of a resource (the subject), a property, and either a
5 http://www.w3.org/TR/1999/REC-rdf-syntax-19990222.
3.3 From Requirements to Specifications 53
resource or a string as object of the property. Since the RDF data model defines
no syntax for the language, RDF models can be expressed in different formats; the
two most prevalent are XML and Turtle.6 RDF models are often accompanied by an
RDF schema (RDFS)7 that defines a domain-specific vocabulary of resources and
properties. Although RDF is an adequate basis for representing and storing informa-
tion, reasoning over the information and defining an explicit semantics for it is hard.
RDFS provides some reasoning capabilities but is intentionally inexpressive, and
fails to support the logical notion of negation. For this purpose, the Web Ontology
Language (OWL)8 was designed as a more expressive formalism that allows classes
to be defined axiomatically and supports consistency reasoning over classes. OWL
is built on top of RDF but includes a richer syntax with features such as cardinality
or inverse relations between properties. In the context of our work, we employ OWL
for representing information, since it is a widely used standard within the research
and industry communities.9
An Ontology for the Static View of Software Projects Concerning the static aspects
of requirements elicitation, i.e., functional requirements and use case diagrams, the
design of the ontology revolves around acting units (e.g., user, administrator, guest)
performing some action(s) on some object(s). The ontology was designed to sup-
port information extracted from Subject-Verb-Object (SVO) sentences. The class
hierarchy of the ontology is shown in Fig. 3.5.
Anything entered in the ontology is a Concept. Instances of Concept
are further classified into Project, Requirement, ThingType, and
OperationType. The Project and Requirement classes are used to store
the requirements of the system, so that any instance can be traced back to the orig-
inating requirement. ThingType and OperationType are the main types of
objects found in any requirement. ThingType refers to acting units and units
acted upon, while OperationType involves the types of actions performed by
the acting units. Each ThingType instance can be an Actor, an Object, or
a Property. Actor refers to the actors of the project, including users, the sys-
tem itself, or any external systems. Instances of type Object include any object
or resource of the system that receives some action, while Property instances
include all modifiers of objects or actions. OperationType includes all possible
operations, including Ownership that expresses possession (e.g., “each user has
an account”), Emergence that implies passive transformation (e.g., “the posts are
6 http://www.w3.org/TR/turtle/.
7 http://www.w3.org/TR/1999/PR-rdf-schema-19990303/.
8 http://www.w3.org/TR/2004/REC-owl-guide-20040210/.
9 Although we don’t rely on OWL inference capabilities explicitly, they are useful for expressing
integrity conditions over the ontology, such as ensuring that certain properties have inverse properties
(e.g., owns/owned_by).
54 3 Modeling Software Requirements
sorted”), State that describes the status of an Actor (e.g., “the user is logged in”),
and Action describes an operation performed by an Actor on some Object
(e.g., “the user creates a profile”).
The possible interactions between the different concepts of the ontology, i.e., the
relations between the ontology (sub)classes, are defined using properties. The prop-
erties of the static ontology are shown in Fig. 3.6. Note that each property also has its
inverse one (e.g., has_actor has the inverse is_actor_of). Figure 3.6 includes
only one of the inverse properties and excludes properties involving Project and
Requirement for simplicity.
An instance of Project can have many instances of Requirement, while
each Requirement connects to several ThingType and OperationType
3.3 From Requirements to Specifications 55
several requirements can be tiresome for the user. To this end, the tool can be con-
nected to an external NLP parser in order to support semi-automatically annotating
requirements. We have constructed such an example parser (described in Sect. 3.3.3)
that operates in two stages. The first stage is the syntactic analysis for each require-
ment. Input sentences are initially split into tokens, the grammatical category of
these tokens is identified, and their base types are extracted, to finally identify the
grammatical relations between the words. The second stage involves the semantic
analysis of the parsed sentences. This stage extracts semantic features for the terms
(e.g., part of speech, relation to parent lemma, etc.) and employs a classification
algorithm to classify each term to the relevant concept or operation of the ontology.
Upon annotation, the user can select the option to export the annotations to
the static ontology. The mapping from the annotation scheme to the concepts
of the static ontology is straightforward. Actor, Action, Object, and Property
annotations correspond to the relevant OWL classes. Concerning relations,
IsActorOf instantiates the is_actor_of and has_actor properties, ActsOn
instantiates acts_on and receives_action, and HasProperty maps to the
has_property and is_property_of properties. The rest of the ontology
properties (e.g., project_has, consists_of) are instantiated using the infor-
mation of each functional requirement and of the name of the software project.
The different relations of ontology classes are actually forming the main flow
as derived from diagram elements. Thus, Activity instances are connected with
each other via instances of type Transition. Any Transition has a source
and a target Activity (properties has_source and has_target, respec-
tively), and it may also have a GuardCondition (property has_condition).
Finally, any Activity is related to instances of type Property, while any
GuardCondition has an opposite one, connected to each other via the bidi-
rectional property is_opposite_of. Finally, any Activity is connected to an
Actor, an Action, and an Object via the corresponding properties has_actor,
activity_has_action, and activity_has_object.
Extracting Dynamic Flow from Storyboards Storyboards are dynamic system
scenarios that describe flows of actions in software systems. A storyboard diagram is
structured as a flow from a start node to an end node. Between the start and the end
nodes, there are actions with their properties and conditions. All nodes are connected
with edges/paths. We have designed and implemented Storyboard Creator, a tool for
creating and editing storyboard diagrams, as an Eclipse plugin using the Graphical
Modeling Framework (GMF). A screenshot of the tool including a storyboard is
shown in Fig. 3.10.
The Storyboard Creator includes a canvas for drawing storyboards, a palette that
can be used to create nodes and edges, and an outline view that can be used to scroll
when the diagram does not fit into the canvas. It also supports validating diagrams
using several rules, e.g., to ensure that each diagram has at least one action node,
each property connects to exactly one action, etc.
The storyboard of Fig. 3.10 includes an action “Login to account” which also has
two properties that define the elements of “account”, a “username”, and a “pass-
word”. Conditions have two possible paths. For instance, condition “Credentials are
3.3 From Requirements to Specifications 59
Table 3.1 Example instantiated OWL classes for the storyboard of Fig. 3.10
OWL class OWL instances
Activity Login_to_account
Property Username, Password
Transition FROM__StartNode__TO__Login_to_account,
FROM__Login_to_account__TO__EndNode,
FROM__Login_to_account__TO__Login_to_account
GuardCondition Credentials_are_correct__PATH__Yes,
Credentials_are_correct__PATH__No
Action login
Object account
correct?” has the “Yes” path that leads to the end node and the “No” path that leads
back to the login action requiring new credentials.
Mapping storyboard diagrams to the dynamic ontology is straightforward. Sto-
ryboard actions are mapped to the OWL class Activity, and they are further
split into instances of Action and Object. Properties become instances of the
Property class, and they are connected to the respective Activity instances
via the has_property relation. The paths and the condition paths of story-
boards become instances of Transition, while the storyboard conditions split
into two opposite GuardConditions. An example instantiation for the story-
board of Fig. 3.10 is shown in Table 3.1.
60 3 Modeling Software Requirements
Based on the modeling effort described in Sect. 3.3.2, we have developed a parser that
learns to automatically map software requirements written in natural language texts
to concepts and relations defined in the ontology. We identify instances of the con-
cepts Actor, Action, Object, and Property, and the relations among them
(is_actor_of, acts_on, and has_property). These concepts are selected
since they align well with key syntactic elements of the sentences, and are intuitive
enough to make it possible for non-linguists to annotate them in texts. Our parser,
however, could also identify more fine-grained concepts, given more detailed anno-
tations. In practice, the parsing task involves several steps: first, concepts need to
be identified and then mapped to the correct class and, second, relations between
concepts need to be identified and labeled accordingly. We employ a feature-based
relation extraction method, since we are focusing on a specific problem domain and
type of input, i.e., textual requirements of software projects. This allows us to imple-
ment the parsing pipeline shown in Fig. 3.11. In the following paragraphs, we provide
a summary of constructing, training, and evaluating our parser, while the interested
reader is further referred to [28] and [29] for the details of our methodology.
Our parser comprises two subsystems, one for syntactic and one for semantic
analysis. Syntactic analysis allows us to identify the grammatical category of each
word in a requirement sentence, while semantic analysis is used to map words and
constituents in a sentence to instances of concepts and relations from the ontology.
The syntactic analysis stage of our pipeline architecture performs the following
steps: tokenization, part-of-speech tagging, lemmatization, and dependency parsing.
Given an input sentence, this means that the pipeline separates the sentence into
word tokens, identifies the grammatical category of each word (e.g., “user” → noun,
3.3 From Requirements to Specifications 61
Fig. 3.12 Example annotated instance using the hierarchical annotation scheme
“create” → verb), and determines their uninflected base forms (e.g., “users” →
“user”). Finally, the pipeline identifies the grammatical relations that hold between
two words (e.g., “user”, “must” → subject-of, “create”, “account” → object-of).
We construct the pipeline using the Mate Tools [30, 31],10 which is a system that
includes parsing components and pre-trained models for syntactic analysis.
The semantic analysis subsystem also adopts a pipeline architecture to receive
the output of syntactic preprocessing and extract instances of ontology concepts
and relations. The analysis is carried out in four steps: (1) identifying instances of
Action and Object; (2) allocating these to the correct concept (either Action or
Object); (3) identifying instances of related concepts (i.e., Actor and
Property), and (4) determining their relationships to concept instances identi-
fied in step (1). These four steps correspond to the predicate identification, predicate
disambiguation, argument identification, and argument classification phases of Mate
Tools [30]. Our method is based on the semantic role labeler from Mate Tools and
uses the built-in re-ranker to find the best joint output of steps (3) and (4). We extend
Mate Tools with respect to continuous features and arbitrary label types. Each step
in our pipeline is implemented as a logistic regression model using the LIBLINEAR
toolkit [32] that employs linguistic properties as features, for which appropriate fea-
ture weights are learned based on annotated training data (see [28] and [29] for more
information on building this example parser).
Training the parser requires a dataset of annotated functional requirements. Our
annotation procedure is quite simple and involves marking instances of Actor,
Object, Action, and Property that are explicitly expressed in a given require-
ment. Consider the example of Fig. 3.12. In this sentence, “user” is annotated as
Actor, “create” as Action, “account” as Object, and “username” as
Property.11
Since annotating can be a hard task, especially for inexperienced users, we cre-
ated an annotation tool, named Requirements Annotation Tool, to ease the process.
The Requirements Annotation Tool is a web platform that allows users to create an
account, import one or more of their projects, and annotate them. Practically, the
tool is similar to the Requirements Editor presented in Sect. 3.3.2; however, the main
scope of the Requirements Annotation Tool is annotation for parser training, so it
provides an intuitive drag n’ drop environment. The two tools are compatible as they
allow importing/exporting annotations using the same file formats.
10 http://code.google.com/p/mate-tools/.
11 More specific or implicit annotations could also be added using the hierarchical annotation scheme
defined in [29].
62 3 Modeling Software Requirements
Table 3.3 Properties for the ontology instances of the FR4 of Restmarks
OWL individual OWL property OWL individual
user is_actor_of add
add has_actor user
add acts_on bookmark
bookmark receives_action add
12 The majority of requirements collected in this way were provided by a software development
once for testing while training on the remaining other folds. Using this setting, our
model achieves a precision and recall of 77.9% and 74.5%, respectively. As these
metrics illustrate, our semantic parsing module is quite effective in terms of software
engineering automation. In particular, the recall value indicates that our module cor-
rectly identifies roughly three out of four of all annotated instances and relations.
Additionally, the precision of our module indicates that approximately four out of
five annotations are correct. Thus, our parser significantly reduces the effort (and
time) required to identify the concepts that are present in functional requirements
and annotate them accordingly.
The ontologies developed in Sect. 3.3.2 constitute on their own a model that is suitable
for storing requirements, as well as for actions such as requirements validation and
mining (see following chapter). Their aggregation can lead to a final model that
would describe the system under development both from a static and from a dynamic
perspective. This model can be subsequently used to design the final system or even
automatically develop it (as performed in [27]). Without loss of generality, we build
an aggregated model for RESTful web services and construct a YAML representation
that functions as the final model of the service under development.
Table 3.5 Classes mapping from static and dynamic ontologies to aggregated ontology
OWL class of static ontology OWL class of dynamic OWL class of aggregated
ontology ontology
Project Project Project
Requirement - Requirement
- ActivityDiagram ActivityDiagram
OperationType Action Activity
- GuardCondition Condition
Object Object Resource
Property Property Property
Table 3.6 Properties mapping from static and dynamic ontologies to aggregated ontology
OWL property of static OWL property of dynamic OWL property of aggregated
ontology ontology ontology
project_has - has_requirement
is_of_project - is_requirement_of
- project_has_diagram has_activity_diagram
- is_diagram_of_project is_activity_diagram_of
consists_of diagram_has contains_element
consist is_of_diagram element_is_contained_in
receives_action is_object_of has_activity
acts_on has_object is_activity_of
has_property has_property∗ has_property
is_property_of is_property_of∗ is_property_of
- has_action has_action
- is_action_of is_action_of
- has_condition∗ has_condition
- is_condition_of∗ is_condition_of
- has_target has_next_activity
- has_source has_previous_activity
∗ derived property
added. However, any properties of this instance are also added (again if they do not
exist); this ensures that the ontology is fully descriptive, yet without any redundant
information. The mapping for OWL properties is shown in Table 3.6.
Note that some properties are not directly mapped among the ontologies. In
such cases, properties can be derived from intermediate instances. For example,
the aggregated ontology property has_property is directed from Resource
to Property. In the case of the dynamic ontology, however, properties are con-
nected to activities. Thus, for any Activity, e.g., “Create bookmark”, we have
to first find the respective Object (“bookmark”) and then upon adding it to the
68 3 Modeling Software Requirements
Upon instantiating the aggregated ontology, the next step is the transformation from
the ontology to the specifications of the envisioned service. Specifications can be
defined and stored in a variety of different formats, including Computationally Inde-
pendent Models (CIMs), UML artifacts, etc. In our case, we design a representation
in YAML,14 which shall effectively describe the conceptual elements of the service
under development and provide the user with the ability to modify (fine-grain) the
model. After that, our YAML model can be used as a design artifact or even to directly
produce a CIM, thus allowing developers to have a clear view of the specifications
and in some cases even automate the construction of the service (e.g., as in [27]).
YAML supports several well-known data structures, since it is designed to be easily
mapped to programming languages. In our case, we use lists and associative arrays
(i.e., key-value structures) to create a structure for resources, their properties, and the
different types of information that have to be stored for each resource. The schema
of our representation is shown in Fig. 3.16, where the main element is the RESTful
resource (Resource).
A project consists of a list of resources. Several fields are defined for each
resource, each with its own data type and allowed values. At first, each resource
14 http://yaml.org/.
3.3 From Requirements to Specifications 69
must have a name, which also has to be unique. Additionally, every resource may
be either algorithmic (i.e., requiring some algorithm-related code to be written or
some external service to be called) or non-algorithmic. This is represented using
the IsAlgorithmic Boolean field. CRUD verbs/actions (or synonyms that can
be translated to a CRUD verb) are usually not applied on algorithmic resources.
There are four types of activities that can be applied to resources, in compliance with
the CRUD actions (Create, Read, Update, and Delete). Each resource may support
one or more of these actions, i.e., any combination of them, represented as a list
(CRUDActivities).
Resources also have properties, which are defined as a list of objects. Each property
has a Name, which is alphanumerical, as well as a Type, which corresponds to
the common data types of programming languages, i.e., integers, floats, strings,
and booleans. Furthermore, each property has two Boolean fields: Unique and
NamingProperty. The former denotes whether the property has a unique value
for each instance of the resource, while the latter denotes whether the resource is
named after the value of this property. For example, a resource “user” could have
the properties “username” and “email account”. In this case, the “username” would
possibly be unique, while “email account” could or could not be unique (e.g., a user
may be allowed to declare more than one email accounts). Any instance of “user”,
however, should also be uniquely identified in the system. Thus, if we do not allow two
users to have the same username, we could declare “username” as a naming property.
Finally, each resource may have related resources. The field RelatedResources
is a list of alphanumerical values corresponding to the names of other resources.
Extracting information from the aggregated ontology and creating the correspond-
ing YAML file is a straightforward procedure. At first, instances of the OWL class
Resource can directly be mapped to YAML objects of type Resource. Each
resource is initially considered non-algorithmic. The flow of activities and condi-
tions is used to find the types of verbs that are used on any resource as well as the
related resources. Thus, for example, given an activity “Add bookmark” followed
by an activity “Add tag”, one may identify two resources, “bookmark” and “tag”,
where “tag” is also a related resource for “bookmark”. Additionally, both “book-
mark” and “tag” must have the “Create” CRUD activity enabled, since the verb
“add” implies creating a new instance. The type for each verb is recognized using
a lexicon. Whenever an action verb cannot be classified as any of the four CRUD
types, a new algorithmic resource is created. Thus, for example, an activity “Search
user” would spawn the new algorithmic resource “userSearch”, and connecting it as
a related resource of the resource “user”. Finally, the properties of the resources are
mapped to the Properties list field.
In this section, we provide a case study for the example project Restmarks, which
was introduced in Sect. 3.3.3. Consider Restmarks as a service that allows users to
70 3 Modeling Software Requirements
store and retrieve online their bookmarks, share them with other users, and search
for bookmarks by using tags. One could think of it as a social service for bookmarks.
In the following paragraphs, we illustrate how one can use our system step by step,
as in Fig. 3.4, to create such a product easily, while at the same time ensure that the
produced service is fully functional and traceable.
The first step of the process includes entering and annotating functional require-
ments. An illustrative part of the requirements of Restmarks, annotated and refined
using the Requirements Editor, is depicted in Fig. 3.17.
Upon adding the requirements, the user has to enter information about the dynamic
view of the system. In this example, dynamic system representation is given in the
form of storyboards. Let us assume Restmarks has the following dynamic scenarios:
1. Add Bookmark: The user adds a bookmark to his/her collection and optionally
adds a tag to the newly added bookmark.
2. Create Account: The user creates a new account.
3. Delete Bookmark: The user deletes one of his/her bookmarks.
4. Login to Account: The user logs in to his/her account.
5. Search Bookmark by Tag System Wide: The user searches for bookmarks by
giving the name of a tag. The search involves all public bookmarks.
6. Search Bookmark by Tag User Wide: The user searches for bookmarks by giving
the name of a tag. The search involves the user’s public and private bookmarks.
7. Show Bookmark: The system shows a specific bookmark to the user.
8. Update Bookmark: The user updates the information on one of his/her bookmarks.
3.4 Case Study 71
Table 3.7 Instantiated classes Resource, Property, and Activity for restmarks
OWL class OWL instances
Resource bookmark, tag, account, password
Property username, password, private, public, user
Activity search_bookmark, add_tag, update_bookmark, delete_bookmark,
mark_bookmark, retrieve_bookmark, delete_tag, add_bookmark,
update_tag, update_password, login_account, create_account,
get_bookmark
The storyboards of the above scenarios are created using the Storyboard Creator.
An example storyboard diagram using the tool is shown in Fig. 3.18. The storyboard
of this figure refers to the first scenario of adding a new bookmark. The rest of the
requirements and storyboards of the project are inserted accordingly.
Upon having composed of the requirements and storyboards, the next step involves
creating the static and dynamic ontologies. The two ontologies are combined to
provide the instances of the aggregated ontology. The instantiation of the classes
Resource, Property, and Activity of the aggregated ontology is shown in
Table 3.7.
Finally, the YAML representation that describes the project is exported from the
aggregated ontology. For Restmarks, the YAML file is shown in Fig. 3.19.
Once the YAML representation of the envisioned system is produced, the devel-
oper has complete and traceable specifications that can be subsequently used for
various purposes, including requirements validation, system development, and soft-
72 3 Modeling Software Requirements
ware reuse. Upon demonstrating how one can instantiate our models, the tasks of
requirements mining and reuse are illustrated in the following chapter, while for more
information about the potential model-to-code transformation the interested reader
is referred to [27].
3.5 Conclusion
using software requirements from multimodal formats. Our system receives input
in the form of functional requirements and action flows in the form of storyboards,
thus covering both the structural and the behavioral views of the envisioned system.
The use of NLP and semantics facilitates the extraction of specifications from these
requirements, while the designed ontologies produce a traceable model.
To further illustrate the advantages of our approach, we focus on RESTful web
service development and thus illustrate how the specifications of a service can be
easily produced. The produced model (YAML file) comprises all the required ele-
ments of the service, including resources, CRUD actions and properties, and support
for hypermedia links. Future work on our methodology may lie in several directions.
The continuous improvement of the tools by receiving feedback from users is in our
immediate plans, while future research includes further assessing our methodology
in industrial settings.
References
1. van Lamsweerde A (2009) Requirements engineering: from system goals to UML models to
software specifications. Wiley, Hoboken
2. Luisa M, Mariangela F, Pierluigi NI (2004) Market research for requirements analysis using
linguistic tools. Requir Eng 9(1):40–56
3. Boehm B, Basili VR (2001) Software defect reduction top 10 list. Computer 34:135–137
4. Kaindl H, Smialek M, Svetinovic D, Ambroziewicz A, Bojarski J, Nowakowski W, Straszak T,
Schwarz H, Bildhauer D, Brogan JP, Mukasa KS, Wolter K, Krebs T (2007) Requirements spec-
ification language definition: defining the ReDSeeDS languages, Deliverable D2.4.1. Public
deliverable, ReDSeeDS (Requirements driven software development system) project
5. Smialek M. (2012) Facilitating transition from requirements to code with the ReDSeeDS tool.
In: Proceedings of the 2012 IEEE 20th international requirements engineering conference (RE),
RE ’12. IEEE Computer Society, Washington, pp 321–322
6. Wynne M, Hellesoy A (2012) The cucumber book: behaviour-driven development for testers
and developers. Pragmatic bookshelf
7. North D (2003) JBehave: a framework for behaviour driven development. http://jbehave.org/.
Accessed March 2013
8. Mylopoulos J, Castro J, Kolp M (2000) Tropos: a framework for requirements-driven software
development. Information systems engineering: state of the art and research themes. Springer,
Berlin, pp 261–273
9. Yu ES-K (1995) Modelling strategic relationships for process reengineering. PhD thesis, Uni-
versity of Toronto, Toronto, Ont., Canada. AAINN02887
10. Abbott RJ (1983) Program design by informal english descriptions. Commun ACM
26(11):882–894
11. Booch G (1986) Object-oriented development. IEEE Trans Softw Eng 12(1):211–221
12. Saeki M, Horai H, Enomoto H (1989) Software development process from natural language
specification. In: Proceedings of the 11th international conference on software engineering,
ICSE ’89. ACM, New York, pp 64–73,
13. Mich L (1996) NL-OOPS: from natural language to object oriented requirements using the
natural language processing system LOLITA. Nat Lang Eng 2(2):161–187
14. Harmain HM, Gaizauskas R (2003) CM-Builder: a natural language-based CASE tool for
object-oriented analysis. Autom Softw Eng 10(2):157–181
15. Casta neda V, Ballejos L, Caliusco ML, Galli MR (2010) The use of ontologies in requirements
engineering. Glob J Res Eng 10(6)
74 3 Modeling Software Requirements
16. Siegemund K, Thomas EJ, Zhao Y, Pan J, Assmann U (2011) Towards ontology-driven require-
ments engineering. In: Workshop semantic web enabled software engineering at 10th interna-
tional semantic web conference (ISWC), Bonn
17. Happel H-J, Seedorf S (2006) Applications of ontologies in software engineering. In: Pro-
ceedings of the 2nd international workshop on semantic web enabled software engineering
(SWESE 2006), held at the 5th international semantic web conference (ISWC 2006), pp 5–9
18. Dermeval D, Vilela J, Bittencourt I, Castro J, Isotani S, Brito P, Silva A (2015) Applications of
ontologies in requirements engineering: a systematic review of the literature. Requir Eng 1–33
19. Kambhatla N (2004) Combining lexical, syntactic, and semantic features with maximum
entropy models for extracting relations. In: Proceedings of the ACL 2004 on interactive poster
and demonstration sessions, ACLdemo’04. ACL, Stroudsburg, pp 178–181
20. Zhao S, Grishman R (2005) Extracting relations with integrated information using kernel meth-
ods. In: Proceedings of the 43rd annual meeting on association for computational linguistics,
ACL ’05. Association for Computational Linguistics, Stroudsburg, pp 419–426
21. GuoDong Z, Jian S, Jie Z, Min Z (2005) Exploring various knowledge in relation extraction.
In: Proceedings of the 43rd annual meeting on association for computational linguistics, ACL
’05. Association for Computational Linguistics, Stroudsburg, pp 427–434
22. Bunescu R, Mooney RJ (2005) Subsequence kernels for relation extraction. In: Advances in
Neural Information Processing Systems, vol. 18: proceedings of the 2005 conference (NIPS),
pp 171–178
23. Zelenko D, Aone C, Richardella A (2003) Kernel methods for relation extraction. J Mach Learn
Res 3:1083–1106
24. Culotta A, Sorensen J (2004) Dependency tree kernels for relation extraction. In: Proceedings
of the 42nd annual meeting on association for computational linguistics, ACL ’04. Association
for Computational Linguistics, Stroudsburg, pp 423–429
25. Bunescu RC, Mooney RJ (2005) A shortest path dependency kernel for relation extraction. In:
Proceedings of the conference on human language technology and empirical methods in natural
language processing. Association for Computational Linguistics, Stroudsburg, pp 724–731
26. Bach N, Badaskar S (2007) A review of relation extraction. Carnegie Mellon University, Lan-
guage Technologies Institute
27. Zolotas C, Diamantopoulos T, Chatzidimitriou K, Symeonidis A (2016) From requirements to
source code: a model-driven engineering approach for RESTful web services. Autom Softw
Eng 1–48
28. Roth M, Diamantopoulos T, Klein E, Symeonidis AL (2014) Software requirements: a new
domain for semantic parsers. In: Proceedings of the ACL 2014 workshop on semantic parsing,
SP14, Baltimore, pp 50–54
29. Diamantopoulos T, Roth M, Symeonidis A, Klein E (2017) Software requirements as an appli-
cation domain for natural language processing. Lang Resour Eval 51(2):495–524
30. Björkelund A, Hafdell L, Nugues P (2009) Multilingual semantic role labeling. In: Proceedings
of the 13th conference on computational natural language learning: shared task, CoNLL ’09.
Association for Computational Linguistics, Stroudsburg, pp 43–48
31. Bohnet B (2010) Top accuracy and fast dependency parsing is not a contradiction. In: Proceed-
ings of the 23rd international conference on computational linguistics, Beijing, pp 89–97
32. Fan RE, Chang KW, Hsieh CJ, Wang XR, Lin CJ (2008) LIBLINEAR: a library for large linear
classification. J Mach Learn Res 9:1871–1874
Chapter 4
Mining Software Requirements
4.1 Overview
determine project domains and identify missing entities and relations at require-
ments’ level. DARE [19] is such a tool that utilizes multiple sources of informa-
tion, including the requirements, the architecture, and the source code of a project.
DARE extracts entities and relations from several projects and then uses clustering
to identify common entities and thus recommend similar artifacts and architectural
schemata for each project. Figure 4.1 depicts an example mockup screen of the tool
for clustering terms relevant to a library management domain. Kumar et al. [15], on
the other hand, make use of ontologies in order to store requirements and project
domains and extract software specifications. Ghaisas and Ajmeri [16] further develop
a Knowledge-Assisted Ontology-Based Requirements Evolution (K-RE) repository
in order to facilitate requirements elicitation and resolve any conflicts between change
requests.
There are also several approaches that aspire to identify the features of a system
through its requirements and recommend new ones. Chen et al. [17] used require-
ments from several projects and constructed relationship graphs among requirements.
The authors then employed clustering techniques to extract domain information.
Their system can identify features that can be used to create a feature model of
projects that belong to the same domain. Similar work was performed by Alves
et al. [18], who employed the vector space model to construct a domain feature
model and used latent semantic analysis to find similar requirements by clustering
78 4 Mining Software Requirements
Fig. 4.2 Example clustering of requirements for a smart home system [18]
them into domains. An example for this line of work is shown in Fig. 4.2, which
depicts the hierarchical clustering of five textual requirements extracted from a doc-
ument describing a smart home system. The first two requirements (from top to
bottom) are quite similar as they are both related to devices and their integration.
The third requirement has similar terms with the fourth one, while the fifth one is also
closer to the fourth one, as both requirements describe actions for detecting danger.
On the other hand, Dumitru et al. [28] employed association rule mining to analyze
requirements and subsequently cluster them into feature groups. These groups can
be used to find similar projects or to recommend a new feature for a project.
From a different perspective, there are also approaches that explore the relation
between requirements and stakeholders [29–31]; the related systems are given as
input ratings for the requirements of individual stakeholders and use collaborative
filtering [32] to provide recommendations and prioritize requirements according to
stakeholder preferences. However, these approaches, and any approach focusing on
non-functional requirements [33], deviate from the scope of this work.
Based on the above discussion, one easily notices that most approaches are
domain-centered [15–19], which is actually expected as domain-specific information
4.2 State of the Art on Requirements Mining 79
can improve the understanding of the software project under analysis. Still, as noted
by Dumitru et al. [28], these domain analysis techniques are often not applicable due
to the lack of annotated requirements for a specific domain. And although feature-
based techniques [28, 33] do not face such issues, their scope is too high level and
they are not applied on fine-grained requirements.
In the context of our work, we build a system that focuses on functional require-
ments and provides low-level recommendations, maintaining a semantic domain-
agnostic outlook. To do so, we employ the static ontology and the semantic role
labeler of the previous chapter to extract actors, actions, objects, and properties
from functional requirements, and effectively index them. The descriptive power of
our model allows us to recommend requirements in a comprehensive and domain-
agnostic manner.
Early efforts for UML model mining employed information retrieval techniques and
were directed toward use case scenarios [24, 25]. Typically, such scenarios can be
defined as sets of events triggered by actions of actors, and they also include authors
and goals. Thus, research on scenario reuse is mainly directed toward representing
event flows and comparing them to find similar use cases [25]. Other approaches
involve representing UML diagrams to graphs and detecting similar graphs (dia-
grams) using graph matching techniques. In such graph representations, the vertices
comprise the object elements of UML use case, class, and sequence diagrams, while
the edges denote the associations among these elements. The employed graph match-
ing techniques can either be exact [34–36] or inexact [37]. Similar work by Park and
Bae [38] involves using Message Object Order Graphs (MOOGs) to store sequence
diagrams, where the nodes and the edges represent the messages and the flow among
them.
Despite the effectiveness of graph-based methods under certain scenarios (e.g.,
structured models), applying them to UML diagrams lacks semantics. As noted by
Kelter et al. [26], UML model mining approaches should focus on creating a seman-
tic model, rather than arbitrarily applying similarity metrics. For this reason, UML
diagrams (and XML structures) are often represented as ordered trees [26, 39, 40].
The main purpose of these efforts is to first design a data model that captures the
structure of UML diagrams and then apply ordered tree similarity metrics. Indica-
tively, the model of Kelter et al. [26] is shown in Fig. 4.3. It consists of the following
elements: a Document, a set of Elements, each one with its type (ElementType),
a set of Attributes for each Element, and a set of References between elements.
The model supports several types of data, including not only UML models, but also
more generic ones, such as XML documents. Furthermore, the Elements have hier-
archy, which is quite useful for UML Class diagrams, where examples of Elements
would include packages, classes, methods, etc., which are indeed hierarchical (e.g.,
packages include classes).
80 4 Mining Software Requirements
Document
+name :String
Reference contains
references *
+name :String
Element ElementType
* isOfType
+name :String +name :String
* hasReference
+hashValue :byte[] +threshold :float 0..1
+path :String
Attribute * +version :int
+name :String *
hasAttribute
+value :String subelementTypes
*
hasSubelement
Fig. 4.3 Data model of the UML differencing technique of Kelter et al. [26]
These approaches are effective for static diagram types (use case, class); how-
ever, they cannot efficiently represent data flows or action flows, and thus they do not
adhere to dynamic models (activity, sequence). Furthermore, they usually employ
string difference techniques, and thus they are not applicable to scenarios with mul-
tiple sources of diagrams. To confront this challenge, researchers have attempted to
incorporate semantics through the use of domain-specific ontologies [22, 23, 41].
Matching diagram elements (actors, use cases, etc.) according to their semantic dis-
tance has indeed proven effective as long as domain knowledge is available; most of
the time, however, domain-specific information is limited.
We focus on the requirements elicitation phase and propose two mining method-
ologies, one for use case diagrams and one for activity diagrams. For use case diagram
matching, we handle use cases as discrete nodes and employ semantic similarity
metrics, thus combining the advantages of graph-based and information retrieval
techniques and avoiding their potential limitations. For activity diagram matching,
we construct a model that represents activity diagrams as sequences of action flows.
This way, the dynamic nature of the diagrams is modeled effectively, without using
graph-based methods that could result to over-engineering. Our algorithm is similar
to ordered tree methods [26, 39, 40], which are actually a golden mean between
unstructured (e.g., Information Retrieval) and heavily structured (e.g., graph-based)
methods. Our matching methodologies further use a semantic scheme for comparing
strings, which is, however, domain-agnostic and thus can be used in scenarios with
diagrams originating from various projects.
4.3 Functional Requirements Mining 81
2 · log P(C0 )
sim(C1 , C2 ) = (4.1)
log P(C1 ) + log P(C2 )
1 http://users.sussex.ac.uk/~drh21/.
82 4 Mining Software Requirements
indication
evidence
record
account history
biography
profile
where C0 is the most specific class that contains both C1 and C2 . For example, the
most specific class that describes the terms account and profile is the class
record, as shown also in the relevant WordNet fragment of Fig. 4.5.
Upon having found C0 , we compute the similarity between the two terms using
Eq. (4.1), which requires the information content for each of the three WordNet
classes. The information content of a WordNet class is defined as the log prob-
ability that a corpus term belongs in that class.2 Thus, for instance, record,
account, and profile have information content values equal to 7.874, 7.874,
and 11.766, respectively, and the similarity between account and profile is
2 · 7.874/(7.874 + 11.766) = 0.802. Finally, two terms are assumed to be seman-
tically similar if their similarity value (as determined by Eq. (4.1)) is higher than a
threshold t. We set t to 0.5, and thus we now have 1512 terms for the 30 projects,
out of which 1162 are distinct.
Having performed this type of analysis, we now have a dataset with one set of
items per software project and can extract useful association rules using association
rule mining [46]. Let P = { p1 , p2 , . . . , pm } be the set of m software projects and
I = {i 1 , i 2 , . . . , i n } be the set of all n items. Itemsets are defined as subsets of I .
Given an itemset X , its support is defined as the number of projects in which all of
its items appear in
σ (X ) = |{ pi |X ⊂ pi , pi ∈ P}| (4.2)
Association rules are expressed in the form X → Y , where X and Y are disjoint
itemsets. An example rule that can be extracted from the items of the require-
ment of Fig. 4.4 is {account_HasProperty_username} → {account_
HasProperty_password}.
The two metrics used to determine the strength of a rule are its support and its
confidence. Given an association rule X → Y , its support is the number of projects
2 We used the precomputed information content data of the Perl WordNet similarity library [44],
available at http://www.d.umn.edu/~tpederse/.
4.3 Functional Requirements Mining 83
σ (X ∪ Y )
σ (X → Y ) = (4.3)
|P|
The confidence of the rule indicates how frequently items in Y appear in X , and it is
given as
σ (X ∪ Y )
c(X → Y ) = (4.4)
σ (X )
We use the Apriori association rule mining algorithm [47] in order to extract associ-
ation rules with support and confidence above certain thresholds. For our dataset, we
set the minimum support to 0.1, so that any rule has to be contained in at least 10% of
the projects. We also set the minimum confidence to 0.5, so that the extracted rules
are confirmed at least half of the time that their antecedents are found. The execution
of Apriori resulted in 1372 rules, a fragment of which is shown in Table 4.1.
Several of these rules are expected. For example, rule 2 indicates that in order for
a user to login, the system must first validate the account. Also, rule 5 implies that
logout functionality should co-occur in a system with login functionality.
84 4 Mining Software Requirements
Upon having extracted the rules, we use them to recommend new requirements for
software projects. At first, we define a new set of items p for the newly added software
project. Given this set of items and the rules, we extract the activated rules R. A rule
X → Y is activated for the project with set of items p if all items present in X are also
contained in p (i.e., X ⊂ p). The set of activated rules R is then flattened by creating
a new rule for each combination of antecedents and consequents of the original rule,
so that the new rules contain single items as antecedents and consequents. Given,
e.g., the rule X → Y where the itemsets X and Y contain the items {i 1 , i 2 , i 3 } and
{i 4 , i 5 }, respectively, the new flattened rules are i 1 → i 4 , i 1 → i 5 , i 2 → i 4 , i 2 → i 5 ,
i 3 → i 4 , and i 3 → i 5 . We also propagate the support and the confidence of the original
rules to these new flattened rules, so that they are used as important criteria. Finally,
given the set of items of a project p and the flattened activated rules for this project,
our system provides recommendations of new requirements using the heuristics of
Table 4.2.
Concerning the heuristics for consequent [Actor 2, Action2], which corresponds
to an Actor2_IsActorOf_Action2 item, the recommended requirement
includes the actor and the action of the consequent as well as an Object that
is determined by the antecedent. Given, e.g., an antecedent [cr eate, bookmar k]
and a consequent [user, edit], the new recommended requirement will be [user,
edit, bookmar k]. Concerning the consequent [Action2, Object2], which corre-
sponds to an Action2_ActsOn_Object2 item, the recommended require-
ment includes the action and the object of the consequent as well as the actor
that is determined by the antecedent. Given, e.g., an antecedent [user, pr o f ile]
and a consequent [cr eate, pr o f ile], the new recommended requirement will be
[user, cr eate, pr o f ile]. Finally, any rule with a HasProperty consequent (and
any antecedent) leads to new recommended requirements of the form [Any, Pr op-
er t y]. An example requirement would be [user, pr o f ile]. Using the static ontology,
4.3 Functional Requirements Mining 85
Create Account
Guest User
Show Bookmark
List Bookmarks
Search by Tag
User
Update Account
<include>
<include>
Delete Bookmark
<extend>
Add Tag
Registered User Add Bookmark
we are also able to reconstruct requirements, in the formats “The Actor must be able
to Action Object” and “The Any must have Pr oper t y”.
To design our UML diagrams mining model, we use a dataset of use case and activity
diagrams, which includes the diagrams of project Restmarks, the demo social service
that was used in the previous chapter. An example use case diagram from Restmarks
that contains 10 use cases and 3 actors is shown in Fig. 4.6.
As use cases refer to static information, they are stored in the static ontology with
a model that is mostly flat. Specifically, the model comprises the actors and the use
cases of the diagram. For example, the diagram D of Fig. 4.6 has a model that consists
of two sets: the set of actors A = {U ser , Register ed U ser , Guest U ser } and the
set of use cases U C = {Add Bookmar k, U pdate Bookmar k, U pdate Account,
Show Bookmar k, Sear ch by T ag, Add T ag, Login to Account, Delete
Bookmar k}. Given two diagrams D1 and D2 , our matching scheme involves two
86 4 Mining Software Requirements
sets for each diagram: one set for the actors A1 and A2 , respectively, and one set for
the use cases U C1 and U C2 , respectively. The similarity between the diagrams is
computed by the following equation:
where s denotes the similarity between two sets (either of actors or of use cases) and
α denotes the importance of the similarity of actors for the diagrams. α was set to the
proportion of the number of actors divided by the number of use cases of the queried
diagram. Given, e.g., a diagram with 3 actors and 10 use cases, α is set to 0.3.
The similarity between two sets, either actors or use cases, is given by the combi-
nation between all the matched elements with the highest score. Given, e.g., two sets
{user , administrator , guest} and {administrator , user }, the best possible com-
bination is {(user, user ), (administrator, administrator ), (guest, null)}, and
the matching would return a score of 2/3 = 0.66. We employ the semantic mea-
sure of the previous section to provide a similarity score between strings. Given
two strings S1 and S2 , we first split them into tokens, i.e., tokens(S1 ) = {t1 , t2 },
tokens(S2 ) = {t3 , t4 }, and then determine the combination of tokens with the maxi-
mum token similarity scores. The final similarity score between the strings is deter-
mined by averaging over all tokens. For example, given the strings “Get bookmark”
and “Retrieve bookmarks”, the best combination is (“get”, “retrieve”) and (“book-
mark”, “bookmarks”). Since the semantic similarity between “get” and “retrieve”
is 0.677, and the similarity of “bookmark” with “bookmarks” is 1.0, the similarity
between the strings is (0.677 + 1)/2 = 0.8385.
Given, for example, the diagram of Fig. 4.6 and the diagram of Fig. 4.7, the match-
ing between the diagram elements is shown in Table 4.3, while the final score (using
Eq. (4.5)) is 0.457.
Concerning recommendations, the engineer of the second diagram could consider
adding a guest user. Furthermore, he/she could consider adding use cases for listing
or updating bookmarks, adding tags to bookmarks, or updating account data.
Concerning activity diagrams, we require a representation that would view the
diagram as a flow model. An example diagram of Restmarks is shown in Fig. 4.8.
Activity diagrams consist mostly of sequences of activities and possible
conditions. Concerning conditions (and forks/joins), we split the flow of the dia-
gram. Hence, an activity diagram is actually treated as a set of sequences, each
of which involves the activities required to traverse the diagram from its start
node to its end node. For instance, the diagram of Fig. 4.8 spawns a sequence
Star t N ode > Logged I n? > Login to account > Pr ovide bookmar k U R L >
Cr eate Bookmar k>Add tag>U ser wants to add tag? > End N ode. Upon hav-
ing parsed two diagrams and having extracted one set of sequences per diagram, we
compare the two sets. In this case and in contrast with use case diagram matching,
we set a threshold t AC T for the semantic string similarity metric. Thus, two strings
are considered similar if their similarity score is larger than this threshold. We set
t AC T to 0.5.
4.4 UML Models Mining 87
Retrieve Bookmark
Login
User Search
<include>
<include>
Remove Bookmark
Fig. 4.7 Example use case diagram for matching with the one of Fig. 4.6
Table 4.3 Matching between the diagrams of Figs. 4.6 and 4.7
Diagram 1 Diagram 2 Score
User User 1.00
Registered user Registered user 1.00
Guest user null 0.00
Delete bookmark Remove bookmark 0.86
Show bookmark Retrieve bookmark 0.50
Add bookmark Add bookmark 1.00
Create account Create new account 0.66
Search by tag Search 0.33
Login to account Login 0.33
List bookmarks null 0.00
Update bookmark null 0.00
Update account null 0.00
Add tag null 0.00
We use the Longest Common Subsequence (LCS) [48] in order to determine the
similarity score between two sequences. The LCS of two sequences S1 and S2
is defined as the longest subsequence of common (yet not consecutive) elements
between them. Given, e.g., the sequences [A, B, D, E, G] and [A, B, E, H ], their
88 4 Mining Software Requirements
No
Login to account
Yes
No Yes
Provide tag text
LCS is [A, B, E]. Finally, the similarity score between the sequences is defined as
|LC S(S1 , S2 )|
sim(S1 , S2 ) = 2 · (4.6)
|S1 | + |S2 |
The similarity score is normalized in [0, 1]. Given that we have two sets of sequences
(one for each of the two diagrams), their similarity is given by the best possi-
ble combination between the sequences, i.e., the combination that results in the
highest score. For example, given two sets {[A, B, E], [A, B, D, E], [A, B, C, E]}
and {[A, B, E], [A, C, E]}, the combination that produces the highest possible
score is {([A, B, E], [A, B, E]), ([A, B, C, E], [A, C, E]), ([A, B, D, E], null)}.
Finally, the similarity score between the diagrams is the average of their LCS scores.
For example, let us consider matching the diagrams of Figs. 4.8 and 4.9. Our
system returns the matching between the sequences of the diagrams, as shown in
Table 4.4, while the total score, which is computed as the mean of these scores, is
(0.833 + 0.714 + 0 + 0)/4 = 0.387.
The matching process between the sequences indicates that the engineer of the
second diagram could add a new flow that would include the option to add a tag to
the newly created bookmark.
4.4 UML Models Mining 89
Logged In?
No
Login
Yes
Provide URL
Create Bookmark
Fig. 4.9 Example activity diagram for matching with the one of Fig. 4.8
Table 4.4 Matching between the diagrams of Figs. 4.8 and 4.9
Diagram 1 Diagram 2 Score
StartNode > Logged In? > Provide StartNode > Logged In? > Provide 0.833
bookmark URL > Create Bookmark > URL > Create Bookmark > EndNode
Add tag > User wants to add tag? >
EndNode
StartNode > Logged In? > Login to StartNode > Logged In? > Login > 0.714
account > Provide bookmark URL > Provide URL > Create Bookmark >
Create Bookmark > Add tag > User EndNode
wants to add tag? > EndNode
StartNode > Logged In? > Provide null 0.000
bookmark URL > Create Bookmark >
Add tag > User wants to add tag? >
Provide tag text > Add tag to bookmark >
EndNode
StartNode > Logged In? > Login to null 0.000
account > Provide bookmark URL >
Create Bookmark > Add tag > User
wants to add tag? > Provide tag text >
Add tag to bookmark > EndNode
90 4 Mining Software Requirements
4.5 Evaluation
(a)
(b)
(c)
Fig. 4.10 Example depicting a the functional requirements of Restmarks, b the recommended
requirements, and c the recommended requirements visualized
of which he/she would have selected 6 to add to the project. Most recommended
requirements are extracted from rules with low support, which is actually expected
since our dataset is largely domain-agnostic. However, low support rules do not
necessarily result in low-quality recommendations, as long as their confidence is
large enough. Indicatively, 2 out of 3 recommendations extracted from rules with
confidence values equal to 0.5 may not be useful. However, setting the confidence
92 4 Mining Software Requirements
Fig. 4.11 Visualization of recommended requirements including the percentage of the correctly
recommended requirements given support and confidence
4.5 Evaluation 93
(a) (b)
Fig. 4.12 Classification results of our approach and the approach of Kelter et al. for a use case
diagrams and b activity diagrams
value to 0.75 ensures that more than 2 out of 3 recommended requirements will be
added to the project.
To evaluate our UML mining model, we use a dataset of 65 use case diagrams and
72 activity diagrams, originating from software projects with different semantics.
Our methodology involves finding similar diagrams and subsequently providing
recommendations, and thus we initially split diagrams into six categories, includ-
ing health/mobility, traffic/transportation, social/p2p networks, account/product ser-
vices, business process, and generic diagrams. We construct all possible pairs of use
case and activity diagrams, which are 2080 and 2556 pairs, respectively, and mark
each pair as relevant or non-relevant according to their categories.
We compare our approach to that of Kelter et al. [26]. Given that the two
approaches are structurally similar, their execution on use case diagrams provides an
assessment of the semantic methodology. Concerning activity diagrams, the approach
of Kelter et al. uses a static data model, and thus our evaluation should reveal how
using a dynamic flow model can be more effective. Upon executing the approaches
on the sets of use case and activity diagrams, we normalized the matching scores
according to their average (so that pairs are defined when their score is higher than
0.5) and computed the precision, the recall, and the F-measure for each approach
and for each diagram type. The results are shown in Fig. 4.12.
It is clear that our approach outperforms that of Kelter et al. when it comes
down to finding relevant diagram pairs both for use case and for activity diagrams.
The higher recall values indicate that our methodology can effectively retrieve more
relevant diagram pairs, while high precision values indicate that the retrieved pairs are
indeed relevant and false positives (non-relevant pairs that are considered relevant)
94 4 Mining Software Requirements
are fewer. The combined F-measure also indicates that our approach is more effective
for extracting semantically similar diagram pairs.
Examining the results indicates that our approach is much more effective when it
comes down to identifying diagram pairs that are semantically similar. For example,
out of the 2080 use case diagram pairs, 1278 were identified correctly as semantically
similar (or not similar), while the approach of Kelter et al. [26] identified correctly
1167 pairs. The corresponding numbers for the 2556 activity diagram pairs are 1428
and 1310, for our approach and the approach of Kelter et al. [26], respectively.
4.6 Conclusion
Although several research efforts have focused on the area of requirements min-
ing, most of these efforts are oriented toward high-level requirement representations
and/or are dependent on domain-specific knowledge. In this chapter, we presented
a mining methodology that can aid requirement engineers and stakeholders in the
elicitation of functional requirements and the UML models for the system under
development.
Concerning functional requirements, our system employs association rule mining
to extract useful rules and uses these rules to create recommendations for the project
under development. As we have shown in the evaluation section, our system can
provide a set of recommendations out of which approximately 60% are rational.
Thus, given that the requirements engineer (or a stakeholder) has compiled a set
of requirements, he/she could effectively check whether he/she has omitted any
important ones.
Our methodology for UML model mining employs domain-agnostic semantics
and uses practical representations for UML use case and activity diagrams. As such, it
covers both the static and the dynamic views of software projects, while the specifics
of use case and activity diagrams, i.e., set-based and flow-based structure, are taken
into account. The results of our evaluation indicate that our mining approach outper-
forms current state-of-the-art approaches.
Concerning future work, an interesting problem in the area of requirements elici-
tation is that of combining the preferences of different stakeholders. In the context of
functional requirements, an extension of our system may be used in a multiple stake-
holder scenario, where the different stakeholder requirements can be combined and
a more project-specific set of association rules can be extracted. Another interesting
future direction would be to improve the semantic matching module by incorporating
information from online software databases (e.g., GitHub). Finally, our UML model
mining methodology can include parsing different diagram types, such as UML class
or sequence diagrams.
References 95
References
1. Montequin VR, Cousillas S, Ortega F, Villanueva J (2014) Analysis of the success factors and
failure causes in Information & Communication Technology (ICT) projects in Spain. Procedia
Technol 16:992–999
2. Leffingwell D (1997) Calculating your return on investment from more effective requirements
management. Am Program 10(4):13–16
3. Thummalapenta S, Xie T (2007) PARSEWeb: a programmer assistant for reusing open source
code on the web. In: Proceedings of the 22nd IEEE/ACM international conference on automated
software engineering, ASE’07. ACM, New York, pp 204–213
4. Sahavechaphan N, Claypool K (2006) XSnippet: mining for sample code. SIGPLAN Not
41(10):413–430
5. Hummel O, Janjic W, Atkinson C (2008) Code conjurer: pulling reusable software out of thin
air. IEEE Softw 25(5):45–52
6. Diamantopoulos T, Thomopoulos K, Symeonidis AL (2016) QualBoa: reusability-aware rec-
ommendations of source code components. In: Proceedings of the IEEE/ACM 13th working
conference on mining software repositories, MSR’16, pp 488–491
7. Papamichail M, Diamantopoulos T, Symeonidis AL (2016) User-perceived source code quality
estimation based on static analysis metrics. In: Proceedings of the 2016 IEEE international
conference on software quality, reliability and security, QRS. Vienna, Austria, pp 100–107
8. Dimaridou V, Kyprianidis A-C, Papamichail M, Diamantopoulos T, Symeonidis A (2017)
Towards modeling the user-perceived quality of source code using static analysis metrics. In:
Proceedings of the 12th international conference on software technologies - volume 1, ICSOFT.
INSTICC, SciTePress, Setubal, Portugal, pp 73–84
9. Luisa M, Mariangela F, Pierluigi NI (2004) Market research for requirements analysis using
linguistic tools. Requir Eng 9(1):40–56, 2004
10. Kaindl H, Smialek M, Svetinovic D, Ambroziewicz A, Bojarski J, Nowakowski W, Straszak
T, Schwarz H, Bildhauer D, Brogan JP, Mukasa KS, Wolter K, Krebs T (2007) Requirements
specification language definition: defining the ReDSeeDS languages, deliverable D2.4.1. Public
deliverable, ReDSeeDS (Requirements driven software development system) project
11. Smialek M (2012) Facilitating transition from requirements to code with the ReDSeeDS tool.
In: Proceedings of the 2012 IEEE 20th international requirements engineering conference (RE),
RE’12. IEEE Computer Society, Washington, pp 321–322
12. Wynne M, Hellesoy A (2012) The cucumber book: behaviour-driven development for testers
and developers. Pragmatic Bookshelf, Raleigh
13. Mylopoulos J, Castro J, Kolp M (2000) Tropos: a framework for requirements-driven software
development. Information systems engineering: state of the art and research themes. Springer,
Berlin, pp 261–273
14. Boehm B, Basili VR (2001) Software defect reduction top 10 list. Computer 34:135–137
15. Kumar M, Ajmeri N, Ghaisas S (2010) Towards knowledge assisted agile requirements evo-
lution. In: Proceedings of the 2nd international workshop on recommendation systems for
software engineering, RSSE’10. ACM, New York, pp 16–20
16. Ghaisas S, Ajmeri N (2013) Knowledge-assisted ontology-based requirements evolution. In:
Maalej W, Thurimella AK (eds) Managing requirements knowledge. Springer, Berlin, pp 143–
167
17. Chen K, Zhang W, Zhao H, Mei H (2005) An approach to constructing feature models based
on requirements clustering. In: Proceedings of the 13th IEEE international conference on
requirements engineering, RE’05. IEEE Computer Society, Washington, pp 31–40
18. Alves V, Schwanninger C, Barbosa L, Rashid A, Sawyer P, Rayson P, Pohl C, Rummler A (2008)
An exploratory study of information retrieval techniques in domain analysis. In: Proceedings
of the 2008 12th international software product line conference, SPLC’08. IEEE Computer
Society, Washington, pp 67–76
19. Frakes W, Prieto-Diaz R, Fox C (1998) DARE: domain analysis and reuse environment. Ann
Softw Eng 5:125–141
96 4 Mining Software Requirements
20. Felfernig A, Schubert M, Mandl M, Ricci F, Maalej W (2010) Recommendation and decision
technologies for requirements engineering. In: Proceedings of the 2nd international workshop
on recommendation systems for software engineering, RSSE’10. ACM, New York, pp 11–15
21. Maalej W, Thurimella AK (2009) Towards a research agenda for recommendation systems in
requirements engineering. In: Proceedings of the 2009 2nd international workshop on managing
requirements knowledge, MARK’09. IEEE Computer Society, Washington, pp 32–39
22. Gomes P, Gandola P, Cordeiro J (2007) Helping software engineers reusing uml class diagrams.
In: Proceedings of the 7th international conference on case-based reasoning: case-based rea-
soning research and development, ICCBR’07. Springer, Berlin, pp 449–462
23. Robles K, Fraga A, Morato J, Llorens J (2012) Towards an ontology-based retrieval of uml
class diagrams. Inf Softw Technol 54(1):72–86
24. Alspaugh TA, Antón AI, Barnes T, Mott BW (1999) An integrated scenario management
strategy. In: Proceedings of the 4th IEEE international symposium on requirements engineering,
RE’99. IEEE Computer Society, Washington, pp 142–149
25. Blok MC, Cybulski JL (1998) Reusing uml specifications in a constrained application domain.
In: Proceedings of the 5th Asia Pacific software engineering conference, APSEC’98. IEEE
Computer Society, Washington, p 196
26. Kelter U, Wehren J, Niere J (2005) A generic difference algorithm for uml models. In: Ligges-
meyer P, Pohl K, Goedicke M (eds) Software engineering, vol 64 of LNI. GI, pp 105–116
27. Diamantopoulos T, Symeonidis A (2017) Enhancing requirements reusability through semantic
modeling and data mining techniques. Enterprise information systems, pp 1–22
28. Dumitru H, Gibiec M, Hariri N, Cleland-Huang J, Mo-basher B, Castro-Herrera C, Mirakhorli
M (2011) On-demand feature recommendations derived from mining public product descrip-
tions. In: Proceedings of the 33rd international conference on software engineering, ICSE’11.
ACM, New York, pp 181–190
29. Lim SL, Finkelstein A (2012) StakeRare: using social networks and collaborative filtering for
large-scale requirements elicitation. IEEE Trans Softw Eng 38(3):707–735
30. Castro-Herrera C, Duan C, Cleland-Huang J, Mobasher B (2008) Using data mining and recom-
mender systems to facilitate large-scale, open, and inclusive requirements elicitation processes.
In: Proceedings of the 2008 16th IEEE international requirements engineering conference,
RE’08. IEEE Computer Society, Washington, pp 165–168
31. Mobasher B, Cleland-Huang J (2011) Recommender systems in requirements engineering. AI
Mag 32(3):81–89
32. Konstan JA, Miller BN, Maltz D, Herlocker JL, Gordon LR, Riedl J (1997) GroupLens: applying
collaborative filtering to usenet news. Commun ACM 40(3):77–87
33. Romero-Mariona J, Ziv H, Richardson DJ (2008) SRRS: a recommendation system for security
requirements. In: Proceedings of the 2008 international workshop on recommendation systems
for software engineering, RSSE’08. ACM, New York, pp 50–52
34. Woo HG, Robinson WN (2002) Reuse of scenario specifications using an automated relational
learner: a lightweight approach. In: Proceedings of the 10th anniversary IEEE joint international
conference on requirements engineering, RE’02. IEEE Computer Society, Washington, pp 173–
180
35. Robinson WN, Woo HG (2004) Finding reusable uml sequence diagrams automatically. IEEE
Softw 21(5):60–67
36. Bildhauer D, Horn T, Ebert J (2009) Similarity-driven software reuse. In: Proceedings of the
2009 ICSE workshop on comparison and versioning of software models, CVSM’09. IEEE
Computer Society, Washington, pp 31–36
37. Salami HO, Ahmed M (2013) Class diagram retrieval using genetic algorithm. In: Proceedings
of the 2013 12th international conference on machine learning and applications - volume 02,
ICMLA’13. IEEE Computer Society, Washington, pp 96–101
38. Park W-J, Bae D-H (2011) A two-stage framework for uml specification matching. Inf Softw
Technol 53(3):230–244
39. Chawathe SS, Rajaraman A, Garcia-Molina H, Widom J (1996) Change detection in hierar-
chically structured information. SIGMOD Rec 25(2):493–504
References 97
40. Wang Y, DeWitt DJ, Cai JY (2003) X-Diff: an effective change detection algorithm for
XML documents. In: Proceedings 19th international conference on data engineering (Cat.
No.03CH37405), pp 519–530
41. Bonilla-Morales B, Crespo S, Clunie C (2012) Reuse of use cases diagrams: an approach based
on ontologies and semantic web technologies. Int J Comput Sci 9(1):24–29
42. Miller GA (1995) WordNet: a lexical database for english. Commun ACM 38(11):39–41
43. Finlayson M (2014) Java libraries for accessing the princeton wordnet: comparison and evalua-
tion. In: Orav H, Fellbaum C, Vossen P (eds) Proceedings of the 7th global wordnet conference.
Tartu, Estonia, pp 78–85
44. Pedersen T, Patwardhan S, Michelizzi J (2004) WordNet:: similarity: measuring the
relatedness of concepts. In: Demonstration papers at HLT-NAACL 2004, HLT-NAACL–
Demonstrations’04. Association for Computational Linguistics, Stroudsburg, pp 38–41
45. Lin D (1998) An information-theoretic definition of similarity. In: Proceedings of the 15th
international conference on machine learning, ICML’98. Morgan Kaufmann Publishers Inc,
San Francisco, pp 296–304
46. Agrawal R, Imieliński T, Swami A (1993) Mining association rules between sets of items
in large databases. In: Proceedings of the 1993 ACM SIGMOD international conference on
management of data, SIGMOD’93. ACM, New York, pp 207–216
47. Agrawal R, Srikant R (1994) Fast algorithms for mining association rules in large databases. In:
Proceedings of the 20th international conference on very large data bases, VLDB’94. Morgan
Kaufmann Publishers Inc, San Francisco, pp 487–499
48. Cormen TH, Leiserson CE, Rivest RL, Stein C (2009) Introduction to algorithms, 3rd edn. The
MIT Press, Cambridge, pp 390–396
Part III
Source Code Mining
Chapter 5
Source Code Indexing for Component
Reuse
5.1 Overview
The introduction of the open-source development paradigm has changed the way soft-
ware is developed and distributed. Several researchers and practitioners now develop
software as open source and store their projects online. This trend has acquired
momentum during the last decade, leading to the development of numerous online
software repositories, such as GitHub1 and Sourceforge.2 One of the main arguments
in favor of open-source software development is that it can be improved by harnessing
the available software repositories for useful code and development practices.
Another strong argument is that software developers can greatly benefit from soft-
ware reuse, in order to reduce the time spent for development, as well as to improve
the quality of their projects. Nowadays, the development process involves search-
ing for reusable code in general-purpose search engines (e.g., Google or Yahoo),
in open-source software repositories (e.g., GitHub), or even in question-answering
communities (e.g., Stack Overflow3 ). Indeed, the need for searching and reusing
software components has led to the design of search engines specialized in code
search. Code Search Engines (CSEs) improve search in source code repositories,
using keywords, or even using syntax-aware queries (e.g., searching for a method
with specific parameters). Another category of more specialized systems is that of
Recommendation Systems in Software Engineering (RSSEs). In the context of code
reuse, RSSEs are intelligent systems, capable of extracting the query from the source
code of the developer and processing the results of CSEs to recommend software
components, accompanied by information that would help the developer understand
and reuse the code. In this chapter, we focus on CSEs, while in the next we focus on
RSSEs.
1 https://github.com/.
2 https://sourceforge.net/.
3 http://stackoverflow.com/.
Although contemporary CSEs are certainly gaining their share in the developer
market, the proper design of a system that addresses the needs of developers directly
is still an open research question. In specific, current CSEs fall short in one or more
of the following three important aspects: (a) they do not offer flexible options to
allow performing advanced queries (e.g., syntax-aware queries, regular expressions,
snippet, or project search), (b) they are not updated frequently enough to address the
ad hoc needs of developers, since they are not integrated with hosting sites, and (c)
they do not offer complete well-documented APIs, to allow their integration with
other tools. The latter offering is very important also for RSSEs; their aim to offer
recommendations to developers in a more personalized context is strongly tied to their
need for using specialized open-API CSEs in the backend and has been highlighted
by several researchers [1–4].
In this chapter, we present AGORA,4 a CSE that satisfies the aforementioned
requirements [5]. AGORA indexes all source code elements, while the code is also
indexed at token level and the hierarchy of projects and files is preserved. Thus, it
enables syntax-aware search, i.e., searching for classes with specific methods, etc.,
and allows searching using regular expressions, searching for snippets or for multiple
objects (at project level). The stored projects are continuously updated owing to the
integration of AGORA with GitHub. Finally, AGORA offers a REST API that allows
integration with other tools.
Given the status quo in software engineering practices and the research on search
engines for source code (see next subsection), CSEs should abide by the following
criteria:
• Syntax-awareness: The system must support queries given specific elements of
a file, e.g., class name, method name, return type of method, etc.
• Regular expressions: Regular expressions have to be supported so that the user
can search for a particular pattern.
• Snippet search: Although most CSEs support free text search, this is usually
ineffective for multi-line queries. A CSE should allow searching for small blocks
of code, known as snippets.
• Project search: The system must allow the user to search for projects given a file
query. In other words, the system must store the data with some kind of structure
4 The term “agora” refers to the central spot of ancient Greek city-states. Although it is roughly
translated to the English term “market”, “agora” can be better viewed as an assembly, a place where
people met not only to trade goods, but also to exchange ideas. It is a good fit for our search engine
since we envision it as a place where developers can (freely) distribute and exchange their source
code, and subsequently their ideas.
5.2 Background on Code Search Engines 103
(schema) so that complex queries of the type “find project that has file” can be
answered.
• Integration with hosting websites: Importing projects into the CSE repository
and keeping the index up-to-date should be straightforward. Hence, CSEs must
integrate well with source code hosting websites (e.g., GitHub).
• API: The system must have a public API. Although this criterion may seem redun-
dant, it is crucial because a comprehensive public API allows different ways of
using the system, e.g., by creating a UI, by some IDE plugin, etc.
Note that the list of desired features/criteria outlined above covers the most com-
mon aspects of CSEs; there are several other features that could be considered in a
code search scenario, such as search for documentation. However, we argue that the
selected list of criteria outlines a reasonable set of features for code search scenarios
and especially for integrating with RSSEs, a view also supported by current research
[3, 6]. In the following subsections, we provide a historical review for the area of
CSEs and discuss the shortcomings of well-known CSEs with respect to the above
criteria. This way we justify our rationale for creating a new CSE.
Given the vastness of code in online repositories (as already analyzed in Chap. 2),
the main problem is how to search for and successfully retrieve code according to
one’s needs. Some of the most popular CSEs are shown in Table 5.1.
An important question explored in this chapter is whether these CSEs are actually
as effective and useful as they can be for the developer. Codase, one of the first engines
with support for syntax-aware search (e.g., search for class or method name), seemed
promising; however, it didn’t go past the beta phase. Krugle Open Search, on the
other hand, is one of the first CSEs that is still active. The main advantage of Krugle
is its high-quality results; the service offers multiple search options (e.g., syntax-
aware queries) on a small but well-chosen number of projects, including Apache
Ant, Jython, etc. Although Krugle does indeed cover criteria 1 and 3, it does not
integrate well with hosting sites, and it does not offer a public API.
In 2006, one of the big players, Google, decided to enter the CSE arena. The
Google Code Search Engine was fast, supported regular expressions, and had over
250000 projects to search for source code. Its few shortcomings were spotted in
criteria 3 and 4 since the engine did not allow snippets or complex intra-project
queries. However, in 2011 Google announced its forthcoming shutdown5 to finally
shut it down in 2013. Despite its shutdown, the service has increased the interest for
CSEs, leading developers to search for alternatives.
One such early alternative was MeroBase [6]. The index of MeroBase contained
projects as software artifacts and supported searching and retrieving source code and
5 Source: http://googleblog.blogspot.gr/2011/10/fall-sweep.html.
104 5 Source Code Indexing for Component Reuse
projects according to several criteria, including source code criteria (e.g., class and
method names), licensing criteria, the author of the artifact, etc. As noted by the
authors [6], the main weakness of MeroBase is its flat structure, which is hard to
change given the underlying indexing. Due to this, MeroBase does not support the
advanced search capabilities described in criteria 2 and 3, while its index, though
large, is not updatable using a hosting site. A similar proposal is Sourcerer [7], which
employs a schema with multiple syntax-aware features extracted from source code
and thus supports different types of queries. However, Sourcerer is also not integrated
with code repositories, and thus its index has to be manually updated.
GitHub, the most popular source code hosting site today, has its own search engine
for its open-source repositories. GitHub Search is fast and reliable, providing up-to-
date results not only for source code, but also for repository statistics (commits, forks,
etc.). Its greatest strength when it comes to source code is its huge project index,
while its API is well defined. However, its search options are rather outdated; there
is no support for regular expressions or generally syntax-aware queries. As a result,
it covers none of the first four criteria.
BlackDuck Open Hub is also one of the most popular CSEs with a quite large
index. BlackDuck acquired the former CSE of Koders in 2008 and merged it with the
index of Ohloh (obtained in 2010) to finally provide Ohloh Code in 2012. Although
5.2 Background on Code Search Engines 105
Table 5.2 Feature comparison of popular source code search engines and AGORA
Search engine Criterion 1 Criterion 2 Criterion 3 Criterion 4 Criterion 5 Criterion 6
syntax- regular snippet project repositories API
awareness expressions search search integration support
AGORA
Codase × × × × –
Krugle × × × ×
Google CSE – × ×
MeroBase × × – ×
Sourcerer × – ×
GitHub CSE × × × ×
BlackDuck × × × ×
Searchcode × × ×
Feature supported; ×Feature not supported; – Feature partially supported
the service initially aimed at filling the gap left by Google Code Search,6 it later grew
to become a comprehensive directory of source code projects, providing statistics,
license information, etc. Ohloh Code was renamed to BlackDuck Open Hub Code
Search in 2014. The service has support for syntax-aware queries and offers plugins
for known IDEs (Eclipse, Microsoft Visual Studio). However, it does not support
snippet and project search (criteria 3 and 4).
Finally, a promising newcomer in the field of CSEs is Searchcode. Under its
original name (search[co.de]) the service offered a documentation search for popular
libraries. Today, it crawls several known repositories (GitHub, Sourceforge, etc.)
for source code, and thus its integration with source code hosting sites is quite
advantageous. However, the engine does not support syntax-aware queries, snippet,
or project search, and hence criteria 1, 3, and 4 are not met. Conclusively, none of
the CSEs presented in Table 5.1 (and no other to the best of our knowledge) covers
adequately all of the above criteria.
The effectiveness of a CSE depends on several factors, such as the quality of the
indexed projects and the way these projects are indexed. We compare AGORA with
other known CSEs and draw useful conclusions. In particular, since the scope of
this chapter lies in creating a CSE that meets the criteria defined in Sect. 5.2.1, we
evaluate whether AGORA fulfills these criteria compared to other state-of-the-art
engines. This comparison is shown in Table 5.2.
6 Source: http://techcrunch.com/2012/07/20/ohloh-wants-to-fill-the-gap-left-by-google-code-
search/.
106 5 Source Code Indexing for Component Reuse
Concerning the search-related features, i.e., criteria 1–4, AGORA is the CSE that
allows the most flexible queries. It supports syntax-aware queries, regular expres-
sions, even searching for snippets and projects. Other CSEs, such as Codase or Black-
Duck, cover the syntax-awareness and regular expression criteria, without, however,
support for snippet or project search. Krugle supports searching for snippets but
does not allow using regular expressions or searching for projects. MeroBase and
Sourcerer are the only other CSEs (except from AGORA) to partially support com-
plex project search, aiming, however, mostly at the documentation of the projects.
Integration with hosting websites is another neglected feature of modern CSEs.
GitHub Search or the discontinued Google Code Search obviously supports it since
the hosting sites are actually in their own servers (GitHub and Google Code, respec-
tively). Other than these services, though, Searchcode is the only other service to
offer an index that is integrated with hosting sites, namely, Github, Sourceforge, etc.
The sixth criterion is also quite important since CSEs with comprehensive APIs
can easily be extended by building web UIs or IDE plugins. Most CSEs fulfill this
criterion, with GitHub and Searchcode having two of the most comprehensive APIs.
The API of AGORA, which actually wraps the API provided by Elasticsearch, is
quite robust and allows performing different types of queries.
We do not claim to have created an overall stronger CSE; this would also require
evaluating several aspects, such as the size of its index or how many languages
it supports; AGORA is still experimental. However, its architecture for both the
language and the web hosting facilities is mostly agnostic; adding new languages
and/or integrating with new code hosting services (other than GitHub) is feasible.
Practically, AGORA is a solution suitable for supporting research in software reuse.
5.3.1 Overview
The overall architecture of AGORA is shown in Fig. 5.1. ElasticSearch [8] was
selected as our search server since it abstracts Lucene well, and allows harnessing
its power, while at the same time overcoming any difficulties coming from Lucene’s
“flat” structure. Storage in ElasticSearch is quite similar to the NoSQL document-
based paradigm. Objects are stored in documents (the equivalent of records in a rela-
tional schema), documents are stored inside collections (the equivalent of relational
tables), and collections are stored in indexes (the equivalent of relational databases).
We define two collections: projects and files. Projects contain information about
software projects, while files contain information about source code files.
As depicted in the figure, the Elasticsearch Index is accessible through the Elas-
ticsearch Server. In addition, an Apache Server is instantiated in order to control
all communication to the server. The Apache server allows only authorized input
to the index, thus ensuring that only the administrator can perform specific actions
5.3 The AGORA Code Search Engine 107
(e.g., delete a document, backup the index, etc.). For the end users of AGORA, the
server is accessible using the comprehensible JSON API offered by Elasticsearch.
The controller shown in Fig. 5.1 is one of the most important components of
AGORA. Built in Python, the controller provides an interface for adding GitHub
projects in ElasticSearch. In specific, given a GitHub project address, the controller
provides the address to the downloader. The downloader checks whether the source
code of the project is already downloaded, and downloads or updates it. Typically,
the downloader comprises two components, the GitHub API Downloader and the
Git Downloader. The GitHub API Downloader fetches information about a GitHub
project using the GitHub API [9], including the git address of the project and its
source code tree. The Git Downloader is actually a wrapper for the Git tool; it clones
GitHub projects or pulls them if they already exist. Given the tree provided by the
API, the downloader can provide the controller with a list of files and sha ids. After
that, the sha ids are used by the controller in order to determine which of the files
need to be added, updated, or deleted from the index.
Another important step is performed before adding the new documents to the
index. The source code of the project is analyzed using the Source Code Parser. The
Source Code Parser receives as input a source code file and extracts its signature.
The signature of a file is represented as a JSON document containing all elements
defined or used in the file, e.g., class names, method names, etc.
The JSON documents are added to the projects and the files index. The schemata
of the fields are dictated by the mappings. AGORA has two mappings, one for the
JSON documents of the projects index and one for those of the files index. When
a document is added to the index, its fields (JSON entries) are analyzed using an
analyzer. This step is important since it allows Elasticsearch to perform fast queries
over the analyzed fields.
The index of AGORA can obviously be used to store projects written in different
programming languages. In specific, adding source code files of a project written
in a specific programming language involves three tasks: (a) developing a Source
Code Parser (to extract information from the source code), (b) creating appropriate
analyzers that conform to the specifics of the “new” programming language, and (c)
designing a structure that shall store the elements of the source code included in each
108 5 Source Code Indexing for Component Reuse
file. Without loss of generality, in the following paragraphs, we illustrate how these
tasks can be performed for the Java programming language.
The Source Code Parser for Java projects is built using the Java Compiler Tree
API [10] offered by Oracle. It receives as input one or more files and traverses
the Abstract Syntax Tree (AST) of each file in order to output all the Java-specific
elements defined or used in the file: packages, imports, classes, methods, variables,
etc. Additionally, it contains information about these elements, such as the scope of
variables and the return type of methods. Concerning the analyzers of Elasticsearch,
these also have to conform to the characteristics of Java files (e.g., support indexing
camel case terms). The analyzers that we have developed for the Java language are
presented in Sect. 5.3.2, while Sect. 5.3.3 illustrates the mappings used for storing
projects and files.
5.3.2 Analyzers
Field analysis is one of the most important features of indexing servers, since it
affects the quality and efficiency of the returned results. In Elasticsearch, analyzers
comprise a single tokenizer and zero or more token filters. Given a field, the tokenizer
splits its value into tokens and the token filters accept a stream of tokens which they
can modify, delete, or even add. For example, in a text search engine scenario, a
tokenizer would split a document into words according to spaces and punctuation,
and then it would probably apply a token filter that makes all words lowercase and
another one that would remove stop words. Then all fields would be stored in the
index in tokens, thus allowing fast and accurate queries. The same approach holds
for source code analysis.
Elasticsearch has several default analyzers and even allows creating custom ones
given its collection of tokenizers and token filters. For our scenario, we used the
standard analyzer offered by Elasticsearch and created also three custom analyzers:
the Filepath analyzer, which is a custom analyzer for file paths, the CamelCase
analyzer, which handles camel case tokens, and the Javafile analyzer, for analyzing
java files.
a Stop Token Filter.7 The standard analyzer is the default analyzer of Elasticsearch,
since it is effective for most fields.
The filepath analyzer is used to analyze file or Internet paths. It comprises the Path
Hierarchy Tokenizer and the Lowercase Token Filter of Elasticsearch. The analyzer
receives input in the form “/path/to/file” and outputs tokens “/path”, “/path/to”, and
“/path/to/file”.
Although the standard analyzer is effective for most fields, the fields derived from
source code files usually conform to the standards of the language. Since our tar-
geted language is Java, we created an analyzer that parses camel case text. This is
accomplished using a pattern tokenizer, which uses a regular expression pattern in
order to split the text into tokens. For camel case text, we used the tokenizer shown
in [12]. The pattern is shown in Fig. 5.2.
Upon splitting the text into tokens, our CamelCase analyzer transforms the tokens
to lowercase using the Lowercase Token Filter of Elasticsearch. No stop words are
removed.
The Javafile analyzer is used to analyze Java source code files. The analyzer employs
the standard tokenizer and the Stop Token Filter of Elasticsearch. It splits the text
7 Elasticsearchsupports removing stop words for several languages. In our case, the default choice
of English is adequate.
110 5 Source Code Indexing for Component Reuse
into tokens and the most common Java source file tokens are removed. The Java stop
words used are shown in Fig. 5.3.
The stop word list of Fig. 5.3 contains all the terms that one would want to omit
when searching for snippets. Note, however, that we decided to keep the types (e.g.,
“int”, “float”, “double”) since they are useful for finding similar code snippets. The
type “void”, however, is removed since it is so frequent that it would practically add
nothing to the final result.
As noted in Sect. 5.3.1, Elasticsearch stores data in documents. The way each doc-
ument is stored (i.e., its schema) and the way each field is analyzed by one of the
analyzers of Sect. 5.3.2 are determined by its mapping. For AGORA and for the
Java language, we defined two mappings, one for project documents and one for
file documents. Additionally, we connected these two mappings to each other in
a parent-child relationship, where each file document maps to one parent project
document. We present the two mappings in the following paragraphs.
Projects are identified uniquely in AGORA by their name and username joined with
a slash (/). Thus, the _id field of the mapping has values of the form “user/repo”.
The fields for project documents are shown in Table 5.3, along with an example
instantiation of the mapping for the egit Eclipse plugin.
Most of the project’s fields are useful to be stored, yet analyzing all of them does
not provide any added value. Hence, URLs (the “url”, the GitHub API “git_url”)
and “default_branch” are not analyzed. The field “fullname”, however, which is the
same as the “_id”, is analyzed using the standard analyzer so that it is searchable.
In addition, one can search by the “user” and “name” fields, corresponding to the
owner and the project name, which are also analyzed using the standard analyzer.
5.3 The AGORA Code Search Engine 111
As already mentioned, we have structured the project and file mappings using the
parent-child relation provided by Elasticsearch. Each file has a “_parent” field, which
is a link to the respective document of the project. This field can be used to perform
has_child queries (see Sect. 5.4.2.3) or retrieve all the files of a project. The “_id”
of each file has to be unique; it is formed by the full path of the file including also the
project, i.e, user/project/path/to/file. Table 5.4 contains the rest of the fields as well
as an example for the DateFormatter.java file of the Eclipse egit plugin of Table 5.3.
As in projects, the “_id” is also copied to the analyzed field “fullpathname” so
that it is searchable. Both this and the “path” field, which refers to the path without
112 5 Source Code Indexing for Component Reuse
the project prefix, are analyzed using the filepath analyzer. The filename of each file
(“name”) and the name of the project it belongs to (“project”) are analyzed using the
standard analyzer.8
Again, some fields do not need to be searchable, yet are useful to be stored. The
fields “mode” and “type”, which refer to the git file mode9 and the git type,10 as well
as the “extension” and the “url” of each file are not analyzed. The field “sha” is also
not analyzed. However, it should be mentioned that “sha” is very important since it
maintains an id for the content of the file. This field is used to check whether a file
needs to be updated when updating a project.
Furthermore, since files also contain source code, several important fields are
defined to cover the source code search. At first, the content of each file is analyzed
with two analyzers, the standard analyzer and the Javafile analyzer, storing the results
in the fields “content” and “analyzedcontent”, respectively. This analysis allows
performing strict text matching on the “content” field as well as loose search on
“analyzedcontent” given the stop words on the second one are removed.
After that, the instances for the subfields of the field “code” are extracted using
the Source Code Parser (see Fig. 5.1). Two such fields that are present in Java files are
“package” and “imports” (which are stored as a list since they can be more than one).
The two fields are analyzed using the CamelCase analyzer, since both the package
of a Java file and its imports follow the Java coding conventions. Since each Java file
has one public class and possibly other non-public classes, the mapping also contains
the fields “class” and “otherclasses”.
Note that “otherclasses” is a list of type “nested”. This is a special type of field
offered by Elasticsearch that allows querying subfields in a coherent way. In specific,
given that a class has two subfields, “name” and “type” (class or interface), if we
want to search for a specific name and type both for the same class, then the list must
be “nested”. Otherwise, our search may return a document that has two different
classes, one having only the name and one having only the type that we searched
for (more about nested search in Sect. 5.4.2.1). The subfields of a class are shown
in Table 5.5, including an example for the DateFormatter class corresponding to the
DateFormatter.java file of Table 5.4.
As depicted in Table 5.5, the “extends” and “implements” fields are analyzed using
the CamelCase analyzer. Note that “implements” is of type “array”, since in Java
a class may implement more than one interface. A Java class can also have many
methods, with each of them having its own name, parameters, etc. So the “methods”
field is of type “nested” so that methods can be searched using an autonomous
query, while also allowing multi-method queries. The “name” and the “returntype”
of a method are string fields analyzed using the CamelCase analyzer, while the
8 Note that the standard analyzer initially splits the filename into two parts, the filename without
extension and the extension since it splits according to punctuation.
9 One of 040000, 100644, 100664, 100755, 120000, or 160000 which correspond to directory,
regular non-executable file, regular non-executable group-writeable file, regular executable file,
symbolic link, or gitlink, respectively.
10 One of blob, tree, commit, or tag.
5.3 The AGORA Code Search Engine 113
Table 5.5 The inner mapping of a class for the files documents of AGORA
Name Type Analyzer Example
extends string CamelCase –
implements array CamelCase [JsonSerializer<Date>,
...]
methods nested
modifiers array – [public]
name string CamelCase serialize
parameters nested
name string CamelCase Date
type string CamelCase Date
returntype string CamelCase JsonElement
throws array CamelCase []
modifiers array – [public]
name string CamelCase DateFormatter
type string CamelCase Class
variables nested
modifiers array – [private, final]
name string CamelCase Formats
type string CamelCase DateFormat[]
innerclasses (1-level) recursive of the class mapping
exceptions (“throws”) are stored in arrays and also indexed using the CamelCase
analyzer. The field “modifiers” is also an array; however, it is not analyzed since
its values are specific (one of public, private, final, etc.). The subfield “parameters”
holds the parameters of the method, including the “name” and the “type” of each
parameter, both analyzed using the CamelCase analyzer.11
The class “modifiers” are not analyzed, whereas their “name” is analyzed. The
field “type” is also not analyzed since its values are specific (one of class, interface,
type). The field “variables” holds the variables of classes in a “nested” type, and
the subfields corresponding to the “name” and “type” of each variable are analyzed
using the CamelCase analyzer. Finally, inner classes have the same mapping as in
Table 5.5, however without containing an “innerclasses” field, thus allowing only
one level of inner classes. Allowing more levels would be ineffective.
11 The CamelCase analyzer is quite effective for fields including Java types; the types are conven-
tionally in camelCase, while the primitives are not affected by the CamelCase tokenizer, i.e., the
text “float” results in the token “float”.
114 5 Source Code Indexing for Component Reuse
The scoring scheme of Elasticsearch [13] is used to rank the retrieved results for a
query according to their relevance. When forming a query, one may select any of the
fields defined in the mappings of AGORA (see Sect. 5.3.3) or combinations among
them. As already mentioned, all analyzed fields (using any one of the four defined
analyzers) are indexed as split tokens, disregarding punctuation and stop words.
At first, the text of the query is also analyzed using the analyzer that corresponds
to the addressed field. After that, the similarity between the terms/tokens of the query
and the terms/tokens of the document field is computed using the vector space model
[14]. In the vector space model, any term is a dimension, thus each document (or
the query) can be represented as a vector. Given a collection of documents D, the
relevance weight of a term t in a document d is computed using the term frequency-
inverse document frequency (tf-idf):
|D| 1
wt,d = f (t, d) · 1 + log ·√ (5.1)
|d ∈ D : t ∈ d| + 1 |d|
The first term of the above product is the frequency of the term in the document (tf ),
computed as the square root of the number of times the term appears in the document
f (t, d). The second term is the inverse document frequency (idf ), computed as the
logarithm of the total number of documents |D| divided by the number of docu-
ments that contain the term |d ∈ D : t ∈ d|. Finally, the third term is a field-length
normalizer that accounts for the number of terms in the document |d|, and effectively
penalizes larger documents. In the case of non-analyzed fields, the content of the field
has to be exactly specified and the system will return a binary match-no match deci-
sion, i.e., the result of Eq. (5.1) will be either 1 or 0. Array fields are treated like
already split tokens.
Thus, the document vectors are represented using the term weights defined by the
above equation. The final score between a query vector q and a document vector d
is computed using the cosine similarity between them:
n
q ·d wt ,q · wt ,d
scor e(q, d) = = n 1 2 i n i 2 (5.2)
|q| · |d| 1 wti ,q · 1 wti ,d
where ti is the ith term of the query and n is the total number of terms in the query.
Finally, when a query addresses more than one fields, an “and” relationship among
the fields is implied. Then, assuming that the score of equation (5.2) is defined per
field, i.e., scor e(q, f ) is the score for query q and field f , the score for each document
d is computed using the scoring function:
scor e(q, d) = nor m(q) · coor d(q, d) · scor e(q, f ) (5.3)
t∈q
5.3 The AGORA Code Search Engine 115
where nor m(q) is the query normalization factor, definedas the inverse square root of
n
the sum of squared idf of each term in the query, i.e., 1/ 1 wti ,q and coor d(q, d)
is the coordination factor defined as the number of matching query terms divided by
the total number of terms in the query, i.e., |t ∈ q : t ∈ d|/|q|.
AGORA has a comprehensive REST API as well as a web UI, shown in Fig. 5.4.
Although the API of AGORA allows performing any complex request supported by
Elasticsearch, in practice we may distinguish among four types of searches, which
are also the ones supported by the web interface: Simple Search, Advanced Search,
Snippet Search, and Project Search.
The Simple Search feature involves the Elasticsearch field “_all”, which allows
searching for all the fields of a file (e.g., filename, content, etc.). This type of search
further supports the use of regular expressions.
The Advanced Search functionality allows searching for a class with specific
attributes including imports, variable names and types, method names, return types
and their parameters’ names and types, etc. Using this functionality, developers can
easily find reusable components with specific inputs/outputs to integrate to their
source code or even find out how to write code in context.
The Snippet Search allows providing a source code snippet as input in order to
find similar code. Although using syntax-aware queries yields satisfactory results
at class or method level, when it comes to finding API usage examples, searching
for snippets can be quite helpful, as most API usage scenarios involve method calls
within a few lines of code.
Finally, the Project Search functionality allows searching for software projects
that involve certain components or follow a specific structure. Thus, one may search
for projects that conform to specific design patterns [15] or architectures (e.g., MVC).
Currently, the index is populated given a list of the 3000 most popular GitHub
projects, as determined by the number of stars, which, upon removing forked projects
and invalid projects (e.g., without java code), resulted in 2709 projects. Preliminary
research has indicated that highly rated projects (i.e., with large number of stars/forks)
exhibit also high quality [16], have sufficient documentation/readme files [17, 18],
and involve frequent maintenance releases, rewrites and releases related to both func-
tional and non-functional requirements [19]. Upon further measuring the reusability
of software components with static analysis metrics [20, 21], we could argue that
these projects include reusable code; indicatively, we have shown that roughly 90%
of the source code of the 100 most popular GitHub repositories is reusable and of
high quality [20]. Given also that the projects are highly diverse, spanning from
known large-scale projects (e.g., the Spring framework or the Clojure programming
language) to small-scale projects (e.g., android apps), we may conclude that our
index contains a variety of reusable solutions.
In Sect. 5.4.2, we provide example queries performed on the index of AGORA,
indicating the corresponding usage scenarios and illustrating the syntax required to
search in the AGORA API. Furthermore, in Sect. 5.4.3, we provide an example end-
to-end scenario for using AGORA, this time through its web interface, to construct
a project with three connected components.
One of the most common scenarios studied by current literature [3, 4, 6, 22] is that of
searching for a reusable software component. Components can be seen as individual
entities with specific inputs/outputs that can be reused. The simplest form of a Java
component is a class. Thus, a simple scenario includes finding a class with specific
functionality. Using the API of AGORA, finding a class with specific methods and/or
variables is straightforward. For example, assume a developer wants to use a data
structure similar to a stack with the known push/pop functionality. Then the query
would be formed as shown in Fig. 5.5. The query is a bool query that allows searching
using multiple conditions that must hold at the same time. In this case, the query
involves three conditions: one for the name of the class which must be “stack”, a
nested one for a method with name “push” and return type “void”, and another nested
one for a method with name “pop” and return type “int”. The response for the query
is shown in Fig. 5.6.
On our installed snapshot of AGORA, the query took 109 ms and 412275 docu-
ments (“hits”) were found. For each document, the “_score” is determined according
to the scoring scheme defined in Sect. 5.3.4. In this case, the developer can examine
5.4 Searching for Code 117
Fig. 5.5 Example query for a class “stack” with two methods, one named “push” with return type
“void” and one named “pop” with return type “int”
Fig. 5.7 Example query for a class with name containing “Export”, which extends a class named
“WizardPage” and the Java file has at least one import with the word “eclipse” in it
the results and select the one that fits best his/her scenario. Note that Elasticsearch
documents also have a special field named “_source”, which contains all the fields
of the document. So, one can iterate over the results and access the source code of
each file by accessing the field “source.content”.
Another scenario that is often handled using RSSEs [23] is that of finding out how to
write code given some context. This scenario occurs when the developer wants to use
a specific library or extend some functionality and wants to find relevant examples.
An example query for this scenario is shown in Fig. 5.7.
In this case, the developer wants to find out how to create a page for an Eclipse
wizard in order to export some element of his/her plugin project. Given the Eclipse
Plugin API, it is easy to see that wizard pages extend the “WizardPage” class. So
the query includes classes with name containing the word “Export” that extend the
“WizardPage” class, while also containing “eclipse” imports. The results of this
query are shown in Fig. 5.8.
The query retrieved 16681 documents in 16 ms. Note that not all results conform
to all the criteria of the query, e.g., a result may extend “WizardPage”, and its name
may not contain the word “Export”. However, the results that conform to most criteria
appear at the top positions.
The third scenario demonstrated is a particularly novel scenario in the area of CSEs.
Often developers would like to find similar projects to theirs in order to understand
how other developer teams (probably experienced ones or ones involved in large-
scale projects) structure their code. Performing a generalization, one could argue that
structure can be interpreted as a design pattern. In the context of a search, however,
5.4 Searching for Code 119
it could also mean a simple way to structure one’s code. Such a query is shown in
Fig. 5.9.
The query is formed using the “has_child” functionality of Elasticsearch so that
the returned projects have (at least) one class that implements an interface with
name containing “Model”, and another class that implements an interface with name
similar to “View”, and finally a third one that implements an interface with name con-
taining “Controller”. In short, projects following the Model-View-Controller (MVC)
architecture are sought. Additionally, the query requests that the projects also have
classes that extend “JFrame”, so that they use the known GUI library swing. In
fact, this query is quite reasonable since GUI frameworks are known to follow such
architectures. The response for this query is shown in Fig. 5.10.
The query took 16 ms and returned 679 project documents. The files of these
projects can be found using the “_parent” field of the files mapping.
Fig. 5.9 Example query for project with files extending classes “Model”, “View”, and “Controller”
and also files extending the “JFrame” class
Fig. 5.11 Example query for a snippet that can be used to read XML files
“\n” character. This particular example is a snippet for reading XML files. Possibly,
the developer would want to know if the code is preferable or how others perform
XML parsing.
This query took 78 ms and returned several documents (522327). Obviously not
all of these perform XML parsing; several matches may include only the first line
of the query. In any case, the most similar ones to the query are the ones with the
largest “_score” at the top of the list.
122 5 Source Code Indexing for Component Reuse
Fig. 5.13 Query for a file reader on the Advanced Search page of AGORA
Fig. 5.14 Query for an edit distance method on the Advanced Search page of AGORA
however, the field could be set to “Scanner” or any other reader, or to “apache” to use
the Apache file utilities, or even to “eclipse” to read the files as Eclipse resources.
Upon having read a collection of strings, we use the query of Fig. 5.14 to find a
component that receives two strings and computes their edit distance. In this case,
the developer searches for an “EditDistance” component that would return the edit
distance of two strings (of which the variable names are not given since they are not
necessary) as an integer. Similar results are obtained by setting “EditDistance” as a
method (or even by searching for it in the simple search page of Fig. 5.4).
Storing the computed edit distances requires a pair that would contain the two
strings (or possibly two string ids) and a Java HashMap that would have the pair
as the key and the edit distance of the two strings as a value. A query for a class
“Pair” with two variables “first” and “second” is shown in Fig. 5.15. Furthermore,
124 5 Source Code Indexing for Component Reuse
Fig. 5.15 Query for a pair class on the Advanced Search page of AGORA
given that the pairs should be HashMap keys, the query involves implementing the
“hashCode” method.
The scenario of this section illustrates how the developer can use AGORA to find
reusable components and thus reduce the effort required for his/her project. Similar
use cases can be composed for the other types of AGORA queries (e.g., the ones
described in the previous subsection).
5.5 Evaluation 125
5.5 Evaluation
Although the high-level comparison of AGORA with other CSEs is interesting (see
Sect. 5.2.3), we need a quantitative measure to assess the effectiveness of our CSE.
The measure of choice should reflect the relevance of the results retrieved by the
engine. Thus, we need to examine whether the results for a query accurately cover
the desired functionality. Therefore, in this section, we evaluate AGORA against two
known CSEs: GitHub Search and BlackDuck Open Hub. We selected these two CSEs
since they are representative of simple search and advanced search functionalities. In
specific, BlackDuck supports searching using syntax-aware queries, while GitHub is
used with text search. GitHub was actually chosen over all other simple search engines
(like Searchcode) because of the size of its index. Most other engines of Table 5.1
are either not operational (e.g., Merobase) or provided for local installations (e.g.,
Krugle).
We evaluate the three CSEs on a simple component reuse scenario, as analyzed
in the previous section. Typically, the developer wanting to find a specific compo-
nent will submit the relevant query, and the CSE will return several results. Then,
the developer would have to examine these results and determine whether they are
relevant and applicable to his/her code. In our case, however, relying on examining
the results for the CSEs ourselves would pose serious threats to validity. Instead, we
determine whether a result is relevant and usable by using test cases (a logic also used
in the next chapter). Our evaluation mechanism and the used dataset are presented
in Sect. 5.5.1, while the results for the three CSEs are presented in Sect. 5.5.2.
to construct the queries in the webpage of the service; however, it did not support any
API. Therefore, we did it by hand and downloaded the first results for each query.
Upon constructing the queries for the CSEs we also created test cases for each
component. In each test case, for a specific component, the names of the objects,
variables, and methods are initially written with respect to the component. When
a specific result is retrieved from a CSE, it is assigned a Java package, and the
code is saved in a Java class file. After that, the test case is modified to reflect
the class, variable, and method names of the class file. A modified test case for a
class “Customer1” with functions “setTheAddress” and “getTheAddress”, which is a
result of the “Customer” component query is shown in Fig. 5.17. Variable “m_index”
controls the order of the classes in the file. Ordering the methods of the class file
in the same order as the test case (a procedure that, of course, does not change
the functionality of the class file) can be performed by iterating over all possible
combinations of them.
5.5 Evaluation 127
Fig. 5.17 Example modified test case for the result of a code search engine
Hence, our evaluation framework determines whether each result returned from
a CSE is compilable and whether it passes the test. Note that compilability is also
evaluated with respect to the test case, thus ensuring that the returned result is func-
tionally equivalent to the desiderata of the developer. This suggests that the developer
would be able to make adjustments to any results that do not pass the test.
In this subsection, we present the results of our evaluation for the components
described in Sect. 5.5.1. Table 5.7 presents the number of results for each query,
how many of them are compilable and how many passed the tests, for each CSE. We
kept the first 30 results for each CSE, since keeping more results did not improve
the compilability and test passing score for any CSE, and 30 results are usually more
than a developer would examine in such a scenario.
Both AGORA and GitHub returned at least 30 results for each query. BlackDuck,
on the other hand, did not return any results for three queries, while there are also
two more queries for which the returned results are less than 10. If we examine the
queries that returned few results, we conclude that it is due to differently defined
method names. For example, although BlackDuck returned numerous results for a
“Stack” with “push” and “pop” methods, it returned no results for a “Stack” with
“pushObject” and “popObject”. This is actually one of the strong points of AGORA,
128 5 Source Code Indexing for Component Reuse
Table 5.7 Compiled results and results that passed the tests for the three code search engines, for
each query and on average, where the format for each value is number of tested results/number of
compiled results/total number of results
Query Code search engines
GitHub AGORA BlackDuck
Account 0/10/30 1/4/30 2/9/30
Article 1/6/30 0/5/30 2/3/15
Calculator 0/5/30 0/5/30 1/3/22
ComplexNumber 2/2/30 2/12/30 1/1/3
CreditCardValidator 0/1/30 1/4/30 2/3/30
Customer 2/3/30 13/15/30 7/7/30
Gcd 1/3/30 5/9/30 12/16/30
Md5 3/7/30 2/10/30 0/0/0
Mortgage 0/6/30 0/4/30 0/0/0
Movie 1/3/30 2/2/30 3/7/30
Spreadsheet 0/1/30 1/2/30 0/0/6
Stack 0/3/30 7/8/30 8/12/30
Stack(2) 0/0/30 7/8/30 0/0/0
Average number of 30.0 30.0 17.4
results
Average number of 3.846 6.769 4.692
compiled results
Average number of 0.769 3.154 2.923
tested results
which is, in fact, due to its CamelCase analyzer (see Sect. 5.3.2.3). GitHub also would
return minimum results, if not for the multi-query scheme we defined in Sect. 5.5.1.
Concerning the compilability of the results, and subsequently their compatibility
with the test cases, the results of AGORA are clearly better than the results of the
other two CSEs. In specific, AGORA is the only engine that returns compilable results
for all 13 queries, while on average it returns more than 6.5 compilable results per
query. GitHub and BlackDuck return roughly 4 and 4.5 results per query, respectively,
indicating that our CSE provides more useful results in most cases.
The number of results that pass the test cases for each CSE is also an indicator
that AGORA is quite effective. Given the average number of these results, shown in
Table 5.7, it seems that AGORA is more effective than BlackDuck, while GitHub is
less effective than both. Although the difference between AGORA and BlackDuck
does not seem as important as in the compilability scenario, it is important to note that
AGORA contains much fewer projects than both other CSEs, and thus it is possible
that certain queries do not have test-compliant results in the index of AGORA.
Finally, the average number of compilable results and results that passed the tests
are visualized in Fig. 5.18, where it is clear that AGORA is the most effective of the
three CSEs.
5.5 Evaluation 129
Upon examining whether the retrieved results for each CSE are useful for the
component reuse scenario, it is also reasonable to determine whether these results
are also ranked in an effective manner from each engine. For this reason, we use the
search length as a metric for the ranking of a CSE. The search length is defined as
the number of non-relevant results that the user must examine in order to find the
first relevant result. One may also define subsequent values for the second relevant
result, the third one, etc. In our case, we define two variants of this measure, one for
the results that pass compilation and one for the results that pass the tests. Table 5.8
presents the search length for the two cases with regard to finding the first relevant
(i.e., compilable or passed) result, for all the queries and for the three CSEs.
For the case of finding compilable results, AGORA clearly outperforms the other
two CSEs, requiring on average to examine roughly 5 irrelevant results to find the first
relevant one. By contrast, a search on GitHub or BlackDuck would require examining
approximately 6.8 and 10 irrelevant results, respectively. The average search length
for finding more than one compilable results is shown in Fig. 5.19a. In this figure, it
is also clear that using AGORA requires examining fewer irrelevant results to find
the first relevant ones. Given that a developer wants to find the first three compilable
results, then on average he/she would have to examine 9.7 of the results of AGORA,
13.3 of the results of BlackDuck, and 16.5 of the results of GitHub.
In Table 5.8, the search length for the results that pass the tests is also shown for
the three CSEs. In this case, BlackDuck seems to require examining fewer irrele-
vant results (ones failed the test) on average than AGORA, nearly 10.5 and 12.5,
respectively, while GitHub is clearly outperformed requiring to examine more than
22 irrelevant results. However, it is important to note that the first relevant result
of AGORA is included in the first 30 results for 10 out of 13 queries, whereas for
BlackDuck for 9 out of 13 queries. It is also clear that the average of the search length
for the CSEs is greatly influenced by extreme values. For example, BlackDuck is
very effective for the “Account”, “Calculator”, and “ComplexNumber” queries. This
is expected since we compare CSEs that contain different projects and subsequently
130 5 Source Code Indexing for Component Reuse
Table 5.8 Search length for the compiled and passed results for the three code search engines, for
each result and on average, where the format for each value is passed/compiled
Query Code search engines
GitHub AGORA BlackDuck
Account 0/30 2/2 0/0
Article 0/21 0/30 0/2
Calculator 0/30 8/30 0/0
ComplexNumber 0/0 0/0 0/0
CreditCardValidator 3/30 8/15 0/2
Customer 0/0 0/0 1/1
Gcd 25/25 1/8 5/5
Md5 1/9 0/0 30/30
Mortgage 0/30 4/30 30/30
Movie 23/23 21/21 0/1
Spreadsheet 4/30 20/23 30/30
Stack 2/30 1/2 4/4
Stack2 30/30 1/2 30/30
Average compiled 6.769 5.077 10.000
Average passed 22.154 12.538 10.385
Fig. 5.19 Evaluation diagrams depicting a the search length for finding compilable results and b
the search length for finding results that pass the tests, for the three code search engines
different source code components. Hence, AGORA is actually quite effective, con-
sidering that it contains less than 3000 projects, while GitHub and BlackDuck contain
roughly 14 million and 650 thousand projects, respectively.
The average search length for finding more than one tested result is shown in
Fig. 5.19b. As shown in this figure, both AGORA and BlackDuck are considerably
more effective than GitHub. BlackDuck requires examining slightly fewer results for
finding the first three results that pass the tests, whereas AGORA is clearly better
when more results are required.
5.6 Conclusion 131
5.6 Conclusion
In this chapter, we have presented AGORA, a CSE with advanced searching capabili-
ties, a comprehensive REST API and seamless integration with GitHub. The strength
of AGORA lies in its architecture and its powerful indexing mechanism. We have
evaluated AGORA against other CSEs and we have shown how it can be used for
finding reusable components, using an example usage scenario. In the provided sce-
nario, the developer would only need to issue three queries to AGORA, and then
he/she would be presented with ready-to-use components (i.e., the ones ranked high
in the result set). Compared to building the required components from scratch, using
AGORA would save valuable development time and effort.
Concerning future work, we consider conducting developer studies in realistic
scenarios to determine whether AGORA facilitates development in a reuse context
and whether the features of AGORA are well received by the developer community.
The index could also be extended by including quality metrics (e.g., static analysis
metrics) for each file. Finally, we may employ query expansion strategies and/or
incorporate semantics to further improve the relevance and/or the presentation of the
retrieved results.
References
1. Thummalapenta S, Xie T (2007) PARSEWeb: a programmer assistant for reusing open source
code on the web. In: Proceedings of the 22nd IEEE/ACM international conference on automated
software engineering, ASE ’07, New York, NY, USA. ACM, pp 204–213
2. Xie T, Pei J (2006) MAPO: mining API usages from open source repositories. In: Proceedings
of the 2006 international workshop on mining software repositories, MSR ’06, New York, NY,
USA. ACM, pp 54–57
3. Hummel O, Janjic W, Atkinson C (2008) Code conjurer: pulling reusable software out of thin
air. IEEE Softw 25(5):45–52
4. Lazzarini Lemos OA, Bajracharya SK, Ossher J (2007) CodeGenie: a tool for test-driven
source code search. In: Companion to the 22nd ACM SIGPLAN conference on object-oriented
programming systems and applications companion, OOPSLA ’07, New York, NY, USA. ACM,
pp 917–918
5. Diamantopoulos T, Symeonidis AL (2018) AGORA: a search engine for source code reuse.
SoftwareX, page under review
6. Janjic W, Hummel O, Schumacher M, Atkinson C (2013) An unabridged source code dataset
for research in software reuse. In: Proceedings of the 10th working conference on mining
software repositories, MSR ’13, Piscataway, NJ, USA. IEEE Press, pp 339–342
7. Linstead E, Bajracharya S, Ngo T, Rigor P, Lopes C, Baldi P (2009) Sourcerer: mining and
searching internet-scale software repositories. Data Min Knowl Discov 18(2):300–336
8. Elasticsearch: RESTful, distributed search & analytics (2016). https://www.elastic.co/
products/elasticsearch. Accessed April 2016
9. GitHub API, GitHub Developer (2016). https://developer.github.com/v3/. Accessed April 2016
10. Java Compiler Tree API (2016). http://docs.oracle.com/javase/8/docs/jdk/api/javac/tree/index.
html. Accessed April 2016
11. Unicode Standard Annex #29 (2016) Unicode text segmentation. In: Davis M (ed) An integral
part of The Unicode Standard. http://www.unicode.org/reports/tr29/. Accessed April 2016
132 5 Source Code Indexing for Component Reuse
12. CamelCase tokenizer, pattern analyzer, analyzers, Elasticsearch analysis (2016). http://
www.elastic.co/guide/en/elasticsearch/reference/current/analysis-pattern-analyzer.html#_
camelcase_tokenizer. Accessed April 2016
13. Lucene’s practical scoring function, controlling relevance, search in depth, elasticsearch: The
definitive guide (2016). http://www.elastic.co/guide/en/elasticsearch/guide/current/practical-
scoring-function.html. Accessed April 2016
14. Manning CD, Raghavan P, Schütze H (2008) Introduction to information retrieval. Cambridge
University Press, New York
15. Gamma E, Vlissides J, Johnson R, Helm R (1998) Design patterns: elements of reusable object-
oriented software. Addison-Wesley Longman Publishing Co. Inc, Boston
16. Papamichail M, Diamantopoulos T, Symeonidis AL (2016) User-perceived source code quality
estimation based on static analysis metrics. In: Proceedings of the 2016 IEEE international
conference on software quality, reliability and security, QRS, Vienna, Austria, pp 100–107
17. Aggarwal K, Hindle A, Stroulia E (2014) Co-evolution of project documentation and popularity
within github. In: Proceedings of the 11th working conference on mining software repositories,
MSR 2014, New York, NY, USA. ACM, pp 360–363
18. Weber S, Luo J (2014) What makes an open source code popular on GitHub? In: 2014 IEEE
international conference on data mining workshop, ICDMW, pp 851–855
19. Borges H, Hora A, Valente MT (2016) Understanding the factors that impact the popularity
of GitHub repositories. In: 2016 IEEE international conference on software maintenance and
evolution (ICSME), ICSME, pp 334–344
20. Dimaridou V, Kyprianidis A-C, Papamichail M, Diamantopoulos T, Symeonidis A (2017)
Towards modeling the user-perceived quality of source code using static analysis metrics. In:
Proceedings of the 12th international conference on software technologies - volume 1, ICSOFT,
Setubal, Portugal, 2017. INSTICC, SciTePress, pp 73–84
21. Diamantopoulos T, Thomopoulos K, Symeonidis AL (2016) QualBoa: reusability-aware rec-
ommendations of source code components. In: Proceedings of the IEEE/ACM 13th working
conference on mining software repositories, MSR ’16, pp 488–491
22. Reiss SP (2009) Semantics-based code search. In: Proceedings of the 31st international con-
ference on software engineering, ICSE ’09, Washington, DC, USA. IEEE Computer Society,
pp 243–253
23. Sahavechaphan N, Claypool K (2006) XSnippet: mining for sample code. SIGPLAN Not.
41(10):413–430
Chapter 6
Mining Source Code for Component
Reuse
6.1 Overview
6.2.1 Overview
The recommendation of McIlroy [3] for component reuse in the field of software
engineering, dated back in 1968, is more accurate than ever. Living in a period where
vast amounts of data are accessible through the Internet, and the time to market plays
a vital role in industry, software reuse can result in time-saving, as well as to the
development of high-quality and reliable software.
Aiming to facilitate code reuse, RSSEs have made their appearance during the
last years [4, 5]. According to Robillard et al. [6], an RSSE is “a software application
that provides information items estimated to be valuable for a software engineering
task in a given context”. In our case, the “software engineering task” refers to finding
useful software components, while the “given context” is the source code of the
developer.2
As stated by Walker [1], “recommendations for a software engineer play much
the same role as recommendations for any user”. In the case of software engineering,
we could say that a general concept for a recommendation is as follows: an engineer
would like to solve a problem for which there are several possible solutions available.
The RSSE identifies the problem (extracting useful information from the given source
code), decides on relevant results according to some metrics, and presents them to
the engineer, in an appropriate manner.
1 The term “Mantissa” refers to a sacred woman in ancient Greece, who interpreted the signs sent by
the gods and gave oracles. In a similar way, our system interprets the query of the user and provides
suitable results.
2 See [7] for a systematic literature review on software engineering tasks that are facilitated using
recommendation systems.
6.2 Background on Recommendation Systems in Software Engineering 135
Although there are several code reuse recommendation systems, their functional
steps are more or less similar. In specific, many contemporary code reuse recom-
mendation systems [8–14] follow a five-stage process:
• Stage 1—Data Collection: Local repositories, CSEs, or even open-source code
repositories are used for source code retrieval.
• Stage 2—Data Preprocessing: Upon extracting information from the retrieved
source code, data are transformed in some representation so that they can later be
mined.
• Stage 3—Data Analysis and Mining: Involves all the techniques used for extracting
useful information for the retrieved components. Common techniques in this stage
include clone detection, frequent sequences mining, etc.
• Stage 4—Data Postprocessing: Determines the recommendations of the system,
usually in the form of a ranking of possible solutions.
• Stage 5—Results Presentation: The results are presented along with a relevance
score and metrics concerning the functional or non-functional aspects of each
result.
In the following subsections, we discuss the main approaches that have grown to
be popular in the RSSE community.
One of the first code reuse RSSEs is CodeFinder, a system developed by Heninger
[15]. CodeFinder does not make use of any special query languages; it is based on
keywords, category labels, and attributes to conduct keyword matching. Its methodol-
ogy is based on associative networks and is similar to that of the PageRank algorithm
[16] that was published a few years later. CodeWeb [17] is another early developed
system for the KDE application framework that takes into account the structure of
a query to recommend results, which are accompanied by documentation. Although
it is one of the first RSSEs to use association rules, it is actually a browsing system
rather than a search engine, thus limiting its results.
A promising system named CodeBroker was published in 2002 by Ye and Fischer
[18]. CodeBroker was developed as an add-on for the Emacs editor. The tool forms
the query from documentation comments or signatures and is the first to utilize a
local repository using Javadoc comments. CodeBroker was also largely based on
user feedback, enabling retrieval by reformulation, while it also presented results
accompanied by evaluation metrics (e.g., recall and precision) and documentation.
However, the use of only a local database and its dependency on Javadoc comments
restricted the search results of CodeBroker.
From 2004 onward, when the Eclipse Foundation3 was established, many RSSEs
were developed as plugins for the Eclipse IDE. Two of the first such systems are
3 https://eclipse.org/org/foundation/.
136 6 Mining Source Code for Component Reuse
Strathcona [19] and Prospector [20], both published in 2005. Strathcona extracts
queries from code snippets and performs structure matching, promoting the most
frequent results to the user, including also UML diagrams. However, it also uses a
local repository, thus the results are limited. Prospector introduced several novel fea-
tures, including the ability to generate code. The main scenario of this tool involves
the problem of finding a path between an input and an output object in source code.
Such paths are called jungloids and the resulting program flow is called a jungloid
graph. Prospector also requires maintaining a local database, while the main draw-
back of the tool is that API information may be incomplete or even irrelevant, so the
results are limited.
By 2006, the first CSEs made their appearance, and most RSSEs started to rely
on them for finding source code, thus ensuring that their results are up-to-date. One
of the first systems to employ a CSE was MAPO [10], a system developed by Xie
and Pei, with a structure that resembles most modern RSSEs. Upon downloading
source code from open-source repositories, MAPO analyzes the retrieved snippets
and extracts sequences from the code. These sequences are then mined to provide
a ranking according to the query of the developer. XSnippet, published in 2006
by Sahavechaphan and Claypool [8], is another popular RSSE which is similar to
MAPO. However, it uses structures such as acyclic graphs, B+ trees, and an extension
of the Breadth-First Search (BFS) algorithm for mining.
PARSEWeb is another quite popular system developed by Thummalapenta and
Xie [9]. Similarly to MAPO, PARSEWeb recommends API usages using code snip-
pets from the Google Code Search Engine.4 The service uses both Abstract Syntax
Trees (AST) and Directed Acyclic Graphs (DAGs) as source code representations and
clusters similar sequences. PARSEWeb employs heuristics for extracting sequences,
while it also enables query splitting in case no adequate results have been found.
All aforementioned systems are similar in terms of data flow. Considering the ar-
chitecture of PARSEWeb (shown in Fig. 6.1), a typical RSSE has a downloader that
receives the query of the user and retrieves relevant code (phase 1). After that, a code
analyzer extracts representations that are then mined to produce the results (phase 2),
which, before presented to the user, often undergo some postprocessing (phase 3).
In the case of PARSEWeb, the system extracts method invocation sequences, while
the mining part (sequence postprocessor) clusters the sequences. If the clustering
process produces very few sample sequences, PARSEWeb also allows splitting the
query and issuing two consecutive queries to the system.
Lately, several of the systems that have been developed (such as Portfolio [21],
Exemplar [22], or Snipmatch [23], the snippet retrieval recommender of the Eclipse
IDE) accept queries in natural language. Bing Code Search [11] further introduces
a multi-parameter ranking system, which uses source code and semantics originat-
ing from results of the Bing search engine. Finally, certain RSSEs have harnessed
the power of question-answering systems for recommending source code, such as
Example Overflow [24] which recommends Stack Overflow snippets, while there is
4 As already mentioned in the previous chapter, the Google Code Search Engine was discontinued
in 2013.
6.2 Background on Recommendation Systems in Software Engineering 137
also a preference in dynamic and interactive systems, such as the recently developed
CodeHint [12].
Fig. 6.2 Example screen of Code Conjurer for a loan calculator component [13]
clustering similar results and accompanying the results with complexity metrics. The
tool also supports a proactive mode, meaning that queries can be created dynamically
whenever the developer adds, modifies, or deletes a piece of code. Although Code
Conjurer is an effective RSSE, its matching mechanism is not flexible enough to deal
with small differences between the retrieved code and the tests, while the adaptation
(advanced) version of the tool is too computationally complex. Additionally, its CSE,
Merobase, has been discontinued, so the tool is not functional.
Another interesting approach is CodeGenie, implemented by Lemos et al. [14, 30,
31]. Similarly to Code Conjurer, CodeGenie employs its own CSE, Sourcerer [32,
33], in order to retrieve source code components. A major advantage of CodeGenie
is the ability to extract the query from the test code of the developer, thus the latter
has to write only a JUnit test. However, as pointed out in [34], the extraction of the
query from the test case could lead to failure as this is a non-trivial task. Additionally,
the tool does not detect duplicate results, while it also suffers from modifications in
the code of the components.
Reiss follows a slightly different concept in S6 [35, 36]. S6 is a web-based appli-
cation that integrates several CSEs and repositories, including Krugle5 and GitHub.
Although the developer has to compose more complex queries (e.g., semantic key-
words), the retrieved results are refactored and simplified, making S6 possibly the
only RSSE that integrates the code refactoring aspect of TDD. An example screen of
S6 is shown in Fig. 6.3, where it is clear that the returned results are indeed quite use-
ful; however, the tool offers some options that, at first sight, may be overwhelming.
5 http://opensearch.krugle.org/.
6.2 Background on Recommendation Systems in Software Engineering 139
Fig. 6.3 Example screen of S6 for a converter of numbers to roman numerals [35]
Furthermore, the service often returns duplicates, while most results are incomplete
snippets that confuse the developer, instead of helping him/her.
Another interesting system is FAST, an RSSE in the form of an Eclipse plugin
developed by Krug [34]. The FAST plugin makes use of the Merobase CSE and
emphasizes automated query extraction; the plugin extracts the signature from a
given JUnit test and forms a query for the Merobase CSE. The retrieved results are
then evaluated using automated testing techniques. A drawback of this system is that
the developer has to examine the query extracted from the test before executing the
search. Additionally, the service does not make use of growing software repositories,
while it employs the deprecated CSE Merobase so it is also not functional.
6.2.4 Motivation
Although several RSSEs have been developed during the latest years, most of them
have important limitations. For instance, not all are integrated with updatable soft-
ware repositories, while their syntax-aware features are limited to exact matching
between method and parameter names. We argue that an RSSE should exhibit the
following features:
140 6 Mining Source Code for Component Reuse
• Query composition: One of the main drawbacks of CSEs is that they require
forming queries in a specific format. An RSSE should form the required query
using a language that the developers understand, or optimally extract it from the
code of the developer.
• Syntax-aware mining model: Implementing and integrating a syntax-aware min-
ing model focused on code search, instead of a general-purpose one (e.g., lexical),
guarantees more effective results.
• Automated code transformation: Employing heuristics to modify elements of
the retrieved code could lead to more compilable results. There are cases where
the required transformations are straightforward, e.g., when the code has generics.
• Results assessment: It is important for the system to provide information about
the retrieved results, both in functional and non-functional characteristics. Thus,
the results should be ranked according to whether they are relevant to the query of
the developer.
• Search in dynamic repositories: Searching in growing repositories instead of
using local ones or even static CSEs results in more effective up-to-date imple-
mentations. Thus, RSSEs must integrate with updatable indexes.
Note that the list of features outlined above is not necessarily exhaustive. We
argue, however, that they reasonably cover a component reuse scenario in a test-
driven reuse context, as supported also by current research efforts in this area [13,
14, 34, 35].
Upon presenting the current state of the art on TDR systems, we compare these
systems to Mantissa with regard to the features defined in Sect. 6.2.4. The compar-
ison, which is shown in Table 6.1, indicates that Mantissa covers all the required
features/criteria, while the other systems are mostly limited to full coverage of two
or three of these features.
Table 6.1 Feature comparison of popular test-driven reuse systems and Mantissa
Code reuse Feature 1 Feature 2 Feature 3 Feature 4 Feature 5
system
Query Syntax- Code Results Search in
composition awareness transformation assessment dynamic repos
Mantissa
Code Conjurer – – ×
CodeGenie × – ×
S6 × –
FAST – × – ×
Feature supported, ×Feature not supported, – Feature partially supported
6.2 Background on Recommendation Systems in Software Engineering 141
In this section, we present Mantissa [37], a TDR system designed to support the
features of Sect. 6.2.4.
6.3.1 Overview
tract its AST. The downloader uses the AST to compose the queries for the connected
CSEs. After that, it issues the queries, retrieves the results, and stores them in a local
repository.
The retrieved results are then analyzed by the Parser, which extracts the AST
for every result. The Miner ranks the results using a syntax-aware scoring model
based on information retrieval metrics. The retrieved files are then passed to the
transformer module, which performs source code transformations on them in order
to syntactically match them to the query. The tester compiles the files in order to
execute them against the given test cases. The ranked list of results involves the
score for each component, along with these elements, as well as information as to
whether the result is compilable and whether it has passed the tests. Finally, the
control flow and the dependencies are extracted for each result by the flow extractor,
while links to grepcode6 are provided for each dependency/import.
Note that the architecture of Mantissa is language-agnostic, as it can retrieve func-
tional software components written in different programming languages. In specific,
supporting a programming language requires developing a Parser module in order to
extract the ASTs of source code components, and a Tester module so as to test the
retrieved results. In the following paragraphs, we provide an implementation of the
modules of Mantissa for the Java programming language.
At first, the input signature of Mantissa is given in a form similar to a Java interface
and any test cases are written in JUnit format. An example signature for a Java “Stack”
component with two methods “push” and “pop” is shown in Fig. 6.5.
Note also that a wild character (_) can be used for cases where the user would
like to ignore the class/method name, while Java generic types can also be used.
6 http://grepcode.com/.
6.3 Mantissa: A New Test-Driven Reuse Recommendation System 143
6.3.2 Parser
The parser module is used for extracting information from the source code of the
developer as well as for extracting the ASTs of the retrieved results. The module
employs srcML,7 a tool which extracts the AST in XML format, where mark-up
tags denote elements of the abstract syntax for the language. The tool also allows
refactoring by transforming XML trees back to source code. The AST for a file
includes all the elements that may be defined in the file: packages, imports, classes,
methods, etc. An example of stripped-down (without modifiers) AST for the file of
Fig. 6.5 is shown in Fig. 6.6.
6.3.3 Downloader
The downloader module is used to search in the integrated CSEs of the system,
retrieve relevant files, and form a temporary local repository so that the files are
further analyzed by the miner module. The module provides the developer with the
ability to search either in Github Search or in our CSE AGORA that was analyzed
in the previous chapter. We present the integration of both CSEs in the following
subsection.
7 http://www.srcml.org/.
144 6 Mining Source Code for Component Reuse
Concerning the integration of our system with GitHub, the latter provides an API for
code search; however, it has two major limitations: (a) issuing a maximum number
of requests per hour and (b) restricting its code search feature to specific reposito-
ries/users, thus not allowing to search in the whole index. For this reason, we decided
to implement a custom downloader in order to overcome the latter limitation.
Our implementation uses the Searchcode CSE,8 in order to initially find GitHub
users whose repositories are possibly relevant to the query. In specific, the downloader
uses the AST of the queried component in order to compose a number of queries for
Searchcode. These queries are issued one after another until a specified number of
results from different GitHub users are returned (tests have shown that retrieving the
first 30 users is quite adequate). The downloader issues three types of queries:
• the name of the class and a combination of the methods, including their names
and return types;
• the name of the class on its own as is and tokenized if it can be split into more than
one tokens; and
• the combination of methods on their own, including the name and return type of
each method.
Figure 6.7 depicts a list of queries for the “Stack” of Fig. 6.5.
Upon issuing the queries to Searchcode, the downloader identifies 30 user ac-
counts that are a good fit for the query. Since, however, GitHub allows searching for
more than 30 users, we decided to exploit this by also adding the most popular users
to this list (popularity is determined given the number of stars of their repositories).
In specific, GitHub allows performing queries in the source code of up to 900 user
accounts per minute, i.e., 30 requests per minute each including 30 users.
Similarly to Searchcode, GitHub does not support regular expressions, hence the
downloader formulates a list of queries in order to retrieve as many relevant results as
possible. The list of queries is slightly different from what was used in the Searchcode
CSE, since in the case of GitHub the required results are the files themselves and not
only the repositories of GitHub users. In specific, the queries to GitHub include the
following:
8 https://searchcode.com/.
6.3 Mantissa: A New Test-Driven Reuse Recommendation System 145
• the methods of the component, including their type, their name, and the type of
their parameters;
• the names and the parameter types of the methods;
• the names of the methods;
• the tokenized names of the methods; and
• the tokenized name of the class.
The relevant example of a list of queries for the “Stack” component of Fig. 6.5 is
shown in Fig. 6.8.
Finally, if the query is longer than 128 characters, then its terms are removed one
at a time, starting from the last one until this limitation of GitHub is satisfied.
As indicated, GitHub imposes several restrictions on its API. Thus, we also use
AGORA, which is better oriented toward component reuse. The syntax-aware search
of AGORA allows finding Java objects with specific method names, parameter
names, etc., while its API does not impose any limitations to the extent of the index
to be searched. In our case, the downloader takes advantage of the syntax-aware
features of AGORA to create queries such as the one shown in the searching for a
reusable component scenario of AGORA (see Sect. 5.4.2.1).
For bandwidth reasons, most CSEs usually restrict the users with respect to the
number of results that can be downloaded within a time window. For example, in
the case of GitHub, downloading all the results one by one is rather ineffective.
However, CSEs usually return a list of results including snippets that correspond
to the matching elements of the query. To accelerate our system and ensure that
a result is at least somewhat relevant before downloading it, we have constructed
a Snippets Miner within the downloader module. The Snippets Miner component
parses the snippets provided by the CSE and determines which are relevant to the
query, ensuring the downloader downloads only these and ignores any remaining
files.
Note that the Snippets Miner component can be adjusted to any CSE that returns
snippets information. It employs syntax-aware features, thus in our case, it was not
required to employ it for AGORA that already has these matching capabilities. The
component, however, had to be activated for the results of GitHub.
At first, the component works by eliminating duplicates using the full filename
of the result file (including the project name) as well as the sha hash returned by the
146 6 Mining Source Code for Component Reuse
GitHub API for each result. After that, the component extracts method declarations
from each result and checks whether they comply with the original query. Note that
using the parser is not possible since the snippets are incomplete; thus, the method
declarations are exported using the regular expression of Fig. 6.9.
Upon extracting method declarations, the Snippets Miner performs matching be-
tween the methods that were extracted by the currently examined snippet and the
methods requested in the query of the developer. The matching mechanism involves
checking whether there are similar names and return types of methods, as well as
parameter names and types. For the return types, only exact matches are checked,
while for parameters the possible permutations are considered. A total score is given
to each result using the assigned score of each of these areas. The scoring scheme
used is explained in detail in Sect. 6.3.4, since it is the same as one of the miner
modules of Mantissa. The Snippets Miner is used to keep results that have score
greater than 0 and discard any remaining files. If, however, the retrieved results are
less than a threshold (in our case 100), then results that scored 0 are also downloaded
until that threshold is reached.
6.3.4 Miner
Upon forming a local repository with the retrieved results, the parser extracts their
ASTs as described in Sect. 6.3.2, while the miner ranks them according to the query of
the developer. The miner consists of three components: the preprocessor, the scorer,
and the postprocessor. These are analyzed in the following paragraphs.
6.3.4.1 Preprocessor
The first step before ranking the results involves detecting and removing any duplicate
files. There are several algorithms for detecting duplicate files while current literature
also includes various sophisticated code clone detection techniques. However, these
methods are usually not oriented toward code reuse; when searching for a software
component, finding similar ones is actually encouraged, what one would consider
excess information is exact duplicates. Thus, the preprocessor eliminates duplicates
using the MD5 algorithm [38] in order to ensure that the process is as fast as possible.
Initially, the files are scanned and an MD5 hash is created for each of them and after
that the hashes are compared to detect duplicates.
6.3 Mantissa: A New Test-Driven Reuse Recommendation System 147
6.3.4.2 Scorer
The scorer is the core component of the miner module, which is assigned with
scoring and ranking the files. It receives the ASTs of the downloaded files as input
and computes the similarity between each of the results with the AST of the query.
Our scoring methodology is similar to the Vector Space Model (VSM) [39], where
the documents are source code files, while the terms are the values of the AST nodes
(XML tags) as defined in Sect. 6.3.2. The scorer practically creates a vector for each
result file and the query and compares these vectors.9
Since each component has a class, initially a class vector is created for each file.
The class vector has the following form:
1 ), scor e(m
c = [scor e(name), scor e(m 2 ), . . . , scor e(m
n )] (6.1)
where name is the name of the class and m i is the vector of the ith method of the
class, out of n methods in total. The score of the name is a metric of its similarity
to the name of the requested component, while the score of each method vector is
a metric of its similarity to the method vector of the requested component, which
includes the name of the method, the return type, and the type of its parameters. A
method vector m is defined as follows:
where name is the name of the method, t ype is the return type of the method, and
p j is the type of the jth parameter of the method, out of m parameters in total.
Given Eqs. (6.1) and (6.2), we are able to represent the full signature of a compo-
nent as a class vector. Thus, comparing two components comes down to comparing
these vectors that contain numeric values in the range [0, 1]. The scor e functions
are computed for each document-result file given the signature of the query. Com-
puting the vector values of Eqs. (6.1) and (6.2) requires a function for computing
the similarity between sets, for the methods and the parameters, and a function for
computing the similarity between strings, for class names, method names and return
types, and parameter types.
Note that using only a string matching method is not adequate, since the maximum
score between two vectors such as the ones of Eqs. (6.1) and (6.2) also depends on
the order of the elements. Methods, for example, are not guaranteed to have the
same order in two files, as parameters do not necessarily have the same order in
methods. Thus, the scorer has to find the best possible matching between two sets
of methods or parameters, respectively. Finding the best possible matching between
two sets is a problem similar to the Stable Marriage Problem (SMP) [40], where
9 Inspecific, the main difference between our scoring model and the VSM is that our model is
hierarchical, i.e., in our case a vector can be composed by other vectors. Our model has levels; at
the class level it is practically a VSM for class vectors, while class vectors are further analyzed at
the method level.
148 6 Mining Source Code for Component Reuse
Fig. 6.10 Algorithm that computes the similarity between two sets of elements, U and V , and
returns the best possible matching as a set of pairs Matched Pair s
two sets of elements have to be matched in order to maximize the profit given the
preferences of the elements of the first set to the elements of the second and vice
versa.10 However, in our case, the problem is simpler, since the preference lists for
both sets are symmetrical.
Hence, we have constructed an algorithm that finds the best possible matching
among the elements of two sets. Figure 6.10 depicts the Sets Matching algorithm.
The algorithm receives sets U and V as input and outputs the Matched Pair s,
i.e., the best matching between the sets. At first, two sets are defined, MatchedU
and Matched V , to keep track of the matched elements of the two initial sets. After
that, all possible scores for the combinations of the elements of the two sets are
computed and stored in the Pair s set in the form (u, v, scor e(u, v)), where u and v
are elements of the sets U and V , respectively. The Pair s set is sorted in descending
order according to the scores, and then the algorithm iterates over each pair of the
set. For each pair, if neither of the elements is already present in the MatchedU and
Matched V sets, then the pair is matched and added to the Matched Pair s set.
Finally, upon having confronted the problem of determining the matching between
classes and methods, the only remaining issue is devising a string similarity scheme.
The scheme shall be used to compute the similarity between Java names and types.
Note that the string similarity scheme has to comply with the main characteristics of
the Java language, e.g., handling camelCase strings, while at the same time employing
NLP semantics (e.g., stemming). Thus, the first step before comparing two strings is
preprocessing. Figure 6.11 depicts the preprocessing steps.
At first, the string is tokenized using a tokenizer that is consistent with the naming
conventions of Java. The tokenizer splits the string into tokens on uppercase charac-
ters (for camelCase strings), numbers, and the underscore (_) character. Additionally,
10 According to the original definition of the SMP, there are N men and N women, and every person
ranks the members of the opposite sex in a strict order of preference. The problem is to find a
matching between men and women so that there are no two people of opposite sex who would both
rather be matched to each other than their current partners.
6.3 Mantissa: A New Test-Driven Reuse Recommendation System 149
it removes numbers and common symbols used in types (< and >) and arrays ([ and ])
and converts the text to lowercase.
The next step involves removing stop words and terms with less than three char-
acters, since they would not offer any value to the comparison. We used the stop
words list of NLTK [41]. Finally, the PortStemmer of NLTK [41] is used to stem
the words, applying both inflectional and derivational morphology heuristics. Table
6.2 illustrates these steps using different examples. The leftmost column shows the
original string while subsequent columns show the result after performing each step,
so the rightmost column shows the resulting tokens.
Finally, upon preprocessing the strings that are to be compared, we use string
similarity methods to compute a score of how similar the strings are. For this, we
used two different metrics: the Levenshtein similarity [42] and the Jaccard index
[43]. The Levenshtein similarity is used by the scorer module in order to perform a
strict matching that takes also the order of the strings into account, while the Jaccard
index is used by the Snippets Miner component of the downloader (see Sect. 6.3.3.2)
so that the matching is less strict.
As the Levenshtein distance is applied to strings, the token vectors are first merged.
The Levenshtein distance between two strings is defined as the minimum number
of character insertions, deletions, or replacements required to transform one string
into the other. For our implementation, the cost of replacing a character was set to 2,
while the costs of the insertion and deletion were set to 1. Then, for two strings, s1
and s2 , their Levenshtein similarity score is computed as follows:
distance(s1 , s2 )
L(s1 , s2 ) = 1 − (6.3)
max{|s1 |, |s2 |}
As shown in Eq. (6.3), the distance between the two strings is normalized in [0, 1]
by dividing by the total number of characters in the two strings, and the result is
subtracted from 1 to offer a metric of similarity.
The Jaccard index is used to compute the similarity between sets. Given two sets,
U and V , their Jaccard index is the size of their intersection divided by the size of
150 6 Mining Source Code for Component Reuse
their union:
|U ∩ V |
J (U, V ) = (6.4)
|U ∪ V |
In our case, the sets contain tokens, so the intersection of the two sets includes the
tokens that exist in both sets, while their union includes all the tokens of both sets
excluding duplicates.
Upon having constructed the algorithms and the string metrics required for cre-
ating the vector for each result, we now return to the VSM defined by Eqs. (6.1) and
(6.2). Note that all metrics defined in the previous paragraphs are normalized, so all
dimensions for the vector of a document will be in the range [0, 1]. The query vector
is the all-ones vector. The final step is to compute a score for each document-result
vector and rank these vectors according to their similarity to the query vector.
Since the Tanimoto coefficient [44] has been proven to be effective for comparing
bit vectors, we have used its continuous definition for the scorer. Thus, the similarity
between the query vector Q and a document vector D is computed as follows:
Q·D
D)
T ( Q, = (6.5)
2 + || D||
|| Q|| 2−Q
·D
The nominator of Eq. (6.5) is the dot product of the two vectors while the denominator
is the summation of their Euclidean norms minus their dot product. Finally, the scorer
outputs the result files, along with their scores that indicate their relevance to the query
of the developer.
As an example, we can use the signature of Fig. 6.5. The “Stack” component of
this signature has two methods, “push” and “pop”, so its vector would be an all-ones
vector with three elements, one for the name of the class and one for each method:
pop = [1, 1]
m (6.7)
and
pop = [1, 1]
m (6.8)
respectively, since “push” has a name, a return type (“void”), and a parameter type
(“Object”), while “pop” only has a name and a return type (“Object”).
For our example, consider executing the query and retrieving a result with the
signature of Fig. 6.12.
This object has two method vectors m push Object and m popObject . In this case,
the matching algorithm of Fig. 6.10 would match m push to m push Object and m pop
popObject . Given the string preprocessing steps of Fig. 6.11 (e.g., “pushObject”
to m
would be split to “push” and “object”, all names would become lowercase, etc.) and
6.3 Mantissa: A New Test-Driven Reuse Recommendation System 151
Fig. 6.12 Example signature for a “Stack” component with two methods for adding and deleting
elements from the stack
Fig. 6.13 Example vector space representation for the query “Stack” and the retrieved result
“MyStack”
Finally, given Eq. 6.5, the vectors of Eqs. 6.6 and 6.10 would provide score equal to
0.898 for the retrieved component. Figure 6.13 graphically illustrates how the class
vectors are represented in the three-dimensional space for this example.
152 6 Mining Source Code for Component Reuse
6.3.4.3 Postprocessor
Upon having scored the results, the postprocessor can be used to further enhance the
produced ranking. Intuitively, when one or more files are functionally equivalent,
a developer would probably select the simplest solution. As a result, we use the
physical lines of code metric,11 counting any line that has more than three non-blank
characters, to determine the size and complexity of each file. After that, the results are
ranked according to their functional score as computed by the scorer, in descending
order, and subsequently in ascending order of their LoC value when the scores are
equal.
6.3.5 Transformer
The transformer receives as input the retrieved software components and the query of
the developer and performs a series of transformations so that the query is matched
(i.e., the retrieved components are compatible along with the query). The transforma-
tions are performed on the srcML XML tree using XPath expressions and algorithms,
and the resulting transformed XML produces the new source code component.
At first, class and method names are renamed to match those of the query. The
method invocations are also examined to support cases where one method of the
component calls another method that has to be renamed. Furthermore, the return
types of methods are converted to the corresponding ones of the query, using type
casts12 wherever possible, while return statements are also transformed accordingly.
Upon having transformed the names and the return types of each method, the
next step involves transforming the parameters. The parameters of the results have
already been matched to those of the query using the algorithm of Fig. 6.10. Thus,
the transformer iterates over all parameters of the query method and refactors the
corresponding parameters of the result method. The refactoring involves changing
the name, as well as the type of each parameter using type casts. Any parameters
that are not present in the result method are added, while any extraneous parameters
are removed from the parameter list and placed inside the method scope as variable
declaration commands and are initialized to the default values as defined by Java
standards.13 Note also that the order of the parameters is changed to match the pa-
rameter order of the corresponding query method. These changes are also performed
on all commands including invocations of the method.
11 https://java.net/projects/loc-counter/pages/Home.
12 http://imagejdocu.tudor.lu/doku.php?id=howto:java:how_to_convert_data_type_x_into_type_
y_in_java.
13 https://docs.oracle.com/javase/tutorial/java/nutsandbolts/datatypes.html.
6.3 Mantissa: A New Test-Driven Reuse Recommendation System 153
Fig. 6.14 Example signature for a class “Customer” with methods “setAddress” and “getAddress”
Finally, using srcML, the transformed XML tree is parsed in order to extract the
new version of the component. The code is further style-formatted using the Artistic
Style formatter14 to produce a clean ready-to-use component for the developer.
Figure 6.14 provides an example of a query signature for a “Customer” component
with two methods, “setAddress” and “getAddress”, that are used to set and retrieve the
address of the customer, respectively. This example is used throughout this subsection
to illustrate the functionality of the transformer module.
Figure 6.15 depicts an example retrieved result for the “Customer” component.
This result indeed contains the required methods, however with slightly different
names, different parameters, and different return types.
In this case, the transformer initially renames the methods setTheAddress and
getTheAddress to setAddress and getAddress, respectively. After that, the return
type of setAddress is changed from int to void and the corresponding return state-
ments are modified. Finally, the extraneous parameter postCode is moved inside
14 http://astyle.sourceforge.net/.
154 6 Mining Source Code for Component Reuse
the method and set to its default value (in this case null). The final result is shown in
Fig. 6.16.
6.3.6 Tester
The tester module of Mantissa evaluates whether the results satisfy the functional
conditions set by the developer. In order for this module to operate, the developer
has to provide test cases for the software component of the query.
Initially, a project folder is created for each result file. The corresponding result
file and the test case written by the developer are copied to the folder. After that, the
tester component employs the heuristics defined by the transformer in Sect. 6.3.5 to
modify the source code file (transform classes, methods, and parameters), so that the
file is compilable along with the test file. The files are compiled (if this is possible)
and the test is executed to provide the final result. The outcome of this procedure
indicates whether the downloaded result file is compilable and whether it passes the
provided test cases.
The corresponding test file that is provided by the developer for the component
of Fig. 6.14 is shown in Fig. 6.17.
The test is formed using JUnit and the Java reflection feature. Reflection is required
to provide a test structure that can be modified in order to comply with different files.
In the code of Fig. 6.17, variable m_index is a zero-based counter, which stores the
position of the method that is going to be invoked next in the result file. Using a
6.3 Mantissa: A New Test-Driven Reuse Recommendation System 155
Fig. 6.17 Example test case for the “Customer” component of Fig. 6.14
15 Java supports method overloading, meaning that it allows the existence of methods with the same
The flow extractor extracts the source code flow of the methods for each result. The
types that are used in each method and the sequence of commands can be a helpful
asset for the developer, as he/she may easily understand the inner functionality of the
method, without having to read the source code. We use the sequence extractor of
[45], and further extend it to account for conditions, loops, and try-catch statements.
In specific, for each component, we initially extract all the declarations from the
source code (including classes, fields, methods, and variables) to create a hierarchical
lookup table according to the scopes as defined by the Java language. After that, we
iterate over the statements of each method and extract three types of statements:
assignments, functions calls, and class instantiations. For instance, the command
this.address = address of Fig. 6.16 is an assignment statement of type String.
All different alternate paths are taken into account. Apart from conditions, try-catch
statements define different flows (with finally blocks executed in all cases), while
the loops are either executed or not, i.e., are regarded as conditions.
Finally, for each method, its possible flows are depicted in the form of a graph,
where each node represents a statement. The graphs are constructed in the Graphviz
dot format17 and generated using Viz.js.18 The types of the nodes are further printed
in red font when they refer to external dependencies. An example usage of the flow
extractor will be given in Sect. 6.4.2 (see Figs. 6.21 and 6.22).
In this section, we present the user interface of Mantissa (Sect. 6.4.1) and provide an
example of using our tool for a realistic search scenario (Sect. 6.4.2).
17 http://www.graphviz.org/.
18 http://viz-js.com/.
19 http://flask.pocoo.org/.
6.4 Mantissa User Interface and Search Scenario 157
the bottom of the screen, the user can select the CSE to be used for searching, either
AGORA or GitHub, using a dropdown menu, before hitting the “Search” button. In
the example of Fig. 6.18, the user has entered a query for a “Stack” component and
the corresponding test case.
Upon issuing a search, relevant results are presented to the user in the results
page of Fig. 6.19. As shown in this figure, the user can navigate the results as well as
possibly integrate any useful component in his/her code. Each result is accompanied
by information about the score given by the system, its physical lines of code, whether
it compiled and whether it successfully passed the tests. For instance, Fig. 6.19 depicts
the nineteenth result for the “Stack” query, which has score value equal to 0.96,
consists of 27 lines of code, while it compiled but did not pass the test. Finally, the
provided options (in the form of links) allow viewing the original file (i.e., before
any transformation by the transformer), showing the control flow for each function
of the file and showing the external dependencies in the form of imports.
158 6 Mining Source Code for Component Reuse
In this section, we provide an example usage scenario to illustrate how Mantissa can
be a valuable asset for the developer. The scenario involves creating a duplicate file
check application. The application will initially scan the files of a given folder and
store the MD5 representation for the contents of each file in a set. After that, any
new files will be checked using the set before being added to the folder.
The first step is to divide the application into different components. Typically, the
developer would split the application in a file reader/writer module, a serializable
set data structure that can be saved to/loaded from disk, and an implementation of
the MD5 algorithm. Using Mantissa, we are able to search for and retrieve these
components quite easily. For the file reader/writer, we construct the query shown in
Fig. 6.20.
A quick check in the results list reveals that several components provide this
functionality. In this case, the one selected would be the smaller one out of all
components that are compilable. Indicatively, the file write method of the file tools
component is shown in Fig. 6.21.
6.4 Mantissa User Interface and Search Scenario 159
Fig. 6.20 Example signature for a component that reads and writes files
Fig. 6.21 Example write method of a component that reads and writes files
Further examining the source code flow of the methods for the selected result
indicates that methods of the Java standard library are employed, and therefore the
test, in this case, can be omitted for simplicity. For example, Fig. 6.22 depicts the
code flow for the method shown in Fig. 6.21, where it is clear that the main type
involved is the BufferedWriter, which is included in the standard library of Java.
For the MD5 algorithm, the query is shown in Fig. 6.23, while the corresponding
test case is shown in Fig. 6.24. The test involves checking the correct hashing of a
known string. In this case, the second result passes the test case and has no external
dependencies.
The next component is the serializable set data structure. The query in this case is
shown in Fig. 6.25 and the corresponding test case is shown in Fig. 6.26. Note how
the main functionality lies on the serialization part, and therefore this is covered by
the test case.
Finally, upon having downloaded the components, the developer can now easily
write the remaining code for creating the MD5 set, reading or writing it to disk, and
checking new files for duplicate information. The code for creating the database is
shown in Fig. 6.27.
Overall, the developer has to write less than 50 lines of code in order to con-
struct both the queries and the tests for this application. Upon finding executable
components that cover the desired functionality, the integration effort required is
also minimal. Finally, the main flow of the application (a part of which is shown in
Fig. 6.27) is quite simple, given that all implementation details of the components
160 6 Mining Source Code for Component Reuse
Fig. 6.22 Code flow for the write method of a component that handles files
6.5 Evaluation
We evaluate Mantissa against popular CSEs (the ones assessed also in the previous
chapter, i.e., GitHub Search, BlackDuck Open Hub, and AGORA), as well as against
two well-known RSSEs (FAST and Code Conjurer). For each of the three experi-
ments, we define and present the dataset used for the evaluation, while the results
are summarized in tables and illustrated in figures. Note that each experiment has a
6.5 Evaluation 161
Fig. 6.24 Example test case for the “MD5” component of Fig. 6.23
different dataset since some of the systems are no longer available, thus only their
own datasets could be used for evaluating them.
Our evaluation is based on the scenario of searching for a reusable software compo-
nent, as this was described in the evaluation of the previous chapter. In our scenario,
the developer provides a CSE or an RSSE with a query for the desired component.
The system returns relevant results and the developer examines them and deter-
mines whether they are relevant to the query. As before, we refrain from examining
the results, since this would pose serious threats to validity, and instead determine
whether a result is relevant by using automated testing (like with the tester module of
Mantissa). As a result, we do not assess the tester. The datasets used for evaluation
consist of queries for components and their corresponding test cases.
For each query, we use the number of compiled results and the number of results
that passed the tests as metrics. Furthermore, as in the previous chapter, we employ
162 6 Mining Source Code for Component Reuse
Fig. 6.26 Example test case for the serializable set component of Fig. 6.25
Fig. 6.27 Example for constructing a set of MD5 strings for the files of folder folderPath and
writing it to a file with filename setFilename
the search length, which we define as the number of non-compilable results (or results
that did not pass the tests) that the developer must examine in order to find a number
of compilable results (or results that passed the tests). We again assess all the metrics
for the first 30 results for each system, since keeping more results did not have any
impact on the compilability or the test passing score, while 30 results are usually
more than a developer would examine in such a scenario.
6.5 Evaluation 163
In this section, we extend our evaluation of CSEs presented in the previous chapter
in order to illustrate how using Mantissa improves on using only a CSE. As such, we
compare Mantissa against the well-known CSEs of GitHub and BlackDuck as well
as against our CSE, AGORA.
6.5.2.1 Dataset
Our dataset consists of 16 components and their corresponding test cases. These
components are of variable complexity to fully assess the results of the CSEs against
Mantissa in the component reuse scenario. Our dataset is shown in Table 6.3.
The formation of the queries for the three CSEs is identical to the one described
in the previous chapter. The dataset, however, is updated to include 16 components.
The original 13 of them refer to classes, while the 3 new ones, “Prime”, “Sort”, and
“Tokenizer”, refer to functions. These components were added to further assess the
CSEs against Mantissa in scenarios where the required component is a function and
not a class. Further analyzing the dataset, we perceive that there are simple queries,
some of them related to mathematics, for which we anticipate numerous results
Table 6.3 Dataset for the evaluation of Mantissa against code search engines
Class Methods
Account deposit, withdraw, getBalance
Article setId, getId, setName, getName, setPrice,
getPrice
Calculator add, subtract, divide, multiply
ComplexNumber ComplexNumber, add, getRealPart,
getImaginaryPart
CardValidator isValid
Customer setAddress, getAddress
Gcd gcd
Md5 md5Sum
Mortgage setRate, setPrincipal, setYears,
getMonthlyPayment
Movie Movie, getTitle
Prime checkPrime
Sort sort
Spreadsheet put, get
Stack push, pop
Stack2 pushObject, popObject
Tokenizer tokenize
164 6 Mining Source Code for Component Reuse
(e.g., “Gcd” or “Prime”), and more complex queries (e.g., “Mortgage”) or queries
with rare names (e.g., “Spreadsheet”) that are more challenging for the systems.
Finally, although some queries might seem simple (e.g., “Calculator”), they may
have dependencies from third-party libraries that would lead to compilation errors.
6.5.2.2 Results
The results of the comparison of Mantissa with the CSEs are shown in Table 6.4. At
first, one may notice that both GitHub and AGORA, as well as the two implementa-
tions of Mantissa, returned at least 30 results for each query. By contrast, BlackDuck
returned fewer results in half of the queries, indicating that its retrieval mechanism
performs exact matching, and therefore is more sensitive to small changes in the
query. This is particularly obvious for the modified version of the “Stack” query,
where the service returned no results.
GitHub seems to return several results, however few of them are compilable,
while most queries return results that do not pass the tests. This is expected since the
Table 6.4 Compiled and passed results out of the total returned results for the three code search
engines and the two Mantissa implementations, for each result and on average, where the format is
passed/compiled/total results
Query Code search engines Mantissa
GitHub AGORA BlackDuck AGORA GitHub
Account 0/10/30 1/4/30 2/9/30 1/5/30 6/7/30
Article 1/6/30 0/5/30 2/3/15 5/9/30 2/8/30
Calculator 0/5/30 0/5/30 1/3/22 0/4/30 0/9/30
ComplexNumber 2/2/30 2/12/30 1/1/3 4/7/30 7/11/30
CardValidator 0/1/30 1/4/30 2/3/30 0/0/30 0/4/30
Customer 2/3/30 13/15/30 7/7/30 20/21/30 5/6/30
Gcd 1/3/30 5/9/30 12/16/30 5/9/30 20/22/30
Md5 3/7/30 2/10/30 0/0/0 4/7/30 1/3/30
Mortgage 0/6/30 0/4/30 0/0/0 0/7/30 0/1/30
Movie 1/3/30 2/2/30 3/7/30 2/2/30 1/5/30
Prime 4/13/30 1/4/30 1/2/15 1/1/30 4/14/30
Sort 0/7/30 0/1/30 0/4/30 1/4/30 4/5/30
Spreadsheet 0/1/30 1/2/30 0/0/6 1/2/30 1/3/30
Stack 0/3/30 7/8/30 8/12/30 14/15/30 10/11/30
Stack2 0/0/30 7/8/30 0/0/0 14/15/30 1/1/30
Tokenizer 0/8/30 0/4/30 2/7/30 0/3/30 1/4/30
Average results 30.0 30.0 18.8 30.0 30.0
Average compiled 4.875 6.062 4.625 6.938 7.125
Average passed 0.875 2.625 2.562 4.500 3.938
6.5 Evaluation 165
Table 6.5 Search length for the passed results for the three code search engines and the two
Mantissa implementations, for each result and on average
Query Code search engines Mantissa
GitHub AGORA BlackDuck AGORA GitHub
Account 30 2 0 0 0
Article 21 30 2 11 1
Calculator 30 30 0 30 30
ComplexNumber 0 0 0 0 0
CardValidator 30 15 2 30 30
Customer 0 0 1 0 0
Gcd 25 8 5 0 0
Md5 9 0 30 1 2
Mortgage 30 30 30 30 30
Movie 23 21 1 22 7
Prime 1 10 0 6 0
Sort 30 30 30 0 1
Spreadsheet 30 23 30 8 9
Stack 30 2 4 0 3
Stack2 30 2 30 0 0
Tokenizer 30 30 14 30 8
Average passed 21.812 14.562 11.188 10.500 7.562
In this section, we evaluate Mantissa against the FAST plugin [34]. We use the dataset
provided for the evaluation of FAST for this comparison since the plugin employs
the deprecated CSE Merobase so it is not functional.
6.5.3.1 Dataset
6.5.3.2 Results
Table 6.7 presents the results for Mantissa and FAST, where it is clear that Mantissa
returns more compilable results as well as more results that pass the tests than the
FAST system. Further analyzing the results, FAST is effective for only half of the
queries, while there are several queries where its results are not compilable. Both im-
plementations returned numerous tested results, even in cases where FAST was not
168 6 Mining Source Code for Component Reuse
effective (e.g., “Md5” and “SumArray” classes). Similar to our previous experiment,
mathematical queries, such as “FinancialCalculator” and “Prime”, were challeng-
ing for all systems, possibly indicating dependencies from third-party libraries. It is
worth noticing that the GitHub implementation of Mantissa seems more promising
than the AGORA one. However, this is expected since GitHub has millions of repos-
itories, while the index of AGORA contains fewer repositories that may not contain
mathematical functions, as in the dataset of FAST.
The average number of compilable results and passed results are also shown in
Fig. 6.30, where it is clear that the Mantissa implementations outperform FAST.
In this subsection, we evaluate Mantissa against Code Conjurer [13]. Code Conjurer
has two search approaches: the “Interface-based” approach and the “Adaptation”
approach. The first conducts exact matching between the query and the retrieved
components, while the second conducts flexible matching, ignoring method names,
types, etc. Similarly to FAST, Code Conjurer is also currently not functional, and thus
we use the provided dataset to compare it to our system. Finally, since the number
of compilable results or the ranking of the results is not provided for Code Conjurer,
we compare the systems only with respect to the number of results that pass the tests.
6.5 Evaluation 169
Table 6.7 Compiled and passed results out of the total returned results for FAST and the two
Mantissa implementations, for each result and on average, where the format for each value is
passed/compiled/total results
Query FAST Mantissa
AGORA GitHub
CardValidator 1/1/1 0/0/30 0/4/30
Customer 0/0/5 10/10/30 0/3/30
Fibonacci 4/11/30 2/2/30 4/12/30
FinancialCalculator 0/0/0 0/5/30 0/2/30
Gcd 12/12/30 5/9/30 20/22/30
isDigit 2/2/30 6/6/30 10/12/30
isLeapYear 4/4/30 1/2/30 2/3/30
isPalindrome 4/4/9 1/3/30 6/18/30
Md5 0/0/24 4/7/30 1/3/30
Mortgage 0/0/0 0/2/30 0/5/30
Prime 3/4/30 1/1/30 5/14/30
Sort 1/1/30 1/4/30 4/5/30
Stack 10/10/30 12/14/30 9/14/30
StackInit 3/4/30 5/12 /30 5/14/30
SumArray 0/0/30 5/9/30 1/8/30
Average results 20.6 30.0 30.0
Average compiled 3.533 5.733 9.267
Average passed 2.933 3.533 4.467
Table 6.8 Dataset for the evaluation of Mantissa against Code Conjurer
Class Methods
Calculator add, sub, div, mult
ComplexNumber ComplexNumber, add, getRealPart,
getImaginaryPart
Matrix Matrix, set, get, multiply
MortgageCalculator setRate, setPrincipal, setYears,
getMonthlyPayment
Spreadsheet put, get
Stack push, pop
6.5.4.1 Dataset
The dataset, which is shown in Table 6.8, consists of six different components and six
corresponding test cases. Compared to the dataset of [13], we omitted the “Shopping-
Cart” component because it depends on the external “Product” class, and therefore
it could not be executed. The dataset is rather limited; however, it includes different
types of queries that could lead to interesting conclusions. Complex queries with mul-
tiple methods, such as “ComplexNumber” or “MortgageCalculator”, are expected to
be challenging for the systems. On the other hand, queries that include mathemati-
cal components, such as “Calculator” or “Matrix”, may involve dependencies from
third-party libraries, and hence they may also be challenging for the systems.
6.5.4.2 Results
The results of our comparison, shown in Table 6.9, are highly dependent on the given
queries for all systems. One of the main advantages of Code Conjurer, especially of
its adaptation approach, is its ability to find numerous results. The approach retrieves
the maximum number of files that comply with the given query from the Merobase
CSE and attempts to execute the test cases by performing certain transformations. In
this scope, it is not comparable with our system since Mantissa uses certain thresholds
as to the number of the files retrieved from the CSEs. This allows Mantissa to present
meaningful results for a query in a timely manner, i.e., at most 3 min.
Using the maximum number of files has a strong impact on the response time
of Code Conjurer. The adaptation approach in some cases requires several hours
to return tested components. The interface-based approach provides results in a
more timely manner, including, however, unexpected deviations in the response time
among different queries (e.g., 26 min for the “Stack” query).
Concerning the effectiveness of the interface-based approach of Code Conjurer
in comparison with Mantissa, it is worth noticing that each system performs better
in different queries. Code Conjurer is more effective for finding tested components
6.5 Evaluation 171
Table 6.9 Results that passed the tests and response time for the two approaches of Code Conjurer
(Interface-based and Adaptation) and the two implementations of Mantissa (using AGORA and
using GitHub), for each result and on average
Query Code Conjurer Mantissa
Interface Adaptation AGORA GitHub
Passed Time Passed Time Passed Time Passed Time
Calculator 1 19 s 22 20 h 24 m 0 25 s 0 2m
35 s
ComplexNumber 0 3s 30 1 m 19 s 4 35 s 7 2m 8s
Matrix 2 23 s 26 5 m 25 s 0 31 s 0 2m
51 s
MortgageCalculator 0 4s 15 3 h 19 m 0 31 s 0 2m
18 s
Spreadsheet 0 3s 4 15 h 13 m 1 23 s 1 2m
28 s
Stack 30 26 m 30 18 h 23 m 14 30 s 10 2m
53s
Average 5.500 4m 21.167 9 h 34 m 17 s 3.167 29 s 3.000 2m
28 s 32 s
for the “Calculator” and “Matrix” queries, while both implementations of Mantissa
retrieve useful components for the “ComplexNumber” and “Spreadsheet” queries.
Notably, all systems found several results for the “Stack” query, while “Mortgage-
Calculator” was too complex to produce tested components in any system. Note,
however, that Code Conjurer retrieved and tested 692 components for this query
(requiring 26 min), whereas the Mantissa implementations were restricted to 200
components.
Finally, given that the dataset contains only six queries, and given that using
a different dataset was not possible since Merobase is not functional, it is hard
to arrive at any conclusions about the effectiveness of the systems. However, this
experiment indicates that Mantissa has a steady response time. As shown in Table
6.9, the response time of the AGORA implementation is approximately 35 s, while
the GitHub implementation returns tested components within roughly 2.5 min. Both
implementations are consistent, having quite small deviations from these averages.
6.6 Conclusion
functionality of the retrieved source code files. Our evaluation showed that Mantissa
outperforms CSEs, while its comparison with two RSSEs indicates that it presents
useful results in a timely manner and outperforms the other systems in several sce-
narios.
Future work on Mantissa lies in several directions. Regarding the tester compo-
nent, it would be interesting to work on resolving component dependencies from
libraries in order to increase the number of results that could be compiled and pass
the tests. Additionally, hosting a developer survey could lead to better assessing
Mantissa and understanding which modules are of most value to the developers.
References
1. Walker RJ (2013) Recent advances in recommendation systems for software engineering. In:
Ali M, Bosse T, Hindriks KV, Hoogendoorn M, Jonker CM, Treur J (eds) Recent trends in
applied artificial intelligence, vol 7906. Lecture notes in computer science. Springer, Berlin,
pp 372–381
2. Nurolahzade M, Walker RJ, Maurer F (2013) An assessment of test-driven reuse: promises
and pitfalls. In: Favaro John, Morisio Maurizio (eds) Safe and secure software reuse, vol 7925.
Lecture notes in computer science. Springer, Berlin Heidelberg, pp 65–80
3. McIlroy MD (1968) Components mass-produced software. In: Naur P, Randell B (eds.) Soft-
ware engineering; report of a conference sponsored by the nato science committee, pp 138–155.
NATO Scientific Affairs Division, Brussels, Belgium, NATO Scientific Affairs Division. Bel-
gium, Brussels
4. Mens K, Lozano A (2014) Source code-based recommendation systems. Springer, Berlin, pp
93–130
5. Janjic W, Hummel O, Atkinson C (2014) Reuse-oriented code recommendation systems.
Springer, Berlin, pp 359–386
6. Robillard M, Walker R, Zimmermann T (2010) Recommendation systems for software engi-
neering. IEEE Softw 27(4):80–86
7. Gasparic M, Janes A (2016) What recommendation systems for software engineering recom-
mend. J Syst Softw 113(C):101–113
8. Sahavechaphan N, Claypool K (2006) XSnippet: mining for sample code. SIGPLAN Not
41(10):413–430
9. Thummalapenta S, Xie T (2007) PARSEWeb: a programmer assistant for reusing open source
code on the web. In: Proceedings of the 22nd IEEE/ACM international conference on automated
software engineering, ASE ’07, pp. 204–213, New York, NY, USA. ACM
10. Xie T, Pei J (2006) MAPO: mining API usages from open source repositories. In: Proceedings
of the 2006 international workshop on mining software repositories, MSR ’06, pp 54–57, New
York, NY, USA. ACM
11. Wei Y, Chandrasekaran N, Gulwani S, Hamadi Y (2015) Building bing developer assistant.
Technical Report MSR-TR-2015-36, Microsoft Research
12. Galenson J, Reames P, Bodik R, Hartmann B, Sen K (2014) CodeHint: dynamic and interactive
synthesis of code snippets. In: Proceedings of the 36th international conference on software
engineering, ICSE 2014, pp 653–663, New York, NY, USA. ACM
13. Hummel O, Janjic W, Atkinson C (2008) Code conjurer: pulling reusable software out of thin
air. IEEE Softw 25(5):45–52
14. Lemos OAL, Bajracharya SK, Ossher J, Morla RS, Masiero PC, Baldi P, Lopes CV (2007)
CodeGenie: using test-cases to search and reuse source code. In: Proceedings of the Twenty-
second IEEE/ACM international conference on automated software engineering, ASE ’07, pp
525–526, New York, NY, USA. ACM
References 173
33. Linstead E, Bajracharya S, Ngo T, Rigor P, Lopes C, Baldi P (2009) Sourcerer: mining and
searching internet-scale software repositories. Data Min Knowl Discov 18(2):300–336
34. Krug M (2007) FAST: an eclipse plug-in for test-driven reuse. Master’s thesis, University of
Mannheim
35. Reiss SP (2009) Semantics-based code search. In: Proceedings of the 31st international confer-
ence on software engineering, ICSE ’09, pp 243–253, Washington, DC, USA. IEEE Computer
Society
36. Reiss SP (2009) Specifying what to search for. In: Proceedings of the 2009 ICSE workshop on
search-driven development-users, infrastructure, tools and evaluation, SUITE ’09, pp 41–44,
Washington, DC, USA. IEEE Computer Society
37. Diamantopoulos T, Katirtzis N, Symeonidis A (2018) Mantissa: a recommendation system for
test-driven code reuse. Unpublished manuscript
38. Rivest R (1992) The MD5 message-digest algorithm
39. Manning CD, Raghavan P, Schütze H (2008) Introduction to information retrieval. Cambridge
University Press, New York
40. Gale D, Shapley LS (1962) College admissions and the stability of marriage. Am Math Mon
69(1):9–15
41. Bird S, Klein E, Loper E (2009) Natural language processing with python, 1st edn. O’Reilly
Media, Inc., Sebastopol
42. Levenshtein VI (1966) Binary codes capable of correcting deletions, insertions and reversals.
Soviet Physics Doklady 10:707
43. Jaccard P (1901) Étude comparative de la distribution florale dans une portion des alpes et des
jura. Bulletin del la Société Vaudoise des Sciences Naturelles 37:547–579
44. Tanimoto TT (1957) IBM Internal Report
45. Diamantopoulos T, Symeonidis AL (2015) Employing source code information to improve
question-answering in stack overflow. In: Proceedings of the 12th working conference on
mining software repositories, MSR ’15, pp 454–457, Piscataway, NJ, USA. IEEE Press
Chapter 7
Mining Source Code for Snippet Reuse
7.1 Overview
In this chapter, we design and develop CodeCatch, a system that receives queries
in natural language, and employs the Google search engine to extract useful snip-
pets from multiple online sources. Our system further evaluates the readability of
the retrieved snippets, as well as their preference/acceptance by the developer com-
munity using information from online repositories. Moreover, CodeCatch performs
clustering to group snippets according to their API calls, allowing the developer to
first select the desired API implementation and subsequently choose which snippet
to use.
As already mentioned, in this chapter we focus on systems that receive queries for
solving specific programming tasks and recommend source code snippets suitable
for reuse in the developer’s source code. Some of the first systems of this kind, such
as Prospector [10] or PARSEWeb [11], focused on the problem of finding a path
between an input and an output object in source code. For Prospector [10], such paths
are called jungloids and the resulting program flow is called a jungloid graph. The tool
is quite effective for certain reuse scenarios and can also generate code. However, it
requires maintaining a local database, which may easily become deprecated. A rather
more broad solution was offered by PARSEWeb [11], which employed the Google
Code Search Engine1 and thus the resulting snippets were always up-to-date. Both
systems, however, consider that the developer knows exactly which API objects to
use, and he/she is only concerned with integrating them.
Another category of systems is those that generate API usage examples in the form
of call sequences by mining client code (i.e., code using the API under analysis).
One such system is MAPO [1], which employs frequent sequence mining in order to
identify common usage patterns. As noted, however, by Wang et al. [2], MAPO does
not account for the diversity of usage patterns, and thus outputs a large number of
API examples, many of which are redundant. To improve on this aspect, the authors
propose UP-Miner [2], a system that aims to achieve high coverage and succinctness.
UP-Miner models client source code using graphs and mines frequent closed API call
paths/sequences using the BIDE algorithm [12]. A screenshot of UP-Miner is shown
in Fig. 7.1. The query in this case concerned the API call SQLConnection.Open,
and, as one may see in this figure, the results include different scenarios, e.g., for
reading (pattern 252) or for executing transactions (pattern 54), while the pattern is
also presented graphically at the right part of the screen.
PAM [3] is another similar system that employs probabilistic machine learning to
extract API call sequences, which are proven to be more representative than those of
MAPO and UP-Miner. An interesting novelty of PAM [3] is the use of an automated
evaluation framework based on handwritten usage examples by the developers of
1 As already mentioned in the previous chapters, the Google Code Search Engine resided in http://
the API under analysis. CLAMS [13] also employs the same evaluation process to
illustrate how effective clustering can lead to patterns that are precise and at the
same time diverse. The system also employs summarization techniques to produce
snippets that are more readable and, thus, more easily reusable by the developer.
Apart from the aforementioned systems, which extract API call sequences, there
are also approaches that recommend ready-to-use snippets. Indicatively, we refer
to APIMiner [4], a system that performs code slicing to isolate useful API-relevant
statements of snippets. Buse and Weimer [5] further employ path-sensitive data
flow analysis and pattern abstraction techniques to provide more abstract snippets.
Another important advantage of their implementation is that it employs clustering to
group the resulting snippets into categories. A similar system in this aspect is eXoaD-
ocs [6], as it also clusters snippets, however, using a set of semantic features proposed
by the DECKARD code clone detection algorithm [14]. The system provides exam-
ple snippets using the methods of an API and allows also receiving feedback from
the developer. An example screenshot of eXoaDocs is shown in Fig. 7.2.
Though interesting, all of the aforementioned approaches provide usage exam-
ples for specific API methods, and do not address the challenge of choosing which
library to use. Furthermore, several of these approaches output API call sequences,
instead of ready-to-use solutions in the form of snippets. Finally, none of the afore-
mentioned systems accept queries in natural language, which are certainly preferable
when trying to formulate a programming task without knowing which APIs to use
beforehand.
178 7 Mining Source Code for Snippet Reuse
To address the above challenges, several recent systems focus on generic snippets
and employ some type of processing for natural language queries. Such an exam-
ple system is SnipMatch [7], the snippet recommender of the Eclipse IDE, which
also incorporates several interesting features, including variable renaming for easier
snippet integration. However, its index has to be built by the developer who has to pro-
vide the snippets and the corresponding textual descriptions. A relevant system also
offered as an Eclipse plugin is Blueprint [8]. Blueprint employs the Google search
engine to discover and rank snippets, thus ensuring that useful results are retrieved
for almost any query. An even more advanced system is Bing Code Search [9], which
employs the Bing search engine for finding relevant snippets, and further introduces
a multi-parameter ranking system for snippets as well as a set of transformations to
adapt the snippet to the source code of the developer.
The aforementioned systems, however, do not provide a choice of implementa-
tions to the developer. Furthermore, most of them do not assess the retrieved snippets
from a reusability perspective. This is crucial, as libraries that are preferred by devel-
opers typically exhibit high quality and good documentation, while they are obviously
supported by a larger community [15, 16]. In this chapter, we present CodeCatch,
a snippet mining system designed to overcome the above limitations. CodeCatch
employs the Google search engine in order to receive queries in natural language
7.2 State of the Art on Snippet and API Mining 179
and at the same time extract snippets from multiple online sources. As opposed to
current systems, our tool assesses not only the quality (readability) of the snippets but
also their reusability/preference by the developers. Furthermore, CodeCatch employs
clustering techniques in order to group the snippets according to their API calls, and
thus allows the developer to easily distinguish among different implementations.
The architecture of CodeCatch is shown in Fig. 7.3. The input of the system is a query
given in natural language to the Downloader module, which is then posted to the
Google search engine, to extract code snippets from the result pages. Consequently,
the Parser extracts the API calls of the snippets, while the Reusability Evaluator scores
the snippets according to whether they are widely used/preferred by developers.
Additionally, the readability of the snippets is assessed by the Readability Evaluator.
Finally, the Clusterer groups the snippets according to their API calls, while the
Presenter ranks them and presents them to the developer. These modules are analyzed
in the following subsections.
7.3.1 Downloader
The downloader receives as input the query of the developer in natural language and
posts it in order to retrieve snippets from multiple sources. An example query that
will be used throughout this section is the query “How to read a CSV file”. The
downloader receives the query and augments it before issuing it in the Google search
engine. Note that our methodology is programming language-agnostic; however,
without loss of generality we focus in this work on the Java programming language.
In order to ensure that the results returned by the search engine will be targeted
to the Java language, the query augmentation is performed using the Java-related
keywords java, class, interface, public, protected, abstract, final, static, import,
180 7 Mining Source Code for Snippet Reuse
if, for, void, int, long, and double. Similar lists of keywords can be constructed for
supporting other languages.
The URLs that are returned by Google are scraped using Scrapy.2 In specific, upon
retrieving the top 40 web pages, we extract text from HTML tags such as <pre>
and <code>. Scraping from those tags allows us to gather the majority (around 92%
as measured) of code content from web pages. Apart from the code, we also collect
information relevant to the webpage, including the URL of the result, its rank at the
Google search results list, and the relative position of each code fragment inside the
page.
7.3.2 Parser
The snippets are then parsed using the sequence extractor described in [17], which
extracts the AST of each snippet and takes two passes over it, one to extract all
(non-generic) type declarations (including fields and variables), and one to extract
the method invocations (API calls). Consider, for example, the snippet of Fig. 7.4.
During the first pass, the parser extracts the declarations line: String, br: Buffere-
dReader, data: String[], and e: Exception. Then, upon removing the generic dec-
larations (i.e., literals, strings, and exceptions), the parser traverses the AST for a
second time and extracts the relevant method invocations, which are highlighted in
Fig. 7.4. The caller of each method invocation is replaced by its type (apart from
constructors for which types are already known), to finally produce the API calls
FileReader.__init__, BufferedReader.__init__, BufferedReader.readLine, and
BufferedReader.close, where __init__ signifies a call to a constructor.
Note that the parser is quite robust even in cases when the snippets are not com-
pilable, while it also effectively isolates API calls that are not related to a type (since
generic calls, such as close, would only add noise to the invocations). Finally, any
2 https://scrapy.org/.
7.3 CodeCatch Snippet Recommender 181
snippets not referring to Java source code and/or not producing API calls are dropped
at this stage.
Upon gathering the snippets and extracting their API calls, the next step is to deter-
mine whether they are expected to be of use to the developer. In this context of
reusability, we want to direct the developer toward what we call common practice,
and, to do so, we make the assumption that snippets with API calls commonly used
by other developers are more probable to be of (re)use. This is a reasonable assump-
tion since answers to common programming questions are prone to appear often in
the code of different projects. As a result, we designed the Reusability Evaluator by
downloading a set of high-quality projects and determining the amount of reuse for
the API calls of each snippet.
For this task, we have used the 1000 most popular Java projects of GitHub, as
determined by the number of stars assigned. These were drawn from the index of
AGORA (see Chap. 5). The rationale behind this choice of projects is highly intuitive
and is also strongly supported by current research; popular projects have been found to
exhibit high quality [15], while they contain reusable source code [18] and sufficient
documentation [16]. As a result, we expect that these projects use the most effective
APIs in a good way.
Upon downloading the projects, we construct a local index where we store their
API calls, which are extracted using the Parser.3 After that, we score each API call by
dividing the number of projects in which it is present by the total number of projects.
For the score of each snippet, we average over the scores of its API calls. Finally, the
index also contains all qualified names so that we may easily retrieve them given a
caller object (e.g., BufferedReader: java.io.BufferedReader).
To construct a model for the readability of snippets, we used the publicly available
dataset from [19] that contains 12,000 human judgements by 120 annotators on 100
snippets of code. We build our model as a binary classifier that assesses whether a
code snippet is more readable or less readable. At first, for each snippet, we extract
a set of features that are related to readability, including the average line length, the
average identifier length, the average number of comments, etc. (see [19] for the full
3 As a side note, this local index is only used to assess the reusability of the components; our search,
however, is not limited within it (as is the case with other systems) as we employ a search engine
and crawl multiple pages. To further ensure that our reusability evaluator is always up-to-date, we
could rebuild its index along with the rebuild cycles of the index of AGORA.
182 7 Mining Source Code for Snippet Reuse
7.3.5 Clusterer
Upon scoring the snippets for both their reusability and their readability, the next
step is to cluster them according to the different implementations. In order to perform
clustering, we first have to extract the proper features from the retrieved snippets. A
simple approach would be to cluster the snippets by examining them as text docu-
ments; however, this approach would fail to distinguish among different implemen-
tations. Consider, for example, the snippet of Fig. 7.4 along with that of Fig. 7.5. If
we remove any punctuation and compare the two snippets, we may find out that more
than 60% of the tokens of the second snippet are also present in the first. The two
snippets, however, are quite different; they have different API calls and thus refer to
different implementations.
As a result, we decided to cluster code snippets based on their API calls. To do
so, we employ a Vector Space Model (VSM) to represent snippets as documents and
API calls as vectors (dimensions). Thus, at first, we construct a document for each
snippet based on its API calls. For example, the document for the snippet of Fig. 7.4
is “FileReader.__init__ BufferedReader.__init__ BufferedReader.readLine
Buf feredReader.close”, while the document for that of Fig. 7.5 is “File.__init__
Scanner.__init__ Scanner.hasNext Scanner.next Scanner.close”. After that,
we use a tf-idf vectorizer to extract the vector representation for each document. The
weight (vector value) of each term t in a document d is computed by the following
equation:
7.3 CodeCatch Snippet Recommender 183
where t f (t, d) is the term frequency of term t in document d and refers to the
appearances of the API call in the snippet, while id f (t, D) is the inverse document
frequency of term t in the set of all documents D, referring to how common the API
call is in all the retrieved snippets. In specific, id f (t, D) is computed as follows:
1 + |D|
id f (t, D) = 1 + log (7.2)
1 + dt
where |dt | is the number of documents containing the term t, i.e., the number of
snippets containing the relevant API call. The idf ensures that very common API
calls (e.g., Exception.printStackTrace) are given low weights, so that they do not
outweigh more decisive invocations.
Before clustering, we also need to define a distance metric that shall be used to
measure the similarity between two vectors. Our measure of choice is the cosine
similarity, which is defined for two document vectors d1 and d2 using the following
equation:
N
d1 · d2 wt ,d · wt ,d
cosine_similarit y(d1 , d2 ) = = N 1 2 i 1 N i 22 (7.3)
|d1 | · |d2 | 1 wti ,d1 · 1 wti ,d2
where wti ,d1 and wti ,d2 are the tf-idf scores of term ti in documents d1 and d1 , respec-
tively, and N is the total number of terms.
We select K-Means as our clustering algorithm, as it is known to be effective
in text clustering problems similar to ours [20]. The algorithm, however, still has
an important limitation as it requires as input the number of clusters. Therefore, to
automatically determine the best value for the number of clusters, we employ the
silhouette metric. The silhouette was selected as it encompasses both the similarity
of the snippets within the cluster (cohesion) and their difference with the snippets
of other clusters (separation). As a result, we expect that using the value of the
silhouette as an optimization parameter should yield clusters that clearly correspond
to different implementations. We execute K-Means for 2–8 clusters, and each time
we compute the value of the silhouette coefficient for each document (snippet) using
the following equation:
b(d) − a(d)
silhouette(d) = (7.4)
max(a(d), b(d))
where a(d) is the average distance of document d from all other documents in
the same cluster, while b(d) is computed by measuring the average distance of d
from the documents of each of the other clusters and keeping the lowest one of
these values (each corresponding to a cluster). For both parameters, the distances
between documents are measured using Eq. (7.3). Finally, the silhouette coefficient
for a cluster is given as the mean of the silhouette values of its snippets, while the
184 7 Mining Source Code for Snippet Reuse
total silhouette for all clusters is given by averaging over the silhouette values of all
snippets.
We present an example silhouette analysis for the query “How to read a CSV
file”. Figure 7.6 depicts the silhouette score for 2–8 clusters, where it is clear that the
optimal number of clusters is 5.
Furthermore, the individual silhouette values for the documents (snippets) of
the five clusters are shown in Fig. 7.7, and they also confirm that the clustering is
effective as most samples exhibit high silhouette and only a few have marginally
negative values.
7.3.6 Presenter
The presenter is the final module of CodeCatch and handles the ranking and the
presentation of the results. We have developed the CodeCatch prototype as a web
7.3 CodeCatch Snippet Recommender 185
Fig. 7.8 Screenshot of CodeCatch for query “How to read a CSV file”, depicting the top three
clusters
application. When the developer inserts a query, he/she is first presented with the
clusters that correspond to different implementations for the query.
An indicative view of the first three clusters containing CSV file reading imple-
mentations is shown in Fig. 7.8. The proposed implementations of the clusters involve
the BufferedReader API (e.g., as in Fig. 7.4), the Scanner API (e.g., as in Fig. 7.5), and
the Java CSV reader API.4 The clusters are ordered according to their API reusabil-
ity score, which is the average of the score of each of their snippets, as defined in
Sect. 7.3.3. For each cluster, CodeCatch provides the five most frequent API imports
and the five most frequent API calls, to aid the developer to distinguish among the
different implementations. In cases where imports are not present in the snippets,
they are extracted using the index created in Sect. 7.3.3.
Upon selecting to explore a cluster, the developer is presented with a list of its
included snippets. The snippets within a cluster are ranked according to their API
reusability score, and in cases of equal scores according to their distance from the
cluster centroid (computed using Eq. (7.3)). This ensures that the most common
usages of a specific API implementation are higher on the list. Furthermore, for each
snippet, CodeCatch provides useful information, as shown in Fig. 7.9, including its
reusability score (API Score), its distance from the centroid, its readability (either
Low or High), the position of its URL in the results of the Google search engine and
its order inside the URL, its number of API calls, and its number of lines of code.
Finally, apart from immediately reusing the snippet, the developer has the option
to isolate only the code that involves its API calls, while he/she can also check the
webpage from which the snippet was retrieved.
4 https://gist.github.com/jaysridhar/d61ea9cbede617606256933378d71751.
186 7 Mining Source Code for Snippet Reuse
Fig. 7.9 Screenshot of CodeCatch for query “How to read a CSV file”, depicting an example
snippet
Table 7.1 API clusters of CodeCatch for query “How to upload file to FTP”
ID Library API Score (%)
1 Apache commons org.apache.commons.net.ftp.* 83.57
2 Jscape com.jscape.inet.ftp.* 11.39
3 EnterpriseDT edtFTPj com.enterprisedt.net.ftp.* 4.78
4 Spring framework org.springframework.integration.ftp.* 0.26
In this case, we assume that the developer has built the relevant application, which
scans the files of a given folder and checks their MD5 checksums before adding them
to a given path. Let us now assume that the next step involves uploading these files
to an FTP server.
FTP file handling is typically a challenge that can be confronted using external
libraries/frameworks, therefore the process of writing code to perform this task is
well suited for CodeCatch. Thus, the developer, who does not necessarily know
about any components and/or APIs, initially issues the simple textual query “How to
upload file to FTP”. In this case, CodeCatch returns four result clusters, each linked
to a different API (and, thus, a different framework). The returned API clusters are
summarized in Table 7.1 (the API column is returned by the service, while the library
column is provided here for clarity).
As one may easily observe, there are different equivalent implementations for
uploading files to an FTP server. The developer could, of course, choose any of
them, depending on whether he/she already is familiar with any of these libraries.
However, in most scenarios (and let us assume also in our scenario), the developer
would select to use the most highly preferred implementation, which in this case is
the implementation offered by Apache Commons. Finally, upon selecting to enter
the first cluster, the developer could review the snippets and select one that fits his/her
purpose, such as the one shown in Fig. 7.10.
7.5 Evaluation 187
Fig. 7.10 Example snippet for “How to upload file to FTP” using Apache Commons
7.5 Evaluation
Comparing CodeCatch with similar approaches has not been performed in a straight-
forward manner, as several of them focus on mining single APIs, while others are
not maintained and/or are not publicly available. Our focus is mainly on the merit of
reuse for results, and the system that is most similar to ours is Bing Code Search [9],
however, it targets the C# programming language. Hence, we have decided to per-
form a reusability-related evaluation against the Google search engine on a dataset
of common queries shown in Table 7.2.
The purpose of our evaluation is twofold; we wish not only to assess whether
the snippets of our system are relevant, but also to determine whether the developer
188 7 Mining Source Code for Snippet Reuse
can indeed more easily find snippets for all different APIs relevant to a query. Thus,
at first, we annotate the retrieved snippets for all the queries as relevant and non-
relevant. To maintain an objective and systematic outlook, the annotation procedure
was performed without any knowledge on the ranking of the snippets. For the same
reason, the annotation was kept as simple as possible; snippets were marked as
relevant if and only if their code covers the functionality described by the query.
That is, for the query, “How to read CSV file”, any snippets used to read a CSV file
were considered relevant, regardless of their size or complexity, and of any libraries
involved, etc.
As already mentioned, the snippets are assigned to clusters, where each cluster
involves different API usages and thus corresponds to a different implementation.
As a result, we have to assess the relevance of the results per cluster, hence assuming
that the developer would first select the desired implementation and then navigate
into the cluster. To do so, we compare the snippets of each cluster (i.e., of each
implementation) to the results of the Google search engine. CodeCatch clusters
already provide lists of snippets, while for Google, we construct one by assuming
that the developer opens the first URL, subsequently examines the snippets of this
URL from top to bottom, then he/she opens the second URL, etc.
When assessing the results of each cluster, we wish to find snippets that are
relevant not only to the query but also to the corresponding API usages. As a result,
for the assessment of each cluster, we further annotate the results of both systems to
consider them relevant when they are also part of the corresponding implementation.
This, arguably, produces less effective snippet lists for the Google search engine,
however, note that our purpose is not to challenge the results of the Google search
7.5 Evaluation 189
engine in terms of relevance to the query, but rather to illustrate how easy or hard it is
for the developer to examine the results and isolate the different ways of answering
his/her query.
For each query, upon having constructed the lists of snippets for each cluster and
for Google, we compare them using the reciprocal rank metric. This metric was
selected as it is commonly used to assess information retrieval systems in general
and also systems similar to ours [9]. Given a list of results, the reciprocal rank for a
query is computed as the inverse of the rank of the first relevant result. For example,
if the first relevant result is in the first position, then the reciprocal rank is 1/1 = 1,
if the result is in the second position, then the reciprocal rank is 1/2 = 0.5, etc.
Figure 7.11 depicts the reciprocal rank of CodeCatch and Google for the snippets
corresponding to the three most popular implementations for each query. At first,
interpreting this graph in terms of the relevance of the results indicates that both
systems are very effective. In specific, if we consider that the developer would require
a relevant snippet regardless of the implementation, then for most queries, both
CodeCatch and Google produce a relevant result in the first position (i.e., reciprocal
rank equal to 1).
If, however, we focus on all different implementations for each query, we can make
certain interesting observations. Consider, for example, the first query (“How to read
a CSV file”). In this case, if the developer requires the most popular BufferedReader
implementation (I1), both CodeCatch and Google output a relevant snippet in the
first position. Similarly, if one wished to use the Scanner implementation (I2) or
the Java CSV reader (I3), our system would return a ready-to-use snippet in the top
of the second cluster or in the second position of the third cluster (i.e., reciprocal
rank equal to 0.5). On the other hand, using Google would require examining more
results (3 and 50 results for I2 and I3, respectively, as the corresponding reciprocal
ranks are equal to 0.33 and 0.02, respectively). Similar conclusions can be drawn for
most queries.
Another important point of comparison of the two systems is whether they return
the most popular implementations at the top of their list. CodeCatch is clearly more
effective than Google in this aspect. Consider, for example, the sixth query; in this
case, the most popular implementation is found in the third position of Google, while
the snippet found in its first position corresponds to a less popular implementation.
This is also clear in several other queries (i.e., queries 2, 3, 5, 7). Thus, one could
argue that CodeCatch does not only provide the developer with all different API
implementations for his/her query but also further aids him/her to select the most
popular of these implementations, which is usually the most preferable.
190 7 Mining Source Code for Snippet Reuse
Fig. 7.11 Reciprocal Rank of CodeCatch and Google for the three most popular implementations
(I1, I2, I3) of each query
7.6 Conclusion
In this chapter, we proposed a system that extracts snippets from online sources and
further assesses their readability as well as their reusability based on the preference
of developers. Moreover, our system, CodeCatch, provides a comprehensive view
of the retrieved snippets by grouping them into clusters that correspond to different
implementations. This way, the developer can select among possible solutions to
common programming queries even in cases when he/she does not know which API
to use beforehand.
Future work on CodeCatch lies in several directions. At first, we may extend our
ranking scheme to include the position of each snippet’s URL in the Google results,
etc. Furthermore, snippet summarization can be performed using information from
the clustering (e.g., removing statements that appear in few snippets within a cluster).
Finally, an interesting idea would be to conduct a developer study in order to further
assess CodeCatch for its effectiveness in retrieving useful source code snippets.
References 191
References
1. Xie T, Pei J (2006) MAPO: mining API usages from open source repositories. In: Proceedings
of the 2006 international workshop on mining software repositories, MSR ’06, pp 54–57, New
York, NY, USA. ACM
2. Wang J, Dang Y, Zhang H, Chen K, Xie T, Zhang D (2013) Mining succinct and high-coverage
API usage patterns from source code. In: Proceedings of the 10th working conference on mining
software repositories, pp 319–328, Piscataway, NJ, USA. IEEE Press
3. Fowkes J, Sutton C (2016) Parameter-free probabilistic API mining across GitHub. In: Pro-
ceedings of the 2016 24th ACM SIGSOFT international symposium on foundations of software
engineering, pp 254–265, New York, NY, USA. ACM
4. Montandon JE, Borges H, Felix D, Valente MT (2013) Documenting APIs with examples:
lessons learned with the APIMiner platform. In: 2013 20th working conference on reverse
engineering (WCRE), pp 401–408, Piscataway, NJ, USA. IEEE Computer Society
5. Buse RPL, Weimer W (2012) Synthesizing API usage examples. In: Proceedings of the 34th
international conference on software engineering, ICSE ’12, pp 782–792, Piscataway, NJ, USA.
IEEE Press
6. Kim J, Lee S, Hwang SW, Kim S (2010) Towards an intelligent code search engine. In: Proceed-
ings of the Twenty-Fourth AAAI conference on artificial intelligence, AAAI’10, pp 1358–1363,
Palo Alto, CA, USA. AAAI Press
7. Wightman D, Ye Z, Brandt J, Vertegaal R (2012) SnipMatch: using source code context to
enhance snippet retrieval and parameterization. In: Proceedings of the 25th annual ACM sym-
posium on user interface software and technology, UIST ’12, pp 219–228, New York, NY,
USA. ACM
8. Brandt J, Dontcheva M, Weskamp M, Klemmer SR (2010) Example-centric programming:
integrating web search into the development environment. In: Proceedings of the SIGCHI
conference on human factors in computing systems, CHI ’10, pp 513–522, New York, NY,
USA. ACM
9. Wei Y, Chandrasekaran N, Gulwani S, Hamadi Y (2015) Building bing developer assistant.
Technical Report MSR-TR-2015-36, Microsoft Research
10. Mandelin D, Lin X, Bodík R, Kimelman D (2005) Jungloid mining: helping to navigate the
API jungle. SIGPLAN Not 40(6):48–61
11. Thummalapenta S, Xie T (2007) PARSEWeb: a programmer assistant for reusing open source
code on the web. In: Proceedings of the 22nd IEEE/ACM international conference on automated
software engineering, ASE ’07, pp 204–213, New York, NY, USA. ACM
12. Wang J, Han J (2004) BIDE: efficient mining of frequent closed sequences. In: Proceedings of
the 20th international conference on data engineering, ICDE ’04, pp 79–90, Washington, DC,
USA. IEEE Computer Society
13. Katirtzis N, Diamantopoulos T, Sutton C (2018) Summarizing software API usage examples
using clustering techniques. In: 21th international conference on fundamental approaches to
software engineering, pp 189–206, Cham. Springer International Publishing
14. Jiang L, Misherghi G, Su Z, Glondu S (2007) DECKARD: scalable and accurate tree-based
detection of code clones. In: Proceedings of the 29th international conference on software
engineering, ICSE ’07, pp 96–105, Washington, DC, USA. IEEE Computer Society
15. Papamichail M, Diamantopoulos T, Symeonidis AL (2016) User-perceived source code quality
estimation based on static analysis metrics. In: Proceedings of the 2016 IEEE international
conference on software quality, reliability and security, QRS, pp 100–107, Vienna, Austria
16. Aggarwal K, Hindle A, Stroulia E (2014) Co-evolution of project documentation and popularity
within Github. In: Proceedings of the 11th working conference on mining software repositories,
MSR 2014, pp 360–363, New York, NY, USA. ACM
17. Diamantopoulos T, Symeonidis AL (2015) Employing source code information to improve
question-answering in stack overflow. In: Proceedings of the 12th working conference on
mining software repositories, MSR ’15, pp 454–457, Piscataway, NJ, USA. IEEE Press
192 7 Mining Source Code for Snippet Reuse
8.1 Overview
In the previous chapter, we analyzed the challenges a developer faces when trying
to find solutions to programming queries online. We focused on API-agnostic usage
queries that are issued by the developer in order to find implementations that satisfy
the required functionality. In this chapter, we focus on a challenge that follows from
snippet mining, that is, of finding out whether a snippet chosen (or generally written)
by the developer is correct and also optimal for the task at hand. In specific, we
assume that the developer has either written a source code snippet or found and
integrated it in his/her source code, and yet the snippet does not execute as expected.
In such a scenario, a common process followed involves searching online, possibly
in question-answering communities like Stack Overflow, in order to probe on how
other users have handled the problem at hand, or whether they have already found a
solution. In case that an existing answer to the problem is not found, the developer
usually submits a new question post, hoping that the community will respond to
his/her aid.
Submitting a new question post in Stack Overflow requires several steps; the
developer has to form a question with a clear title, further explain the problem in
the question body, isolate any relevant source code fragments, and possibly assign
tags to the question (e.g., “swing” or “android”) to draw the attention of community
members that are familiar with the specific technologies/frameworks. In most cases,
however, the first step before forming a question post is to find out whether the same
question (or a similar one) is already posted. To do so, one can initially perform a
search on question titles. Further improving the results requires including tags or
question bodies. Probably the most effective query includes the title, the tags, and
the body, i.e., the developer has to form a complete question post. This process is
obviously complex (for the newcomer) and time-consuming; for example, the source
code of the problem may not be easily isolated, or the developer may not know which
tags to use.
In this chapter, we explore the problem of finding similar question posts in Stack
Overflow, given different types of information. Apart from titles, tags, or question
bodies, we explore whether a developer could be able to search for similar questions
using source code fragments, thus not requiring to fully form a question. Our main
contribution is a question similarity scheme which can be used to recommend similar
question posts to users of the community trying to find a solution to their problem [1].
Using our scheme, developers shall be able to check whether their snippets (either
handwritten or reused from online sources) are good choices for the functionality that
has to be covered. As a side note, since this scheme could be useful for identifying
similar questions, it may also prove useful to community members or moderators so
that they identify linked or duplicate questions, or even so that they try to find similar
questions to answer and thus contribute to the community.
Our methodology is applied to the official data dump of Stack Overflow as of Septem-
ber 26, 2014, provided by [2]. We have used Elasticsearch [3] to store the data for
our analysis, since it provides indexing and supports the execution of fast queries
on large chunks of data. Storage in Elasticsearch is simple; any record of data is a
document, documents are stored inside collections, and collections are stored in an
index (see Chap. 5 for more information on Elasticsearch-based architectures). We
have defined a posts collection, which contains the Java questions of Stack Overflow
(determined by the “java” tag).1
An example Stack Overflow question post is depicted in Fig. 8.1. The main elements
of the post are: the title (shown at the top of Fig. 8.1), the body, and the tags (at the
bottom of Fig. 8.1). These three elements are stored and indexed for each question
post in our Elasticsearch server to allow for retrieving the most relevant questions.
The title and the body are stored as text, while the tags are stored as a list of keywords.
Since the body is in html, we have also employed an html parser to extract the source
code snippets for each question.
1 Although our methodology is mostly language-agnostic, we have focused on Java Stack Overflow
This subsection discusses how the elements of each post are indexed in Elasticsearch.
As already noted in Chap. 5, Elasticsearch not only stores the data but also creates and
stores analyzed data for each field, so that the index can be searched in an efficient
manner.
The title of each question post is stored in the index using the standard analyzer
of Elasticsearch. This analyzer comprises a tokenizer that splits the text into tokens
(i.e., lowercase words) and a token filter that removes certain common stop words
(in our case the common English stop words). The tags are stored in an array field
without further analysis, since they are already split and in lowercase. The body of
each question post is analyzed using an html analyzer, which is similar to the standard
analyzer, however, it first removes the html tags using the html_strip character filter
of Elasticsearch. Titles, tags, and question bodies are stored under the fields “Title”,
“Tags”, and “Body”, respectively.
Concerning the snippets extracted from the question bodies, we had to find a
representation that describes not only their terms (which are already included in the
“Body” field), but also their structure. Source code representation is an interesting
problem for which several solutions can be proposed. Although complex represen-
tations, such as the snippet’s Abstract Syntax Tree (AST) or its program dependency
graph, are rich in information, they cannot be used in the case of incomplete snippets
such as the one in Fig. 8.1. For example, extracting structural information from the
AST and using matching heuristics between classes and methods, as in [4], is not pos-
sible since we may not have information about classes and methods (e.g., inheritance,
declared variables, etc.).
8.3 Methodology
This section discusses our methodology, illustrating how text, tags, and snippets
can be used to find similar questions. A query on the index may contain one or
more of the “Title”, “Tags”, “Body”, and “Snippets” fields. When searching for a
question post, we calculate a score that is the average among the scores for each of
the provided fields. The score for each field is normalized in the [0, 1] range, so that
the significance of all fields is equal. The following subsections illustrate how the
scores are computed for each field.
The similarity between two segments of text, either titles or bodies, is computed
using the term frequency-inverse document frequency (tf-idf). Upon splitting the
text/document into tokens/terms (which is already handled by Elasticsearch), we use
the vector space model, where each term is the value of a dimension of the model.
The frequency of each term in each document (tf) is computed as the square root of
the number of times the term appears in the document, while the inverse document
frequency (idf) is the logarithm of the total number of documents divided by the
number of documents that contain the term. The final score between two documents
is computed using the cosine similarity between them:
N
d1 · d2 wt ,d · wt ,d
scor e(d1 , d2 ) = = N 1 2 i 1 N i 22 (8.1)
|d1 | · |d2 | 1 wti ,d1 · 1 wti ,d2
where d1 , d2 are the two documents, and wti ,d j is the tf-idf score of term ti in the
document d j .
198 8 Mining Solutions for Extended Snippet Reuse
Since the values of “Tags” are unique, we can compute the similarity between “Tags”
values by handling the lists as sets. Thus, we define the similarity as the Jaccard index
between the sets. Given two sets, T1 and T2 , their Jaccard index is the size of their
intersection divided by the size of their union:
|T1 ∩ T2 |
J (T1 , T2 ) = (8.2)
|T1 ∪ T2 |
Finally, note that we exclude the tag “java” given the selected dataset, as it is not
useful for the scoring.
Upon having extracted sequences from snippets (see Sect. 8.2.2), we define a sim-
ilarity metric between two sequences based on their Longest Common Subse-
quence (LCS). Given two sequences S1 and S2 , their LCS is defined as the longest
subsequence that is common to both sequences. Note that a subsequence is not
required to contain consecutive elements of any sequence. For example, given two
sequences S1 = [A, B, D, C, B, D, C] and S2 = [B, D, C, A, B, C], their LCS is
LC S(S1 , S2 ) = [B, D, C, B, C]. The LCS problem for two sequences S1 and S2 can
be solved using dynamic programming. The computational complexity of the solu-
tion is O(m × n), where m and n are the lengths of the two sequences [11], so the
algorithm is quite fast.
Upon having found the LCS between two snippet sequences S1 and S2 , the final
score for the similarity between them is computed by the following equation:
|LC S(S1 , S2 )|
scor e(S1 , S2 ) = 2 · (8.3)
|S1 | + |S2 |
Since the length of the LCS is always smaller than the length of the smallest sequence,
the fraction of it divided by the sum of sequence lengths is in the range [0, 0.5]. Hence,
the score of the above equation lies in the range [0, 1].
For example, if we calculate the score between the sequence of Fig. 8.2 and that
of Fig. 8.3, the length of the LCS is 5, and the lengths of the two sequences are 10
and 6. Thus, the score between the two sequences, using Eq. (8.3), is 0.625.
Table 8.2 Example stack overflow questions that are similar to “uploading to FTP using java”
ID Title Score
1 How to resume the interrupted File upload in ftp using Java? 0.668
2 Testing file uploading and downloading speed using FTP 0.603
3 URLConnection slow to call getOutputStream using an FTP url 0.596
4 How to delete file from ftp server using Java? 0.573
5 Storing file name to database after uploading to ftp server 0.558
6 Copy all directories to server by FTP using Java 0.553
7 Getting corrupted JPG while uploading using FTP connection? 0.526
8 How to give FTP address in Java? 0.518
9 Can not connect to FTP server from heroku using apache client 0.501
10 FTP site works but I’m unable to connect from Java program 0.467
Chaps. 6 and 7. In specific, using our component reuse methodology (Chap. 6), the
developer has built a duplicate file check application that scans the files of a folder
and checks their MD5 checksums. Any unique files are then uploaded to an FTP
server. The implementation for connecting to an FTP server and uploading files was
supported using our snippet mining methodology (Chap. 7). The code snippet for the
FTP implementation is presented in Fig. 7.10.
Assuming the developer has already integrated the snippet in his/her source code,
his/her course of action is now quite diverse. For instance, it may be useful to find
code for deleting some files erroneously placed on the FTP server or to find code for
resuming the upload in case of some interruption or even to find code for debugging
the application if there is some failure. What the developer could do in any of these
cases is to build a question and post it on Stack Overflow, waiting for some useful
answer.
As, in fact, the snippet of Fig. 7.10 is part of a Stack Overflow link, which contains
the question “Uploading to FTP using Java” that has the tags “java”, “ftp”, and
“uploading”, we might safely assume that the developer would provide similar input.
Using our approach, the formed query by the developer could be issued as input to the
methodology defined in Sect. 8.3. In this case, the retrieved similar questions could
provide valuable input for any of the aforementioned tasks. In specific, by issuing
the query, the 10 most relevant question posts are shown in Table 8.2.
200 8 Mining Solutions for Extended Snippet Reuse
The results are quite probable to be useful. So, for instance, if the developer wanted
to issue the query in order to find out how to also delete files via FTP (instead of
only adding them), he/she would be presented with an answer prompting him/her to
employ the Apache FTPClient as in Fig. 8.4.
8.5 Evaluation
Evaluating the similarity of question posts in Stack Overflow is difficult, since the
data are not annotated for this type of tests. We use the presence of a link between
two questions (also given by [2]) as an indicator of whether the questions are similar.
Although this assumption seems rational, the opposite assumption, i.e., that any non-
linked questions are not similar, is not necessarily true. In fact, the Stack Overflow
data dump has approximately 700,000 Java question posts, out of which roughly
300,000 have snippets that result in sequences (i.e., snippets with at least one assign-
ment, function call, or class instantiation), and on average each of these posts has
0.98 links. In our scenario, however, the problem is formed as a search problem, so
the main issue is to detect whether the linked (and thus similar) questions can indeed
be retrieved.
In our evaluation, we have included only questions with at least 5 links (including
duplicates which are also links according to the Stack Overflow data schema).2 For
performance reasons, all queries are at first performed on the question title, and then
the first 1000 results are searched again using both title and any other fields, all in an
Elasticsearch bool query. We used eight settings (given the title is always provided,
the settings are all the combinations of tags, body, and snippets) to evaluate our
methodology. For each question we keep the first 20 results, assuming this is the
maximum a developer would examine (using other values had similar results).
The diagram of Fig. 8.5 depicts the average percentage of relevant results (com-
pared to the total number of relevant links for each question) in the first 20 results of
a query, regardless of the ranking.
These results involve all questions of the dataset that have source code snippets
(approximately 300,000), even if the sequence extracted from them has length equal
2 Our methodology supports all questions, regardless of the number of links. However, in the context
of our evaluation, we may assume that questions with fewer links may be too localized and/or may
not have similar question posts.
8.5 Evaluation 201
(a) (b)
Fig. 8.6 Percentage of relevant results in the first 20 results of each query, evaluated for questions
with snippet sequence length larger than or equal to a 3 and b 5
8.6 Conclusion
In this chapter, we have explored the problem of finding relevant questions in Stack
Overflow. We have designed a similarity scheme which exploits not only the title,
tags, and body of a question, but also its source code snippets. The results of our
evaluation indicate that snippets are a valuable source of information for detecting
links or for determining whether a question is already posted. As a result, using
our system, the developer is able to easily form a question post even using only
an already existing snippet in order to validate his/her code or make sure that a
third-party solution is consistent with what is proposed by the community.
Although snippet matching is a difficult task, we believe that our methodology
is a step toward the right direction. Future research includes further adapting the
methodology to the special characteristics of different snippets, e.g., by clustering
them according to their size or their complexity. Additionally, different structural
representations for source code snippets may be tested in order to improve on the
similarity scheme.
References
9.1 Overview
The architecture of QualBoa is shown in Fig. 9.1. This architecture is agnostic, since
it can support any language and integrate with various code hosting services. We
employ Boa and GitHub as the code hosting services, and implement the language-
specific components for the Java programming language.
Initially, the developer provides a query in the form of a signature (step 1). As
in Mantissa (see Chap. 6), a signature is similar to a Java interface and comprises
all methods of the desired class component. An example signature for a “Stack”
component is shown in Fig. 9.2.
The signature is given as input to the Parser. The Parser, which is implemented
using Eclipse JDT,1 extracts the Abstract Syntax Tree (AST) (step 2). Upon extracting
the AST for the query, the Boa Downloader parses it and constructs a Boa query
for relevant components, including also calculations for quality metrics. The query
is submitted to the Boa engine [5] (step 3), and the response is a list of paths to Java
files and the values of quality metrics for these files. The result list is given to the
GitHub Downloader (step 4), which downloads the files from GitHub (step 5). The
files are then given to the Parser (step 6), so that their ASTs are extracted (step 7).
The ASTs of the downloaded files are further processed by the Functional Scorer,
which creates the functional score for each result (step 8). The Functional Score of a
file denotes whether it fulfills the functionality posed by the query. The Reusability
Reusability Quality
score metrics
Reusability 9
10
Scorer
1 http://www.eclipse.org/jdt/.
9.2 QualBoa: Reusability-Aware Component Recommendations 209
Scorer receives as input the metrics for the files (step 9) and generates a reusabil-
ity score (step 10), denoting the extent to which each file is reusable. Finally, the
developer (Client) is provided with the scores for all files, and the results are ranked
according to their Functional Score.
Upon extracting the elements of the query, i.e., class name, method names, and types,
QualBoa finds useful software components and computes their metrics using Boa
[5]. A simplified version of the query is shown in Fig. 9.3.
The query follows the visitor pattern to traverse the ASTs of the Java files in
Boa. The first part of the query involves visiting the type declarations of all files and
determining whether they match the given class name (class_name). The methods
of any matched files are further visited to check whether their names and types match
those of the query (method_names and method_types). The number of matched
elements is aggregated to rank the results and a list with the highest ranked 150
results is given to the GitHub Downloader, which retrieves the Java files from the
GitHub API.
The second part of the query involves computing source code metrics for the
retrieved files. Figure 9.3 depicts the code for computing the number of public fields.
We extend the query to compute the metrics shown in Table 9.1 (for which the ratio-
nale is provided in Sect. 9.2.4).
The Functional Scorer computes the similarity between each of the ASTs of the
results with the AST of the query, in order to rank the results according to the
functional desiderata of the developer. Initially, both the query and each examined
result file is represented as a list. The list of elements for a result file is defined in the
following equation:
where name is the name of the class and methodi is a sublist of the ith method of
the class, out of n methods in total. The sublist of a method is defined as follows:
where name is the name of the method, t ype is its return type, and param j is the
type of its jth parameter, out of m parameters in total. Using Eqs. (9.1) and (9.2), we
can represent each result, as well as the query, as a nested list structure.
Comparing a query to a result requires computing a similarity score between the
two corresponding lists, which in turn implies computing the score between each
pair of methods. Note that computing the maximum similarity score between two
lists of methods requires finding the score of each individual pair of methods and
selecting the highest scoring pairs. Following the approach of the stable marriage
problem (see Chap. 6), this is accomplished by ordering the pairs according to their
similarity and selecting them one-by-one off the top, noticing whether a method has
already been matched. The same procedure is also applied for optimally matching
the parameters between the two method lists.
Class names, method names, and types, as well as parameter types are matched
using a token set similarity approach. Since in Java identifiers follow the camelCase
convention, we first split each string into tokens and compute the Jaccard index for
the two token sets. Given two sets, the Jaccard index is defined as the size of their
intersection divided by the size of their union. Finally, the similarity between two
lists or vectors A and B (either in method or in class/result level) is computed using
the Tanimoto coefficient [6] of the vectors A · B/(|A|2 + |B|2 − A · B), where |A|
and |B| are the sizes of vectors A and B, and A · B is their inner product.
For instance, given the query list [Stack, [ push, void, Object], [ pop, Object]]
and the component list [I nt Stack, [ push Object, void, int], [ popObject, int]],
the similarity between the method pairs for the “push” functionality is computed as
follows:
where the query list is always a list with all elements set to 1, since it constitutes
a perfect match. Similarly, the score for the “pop” functionality is computed as
follows:
212 9 Providing Reusability-Aware Recommendations
Upon constructing a functional score for the retrieved components, QualBoa checks
whether each component is suitable for reuse using source code metrics. Although the
problem of measuring the reusability of a component using source code metrics has
been studied extensively [7–10], the choice of metrics is not trivial; research efforts
include employing the C&K metrics [8], coupling and cohesion metrics [9], and
several other metrics referring to volume and complexity [7]. In any case, however,
the main quality axes that measure whether a component is reusable are common.
In accordance with the definitions of different quality characteristics [11] and the
current state of the art [8, 10, 12], reusability spans across the modularity, usability,
maintainability, and understandability concepts.
Our reusability model, shown in Table 9.2, includes eight metrics that refer to one
or more of the aforementioned quality characteristics. These metrics cover several
aspects of a source code component, including volume, complexity, coupling, and
cohesion. Apart from covering the above criteria, the eight metrics were selected as
they can be computed by the Boa infrastructure. The reader is referred to the following
chapter for a more detailed model on reusability. This model is fairly simple, marking
the value of each metric as normal or extreme according to the thresholds shown in the
second column of Table 9.2. When a value is flagged as normal, then it contributes
to one quality point in the relevant characteristics. Reusability includes all eight
values. Thus, given, e.g., a component with extreme values for 2 out of 8 metrics,
the reusability score would be 6 out of 8.
Determining appropriate thresholds for quality metrics is non-trivial, since differ-
ent types of software may have to reach specific quality objectives. Thus, QualBoa
offers the ability to configure these thresholds, while their default values are set
Table 9.2 The reusability model of QualBoa
Quality metrics Extremes Quality characteristics
Modularity Maintainability Usability Understandability Reusability
Average lines of >30 × × ×
code per method
Average cyclomatic >8 × × ×
complexity
Coupling between >20 × × ×
Objects
Lack of cohesion in >20 × ×
methods
Average block depth >3 × ×
Efferent couplings >20 × × ×
Number of public >10 × × ×
9.2 QualBoa: Reusability-Aware Component Recommendations
fields
Number of public >30 × × ×
methods
#Metrics per quality characteristic: 3 4 4 3 8
213
214 9 Providing Reusability-Aware Recommendations
according to the current state of the practice, as defined by current research [13] and
widely used static code analysis tools, such as PMD2 and CodePro AnalytiX.3
9.3 Evaluation
We evaluated QualBoa in the query dataset of Table 9.3, which contains seven queries
for different types of components. This dataset is similar to the one presented in
Chap. 6 for the evaluation of Mantissa against Code Conjurer [1]. Note, however,
that this time the backend of our RSSE is Boa so the queries have to be adjusted
in order to retrieve useful results. All queries returned results from Boa, except
MortgageCalculator, for which we adjusted the class name to Mortgage, and
the query filters to account only for the setRate method. Note, however, that the
Functional Scorer of QualBoa uses all methods.
For each query, we examine the first 30 results of QualBoa, and mark each result
as useful or not useful, considering a result as useful if integrating it in the developer’s
code would require minimal or no effort. In other words, useful components are the
ones that are understandable and at the same time cover the required functionality.
Benchmarking and annotation were performed separately to ensure minimal threats
to validity.
Given these marked results, we check the number of relevant results retrieved
for each query, and compute also the average precision for the result rankings. The
average precision was considered as the most appropriate metric since it covers
not only the relevance of the results but also, more importantly, their ranking. We
further assess the files using the reusability model of QualBoa and, upon normalizing
to percentages, report the average reusability score for the useful/relevant results
2 https://pmd.github.io/.
3 https://developers.google.com/java-dev-tools/codepro/.
9.3 Evaluation 215
(a) (b)
Fig. 9.4 Evaluation results of QualBoa depicting a the average precision and b the average reusabil-
ity score for each query
of each query. The results of our evaluation are summarized in Table 9.4, while
Fig. 9.4 graphically depicts the average precision value for each query and the average
reusability score for each query.
Concerning the number of relevant results, QualBoa seems to be quite effective.
In specific, our RSSE successfully retrieves at least 10 useful results for 5 out of 7
queries. Additionally, almost all queries have average precision scores higher than
75%, indicating that the results are also ranked correctly. This is particularly obvious
in cases where the number of retrieved results is limited, such as the Mortgage-
Calculator or the Spreadsheet component. The perfect average precision scores
indicate that the relevant results for these queries are placed on the top of the ranking.
The results are also highly reusable, as indicated by the reusability score for each
query. In 5 out of 7 queries, the average reusability score of the relevant results is quite
near or higher than 87.5%, indicating that on average the components do not surpass
more than 1 of the thresholds defined in Table 9.2, so they have a reusability score
of at least 7 out of 8. The reusability score is also related to the overall complexity
216 9 Providing Reusability-Aware Recommendations
9.4 Conclusion
References
1. Hummel O, Janjic W, Atkinson C (2008) Code conjurer: pulling reusable software out of thin
air. IEEE Softw 25(5):45–52
2. Reiss SP (2009) Semantics-based code search. In: Proceedings of the 31st International Con-
ference on Software Engineering, ICSE ’09. IEEE Computer Society, Washington, DC, USA,
pp 243–253
3. Wei Y, Chandrasekaran N, Gulwani S, Hamadi Y (2015) Building Bing Developer Assistant.
Technical Report MSR-TR-2015-36, Microsoft Research
4. Diamantopoulos T, Thomopoulos K, Symeonidis A (2016) QualBoa: reusability-aware Rec-
ommendations of source code components. In: Proceedings of the IEEE/ACM 13th Working
Conference on Mining Software Repositories, MSR ’16. pp 488–491
5. Dyer R, Nguyen HA, Rajan H, Nguyen TN (2013) Boa: a language and infrastructure for
analyzing ultra-large-scale software repositories. In: 35th International Conference on Software
Engineering, ICSE 2013. pp 422–431
6. Tanimoto TT (1957) IBM Internal Report
7. Caldiera G, Basili VR (1991) Identifying and qualifying reusable software components. Com-
puter 24(2):61–70
8. Moser R, Sillitti A, Abrahamsson P, Succi G (2006) Does refactoring improve reusability?
In: Proceedings of the 9th International Conference on Reuse of Off-the-Shelf Components,
ICSR’06. Berlin, Heidelberg, Springer-Verlag, pp 287–297
9. Gui G, Scott PD (2006) Coupling and cohesion measures for evaluation of component reusabil-
ity. In: Proceedings of the 2006 International Workshop on Mining Software Repositories, MSR
’06. ACM, New York, NY, USA, pp 18–21
10. Poulin JS (1994) Measuring software reusability. In: Proceedings of the Third International
Conference on Software Reuse: Advances in Software Reusability. pp 126–138
References 217
11. Spinellis D (2006) Code Quality: The Open Source Perspective. Effective software development
series. Addison-Wesley Professional
12. Dandashi F (2002) A method for assessing the reusability of object-oriented code using a
validated set of automated measurements. In: Proceedings of the 2002 ACM Symposium on
Applied Computing, SAC ’02. ACM, New York, NY, USA, pp 997–1003
13. Ferreira KAM, Bigonha MAS, Bigonha RS, Mendes LFO, Almeida HC (2012) Identifying
thresholds for object-oriented software metrics. J Syst Softw 85(2):244–257
Chapter 10
Assessing the Reusability of Source Code
Components
10.1 Overview
In this chapter, we mitigate the need for such an imposed ground truth quality
value, by employing user-perceived quality as a measure of the reusability of a
software component. From a developer perspective, we are able to associate user-
perceived quality with the extent to which a software component is adopted by
developers, thus arguing that the popularity of a component can be a measure of its
reusability. We provide a proof-of-concept for this hypothesis and further investigate
the relationship between popularity and user-perceived quality by using a dataset of
source code components along with their popularity as determined by their rating on
GitHub. Based on this argument, we design and implement a methodology that covers
the four software quality properties related to reusability by using a large set of static
analysis metrics. Given these metrics and using the popularity of GitHub projects as
ground truth, we can estimate the reusability of a source code component. In specific,
we model the behavior of the metrics to automatically translate their values into a
reusability score [17]. Metric behaviors are then used to train a set of reusability
estimation models using different state-of-the-practice machine learning algorithms.
Those models are able to estimate the reusability degree of components at both class
and package level, as perceived by software developers.
Most of the existing quality and reusability assessment approaches make use of static
analysis metrics in order to train quality estimation models [6, 8, 9, 18]. In principle,
estimating quality through static analysis metrics is a non-trivial task, as it often
requires determining quality thresholds [4], which is usually performed by experts
who manually examine the source code [19]. However, the manual examination of
source code, especially for large complex projects that change on a regular basis, is
not always feasible due to constraints in time and resources. Moreover, expert help
may be subjective and highly context-specific.
Other approaches may require multiple parameters for constructing quality evalu-
ation models [15], which are again highly dependent on the scope of the source code
and are easily affected by subjective judgment. Thus, a common practice involves
deriving metric thresholds by applying machine learning techniques on a benchmark
repository. Ferreira et al. [20] propose a methodology for estimating thresholds by
fitting the values of metrics into probability distributions, while Alves et al. [21]
follow a weight-based approach to derive thresholds by applying statistical analysis
on the metrics values. Other approaches involve deriving thresholds using bootstrap-
ping [22] and ROC curve analysis [23]. Still, these approaches are subject to the
projects selected for the benchmark repository. Finally, there are some approaches
that focus specifically on reusability [12–14, 24]. As in quality estimation, these
approaches also first involve quantifying reusability through reuse-related informa-
tion such as reuse frequency [24]. Then, machine learning techniques are employed
to train reusability evaluation models using as input the values of static analysis
metrics [12–14].
10.2 Background on Reusability Assessment 221
Although the aforementioned approaches can be effective for certain cases, their
applicability in real-world scenarios is limited. At first, approaches using predefined
thresholds [4, 6, 7, 19], even with the guidance of an expert, cannot extend their
scope beyond the specific class and the characteristics of the software for which they
have been designed. Automated quality and reusability evaluation systems appear
to overcome these issues [11–14, 20–23]. However, they are also still confined by
the ground truth knowledge of an expert for training the model, and thus do not rate
the reusability of source code according to the diverse needs of the developers. As
a result, several academic and commercial approaches [15, 25] resort to providing
detailed reports of quality metrics, leaving configurations to developers or experts.
To overcome these issues, we build a reusability estimation system to provide a
single score for every class and every package of a given software component. Our
system does not use predefined metric thresholds, while it also refrains from expert
representations of quality; instead, we use popularity (i.e., user-perceived quality)
as ground truth for reusability [26]. Given that popular components have a high rate
of reuse [27], by intuition the preference of developers may be a measure of high
reusability. We employ this concept to evaluate the impact of each metric into the
reusability degree of software components individually, and subsequently aggregate
the outcome of our analysis in order to construct a final reusability score. Finally,
we quantify the reusability degree of software components both at class and package
level by training reusability estimation models that effectively estimate the degree to
which a component is reusable as perceived by software developers.
In this section, we discuss the potential connection between reusability and popularity
and indicate the compliance of our methodology with the ISO quality standards.
Fig. 10.1 Diagram depicting the number of stars versus the number of forks for the 100 most
popular GitHub repositories
We created a dataset that includes the values of the static analysis metrics shown in
Table 10.2, for the 100 most starred and 100 most forked GitHub Java projects (137
in total). These projects amount to more than 12M lines of code spread in almost
15 K packages and 150 K classes. All metrics were extracted at class and package
level using SourceMeter [29].
As GitHub stars and forks refer to repository level, they are not adequate on their own
for estimating the reusability of class level and package level components. Thus, we
estimate reusability using static analysis metrics. For each metric, we first perform
distribution-based binning and then relate its values to those of the stars/forks to
incorporate reusability information. The final reusability estimation is computed by
aggregating over the estimations derived by each metric.
Distribution-Based Binning
As the values of each static analysis metric are distributed differently among the
repositories of our dataset, we first define a set of representative intervals (bins),
unified across repositories, that effectively approximate the actual distribution of the
values. For this purpose, we use the values of each metric from all the packages (or
classes) of our dataset to formulate a generic distribution, and then determine the
optimal bin size that results in the minimum information loss. In an effort to follow a
generic and accurate binning strategy while taking into consideration the fact that the
values of the metrics do not follow a normal distribution, we use the Doane formula
[30] for selecting the bin size in order to account for the skewness of the data.
Figure 10.2 depicts the histogram of the API Documentation (AD) metric at pack-
age level, which will be used as an example throughout this section. Following our
224 10 Assessing the Reusability of Source Code Components
Table 10.2 Overview of static metrics and their applicability on different levels
Name Description Class Package
Complexity
NL {·,E} Nesting Level {Else-If} ×
WMC Weighted Methods per Class ×
Coupling
CBO {·,I} Coupling Between Object classes ×
{Inverse}
N {I,O}I Number of { Incoming, Outgoing} ×
Invocations
RFC Response set For Class ×
Cohesion
LCOM5 Lack of Cohesion in Methods 5 ×
Documentation
AD API Documentation ×
{·,T}CD {Total} Comment Density × ×
{·,T}CLOC {Total} Comment Lines of Code × ×
DLOC Documentation Lines of Code ×
P {D,U}A Public {Documented, Undocumented} × ×
API
TAD Total API Documentation ×
TP {D,U}A Total Public {Documented, ×
Undocumented} API
Inheritance
DIT Depth of Inheritance Tree ×
NO {A,C} Number of {Ancestors, Children} ×
NO {D,P} Number of {Descendants, Parents} ×
Size
{·,T} {·,L}LOC {Total} {Logical} Lines of Code × ×
N{G,S} Number of {Getters, Setters} × ×
N{A,M} Number of {Attributes, Methods} × ×
N{CL,EN,IN,P} Number of {Classes, Enums, Interfaces, ×
Packages}
NL{G,S} Number of Local {Getters, Setters} ×
NL{A,M} Number of Local {Attributes, Methods} ×
NLP{A,M} Number of Local Public {Attributes, ×
Methods}
NP{A,M} Number of Public {Attributes, Methods} × ×
NOS Number of Statements ×
TNP{CL,EN,IN} Total Number of Public {Classes, ×
Enums, Interfaces}
TN{CL,DI,EN,FI} Total Number of {Classes, Directories, ×
Enums, Files}
10.4 Reusability Modeling 225
binning strategy, 20 bins are produced. AD values appear to have positive skewness,
while their highest frequency is in the interval [0.1, 0.5]. After having selected the
appropriate bins for each metric, we construct the histograms for each of the 137
repositories.
Figure 10.3 depicts the histograms for two different repositories, where it is clear
that each repository follows a different distribution for the values of AD. This is
expected, as every repository is an independent entity with a different scope, devel-
oped by different contributors; thus, it accumulates a set of different characteristics.
Relating Bins Values with GitHub Stars and Forks
We use the produced histograms (one for each repository using the bins calculated in
the previous step) in order to construct a set of data instances that relate each metric
bin value (here the AD) to a GitHub stars (or forks) value. So, we aggregate these
values for all the bins of each metric, i.e., for metric X and bin 1 we gather all stars
(or forks) values that correspond to packages (or classes) for which the metric value
lies in bin 1 and aggregate them using an averaging method. This process is repeated
for every metric and every bin.
Practically, for each bin value of the metric, we gather all relevant data instances
and calculate the weighted average of their stars (or forks) count, which represents
the stars-based (or forks-based) reusability value for the specific bin. The reusability
scores are defined using the following equations:
N
RS Metric (i) = f r eq p.u. (i) · log(S(r epo)) (10.1)
r epo=1
226 10 Assessing the Reusability of Source Code Components
N
R FMetric (i) = f r eq p.u. (i) · log(F(r epo)) (10.2)
r epo=1
where RS Metric (i) refers to the stars-based reusability score of the i-th bin for the
metric under evaluation and R FMetric (i) refers to the respective forks-based reusabil-
ity score. S(r epo) and F(r epo) refer to the number of stars and the number of forks
of the repository, respectively, while the use of logarithm acts as a smoothing factor
between the big differences in the number of stars and forks among the repositories.
Finally, the term f r eq p.u. (i) is the normalized/relative frequency of the metric value
of the i-th bin, defined as follows:
Fi
f r eq p.u. (i) = N (10.3)
i=1 Fi
where Fi is the absolute frequency (i.e., the count) of the metric values lying in
the i-th bin. For example, if a repository had 3 bins with 5 AD values in bin 1, 8
values in bin 2, and 7 values in bin 3, then the normalized frequency for bin 1 would
be 5/(5 + 8 + 7) = 0.25, for bin 2 it would be 8/(5 + 8 + 7) = 0.4, and for bin 3
it would be 7/(5 + 8 + 7) = 0.35. The use of normalized frequency was chosen to
eliminate any biases caused by the high variance in the number of packages among the
different repositories. Given that the selected repositories are the top 100 starred and
forked repositories, these differences do not necessarily reflect the actual differences
in the reusability degree of packages.
Figure 10.4 illustrates the results of applying Eq. (10.1), i.e., the star-based
reusability, to the AD metric values at package level. Note also that the upper and
10.4 Reusability Modeling 227
lower 5% of the metric values are removed at this point as outliers. As shown in this
figure, the reusability score based on AD is maximum for AD values in the interval
[0.25, 0.5].
Finally, based on the fact that we compute a forks-based and a stars-based reusabil-
ity score for each metric, the final reusability score for each source code component
(class or package) is given by the following equations:
k
j=1 RSmetric ( j) · corr (metric j , star s)
RS Final = k (10.4)
j=1 corr (metric j , star s)
k
j=1 R Fmetric ( j) · corr (metric j , f or ks)
R FFinal = k (10.5)
j=1 corr (metric j , f or ks)
3 · R FFinal + RS Final
Final Scor e = (10.6)
4
where k is the number of metrics at each level. RS Final and R FFinal correspond to the
final stars-based and forks-based scores, respectively. RSmetric ( j) and R Fmetric ( j)
refer to the scores for the j-th metric as given by Eqs. (10.1) and (10.2), while
corr (metric j , star s) and corr (metric j , f or ks) correspond to the Pearson corre-
lation coefficient between the values of j-th metric and the stars or forks, respec-
tively. Final Scor e refers to the final reusability score and is the weighted average of
the stars-based and forks-based scores. More weight (3 vs 1) is given in the forks-
based score as it is associated with more reusability-related quality attributes (see
Table 10.1).
228 10 Assessing the Reusability of Source Code Components
This section presents our proposed methodology for software reusability estimation
and illustrates the construction of our models.
The data flow of our system is shown in Fig. 10.5. The input is the set of static
analysis metrics, along with the popularity information (stars and forks) extracted
from GitHub repositories. As the system involves reusability estimation both at class
and package levels, we train one model for each individual metric applied at each
level using the results of the aforementioned analysis. The output of each model
provides a reusability score that originates from the value of the class (or package)
under evaluation for the respective metric. All the scores are then aggregated to
calculate a final reusability score that represents the degree to which the class (or
package) is adopted by developers.
In an effort to construct effective models, each fitting the behavior of one metric,
we perform a comparative analysis between different state-of-the-practice machine
learning algorithms. In specific, we evaluate three techniques: Support Vector Regres-
sion, Polynomial Regression, and Random Forest Regression. The Random Forest
models have been constructed using the bagging ensemble/resampling method, while
for the Support Vector Regression models we used a Radial Basis Function (RBF)
kernel. Regarding the polynomial regression models, the fitting procedure involves
choosing the optimal degree by applying the elbow method on the square sum of
Fig. 10.6 Average NRMSE for all three machine learning algorithms
residuals.1 To account for cases where the number of bins, and consequently the
number of training data points, is low, we used linear interpolation up to the point
where the dataset for each model contained 60 instances.
To evaluate the effectiveness of our constructed models and select the one that
best matches the identified behaviors, we employ the Normalized Root Mean Squared
Error (NRMSE) metric. Given the actual scores y1 , y2 , . . . , yn , the predicted scores
ŷ1 , ŷ2 , . . . , ŷn , and the mean actual score ȳ, the NRMSE is calculated as follows:
N
1 ( ŷi − yi )2
N RMSE = · i=1 (10.7)
N
i=1 ( ȳ − yi )
N 2
where N is the number of samples in the dataset. NRMSE was selected as it does not
only take into account the average difference between the actual and the predicted
values, but also provides a comprehensible result in a certain scale.
Figure 10.6 presents the mean NRMSE for the three algorithms regarding the
reusability scores (forks/stars based) at both class and package levels. Although all
three approaches seem quite effective as their NRMSE values are low, the Random
Forest clearly outperforms the other two algorithms in all four categories, while the
Polynomial Regression models exhibit the highest errors. Another interesting con-
clusion is that reusability estimation at package level exhibits higher errors for all
three approaches. This fact originates from our metrics behavior extraction method-
ology, which uses ground truth information at repository level. Given that the number
of classes greatly outnumbers the number of packages within a given repository, the
1 According to this method, the optimal degree is the one for which there is no significant decrease
in the square sum of residuals when increasing the order by one.
230 10 Assessing the Reusability of Source Code Components
extracted behaviors based on packages are less resilient to noise and thus result in
higher errors. Finally, as the Random Forest is more efficient than the other two
approaches, we select it as the main algorithm for the rest of the evaluation.
Figures 10.7a and 10.7b depict the distribution of the reusability score at class
and package levels, respectively. As expected, the score in both cases follows a
distribution similar to normal and the majority of instances are accumulated evenly
around 0.5. For the reusability score at class level, we observe a left-sided skewness.
After manual inspection of the classes that received scores in the interval [0.2, 0.25],
they appear to contain little valuable information (e.g., most of them have LOC <
10) and thus are given low reusability score.
(a)
(b)
10.6 Evaluation
Figures 10.8a and 10.8b depict the distributions of the reusability score for all projects
at class level and package level, respectively. The boxplots in blue refer to the human-
generated projects, while the ones in orange refer to the auto-generated ones.
At first, it is obvious that the variance of the reusability scores is higher in the
human-generated projects than in the auto-generated ones. This is expected since
the projects that were generated automatically have proper architecture and high
abstraction levels. These two projects also have similar reusability values, which is
due to the fact that projects generated from the same tool ought to share similar
characteristics that are reflected in the values of the static analysis metrics.
2 http://s-case.github.io/.
232 10 Assessing the Reusability of Source Code Components
(a)
(b)
Fig. 10.8 Boxplots depicting reusability distributions for three human-generated ( ) and two auto-
generated ( ) projects, a at class level and b at package level (color online)
The high variance for the human-generated projects indicates that our reusability
assessment methodology is capable of distinguishing components with both high
and low degrees of reusability. Typically, these projects involve several components,
of which others were written while keeping in mind software reuse (e.g., APIs) and
others not. Another effect reflected on these values is the number of distinct devel-
opers working on each project, a metric which is not an issue for the automatically
generated projects. Finally, given the reusability score distribution for all projects,
we conclude that the results are consistent regardless of the project size. Despite the
fact that the number of classes and the number of packages vary from very low values
(e.g., only 9 packages and 27 classes) to very high values (e.g., 1670 packages and
7099 classes), the score is not affected.
10.6 Evaluation 233
Table 10.4 Metrics for classes and packages with different reusability
Metrics Classes Packages
Class 1 Class 2 Package 1 Package 2
WMC 14 12 – –
CBO 12 3 – –
LCOM5 2 11 – –
CD (%) 20.31% 10.2% 41.5% 0.0%
RFC 12 30 – –
LOC 84 199 2435 38
TNCL – – 8 2
Reusability score 95.78% 10.8% 90.59% 16.75%
Further assessing the validity of the reusability scores, we manually examined the
static analysis metrics of sample classes and packages in order to check whether they
align with the reusability estimation. Table 10.4 provides an overview of a subset
of the computed metrics for representative examples of classes and packages with
different reusability scores. The table contains static analysis metrics for two classes
and two packages that received both high and low reusability scores.
Examining the values of the computed static analysis metrics, one can see that the
reusability estimation in all four cases is reasonable and complies with the literature
findings on the impact of metrics on the quality properties related to reusability (see
Table 10.1). Concerning the class that received high reusability estimation, it appears
to be well documented (the value of the Comments Density (CD) is 20.31%), which
indicates that is suitable for reuse. In addition, the value of its Lines of Code (LOC)
metric, which is highly correlated with the degree of understandability and thus the
reusability, is typical. As a result, the score for this class is quite rational. On the other
hand, the class that received low score appears to have low cohesion (the LCOM5
metric value is high) and high coupling (the RFC metric value is high). Those code
properties are crucial for the reusability-related quality attributes and thus the low
score is expected.
Finally, upon examining the static analysis metrics of the packages, the conclu-
sions are similar. The package that received high reusability score appears to have
proper documentation level (based on CD value) which combined with its typical
size makes it suitable for reuse. This is reflected in the reusability score. On the other
hand, the package that received low score appears to have no valuable information
due to the fact that its size is very small (only 38 lines of code).
234 10 Assessing the Reusability of Source Code Components
10.7 Conclusion
References
1. Pfleeger SL, Kitchenham B (1996) Software quality: the elusive target. IEEE Software, pp
12–21
2. ISO/IEC 25010:2011 (2011) https://www.iso.org/standard/35733.html. Retrieved: November
2017
3. ISO/IEC 9126-1:2001 (2001) https://www.iso.org/standard/22749.html. Retrieved: October
2017
4. Diamantopoulos T, Thomopoulos K, Symeonidis A (2016) Symeonidis. QualBoa: reusability-
aware recommendations of source code components. In: Proceedings of the IEEE/ACM 13th
Working Conference on Mining Software Repositories, MSR ’16, pp 488–491
5. Taibi F (2014) Empirical analysis of the reusability of object-oriented program code in open-
source software. Int J Comput Inf Syst Control Eng 8(1):114–120
6. Le Goues C, Weimer W (2012) Measuring code quality to improve specification mining. IEEE
Trans Softw Eng 38(1):175–190
7. Washizaki H, Namiki R, Fukuoka T, Harada Y, Watanabe H (2007) A framework for measuring
and evaluating program source code quality. In: Proceedings of the 8th International Conference
on Product-Focused Software Process Improvement, PROFES. Springer, pp 284–299
8. Singh AP, Tomar P (2014) Estimation of component reusability through reusability metrics.
Int J Comput Electr Autom Control Inf Eng 8(11):1965–1972
9. Sandhu PS, Singh H (2006) A reusability evaluation model for OO-based software components.
Int J Comput Sci 1(4):259–264
10. Chidamber SR, Kemerer CF (1994) A metrics suite for object oriented design. IEEE Trans
Softw Eng 20(6):476–493
11. Zhong S, Khoshgoftaar TM, Seliya N (2004) Unsupervised learning for expert-based soft-
ware quality estimation. In: Proceedings of the Eighth IEEE International Conference on High
Assurance Systems Engineering, HASE’04, pp 149–155
12. Kaur A, Monga H, Kaur M, Sandhu PS (2012) Identification and performance evaluation of
reusable software components based neural network. Int J Res Eng Technol 1(2):100–104
13. Manhas S, Vashisht R, Sandhu PS, Neeru N (2010) Reusability evaluation model for procedure-
based software systems. Int J Comput Electr Eng 2(6)
14. Kumar A (2012) Measuring software reusability using SVM based classifier approach. Int J
Inf Technol Knowl Manag 5(1):205–209
References 235
15. Cai T, Lyu MR, Wong KF, Wong M (2001) ComPARE: a generic quality assessment envi-
ronment for component-based software systems. In: Proceedings of the 2001 International
Symposium on Information Systems and Engineering, ISE’2001
16. Bakota T, Hegedűs P, Körtvélyesi P, Ferenc R, Gyimóthy T (2011) A probabilistic software
quality model. In: 27th IEEE International Conference on Software Maintenance (ICSM), pp
243–252
17. Papamichail M, Diamantopoulos T, Chrysovergis I, Samlidis P, Symeonidis A (2018) User-
perceived reusability estimation based on analysis of software repositories. In: Proceedings of
the 2018 IEEE International Workshop on Machine Learning Techniques for Software Quality
Evaluation, MaLTeSQuE. Campobasso, Italy, pp 49–54
18. Samoladas I, Gousios G, Spinellis D, Stamelos I (2008) The SQO-OSS quality model: mea-
surement based open source software evaluation. Open source development, communities and
quality, pp 237–248
19. Hegedűs P, Bakota T, Ladányi G, Faragó C, Ferenc R (2013) A drill-down approach for mea-
suring maintainability at source code element level. Electron Commun EASST 60
20. Ferreira KAM, Bigonha MAS, Bigonha RS, Mendes LFO, Almeida HC (2012) Identifying
thresholds for object-oriented software metrics. J Syst Softw 85(2):244–257
21. Alves TL, Ypma C, Visser J (2010) Deriving metric thresholds from benchmark data. In:
Proceedings of the IEEE International Conference on Software Maintenance, ICSM. IEEE, pp
1–10
22. Foucault M, Palyart M, Falleri JR, Blanc X (2014) Computing contextual metric thresholds.
In: Proceedings of the 29th Annual ACM Symposium on Applied Computing. ACM, pp 1120–
1125
23. Shatnawi R, Li W, Swain J, Newman T (2010) Finding software metrics threshold values using
ROC curves. J Softw: Evol Process 22(1):1–16
24. Bay TG, Pauls K (2004) Reuse Frequency as Metric for Component Assessment. Technical
report, ETH, Department of Computer Science, Zurich. Technical Reports D-INFK
25. SonarQube platform (2016). http://www.sonarqube.org/. Retrieved: June 2016
26. Papamichail M, Diamantopoulos T, Symeonidis A (2016) User-perceived source code quality
estimation based on static analysis metrics. In: Proceedings of the 2016 IEEE International
Conference on Software Quality, Reliability and Security, QRS. Vienna, Austria, pp 100–107
27. Borges H, Hora A, Valente MT (2016) Predicting the popularity of github repositories. In:
Proceedings of the The 12th International Conference on Predictive Models and Data Analytics
in Software Engineering, PROMISE 2016. ACM, New York, NY, USA, pp 9:1–9:10
28. ARiSA - Reusability related metrics (2008). http://www.arisa.se/compendium/node38.html.
Retrieved: September 2017
29. SourceMeter static analysis tool (2017). https://www.sourcemeter.com/. Retrieved: November
2017
30. Doane DP, Seward LE (2011) Measuring skewness: a forgotten statistic. J Stat Educ 19(2):1–18
31. Dimaridou V, Kyprianidis AC, Papamichail M, Diamantopoulos TG, Symeonidis AL (2017)
Towards modeling the user-perceived quality of source code using static analysis metrics.
In: Proceedings of the 12th International Conference on Software Technologies - Volume 1,
ICSOFT. INSTICC, SciTePress, Setubal, Portugal, pp 73–84
32. Dimaridou V, Kyprianidis AC, Papamichail M, Diamantopoulos T, Symeonidis A (2018)
Assessing the user-perceived quality of source code components using static analysis met-
rics. In: Communications in Computer and Information Science. Springer, page in press
Part V
Conclusion and Future Work
Chapter 11
Conclusion
The main scope and purpose of this book has been the application of different
methodologies in order to facilitate the phases of software development through
reuse. In this context, we have applied various techniques in order to enable reuse in
requirements and specifications, support the development process, and further focus
on quality characteristics under the prism of reuse.
In Part II, we focused on the challenges of modeling and mining software require-
ments. Our methodology in requirements modeling provides a solid basis as it pro-
vides an extensible schema that allows storing and indexing software requirements.
We have shown how it is possible to store requirements from multimodal formats,
including functional requirements and UML diagrams. Our proposal of different
methodologies for mining software requirements further indicates the potential of
our ontology schema. The applied mining approaches have been shown to be effective
for enabling reuse at requirements level. Another important contribution at this level
is the mechanism that can directly translate requirements from different formats into
a model that can subsequently serve as the basis of developing the application. This
link, which we have explored in the scenario of RESTful services, is bidirectional
and, most importantly, traceable.
In Part III, we focused on the challenges of source code mining in order to support
reuse at different levels. We have initially shown how source code information can be
retrieved, mined, stored, and indexed. After that, we confronted different challenges
in the context of source code reuse. We first presented a system for software com-
ponent reuse that encompasses important features; the system receives a query in an
interface-like format (thus allowing easily to form queries), and retrieves and ranks
software components using a syntax-aware model. The results are also further ana-
lyzed to assess whether they comply with the requirements of the developer (using
tests) as well as to provide useful information, such as their control flow. Our second
contribution has been in the area of API and snippet mining. The main method-
ology, in this case, addresses the challenge of finding useful snippets for common
© Springer Nature Switzerland AG 2020 239
T. Diamantopoulos and A. L. Symeonidis, Mining Software Engineering Data
for Software Reuse, Advanced Information and Knowledge Processing,
https://doi.org/10.1007/978-3-030-30106-4_11
240 11 Conclusion
developer queries. To do so, we have retrieved snippets from different online sources
and have grouped them in order to allow the developer to distinguish among different
implementations. An important concept presented as part of this work has been the
ranking of the implementations and of the snippets according to developer prefer-
ence, a useful feature that further facilitates finding the most commonly preferred
snippets. Finally, we have focused on the challenge of further improving the selected
implementations (or generally one’s own source code) by finding similar solutions
in online communities. Our contribution in this context has been a methodology for
discovering solutions that answer similar problems to those presented to the devel-
oper. An important aspect of our work has been the design of a matching approach
that considers all elements of question posts, including titles, tags, text, and, most
importantly, source code, which we have shown to contain valuable information.
In Part IV, we focused on source code reusability, i.e., the extent to which software
components are suitable for reuse. Initially, we constructed a system to illustrate
the applications of reusability assessment. In specific, our system focuses on the
challenge of component reuse to indicate how developers need to assess software
components not only from a functional perspective but also from a quality perspec-
tive. As a result, we have proposed a methodology for assessing the reusability of
software components. Our methodology is based on the intuition that components
that are usually preferred by developers are more probable to have high quality. As
part of our contribution to this aspect, we have initially shown how project popularity
(preference by developers) correlates with reusability and how these attributes can
be modeled using static analysis metrics. Our models have been shown to be capable
of assessing software components without any expert help.
As a final note, this book has illustrated how the adoption of reuse practices can
be facilitated in all phases of the software development life cycle. We have shown
how software engineering data from multiple sources and in different formats can be
mined in order to provide reusable solutions in the context of requirements elicitation
and specification extraction, software development, and quality assessment (both for
development and for maintenance). Our methodologies cover several different sce-
narios but, most importantly, offer an integrated solution. Our research contributions
can be seen as the means to create better software by reusing all available infor-
mation. And, as already mentioned in the introduction of this book, creating better
software with minimal cost and effort has a direct impact on the economical and
social aspects of everyday life.
Chapter 12
Future Work
As already mentioned in the previous chapter, our contributions lie in different phases
of the software engineering life cycle. In this chapter, we discuss the future work
with respect to our contributions as well as in regard to the relevant research areas.1
We initially analyze each area on its own by connecting it to the corresponding part
of this book, and then focus on the future work that can be identified for the field as
a whole.
Future work on requirements modeling and mining (Part II) lies in multiple
axes. Concerning the modeling part, the main challenge is to create a model that
would encompass the requirements elicitation process as a whole. Such a model
could include requirements, stakeholder preferences, non-functional characteristics
of projects, etc. Our effort in this context can be considered a solid first step as it
effectively stores and indexes some of this data, while it is also suited for employ-
ing mining techniques with the purpose of reuse. Similarly, and concerning also the
requirements mining part, an important future direction is the support for different
formats of requirements (e.g., user stories, different diagram types, etc.) and the
application of mining techniques to facilitate requirements reuse as well as specifi-
cation extraction. In the context of specification extraction, another important point
to be explored in future work is that of connecting requirements to implementations.
Our work in this aspect, which is focused on RESTful services, serves as a practical
example and can provide the basis for further exploring this connection.
Concerning source code mining (Part III), the future research directions are
related to the different challenges discussed. One such challenge is component
reuse, for which we have proposed a methodology that assesses whether the retrieved
1 An important note for the reader at this point is that highly specific future proposals for our
techniques are provided at the end of each chapter. As a result, we focus here on the future work
for the areas in a rather abstract level.
© Springer Nature Switzerland AG 2020 241
T. Diamantopoulos and A. L. Symeonidis, Mining Software Engineering Data
for Software Reuse, Advanced Information and Knowledge Processing,
https://doi.org/10.1007/978-3-030-30106-4_12
242 12 Future Work
components are functionally suitable for the query of the developer. Interesting direc-
tions in this context include the support for larger multi-class components as well as
the automated dependency extraction for each component. Ideally, we would want
a system that could retrieve online source code, check if it is suitable for the cor-
responding requirements, and integrate it by taking into account the source code of
the developer along with any dependencies to third-party code. Similarly, for the
API/snippet reuse scenario, a future direction would be to integrate the snippet into
the source code of the developer, possibly after employing summarization techniques
to ensure that it is optimal. Finally, an interesting extension with respect to the chal-
lenge of finding existing solutions in online communities would be the incorporation
of semantics. This way, one would be able to form a bidirectional semantic con-
nection between a query that practically expresses some functional criteria and a
relevant snippet. Such connections could be used for facilitating API/snippet dis-
covery or even for automated documentation or extraction of usage examples from
libraries.
The research directions for quality assessment (Part IV) are also multiple. At first,
it would be interesting to further explore the incorporation of quality metrics into
source code reuse systems. The main purpose of such an embedding would be to
provide the developer with the best possible solution, one that would be functionally
suitable for the given query and at the same time would exhibit high quality and/or
possibly high preference by the developer community. In this context, it would be
useful to investigate how different quality metrics and models can be used to assess
source code components. Having the reusability assessment methodology provided
in this book as a starting point, one could further explore the relation between devel-
oper preference and reusability in multiple ways. Developer preference measure-
ments could be extended by measuring the actual reuse rate of each software library,
a statistic that could be obtained by mining the imports of online repositories. Fur-
thermore, the indicated relation could be explored in different domains; for instance,
one would probably expect that general-purpose libraries are reused more often than
machine learning libraries, which indicates that their preference has to be measured
using a different basis.
Finally, we may say that in this book we have explored the concept of software
reuse as a driving axis for software development. An interesting extension that spans
over the whole process would be to model and index components of different granu-
larity, along with their requirements, source code, and all other relevant information,
and, through semantics, manage to provide an interface for finding reusable solu-
tions in different levels (e.g., requirements, source code, documentation, etc.). Given
traceable links between the data of the components, one would be able to develop
software in a reuse-oriented manner, while the integrated solutions would effectively
propagate when required in order to produce a clean, efficient, and maintainable
product. Thus, as a final note, we may view this work as a solid first step toward this
direction.