s11219-015-9292-4
s11219-015-9292-4
DOI 10.1007/s11219-015-9292-4
123
344 Software Qual J (2017) 25:343–372
1 Introduction
123
Software Qual J (2017) 25:343–372 345
1. What are the effects of using TDD, from the developers’ perspective?
2. What are the difficulties of using TDD?
3. Which testing methods do developers use?
4. Which refactoring techniques do developers use?
Based on the results of the survey reported in the paper, the primary contributions of this
work are:
• The benefits of TDD reported by the survey respondents should encourage scientific
developers to consider adopting TDD to improve software quality.
• The problems reported by survey participants highlight the need for additional
empirical evaluations of adopting TDD in various contexts.
• The code improvement methods and refactoring techniques reported by the survey
respondents provide practical suggestions for other developers who wish to adopt TDD
in their projects.
• From a SE point of view, the results indicate that adopting SE practices like TDD in a
non-traditional environment is not always a positive experience.
The remaining of this paper is organized as follows. Section 2 provides the background
and concepts related to this work. Section 3 describes the methodology of our survey.
Section 4 presents the results of the survey. Section 5 discusses the results. Section 6
describes the study limitation. Section 7 concludes and discusses the future direction for
research.
2 Related work
Because there has been little work on the use of TDD for the development of scientific
software, other than our own, this section focuses on three related background topics: TDD
in a traditional SE context, characteristics of the scientific context that seem well-suited for
TDD, and the refactoring practice within TDD.
Our review of the literature found both positive and negative effects of using TDD in a
traditional software development environment. On the positive side, in their systematic
meta-analysis of 27 studies that investigated the impact of TDD on software development,
Rafique and Misic reported that TDD helped the development team improve the software
quality (Rafique and Misic 2013). Moreover, another study showed that TDD helped a
team at IBM increase the quality of their software products (Sanchez et al. 2007). These
results suggest that TDD can have a positive impact on software quality.
On the negative side, two systematic literature reviews regarding the adoption of TDD
in industry point out some challenges of using TDD. First, Causevic et al. (2011) identified
seven key factors that limit the adoption of TDD in traditional, i.e., non-scientific,
industrial environments:
1. increased development time;
2. insufficient TDD experience/knowledge;
3. insufficient design;
4. insufficient developer testing skills;
5. insufficient adherence to the TDD protocol;
123
346 Software Qual J (2017) 25:343–372
There are two key requirements of scientific software development that match well with
the characteristics of TDD. First, rather than focusing significant effort on the up-front
requirements and design analysis, scientific developers repeatedly implement small
increments of functional code, taking into account any changes in the requirements or
context (Sletholt et al. 2011). TDD fits well into this type of environment because TDD
does not require developers to gather and document all requirements up-front. Second,
because simulations often need to be conducted quickly, many scientists work with limited
time and a defined schedule for each iteration (Sanders and Kelly 2008). Process-heavy
software development approaches conflict with this need for quick turn-around of results
for publications and program review deadlines (Carver et al. 2007). Lifecycle models like
the Waterfall model (Ruparelia 2010), are inappropriate for such a setting (Nan-
thaamornphong et al. 2013). Therefore, scientific software development must employ
flexible, lightweight processes like TDD.
2.3 Refactoring
One of the key aspects of TDD is refactoring (Opdyke 1992). The most widely cited
definition for refactoring is ‘‘a change made to the internal structure of software to make it
easier to understand and cheaper to modify without changing its observable behav-
ior’’ (Fowler 1999). In other words, refactoring improves the internals of the software
without changing the externals. There are a number of common reasons for performing
refactoring.
• To ease the addition of new code Developers can quickly add new code without
worrying about how well that code fits the overall structure of the system and then
clean up the code later through refactoring.
• To improve the design of existing code By continuously refactoring the design of code,
refactoring improves the quality of code. As a result, developers can easily extend the
maintained code.
• To help developers avoid defects Refactoring is a method to clean up code, which
minimizes the chances of introducing defects.
• To gain a better understanding of the code Unclear code is difficult for developers to
comprehend and must therefore be removed by refactoring.
Researchers have proposed a number of refactoring techniques for Fortran, a pro-
gramming language widely used in scientific software (Orchard and Rice 2013; Overbey
et al. 2005, 2009). The primary goal of these techniques is to improve system performance.
123
Software Qual J (2017) 25:343–372 347
Additionally, the automated refactoring tool Photran (Eclipse 2013) helps developers
perform refactoring. Photran is an IDE based on Eclipse that includes 39 refactoring
methods, such as replacing common block and block data subprograms with module
variables, removing computed goto, and requiring explicit interface blocks.
Therefore, refactoring is an important step in the TDD process. The needs of the
scientific software development context suggest that the use of refactoring could be quite
beneficial. The fact that there are a number existing tools is also encouraging when con-
sidering whether scientific software developers might choose to employ TDD.
3 Methodology
This section describes the methodology that we used to design, execute and analyze the
survey.
To ensure we gathered responses only from scientific developers who had experience with
TDD, the first part of the survey contained a series of screening questions to assess the
respondent’s experience with TDD. Only those respondents with TDD experience were
presented with the TDD-specific questions, which asked them to assess the effectiveness of
employing TDD, including writing tests and refactoring. Additionally, we asked specific
questions about testing and refactoring activities (e.g., testing problems, refactoring
problems, refactoring techniques).
The survey contained questions of different formats: multiple choice, Yes/No, self-
assessment items (using a five-point scale), and open-ended questions. Some questions also
contained an ‘‘Other (specify)’’ option to allow the respondents to provide additional
information. To help ensure the quality and completeness of the survey, we employed an
external scientific expert to evaluate the questions and provide feedback, which we
incorporated into the final survey.
Appendix 1 provides a complete list of the survey questions. Table 1 maps those survey
questions to the research questions described in Sect. 1. Note that survey questions 1–9
were the screening questions and therefore do not map to a specific research question.
We built a list of 300 unique email addresses for potential survey respondents that
included:
RQ1. What are the effects of using TDD, from the developers’ perspective? 10, 11
RQ2. What are the difficulties of using TDD? 12, 14, 15, 19, 24, 25
RQ3. Which testing methods do developers use? 18, 20, 21, 22, 23
RQ4. Which refactoring techniques do developers use? 13, 16, 17
123
348 Software Qual J (2017) 25:343–372
We conducted a pilot study to ensure that the survey questions were comprehensible and
valid with respect to the research questions. We emailed the pilot survey to a randomly
selected 5 % of the targeted participants (15 of the 300 email addresses). Note that we did
not include the pilot participants in the solicitation for the main study. We evaluated the
responses from the four pilot subjects who responded to determine whether they under-
stood the questions and provided answers that were sufficient for analysis. We used the
qualitative analysis process described in Sect. 3.5 to check the responses. In addition, we
emailed the four respondents directly with the following questions:
• Were all of the words on the survey understandable? (If not, please describe)
• Were all response choices appropriate? (If not, please describe)
• Was the format and layout easy to follow? (If not, please describe)
• Did the survey logic (i.e., question skipping) make sense? (If not, please describe)
Based on the responses, we made some minor adjustments to the wording of a few
questions to fix typos and increase clarity.
After the pilot, we emailed the survey’s web link to the remaining population (285 email
addresses). To increase the response rate, we sent reminder emails after one week and after
two weeks. After we sent the email to the target list, some recipients informed us that they
had forwarded the survey link to additional potential respondents and some posted the link
on scientific community web sties, such as Software Carpentry.2 Therefore, we are not able
to accurately determine the total number of people who were solicited to participate in the
survey beyond the 285 we emailed directly.
To analyze the survey responses, we performed a qualitative analysis process, called ‘‘open
coding’’ (Anselm and Juliet 1990). In this commonly used analysis methodology,
researchers identify and tentatively name the conceptual categories into which the phe-
nomena observed can be grouped. This methodology includes the following steps:
(a) creating categories of answers, (b) identifying and coding each answer carefully,
(c) organizing the answers into categories, and (d) comparing each new answer to the
1
http://SE4Science.org/workshops.
2
http://www.software-carpentry.org.
123
Software Qual J (2017) 25:343–372 349
existing categories to determine whether it fits into one of them. The goal of the coding
process is to allow patterns and explanations emerge directly from the data. When ana-
lyzing the output of the coding process, we treated the results as nominal or ordinal and
reported counts and cross-tabulations. To ensure valid results and minimize bias, each
author coded the responses separately. In addition, each author coded the responses
multiple times, each time obtaining the same results. Then, the authors compared their
results and discussed to resolve any discrepancies.
4 Results
This section begins with an analysis of the demographics of the survey respondents. Then it
continues with the results relative to each of the four research questions described in
Sect. 1.
4.1 Demographics
We conducted the survey from December 2013 to January 2014. A total of 77 people
responded to the survey of which 64 reported experience with TDD. These 64 respondents
answered the detailed questions about the use of TDD on their projects. The diversity of
the sample can be characterized on the following attributes:
• Experience with TDD (Fig. 1);
• Geographical distribution (Fig. 2);
• Type of organization (Fig. 3);
• Type of project, where research indicates a main goal of publishing a paper and
production indicates a main goal of producing software for real users (Fig. 4).
4.2 RQ1. What are the effects of using TDD, from the developers’
perspective?
The following subsections address the effects of TDD on software quality (Sect. 4.2.1) and
the overall benefits of using TDD (Sect. 4.2.2).
123
350 Software Qual J (2017) 25:343–372
Fig. 2 Distribution of
respondents’ locations
Survey question 10 asked the respondents to rank the relative importance of the eight
software quality attributes described by the ISO/IEC 25010:2011 standard (ISO/IEC
2011). To ensure that the respondents understood the meaning of each characteristic, the
survey included a description of each characteristic drawn from the standard. We used
these rankings to calculate the relative importance of each attribute by giving the #1 ranked
attribute 8 points, the #2 ranked attributed 7 points, on down to the #8 ranked attribute 1
point. Figure 5 shows the results of this analysis in order of importance based on overall
points.
123
Software Qual J (2017) 25:343–372 351
Survey question 11 asked the respondents to describe the effectiveness of TDD relative
to the most important software quality attribute they indicated in Question 10. Our qual-
itative analysis of the responses to this open-ended question resulted in the following
categories. For each category, we provide a list of the types of answers that constitute that
category:
• Very effective answers contain phrases like ‘‘very effective’’ or ‘‘very excellent’’,
• Effective answers contain other positive words,
123
352 Software Qual J (2017) 25:343–372
Survey question 19 asked the respondents to describe the benefits and challenges of
employing TDD in their projects. This section only describes the benefits, whereas the
challenges are explained in Sect. 4.3.4. Using the qualitative data analysis process
described in Sect. 3.5, we grouped the responses into five primary benefits (labeled ‘B#’
below). The numbers in parentheses represent the number of responses grouped into each
category. Note that each respondent’s answer could have been coded into multiple
categories.
B#1. Ensure the quality of code (22) The respondents stated that while implementing the
system, they gain confidence by being able to test very specific code behavior. Tests are
written in conjunction with the development of functionality, providing valuable regression
notifications when combined with continuous integration testing. For example, one
respondent said: ‘‘The main benefit is that I have confidence in my software. Quite literally,
I sleep better at night’’.
TDD enforces the disciplines of writing testable code and systematically testing that
code during development. Many respondents also underscored that writing unit tests before
code is better than writing tests after coding, when it is more difficult to trust the code.
TDD allows developers to safely change the code and let it evolve without the worry of
breaking more parts of the code with each new change.
B#2. Improve software changeability (13) Use of TDD increased developers’ confi-
dence that new capabilities would not break existing capabilities, results, or interfaces. In
addition, the testing artifacts are checked over time to ensure that modifications do not
break the original function. A software package with thorough and easy-to-use tests can be
refactored with confidence over time, thereby improving the longevity of the product.
Conversely, software that lacks a solid test suite can became extremely fragile and cannot
be refactored with confidence.
B#3. Improve software maintainability (11) Refactoring makes it easy to add features to
the software, aids in debugging, and facilitates the fixing of future problems. The tests also
provide documentation describing test drivers, inputs/outputs, and examples of program
execution. For example, one respondent said: ‘‘Having unit tests greatly improves main-
tainability, assisting refactoring and new development, but this is true when tests were
developed before the functional code’’.
B#4. Identify problems early (10) The respondents reported that they could find prob-
lems/bugs earlier in the development process. With TDD, the respondents could remove
defects and improve correctness at the beginning of the project, reducing the overall
number of defects. One respondent reported that s/he could also find performance bot-
tlenecks quickly.
123
Software Qual J (2017) 25:343–372 353
B#5. Better understanding of software requirements (10) The process of developing test
cases first requires the developers to understand the requirements clearly before writing the
code. The developers gained a deeper understanding of the actual problem before coding.
They were also forced to think through edge cases before writing code and thus encoun-
tered fewer surprises later in the process.
This section first reports the overall difficulties of using TDD in scientific projects
(Sect. 4.3.1). Then, it describes the problems and solutions specifically related to testing
(Sect. 4.3.1) and refactoring (Sect. 4.3.3).
4.3.1 Difficulties
Survey question 12 asked the respondents to rank the difficulty of the following TDD
activities: implementing the code, writing a test, and refactoring. The results show that
writing a test was the most difficulty activity, followed by implementing the code.
Theoretically, the TDD process does not require software design before the code is
implemented because the developers have to write only the code corresponding to the test.
In practice, however, the developers may need to design the system before testing or
implementing actual code. In terms of when they performed design activities, 63 %
designed the software before implementing the code (Survey question 14) and 67 %
designed the software while coding (Survey question 15), with some overlap between the
two groups. This evidence may imply that it is difficult to adhere strictly to the TDD
protocol, which does not require a design stage before implementing the code.
Survey question 24 asked the respondents which problems they encountered when writing
tests and how they solved those problems. The respondents noted a range of testing issues
when employing TDD. We grouped the results into six testing problems (TP) and five
testing solutions (TSL). The numbers in parentheses represent the number of respondents
whose answers were grouped into each category. Note that each respondent’s answer could
have been coded into more than one category.
TP#1. Difficulty of writing a test (19) The main problem when writing a test is to write a
good test. Many respondents explained that writing a good test can be more difficult than
implementing the actual code, particularly for tests that do not have to be changed when
the code changes. For example, one respondent said: ‘‘It’s hard to write good tests,
especially tests that won’t have to be changed if you change the internals of functions’’.
Furthermore, during refactoring, developers sometimes realize that the code requires
changes, which requires modifications to the tests and to the documentation (e.g., com-
ments, or developer’s documentation). The respondents also explained that testing without
a tool or framework made it more difficult to write tests; thus, developers were less likely
to do a good job. From the perspective of senior developers, the experience of each new
developer is different, so it is difficult to expect their tests to be written well.
TP#2. Complex application (14) Based on the responses, we classified these issues into
3 groups: complicated algorithm, numerical computations, and parallel computing. In
scientific software, there are many parts in the code that consist of complex algorithms. The
123
354 Software Qual J (2017) 25:343–372
complex code requires complicated tests. Many problems with complex code occur at
production runtime, such as deadlocks and repetitive connection drops; it is thus difficult to
test these problems in advance. Furthermore, the respondents explained that tests for
numerical computations are difficult because the developers do not know the right answer
with full confidence. Regarding parallel computing, the respondents explained that writing
tests to examine concurrency issues in parallel computing was difficult. In particular, the
‘right answer’ for a complete simulation is generally not known a priori.
TP#3. Code coverage (13) Theoretically, tests should cover most of the code, but
respondents explained that the code coverage does not always reach 100 %. In particular, it
is difficult to provide comprehensive coverage for scientific code with many dynamic parts.
One respondent reported that the code coverage analysis tool does not work for their C??
template code. In addition to the traditional definition of code coverage, these responses
also include issues of test validity, i.e., whether a test case tests what it is really intended to
test. In scientific software development, it is difficult to correlate verification tests and
issues with validation tests, so the developers focus on testing based on code functionality.
TP#4. Lack of SE practices, tools, and standard (7) Respondents thought that writing
the tests required them to understand SE practices. One problem of writing tests is
introducing unit testing when not all developers have experience. Scientific developers
from a research environment tend to be less disciplined in building and maintaining tests
than required. Additionally, existing automated testing tools, which were designed for
software engineers, are not always easy for scientists to use (e.g., wording in the menu, no
support for Fortran). Finally, a respondent explained that he/she must assume a tolerance
for the computed output, but there is no standard for creating the tolerance value.
Therefore, the tolerance may be considerably larger than would be ideal.
TP#5. Time consuming (6) It takes longer to write the test, which affects budgets and
planning. For example, one respondent said: ‘‘it is very hard to justify spending a lot of
time on (tests don’t produce ‘‘results’’)’’. A respondent indicated that there is more test
code than actual program code. The testing code becomes a major barrier to improving the
code. For example, a change in the program that might involve 500 LOC could require
changing 100 lines of testing code. This amount of code makes the developers less inclined
to make changes in the production code that require test modification. Adequate testing is a
full-time job, especially when a new change is introduced by the customers or users on a
tight schedule. In particular, the research environment often requires developers to change
their code in ways that require test modification as well.
TP#6. Code or requirement is changed (4) The survey participants indicated that the test
must change when the requirement or API changes. For example, if the algorithm changes,
the test may no longer be sufficient because the random number generator is used in a
different manner.
TSL#1: Use a suitable tool (11) The respondents believed that a good testing framework
or automated testing tool would facilitate writing tests. The respondents also suggested that
the tool would save time compared to writing the test manually. They also recommended
choosing appropriate tools for the projects.
TSL#2: Experience (11) Many respondents indicated that writing a good test requires
experience, particularly when testing complex code or algorithms. Furthermore, experi-
enced developers could confidently provide the team with the solutions to testing problems.
Scientific developers often need to learn the basics of why software is tested and what
constitutes a good test. A collection of best practices and examples would be quite helpful
in this regard. In particular, mentoring or coaching on test writing could be of great benefit.
123
Software Qual J (2017) 25:343–372 355
TSL#3: Redesign the test (5) When the code coverage percentage is lower than desired,
developers must redesign the unit tests. This process requires another developer to review
the test, in addition to the code. In some cases, using a combination of unit testing and
system testing can help solve the code coverage problem. Also, considering the number of
bugs would help developers redesign the unit test to increase the code coverage.
TSL#4: Understand the requirements (4) A clear understanding of the requirements is
necessary to write the test, particularly when the developers are working on complex
problems, such as numerical or multi-physics. As in-depth understanding could improve
the test coverage and reduce the required amount of time.
TSL#5: Gain confidence (3) When developers work on simulation software involving
floating-point calculations and parallelism, they must check simpler output to gain confi-
dence. For example, to ensure that the testing of a stochastic model is correct, scientists
need to run multiple simulations and analyze the output.
In summary, Table 2 presents the mapping between problems and solutions. The
checkmarks represents solutions suggested by survey participants. However, based on our
own experience, we also suggest additional solutions to some problems (represented by
exclamation marks in the table). The last column presents the number of solutions for each
problem, including respondents’ solutions and our suggestions.
Additionally, in the traditional software engineering literature, the following methods
have been proposed to address some of the problems identified in Table 2.
• To gain better understanding of the requirements and more confidence, the developers
can employ ‘‘specification by example’’ instead of abstract or prose (Koskela 2007).
• In Test-Driven Development: By Example (Beck 2002), Kent Beck recommends two
approaches to improve the code coverage, including: (1) to write more tests and (2) to
simplify the logic of the program.
Survey question 25 asked the respondents about the problems they encountered during
refactoring and their solutions to those problems. Our analysis resulted in five refactoring
problems (RP) and four refactoring solutions (RSL). The numbers in parentheses represent
Table 2 Testing problems and solutions U = solution suggested by participants ! = solution based on our
experience
Problems Use a Experience Redesign Understand Gain Total (survey,
suitable the test requirements confidence our experience)
tool
123
356 Software Qual J (2017) 25:343–372
the number of respondents whose answers were grouped into each category. Note that each
respondent’s answer could have been classified into more than one category.
RP#1. Dependence on unit tests (8) Because the implementation is often driven by the
tests rather than by a requirements specification, the respondents indicated that the process
of refactoring is difficult if the unit tests are not well designed. However, software is nearly
impossible to refactor without thorough unit tests that are readily available. For example,
one respondent said: ‘‘Bad unit tests were a hindrance to refactoring, because they tested
the implementation details, not the requirements’’. Similarly, it is also hard to perform
refactoring if the code coverage of the test suite is low.
RP#2. Dependence on the architecture design (7) Although testing is essential to
refactoring, poor architecture design makes refactoring difficult or impossible. Further-
more, the respondents indicated that if the initial architecture or system design is poor, it is
occasionally necessary to redesign the system and revisiting large portions of it.
RP#3. Dependence on the development environment (7) The development environment
includes the platform, programming languages, tools, and interacting components. For
example, one respondent reported that refactoring an application implemented with C is
more difficult than refactoring an application implemented with Python, because Python is
object-oriented and C is not. The respondents also indicated that it is difficult to refactor
without tools. For example, providing code to a large number of users requires a certain
level of API compatibility with older versions. Refactoring is impossible without the
appropriate tools.
RP#4. Lack of knowledge regarding when and how to refactor (6) One common
problem is developers’ failure to understand refactoring and recognize its benefits. There is
lack of knowledge of how to refactor (at all and/or efficiently) or why refactoring is
conducted (because it does not add functionality). Refactoring also may require knowledge
of advanced programming techniques, e.g., template code and functional programming.
The respondents also found that it is difficult to decide when to start refactoring. Problems
are exacerbated when refactoring is delayed. If the refactoring affects many developers’
code, it is difficult to mitigate the problem within a short period of time.
RP#5. Legacy code (3) Refactoring is difficult when the respondents are working with a
diverse legacy code base. More specifically, for legacy code, the developers may not
actually have adequate tests to ensure that a refactoring does not cause problems.
RSL#1. Coaching developers (8) Education about refactoring is necessary. In the con-
text of a research environment, the team leader may need to convince the members that
refactoring saves time and improves software quality. Experienced developers should
provide guidance and help the other developers when they find problems. Regarding the
knowledge about the appropriate time to refactor, generally, refactoring can be performed
at any time during the development process. There is no rule regarding the time for
refactoring. The respondents’ suggestions indicate there are three options for refactoring:
(1) performing several rounds of refactoring would provide better results than would
performing only a few rounds, (2) refactoring when the code begins to become disorga-
nized, or (3) refactoring shortly after release, before he/she becomes tangled in new fea-
tures for the next release. Additionally, to reduce the effort of differing experience among
developers, each developer should try to use a simple refactoring method initially.
RSL#2. Use refactoring tools (7) The respondents explained that automated refactoring
tools would be useful, particularly when working on large and complex applications.
Additionally, the refactoring tool should be able to integrate with the IDEs.
RSL#3. Redesign the software architecture (4) In some cases, it is very difficult to
refactor poor code. One solution is to redesign the software. The respondents indicated that
123
Software Qual J (2017) 25:343–372 357
redesigning the software architecture might save time compared to refactoring code in
some cases.
RSL#4. Redesign the unit tests (2) Revisiting the unit tests would help developers reduce
the time involved in refactoring, especially redesigning some of the worst tests.
In summary, Table 3 presents the mapping between problems and solutions. The
checkmarks represents solutions suggested by survey participants. We also suggest addi-
tional solutions to some problems based on our own experience (represented by excla-
mation marks in the table). For each problem the last column presents the number of
solutions, including respondents’ solutions and our suggestions.
In addition, the following methods have been proposed in the traditional software
engineering literature to address some of the problems identified in Table 3.
• Abdel-Hamid (2013) proposed the following legacy code refactoring method: (1)
quick-wins refactoring (e.g., remove dead code, remove duplicated code), (2)
Decompose into the components, and (3) create automated-tests for components.
• Mens and Tourwé (2004) provides an overview of refactoring techniques in various
dimensions, including refactoring activities, specific techniques, software artifacts
being refactored, and effects of refactoring. This work would help developers gain
more knowledge about the refactoring.
4.3.4 Challenges
Survey question 19 asked the respondents to describe the challenges of employing TDD in
their projects. The goal of this question was to understand the challenges of TDD as a
whole rather than the difficulties of each TDD step, which was the focus of Question 12.
We grouped the responses into four primary challenges (C). The numbers in parentheses
represent the number of responses grouped into each category. Note that each respondent’s
answer could have been classified into more than one category.
C#1. Spending an excessive amount of effort (32) Generally, several survey respondents
reported excessive time spent writing the test, and thus, the developers often skipped
testing in the final stages. Additionally, adding TDD to an existing project requires
enormous effort. One respondent noted that TDD can only be performed by developers
Table 3 Refactoring problems and solutions U = solution suggested by participants ! = solution based on
our experience
Problems Coaching Use Redesign the Redesign the Total (survey, our
developers refactoring software unit test experience)
tools
123
358 Software Qual J (2017) 25:343–372
who understand what code does in detail and are willing to invest time in code quality.
TDD works well if the developers know exactly what the code should produce, which
requires the developers to spend an excessive amount of time on the requirements or
specifications of the project. In terms of project management, it might be difficult to
demonstrate overall progress to the customers or users. For example, one respondent said:
‘‘ TDD requires more time before code is written, harder to show progress of the ’whole’ in
the nearer term’’.
Additionally, the amount of code could be twofold or threefold greater than the amount
of code that the developers would write without TDD. Furthermore, the developers have to
maintain the test code. The increased amount of code has a higher cost of maintenance.
In an academic environment, writing a test case prior to writing actual code can detract
from time and resource constraints on obtaining and publishing research results in the short
term. Similarly, there is not a sufficient amount of research funding to write the amount of
testing code that real TDD would require. There is a trade-off between the rigorous
adoption of TDD and idea exploration for the development of functions with a significant
research component to them. In addition, many scientific software teams contain scientific
domain experts (rather than trained computer scientists). TDD does not come naturally to
many scientific developers. Sometimes TDD just does not fit the context of the develop-
ment environment. The adoption of TDD requires substantial effort to promote the vision
and maintain the necessary discipline.
C#2. Difficult to write tests (16) Three specific types of tests that are challenging to
write include:
1. Integration tests for large numerical computations or complex functions. In scientific
software, for example, obtaining output is often computationally expensive, there are
issues with numerical precision, and in some cases, it is impossible to compute the
expected output. Additionally, the context can change between the tests and code,
especially in the development of complex code. Several respondents noted that it is
extremely difficult to test concurrency issues.
2. Tests for research code because the expected results are typically unknown.
3. Tests that achieve high code coverage for complex capabilities.
C#3. Learning curve (12) TDD is a new concept in the scientific domain. The idea of
writing tests before code is a new experience for many scientists and scientific developers.
Therefore, there is a learning curve associated with TDD for many developers. It can also
be difficult to convince large organizations of the financial benefits of TDD, especially
those organizations that are not interested in using SE practices, which are required for
refactoring. For example, one respondent said: ‘‘it is difficult to learn, hard to justify to
some management types’’.
C#4. Need to set up a new environment (8) Some survey participants reported that the
organizations must create an adequate TDD environment and maintain a consistent level of
testing across all developers. TDD cannot be successful without support from management
and developers. Although some of the benefits of TDD tend to accrue to a development
team or research group as a whole over time, an individual developer might view it as
significant additional work with only a modest short-term payoff. Additionally, some
organizations employ sub-contractors to implement a system. TDD works when both
parties operate from a common set of rules. Therefore, the goal, process, and protocols
must be established for collaboration between the organization and sub-contractors.
In the scientific software development environment, the developers lack appropriate
tools to readily perform TDD. The currently available tools are typically not ideal for
123
Software Qual J (2017) 25:343–372 359
Q21 Unit testing Testing the smallest testable units (e.g., class, module, function) in a 95.3
software in isolation. Usually done with a specialized unit testing
framework
Q22 Integration Testing that occurs after Unit Testing and is intended to ensure that the 89.1
testing units interact properly
Q23 Regression Returning test cases (which were successful in the past) to ensure that 85.9
testing changes to the code have not introduced bugs
working with numerical software because they do not handle tolerance issues adequately.
For example, in Fortran, only a few unit testing frameworks are currently maintained, and
few Fortran developers actually have and use any of these frameworks.
The first step was to determine whether scientific developers defined testing in the same
way as traditional software engineers. Survey questions 21, 22, and 23 asked respondents if
the agreed with standard definitions for unit testing, integration testing, and regression
testing, respectively. Table 4 shows the definitions3 along with the level of agreement.
Overall, there was a very high level of agreement. The disagreements can be summarized
as follows:
• Unit testing Only three respondents disagreed with the given definition. One disagreed
with the second sentence, stating that the framework should be changed to ‘general’
rather than ‘specialized’. Two respondents thoughts that ‘smallest’ was difficult and too
restrictive.
• Integration testing Seven respondents disagreed with the given definition. Five did not
think that integration testing was separate from unit testing or necessary to perform
after unit testing. One respondent thought that ’properly’ had no meaning, and another
one did not think that integration testing was important in scientific software
development.
• Regression testing Nine respondents disagreed with the given definition. All of those
respondents said regression testing does not ensure that ‘bugs are not introduced’.
Instead, it only ensure that the tested invariants are preserved. Furthermore, regression
testing does not always fix everything.
Survey question 20 asked the respondents to explain the testing methods used in their
projects. The responses indicate that respondents interpreted this question in different
ways. Because many of the respondents did not provide a detailed explanation for their
answer and many of the terms could have multiple meanings, here we simply report the
answers provided and do not attempt to provide our own opinion on what the respondents
meant. As such, there were two types of answers:
3
definitions adapted from https://msdn.microsoft.com/en-us/library/aa292484(v=vs.71).aspx.
123
360 Software Qual J (2017) 25:343–372
1. Multiple Some respondents reported testing at multiple levels of the software lifecycle,
including the following combinations: (1) unit, regression, and integration testing, (2)
unit, regression, integration, and system testing, and (3) unit and regression testing.
2. Single Other respondents reported use of only one approach. This list of approaches is
quite heterogeneous, ranging from levels of testing to types of testing. The responses
included (1) unit testing, (2) performance testing, (3) comparing known values, (4)
verification testing, (5) validation testing, (6) evaluating the code coverage, (7)
automated testing tool, (8) white box, (9) smoke test, (10) pre-release testing, (11)
positive testing, (12) negative testing, (13) black box testing, and (14) ad-hoc testing.
Figure 7 summarizes the number of respondents that provided each answer.
Survey question 18 asked whether the respondents used automated testing tools. The
majority, 80 %, reported that they did use automated testing tools. The responses indicate
that there are 24 different automated testing tools. While a complete list of the tools
appears in Appendix 2, Fig. 8 presents the seven most commonly mentioned, including:
1. CMake 4 A tool designed to build, test and package software. It is used to control the
software compilation process. CMake is invoked on the project’s source directory,
parses the text files describing the build process, and generates a native build chain for
the desired platform and compiler. CMake provides options to the user with which the
build process can be customized.
2. CTest 5 CTest is an automated testing tool distributed with CMake. CTest can perform
several operations, including configure, build, and execute predefined tests. It also
includes advanced features for testing, such as code coverage and memory checking.
3. JUnit 6 JUnit quickly became the de facto standard framework for writing and running
automated testing in the Java programming language.
4. Google Test or GTest 7 It is a framework for writing the C/C?? tests on a variety of
platforms, including Linux, Window, and OS X. Gtest provides various options for
running the tests.
5. Python Nose 8 Nose is a Python package that provides an alternate test discovery and
running process for unit tests.
6. Python unit test 9 Python unit test is similar to JUnit, but for the Python language.
7. CxxTest 10 CxxTest is a unit testing framework for C?? that is similar to JUnit.
CxxTest supports a very flexible form of test discovery.
This section describes how respondents identified poor code in need of refactoring and how
they refactored their code.
4
http://www.cmake.org.
5
http://www.cmake.org/cmake/help/v2.8.8/ctest.html.
6
http://www.junit.org.
7
https://code.google.com/p/googletest.
8
http://nose.readthedocs.org.
9
http://docs.python.org.
10
http://cxxtest.com.
123
Software Qual J (2017) 25:343–372 361
123
362 Software Qual J (2017) 25:343–372
Survey question 17 asked respondents to explain how they identified poor code or a poor
design. Respondents could provide more than one answer. Our analysis of the responses
resulted in the following answers. Note that each respondent’s answer could have been
grouped into more than one category.
Code or peer review (47) This code review occurs either daily or periodically. In many
cases, the respondents used IDEs with syntax highlighting to support the code review
process. These IDEs helped reviewers identify certain issues like repetitive copy-and-paste
code or routines that declare large numbers of poorly named variables. The respondents
reported the following approaches for code review:
1. Comparison with guidelines or best practices Respondents used either programming
language-specific or general programming guidelines as a standard. In addition,
respondents used commonly accepted good SE practices to identify poor code, e.g.,
duplicated code, lack of modularity, lack of separation of concern, large argument list
in procedure calls, and lack of encapsulation.
2. Finding complex code Complex code is not a positive sign when a system is critical to
the software performance. Additionally, if the code is hard to understand within a
reasonable amount of time, it is considered poor code.
Poor performance (16) The respondents used profiling tools to identify low-performing
portions of the code. In addition, the respondents also observed the system during exe-
cution to note problems with unreasonably long execution times.
Code that is difficult to modify (9) Respondents indicated that an indication of poor code
is when a developer must modify several portions of code to make a single change. For
example, one respondent said: ‘‘When modifying code, if locating the right point is hard,
then I first refactor’’. Another symptom describing poor code is that the existing code is not
extensible (e.g., adding or extending new features to the system). This symptom will hinder
or prevent further changes.
Number of bugs or defects (8) When the program returned incorrect results or displayed
unexpected behaviors (e.g., sporadic shutdown), the developers examined the code related
to that result. Similarly, the respondents tracked which code tended to cause crashes or
errors most often. Bug reports from colleagues and users were also used to identify poor
code.
Lack of documentation (7) A system that lacks design document tends to have poor code
because the developers do not understand the existing system well, and thus, modifications
will result in many problems. Rather than reviewing the code, reviewing the design helps
developers identify poor code earlier in the process. The software design is simultaneously
reviewed while inspecting the requirements. Good design should conform to the given
requirements. One respondent explained that design reviews with users could help
developers to identify poor design quickly.
Code that is difficult to test (3) The respondents stated that non-testable code was often
identified as poor code.
123
Software Qual J (2017) 25:343–372 363
5 Discussion
This section revisits the four research questions described in Sect. 1 to provide answers
based on the results described in Sect. 4.
5.1 RQ1. What are the effects of using TDD, from the developers’
perspective?
The results indicate that the effects of TDD impact some particular software quality
characteristics, including Functionality, Reliability, Performance, and Maintainability.
TDD gives developers confidence that their software performs all of its intended functions
correctly. Based on the results, TDD is more effective on general scientific projects than
for parallel computing projects. Using the TDD process, developers might produce twofold
to threefold more test code than actual code and may generate thousands of unit tests.
Moreover, testing and refactoring are difficult for large software products, particularly
when testing concurrency issues. For example, mathematical software frameworks provide
such a wide variety of components for the user to choose from that it is nearly impossible
to consider the full coverage of test cases. Therefore, TDD might not be suitable for large
scientific projects, except in the presence of robust tools and experienced developers.
Best practices or guidelines might help developers successfully employ TDD in sci-
entific projects. In large projects, the developers should perform refactoring often. The use
of multiple testing methods would also help developers ensure that nearly all of the code
was tested.
The most difficult aspect of TDD in the scientific environment is writing good tests,
especially when the software includes floating-point calculations, numerical methods, and
parallelism. Writing good tests requires a thorough understanding of the requirements.
Furthermore, time pressure from users can make it difficult for developers to spend the
time required to define and implement adequate unit tests for new features.
123
364 Software Qual J (2017) 25:343–372
Ideally, when using TDD, developers should not have to change unit tests that the code
has successfully passed. Unfortunately, in many cases, it is not practical to leave these tests
unchanged, especially after the code is refactored. For example, when reducing the number
of parameters in a function, developers might later change the unit test calling the refac-
tored function. This modification is helpful for developers, but it breaks the refactoring
practices.
When working with legacy code, a common practice in scientific software, the lack of
existing unit tests makes it difficult or impossible for developers to demonstrate correct
behavior. This inability to easily test for correctness also makes incremental refactoring
difficult. Another challenge with refactoring legacy code is that the large amount of time
required to run validation tests, when they exist, reduces the efficiency of the refactoring
process. While TDD is not strictly an OO phenomenon, the ability to use classes for testing
is helpful. A large amount of legacy code uses languages that are purely procedural (e.g.,
C) or the earlier procedural version of a language that now includes OO (e.g., Fortran and
Cobol) that lack the ability to use classes. One method used to improve the legacy code is
managing the code dependencies carefully. However, it is difficult to break dependencies
in procedural code.
Section 2 described a number of challenges to the adoption of TDD in industry as
identified by Causevic et al. (2011). The following challenges are also present when
attempting to adopt TDD in the scientific environment (the location in our survey appears
in parentheses): (1) increased development time (TP#5 and C#1), (2) insufficient TDD
experience/knowledge (TP#4 and RP#4), (3) insufficient design (TP#1 and RP#2), (4)
insufficient developer testing skills (C#3), (5) insufficient adherence to the TDD protocol
(Sect. 4.3), (6) tool-specific limitations (TP#4 and RP#3), and (7) legacy code (RP#5).
In addition, our survey revealed some TDD adoption problems that have not been
reported in the traditional software engineering literature and may be specific to the sci-
entific domain, including:
1. Complex application (TP#2), code coverage (TP#3), and difficult to implement (C#2)
scientific software is more likely than traditional software to contain complex
numerical computations or advanced algorithms. Additionally, this problem occurs in
the testing process of the scientific environment, where parallel computing is often
utilized. SE researchers should carefully study this problem because existing testing
methods may not be suitable for the parallel computing environment.
2. Code or requirement is changed (TP#6) This problem might appear especially in the
scientific environment because scientists cannot know all of the system features in
advance. More specifically, the requirements are based on the results of experiments.
In contrast, in traditional software development, most often developers have a better
idea of the requirements at the beginning of the project. Although the requirements
may not be complete, in more traditional software development, developers should be
better able to predict the needs of customers than in scientific projects.
3. Dependence on unit tests (RP#1) In the scientific domain, this problem might result
from the generally low percent of code coverage obtained by the test suite (TP#3).
4. Need to set up a new environment (C#4) Scientific projects often involve scientists
with different levels of programming expertise. Effective adoption of TDD requires
most developers to have a good understanding of the TDD process and its benefits.
These problems indicate that the adoption of TDD in a scientific environment may
require more attention than in a traditional environment. To minimize problems, both
technical solutions and managerial strategies are necessary.
123
Software Qual J (2017) 25:343–372 365
Scientific developers used three primary testing approaches: unit testing, regression testing,
and integration testing. Scientific developers appear to use similar automated testing tools as
used by traditional software developers. In addition, scientific developers commonly use the
CMake framework because it works in various environments. For other tools, we believe that
the programming language is an important key to select a specific tool, such as CTest and
JUnit. While some software testing frameworks have explicit support for distributed paral-
lelism, i.e., pFUnit, few of respondents reported use of these tools. Conversely, about 20 %
of the scientific developers did not use any automated testing tools. This result suggests that
the scientific developers may need some additional training in the use of testing tools.
Although many scientific developers did apply refactoring techniques, most were relatively
simple. Because complex refactoring techniques involve advanced SE practices, scientific
developers often do not use difficult techniques. In some cases, the developers employed Design
Patterns, which are relatively new to scientific. In addition to using the GoF design pat-
terns (Gamma et al. 1995), they have also developed design patterns for specific programming
languages like Fortran.
The refactoring phase is often overlooked in an academic environment due to the tight
schedules and funding constraints. Additionally, ensuring high-quality code is not always the
most important goal for scientific developers because they view the software simply as a means to
an end, i.e., scientific research. (Note: We are not endorsing this view, just reporting the results.)
In many cases, scientific developers still need to be convinced of the benefits of
refactoring. The developers benefit from coupling training with hands-on experience. The
evidence indicates that many scientific developers view refactoring tools as essential
during the refactoring process. Automated refactoring tools would help scientific devel-
opers save the time and effort. Unfortunately, some of existing tools are limited to a few
programming languages (e.g., Java, C??).
6 Threats to validity
We organize this section around the three common types of validity threats.
This study has two primary threats to internal validity. The first threat is a potential
selection bias. Our email distribution list consisted only of authors who had attended
related workshops, published papers about scientific software development, or were
members of a selected email list. Therefore, it omitted any scientific developers who may
have used TDD but had not published any papers or participated in the workshops or
mailing lists. As mentioned in Sect. 3.4, some of the recipients of the survey invitation
posted the survey link in community forums frequented by scientific developers. This
additional availability of the survey reduces (but does not eliminate) the selection bias.
The second threat relates to the qualitative analysis process. Each author coded the survey
responses separately, introducing the possibility of a confirmation bias, in which each author seeks
out evidence that supports his preconceived notions. To reduce the confirmation, each author
123
366 Software Qual J (2017) 25:343–372
performed the analysis multiple times, each time arriving at the same results. Moreover, the authors
compared their individual results and refined those results until there were no disagreements.
Construct validity is concerned with whether the concepts being studied are correct and whether
the survey can be understood by participants. To mitigate this threat, we asked SE and scientific
experts to evaluate the survey questions. In addition, we conducted a pilot study with a subset of
the survey population. The results of the pilot helped us refine the wording of the questions to
make them as easy to understand as possible. In addition, the high level of agreement with the
testing definitions provided in Table 4 provides additional confidence in our results.
External validity focuses on the generalizability of the study results. The main threat to
external validity is the sample size of 64 participants. The demographics in Sect. 4.1
suggest that overall the pool of respondents is diverse. The one characteristic that is a bit
skewed is the geographical distribution. While the scientific community is international,
more than 60 % of the survey responses came from North America. It is possible that the
level of knowledge and experience differs across geographical boundaries. While we have
no evidence to suggest this distribution biased the results, it is possible. Because we cannot
guarantee that the survey participants adequately represent the population, the results of
this study may not be generalizable to all scientific communities.
7 Conclusion
The results of this survey provide empirical evidence about the effectiveness of TDD for
scientific software development. In terms of software quality characteristics, the primary effect
of TDD is to improve functionality. TDD supports the addition of new functionality with little
cost. When developing software with testability in mind, the result is that the software is
extensible, flexible, and maintainable. Additionally, TDD helps developers identify and
remove defects early in the project, thereby reducing the overall number of defects.
Writing tests, especially those that do not have to change when the code is refactored, is
the most difficult task in the TDD process. Furthermore, writing tests to examine con-
currency issues in parallel computing is difficult. Poorly implemented unit tests also makes
the refactoring process difficult. Regardless of the benefits of refactoring, there is little
motivation for refactoring code in an academic environment. Researchers in the scientific
domain rarely revise previous code after the paper is published.
We believe that the results of this work will be beneficial for scientific developers and
SE researchers. The solutions of testing and refactoring problems provide practical sug-
gestions to other scientists who are using TDD in their projects. The results of this work
should be of interest to researchers, because there is a need for additional empirical
evaluations of adopting TDD in various contexts.
Acknowledgments The authors gratefully thank all participants in the survey for their time and contri-
butions. Jeffrey Carver would like to acknowledge partial support from NSF Grants 1243887 and 1445344.
123
Software Qual J (2017) 25:343–372 367
2. What type of projects do you typically work on? (Select more than one)
Research (main goal is to publish papers)
Production (main goal is to produce software for real users)
Other
3. Please describe any other significant work experience in fields other than your
educational background.
....................................................................................................................................................
4. Please describe your educational background (i.e. list degrees and Majors B.S.
in Chemistry; M.S. in Chemistry, etc)
....................................................................................................................................................
5. How many years have you been developing real scientific software projects?
....................................................................................................................................................
123
368 Software Qual J (2017) 25:343–372
Reading books
Training course
Co-workers
Learning on my own from online resources
Other
10. Please rank the software quality based on what is important to your soft-
ware [1 - Most Important]
Compatibility
(The ability of two or more software components to exchange information and/or to perform
the required function while sharing the same hardware or software environment)
Functional suitability
(The degree to which the software product provides functions that meet stated and implied
needs when the software is used under specific conditions)
Maintainability (The degree to which the software product can be modified. Modification
may include corrections, improvements or adaption of the software to change in environment,
and in requirements and functional specifications)
Operability (The degree to which the software product can be understood, learned, used and
attractive to the user, when using under specific conditions)
Performance efficiency
(The degree to which the software product provides appropriate performance, relative to the
amount of resource used, under stated conditions)
Reliability
(The degree to which the software product can maintain a specified level of performance when
using under specific conditions)
Security
(The protection of system items from accidental or malicious access, use, modification, de-
struction, or disclosure)
Transferability
(The degree to which the software product can be transferred from one environment to another)
11. Based on Question 10, how was the effectiveness of employing TDD on the
most important software quality?
....................................................................................................................................................
12. Please rank these activities in terms of difficulty relative to TDD. [1- Most
difficult]
Write a test
Write code to make the test pass
Refactoring
13. Which techniques did you use to refactor the code? (Code refactoring is dis-
ciplined technique for restructuring an existing body of code, altering its internal structure
without changing its external behavior. Refactoring is undertaken in order to improve some of
the nonfunctional attributes of the software.)
Breaking, large methods up into smaller methods
Renaming methods, variables or classes
Simplifying control structure (e.g., series of if statements, or nested loops, etc.)
Creating encapsulated field (e.g., using getter and setter methods to make public member
data private)
Splitting large classes (Move part of the code from existing class into a new class)
Adding or removing parameters from a method
Moving methods or fields of a class to a super class
Moving methods or field of a class to sub-class
Applying the design pattern(s) (Please specify the used design patterns)
Others
14. When using TDD, how often do you design the software before writing the
code?
123
Software Qual J (2017) 25:343–372 369
(Software design is the process of defining software methods, functions, objects, and the overall
structure and interaction of your code e.g., create flow charts, class diagrams)
Very frequently
Frequently
Occasionally
Rarely
Very rarely
Never
15. When using TDD, how often do you perform any software design activities
during code development?
Very frequently
Frequently
Occasionally
Rarely
Very rarely
Never
16. Besides refactoring, did you use other techniques or approaches to improve
the code?
Yes
No
17. Overall, how did you identify poor code or poor design?
....................................................................................................................................................
18. Did you use any automated testing tools? (CMake, CTest, GTest, etc...)
Yes (Please specify the tools)
No
19. Based on your experience, what are the benefits and challenges of TDD?
....................................................................................................................................................
20. Please explain the testing method(s) that you used in the project?
....................................................................................................................................................
21. Do you agree with this given definition of ‘Unit Testing’ ?
Definition: Testing the smallest testable units (e.g., class, module, function) in a software sys-
tem in isolation. Usually done with a specialized unit testing framework.
Yes
No
24. What did you learn about the problem of writing tests in your project? How
did you solve such problems?
....................................................................................................................................................
25. What did you learn about the problem of refactoring the code in your project?
How did you solve such problems?
....................................................................................................................................................
123
370 Software Qual J (2017) 25:343–372
References
Abdel-Hamid, A. (2013). Refactoring as a lifeline: Lessons learned from refactoring. In Agile Conference
(AGILE), 2013, pp. 129–136
Anselm, L. S., & Juliet, M. C. (1990). Basics of qualitative research: Rounded theory procedures and
techniques. Newbury Park, CA: Sage Publications.
Beck, K. (2002). Test driven development: By example. Boston, MA: Addison-Wesley Longman Publishing
Co. Inc.
Beck, K., & Andres, C. (2004). Extreme programming explained: Embrace change (2nd ed.). Boston, MA:
Addison-Wesley Professional.
Carver, J. (2011). Development of a mesh generation code with a graphical front-end: A case study. Journal
of End User Computing, 23(4), 1–16.
Carver, J. C., Kendall, R. P., Squires, S. E., & Post, D. E. (2007). Software development environments for
scientific and engineering software: A series of case studies. In The 29th international conference on
software engineering (pp. 550–559). MN: Minneapolis.
123
Software Qual J (2017) 25:343–372 371
Causevic, A., Sundmark, D., & Punnekkat, S. (2011). Factors limiting industrial adoption of test driven
development: A systematic review. In The 4th international conference on software testing (pp.
337–346). Berlin: Verification and Validation.
Desai, C., Janzen, D., & Savage, K. (2008). A survey of evidence for Test-Driven Development in academia.
SIGCSE Bulletin, 40(2), 97–101.
Eclipse. (2013). Photran—An integrated development environment and refactoring tool for fortran. http://
www.eclipse.org/photran/. Accessed December 2013
Erdogmus, H., Morisio, M., & Torchiano, M. (2005). On the effectiveness of the test-first approach to
programming. IEEE Transactions on Software Engineering, 31(3), 226–237. doi:10.1109/TSE.2005.37
Fowler, M. (1999). Refactoring: Improving the design of existing code. Boston, MA: Addison-Wesley
Longman Publishing Co. Inc.
Gamma, E., Helm, R., Johnson, R., & Vlissides, J. (1995). Design patterns: Elements of reusable object-
oriented software. Boston, MA: Addison-Wesley Longman Publishing Co. Inc.
ISO IEC. (2011). Systems and software engineering: System and software quality requirements and eval-
uation (SQuaRE)—System and software quality models. ISO/IEC, 25010, 2011.
Janzen, D., & Saiedian, H. (2005). Test-Driven Development concepts, taxonomy, and future direction.
Computer, 38(9), 43–50. doi:10.1109/MC.2005.314
Kollanus, S. (2010). Test-Driven Development—Still a promising approach? In Proceedings of the 7th
international conference on the quality of information and communications technology (pp. 403–408).
Portugal: Porto.
Koskela, L. (2007). Test driven: Practical TDD and acceptance TDD for java developers. Greenwich, CO:
Manning Publications Co.
Mens, T., & Tourwé, T. (2004). A survey of software refactoring. IEEE Transactions on Software Engi-
neering, 30(2), 126–139.
Nanthaamornphong, A., Morris, K., Rouson, D., & Michelsen, H. (2013). A case study: Agile development
in the community laser-induced incandescence modeling environment (CLiiME). In The 5th inter-
national workshop on software engineering for computational science and engineering (pp. 9–18).
Nanthaamornphong, A., Carver, J., Morris, K., Michelsen, H., & Rouson, D. (2014). Building cliime via
Test-Driven Development: A case study. Computing in Science Engineering, 16(3), 36–46.
Opdyke, W. F. (1992). Refactoring object-oriented frameworks. PhD thesis, University of Illinois at Urbana-
Champaign, Champaign, Illinois, USA.
Orchard, D., & Rice, A. (2013). Upgrading fortran source code using automatic refactoring. Proceedings of
the international workshop on refactoring tools (pp. 29–32). Indiana: Indianapolis.
Overbey, J., Xanthos, S., Johnson, R., & Foote, B. (2005). Refactorings for fortran and high-performance
computing. In Proceedings of the 2nd international workshop on software engineering for high per-
formance computing system applications (pp. 37–39). Missouri: St. Louis.
Overbey, J. L., Negara, S., & Johnson, R. E. (2009). Refactoring and the evolution of fortran. In Proceedings
of the international workshop on software engineering for computational science and engineering (pp.
28–34). British Columbia: Vancouver.
Rafique, Y., & Misic, V. (2013). The effects of Test-Driven Development on external quality and pro-
ductivity: A meta-analysis. IEEE Transactions on Software Engineering, 39(6), 835–856.
Ruparelia, N. B. (2010). Software development lifecycle models. SIGSOFT Software Engineering Notes,
35(3), 8–13.
Sanchez, J., Williams, L., Maximilien, E. (2007). On the sustained use of a Test-Driven Development
practice at ibm. In Agile conference (AGILE), 2007 (pp 5–14)
Sanders, R., & Kelly, D. (2008). Dealing with risk in scientific software development. IEEE Software, 25(4),
21–28.
Sletholt, M., Hannay, J., Pfahl, D., & Langtangen, H. (2012). What do we know about scientific software
development’s agile practices? Computing in Science Engineering, 14(2), 24–37.
Sletholt, M. T., Hannay, J., Pfahl, D., Benestad, H. C., & Langtangen, H. P. (2011). A literature review of
agile practices and their effects in scientific software development. In Proceedings of the 4th inter-
national workshop on software engineering for computational science and engineering (pp. 1–9).
Hawaii: Honolulu.
123
372 Software Qual J (2017) 25:343–372
123