0% found this document useful (0 votes)
7 views

Automated Software Testing as a Service

Uploaded by

Abrham Yeshitla
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views

Automated Software Testing as a Service

Uploaded by

Abrham Yeshitla
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

Automated Software Testing as a Service

George Candea Stefan Bucur Cristian Zamfir


School of Computer and School of Computer and School of Computer and
Communication Sciences Communication Sciences Communication Sciences
EPFL EPFL EPFL
Lausanne, Switzerland Lausanne, Switzerland Lausanne, Switzerland
[email protected] [email protected] [email protected]

ABSTRACT competitive metric is often performance and functionality. There


This paper makes the case for TaaS—automated software testing as is no independent certification body to guarantee that every vendor
a cloud-based service. We present three kinds of TaaS: a “program- employs state-of-the-art testing.
mer’s sidekick” enabling developers to thoroughly and promptly We need a “disruptive technology” to substantially improve soft-
test their code with minimal upfront resource investment; a “home ware quality. Various studies have found the average bug density in
edition” on-demand testing service for consumers to verify the soft- production-ready software to have stayed relatively constant over
ware they are about to install on their PC or mobile device; and a time, while average code volume of software has increased along
public “certification service,” akin to Underwriters Labs, that inde- an exponential curve [13], with the net effect that the number of
pendently assesses the reliability, safety, and security of software. bugs per product is increasing. It is therefore necessary to quickly
TaaS automatically tests software, without human involvement find a way of reducing bug density by at least an order of magni-
from the service user’s or provider’s side. This is unlike today’s tude. A promising direction is to reduce reliance on human labor
“testing as a service” businesses, which employ humans to write through automated testing techniques, and recent proposals [4, 5,
tests. Our goal is to take recently proposed techniques for auto- 2, 1] have made promising progress along these lines. Alas, they
mated testing—even if usable only on toy programs—and make are still not ready to handle real-sized software (1 million lines of
them practical by modifying them to harness the resources of com- code or more), mainly due to high CPU and memory requirements.
pute clouds. Preliminary work suggests it is technically feasible to We believe cloud computing can come to the rescue.
do so, and we find that TaaS is also compelling from a social and
business point of view. The Promise of Automated Testing as a Service (TaaS)
Categories and Subject Descriptors Software testing essentially consists of exercising as many paths
D.2.5 [Testing and Debugging]: Testing tools through a program as possible and checking that certain properties
hold along those paths (no crashes, no buffer overflows, etc.)
General Terms: Reliability TaaS combines two ideas: (1) offering software testing as a com-
petitive, easily accessible Web service, and (2) doing fully auto-
mated testing in the cloud, to harness vast, elastic resources toward
1. INTRODUCTION making automated testing practical for real software.
A software-testing Web service allows users to upload the soft-
Software quality assurance is in dire need of substantial progress.
ware of interest, instruct the service what type of testing to perform,
Software testing is resource-hungry, time-consuming, labor-inten-
click a button, and then obtain a report with the results within min-
sive, and prone to human omission and error. Despite massive in-
utes or hours. This report is a list of bugs found, or the level of
vestments in quality assurance, serious code defects are routinely
coverage obtained by tests with successful outcomes. Such a ser-
discovered after software has been released [16], and fixing them
vice can have a basic interface, where an end user uploads, e.g.,
at so late a stage carries substantial cost [13]. Thorough testing
the latest Windows service pack and then chooses from a menu of
of large, complex software involves great effort, and the software
possible test types (e.g., comprehensive testing, security testing). A
industry still employs relatively primitive testing techniques.
service can also have an expert interface, to be used by software de-
The current software business model forces software users to
velopers to provide sophisticated definitions of what “a bug” may
take on faith that the vendor has performed thorough testing before
be, thus teaching the testing service what kinds of correctness viola-
shipping. Yet, given the difficulty of thoroughly testing software,
tions to look for. For professional uses, TaaS can integrate directly
such trust is typically misplaced. There exists no objective way
with the development process and test the code as it is written.
to assess the reliability of a software product, therefore the main
We wish to empower consumers of software—both programmers
and end users—to be in control of the quality of the software they
use. Information on the bugs present in a piece of software is of-
Permission to make digital or hard copies of all or part of this work for ten (partially) known to the vendor, but not to consumers, for both
personal or classroom use is granted without fee provided that copies are technical and business reasons. With TaaS, we aim to make this
not made or distributed for profit or commercial advantage and that copies otherwise-hidden information openly available on-demand to any-
bear this notice and the full citation on the first page. To copy otherwise, to one who wishes to obtain it. Software testing ought to be fast,
republish, to post on servers or to redistribute to lists, requires prior specific automated, and as easy and accessible as Web email.
permission and/or a fee.
SoCC’10, June 10–11, 2010, Indianapolis, Indiana, USA. TaaS can also serve as a publicly available certification service,
Copyright 2010 ACM 978-1-4503-0036-0/10/06 ...$10.00. that enables comparing the reliability and safety of software prod-

155
ucts. In this way, TaaS can promote open competition among soft- 2. SOFTWARE TESTING WEB SERVICES
ware vendors and compel them to produce software that is reliable. While the most obvious embodiment of TaaS is a service aimed
Doing automated testing in a cloud instead of on individual de- at software developers, we see TaaS reaching further: end users
velopers’ machines increases the available compute power by or- themselves could use TaaS, with the same ease with which they use
ders of magnitude. In the past, faster CPUs enabled increased levels Web email. In this section, we describe a form of TaaS aimed at
of interactivity in development, such as quick compile-retry cycles. developers (§2.1), then TaaS for end users (§2.2), and finally TaaS
Cloud-based computation, offering vast numbers of fast CPUs with as a universally accessible certification service for software (§2.3).
plenty of memory, could engender a similar transformation, with
TaaS becoming a seamless extension of a developer’s environment. 2.1 TaaSD for Developers (“Sidekick”)
If automated testing techniques can be adapted to scale up on cloud TaaSD can become a programmer’s true sidekick, i.e., an insep-
infrastructures, they can yield the order-of-magnitude lower bug arable companion assisting the developer at every step.
density and higher programmer productivity we seek.
Continuous Testing
Our Goal in Brief In the simplest form, a TaaSD provider operates “in a loop” that
Automated software testing, available to anyone and everyone at pulls the latest code from the developers’ repository. It then exer-
low cost, can transform the current development paradigm into one cises the various paths through the code and checks them against a
that involves more-thorough yet less-time-consuming testing. Our collection of so-called test predicates (described in more detail be-
end goal is for all software to be more reliable and safe. Toward this low). Continuous testing integrated into the development environ-
goal, we see four fundamental research challenges: testing must ment has been previously proposed [17] as a way to run developer-
(a) be fully automated, i.e., humans no longer write test harnesses; provided test suites in the background on the developer’s worksta-
(b) scale to frequently-changing code bases that exceed 1 million tion. In TaaSD , however, developers provide higher level specifica-
lines of code; (c) be feasible as a service, i.e., useful to develop- tions of what should be tested: instead of imperative test suites, they
ers/consumers and economically viable; and (d) be able to directly write test predicates, which takes considerably less human time.
test binaries, since much software is still proprietary. This has two benefits: it places less burden on the developer,
Individual software testing continues to be relevant even as SaaS and it allows checking much deeper properties faster, by using the
(software as a service) gains increasingly more ground. End users, resources of the cloud. TaaSD continuous testing can improve soft-
as well as organizations and corporations, rely on ever more third- ware reliability and shorten the development cycle by automating
party software. Software services are stitched together from large test generation and finding bugs as software is being developed,
volumes of third-party code (libraries, databases, Web servers, vir- even before a test harness exists. This forces developers to pro-
tual machines, etc.). Consumers install, run, and upgrade software duce high quality code early on during development, when bugs
not only on their increasing number of computers, but also on their are cheapest to eradicate [13]. Only a cloud environment could
mobile phones, audio/video players, TiVo, and cameras. allow TaaSD to provide quick feedback, i.e., reduce what would
otherwise take several days down to mere minutes or seconds.
An Historical Perspective
An important turning point in improving the productivity-to-bugs Test Predicates
ratio was brought about by the introduction of high level languages Predicates over program state or control flow can succinctly char-
and compilers in the 1950s, gradually eliminating most direct use acterize undesired behaviors. Test predicates can be, for example,
of assembly language. Another important development was faster a more sophisticated form of assert-like statements. They can use
hardware and compilers, which now provide programmers quick abstract, symbolic program state to specify computation properties;
feedback on syntax errors and low-level programming errors dur- e.g., “if ever factorial(λ ) = λ ∗ factorial(λ − 1), that is a bug.” A
ing the build process. These two events transformed programmers’ testing service smartly exercises as many execution paths through a
attitude toward writing code: less concern for the minor details and program as possible and checks whether there exist paths that trig-
more time devoted to the higher level thought process. ger these test predicates. In our simple example, TaaSD could find
We expect TaaS to similarly transform the way we write code, concrete λ inputs for which the above predicate is true.
by providing prompt feedback on higher level programming errors Test predicates fall into two categories: universal predicates and
and enabling developers to spend more time thinking about system- application-specific predicates. Universal predicates are broadly
level properties instead of low-level details. TaaS can provide feed- accepted as describing bugs, such as dereferencing a null pointer,
back on semantic correctness, instead of mere syntax. Quick feed- entering a deadlock, race conditions, memory safety errors, crashes.
back on code quality during the development process enables pro- Such predicates can be either given as declarative expressions or
grammers to build systems that are closer to being correct. encoded in write-once/use-many imperative checkers [14]. Appli-
The compute power needed to achieve prompt feedback on deep cation-specific predicates capture semantics that are particular to
program properties, such as whether a particular assert() could ever the tested program (e.g., numConnections > maxPoolSize + delta).
fail or not, far exceeds what is available in a mere workstation. For every predicate violation, TaaSD produces a set of inputs, envi-
Cloud infrastructures make such compute power available today, if ronment conditions, and sequence of program events that develop-
only we had the automated test techniques to harness it. ers can use to reproduce the corresponding bug [18].
In this paper, we make the case for TaaS, hoping to motivate Application-specific test predicates often come to one’s mind
also other researchers to engage in adapting automated testing tech- while coding. Such properties cannot always be captured in a lo-
niques to the cloud. We first describe in more detail the three vari- cal assert() statement in the code, but rather require a more global
ants of TaaS (§2), present our initial forays into cloud-based au- predicate over program behavior. Should the scope of a predicate
tomated testing, along with ideas for future steps (§3), make the need to be restricted to a portion of the code, it can be done by
social and business case for TaaS (§4), describe the expected ben- incorporating a range of line numbers in the predicate itself.
efits and drawbacks of TaaS (§5), and finally close with a list of We envision allowing developers to write these test predicates
research challenges (§5) and conclusions (§6). and upload them into a database at the TaaSD provider. They could

156
be uploaded manually via a Web interface, or be provided directly overflows, deadlocks, or race conditions, but the TaaSH provider is
from withing the IDE (e.g., Eclipse or Microsoft Visual Studio). free to tap into additional databases of test predicates.
We believe that test predicates, although not suitable for express- Within minutes, the TaaSH service produces a webpage with the
ing absolutely all bugs, can be used for many classes of bugs. While results of the tests, indicating whether it found any serious bugs,
bugs can relate to arbitrarily complex semantics, many of the bugs such as hangs or crashes. A bug is automatically rated as serious vs.
that plague today’s software are buffer overflows and other memory minor based on the corresponding test predicate, itself rated by the
errors, crashes, integer overflows, race conditions, deadlocks, etc. TaaSH provider or the predicate writer. Mrs. X allows the phone to
all of which can be easily encoded in test predicates. For most bugs update itself only if the test report says no serious bugs were found.
that violate higher level program semantics, assert-style predicates For interested users, the TaaSH response may include a rating of
over the global program state are also sufficient. For the remaining the software, akin to stars for products on e-tailers’ websites.
types of bugs, there exist more sophisticated forms of expressing This service would cost Mrs. X no more than a few cents, or even
them, such as formal logics that capture temporal properties, or, as be freely included in her monthly phone subscription. Even though
a last resort, imperative test programs. the TaaSH provider commissions a few dozen machines for several
TaaSD provides an entire spectrum of solutions to developers: minutes to run the test, this cost can be amortized across multiple
they can rely solely on the fully automated discovery of bugs that users: if this same upgrade has already been tested before, the re-
can be detected by universal predicates, or they can provide their sponse to Mrs. X can be immediate and cost the provider virtually
own test predicates or imperative test suites. The benefit of TaaSD nothing. As will be seen later, TaaS offers attractive opportunities
is that it uses the resources of the cloud to run this predicate check- for economies of scale, especially for widely used software.
ing on many more execution paths than would be feasible in the Since TaaSH emphasizes simplicity, it only checks for a set of
developer’s own infrastructure. In this sense, TaaS complements “canned” bad behaviors, such as memory safety bugs or deadlocks.
advances in programming languages—strong type systems, for ex- Community efforts, however, are likely to produce additional test
ample, prevent developers from making low-level mistakes, while predicates, in the spirit of Wikipedia or Knol [7], perhaps based on
TaaS helps check the next-higher level of bugs. bug reports filed in the past. Such a database of test predicates could
then be tapped by a TaaSH provider for the benefit of its users.
We plan to run TaaSD as a “public service” for open-source soft- The TaaSH scenario presented here is not far-fetched. We have
ware developers. This effort could make open-source code the most built a tool, called DDT [11], for testing closed-source binary de-
reliable software available. For such a public service, we expect vice drivers against undesired behaviors, like race conditions, mem-
the user and developer community to be willing to contribute de- ory errors, resource leaks, etc. DDT combines virtualization with
tailed test predicates to a Wikipedia-style database. To this end, a specialized form of symbolic execution to thoroughly exercise
we are building a system, called Cloud9, which promises to scale tested drivers; a set of modular dynamic checkers use test predi-
symbolic execution [10]—a popular test automation technique—to cates to identify bug conditions. In preliminary experiments, DDT
large clusters of machines. Preliminary results [3] show substan- tested six mature Windows-certified closed-source binary drivers
tial speedup over a single-node state-of-the-art symbolic execution for less than 5 minutes each and found 14 different serious bugs.
engine when testing real UNIX utilities. The key techniques under- DDT produces executable traces for every path that leads to a fail-
lying Cloud9 are summarized in §3. ure, thus proving the existence of the bugs and helping developers
debug them. The test predicates used in DDT were extracted from
the Microsoft Driver Verifier [14], shipped with Windows, plus a
2.2 TaaSH for End Users (“Home Edition”) few new ones added by us.
While TaaSH does not offer much flexibility, it is still a com-
TaaSD helps develop more reliable software, which makes end
pelling service for end users like Mrs. X, who otherwise would
users happier. But can TaaS directly benefit end users? Yes, it can.
have to blindly trust software vendors.
Consider the following scenario: Mrs. X, a grandmother who
lives by herself, owns a computer and a mobile phone. She relies
on the mobile phone to notify her children (who live in the same 2.3 TaaSC Certification Services
city) whenever she experiences the symptoms that often precede The third type of TaaS is a public certification service, which
her seizures. The software on her mobile phone recently notified provides an objective assessment of a software product’s quality.
her that it needs to be upgraded, to improve the speech recognition A primitive form of certification is already gaining hold in the in-
component. Mr. X knows that such upgrades are perilous, and that dustry, as in the case of Microsoft’s Hardware Quality Labs testing
a buggy upgrade may disable her phone altogether. At the same of third-party software, or Apple’s certification process for listing
time, improved speech-to-text would help the phone better handle applications in its App Store. TaaSC analyzes software (either in
her aging voice. Mrs. X is the kind of user who can benefit from a binary or source code form) and, for each defect found, provides
“home edition” version of TaaS, which we refer to as TaaSH . irrefutable evidence of the defect. Based on the defect density,
The key difference between TaaSH and TaaSD is the service in- TaaSC can provide a rating for each product. For an industry to
terface and the presentation/interpretation of results. Developers compete on a certain product attribute, that attribute must be easily
can be expected to write test predicates, but end users cannot. Thus, explained and quantified for consumers. It is for this reason that
whereas TaaSD checks for both universal and application-specific software companies compete on performance (measurable through
test predicates, TaaSH only checks for universal predicates. benchmarks) and on features (measurable via check lists). We be-
We expect end users to employ testing services in a “one-of” lieve this is also the reason for which the software industry has not
manner, unlike developers who will likely prefer continuous test- started yet seriously competing on reliability, safety, and security.
ing. Thus, TaaSH has a public website where consumers can up- In proposing a software certification service, we draw inspiration
load software in binary or bytecode form and select from a pull- from Underwriters Laboratories Inc. (UL), an independent product
down menu the type of testing they want. Using this testing service safety certification organization in the USA. UL has had a univer-
should be no more complex than downloading software from the sally recognized positive effect on the manufacturing industry, en-
Web. By default, TaaSH may check programs for bugs like buffer couraging the adoption of important safety measures.

157
TaaSC is meant to be the Underwriters Labs of the software in- void write( int p ){ p<MAX
dustry. Similar to UL, the certification service tests and then certi- if (p < MAX) { False True

fies (by digitally signing) the tested software—this is the equivalent if (p > 0)
p>3 p>0
of the UL Mark. The TaaSC provider can be funded by govern- ... False True False True
else {
ments, by industry consortia, or simply provide certification ser- True False
...
vices for pay. It can offer different levels of certification, depend- } True False
ing on the types of predicates that are checked. When a product is } else {
False
modified, it needs to be re-tested and re-certified. In an ideal fu- if (p > 3)
ture, software companies will be required to subject their software close(p);
to quality validation on such a service, akin to mandatory crash else {
testing of vehicles. In the absence of such certification, software ...
companies could be held liable for damages resulting from bugs. ...
An officially sanctioned TaaSC provider maintains a Web-acces-
sible directory of the software products it tested and certified; this Figure 1: Symbolic execution tree for an example body of code.
list can be used by consumers to compare products. Certification,
of course, does not guarantee the product will perform acceptably
or that it is safe under all conditions (such as misuse), but it pro- lowing the then-branch and another following the else-branch. The
vides an increased level of assurance. Being certified would not symbolic values are constrained in the two execution clones so as to
carry legal weight (as does, for instance, the European CE Mark or make the branch condition evaluate to true (e.g., λ <0), respectively
the FCC Part 15 requirement for electronic devices), but we expect false (e.g., λ ≥0). Execution recursively splits into sub-executions
that it would become difficult in practice to sell software that does at each relevant branch, turning an otherwise linear execution into
not carry such a certification. IT consulting firms may be unwilling an execution tree that captures all possible executions (Figure 1).
to install uncertified products, use of uncertified software may in- Symbolic execution consists of systematically exploring this ex-
validate certain insurance coverages, and governmental authorities ecution tree. Each inner node is a branching decision, and each
could require contractors to use exclusively certified software. leaf is a program state that contains its own address space, program
In addition to certification, a TaaSC provider could also pub- counter, and set of constraints on program variables.
lish sorely needed statistics on software: Which bugs are the most When an execution encounters a bug, as defined by a test pred-
prevalent? What is their frequency? What is the typical bug density icate, the conjunction of constraints collected from the root to the
for each class of applications? Such data would help everyone do- goal leaf is solved to produce concrete program inputs that exercise
ing research on reliability (systems, databases, programming lan- the path to the bug. In addition to these constraints, program events
guages, etc.), the same way surveys and studies help the medical (such as thread context switches) must be factored in, as illustrated
profession and the pharmaceutical industry. It would enable the de- in our execution synthesis system [18]. Herein lies the strength of
velopment of more scientific and rigorous approaches to software symbolic execution: it can automatically generate test cases that
development, grounded in concrete data. evidence bugs. Symbolic execution is substantially more efficient
than exhaustive input-based testing—it analyzes code behavior for
3. INFRASTRUCTURE AND SOFTWARE entire classes of inputs/events at a time, without having to try each
Automated testing relieves humans from the task of writing test one out—yet is at least as complete.
cases and workload drivers. For such a testing technique to be fea- Unfortunately, symbolic execution faces two serious challenges:
sible for TaaS, it must be able to control program execution, so that high memory consumption and CPU-intensive constraint solving,
it takes the program through as many different execution paths as both of which are roughly exponential in program size. Memory
possible, and be able to automatically recognize undesired behav- consumption results from the large (potentially infinite) symbolic
ior along those paths, i.e., find bugs. Even if a technique cannot execution tree. CPU consumption results from the fact that, at
determine 100% that a system is bug-free, by exploring substan- each branch instruction, the symbolic execution engine must check
tially more execution paths than what a human-written test could which of the branches are feasible, given the current constraints.
do constitutes a valuable service for developers and end users. Consequently, on a present-day computer it is only possible to thor-
There are several ways to construct such automated testing ser- oughly test programs with a few thousand lines of code; for larger
vices. In our work, we use a technique called symbolic execu- programs, only the shorter paths can be explored. Thus, symbolic
tion [10] combined with test predicates. We are working on paral- execution is virtually unheard of in the general-purpose software
lelizing symbolic execution on large, elastic clusters of machines, industry because real programs often have millions of lines of code,
in order to allow it to scale up to realistically sized programs. Other and executing them symbolically on a single node is not practical.
techniques, such as structured input generation [15], that can run in
parallel on shared-nothing clusters, can also be deployed in TaaS. Symbolic Execution in the Cloud
We are building Cloud9, a parallel symbolic execution engine to
Classic Symbolic Execution run on large shared-nothing clusters of computers, thus harnessing
Symbolic execution is a technique for automated testing, originally their aggregate memory and CPU resources. In this way, we can
proposed in the 1970s. It has recently been shown to find (with mitigate the memory and CPU bottleneck of symbolic execution.
no human assistance) bugs that were missed by manual testing and Parallelization is a natural way to improve the scalability of sym-
static analysis [4, 6, 12, 1]. Instead of running the program with bolic execution, but doing so in a cluster presents significant chal-
regular inputs, a symbolic execution engine runs the program with lenges. Furthermore, in a cloud setting, parallel symbolic execution
abstract, symbolic inputs that are unconstrained, e.g., an integer requires coping with frequent fluctuation in resource quality, avail-
input x is given as value a symbol λ that can take on any integer ability, and cost, which are not present in regular clusters.
value. When the program encounters a branch that depends on x, Cloud9 consists of many worker nodes and one or more coor-
program state is forked to produce two parallel executions, one fol- dinator nodes. Each worker independently explores a subtree of

158
the program’s execution tree by running one classic symbolic ex- 4. THE ECONOMICS OF TaaS
ecution engine and a constraint solver on each CPU core. As it We believe TaaS can be operated in a sustainable manner both
explores paths, Cloud9 checks whether any bug predicates are true. as a public service and as a business. In this section we argue that
The choice of which branches to pursue first is governed by a so- there exists a market for automated test services, we suggest a pos-
called search strategy. In the cloud version of symbolic execution, sible pricing scheme, and finally describe how TaaS providers (both
instead of using a single strategy, we simultaneously employ a port-
public and commercial) can benefit from economies of scale.
folio of multiple strategies. This allows the exploration to speculate
Like most Web services, TaaS will first attract the “long tail” of
on the promise of certain paths, taking advantage of the fact that potential users; in this case, it would be those who cannot justify in-
speculation requires solely employing a few additional machines. vesting in testing infrastructures, or those who wish to cloud-burst
Our preliminary results indicate that diversification of exploration during periods of intense testing. Small and medium businesses,
strategies can help find sooner the paths leading to a desired test as well as open-source software developers, will likely be the first
goal (such as maximizing code coverage, testing the bounds of all
users. TaaS puts at their disposal a testing service on par with (or
string copy operations, etc.) [3]. This is particularly relevant for
better than) what the largest software companies have, thus level-
symbolic execution trees of infinite size. ling the playing field. Large companies that already own private
One key aspect of making Cloud9 scale to large clusters is per- clouds can run TaaS as an internal service.
forming efficient load balancing at infrequent time intervals. The Commercial TaaS providers will have to identify good pricing
coordinator reasons about which parts of the execution tree ought schemes. Ideally, customers pay proportionally with the value they
to be transferred between workers to distribute load evenly. Cloud9
derive from the testing task they submit to the service. This value
employs a discrete job model that allows workers to self-regulate,
can be expressed as a function of the number and importance of
thus requiring the intervention of the coordinator relatively rarely. the bugs found and/or the confidence that the user gets in the code.
To balance load, workers exchange compact encodings of tree nodes Confidence can be expressed in terms of code coverage or path
and quickly reconstruct the state of the migrated node, without hav- coverage, both of which can easily be measured by the testing ser-
ing to copy potentially hundreds of MB/state across the network.
vice. Alternatively, one may wish to pay per bug found or per unit
Another important aspect is the quality of the program state en-
of coverage increase—in these cases, the marginal value increases
coding. Since most cloud clusters employ commodity network in-
over time. For example, achieving an extra 1% coverage on code
terconnects, transferring explicit states over the network often turns that is already 95% covered is more valuable than if the starting
into a bottleneck (which is one of the reasons why parallel model coverage was only 60%. This telescoping marginal value matches
checkers have a hard time running on clusters without shared mem- well the increased resources (and thus higher cost for the TaaS op-
ory). Additionally, we are developing techniques for reducing re-
erator) required to achieve that target.
dundancy, handling worker failures, and coping with heterogeneity.
TaaS users can provide a target level of desired coverage and/or
Since symbolic execution is a dynamic testing technique, it has an upper limit on the budget, and the testing service can optimize
no false positives. This means that bugs found by a TaaS service are accordingly. Since the service provider can allocate resources elas-
legitimate bugs accompanied by inputs and a set of system events tically across its customers, the resource demands of each testing
that help reproduce them. A recent study [8] showed that lack of task can be optimized globally across all in-progress test jobs.
false positives and the ability to reproduce bugs not only aids but
For price-sensitive customers, auction-style pricing schemes may
also compels software developers to fix bugs sooner.
be advantageous: depending on how much the user is willing to
When testing software that depends on hardware features, as pay, more or fewer resources can be commissioned in the cloud
is the case of device drivers or a mobile phone operating system, for that user’s task. The difference in price may be reflected in the
it may appear necessary for the TaaS provider to employ hard- total time required to test a piece of software, taking longer if it
ware simulators. However, this challenge can be circumvented with uses fewer resources, perhaps even being suspended for some pe-
symbolic hardware. DDT [11] showed that this approach requires
riod of time when there is high demand for resources from other
neither real hardware nor hardware models to test device drivers—
higher-paying customers. Such a mechanism is a good match for
instead, symbolic hardware returns symbolic values to the software, auction-based clouds, like Amazon’s EC2 Spot Instances, where
thus testing it against all possible reactions of the hardware. unused cluster nodes can be employed at very low cost. Perhaps
cloud operators will be willing to donate resources to a public ver-
Infrastructure sion of TaaS, during periods of low utilization.
We expect TaaS providers to either operate their own data centers, TaaS providers benefit from economies of scale. First, users
or provide TaaS as a value-added service on top of a public cloud are likely to end up testing common bodies of code, like popu-
operated by a third party, such as Amazon EC2. lar libraries (e.g, many Java programs will use the same JDK). The
TaaS in public clouds is an ideal solution for small and medium TaaS provider can exploit this redundancy by not re-testing already-
companies, which cannot afford the upfront investment of setting tested code, and thus saving resources. The more users a service
up large clusters of machines. Moreover, even for large compa- has, the more exploitable redundancy there will be. Moreover, the
nies that do have their own clusters, TaaS can provide a way to ac- provider may choose to test popular bodies of code in advance, dur-
commodate spikes in their resource needs (“cloud bursting”), such ing periods when its resources are not in high demand. The initial
as may be required during intensive test cycles prior to a release. cost of testing can then be amortized over all the customers who
Finally, TaaS can provide an incremental path to gradually move will need those results later on.
testing from in-house under-provisioned clusters into the cloud. Finally, TaaS may introduce new business opportunities. For ex-
Besides public clouds, there are two other available variants: ample, rigorous testing can make it feasible to offer software war-
private clouds and cooperative clouds. A private cloud may be ranties, which translate into liability payments to the software user
preferred by large companies that have vast hardware resources in case bugs lead to losses. Such warranties could be backed by
they can dedicate to internal use (e.g., Google, Microsoft). A co- insurance products from financial institutions, which would offer
operative test cloud is a federation of user machines (similar to software developers an insurance policy in exchange for a premium
SETI@home) or even data centers, pooled together for shared use. and a requirement that they use TaaS on their code.

159
5. IMPACT AND CHALLENGES 6. CONCLUSION
We expect TaaS to have broad impact on end users, developers, In this paper we made the case for TaaS—automated software
and businesses, leading to higher software reliability in general. testing as a cloud-based service. We presented three classes of test-
TaaS can both compel and help development organizations to ing services: TaaSD for developers to more thoroughly test their
compete based on the reliability of their software products. Certi- code, TaaSH for end users to check the software they install, and
fication services provide easily accessible means for consumers to TaaSC certification services that enable consumers to choose among
compare product reliability, while testing services help developers software products based on the products’ measured reliability.
write better software. The ease of identifying and reproducing bugs We argued that the combination of recent advances in test au-
will also shorten the interval between detection and final bug fix. tomation and the availability of compute clouds can offer unprece-
At the same time, testing services empower end users to check dented levels of testing quality. We find TaaS to be compelling
the software they use. Well informed and demanding users will from both technical and non-technical points of view. By simulta-
further exert pressure on vendors to produce reliable software. Val- neously empowering consumers to make educated choices and also
idation of software could become a built-in feature of operating sys- enabling developers to build better products, TaaS has the ingredi-
tems, that transparently checks all new software via a TaaS provider. ents to indeed help reduce bug density by an order of magnitude.
The fact that competitors and hackers could use TaaS to find
weaknesses in a software product as soon as it is released should 7. REFERENCES
compel developers to use TaaS before releasing their software. [1] C. Cadar, D. Dunbar, and D. R. Engler. KLEE: Unassisted
On the path to TaaS, there are both technical and non-technical and automatic generation of high-coverage tests for complex
challenges. The first and foremost is finding ways to scale auto- systems programs. In Symp. on Operating Systems Design
matic testing techniques to hundreds or thousands of machines in and Implementation, 2008.
loosely coupled clusters. While progress has been made, for exam- [2] C. Cadar, V. Ganesh, P. M. Pawlowski, D. L. Dill, and D. R.
ple, in parallelizing model checkers for multi-core CPUs [9], the Engler. EXE: Automatically generating inputs of death. In
reliance on shared memory makes the techniques difficult to carry Conf. on Computer and Communication Security, 2006.
over to clusters. Furthermore, cloud environments present substan- [3] L. Ciortea, C. Zamfir, S. Bucur, V. Chipounov, and
tial heterogeneity and unpredictability of performance. G. Candea. Cloud9: A software testing service. In Workshop
Finding incremental testing techniques, which reuse existing test on Large Scale Distributed Systems and Middleware, 2009.
results and compose them with tests focused on new or modified
[4] P. Godefroid, N. Klarlund, and K. Sen. DART: Directed
code, can enable TaaSD to provide quicker feedback to developers. automated random testing. In Conf. on Programming
There is also opportunity for techniques that provide progressive Language Design and Implementation, 2005.
refinement of the test results, so that a coarse grained result can
[5] P. Godefroid, M. Y. Levin, and D. Molnar. Automated
be returned immediately, followed by increasingly more precise re-
whitebox fuzz testing. Technical Report MSR-TR-2007-58,
sults, as they become available.
Microsoft Research, 2007.
For test predicates to be easy to formulate and maintain, we must
[6] P. Godefroid, M. Y. Levin, and D. Molnar. Automated
design a language that provides a suitable tradeoff between ex-
Whitebox Fuzz Testing. In Network and Distributed System
pressiveness and complexity. Formal logics are powerful, but have
Security Symp., 2008.
proven to be inaccessible to most programmers. Asserts stated di-
[7] Google Knol. http://knol.google.com.
rectly in the programming language are easy, but less powerful. A
middle ground between the two will likely prove the most fruitful. [8] P. J. Guo and D. Engler. Linux kernel developer responses to
We need metrics for quantifying the level of confidence we get static analysis bug reports. In USENIX Annual Technical
from a test suite. Common coverage metrics, such as line cover- Conf., 2009.
age, do not accurately describe how many of the possible execution [9] G. J. Holzmann and D. Bosnacki. Multi-core model checking
paths have been tested. At the same time, an absolute number indi- with SPIN. In Intl. Parallel and Distributed Processing
cating the number of paths tested is not informative either, and path Symp., 2007.
coverage can rarely be expressed as a percentage, since complex [10] J. C. King. Symbolic execution and program testing.
programs almost always have an infinite number of possible paths. Communications of the ACM, 1976.
We expect much of the testing in TaaS to be done on binaries, [11] V. Kuznetsov, V. Chipounov, and G. Candea. Testing
rather than source code. Both Cloud9 and DDT can operate on bi- closed-source binary device drivers with DDT. In USENIX
naries as well as on source. However, software vendors may be Annual Technical Conf., 2010.
reluctant to allow end users to check the quality of proprietary soft- [12] R. Majumdar and K. Sen. Hybrid concolic testing. In Intl.
ware using TaaS—they might prevent this through code packing Conf. on Software Engineering, 2007.
or through licensing terms. It is not clear whether TaaS providers [13] S. McConnell. Code Complete. Microsoft Press, 2004.
would need to have a license for the software they are about to test. [14] Microsoft. Driver verifier.
Moreover, exhaustively exercising code paths in binaries may be http://www.microsoft.com/whdc/DevTools/tools, 2009.
considered illegal reverse engineering by some vendors, although [15] S. Misailovic, A. Milicevic, N. Petrovic, S. Khurshid, and
user demand for reliable software may change that perspective. D. Marinov. Parallel test generation and execution with
Finally, providing confidentiality of tested code may also be a Korat. In Symp. on the Foundations of Software Eng., 2007.
challenge. While this is not necessary in the case of binaries, pro- [16] Redhat security.
prietary source code will need to be kept confidential by either run- http://www.redhat.com/security/updates/classification, 2005.
ning TaaS in a private cluster, or by providing strong guarantees and [17] D. Saff and M. D. Ernst. Reducing wasted development time
legal provisions if doing so on a shared cloud. There can also be via continuous testing. In Intl. Symp. on Software Reliability
concerns regarding export restrictions, if sensitive software, such Engineering, 2003.
as cryptographic algorithms, ends up being tested by services oper- [18] C. Zamfir and G. Candea. Execution synthesis: A technique
ating in countries where that code cannot be legally exported to. for automated debugging. In EUROSYS Conf., 2010.

160

You might also like