0% found this document useful (0 votes)
5 views

Contextual Fuzzing Automated Mobile App Testing Under Dynamic Device and Environment Conditions

The document presents Context Virtualizer (ConVirt), a cloud-based service designed to enhance mobile app testing under diverse contextual conditions. It introduces the concept of contextual fuzzing, which systematically explores various mobile contexts to identify performance issues and crashes before app release. The system utilizes a hybrid architecture combining emulators and real devices, leveraging app similarity networks to prioritize testing scenarios effectively.

Uploaded by

ifeanyichuks267
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views

Contextual Fuzzing Automated Mobile App Testing Under Dynamic Device and Environment Conditions

The document presents Context Virtualizer (ConVirt), a cloud-based service designed to enhance mobile app testing under diverse contextual conditions. It introduces the concept of contextual fuzzing, which systematically explores various mobile contexts to identify performance issues and crashes before app release. The system utilizes a hybrid architecture combining emulators and real devices, leveraging app similarity networks to prioritize testing scenarios effectively.

Uploaded by

ifeanyichuks267
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 13

Contextual Fuzzing: Automated Mobile App Testing

Under Dynamic Device and Environment Conditions


MSR-TR-2013-100 – March 2013

Chieh-Jan Mike Liang‡ , Nicholas D. Lane‡ , Niels Brouwers∗ , Li Zhang? ,


Börje Karlsson‡ , Ranveer Chandra‡ , Feng Zhao‡
‡ Microsoft Research ∗ Delft University of Technology
? University of Science and Technology of China

Abstract worth, performance and crashes are not tolerated.


App experience drives healthy mobile ecosystems. Today, developers only have a limited set of tools to test
However, mobile platforms present unique challenges to their apps under different mobile context. Tools for collect-
developers seeking to provide such experiences: device ing and analyzing data logs from already deployed apps (e.g.,
heterogeneity, wireless network diversity, and unpredictable [9, 5, 2]) require the app to be first released before prob-
sensor inputs. We propose Context Virtualizer (ConVirt), a lems can be corrected. Through limited-scale field tests (e.g.,
cloud-based testing service that addresses two challenges. small beta releases and internal dogfooding) log analytics
First, it provides a large set of realistic mobile contextual can be applied prior to public release but these tests lack
parameters to developers with emulators. Second, it enables broad coverage. Testers and their personal conditions are
scalable mobile context exploration with app similarity likely not representative of a public (particularly global) app
networks. To evaluate the system design, we profile 147 release. One could use mobile platform simulators ([14, 20])
Windows Store mobile apps on our testbed. Results show to test the app under a specific GPS location and network
that we can uncover up to 11 times more crashes than type, such as Wi-Fi, but in addition to being limited in the
existing testing tools without mobile context. In addition, contexts they support, these simulators do not provide a way
our app similarity network increases the number of abnor- to explore all possible combinations of mobile contexts.
mal performances found in a given time by up to 36%, as To address this challenge of testing the app under different
compared to the current practices. contexts, we propose contextual fuzzing – a technique where
mobile apps are monitored while being exercised within a
1 Introduction host environment that can be programmatically perturbed to
Mobile devices, such as smartphones and tablets, are emulate key forms of device and environment context. By
rapidly becoming the primary computing platform of choice. systematically perturbing the host environment an unmod-
This popularity and usage is fueling a thriving global mo- ified version of the mobile app can be tested, highlighting
bile app eco-system. Hundreds of new apps are released potential problems before the app is released to the public.
daily, e.g. about 300 new apps appear on Apple’s App Store To demonstrate the power of contextual fuzzing, we design
each day [10]. In turn, 750 million Android and iOS apps and implement a first-of-its-kind cloud service Context Vir-
are downloaded each week from 190 different countries [8]. tualizer (ConVirt) – a prototype cloud service that can auto-
However, success of this global market for mobile apps is matically probe mobile apps in search of performance issues
presenting new demanding app testing scenarios that devel- and hard crash scenarios caused by certain mobile contexts.
opers are struggling to satisfy. Developers are able to test their apps by simply providing an
Each newly released app must cope with an enormous di- app binary to the Context Virtualizer service. A summary re-
versity in device- and environment-based operating contexts. port is generated by Context Virtualizer for the developer that
The app is expected to work across differently sized de- details the potential problems observed and which conditions
vices, with different form factors and screen sizes, in differ- appear to act as a trigger. Context Virtualizer utilizes primar-
ent countries, across a multitude of carriers and networking ily cloud-based emulators but also integrates actual real de-
technologies. Developers must test their application across vices to provide coverage of hardware-specific contexts that
a full range of mobile operating contexts prior to an app re- are difficult to otherwise emulate.
lease to ensure a high quality user experience (Section 2). Key challenge when implementing Context Virtualizer is
This challenge is made worse by low consumer tolerance for the scalability of such a service, across the range of scenar-
buggy apps. In a recent study, only 16% of smartphone users ios and number of apps. Individual tests of context can easily
reported they continue to use an app if it crashes twice soon run into the thousands when a comprehensive set of location
after download [28]. As a result, downloaded app attrition is inputs, hardware variations, network carriers and common
very high – one quarter of all downloaded apps are used just memory and cpu availability levels. In our own prototype
once and then eventually deleted [27]. Mobile users only a library of 1,254 contexts is available and sourced largely
provide a small opportunity for an app to demonstrate its from trace logs of conditions encountered by real mobile
users (see §5.2.) Not only must conditions be tested in isola- mobile handoffs, vertical handoffs from cellular to Wi-Fi,
tion, but how an app responds to a combination of conditions and the country of operation. For example the RTTs to the
must also be considered (e.g., a low memory scenario occur- same end-host can vary by 200% based the cellular operator
ring simultaneously during a network connection hand-off.) used [17], even given identical locations and hardware, the
We propose a new technique that avoids a brute force bandwidth speeds between countries frequently can vary
computation across all contexts. It incrementally learns between 1 Mbps and 50 Mbps [33], and the signal strength
which conditions (e.g., high-latency connections) are likely variation changes the energy usage of the mobile device [22].
to cause problematic app behavior for detected app char-
acteristics (e.g., streaming apps). Similarly, this approach Device Heterogeneity. Variation in devices require an app
identifies conditions likely to be redundant for certain apps to perform across different chipsets, memory, cpu, screen
(e.g., network related conditions for apps found to seldom size, resolution, and the availability of resources (e.g. NFC,
use network conditions). Through this intelligent prioriti- powerful GPU, etc.). This device heterogeneity is very
zation of context perturbation unexpected app problems and severe. 3,997 different models of Android devices – with
even crashes can be found much more quickly than is other- more than 250 screen resolution sizes – contributed data to
wise possible. OpenSignal database during a recent six month period [24].
In this work we make the following contributions. We note that devices in the wild can experience low memory
1. We show the need for additional testing framework for states or patterns of low CPU availability different from
mobile apps, which extends beyond the traditionally the expectation of developers, e.g. a camera temporarily
available code testing tools. (§ 2) needs more memory, and this interaction can affect user
experience on a low-end device.
2. We present a new concept, of contextual fuzzing, for
systematically exploring a range of mobile contexts. Sensor Input. Apps need to work across availability of
We also describe techniques to make this approach scal- sensors, their inputs, and variations in sensor readings
able using a learning algorithm that effectively priori- themselves. For example, a GPS or compass might not work
tizes contexts for a given run. (§ 3). at a location, such as a shielded indoor building, thereby
3. We propose a new, one of its kind cloud service, that affecting end user experience. Furthermore, depending
consists of a hybrid testbed consisting of real devices on the location or direction, the apps response might be
and emulators, along with an energy testing framework, different. Apps might sometimes cause these sensors to
to which app developers can submit apps for contextual consume more energy, for example, by polling frequently
testing. (§ 5). for a GPS lock when the reception is poor. The sensors also
We evaluate our system using a workload of a represen- sometimes have jitter in their readings, which an app needs
tative 147 mobile Windows Store apps. Our experiments (1) to handle.
validate the accuracy of all contexts simulated and measure-
ments performed within our prototype; and, (2) examine the 3 Context Virtualizer Design
scalability through test time and resource efficiency relative In the following section we describe the design consider-
to a collection of representative benchmark algorithms. We ations and the overall architecture of Context Virtualizer.
demonstrate the benefits to developers of contextual fuzzing
by automatically identifying a range of context-related hard
crashes and performance problems within our app workload. 3.1 Design Challenges
We describe both aggregate statistics of problems identified The goal of Context Virtualizer (ConVirt) is to address the
and in some cases supplement these with individual app mobile app testing needs of two specific types of end-users:
example case studies. Finally, we discuss a series of • App developers who use ConVirt to complement their
generalizable observations that detail overlooked contexts existing testing procedures by stress-testing code under
that result in these issues and how they might be addressed. hard to predict combinations of contexts.
• App distributors who accept apps from developers and
2 The Mobile Context Test Space offer them to consumers (such as, entities operating
We now present three different mobile contexts, and the marketplaces of apps) – distributors must decide if an
variations therein, which we believe captures the majority app is ready for public release.
of context-related bugs in mobile apps. To the best of our Before we can build an automated app testing service able
knowledge there doesn’t exist ways to principally test all the to examine a comprehensive range of mobile contexts we
variations of these contexts. must solve two fundamental system challenges:

Wireless Network Conditions. Variation in network Challenge 1. High-fidelity Mobile Context Emulation.
conditions leads to different latency, jitter, loss, throughput Cloud-based emulation of realistic mobile contexts enables
and energy consumption, which in turn impacts the perfor- more complete pre-release testing of apps. Developers can
mance of many network-facing apps (43% in the Google then incorporate such testing into their development cycle.
Play). These variations could be caused by the operator, Similarly, distributors can be more confident of the user
signal strength, technology in use (e.g. Wi-Fi vs. LTE), experience consumers will receive. A viable solution to the
emulation problem will have two characteristics. First, it
must be comprehensive and be capable of emulating various
key context dimensions (viz. network, sensor, hardware)
in addition to the key tests within each dimension (e.g.,
wireless handoffs under networking). Second, it must be
an accurate enough emulation of the real-world phenomena
such that app problematic behavior still manifest.

Challenge 2. Scalable Mobile Context Exploration. As


detailed in §2, the number of mobile contexts that may im-
pact a mobile app is vast – for example, consider just the
thousands of device types and hundreds of mobile carriers in
daily use today. Because of the complexity of the real-world
developers and distributors are unable to correctly identify
which conditions are especially hazardous, or even relevant,
to an app. Instead automated exploration is required. Ideally,
not only are contexts tested in isolation, but in combination,
resulting in a combinatorial explosion that makes brute-force
testing infeasible.
To solve these challenges we propose the following
solutions that are embedded within the architecture and
implementation of Context Virtualizer.
Figure 1. Context Virtualizer Architecture
Contextual Fuzzing. To address Challenge 1, we propose
a hybrid cloud service architecture that blends conventional
servers and real mobile devices. Servers running virtual
3.2 Architectural Overview
machines images of mobile devices provide our solution As shown in Figure 1, Context Virtualizer consists of
with scalability. A perturbation layer built into both server four principle components: (1) Context Library (CL); (2)
VMs and mobile devices allows a set of context types to App Similarity Prioritization (ASP); (3) App Environment
be emulated (for example, networking conditions). Under Host (AEH); and, (4) App Performance Analyzer (APA).
this design, certain tests cases that are logic-based or that We now describe in turn each component along with
rely on context that be effectively emulated are executed component-specific design decisions.
on server VMs. Similarly, test cases that require hardware
specific conditionsare performed directly on a mobile device. Context Library. Stored in CL are definitions of various
mobile context scenarios and conditions. One example sce-
nario is a network handoff occurring from Wi-Fi to 3G. An
App Similarity Network. Our solution to Challenge 2 re- example of a stored condition are the network characteristics
lies on constructing a similarity network between a popula- for a specific combination of location and cellular provider.
tion of apps. Under this network, apps are represented as Every context relates to only a single system dimension (e.g.,
network nodes that are connected by weighted edges. Edge network, cpu); this enables contexts to be composable – for
weights capture correlated behavior (e.g., common patterns instance, a networking condition can be run in combination
of resource usage under the same context). By projecting a with a test limiting memory. CL is populated using datasets
new previously unseen app into the network we can identify collected from real devices (e.g., OpenSignal [23]) in addi-
app cliques that are likely to respond to mobile contexts in tion to challenging mobile contexts defined by domain ex-
similar ways. Using the network, mobile contexts previously perts.
observed to cause problems in apps similar to the app under Through our design of CL that packages together context
test can be selected. Similarly, those contexts that turned parameters we (1) reduce the overall test space to search
out to be redundant can be avoided. One strength of this while also (2) restricting tests that are executed to particular
approach is that it allows information previously discovered combinations of parameters that actually occur in the wild
about apps to benefit subsequently tested apps. The more (e.g., a cellular provider and a specific location.) If instead
apps to which the network is exposed, the better it is able to an unstructured parameter search within each mobile context
select context test cases for new apps to be tested. dimension was performed it is unclear how valid parameter
Not only do these solutions enable Context Virtualizer to combinations could be enforced while the search scalability
meet its design goals, but we believe these are generalizable challenges would be even worse. Similarly, our decision to
techniques able to be applied to other mobile system integrate external data sources directly into CL enables it
challenges that we leave to be explored in future work (see to better reflect reality and to even adapt to changes in the
§7 for further discussion.) mobile market (e.g., new networking parameters.) Without
this integration CL would be limited to only the mobile
context that users anticipate. tors will have limited realism. Certain context dimensions
can never be performed with such as design. In contrast,
App Similarity Prioritization. Tested apps are exposed our design is extensible to include improved context support
to mobile contexts selected from CL. ASP determines which (e.g., additional sensors) within our proposed architecture.
order these contexts are to be performed and then assigns
each test to one of a pool of AEHs. Aggregate app behavior App Performance Analyzer. Collectively, all AEHs pro-
(i.e., crashes, and resource use), collected by AEHs, is used duce a large pool of app behavior (i.e., statistics and recorded
by ASP to build and maintain an app similarity network that crashes) collected under various contexts. APA is designed
determines this prioritization. Both prioritization and sim- to extract from these observations unusual app behavior that
ilarity network computation is an online process. As each warrants further investigation by developers or distributors.
new set of results is reported by an AEH more information It relies on anomaly detection that assumed norms based on
is gained about the app being tested, resulting potentially in previous behavior of (1) the app being tested and (2) a group
re-prioritization based on new behavior similarities between of apps that are similar to the target. As shown in Fig. 1, APA
apps being discovered. Through prioritization two outcomes collates its findings into a report made available to Context
occur: (1) redundant or irrelevant contexts are ignored (e.g., Virtualizer users.
an app is discovered to never use the network, so network Our design of APA demonstrates the larger potential for
contexts are not used); and, (2) those contexts that negatively our App Similarity Networks approach. In this component
impacted apps similar to the one being tested are performed we leverage the similarity network used for prioritization as
first (e.g., an app that behaves similarly to other streaming a means to identify accurate expectations of performance for
apps has a series of network contexts prioritized). a tested app.
Our design decision to automate testing with a system-
atic search of mobile context enables any inexperienced user
successfully use Context Virtualizer. An alternative design
4 App Similarity Prioritization
would require users to specify a list of mobile contexts to In the following section, we describe (1) the similarity
test their app against – however, the ability of the user to network used to guide context test case prioritization, and
determine such a list would largely determine the quality of (2) the prioritization algorithm itself.
results. This discussion makes use of two terms. First, the term
Instead of adopting general purpose state space reduction “target app” refers to a representative app for which context
algorithms we instead develop a novel domain-specific test cases are being prioritized. Second, the term “resource
approach. Mobile apps have higher levels of similarity measurement” is one of the hundreds of individual measure-
between each other (compared to desktop apps) because ments of system resources – related to networking, CPU,
of factors including (1) strong SDK-based templates, and memory, etc. (see §5.1) – made while Context Virtualizer
(2) a high degree of functional repetition (e.g., networking tests an app.
functions).
4.1 App Similarity Network
App Environment Host. An AEH provides an environ- A weighted fully connected graph is used by ASP to
ment for a tested app to run while providing the required capture inter-app similarity. We now describe the network
context requested by the PSP. App logic and functionality are and its construction.
exercised assuming a particular usage scenario and encoded
with a Weighted User Interaction Model (WUIM, see §5.1) Nodes in the network represent not only apps, but a pairing
provided by users (i.e., distributors or developers) when test- of an app and a specific WUIM (Weighted User Interaction
ing is initiated. In the absence of a user provided WUIM a Model). A WUIM describes a particular app usage scenario
generic default model is adopted. Specific contexts are pro- in terms of a series of user interaction events. An example
duced within the AEH using using perturbation modules (see scenario is where a user listens to music on a radio and
Figure 1, central component) that realistically simulate a spe- periodically changes radio station. Since app behavior (e.g.,
cific context (e.g., GPS reporting a programmatically defined resource usage) can differ significantly from scenario to
location). However, certain contexts can not be adequately scenario, apps must be paired with a specific WUIM when
simulated – such as, particular hardware – and in these cases modeled in the similarity network. Throughout this section,
the app is tested on being performed on the same, or equiva- while we refer to an “app” being represented in the network,
lent, hardware. During testing AEH uses monitor modules to in all cases this more precisely refers to a app and WUIM
closely record app behavior. Monitors primarily record app pair.
resource usage (e.g., memory) but they also record app state,
such as an app crashes. Edges in the network between two nodes are weighted
The design choice of a mixed architecture is motivated according to the similarity between two apps with respect to
by the competing goals of scalability and realism. An a specific resource measurement (e.g., memory usage). We
alternative approach using only real hardware will struggle calculate similarity by computing the correlation coefficient
to scale. Furthermore it is unnecessary as many tests that are (i.e., Pearson’s Correlation Coefficient [3]) between pairs of
related, for instance, to app logic do not strictly require real identical resource measurements observed under identical
hardware. Similarly, an approach purely using VM emula- context tests for two apps. This estimates the strength of the
performed) context test. Predictions are made using a sim-
ple linear model trained for each member of the app cluster.
These models predict the unseen resource measurement for
a particular context test based on (1) the value for this mea-
surement for the app cluster member, and (2) the relationship
between the target app’s measurements and the app cluster
members under all previously computed context tests.
(a) TCP packets received (b) Number of disk reads For those tests when a member of the app cluster crashed
a prediction can not be made. Instead their prediction is
Figure 2. Similarity graphs depict the degree of correla- replaced with a crash flag. Crash flags indicate a similar app
tion in consuming two resources when exercising 147 app to the target app crashed under the context test attempting to
under eight networking contexts. be predicted. When this context test is ranked against other
tests, the number of crash flags boost its ranking.

linear dependence between the two sets of variables. While Rank. Based on predicted resource estimates and crash
the correlation coefficient reports a signed value depending flags, a ranking is given for all context test cases yet to to
on if the relationship is negative or positive, we use the be executed for the target app. The priority of test cases is
absolute value for the edge weight. determined on this ranking, which is computed as follows.
First, the variance is calculated for each resource mea-
Bootstrap Scenario. Initially, there is no existing resource surement within each test case (excluding those with crash
measurement regarding the target app, with which similarity flags). Variance provides a notion of prediction agreement
to other apps can be computed. From our experiments, we between each estimate. The intuition behind this decision is
find the most effective (i.e., leading to higher prioritization that tightly clustered (i.e., low variance) estimates indicate
performance) initial set of three context tests are three higher uncertainty regarding the estimate. We want to rank
wireless networking profiles (c.f. §5.2): GPRS, 802.11b, test cases high when they have large uncertainty across
and cable. As more and more resource measurements are their resource measurements. Second, the variance for each
made for the target app, the similarity networks improve resource measurement is compared with all other resource
(and largely stabilize). measurements of the same type. A percentile is assigned
based on this comparison. The average percentile for each
Per-Measurement Networks. We utilize multiple simi- resource measurement within every context test case is then
larity networks, one for each type of resource measurement computed. Third, crash flags for all resource measurements
(e.g., allocated memory and CPU utilization). Figure 2 are counted within each test case. We apply a simple
presents similarity graphs for two different resource metrics, scaling function to the number of crashes that balances
TCP packets received and disk reads for a population of the importance of many crashes against a high percentile.
147 Microsoft Windows Store apps (detailed in §6). Each Finally, scaled crash flag count is added to the percentile to
edge weight shown in both figures (equal to correlation compute a final rank.
coefficient) is larger than 0.7. We note that the graph, even
with this restriction applied, is well connected and has high Update. The final phase in prioritization is to revise the app
average node degree. Furthermore, both graphs are clearly similarity network based on new resource measurement col-
different and support our decision to use per-measurement lected about the target app. These measurements may have
networks instead of a single one. altered the edge weights between the target app and the other
apps in the network.
4.2 App Similarity-based Prioritization The new rankings determined at this iteration are utilized
ASP prioritization is an iterative four-step process that as soon as an AEH completes its previously assigned context
repeatedly occurs while the target app is tested by Context test. Once the head of context test ranking has been assigned
Virtualizer. These steps are: (1) cluster, (2) predict, (3) rank, to an AEH then it is no longer considered when ranking
followed by (4) update. occurs at the next iteration.

Cluster. At the start of each iteration, an app cluster 5 Implementation


is selected for the target app within each app similarity
network (one for each resource measurement type). The app This section details our current implementation of the
cluster is based on network edge weights. Experimentally, framework components (see Figure 1).
we find an edge weight with a threshold of 0.70 is effective
during this process. All nodes with an edge weight greater 5.1 App Environment Host
or equal to this threshold are assigned to be a member of the AEH can run in virtual machines (VM) for testbed scala-
cluster. bility, or on real Intel x86 mobile devices (e.g., Samsung 7
Slate [26]) for hardware fidelity. AEH runs Windows 8 OS
Predict. A prediction for the target app is made for each with our own monitoring and perturbation modules.
resource measurement within each pending (i.e., yet to be
Monitoring Modules. AEH logs a set of system statistics Finally, we implemented a virtual GPS driver to feed apps
of some running apps, as specified by the process matching with spoofed coordinates, as Windows 8 assigns the highest
rules in a configuration file. In its simplest form, a process priority to the GPS driver. AEH instructs the virtual driver
matching rule would be a process name (i.e., ‘iexplore.exe’). through the UDP socket to avoid the overhead associated
If AEH detects a new process during the periodic polling of with polling. Upon receiving a valid UDP command, our
running process list, it monitors the process if there is a rule virtual GPS driver signs a state-updated event and a data-
match. updated event to trigger the Windows Location Service to
We note the caveat that HTML5/JS-flavored Win- refresh the geolocation data.
dows Store Apps 1 run inside a host application called These CPU and memory perturbations can be used not
WWAEHost.exe. Therefore, AEH takes the hint from the only for changing a given test context, but also as a way to
working directory argument of the command line that starts emulate more limited hardware, if necessary.
the WWAEHost.exe process.
To accommodate the variable number of logging sources Weighted User Interaction Model. User inputs (e.g., nav-
(e.g., system built-in services and customized loggers), AEH igation and data) can significantly impact the app behavior
implements a plug-in architecture where each source is and resource consumption. For example, a news app con-
wrapped by a monitor. Monitors can either be time-driven taining videos will consume different resources depending
(i.e., logging at fixed intervals), or event-driven (i.e., logging on how many (and which) videos are viewed.
as an event of interest occurs). Logging can take place on Our goal is to exercise apps under common usage sce-
both system-wide or per-app fashion. The final log file is in narios with little guidance from developers. We represent
JSON format. app usage as a tree. Tree nodes represent the pages (or app
We have implemented several monitors. Two separate states), and tree edges represent the UI element being in-
monitors track system-wide and per-app performance voked. Then, invoking UI elements (e.g., clicking buttons)
counters (e.g., network traffic, disk I/Os, CPU and memory essentially traverses the UI tree.
usage), respectively. Specifically, the per-app monitor Each usage scenario is a tree with weighted edges based
records kernel events to back-trace a resource usage to on the likelihood of a user to select a particular UI element
the process. The third monitor hooks to the Event Trac- on a page. We generate a usage tree for each scenario
ing for Windows (ETW) service to capture the output with a stand alone authoring tool, which allows the user to
of msWriteProfilerMark JavaScript method in Internet assign a weight to each UI element. Higher weights indicate
Explorer. This method allows us to write debug data a higher probability of a particular UI element being invoked.
from HTML5/JS Windows Store apps. Finally, we have a
per-app energy consumption estimation monitor based on 5.2 Context Library
JouleMeter [22]. Our complete Context Library includes 1,254 mobile con-
text tests, the bulk of these being network and location re-
Perturbation Modules. We built the network perturbation lated. CL currently uses Open Signal [23] as an external data
module ontop of the Network Emulator for Windows Toolkit source. This public dataset is comprised of user contribu-
(NEWT), a kernel-space network driver. NEWT exposes tions of wireless networking measurements collected around
four main network properties: download/upload bandwidth, the world. It provides CL with data from more than 800 cel-
latency, loss rate (and model), and jitter. We introduced lular networks and one billion WiFi points readings.
necessary enhancements such as the real-time network prop- From datasets as large as Open Signal each data point
erty update to emulate network transitions and cellular tower could be extracted into one (or more) context tests. How-
handoffs. Another crucial enhancement is the per-app filter- ever, this would be intractable during testing but more impor-
ing mechanism for perturbation without impact on on other tantly data is often redundant. Instead we extract aggregates
running processes. – bucketization – that summarize classes of data points. In
Next, we describe two separate approaches to adjust the case of Open Signal we aggregate data representing carri-
the CPU clock speed. First, modern VM managers (e.g. ers at specific cities. Aggregation is performed by computing
VMWare’s vSphere) expose settings for CPU resource al- the centroid of all data points when grouped by carrier and
location on a per-instance basis. Second, most modern Intel city. We limit cities to 400 selected based on the volume of
processors support EIST that allows setting different perfor- data collected at Open Signal. 50 carriers are also included.
mance states. By manipulating the processor settings, we Although we do not maintain the same amount of data for
can make use of three distinct CPU states: 20%, 50%, and each carrier.
100%. Additional examples of CL tests are networking, memory
To control the amount of available memory to apps, and cpu events. These are a variety of hand coded scenarios
we use a tool called Testlimit from Microsoft, which in- replicating specific events typically challenging to apps. For
cludes command-line options that allows AEH to change the example, sudden drops in memory and fluctuations in cpu
amount of system commit memory. AEH then uses this tool availability. Device profiles are primarily combinations of
to create three levels of available memory for the apps to ex- memory and CPU to represent classes of device with certain
ecute. levels of resources. Device profiles also include specific real
devices available within the ConVirt testbed.
1 formerly referred to as Metro apps
5.3 Interfacing with App Environment Hosts
AEHs are controlled via PowerShell scripts, a Windows
shell environment. Each test script specifies the system
conditions to emulate and the system metrics to log. To
scale up the testbed to hundreds of machines, we set up
a client-server infrastructure where a central server (that
implements the ASP algorithm) configures each AEH
node via Windows Management Instrumentation (WMI)
technology, an RPC-like framework. The tight integration
between WMI and PowerShell simplifies the process of
pushing test scripts from the central server. In turn, a service
running on the AEH node accepts the script and initiates
its execution. Finally, after a test script is executed, the
output log is copied to a designated network share. Then,
the central server cleans up the data and writes to a MSSQL
database. Figure 3. Radar plot of statistics for an app under differ-
ent network conditions. The filled area represents the set
5.4 App Performance Analyzer of similar apps.
APA is responsible for (1) highlighting the system met-
ric measurements that are unexpected under certain mobile Latency Bandwidth
(ms) Error (%) (MBps) Error (%)
contexts; and (2) providing actionable reports that help users ADSL 373.38 12.03 544.98 0.96
interpret the raw measurements. Cable 338.18 12.34 641.18 2.65
To judge whether a system metric measurement is within 802.11g #1 118.60 14.05 5000.37 4.84
expectation, APA applies the common anomaly detection al- 802.11g #2 119.90 1.42 4000.06 0.99
gorithm: if the data is more than some standard deviations 802.11b #1 459.40 14.89 260.04 2.56
802.11b #2 367.49 12.11 567.77 3.86
away from the group mean, then it is flagged. APA defines KPN 2G 508.02 0.27 25.48 4.87
two notions of group. First, APA compares multiple itera- KPN 3G 166.05 1.07 99.05 1.81
tions of the same app being tested under the same context. China Mobile 3G 674.87 9.25 71.93 0.72
Second, APA compares an app with a comparison group of
apps, which can include a broad app population or simply a Table 1. Emulated network performance numbers and
set of apps considered to be similar in resource consumption. error for various network types.
APA reuses the same similarity equation described in §4.1.
APA copes with false positives by a ‘novelty’ weight,
which is proportional to the number of times the same com- the current practice by a factor of 11 and 8, respectively;
bination of emulated context and statistic is confirmed as and (5) we share lessons learned to help developers improve
norm by developers of similar apps. their apps.
Finally, the reporting interface of APA can present data
analysis in several ways. Figure 3 shows a radar plot that 6.1 Micro-benchmarks
highlights the system metrics that developers should focus
This sub-section presents experiments to verify the
on. These are the system metrics that have a high probability
accuracy of our simulated network conditions and quantify
of being outliers. In addition, identified outliers can be
the system overhead when probing apps; to ensure that
ranked as to how severe they are. Ranking considers three
measurements are not adversely affected by the system
factors: (1) the frequency at which the outlier occurs; (2)
components.
the difference to the comparison ’norm’; (3) the relative
importance of each metric. A variety of report templates that
Network Emulation Validation. To assess network emula-
provide actionable items for developers can be generated on
tion fidelity, we first configure NEWT parameters to match
demand based on crash and outlier data.
bandwidth, latency, jitter and packet loss from network
traces collected under different real world settings. Then,
6 Evaluation network performance is measured under NEWT.
This section is organized by the following major results: Real world network traces were collected in Amsterdam
(1) our system can emulate device and network conditions and Beijing, as listed in Table 1. Our benchmark client is a
with high fidelity; (2) bucketization can reduce the contin- laptop with a fast Ethernet connection, 802.11b/g radio, and
uous geolocation parameter space of a city to one single a USB-based cellular dongle. The backend server is a VM on
point, at the expense of measurement error rate between 7% Amazon’s EC2. We measured the network bandwidth from
and 30%; (3) GCF can find up to 36% more outliers than the laptop to the backend with iperf and latency by sending
the current industry practices, with the same amount of time 100 ICMP pings.
and computing resources; (4) contextual fuzzing increases The emulated testbed consists of two laptops connected
the number of crashes and performance outliers found over via a 100-Mbps switch. As our LAN outperforms all the
CPU (%) CPU (%)
Connections Active ●
App FW RAM (MB) Connections Established ●
Spotify 0.37 107.14 Connection Resets ●

Spotify (+FW) 0.51 1.04 105.46 segmentsRecv ●


segmentsSent ●
Youtube 5.73 178.91
TCP Tx ●
Youtube (+FW) 5.72 1.09 176.12 TCP Recv ●
MetroTwit 0.52 150.73 Total Disk Writes ●

MetroTwit (+FW) 0.52 0.83 160.03 Total Disk Reads ●


Datagram Recv ●
Hydro Thunder Hurricane 40.71 170.20 Datagram Tx ●
HTH (+FW) 42.42 1.04 169.82 User Time ●
Processor Time ●
Thread Count ●
Table 2. App performance statistics with the framework Peak Working Set ●
Avg Working Set ●
enabled and disabled. Peak Paged Memory ●
Avg Virtual Memory ●
Peak Virtual Memory ●

emulated network types, it does not introduce any artifact 0 1 2 3 4 5


Distance−to−Mean Ratio
to the emulation results. One laptop acts as the server, and
the other repeats the same iperf and ping tests while cycling
through all nine networks. Figure 4. Per-metric variation of WCDMA (T-Mobile)
Table 1 shows the emulated network bandwidth and emulation with traces from 12 locations in Seattle.
latency as observed. NEWT can throttle the bandwidth well,
with an error rate between 0.72% and 4.87%. On the other
hand, latency has an error rate of 18.05%. Fortunately, since GPRS, GPRS (out of range), GPRS (handoff), ADSL, and
the error rate is relatively constant between iterations, we Cable internet. Dataset #2 tests how apps behave under
can compensate by adjusting the latency setting accordingly. common network transitions: <802.11b, WCDMA>,
<802.11b, GPRS>, <GPRS, GPRS (out of range)>, and
System Overhead Analysis. To ensure that our measure- <GPRS, GPRS (tower handoff)>. Dataset #3 emulates a
ment techniques do not skew the results, we measured the WCDMA network at 10 locations with large mobile device
overhead introduced by our system. We selected four apps, populations: Beijing, Berlin, Jakarta, Moscow, New Delhi,
as listed in Table 2, for benchmark. This app selection Seattle, Seoul, Teheran, Tokyo, Washington D.C. Dataset
covers static web content access, streaming media (audio #4 emulates a WCDMA network at 12 locations uniformly
and video), social networks, and CPU/GPU intensive game. spread out in Seattle.
Each app ran two one-hour sessions; only one of which had
the framework enabled. Each session of the same app fol-
lowed the same user interaction model. In addition, for ses-
sions with the framework enabled, we instructed NEWT to 6.2.2 Test Case Bucketization
limit bandwidth to 1 Gbps. In this way, we ensure that Dataset #4 emulated real life WCDMA performance of
NEWT is actively monitoring TCP connections, without dis- 12 uniformly random locations in the Seattle area. The
turbing the available bandwidth. OpenSignal data suggests that the variations in bandwidth
We used Perfmon, a standard Windows tool, to track and loss rate are relatively low, as compared to latency. The
global and per-process performance counters. Table 2 standard deviation for upstream and downstream bandwidth,
shows that the framework has a very low resource overhead, loss rate, and latency are 0.20, 0.48, 0.01, and 39.75, respec-
and the framework does not significantly impact the app tively. We next quantify the impact of these variations on
behavior. Specifically, the framework uses an average of 1% bucketization.
CPU time, and the difference in memory usage is less than Figure 4 shows the trade off of bucketization. Suppose we
2% in most cases. pick one point to represent the 12 test points in Seattle, we
effectively reduce the number of profiles by a factor of 12.
6.2 Technique Validations However, doing so introduces error into measurements. Fig-
This section evaluates the scalability of our system in ure 4 shows the distance-to-mean ratio for each metric across
terms of parameter space searching. We start by describing all apps. If we exclude the total disk reads (with the highest
the four datasets used for evaluation. variance), 85% of the metrics have an average distance-to-
mean ratio less than 30%. In other words, if we use a single
6.2.1 Evaluation Datasets point to represent the WCDMA condition in Seattle, most
We collected four datasets from testing a total of 147 Win- measurement errors are between 7% and 30%. However, the
dows 8 Store apps. Individual test cases lasted five minutes, speed up (by a factor of 12) might outweigh the error in this
and they were repeated on four machines: two desktops, one case.
laptop, and one slate. All apps followed a random user inter- Figure 4 also shows that the network and disk metrics
action model, and all datasets are categorized according to observe a relative high variance, especially for the app
their target scenarios. categories of sports and travel. This result suggests that the
Dataset #1 aims for 8 common network conditions granularity of the buckets should adapt to each app.
that mobile devices may encounter: WCDMA, 802.11b,
1500 2500 3500 4500 5500

2500
● ●


Number of Outliers

Number of Outliers
● ●

2000

● ●

● ●

1500
● ●

● ●

ConVirt Vote (Best) ConVirt Vote (Best)

1000

● ● Oracle ● Vote (Worst) ● Oracle ● Vote (Worst)
Random ● Random

500
4 5 6 7 8
Num Cases to Test 4 5 6 7 8 9
Num Cases to Test
(a) All outliers found
Figure 6. The number of true outliers found for all 147
3000



app packages in dataset #2.
Number of Outliers



2000

● ●


of Vote can match this number. Second, we estimate the
1000

ConVirt Vote (Best) false positives by assuming the complete dataset (of all eight
● Oracle ● Vote (Worst)
Random cases) as the ground truth. False positives are inevitable for
0

4 5 6 7 8
all approaches because only estimations are possible with-
Num Cases to Test out complete measurements. In addition, while GCF reports
(b) False positives slightly more false positives initially, this count drops faster
as the number of cases increases. Finally, considering both
1500 2500 3500 4500

ConVirt Vote (Best) results, Figure 5(c) suggests that GCF can report more true
Number of Outliers


● Oracle ● Vote (Worst)
Random outliers than the non-oracle baselines. Even with only eight


profiles, it can find up to 21%, 8%, and 36% more true out-

● liers than Random, Vote-Best and Vote-Worst, respectively.



The difference between GCF and Oracle is due to the non-



zero error rate of the information gain prediction, as we
500

4 5 6 7 8 quantify next.
Num Cases to Test We quantify two sources of the prediction error by calcu-
(c) True outliers lating the fit score relative to the oracle, or the percentage of
matching profiles in both sets. First, all apps start with a fixed
Figure 5. The number of outliers found for all 147 app bootstrapping set, which has a fit score of 48%. We note that
packages in dataset #1. this fit is better than the non-oracle approaches (e.g., 32%
for Vote). Second, GCF’s set selection accuracy for 4, 5,
6, 7, 8 test cases are 53.57%, 66.94%, 75.85%, 87.95% and
6.2.3 App Similarity Prioritization 100%, respectively. The accuracy increases with the number
The evaluation uses two metrics: number and relevance of measurements, only the bast case of Vote comes near this
of outliers found as (1) the time budget varies, and (2) the result.
amount of available computing resource varies. Finally, we note that the degree of gain from prioritiza-
We picked three comparison baselines to represent the tion is proportional to the degree of parameter variations
absolute upper-bound and common approaches. First, between test cases. For example, compared to dataset #1
Oracle has the complete knowledge of the measurements (network profiles of different physical mediums), dataset
for all apps (including untested ones), and it represents the #2 (WCDMA cellular profiles from different countries)
upper-bound. Vote picks the most popular test case during has a less variation. In addition, Figure 6 suggests that
each step from all previously tested apps. In addition, as a Context Virtualizer finds only up to 12%, 7%, and 11%
single ordering is not optimal for all apps, we coin the term more true outliers than Random, Vote-Best and Vote-Worst,
”vote-best” and ”vote-worst” for its upper and lower bound. respectively.
Finally, Random randomly picks an untested case to run at
each step. Resource Requirements. An important observation is that,
since app testing is highly parallelizable, multiple apps can
Assessment of Outliers Found. We start by studying the be exercised at the same time on different machines. On the
question: given sufficient time to exercise AppPkgtest under extreme of infinite amount of computing resources, all pri-
x test cases, what should those x cases be to maximize the oritization techniques would perform equally well, and the
number of reported problems. The assumption is that a sin- entire dataset can finish in one test-case time. Given this is
gle test case runs for a fixed amount of time (e.g., five min- not practical in the real world, we measure the speed up that
utes). Figure 5 and Figure 6 illustrate the results from dataset GCF offers under various amounts of available computing
#1 and #2, respectively. Next, we use the former to highlight resources.
three observations. Figure 7 illustrates combinations of computing resources
First, the total number of outliers reported by GCF is bet- and time required to find at least 10 potential problems in
ter than the non- oracle baselines, and only the best cases each of the 147 apps in dataset #2. First, the figure shows
900 1200

500 1000 1500 2000 2500


Computing Req (machines)
ConVirt UI Automation
ConVirt

● Random

Number of Outliers
600

● ● ●
300
0

● ●

2500 3500 4500 5500 6500 7500


Time Req (min)

0
HTML Managed x64 Neutral

Figure 7. Feasible combinations of time and computing


budget to find at least 10 potential problems in each of Figure 8. App performance outliers categorized by app
the 147 apps in dataset #2. source code type and targeted hardware architecture.

200 400 600 800 1000


ConVirt UI Automation

increasing the resource in one dimension can lower the

Number of Crash Instances


requirement on the other. Second, by estimating the infor-
mation gain of each pending test case, Context Virtualizer
can reach the goal faster and with fewer machines. For
example, given 4425 minutes of time budget, ConVirt needs
294 machines – 33% less. Finally, the break-even point for
Random is at the time budget of 6630 minutes, or > 90% of

0
HTML Managed x64 Neutral
the total possible time for testing all combinations of apps
and test cases. Figure 9. Crashes categorized by app source code type
and targeted hardware architecture.
6.3 Aggregate App Context Testing Analysis
In this section, we (1) investigate crashes and perfor-
mance outliers identified via contextual fuzzing, and (2) ex- izer is able to identify significantly more potential app prob-
amine the issues we have found from a set of publicly avail- lems across both categories. For example, in both categories,
able apps that presumably had already been tested. this difference in the number of outliers found is a factor of
For these experiments, we exercised the same 147 approximately 8 times.
Windows Store apps on test cases of dataset #1, as described Table 3 categorizes apps in the same way as the Windows
in §6.2. Individual apps were tested with four 5-min rounds Store. It shows that media-heavy apps (e.g., music, video,
under both ConVirt and the context- free baseline solution entertainment, etc.) tend to exhibit problems in multiple
described below. Since the setup is identical for each run contexts. This observation motivates the use of contextual
of the same app, differences in crashes and performance fuzzing in mobile app testing. Furthermore, the use of media
outliers detected are due to our inclusion of contextual often increases the memory usage, which results in crashes.
fuzzing. Figure 10 shows performance outliers broken down by
resource type. The figure suggests that most outliers are
Comparison Baseline. We use a conventional UI au- network or energy related. While tools for testing different
tomation based approach as the comparison baseline. The networking scenarios are starting to emerge [20], the same
baseline represents current common practice of testing has not yet happened for energy-related testing (which also
mobile apps. We use the default WUIM that randomly heavily depends on the type of radios and communication
explores the user interface, which is functionally equivalent protocols in use). Both disk activity (i.e., I/O) and CPU
to the Android Exerciser Monkey [13]. appear to have approximately the same number of perfor-
mance outlier cases.
Summary of Findings. Overall, Context Virtualizer is
able to discover significantly more crashes (11×) and 6.4 Experience and Case Studies
performance outliers (11×) than the baseline solution. In this section we highlight some identified problem
Furthermore, with Context Virtualizer, 75 out of the 147 scenarios that mobile app developers might be unfamiliar
apps tested observed a total of 1,170 crash incidents and with, thus illustrating how a tool like Context Virtualizer can
4,589 performance outliers. This result is surprising, as be used to prevent ever more common issues.
these production apps should have been thoroughly tested
by developers. Geolocation. As an increasing number of apps on mo-
bile platforms become location-aware, services start provid-
Findings by Categories. Figure 8 and Figure 9 show the ing location-tailored content. For example, a weather app
number of crashes and performance outliers categorized by can provide additional weather information for certain cities,
app source code type and targeted hardware architecture: (1) such as Seattle. Another example is content restrictions in
HTML-based vs. compiled managed code, and (2) x64 vs. some streaming and social apps. Unfortunately, many devel-
architecture-neutral. The observation is that Context Virtual- opers are unaware of the implications of device geolocation
ConVirt UI Automation Connections Active ●
Connections Established
Number of Outliers


1500

Connection Resets ●
segmentsRecv ●
segmentsSent ●
TCP Tx ●
TCP Recv ●
500

Total Disk Writes ●


Total Disk Reads ●
Datagram Recv ●
0

I.O CPU Memory Network Energy Datagram Tx ●


Type User Time ●
Processor Time ●

Figure 10. App performance outliers categorized by re- Thread Count


Peak Working Set


source usage. Avg Working Set ●
Peak Paged Memory ●
Avg Virtual Memory ●
Context Auto Context Auto Peak Virtual Memory ●
Virtualizer UI Virtualizer UI 0.0 0.5 1.0 1.5 2.0
(Outliers) (Instance of Crash) Distance−to−Mean Ratio
News 1437 147 284 24
Entertainment 667 62 101 6
Photo 90 10 17 1 Figure 11. Per-metric variation of WCDMA emulation
Sports 304 41 194 15
Video 688 123 142 12 with traces from top 10 worldwide locations with high
Travel 238 12 25 1 mobile device usage.
Finance 193 13 0 0
Weather 31 2 1 0
Music 737 85 289 38
Reference 12 0 0 0
handoffs between cellular towers, switches from 3G to 2G,
Education 125 13 0 0 and transition between Wi-Fi and cellular network. While
E-reader 24 5 0 0 an increasing number of developers start to test apps under
Social 20 1 112 5 discrete network profiles [20], testing for network transitions
Lifestyle 23 4 5 0 is not yet a common practice.
An example case that demonstrates these issues is a
Table 3. App crashes and performance outliers catego- popular Twitter client app. Our system logs and user traces
rized the same as the Windows Store. suggested that the app crashed if users tried to tweet after a
transition to a slower network. Any attempt, by a user, to
post two separate messages, one over a Wi-Fi network and
on app behavior and energy consumption. We use dataset #3 a second one after switching to a slower GPRS network,
to illustrate these impacts. was enough to repeatedly cause a crash. Interestingly, if the
Our first case study focuses on an app released by a US- opposite network transition occurs, it does not seem to affect
based magazine publisher. Test results from ten world-wide the app execution. Without peeking into the source code,
locations show that the app crashed frequently outside of it is difficult to point out the exact root cause of the issue.
North America; sometimes even not proceeding beyond the However, all results from our logs (and posterior manual
loading screen. This problem was confirmed by users on the exploration) suggest that the app does not consider network
app marketplace and verified by our manual testing and the dynamics and assumes the network is always available, after
publisher later released an update (after our first round of an initial successful connection.
testing). Re-exercising the app with our tool suggests that
the likelihood of crashes in China was reduced from 80% to Exception Handlers. Results of dataset #1 also indicate
50% with the new version, but the problem was not com- that some music streaming apps tend to crash with higher
pletely resolved. frequency on slow and lossy networks. Decompiled code
A second case study on location exemplifies how resource analysis [29] reveals that the less crash-prone apps apply
consumption variance can be significant with geolocation. a more comprehensive set of exception handlers around
Figure 11 depicts a snapshot of dataset #3, where we see network-related system calls. The effort in handling such
that network-related metrics typically exhibit the largest exceptional cases is extremely important in mobile plat-
variance. We note that network-related metrics can impact forms as they can experience a much wider range of network
the system-related metrics such as CPU. The implications conditions than traditional PCs. Although, as previously
are two fold. First, apps can present a higher energy highlighted[7], the lack of a priori design of the exceptional
consumption in certain locations. For example, a particular behavior can lead to multiple issues, it is not feasible to
weather app uses 8% more energy in Seattle than Beijing. expect that developers correctly create such code without
Second, the excessive memory usage in some locations can tools that support them in checking for the necessary
translate to crashes. environment conditions.

Network Transitions. In contrast to PCs, mobile devices are Device Service Misuse. On this case ConVirt highlighted
rich in mobility and physical radio options. In fact, network a possible energy bug in a location tracking app. The
transitions can happen frequently throughout the day, e.g., app registers for location updates (by setting a minimum
distance threshold for notifications) to avoid periodic polling 8 Related Work
the platform. Events are then signaled if the location Clearly, a wide variety of testing methodologies al-
service detects the device has moved beyond the threshold. ready exist for discovering software, system, and protocol
However, the app set the threshold to a value lower than the problems. Context Virtualizer contributes to this general
typical accuracy of the location providers (e.g. 5m accuracy area by proposing to expand the testing space for mobile
at best on typical smartphone GPS[31]). This resulted in devices to include a variety of real-world contextual factors.
the app constantly receiving location updates and keeping In particular, we advance prior investigations of fuzz
the process up, which then consumed more energy than testing [21, 11] with techniques that enable the systematic
expected. search of this new context test space; this new approach
can then complement existing techniques including static
analysis and fault injection.
7 Discussion
We now examine some of the overarching issues related Mobile App Testing. In response to the strong need for im-
to Context Virtualizer design. proved mobile app testing tools, academics and practitioners
have developed a variety of solutions. A popular approach is
Generality of the System. While the paper focuses on log analytics, with a number of companies [9, 6, 4, 30] of-
app testing, our core ideas can be generalized to other fering such services. Although this data is often insightful,
scenarios. In privacy, applying app similarity networks it is only collectable post-release, thus exposing users to a
to packet inspections can discovery apps that transmit an negative user experience.
abnormal amount of personal data. In energy optimization, Similarly, although AppInsight [25] enables significant
measurement can help determine whether an app would new visibility into app behaviour, it is also a post deploy-
experience significant performance degradation on a slower ment solution that requires app binary instrumentation. Con-
but more energy-efficient radio. text Virtualizer, in contrast, enables pre-release testing of un-
modified app binaries.
Real-World Mobile Context Collection. Our system Emulators are also commonly used and include the abil-
relies on real-world traces to emulate mobile context. We ity to test coarse network conditions and sensor (GPS, ac-
recognize that some traces are more difficult to collect celerometer) input. [14, 20], for instance, offer such controls
than others. For example, while the data on cellular carrier and allow a developer to test their app by selecting different
performance at various locations is publicly available [23], speeds for each network type. More advanced emulator us-
an extensive database on how users interact with apps is not age, such as [18], enables developers to have scripts sent to
easily accessible. We leave the problem of collecting user either real hardware or emulators and to select from a lim-
interaction traces at large scale as a future work. ited set of network conditions to apply during the test. Un-
like Context Virtualizer, neither emulator systems nor man-
Hardware Emulation Limitation. ConVirt currently only ual testing offer the ability to control a wide range of context
exposes coarse-grained hardware parameters: CPU clock dimensions – simultaneously, if required; neither do they
speed and available memory. While this suggests that certain allow the high-fidelity emulation of contextual conditions,
artifacts of the hardware architecture cannot be emulated, such as handoff between networks (e.g. 3G to WiFi) that can
our system can accommodate real devices in the test client be problematic for apps. Finally, the conditions tested are
pool to achieve hardware coverage. typically defined by the developers – instead, ConVirt gener-
ates the parameters for test cases automatically.
User Interaction Model Limitation. Our current im- Testing methods based on UI automation, when applied
plementation interacts with apps by invoking their user to mobile app testing adopt emulators to host applications.
interface elements (e.g, buttons, links, etc.) through a UI Android offers specialized tools [14, 13] for custom UI
automation library. One limitation is that user gestures can automation solutions. Also, significant effort has been
not be emulated, so many games cannot be properly tested. invested into generating UI input for testing (e.g., [34, 15])
Gesture is left as mid-term future work. with specific goals in mind (e.g., code coverage). ConVirt
allows users to define usage scenarios or falls back to
Applicability To Other Platforms. While our current random exploration. However, any automation technique
implementation works for Windows 8, the core system ideas can be added into our system.
can also work on platforms that fulfill three requirements:
(1) the network stack should allow a method to manipulate Efficient State Space Exploration for Testing. A wide
incoming and outgoing packets; (2) provide a way to variety of state exploration strategies have been developed to
emulate inputs to apps, such as user touch, sensors and solve key problems in domains such as distributed systems
GPS; and (3) good performance logging provided by the verification [35] and model checking [16]. State exploration
OS. Additionally, the backend network should have higher is also a fundamental problem encountered in many testing
performance than the emulated network profiles. These systems. Many existing solutions assume a level of internal
requirements are not onerous, and Android is another readily software access. For example, [1] explores code paths
suitable platform. within mobile apps and reduces path explosion by merging
redundant paths. Context Virtualizer test apps as a blackbox
and so such techniques do not apply. In [12], a prioritiza- [16] H. Guo, M. Wu, L. Zhou, G. Hu, J. Yang, and L. Zhang. Practical soft-
tion scheme is proposed that exploits inter-app similarity ware model checking via dynamic interface reduction. In Proceedings
between code statement execution patterns. However, of SOSP2011. ACM, 2011.
[17] J. Huang, C. Chen, Y. Pei, Z. Wang, Z. Qian, F. Qian, B. Tiwana,
ConVirt computes similarity completely differently (based Q. Xu, Z. Mao, M. Zhang, et al. Mobiperf: Mobile network measure-
on resource usage) – allowing use with blackbox apps. ment system. Technical report, Technical report, Technical report).
University of Michigan and Microsoft Research, 2011.
Simulating Real-world Conditions. ConVirt relies on [18] B. Jiang, X. Long, and X. Gao. Mobiletest: A tool supporting auto-
matic black box test for software on smart mobile devices. In H. Zhu,
high-fidelity context emulation along with real hardware. W. E. Wong, and A. M. Paradkar, editors, AST, pages 37–43. IEEE,
Other domains, notably sensor networks have also de- 2007.
veloped testing frameworks (e.g., [19]) that incorporate [19] P. Levis, N. Lee, M. Welsh, and D. E. Culler. Tossim: accurate and
accurate network emulation. In particular, Avrora [32] offers scalable simulation of entire tinyos applications. In I. F. Akyildiz,
D. Estrin, D. E. Culler, and M. B. Srivastava, editors, SenSys, pages
a cycle-accurate emulation of sensor nodes in addition to 126–137. ACM, 2003.
network emulation. Conceptually ConVirt has similarity [20] Microsoft. Simulation Dashboard for Windows Phone.
with these frameworks, but in practice the context space http://msdn.microsoft.com/en-us/library/windowsphone/
(low-power radio) and target devices (small-scale sensor develop/jj206953(v=vs.105).aspx.
nodes) are completely different. [21] B. P. Miller, L. Fredrikson, and B. So. An empirical study of the
reliability of unix utilities. Communications of the ACM, 33(12):32,
December 1990.
9 Conclusion [22] R. Mittal, A. Kansal, and R. Chandra. Empowering developers to es-
This paper presents Context Virtualizer (ConVirt) – an timate app energy consumption. In Proceedings of the 18th annual
international conference on Mobile computing and networking, Mo-
automated service for testing mobile apps using an expanded bicom ’12, pages 317–328, New York, NY, USA, 2012. ACM.
mobile context test space. By expanding the range of test [23] Open Signal. http://opensignal.com.
conditions we find ConVirt is able to discover more crashes [24] Open Signal. The many faces of a little green robot. http://
and performance outliers in mobile apps than existing tools, opensignal.com/reports/fragmentation.php.
such as emulator-based UI automation. [25] L. Ravindranath, J. Padhye, S. Agarwal, R. Mahajan, I. Obermiller,
and S. Shayandeh. Appinsight: mobile app performance monitoring
in the wild. In Proceedings of the 10th USENIX conference on Oper-
10 References ating Systems Design and Implementation, OSDI’12, pages 107–120,
[1] S. Anand, M. Naik, H. Yang, and M. Harrold. Automated concolic Berkeley, CA, USA, 2012. USENIX Association.
testing of smartphone apps. In Proceedings of the ACM Conference [26] Samsung. Series 7 11.6” Slate. http://www.samsung.com/us/
on Foundations of Software Engineering (FSE). ACM, 2012. computer/tablet-pcs/XE700T1A-A01US.
[2] Apigee. http://apigee.com. [27] Techcrunch. Mobile App Users Are Both Fickle And Loyal: Study.
[3] C. M. Bishop. Pattern Recognition and Machine Learning (Informa- http://techcrunch.com/2011/03/15/mobile-app-users-are-
tion Science and Statistics). Springer, August 2006. both-fickle-and-loyal-study.
[4] BugSense. BugSense — Crash Reports. http://www.bugsense. [28] Techcrunch. Users Have Low Tolerance For Buggy Apps
com/. Only 16% Will Try A Failing App More Than Twice.
[5] Carat. http://carat.cs.berkeley.edu/. http://techcrunch.com/2013/03/12/users-have-low-
tolerance-for-buggy-apps-only-16-will-try-a-failing-
[6] Crashlytics. Powerful and lightweight crash reporting solutions.
app-more-than-twice.
http://www.crashlytics.com.
[29] Telerik. JustDecompile. http://www.telerik.com/products/
[7] R. Di Bernardo, R. Sales Jr, F. Castor, R. Coelho, N. Cacho, and
decompiler.aspx.
S. Soares. Agile testing of exceptional behavior. In Proceedings of
SBES 2011. SBC, 2011. [30] TestFlight. Beta testing on the fly. https://testflightapp.com/.
[8] Flurry. Electric Technology, Apps and The New Global Village. [31] A. Thiagarajan, L. Ravindranath, K. LaCurts, S. Madden, H. Balakr-
http://blog.flurry.com/default.aspx?Tag=market%20size. ishnan, S. Toledo, and J. Eriksson. Vtrack: accurate, energy-aware
road traffic delay estimation using mobile phones. In Proceedings of
[9] Flurry. Flurry Analytics. http://www.flurry.com/flurry-
the 7th ACM Conference on Embedded Networked Sensor Systems,
crash-analytics.html.
SenSys ’09, pages 85–98, New York, NY, USA, 2009. ACM.
[10] Fortune. 40 staffers. 2 reviews. 8,500 iphone apps per week.
[32] B. L. Titzer, D. K. Lee, and J. Palsberg. Avrora: scalable sensor net-
http://tech.fortune.cnn.com/2009/08/21/40-staffers-2-
work simulation with precise timing. In Proceedings of the 4th in-
reviews-8500-iphone-apps-per-week/.
ternational symposium on Information processing in sensor networks,
[11] P. Godefroid, M. Y. Levin, and D. Molnar. Automated whitebox fuzz
IPSN ’05, Piscataway, NJ, USA, 2005. IEEE Press.
testing. In Proceedings of NDAA 2008, 2008.
[33] A. I. Wasserman. Software engineering issues for mobile application
[12] A. Gonzalez-Sanchez, R. Abreu, H.-G. Gross, and A. J. van Gemund.
development. In Proceedings of the FSE/SDP workshop on Future
Prioritizing tests for fault localization through ambiguity group reduc-
of software engineering research, FoSER ’10, pages 397–400, New
tion. In Proceedings of ASE 2011, 2011.
York, NY, USA, 2010. ACM.
[13] Google. monkeyrunner API. http://developer.android.com/
[34] L. Yan and H. Yin. DroidScope: Seamlessly reconstructing the OS
tools/help/monkeyrunner_concepts.html.
and Dalvik semantic views for dynamic Android malware analysis.
[14] Google. UI/Application Exerciser Monkey. http://developer. In Proceedings of the 21st USENIX Security Symposium. USENIX,
android.com/tools/help/monkey.html. 2012.
[15] F. Gross, G. Fraser, and A. Zeller. Search-based system testing: high [35] J. Yang, C. Sar, and D. R. Engler. EXPLODE: A Lightweight, Gen-
coverage, no false alarms. In Proceedings of the International Sympo- eral System for Finding Serious Storage System Errors. In B. N. Ber-
sium on Software Testing and Analysis (ISSTA), 2012. shad and J. C. Mogul, editors, Proceedings of the 7th USENIX Sympo-
sium on Operating Systems Design and Implementation (OSDI), pages
131–146. USENIX Association, 2006.

You might also like