Contextual Fuzzing Automated Mobile App Testing Under Dynamic Device and Environment Conditions
Contextual Fuzzing Automated Mobile App Testing Under Dynamic Device and Environment Conditions
Wireless Network Conditions. Variation in network Challenge 1. High-fidelity Mobile Context Emulation.
conditions leads to different latency, jitter, loss, throughput Cloud-based emulation of realistic mobile contexts enables
and energy consumption, which in turn impacts the perfor- more complete pre-release testing of apps. Developers can
mance of many network-facing apps (43% in the Google then incorporate such testing into their development cycle.
Play). These variations could be caused by the operator, Similarly, distributors can be more confident of the user
signal strength, technology in use (e.g. Wi-Fi vs. LTE), experience consumers will receive. A viable solution to the
emulation problem will have two characteristics. First, it
must be comprehensive and be capable of emulating various
key context dimensions (viz. network, sensor, hardware)
in addition to the key tests within each dimension (e.g.,
wireless handoffs under networking). Second, it must be
an accurate enough emulation of the real-world phenomena
such that app problematic behavior still manifest.
linear dependence between the two sets of variables. While Rank. Based on predicted resource estimates and crash
the correlation coefficient reports a signed value depending flags, a ranking is given for all context test cases yet to to
on if the relationship is negative or positive, we use the be executed for the target app. The priority of test cases is
absolute value for the edge weight. determined on this ranking, which is computed as follows.
First, the variance is calculated for each resource mea-
Bootstrap Scenario. Initially, there is no existing resource surement within each test case (excluding those with crash
measurement regarding the target app, with which similarity flags). Variance provides a notion of prediction agreement
to other apps can be computed. From our experiments, we between each estimate. The intuition behind this decision is
find the most effective (i.e., leading to higher prioritization that tightly clustered (i.e., low variance) estimates indicate
performance) initial set of three context tests are three higher uncertainty regarding the estimate. We want to rank
wireless networking profiles (c.f. §5.2): GPRS, 802.11b, test cases high when they have large uncertainty across
and cable. As more and more resource measurements are their resource measurements. Second, the variance for each
made for the target app, the similarity networks improve resource measurement is compared with all other resource
(and largely stabilize). measurements of the same type. A percentile is assigned
based on this comparison. The average percentile for each
Per-Measurement Networks. We utilize multiple simi- resource measurement within every context test case is then
larity networks, one for each type of resource measurement computed. Third, crash flags for all resource measurements
(e.g., allocated memory and CPU utilization). Figure 2 are counted within each test case. We apply a simple
presents similarity graphs for two different resource metrics, scaling function to the number of crashes that balances
TCP packets received and disk reads for a population of the importance of many crashes against a high percentile.
147 Microsoft Windows Store apps (detailed in §6). Each Finally, scaled crash flag count is added to the percentile to
edge weight shown in both figures (equal to correlation compute a final rank.
coefficient) is larger than 0.7. We note that the graph, even
with this restriction applied, is well connected and has high Update. The final phase in prioritization is to revise the app
average node degree. Furthermore, both graphs are clearly similarity network based on new resource measurement col-
different and support our decision to use per-measurement lected about the target app. These measurements may have
networks instead of a single one. altered the edge weights between the target app and the other
apps in the network.
4.2 App Similarity-based Prioritization The new rankings determined at this iteration are utilized
ASP prioritization is an iterative four-step process that as soon as an AEH completes its previously assigned context
repeatedly occurs while the target app is tested by Context test. Once the head of context test ranking has been assigned
Virtualizer. These steps are: (1) cluster, (2) predict, (3) rank, to an AEH then it is no longer considered when ranking
followed by (4) update. occurs at the next iteration.
2500
● ●
●
●
Number of Outliers
Number of Outliers
● ●
2000
●
● ●
●
● ●
●
1500
● ●
● ●
●
ConVirt Vote (Best) ConVirt Vote (Best)
1000
●
● ● Oracle ● Vote (Worst) ● Oracle ● Vote (Worst)
Random ● Random
500
4 5 6 7 8
Num Cases to Test 4 5 6 7 8 9
Num Cases to Test
(a) All outliers found
Figure 6. The number of true outliers found for all 147
3000
●
●
app packages in dataset #2.
Number of Outliers
●
●
2000
● ●
●
of Vote can match this number. Second, we estimate the
1000
ConVirt Vote (Best) false positives by assuming the complete dataset (of all eight
● Oracle ● Vote (Worst)
Random cases) as the ground truth. False positives are inevitable for
0
4 5 6 7 8
all approaches because only estimations are possible with-
Num Cases to Test out complete measurements. In addition, while GCF reports
(b) False positives slightly more false positives initially, this count drops faster
as the number of cases increases. Finally, considering both
1500 2500 3500 4500
ConVirt Vote (Best) results, Figure 5(c) suggests that GCF can report more true
Number of Outliers
●
● Oracle ● Vote (Worst)
Random outliers than the non-oracle baselines. Even with only eight
●
●
profiles, it can find up to 21%, 8%, and 36% more true out-
●
●
●
4 5 6 7 8 quantify next.
Num Cases to Test We quantify two sources of the prediction error by calcu-
(c) True outliers lating the fit score relative to the oracle, or the percentage of
matching profiles in both sets. First, all apps start with a fixed
Figure 5. The number of outliers found for all 147 app bootstrapping set, which has a fit score of 48%. We note that
packages in dataset #1. this fit is better than the non-oracle approaches (e.g., 32%
for Vote). Second, GCF’s set selection accuracy for 4, 5,
6, 7, 8 test cases are 53.57%, 66.94%, 75.85%, 87.95% and
6.2.3 App Similarity Prioritization 100%, respectively. The accuracy increases with the number
The evaluation uses two metrics: number and relevance of measurements, only the bast case of Vote comes near this
of outliers found as (1) the time budget varies, and (2) the result.
amount of available computing resource varies. Finally, we note that the degree of gain from prioritiza-
We picked three comparison baselines to represent the tion is proportional to the degree of parameter variations
absolute upper-bound and common approaches. First, between test cases. For example, compared to dataset #1
Oracle has the complete knowledge of the measurements (network profiles of different physical mediums), dataset
for all apps (including untested ones), and it represents the #2 (WCDMA cellular profiles from different countries)
upper-bound. Vote picks the most popular test case during has a less variation. In addition, Figure 6 suggests that
each step from all previously tested apps. In addition, as a Context Virtualizer finds only up to 12%, 7%, and 11%
single ordering is not optimal for all apps, we coin the term more true outliers than Random, Vote-Best and Vote-Worst,
”vote-best” and ”vote-worst” for its upper and lower bound. respectively.
Finally, Random randomly picks an untested case to run at
each step. Resource Requirements. An important observation is that,
since app testing is highly parallelizable, multiple apps can
Assessment of Outliers Found. We start by studying the be exercised at the same time on different machines. On the
question: given sufficient time to exercise AppPkgtest under extreme of infinite amount of computing resources, all pri-
x test cases, what should those x cases be to maximize the oritization techniques would perform equally well, and the
number of reported problems. The assumption is that a sin- entire dataset can finish in one test-case time. Given this is
gle test case runs for a fixed amount of time (e.g., five min- not practical in the real world, we measure the speed up that
utes). Figure 5 and Figure 6 illustrate the results from dataset GCF offers under various amounts of available computing
#1 and #2, respectively. Next, we use the former to highlight resources.
three observations. Figure 7 illustrates combinations of computing resources
First, the total number of outliers reported by GCF is bet- and time required to find at least 10 potential problems in
ter than the non- oracle baselines, and only the best cases each of the 147 apps in dataset #2. First, the figure shows
900 1200
Number of Outliers
600
● ● ●
300
0
● ●
0
HTML Managed x64 Neutral
0
HTML Managed x64 Neutral
the total possible time for testing all combinations of apps
and test cases. Figure 9. Crashes categorized by app source code type
and targeted hardware architecture.
6.3 Aggregate App Context Testing Analysis
In this section, we (1) investigate crashes and perfor-
mance outliers identified via contextual fuzzing, and (2) ex- izer is able to identify significantly more potential app prob-
amine the issues we have found from a set of publicly avail- lems across both categories. For example, in both categories,
able apps that presumably had already been tested. this difference in the number of outliers found is a factor of
For these experiments, we exercised the same 147 approximately 8 times.
Windows Store apps on test cases of dataset #1, as described Table 3 categorizes apps in the same way as the Windows
in §6.2. Individual apps were tested with four 5-min rounds Store. It shows that media-heavy apps (e.g., music, video,
under both ConVirt and the context- free baseline solution entertainment, etc.) tend to exhibit problems in multiple
described below. Since the setup is identical for each run contexts. This observation motivates the use of contextual
of the same app, differences in crashes and performance fuzzing in mobile app testing. Furthermore, the use of media
outliers detected are due to our inclusion of contextual often increases the memory usage, which results in crashes.
fuzzing. Figure 10 shows performance outliers broken down by
resource type. The figure suggests that most outliers are
Comparison Baseline. We use a conventional UI au- network or energy related. While tools for testing different
tomation based approach as the comparison baseline. The networking scenarios are starting to emerge [20], the same
baseline represents current common practice of testing has not yet happened for energy-related testing (which also
mobile apps. We use the default WUIM that randomly heavily depends on the type of radios and communication
explores the user interface, which is functionally equivalent protocols in use). Both disk activity (i.e., I/O) and CPU
to the Android Exerciser Monkey [13]. appear to have approximately the same number of perfor-
mance outlier cases.
Summary of Findings. Overall, Context Virtualizer is
able to discover significantly more crashes (11×) and 6.4 Experience and Case Studies
performance outliers (11×) than the baseline solution. In this section we highlight some identified problem
Furthermore, with Context Virtualizer, 75 out of the 147 scenarios that mobile app developers might be unfamiliar
apps tested observed a total of 1,170 crash incidents and with, thus illustrating how a tool like Context Virtualizer can
4,589 performance outliers. This result is surprising, as be used to prevent ever more common issues.
these production apps should have been thoroughly tested
by developers. Geolocation. As an increasing number of apps on mo-
bile platforms become location-aware, services start provid-
Findings by Categories. Figure 8 and Figure 9 show the ing location-tailored content. For example, a weather app
number of crashes and performance outliers categorized by can provide additional weather information for certain cities,
app source code type and targeted hardware architecture: (1) such as Seattle. Another example is content restrictions in
HTML-based vs. compiled managed code, and (2) x64 vs. some streaming and social apps. Unfortunately, many devel-
architecture-neutral. The observation is that Context Virtual- opers are unaware of the implications of device geolocation
ConVirt UI Automation Connections Active ●
Connections Established
Number of Outliers
●
1500
Connection Resets ●
segmentsRecv ●
segmentsSent ●
TCP Tx ●
TCP Recv ●
500
Network Transitions. In contrast to PCs, mobile devices are Device Service Misuse. On this case ConVirt highlighted
rich in mobility and physical radio options. In fact, network a possible energy bug in a location tracking app. The
transitions can happen frequently throughout the day, e.g., app registers for location updates (by setting a minimum
distance threshold for notifications) to avoid periodic polling 8 Related Work
the platform. Events are then signaled if the location Clearly, a wide variety of testing methodologies al-
service detects the device has moved beyond the threshold. ready exist for discovering software, system, and protocol
However, the app set the threshold to a value lower than the problems. Context Virtualizer contributes to this general
typical accuracy of the location providers (e.g. 5m accuracy area by proposing to expand the testing space for mobile
at best on typical smartphone GPS[31]). This resulted in devices to include a variety of real-world contextual factors.
the app constantly receiving location updates and keeping In particular, we advance prior investigations of fuzz
the process up, which then consumed more energy than testing [21, 11] with techniques that enable the systematic
expected. search of this new context test space; this new approach
can then complement existing techniques including static
analysis and fault injection.
7 Discussion
We now examine some of the overarching issues related Mobile App Testing. In response to the strong need for im-
to Context Virtualizer design. proved mobile app testing tools, academics and practitioners
have developed a variety of solutions. A popular approach is
Generality of the System. While the paper focuses on log analytics, with a number of companies [9, 6, 4, 30] of-
app testing, our core ideas can be generalized to other fering such services. Although this data is often insightful,
scenarios. In privacy, applying app similarity networks it is only collectable post-release, thus exposing users to a
to packet inspections can discovery apps that transmit an negative user experience.
abnormal amount of personal data. In energy optimization, Similarly, although AppInsight [25] enables significant
measurement can help determine whether an app would new visibility into app behaviour, it is also a post deploy-
experience significant performance degradation on a slower ment solution that requires app binary instrumentation. Con-
but more energy-efficient radio. text Virtualizer, in contrast, enables pre-release testing of un-
modified app binaries.
Real-World Mobile Context Collection. Our system Emulators are also commonly used and include the abil-
relies on real-world traces to emulate mobile context. We ity to test coarse network conditions and sensor (GPS, ac-
recognize that some traces are more difficult to collect celerometer) input. [14, 20], for instance, offer such controls
than others. For example, while the data on cellular carrier and allow a developer to test their app by selecting different
performance at various locations is publicly available [23], speeds for each network type. More advanced emulator us-
an extensive database on how users interact with apps is not age, such as [18], enables developers to have scripts sent to
easily accessible. We leave the problem of collecting user either real hardware or emulators and to select from a lim-
interaction traces at large scale as a future work. ited set of network conditions to apply during the test. Un-
like Context Virtualizer, neither emulator systems nor man-
Hardware Emulation Limitation. ConVirt currently only ual testing offer the ability to control a wide range of context
exposes coarse-grained hardware parameters: CPU clock dimensions – simultaneously, if required; neither do they
speed and available memory. While this suggests that certain allow the high-fidelity emulation of contextual conditions,
artifacts of the hardware architecture cannot be emulated, such as handoff between networks (e.g. 3G to WiFi) that can
our system can accommodate real devices in the test client be problematic for apps. Finally, the conditions tested are
pool to achieve hardware coverage. typically defined by the developers – instead, ConVirt gener-
ates the parameters for test cases automatically.
User Interaction Model Limitation. Our current im- Testing methods based on UI automation, when applied
plementation interacts with apps by invoking their user to mobile app testing adopt emulators to host applications.
interface elements (e.g, buttons, links, etc.) through a UI Android offers specialized tools [14, 13] for custom UI
automation library. One limitation is that user gestures can automation solutions. Also, significant effort has been
not be emulated, so many games cannot be properly tested. invested into generating UI input for testing (e.g., [34, 15])
Gesture is left as mid-term future work. with specific goals in mind (e.g., code coverage). ConVirt
allows users to define usage scenarios or falls back to
Applicability To Other Platforms. While our current random exploration. However, any automation technique
implementation works for Windows 8, the core system ideas can be added into our system.
can also work on platforms that fulfill three requirements:
(1) the network stack should allow a method to manipulate Efficient State Space Exploration for Testing. A wide
incoming and outgoing packets; (2) provide a way to variety of state exploration strategies have been developed to
emulate inputs to apps, such as user touch, sensors and solve key problems in domains such as distributed systems
GPS; and (3) good performance logging provided by the verification [35] and model checking [16]. State exploration
OS. Additionally, the backend network should have higher is also a fundamental problem encountered in many testing
performance than the emulated network profiles. These systems. Many existing solutions assume a level of internal
requirements are not onerous, and Android is another readily software access. For example, [1] explores code paths
suitable platform. within mobile apps and reduces path explosion by merging
redundant paths. Context Virtualizer test apps as a blackbox
and so such techniques do not apply. In [12], a prioritiza- [16] H. Guo, M. Wu, L. Zhou, G. Hu, J. Yang, and L. Zhang. Practical soft-
tion scheme is proposed that exploits inter-app similarity ware model checking via dynamic interface reduction. In Proceedings
between code statement execution patterns. However, of SOSP2011. ACM, 2011.
[17] J. Huang, C. Chen, Y. Pei, Z. Wang, Z. Qian, F. Qian, B. Tiwana,
ConVirt computes similarity completely differently (based Q. Xu, Z. Mao, M. Zhang, et al. Mobiperf: Mobile network measure-
on resource usage) – allowing use with blackbox apps. ment system. Technical report, Technical report, Technical report).
University of Michigan and Microsoft Research, 2011.
Simulating Real-world Conditions. ConVirt relies on [18] B. Jiang, X. Long, and X. Gao. Mobiletest: A tool supporting auto-
matic black box test for software on smart mobile devices. In H. Zhu,
high-fidelity context emulation along with real hardware. W. E. Wong, and A. M. Paradkar, editors, AST, pages 37–43. IEEE,
Other domains, notably sensor networks have also de- 2007.
veloped testing frameworks (e.g., [19]) that incorporate [19] P. Levis, N. Lee, M. Welsh, and D. E. Culler. Tossim: accurate and
accurate network emulation. In particular, Avrora [32] offers scalable simulation of entire tinyos applications. In I. F. Akyildiz,
D. Estrin, D. E. Culler, and M. B. Srivastava, editors, SenSys, pages
a cycle-accurate emulation of sensor nodes in addition to 126–137. ACM, 2003.
network emulation. Conceptually ConVirt has similarity [20] Microsoft. Simulation Dashboard for Windows Phone.
with these frameworks, but in practice the context space http://msdn.microsoft.com/en-us/library/windowsphone/
(low-power radio) and target devices (small-scale sensor develop/jj206953(v=vs.105).aspx.
nodes) are completely different. [21] B. P. Miller, L. Fredrikson, and B. So. An empirical study of the
reliability of unix utilities. Communications of the ACM, 33(12):32,
December 1990.
9 Conclusion [22] R. Mittal, A. Kansal, and R. Chandra. Empowering developers to es-
This paper presents Context Virtualizer (ConVirt) – an timate app energy consumption. In Proceedings of the 18th annual
international conference on Mobile computing and networking, Mo-
automated service for testing mobile apps using an expanded bicom ’12, pages 317–328, New York, NY, USA, 2012. ACM.
mobile context test space. By expanding the range of test [23] Open Signal. http://opensignal.com.
conditions we find ConVirt is able to discover more crashes [24] Open Signal. The many faces of a little green robot. http://
and performance outliers in mobile apps than existing tools, opensignal.com/reports/fragmentation.php.
such as emulator-based UI automation. [25] L. Ravindranath, J. Padhye, S. Agarwal, R. Mahajan, I. Obermiller,
and S. Shayandeh. Appinsight: mobile app performance monitoring
in the wild. In Proceedings of the 10th USENIX conference on Oper-
10 References ating Systems Design and Implementation, OSDI’12, pages 107–120,
[1] S. Anand, M. Naik, H. Yang, and M. Harrold. Automated concolic Berkeley, CA, USA, 2012. USENIX Association.
testing of smartphone apps. In Proceedings of the ACM Conference [26] Samsung. Series 7 11.6” Slate. http://www.samsung.com/us/
on Foundations of Software Engineering (FSE). ACM, 2012. computer/tablet-pcs/XE700T1A-A01US.
[2] Apigee. http://apigee.com. [27] Techcrunch. Mobile App Users Are Both Fickle And Loyal: Study.
[3] C. M. Bishop. Pattern Recognition and Machine Learning (Informa- http://techcrunch.com/2011/03/15/mobile-app-users-are-
tion Science and Statistics). Springer, August 2006. both-fickle-and-loyal-study.
[4] BugSense. BugSense — Crash Reports. http://www.bugsense. [28] Techcrunch. Users Have Low Tolerance For Buggy Apps
com/. Only 16% Will Try A Failing App More Than Twice.
[5] Carat. http://carat.cs.berkeley.edu/. http://techcrunch.com/2013/03/12/users-have-low-
tolerance-for-buggy-apps-only-16-will-try-a-failing-
[6] Crashlytics. Powerful and lightweight crash reporting solutions.
app-more-than-twice.
http://www.crashlytics.com.
[29] Telerik. JustDecompile. http://www.telerik.com/products/
[7] R. Di Bernardo, R. Sales Jr, F. Castor, R. Coelho, N. Cacho, and
decompiler.aspx.
S. Soares. Agile testing of exceptional behavior. In Proceedings of
SBES 2011. SBC, 2011. [30] TestFlight. Beta testing on the fly. https://testflightapp.com/.
[8] Flurry. Electric Technology, Apps and The New Global Village. [31] A. Thiagarajan, L. Ravindranath, K. LaCurts, S. Madden, H. Balakr-
http://blog.flurry.com/default.aspx?Tag=market%20size. ishnan, S. Toledo, and J. Eriksson. Vtrack: accurate, energy-aware
road traffic delay estimation using mobile phones. In Proceedings of
[9] Flurry. Flurry Analytics. http://www.flurry.com/flurry-
the 7th ACM Conference on Embedded Networked Sensor Systems,
crash-analytics.html.
SenSys ’09, pages 85–98, New York, NY, USA, 2009. ACM.
[10] Fortune. 40 staffers. 2 reviews. 8,500 iphone apps per week.
[32] B. L. Titzer, D. K. Lee, and J. Palsberg. Avrora: scalable sensor net-
http://tech.fortune.cnn.com/2009/08/21/40-staffers-2-
work simulation with precise timing. In Proceedings of the 4th in-
reviews-8500-iphone-apps-per-week/.
ternational symposium on Information processing in sensor networks,
[11] P. Godefroid, M. Y. Levin, and D. Molnar. Automated whitebox fuzz
IPSN ’05, Piscataway, NJ, USA, 2005. IEEE Press.
testing. In Proceedings of NDAA 2008, 2008.
[33] A. I. Wasserman. Software engineering issues for mobile application
[12] A. Gonzalez-Sanchez, R. Abreu, H.-G. Gross, and A. J. van Gemund.
development. In Proceedings of the FSE/SDP workshop on Future
Prioritizing tests for fault localization through ambiguity group reduc-
of software engineering research, FoSER ’10, pages 397–400, New
tion. In Proceedings of ASE 2011, 2011.
York, NY, USA, 2010. ACM.
[13] Google. monkeyrunner API. http://developer.android.com/
[34] L. Yan and H. Yin. DroidScope: Seamlessly reconstructing the OS
tools/help/monkeyrunner_concepts.html.
and Dalvik semantic views for dynamic Android malware analysis.
[14] Google. UI/Application Exerciser Monkey. http://developer. In Proceedings of the 21st USENIX Security Symposium. USENIX,
android.com/tools/help/monkey.html. 2012.
[15] F. Gross, G. Fraser, and A. Zeller. Search-based system testing: high [35] J. Yang, C. Sar, and D. R. Engler. EXPLODE: A Lightweight, Gen-
coverage, no false alarms. In Proceedings of the International Sympo- eral System for Finding Serious Storage System Errors. In B. N. Ber-
sium on Software Testing and Analysis (ISSTA), 2012. shad and J. C. Mogul, editors, Proceedings of the 7th USENIX Sympo-
sium on Operating Systems Design and Implementation (OSDI), pages
131–146. USENIX Association, 2006.