paper-4
paper-4
1
Test Task Execution
Test Steps
GUI Defect Dataset
Test Case
Test Intention
(state,action,state)
Applications
Figure 1. The Workflow for Autonomous GUI Testing (GTArena). GUI Testing requires the model to perform specific tasks, all of which
are evaluated within this workflow. We provide a standardized and reproducible testing framework, enabling fair comparison of different
multimodal large language models.
various components to accomplish this task. Addition- guide future improvements.Furthermore, while benchmarks
ally, these frameworks lack standardized evaluation met- for multimodal large models typically assess specific capa-
rics, with variations in metric design and effectiveness mea- bilities, when applying these models in specific domains,
surement, and the test datasets are often small with limited only the end metrics are considered. Therefore, we propose
availability for open access. This lack of standardization a new method: evaluating models fine-tuned on subsets of
and transparency hinders further progress in GUI Testing datasets on general capability benchmarks and comparing
research. these results to the original models. This analysis aids in
identifying the necessary enhancements for the models to
Moreover, due to the complexity of agent frameworks
excel in specific tasks, facilitating further refinement of their
and the variability in responses of LLMs, reproducing re-
capabilities.
sults becomes challenging for others who want to evaluate
the agent. Most current agent frameworks depend on GPT The contributions of this paper are as follows:
[3], which are not only costly but also lack options for lo- • We establish a formalized end-to-end framework as a fair
cal deployment. They remain unsuitable for applications environment for Autonomous GUI Testing (GTArena)
requiring data privacy, presenting an additional barrier to and introduce a novel data structure for GUI defects,
their widespread use in GUI Testing. which together redefine the complete testing process
and enable the construction of large-scale GUI defect
To address these gaps, we formalize the GUI automated datasets. This dual development not only decouples the
testing problem by redefining the entire testing process with agent but also facilitates rigorous and reproducible evalu-
standard definitions and building a framework (GTArena) ations of GUI Testing methodologies.
for fair GUI Testing evaluation. We introduce a novel data • We develop a comprehensive and standardized bench-
structure for representing GUI defects, enabling the con- mark for evaluating agents across multiple components
struction of large-scale GUI defect datasets for future re- within the automated GUI Testing framework, providing
search. We formalize the entire GUI workflow, making detailed insights into the agents’ performance relative to
agent evaluation standardized and reproducible. Building human testers.
on this foundation, we have developed a unified bench- • We propose a methodology for assessing the specific ca-
mark for evaluating visual-based agents across various mul- pabilities that models need to excel in Autonomous GUI
timodal large language models (MLLMs). This bench- Testing, enabling targeted enhancements of these models
mark not only assesses the end-to-end task performance but through broader and more diverse training datasets.
also evaluates each component of the process individually,
demonstrating that current agents still fall short in complet- 2. Autonomous GUI Testing Agent Workflow
ing these tasks. Through this approach, we can analyze the
specific performance gaps between GPT-based agents and While the idea of casually handling over an app to an agent
other multimodal large models, providing insights that may with a simple <Test this app for me> sounds appeal-
2
ing, the reality of fully automated GUI Testing is far more Algorithm 1 Pseudocode of Workflow
intricate. In order to effectively approach automated GUI
class State:
Testing, it is crucial to deconstruct the testing workflow def act(self, action):
much like a skilled GUI Testing engineer would. The initial # Returns a new state based on the action
return new_state
step involves defining testing objectives—primarily identi-
class TransitionTuple:
fying GUI defects that pose the highest risk to user expe- def __init__(self, state_before, action,
rience. This process begins with the predefinition of core state_after):
self.state_before = state_before
tasks that reflect the most likely user interactions within self.action = action
an app. By executing these tasks on the app interface and self.state_after = state_after
3
2.2. Test Case Generation
Test Case Generation can be viewed as the process of cre-
Applications with
ating a structured chain of states and actions aligned with Artificial GUI Defects
the testing intentions. Given a mobile application, the agent GUI Defect Dataset
sive test case that includes both the test intention and cor- (state,action,state)
4
Figure 3. Examples of Constructed Synthetic GUI Defects. We present examples of various constructed GUI defects, demonstrating the
feasibility of synthesizing defects through post-processing. This approach highlights a method for building large-scale GUI defect datasets,
including both display and interaction defects.
integration across all phases of the testing process. grade logs and issue trackers from open-source repositories
Given the complexities involved in collecting data from on GitHub [15]. Our approach involves systematically fil-
real-world applications, where the defects are inherently un- tering and extracting relevant projects based on the descrip-
predictable, our benchmark consists of three carefully cu- tions in their change logs and issue reports, specifically fo-
rated data categories, including real-world mobile applica- cusing on entries related to GUI defects. This targeted fil-
tions, applications with injected defects and synthetic defect tering ensures that the selected applications contain genuine
datasets. Figures 2 illustrate the composition and of our and relevant defects within their user interfaces.
benchmark, highlighting the balanced representation across For each identified project, we document essential ap-
each category. The detailed data distribution is shown in plication details, such as the version history, defect type,
Table 1. and issue severity, to build a comprehensive profile of the
detected defects. To validate these defects, we carefully re-
3.1. Real-world Applications with GUI Defects
produce the reported issues, ensuring that the GUI defects
Real-world applications serve as a crucial component of our are replicable and align with the descriptions provided by
benchmark by providing insights into naturally occurring the developers. By leveraging this methodology, we not
GUI defects. These defects are identified by mining up- only ensure the relevance and authenticity of the collected
5
data but also capture a wide variety of defect types reflective Therefore, we propose an evaluation method that fine-
of real-world scenarios. These defects, composing of sub- tunes models on datasets specific to GUI Testing tasks
tle layout misalignment to critical interaction failures, offer and then assesses their performance on broad, standard-
a diverse testing ground for evaluating the automated GUI ized benchmarks. This comparative analysis,contrasting a
Testing framework. model’s pre- and post-fine-tuning performance, offers in-
sights into the capabilities that are most relevant to specific
3.2. Applications with injected defects stages of the GUI Testing process. For example, improve-
Real-world applications exhibit a wide range of complex ments in benchmarks focused on perception may indicate
and unpredictable GUI defects, making it difficult to en- the model’s enhanced ability to identify subtle layout issues,
sure consistency across testing scenarios. To address this, while gains in reasoning-oriented benchmarks might reflect
we inject defects at the source code level or use the Mu- better handling of navigation errors or task flows. Addition-
tAPK tool[13, 14] to introduce controlled, predefined GUI ally, real-world GUI Testing tasks often suffer from data
defects into mobile applications[1, 2]. The injection of spe- scarcity. Our method provides a pathway for expanding
cific defects allows us to maintain strict control over the datasets by strategically selecting general datasets aligned
testing framework’s results. Introducing defects in various with the task’s requirements.
areas not only ensures consistency but also creates diverse This framework ties back to the core goal of our paper:
fault scenarios. The controlled nature of this method en- building a comprehensive, end-to-end benchmark for GUI
ables repeatable experiments, helping researchers system- automation testing. By bridging the gap between general
atically explore the strengths and weaknesses of different benchmark performance and task-specific outcomes, we of-
testing models. Additionally, by increasing the complexity fer a practical methodology for identifying the key capabil-
of the injected defects, we can push the boundaries of the ities that matter most. This not only enhances the reliability
agent, ensuring it can handle the kind of diverse challenges of visual-based agent for automated GUI Testing but also
in real-world applications. lays the foundation for continuous model improvement.
6
Test Intention Generation Test Task Execution GUI Defect Detection
Model Coverage TM EM SR Accuracy Recall-D Recall-N
LlaMA3-8B LLaVA 17.12 31.70 4.40 3.28 24.90 6.00 69.0
Qwen2-7B Qwen2-VL 20.37 35.40 13.20 11.73 30.10 0.14 100.0 (+20)
3.5-Sonnet Claude 36.18 48.88 (+7.3) 21.50 (+2.1) 20.45 (+4.3) 7.3 10.29 0.33
2024-02-01 GPT-4o 37.01 (+0.8) 41.60 19.40 16.14 33.80 (+3.7) 14.00 (+3.7) 80.0
Table 2. Comparison of Different Multimodal Large Language Models as Agents for Autonomous GUI Testing. The green numbers
represent the difference between the best result and the second-best result. TM, EM, and SR denote Type Match, Exact Match, and Success
Rate, respectively. Since several models tend to respond with “no defect” during GUI defect detection, we selected both defective and
non-defective data to calculate the recall metrics, denoted as Recall-D and Recall-N, respectively.
Task Execution on different instructions quiring extensive knowledge and imaginative capabilities
Model GPT-4o LLaVA Qwen2-VL
still necessitate models with sufficiently large parameters to
Test Intention 16.14 3.28 11.73
produce richer responses. The differences in TM and EM
Test Steps 16.39 6.81 19.80
performance further support this point: while open-source
Table 3. SR of VLLMs on Different Instructions for Test Task models can answer some matching tasks correctly, LLaVA
Execution. We use two types of instructions, test intention and test struggles significantly in higher-difficulty tasks involving
steps, to compare the task completion performance of the models exact matches, whereas Qwen2-VL performs slightly bet-
under each instruction type. ter due to prior training on GUI data.
For GUI defect detection, we use a mix of data with and
without defects to simulate the agent performing defect de-
details are accurate. For instance, in test step click(3), tection tasks in a real-world scenario. Since Qwen2-VL pre-
where 3 is the element ID within the image, the model needs dominantly responds with “no defect”, it achieves a high
to identify the correct element to click. For a test task, if recall rate on normal data but performs poorly on data with
the action is entirely correct for each tuple, we consider the actual defects. In contrast, Claude tends to think there ex-
agent to have successfully completed the task. SR repre- ists GUI defects in the data. This outcome underscores the
sents the percentage of tasks that the model successfully necessity of evaluating both metrics for a comprehensive
completes out of all tasks. assessment.
Accuracy and Recall. For GUI defect detection, we
evaluate accuracy and recall metrics. The test set includes
5.3. Ablation Study on Test Task Execution
both data with GUI defects and normal data (represented
in our defined triplet form). For Accuracy, we calculate In Section 2.2, we discussed how the action sequence per-
the proportion of correct judgments made by the model formed by the agent under the guidance of a test intention
across all data. Recall is further divided into two metrics: can be used to construct test steps, thereby generating a
Recalldef ect , which measures the model’s correct judg- complete test case. This raises the question: Does using
ment rate on data with GUI defects, and Recallnodef ect , test steps as instructions enable the agent to complete tasks
which assesses the model’s performance on normal data. more effectively compared to using the test intention? To
These three metrics allow us to thoroughly analyze the per- explore this, we concatenated the action sequences within
formance of different multimodal large language models in individual test tasks to form the corresponding test steps for
the task of GUI defect detection. each task. We then evaluated the performance of GPT-4o,
LLaVA, and Qwen2-VL on test task execution under these
5.2. Baseline Autonomous GUI Testing two types of instructions.
To establish a baseline, we select models with significantly The experimental results in Table 3 show that both
different architectures and native multimodal capabilities LLaVA and Qwen2-VL exhibited performance improve-
for comparison, including GPT-4o, Claude, LLaVA, and ments when given test steps as instructions. GPT-4o, how-
Qwen2-VL. ever, showed minimal change in performance, even per-
The experimental results, shown in Table 2, indicate that forming worse than Qwen2-VL when using test steps as
GPT-4o and Claude perform comparably across most met- instructions. This suggests that the limitation in GPT-4o’s
rics and outperform open-source models. Although the dif- performance in test task execution does not stem from its
ference in metrics between closed-source commercial mod- ability to accurately interpret test intentions but rather from
els and open-source models is relatively small, the over- the inherent complexity of GUI interfaces. Qwen2-VL, hav-
all poor performance highlights a significant gap between ing been trained on GUI data, benefits from the clarity in
open-source and closed-source models. In the test inten- identifying the next action to execute, resulting in a more
tion generation metric, LLaVA and Qwen2-VL lag consid- significant performance boost when provided with explicit
erably behind GPT-4o and Claude, suggesting that tasks re- test steps.
7
6. Related Work 7. Conclusion
This paper introduces a formalized framework for Au-
tonomous GUI Testing, aimed at addressing key limitations
Agent on GUI navigation task. Early GUI agents [16, 22,
in visual-based agent evaluation. By structuring the testing
23, 35] primarily relied on training models to explore GUI
workflow with precise mathematical definitions and decou-
environments with task completion as the main objective.
pling GUI defect detection from task execution, we present
With advances in multimodal large models [3, 6], current
a fair and robust environment (GTArena) for evaluating GUI
approaches have shifted from traditional training to using
Testing capabilities. Our work includes a novel data struc-
techniques like prompt tuning [21] and in-context learn-
ture for capturing GUI defects, which facilitates the creation
ing [46] to guide these models in exploration tasks. Ap-
of large-scale datasets.
pAgent [44], Mobile-Agent [38], and AutoDroid [40] uti-
Furthermore, we propose a unified benchmark to assess
lize LLMs to interpret natural language descriptions and
visual-based agents equipped with multimodal large models
transform them into GUI actions. Additionally, some work
(MLLMs), evaluating their performance across core com-
[10, 17] has focused on fine-tuning large models on GUI-
ponents: test case generation, test task execution, and GUI
specific data to improve their performance in GUI environ-
defect detection. Through this structured benchmark, we
ments. Agent workflow memory[39] represents another re-
reveal notable performance gaps between current agents
cent innovation, enhancing agents’ ability to automatically
and practical application for mainstream VLLMs, under-
construct workflows, thus introducing a new paradigm for
scoring the need for targeted model improvements. Addi-
GUI task automation.
tionally, our methodology offers a systematic approach for
fine-tuning models on task-specific datasets, while evaluat-
There has also been progress in creating benchmarks for ing their general capabilities on broader benchmarks.
GUI navigation [12, 23, 35]. WebArena [47], for instance, In conclusion, our work provides a fair, unified and end-
constructs realistic web environments with callable tools to to-end environment for automated GUI Testing, enabling
study the limitations of models like GPT-4V, revealing a convenient and reproducible evaluation of various multi-
significant gap in agent performance compared to humans modal large models in their role as agents. By bridging the
in complex tasks. AitW [34] collected large-scale data by gap between theoretical frameworks and practical evalua-
having annotators operate apps in a simulator to capture hu- tions, we aim to accelerate the development of more capa-
ble, reliable, and efficient agents for GUI Testing applica-
man instruction-following behavior, though data quality re-
tions.
mains a concern. Building on this, AitZ [45] introduced
high-quality navigation data with GPT-4-annotated Chain-
of-Thought (CoT) reasoning, along with new metrics to
References
evaluate agent performance in GUI navigation tasks. [1] F-droid: Free and open source android app repository, 2024.
GUI Defect Detection. Given the close connection be- Accessed: 2024-11-13. 6
tween GUI quality and user experience, various methods [2] Google play store, 2024. Accessed: 2024-11-13. 6
have been developed to detect bugs in GUIs. GUI Testing in [3] Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ah-
industry relies heavily on scripted tests to automate function mad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida,
validation. To address this, AppFlow [18] applies machine Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al.
learning to identify screen components, allowing testers to Gpt-4 technical report. arXiv preprint arXiv:2303.08774,
develop modular libraries for core application functions. 2023. 1, 2, 8
CoSer [9] constructs UI state transition graphs from source [4] Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine
code and scripts to repair outdated tests. Recently, LLMs Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Men-
sch, Katherine Millican, Malcolm Reynolds, et al. Flamingo:
have emerged as powerful tools in GUI Testing due to their
a visual language model for few-shot learning. Advances
extensive training on diverse data and strong reasoning abil-
in neural information processing systems, 35:23716–23736,
ities. For example, QTypist [25] focuses on generating se- 2022. 1
mantic text inputs for form fields to improve exploration [5] Emil Alégroth, Arvid Karlsson, and Alexander Radway.
coverage. GPTDroid [27] extracts page and widget infor- Continuous integration and visual gui testing: Benefits and
mation from the UI hierarchy, using it to create human- drawbacks in industrial practice. In 2018 IEEE 11th Interna-
like interactions. AUITestAgent[19] developed an industry- tional Conference on Software Testing, Verification and Val-
applicable automatic natural language-drifillven GUI Test- idation (ICST), pages 172–181. IEEE, 2018. 4
ing method. VisionDroid [28] addresses non-crash bug de- [6] AI Anthropic. Introducing the next generation of claude,
tection in GUIs by leveraging LLMs to detect unexpected 2024. 8
behaviors, particularly in scenarios where testing oracles [7] Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan
are lacking. Tan, Peng Wang, Junyang Lin, Chang Zhou, and Jingren
8
Zhou. Qwen-vl: A versatile vision-language model for un- Auitestagent: Automatic requirements oriented gui function
derstanding, localization, text reading, and beyond. arXiv testing. arXiv preprint arXiv:2407.09018, 2024. 1, 8
preprint arXiv:2308.12966, 1(2):3, 2023. 1 [20] Valéria Lelli, Arnaud Blouin, and Benoit Baudry. Classify-
[8] Rogerio Bonatti, Dan Zhao, Francesco Bonacci, Dillon ing and qualifying gui defects. In 2015 IEEE 8th interna-
Dupont, Sara Abdali, Yinheng Li, Yadong Lu, Justin Wagle, tional conference on software testing, verification and vali-
Kazuhito Koishida, Arthur Bucker, et al. Windows agent dation (ICST), pages 1–10. IEEE, 2015. 4
arena: Evaluating multi-modal os agents at scale. arXiv [21] Brian Lester, Rami Al-Rfou, and Noah Constant. The power
preprint arXiv:2409.08264, 2024. 4 of scale for parameter-efficient prompt tuning. arXiv preprint
[9] Shaoheng Cao, Minxue Pan, Yu Pei, Wenhua Yang, Tian arXiv:2104.08691, 2021. 8
Zhang, Linzhang Wang, and Xuandong Li. Comprehensive [22] Gang Li and Yang Li. Spotlight: Mobile ui understanding
semantic repair of obsolete gui test scripts for mobile appli- using vision-language models with a focus. arXiv preprint
cations. In Proceedings of the IEEE/ACM 46th International arXiv:2209.14927, 2022. 8
Conference on Software Engineering, pages 1–13, 2024. 8 [23] Yang Li, Jiacong He, Xin Zhou, Yuan Zhang, and Jason
[10] Kanzhi Cheng, Qiushi Sun, Yougang Chu, Fangzhi Xu, Yan- Baldridge. Mapping natural language instructions to mobile
tao Li, Jianbing Zhang, and Zhiyong Wu. Seeclick: Har- ui action sequences. arXiv preprint arXiv:2005.03776, 2020.
nessing gui grounding for advanced visual gui agents. arXiv 8
preprint arXiv:2401.10935, 2024. 1, 8 [24] Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee.
[11] Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Visual instruction tuning. Advances in neural information
Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul processing systems, 36, 2024. 1
Barham, Hyung Won Chung, Charles Sutton, Sebastian [25] Zhe Liu, Chunyang Chen, Junjie Wang, Xing Che, Yuekai
Gehrmann, et al. Palm: Scaling language modeling with Huang, Jun Hu, and Qing Wang. Fill in the blank: Context-
pathways. Journal of Machine Learning Research, 24(240): aware automated text input generation for mobile gui testing.
1–113, 2023. 1 In 2023 IEEE/ACM 45th International Conference on Soft-
[12] Xiang Deng, Yu Gu, Boyuan Zheng, Shijie Chen, Sam ware Engineering (ICSE), pages 1355–1367. IEEE, 2023. 8
Stevens, Boshi Wang, Huan Sun, and Yu Su. Mind2web: [26] Zhe Liu, Chunyang Chen, Junjie Wang, Mengzhuo Chen,
Towards a generalist agent for the web. Advances in Neural Boyu Wu, Xing Che, Dandan Wang, and Qing Wang. Chat-
Information Processing Systems, 36, 2024. 8 ting with gpt-3 for zero-shot human-like mobile automated
[13] Camilo Escobar-Velásquez, Michael Osorio-Riaño, and gui testing. arXiv preprint arXiv:2305.09434, 2023. 1
Mario Linares-Vásquez. Mutapk: Source-codeless mutant [27] Zhe Liu, Chunyang Chen, Junjie Wang, Mengzhuo Chen,
generation for android apps. In 2019 34th IEEE/ACM In- Boyu Wu, Xing Che, Dandan Wang, and Qing Wang. Make
ternational Conference on Automated Software Engineering llm a testing expert: Bringing human-like interaction to mo-
(ASE), pages 1090–1093. IEEE, 2019. 6 bile gui testing via functionality-aware decisions. In Pro-
[14] Camilo Escobar-Velásquez, Diego Riveros, and Mario ceedings of the IEEE/ACM 46th International Conference on
Linares-Vásquez. Mutapk 2.0: A tool for reducing muta- Software Engineering, pages 1–13, 2024. 1, 8
tion testing effort of android apps. In Proceedings of the [28] Zhe Liu, Cheng Li, Chunyang Chen, Junjie Wang, Boyu Wu,
28th ACM Joint Meeting on European Software Engineering Yawen Wang, Jun Hu, and Qing Wang. Vision-driven au-
Conference and Symposium on the Foundations of Software tomated mobile gui testing via multimodal large language
Engineering, pages 1611–1615, 2020. 6 model. arXiv preprint arXiv:2407.03037, 2024. 1, 8
[15] GitHub. Github repository of applications, 2024. Accessed: [29] Z. Liu, J.J. Wang, C.Y. Chen, X. Che, Y.H. Su, and Q. Wang.
2024-11-13. 5 Empirical study on ui display issue detection in mobile appli-
[16] Izzeddin Gur, Ulrich Rueckert, Aleksandra Faust, and Dilek cations. Journal of Software, 35(11):5040–5064, 2024. (in
Hakkani-Tur. Learning to navigate the web. arXiv preprint Chinese). 4
arXiv:1812.09195, 2018. 8 [30] Xinbei Ma, Zhuosheng Zhang, and Hai Zhao. Compre-
[17] Wenyi Hong, Weihan Wang, Qingsong Lv, Jiazheng Xu, hensive cognitive llm agent for smartphone gui automation.
Wenmeng Yu, Junhui Ji, Yan Wang, Zihan Wang, Yuxiao arXiv preprint arXiv:2402.11941, 2024. 1
Dong, Ming Ding, et al. Cogagent: A visual language model [31] MobileLLM. Droidbot-gpt: A lightweight model-driven tool
for gui agents. In Proceedings of the IEEE/CVF Conference for automated gui testing on android. https://github.
on Computer Vision and Pattern Recognition, pages 14281– com / MobileLLM / DroidBot - GPT, 2023. GitHub
14290, 2024. 1, 8 repository. 1
[18] Gang Hu, Linjie Zhu, and Junfeng Yang. Appflow: using [32] Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Car-
machine learning to synthesize robust, reusable ui tests. In roll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini
Proceedings of the 2018 26th ACM Joint Meeting on Euro- Agarwal, Katarina Slama, Alex Ray, et al. Training language
pean Software Engineering Conference and Symposium on models to follow instructions with human feedback. Ad-
the Foundations of Software Engineering, pages 269–282, vances in neural information processing systems, 35:27730–
2018. 8 27744, 2022. 1
[19] Yongxiang Hu, Xuan Wang, Yingchuan Wang, Yu Zhang, [33] Pranav Putta, Edmund Mills, Naman Garg, Sumeet Mot-
Shiyu Guo, Chaoyi Chen, Xin Wang, and Yangfan Zhou. wani, Chelsea Finn, Divyansh Garg, and Rafael Rafailov.
9
Agent q: Advanced reasoning and learning for autonomous for computer control. In The Twelfth International Confer-
ai agents. arXiv preprint arXiv:2408.07199, 2024. 1 ence on Learning Representations, 2023. 8
[34] Christopher Rawles, Alice Li, Daniel Rodriguez, Oriana [47] Shuyan Zhou, Frank F Xu, Hao Zhu, Xuhui Zhou, Robert
Riva, and Timothy Lillicrap. Androidinthewild: A large- Lo, Abishek Sridhar, Xianyi Cheng, Tianyue Ou, Yonatan
scale dataset for android device control. Advances in Neural Bisk, Daniel Fried, et al. Webarena: A realistic web en-
Information Processing Systems, 36, 2024. 8 vironment for building autonomous agents. arXiv preprint
[35] Tianlin Shi, Andrej Karpathy, Linxi Fan, Jonathan Hernan- arXiv:2307.13854, 2023. 4, 8
dez, and Percy Liang. World of bits: An open-domain plat-
form for web-based agents. In International Conference on
Machine Learning, pages 3135–3144. PMLR, 2017. 8 A. GUI Defect Types
[36] Ting Su, Jue Wang, and Zhendong Su. Benchmarking auto- The specific types of GUI defects and examples are shown
mated gui testing for android against real-world bugs. In Pro- in Table 4 and 5.
ceedings of the 29th ACM Joint Meeting on European Soft-
ware Engineering Conference and Symposium on the Foun-
dations of Software Engineering, pages 119–130, 2021. 4
B. GUI Defect Dataset Examples
[37] Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Some real-world defects from Github releases in Table 6.
Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Examples of artificial injected defects and episode from
Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. AitW with defects are show in Figure 4 and Figure 5.
Llama: Open and efficient foundation language models.
arXiv preprint arXiv:2302.13971, 2023. 1
[38] Junyang Wang, Haiyang Xu, Jiabo Ye, Ming Yan, Weizhou
Shen, Ji Zhang, Fei Huang, and Jitao Sang. Mobile-agent:
Autonomous multi-modal mobile device agent with visual
perception. arXiv preprint arXiv:2401.16158, 2024. 8
[39] Zora Zhiruo Wang, Jiayuan Mao, Daniel Fried, and Gra-
ham Neubig. Agent workflow memory. arXiv preprint
arXiv:2409.07429, 2024. 8
[40] Hao Wen, Yuanchun Li, Guohong Liu, Shanhui Zhao, Tao
Yu, Toby Jia-Jun Li, Shiqi Jiang, Yunhao Liu, Yaqin Zhang,
and Yunxin Liu. Empowering llm to use smartphone for in-
telligent task automation. arXiv preprint arXiv:2308.15272,
2023. 8
[41] Yiheng Xiong, Mengqian Xu, Ting Su, Jingling Sun, Jue
Wang, He Wen, Geguang Pu, Jifeng He, and Zhendong Su.
An empirical study of functional bugs in android apps. In
Proceedings of the 32nd ACM SIGSOFT International Sym-
posium on Software Testing and Analysis, pages 1319–1331,
2023. 4
[42] Keen You, Haotian Zhang, Eldon Schoop, Floris Weers,
Amanda Swearngin, Jeffrey Nichols, Yinfei Yang, and Zhe
Gan. Ferret-ui: Grounded mobile ui understanding with mul-
timodal llms. arXiv e-prints, pages arXiv–2404, 2024. 1
[43] Keen You, Haotian Zhang, Eldon Schoop, Floris Weers,
Amanda Swearngin, Jeffrey Nichols, Yinfei Yang, and Zhe
Gan. Ferret-ui: Grounded mobile ui understanding with mul-
timodal llms. In European Conference on Computer Vision,
pages 240–255. Springer, 2025. 1
[44] Chi Zhang, Zhao Yang, Jiaxuan Liu, Yucheng Han, Xin
Chen, Zebiao Huang, Bin Fu, and Gang Yu. Appagent:
Multimodal agents as smartphone users. arXiv preprint
arXiv:2312.13771, 2023. 8
[45] Jiwen Zhang, Jihao Wu, Yihua Teng, Minghui Liao, Nuo Xu,
Xiao Xiao, Zhongyu Wei, and Duyu Tang. Android in the
zoo: Chain-of-action-thought for gui agents. arXiv preprint
arXiv:2403.02713, 2024. 8
[46] Longtao Zheng, Rundong Wang, Xinrun Wang, and Bo An.
Synapse: Trajectory-as-exemplar prompting with memory
10
Defect Cat- Defect Description Example
egories
Content Error This defect involves text that appears as garbled or Replace content in string.xml with
Display
unintelligible characters on the screen, making infor- ‘null’.
mation difficult to read or understand.
Data Type or This defect occurs when data is displayed in inappro- Letters are allowed to be entered in the
Format Error priate or unexpected formats, which can lead to mis- date input field. The page shows the
interpretation or difficulty in understanding the data. date “2021-06-15” as “20210615”.
UI Element This defect refers to the absence of crucial UI ele- Image not loaded or displayed broken.
Missing ments within the interface, which can hinder user in- “New” page lacks a save button.
Layout
teraction or functionality.
UI Element This defect describes scenarios where UI components The labels for “Total Expenditure” and
Overlapping overlap one another, obscuring content and poten- “Remaining Budget” overlap.
tially making certain functions inaccessible.
Alignment Is- This defect is identified when UI elements are not In a center-aligned navigation bar, one
sue properly aligned, leading to a visually disorganized item is right-aligned.
interface that can detract from user experience.
Uneven Spac- This defect is characterized by irregular spacing be- Two elements are spaced too far apart,
ing tween UI elements, which can create a cluttered or un- resulting in a large area of whitespace.
balanced appearance, affecting the aesthetic and us-
ability.
Inconsistent This defect arises when the color scheme of UI el- Most of the icon colors in the naviga-
Style Color ements is mismatched or poorly chosen, potentially tion bar are the same, with a few ex-
leading to a visually unappealing or confusing inter- ceptions.
face.
Inconsistent This defect pertains to UI elements that vary signif- Some fonts are too large while others
Element Size icantly in size, which can confuse users and disrupt are too small.
the visual flow of the application, affecting usability.
Abnormal UI This defect involves UI elements that display unex- The submit button appears in an active
Element State pected behaviors or appearances when they are inter- state although it is not being clicked.
acted with, such as being clicked or focused, which
can confuse users or hinder interaction.
11
Release Display Defect Interaction Defect
v1.21.0 - Stickers from Gboard have black background (fixed) - Broken localization with empty strings in it (fixed)
- mxc reactions not rendered correctly (fixed)
v1.17.2 / - Add cancel button to key request dialog
- Encode component for links correctly
- Forward arbitrary message content
- Open public room bottom sheet by alias
v3.0 - Song placeholder icon in player view /
v2.0 - Launcher icon background color - Disable favourite button for local songs
v1.0 - Color of status and navigation bar /
- Splash screen background color in dark mode
v6.0.0 / - Top/Recent Artists/Albums not updating (Wrong sort
order)
- All Blacklist related crashes
- Restart button not working in crash activity
v5.8.4 / - Crash when adding folders to blacklist
v5.8.3 - Incorrect song data in notification /
v5.8.0 / - Settings change not reflecting immediately
- Crash when clicking on Playlist in the Search Tab
v5.6.0 - Incorrect colors when no cover art is available - Lockscreen dragging glitch
- Blank album cover bug - Favorite not updating when song is changed
- Playlist not getting created & playlist creation crash
with same name
- Bug in “Plain” Now playing theme where onClick
event is consumed by the views behind the bottom
sheet
12
[Content Error] [Content Error] [UI Element Missing] [UI Element Overlapping] [Alignment Issue] [Uneven Spacing]
The screenshot contains garbled The unit type error in "Total UI element 6 is missing an icon. The "Settings" header is overlapping. The btn 2 and btn 3 are not aligned. There is an abnormal blank in the
or unreadable text. Distance Recorded." middlle of the screenshot.
[Inconsistent Color] [Inconsistent Element Size] [Abnormal UI El State] [Operation No Response] [Virtual Keyboard Related Issue]
The color scheme of icons are The size of UI element 5's icon The state of button 5 and 6 is There are more content blow, but could not scroll up and down. The virtual keyboard cannot shut
mismatched. is abnormal. abnormal(disabled). down automatically.
13
EL_MISSING|7
CONTENT_ERROR|29
CONTENT_ERROR|46
14