0% found this document useful (0 votes)
17 views

paper-4

Uploaded by

Huzefa Batth
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
17 views

paper-4

Uploaded by

Huzefa Batth
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 14

GUI Testing Arena: A Unified Benchmark

for Advancing Autonomous GUI Testing Agent

Kangjia Zhao1 JiaHui Song1 Leigang Sha1 HaoZhan Shen1


Chen Zhi1 * Tiancheng Zhao2,3 Xiubo Liang1 Jianwei Yin1
1
College of Computer Science and Technology, Zhejiang University
2
Om AI Research 3 Binjiang Institute of Zhejiang University
arXiv:2412.18426v1 [cs.AI] 24 Dec 2024

{konkaz, songjah, shaleigang, hz shen, zjuzhichen, xiubo, zjuyjw}@zju.edu.cn


[email protected]

Abstract tasks, sniffing out bugs—and delivers a complete report, all


without breaking a sweat.”
Nowadays, research on GUI agents is a hot topic in the
AI community. However, current research focuses on GUI This envisioned scenario, where agents augment hu-
task automation, limiting the scope of applications in var- man efforts in software testing, is becoming increasingly
ious GUI scenarios. In this paper, we propose a formal- plausible as large language models (LLMs) [11, 32, 37]
ized and comprehensive environment to evaluate the entire and vision LLMs (VLLMs) [3, 4, 7, 24] emerge as po-
process of automated GUI Testing (GTArena), offering a tent tools for automating complex processes. The inte-
fair, standardized environment for consistent operation of gration of traditional agents with the cognitive capabili-
diverse multimodal large language models. We divide the ties of LLMs or VLLMs represents a cutting-edge direc-
testing process into three key subtasks: test intention gen- tion in contemporary research, with numerous studies [10,
eration, test task execution, and GUI defect detection, and 17, 30, 33] focusing on enhancing navigation frameworks
construct a benchmark dataset based on these to conduct for web and app interfaces. Current methodologies pre-
a comprehensive evaluation. It evaluates the performance dominantly revolve around singular tasks like <assist me
of different models using three data types: real mobile ap- with a purchase> or <log in and post a tweet>,
plications, mobile applications with artificially injected de- which are executed by these agents.
fects, and synthetic data, thoroughly assessing their capa- However, these applications often suffer from a narrow
bilities in this relevant task. Additionally, we propose a focus, with task complexity increased only through ambigu-
method that helps researchers explore the correlation be- ous instructions or added steps, approaches that do not fun-
tween the performance of multimodal language large mod- damentally enhance the agent’s capabilities. This limitation
els in specific scenarios and their general capabilities in highlights the potential application of these technologies in
standard benchmark tests. Experimental results indicate automated GUI Testing, a field of considerable practical
that even the most advanced models struggle to perform importance and complexity, which presents comprehensive
well across all sub-tasks of automated GUI Testing, high- challenges to the capabilities of current agents.
lighting a significant gap between the current capabilities Automated GUI Testing using LLMs [26, 27, 31] has
of Autonomous GUI Testing and its practical, real-world gained substantial traction in recent research. Recently, the
applicability. This gap provides guidance for the future di- focus has shifted towards leveraging MLLMs in place of
rection of GUI Agent development. Our code is available at previous agents, enabling these models to “see” and interact
https://github.com/ZJU-ACES-ISE/ChatUITest. with GUI elements visually. This approach has garnered
attention in both the AI research community [42, 43] and
the software engineering (SE) community [19, 28], leading
1. Introduction to various exploratory studies and small-scale validations
in industry. As envisioned at the beginning of the paper,
“Imagine a software test engineer casually leaning back in
the ultimate goal of this research direction is to achieve
their chair, saying, ‘Alright, test this app for me’ and an
end-to-end automation in GUI Testing.
agent springs into action—generating test cases, executing
However, existing agents for automated GUI Testing
* Corresponding author tend to rely on complex framework designs, introducing

1
Test Task Execution
Test Steps
GUI Defect Dataset

Test Case
Test Intention
(state,action,state)

action action action


state state state state (state,action,state)

To test if the app exists


GUI Defect, i need to ... (state,action,state)

Current Screenshot click(3)


type("Hello")
scroll(up) GUI Defect Detection
Test Intention Generation UI Element Information enter()
stop()
Test Logs

Applications

Figure 1. The Workflow for Autonomous GUI Testing (GTArena). GUI Testing requires the model to perform specific tasks, all of which
are evaluated within this workflow. We provide a standardized and reproducible testing framework, enabling fair comparison of different
multimodal large language models.

various components to accomplish this task. Addition- guide future improvements.Furthermore, while benchmarks
ally, these frameworks lack standardized evaluation met- for multimodal large models typically assess specific capa-
rics, with variations in metric design and effectiveness mea- bilities, when applying these models in specific domains,
surement, and the test datasets are often small with limited only the end metrics are considered. Therefore, we propose
availability for open access. This lack of standardization a new method: evaluating models fine-tuned on subsets of
and transparency hinders further progress in GUI Testing datasets on general capability benchmarks and comparing
research. these results to the original models. This analysis aids in
identifying the necessary enhancements for the models to
Moreover, due to the complexity of agent frameworks
excel in specific tasks, facilitating further refinement of their
and the variability in responses of LLMs, reproducing re-
capabilities.
sults becomes challenging for others who want to evaluate
the agent. Most current agent frameworks depend on GPT The contributions of this paper are as follows:
[3], which are not only costly but also lack options for lo- • We establish a formalized end-to-end framework as a fair
cal deployment. They remain unsuitable for applications environment for Autonomous GUI Testing (GTArena)
requiring data privacy, presenting an additional barrier to and introduce a novel data structure for GUI defects,
their widespread use in GUI Testing. which together redefine the complete testing process
and enable the construction of large-scale GUI defect
To address these gaps, we formalize the GUI automated datasets. This dual development not only decouples the
testing problem by redefining the entire testing process with agent but also facilitates rigorous and reproducible evalu-
standard definitions and building a framework (GTArena) ations of GUI Testing methodologies.
for fair GUI Testing evaluation. We introduce a novel data • We develop a comprehensive and standardized bench-
structure for representing GUI defects, enabling the con- mark for evaluating agents across multiple components
struction of large-scale GUI defect datasets for future re- within the automated GUI Testing framework, providing
search. We formalize the entire GUI workflow, making detailed insights into the agents’ performance relative to
agent evaluation standardized and reproducible. Building human testers.
on this foundation, we have developed a unified bench- • We propose a methodology for assessing the specific ca-
mark for evaluating visual-based agents across various mul- pabilities that models need to excel in Autonomous GUI
timodal large language models (MLLMs). This bench- Testing, enabling targeted enhancements of these models
mark not only assesses the end-to-end task performance but through broader and more diverse training datasets.
also evaluates each component of the process individually,
demonstrating that current agents still fall short in complet- 2. Autonomous GUI Testing Agent Workflow
ing these tasks. Through this approach, we can analyze the
specific performance gaps between GPT-based agents and While the idea of casually handling over an app to an agent
other multimodal large models, providing insights that may with a simple <Test this app for me> sounds appeal-

2
ing, the reality of fully automated GUI Testing is far more Algorithm 1 Pseudocode of Workflow
intricate. In order to effectively approach automated GUI
class State:
Testing, it is crucial to deconstruct the testing workflow def act(self, action):
much like a skilled GUI Testing engineer would. The initial # Returns a new state based on the action
return new_state
step involves defining testing objectives—primarily identi-
class TransitionTuple:
fying GUI defects that pose the highest risk to user expe- def __init__(self, state_before, action,
rience. This process begins with the predefinition of core state_after):
self.state_before = state_before
tasks that reflect the most likely user interactions within self.action = action
an app. By executing these tasks on the app interface and self.state_after = state_after

closely monitoring for GUI defects, a comprehensive as- def check_defect(self):


# Checks and returns whether there’s a defect
sessment can be achieved. When leveraging agents to sim- return defect_found
ulate this structured testing process, the workflow of auto-
# Process for simulating the automated GUI Testing
mated GUI Testing can be divided into three main phases: def simulate_gui_testing(initial_state, actions):
Test Case Generation, where potential user interactions for action in actions:
next_state = initial_state.act(action)
and testing scenarios are designed; Test Task Execution, transition = TransitionTuple(initial_state,
action, next_state)
in which the agent performs these tasks across the GUI; and
GUI Defect Detection, a critical phase to identify any inter- if transition.check_defect():
log_defect(transition)
face issues that may impair usability or functionality. The
specific workflow process is presented in Figure 1. initial_state = next_state # Update the
current state

2.1. Preliminary def log_defect(transition):


# Log the defective transition
print("Defect detected in transition:",
Current research on Automated GUI Testing predominantly transition)
concentrates on resolving specific issues within the domain. # Usage example
However, a rigorous definition and structured framework actions = [’click’, ’scroll’, ’type’]
initial_state = State()
have been largely absent. To address this gap, we have simulate_gui_testing(initial_state, actions)
formalized the process of this task through a Visual-based
Agent, offering a novel perspective that redefines the task.
Algorithm 1 provides the pseudocode implementation of the
architecture. cessfully detecting GUI defects, incentivizing the agent to
prioritize actions that uncover critical issues affecting user
experience. This approach promotes the identification and
2.1.1 Partially Observable Markov Decision Process resolution of impactful defects, despite the agent’s incom-
plete view of the app’s underlying state.
The cornerstone of our framework is the partially observ-
able markov decision process (POMDP), which serves as
the foundational model for describing the decision-making 2.1.2 GUI Defect Data Model
process of the Visual-based Agent in GUI Testing scenar-
ios. A POMDP is defined by a tuple (S, O, A, T, R), where To systematize the identification and classification of GUI
S denotes the state space of the application’s GUI, O repre- defects, we introduce a novel data structure termed the
sents the observation space, A is the set of possible actions, Transition Tuple (stateb , action, statea ). This tuple ef-
T : S × A → S is the transition function mapping ac- fectively captures the GUI state before (stateb ) and after
tions in states to probability distributions over states, and (statea ) an action is executed, with the action itself rep-
R : S × A × S → R is the reward function. In the context resented in the middle. A sequence of such tuples forms a
of Autonomous GUI Testing, O denotes a partial observa- complete path of state transitions within the application. We
tion of the app’s current state. Due to inherent limitations, a define a specialized action ∅, distinct from standard agent
GUI agent cannot fully capture all state information, partic- operations, to signify performing no operation on the app.
ularly for closed-source applications where key elements, We have designed two classes to support our data model:
such as the Accessibility Tree, are inaccessible. This par- the State class, which encapsulates methods init for initial-
tial observability impacts the likelihood of detecting issues izing a state and act for performing an action, and the Tran-
during GUI Defect Detection, as the agent relies on limited sition Tuple class, with methods init to create instances and
feedback to infer potential defects. Detection likelihood, check to evaluate transitions for defects. This classification
therefore, depends on whether a defect is observable follow- aids in formalized defect detection, allowing for delayed,
ing a specific action. To address this challenge, the reward yet comprehensive defect analysis without the need for real-
function R is designed to assign positive rewards for suc- time feedback.

3
2.2. Test Case Generation
Test Case Generation can be viewed as the process of cre-
Applications with
ating a structured chain of states and actions aligned with Artificial GUI Defects

the testing intentions. Given a mobile application, the agent GUI Defect Dataset

begins by gathering a brief overview of the app and creat- Open-Source


and Close-Source
(state,action,state)
Real-world
ing test intentions, defining what needs to be tested. Once Applications
—— (state,action,state)
Applications with
GUI Defects
AitW Dataset
the test intention is defined, the agent can execute the test- Normal Task Data
(state,action,state)
ing tasks outlined in Section 2.3, generating a comprehen- (state,action,state)

sive test case that includes both the test intention and cor- (state,action,state)

responding test steps. This structured approach enables the (state,action,state)

agent to conduct targeted and informed testing on the same


app in future scenarios, such as version updates or feature
enhancements. By leveraging these predefined test cases, Figure 2. Source and Methodology for Benchmark Data Con-
the agent can focus on high-priority areas and adapt its test- struction. The left side of the figure illustrates our primary data
ing to ensure new changes align with expected functionality, sources, which include intentionally injected defects within apps
making the testing process both efficient and scalable. and synthetic defect data generated by post-processing action se-
quence data obtained from app executions. The right side of the
2.3. Test Task Execution figure shows supplemental data sources, specifically real-world
applications with GUI defects.
In executing the detailed test case, the multimodal agent
performs several key actions to interact with the mobile ap-
Data of Applications GUI Display GUI Interaction
plication. The primary actions include Click, Scroll, and
Type, along with the additional actions Stop and Enter. The Real-World 53
execution starts with the agent activating the initial state of Artificial Inject 79 26
the application. As the agent interacts with the interface, AitW with Defects 6421 1871
it records every action, capturing screenshots and logging Close-Source 1148 399
each step. This thorough recording process ensures that the Open-Source 590 257
impact of each action is documented, allowing for a detailed
assessment of the app’s response to user interactions. As the Table 1. Distribution of our dataset.
agent progresses through the test case, it navigates various
screens and functionalities of the application. The execu-
tion phase continues until one of two outcomes occurs: task Interaction Defects pertain to user interactions with the UI
completion or a problem, using Stop to represent. and include:
This structured approach to executing test cases, com- • Operation, like Operation No Response and Virtual Key-
bined with the agent’s ability to record and react to the ap- board Related Issues.
plication’s state. This systematic execution process is essen- • Task, including Navigation Logic Error and Unexpected
tial for GUI automation testing, ensuring that multimodal Task Result.
large models can effectively mimic human testers in testing More details are provided in Appendix. By referencing
mobile applications. these defect types, the agent can effectively detect and cat-
egorize any GUI defects present in each Transition Tuple
2.4. GUI Defect Detection and return the defect results. Through this automated pro-
cess, the agent can verify that display and interaction ele-
In GUI automation testing, defect detection is essential for
ments perform optimally across various scenarios and user
ensuring application quality and usability. GUI defects can
actions.
be broadly classified into two main categories: Display De-
fects (DD) and Interaction Defects (ID). [20, 29, 41] Dis-
play Defects focus on the visual presentation of the UI and
3. How To Benchmark Autonomous GUI Test-
include: ing Agent
• Data Display, such as Content Error and Data Type or Building a unified benchmark for the entire automated
Format Error. GUI Testing framework requires addressing challenges that
• Layout, including UI Element Missing, UI Element arise from the segmented focus of prior research. Existing
Overlapping, Alignment Issues, and Uneven Spacing. benchmark often isolate specific components, such as task
• Style, such as Inconsistent Color, Inconsistent Element execution[8, 47] or defect detection[5, 36]. Hence, we es-
Size, and Abnormal UI Element State. tablish an end-to-end evaluation system, ensuring seamless

4
Figure 3. Examples of Constructed Synthetic GUI Defects. We present examples of various constructed GUI defects, demonstrating the
feasibility of synthesizing defects through post-processing. This approach highlights a method for building large-scale GUI defect datasets,
including both display and interaction defects.

integration across all phases of the testing process. grade logs and issue trackers from open-source repositories
Given the complexities involved in collecting data from on GitHub [15]. Our approach involves systematically fil-
real-world applications, where the defects are inherently un- tering and extracting relevant projects based on the descrip-
predictable, our benchmark consists of three carefully cu- tions in their change logs and issue reports, specifically fo-
rated data categories, including real-world mobile applica- cusing on entries related to GUI defects. This targeted fil-
tions, applications with injected defects and synthetic defect tering ensures that the selected applications contain genuine
datasets. Figures 2 illustrate the composition and of our and relevant defects within their user interfaces.
benchmark, highlighting the balanced representation across For each identified project, we document essential ap-
each category. The detailed data distribution is shown in plication details, such as the version history, defect type,
Table 1. and issue severity, to build a comprehensive profile of the
detected defects. To validate these defects, we carefully re-
3.1. Real-world Applications with GUI Defects
produce the reported issues, ensuring that the GUI defects
Real-world applications serve as a crucial component of our are replicable and align with the descriptions provided by
benchmark by providing insights into naturally occurring the developers. By leveraging this methodology, we not
GUI defects. These defects are identified by mining up- only ensure the relevance and authenticity of the collected

5
data but also capture a wide variety of defect types reflective Therefore, we propose an evaluation method that fine-
of real-world scenarios. These defects, composing of sub- tunes models on datasets specific to GUI Testing tasks
tle layout misalignment to critical interaction failures, offer and then assesses their performance on broad, standard-
a diverse testing ground for evaluating the automated GUI ized benchmarks. This comparative analysis,contrasting a
Testing framework. model’s pre- and post-fine-tuning performance, offers in-
sights into the capabilities that are most relevant to specific
3.2. Applications with injected defects stages of the GUI Testing process. For example, improve-
Real-world applications exhibit a wide range of complex ments in benchmarks focused on perception may indicate
and unpredictable GUI defects, making it difficult to en- the model’s enhanced ability to identify subtle layout issues,
sure consistency across testing scenarios. To address this, while gains in reasoning-oriented benchmarks might reflect
we inject defects at the source code level or use the Mu- better handling of navigation errors or task flows. Addition-
tAPK tool[13, 14] to introduce controlled, predefined GUI ally, real-world GUI Testing tasks often suffer from data
defects into mobile applications[1, 2]. The injection of spe- scarcity. Our method provides a pathway for expanding
cific defects allows us to maintain strict control over the datasets by strategically selecting general datasets aligned
testing framework’s results. Introducing defects in various with the task’s requirements.
areas not only ensures consistency but also creates diverse This framework ties back to the core goal of our paper:
fault scenarios. The controlled nature of this method en- building a comprehensive, end-to-end benchmark for GUI
ables repeatable experiments, helping researchers system- automation testing. By bridging the gap between general
atically explore the strengths and weaknesses of different benchmark performance and task-specific outcomes, we of-
testing models. Additionally, by increasing the complexity fer a practical methodology for identifying the key capabil-
of the injected defects, we can push the boundaries of the ities that matter most. This not only enhances the reliability
agent, ensuring it can handle the kind of diverse challenges of visual-based agent for automated GUI Testing but also
in real-world applications. lays the foundation for continuous model improvement.

3.3. Synthetic defect datasets 5. Experiment


For most commercial applications, source code is propri- 5.1. Experiment Setup and Evaluation Criteria
etary and public releases are generally stable, having un-
dergone multiple testing iterations. Consequently, these In alignment with the workflow detailed in Section 2, our
apps rarely contain the early-stage GUI defects essential experimental setup benchmarks the performance of various
for benchmarking the agent. To overcome this limitation, multimodal large models under a unified framework. This
we adopt a synthetic approach, transforming screenshots of approach allows for a direct comparison of these models
stable applications to simulate a variety of visual and in- in a consistent architecture, assessing their effectiveness as
teraction defects. This technique allows us to obtain GUI agents in generating test intentions, executing test steps, and
defect data from any app, even complex and mature com- conducting GUI defect detection. Since the test case com-
mercial applications. Specific defect construction types and prises a test intention and corresponding test steps, which
examples are shown in Figure 3. relies on task execution result. Therefore, our primary fo-
cus is on evaluating the generation of test intentions and,
4. Correlation Analysis Between General Ca- based on this, assessing the effectiveness of task execution.
pabilities and GUI Autonomous Test Per- Coverage. The model generates a variable number of
test intentions based on the app and background informa-
formance
tion. To account for the semantic ambiguity in test inten-
A key assumption in our approach is that a fine-tuned model tions, each generated intention is matched to the human-
performs better on specific sub-tasks because it has im- annotated ground truth by GPT as judge, assessing if it
proved mastery of the skills needed for those tasks. In au- aligns with the true set of test intentions. The judgment
tomated GUI Testing, tasks like test case generation and prompt template we used is provided in Appendix. The pro-
defect detection require both perception (e.g., recognizing portion of correctly aligned intentions represents the cover-
visual elements) and reasoning (e.g., interpreting naviga- age rate for test intention generation.
tion logic and workflows). However, it’s difficult to de- TM, EM and SR. For Test Task Execution, we em-
lineate exactly which capabilities contribute most to suc- ploy TM (Type Match), EM (Exact Match), and SR (Suc-
cess. The ability to navigate between screens, detect over- cess Rate) as evaluation metrics. For each defined tuple
lapping elements, or respond correctly to unresponsive but- (stateb , action, statea ), TM indicates whether the model
tons may draw on multiple, interconnected competencies. correctly predicts the type of action to take in the next step.
Often, these dependencies are not straightforward. EM assesses, given a correct action type, whether the action

6
Test Intention Generation Test Task Execution GUI Defect Detection
Model Coverage TM EM SR Accuracy Recall-D Recall-N
LlaMA3-8B LLaVA 17.12 31.70 4.40 3.28 24.90 6.00 69.0
Qwen2-7B Qwen2-VL 20.37 35.40 13.20 11.73 30.10 0.14 100.0 (+20)
3.5-Sonnet Claude 36.18 48.88 (+7.3) 21.50 (+2.1) 20.45 (+4.3) 7.3 10.29 0.33
2024-02-01 GPT-4o 37.01 (+0.8) 41.60 19.40 16.14 33.80 (+3.7) 14.00 (+3.7) 80.0

Table 2. Comparison of Different Multimodal Large Language Models as Agents for Autonomous GUI Testing. The green numbers
represent the difference between the best result and the second-best result. TM, EM, and SR denote Type Match, Exact Match, and Success
Rate, respectively. Since several models tend to respond with “no defect” during GUI defect detection, we selected both defective and
non-defective data to calculate the recall metrics, denoted as Recall-D and Recall-N, respectively.

Task Execution on different instructions quiring extensive knowledge and imaginative capabilities
Model GPT-4o LLaVA Qwen2-VL
still necessitate models with sufficiently large parameters to
Test Intention 16.14 3.28 11.73
produce richer responses. The differences in TM and EM
Test Steps 16.39 6.81 19.80
performance further support this point: while open-source
Table 3. SR of VLLMs on Different Instructions for Test Task models can answer some matching tasks correctly, LLaVA
Execution. We use two types of instructions, test intention and test struggles significantly in higher-difficulty tasks involving
steps, to compare the task completion performance of the models exact matches, whereas Qwen2-VL performs slightly bet-
under each instruction type. ter due to prior training on GUI data.
For GUI defect detection, we use a mix of data with and
without defects to simulate the agent performing defect de-
details are accurate. For instance, in test step click(3), tection tasks in a real-world scenario. Since Qwen2-VL pre-
where 3 is the element ID within the image, the model needs dominantly responds with “no defect”, it achieves a high
to identify the correct element to click. For a test task, if recall rate on normal data but performs poorly on data with
the action is entirely correct for each tuple, we consider the actual defects. In contrast, Claude tends to think there ex-
agent to have successfully completed the task. SR repre- ists GUI defects in the data. This outcome underscores the
sents the percentage of tasks that the model successfully necessity of evaluating both metrics for a comprehensive
completes out of all tasks. assessment.
Accuracy and Recall. For GUI defect detection, we
evaluate accuracy and recall metrics. The test set includes
5.3. Ablation Study on Test Task Execution
both data with GUI defects and normal data (represented
in our defined triplet form). For Accuracy, we calculate In Section 2.2, we discussed how the action sequence per-
the proportion of correct judgments made by the model formed by the agent under the guidance of a test intention
across all data. Recall is further divided into two metrics: can be used to construct test steps, thereby generating a
Recalldef ect , which measures the model’s correct judg- complete test case. This raises the question: Does using
ment rate on data with GUI defects, and Recallnodef ect , test steps as instructions enable the agent to complete tasks
which assesses the model’s performance on normal data. more effectively compared to using the test intention? To
These three metrics allow us to thoroughly analyze the per- explore this, we concatenated the action sequences within
formance of different multimodal large language models in individual test tasks to form the corresponding test steps for
the task of GUI defect detection. each task. We then evaluated the performance of GPT-4o,
LLaVA, and Qwen2-VL on test task execution under these
5.2. Baseline Autonomous GUI Testing two types of instructions.
To establish a baseline, we select models with significantly The experimental results in Table 3 show that both
different architectures and native multimodal capabilities LLaVA and Qwen2-VL exhibited performance improve-
for comparison, including GPT-4o, Claude, LLaVA, and ments when given test steps as instructions. GPT-4o, how-
Qwen2-VL. ever, showed minimal change in performance, even per-
The experimental results, shown in Table 2, indicate that forming worse than Qwen2-VL when using test steps as
GPT-4o and Claude perform comparably across most met- instructions. This suggests that the limitation in GPT-4o’s
rics and outperform open-source models. Although the dif- performance in test task execution does not stem from its
ference in metrics between closed-source commercial mod- ability to accurately interpret test intentions but rather from
els and open-source models is relatively small, the over- the inherent complexity of GUI interfaces. Qwen2-VL, hav-
all poor performance highlights a significant gap between ing been trained on GUI data, benefits from the clarity in
open-source and closed-source models. In the test inten- identifying the next action to execute, resulting in a more
tion generation metric, LLaVA and Qwen2-VL lag consid- significant performance boost when provided with explicit
erably behind GPT-4o and Claude, suggesting that tasks re- test steps.

7
6. Related Work 7. Conclusion
This paper introduces a formalized framework for Au-
tonomous GUI Testing, aimed at addressing key limitations
Agent on GUI navigation task. Early GUI agents [16, 22,
in visual-based agent evaluation. By structuring the testing
23, 35] primarily relied on training models to explore GUI
workflow with precise mathematical definitions and decou-
environments with task completion as the main objective.
pling GUI defect detection from task execution, we present
With advances in multimodal large models [3, 6], current
a fair and robust environment (GTArena) for evaluating GUI
approaches have shifted from traditional training to using
Testing capabilities. Our work includes a novel data struc-
techniques like prompt tuning [21] and in-context learn-
ture for capturing GUI defects, which facilitates the creation
ing [46] to guide these models in exploration tasks. Ap-
of large-scale datasets.
pAgent [44], Mobile-Agent [38], and AutoDroid [40] uti-
Furthermore, we propose a unified benchmark to assess
lize LLMs to interpret natural language descriptions and
visual-based agents equipped with multimodal large models
transform them into GUI actions. Additionally, some work
(MLLMs), evaluating their performance across core com-
[10, 17] has focused on fine-tuning large models on GUI-
ponents: test case generation, test task execution, and GUI
specific data to improve their performance in GUI environ-
defect detection. Through this structured benchmark, we
ments. Agent workflow memory[39] represents another re-
reveal notable performance gaps between current agents
cent innovation, enhancing agents’ ability to automatically
and practical application for mainstream VLLMs, under-
construct workflows, thus introducing a new paradigm for
scoring the need for targeted model improvements. Addi-
GUI task automation.
tionally, our methodology offers a systematic approach for
fine-tuning models on task-specific datasets, while evaluat-
There has also been progress in creating benchmarks for ing their general capabilities on broader benchmarks.
GUI navigation [12, 23, 35]. WebArena [47], for instance, In conclusion, our work provides a fair, unified and end-
constructs realistic web environments with callable tools to to-end environment for automated GUI Testing, enabling
study the limitations of models like GPT-4V, revealing a convenient and reproducible evaluation of various multi-
significant gap in agent performance compared to humans modal large models in their role as agents. By bridging the
in complex tasks. AitW [34] collected large-scale data by gap between theoretical frameworks and practical evalua-
having annotators operate apps in a simulator to capture hu- tions, we aim to accelerate the development of more capa-
ble, reliable, and efficient agents for GUI Testing applica-
man instruction-following behavior, though data quality re-
tions.
mains a concern. Building on this, AitZ [45] introduced
high-quality navigation data with GPT-4-annotated Chain-
of-Thought (CoT) reasoning, along with new metrics to
References
evaluate agent performance in GUI navigation tasks. [1] F-droid: Free and open source android app repository, 2024.
GUI Defect Detection. Given the close connection be- Accessed: 2024-11-13. 6
tween GUI quality and user experience, various methods [2] Google play store, 2024. Accessed: 2024-11-13. 6
have been developed to detect bugs in GUIs. GUI Testing in [3] Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ah-
industry relies heavily on scripted tests to automate function mad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida,
validation. To address this, AppFlow [18] applies machine Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al.
learning to identify screen components, allowing testers to Gpt-4 technical report. arXiv preprint arXiv:2303.08774,
develop modular libraries for core application functions. 2023. 1, 2, 8
CoSer [9] constructs UI state transition graphs from source [4] Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine
code and scripts to repair outdated tests. Recently, LLMs Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Men-
sch, Katherine Millican, Malcolm Reynolds, et al. Flamingo:
have emerged as powerful tools in GUI Testing due to their
a visual language model for few-shot learning. Advances
extensive training on diverse data and strong reasoning abil-
in neural information processing systems, 35:23716–23736,
ities. For example, QTypist [25] focuses on generating se- 2022. 1
mantic text inputs for form fields to improve exploration [5] Emil Alégroth, Arvid Karlsson, and Alexander Radway.
coverage. GPTDroid [27] extracts page and widget infor- Continuous integration and visual gui testing: Benefits and
mation from the UI hierarchy, using it to create human- drawbacks in industrial practice. In 2018 IEEE 11th Interna-
like interactions. AUITestAgent[19] developed an industry- tional Conference on Software Testing, Verification and Val-
applicable automatic natural language-drifillven GUI Test- idation (ICST), pages 172–181. IEEE, 2018. 4
ing method. VisionDroid [28] addresses non-crash bug de- [6] AI Anthropic. Introducing the next generation of claude,
tection in GUIs by leveraging LLMs to detect unexpected 2024. 8
behaviors, particularly in scenarios where testing oracles [7] Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan
are lacking. Tan, Peng Wang, Junyang Lin, Chang Zhou, and Jingren

8
Zhou. Qwen-vl: A versatile vision-language model for un- Auitestagent: Automatic requirements oriented gui function
derstanding, localization, text reading, and beyond. arXiv testing. arXiv preprint arXiv:2407.09018, 2024. 1, 8
preprint arXiv:2308.12966, 1(2):3, 2023. 1 [20] Valéria Lelli, Arnaud Blouin, and Benoit Baudry. Classify-
[8] Rogerio Bonatti, Dan Zhao, Francesco Bonacci, Dillon ing and qualifying gui defects. In 2015 IEEE 8th interna-
Dupont, Sara Abdali, Yinheng Li, Yadong Lu, Justin Wagle, tional conference on software testing, verification and vali-
Kazuhito Koishida, Arthur Bucker, et al. Windows agent dation (ICST), pages 1–10. IEEE, 2015. 4
arena: Evaluating multi-modal os agents at scale. arXiv [21] Brian Lester, Rami Al-Rfou, and Noah Constant. The power
preprint arXiv:2409.08264, 2024. 4 of scale for parameter-efficient prompt tuning. arXiv preprint
[9] Shaoheng Cao, Minxue Pan, Yu Pei, Wenhua Yang, Tian arXiv:2104.08691, 2021. 8
Zhang, Linzhang Wang, and Xuandong Li. Comprehensive [22] Gang Li and Yang Li. Spotlight: Mobile ui understanding
semantic repair of obsolete gui test scripts for mobile appli- using vision-language models with a focus. arXiv preprint
cations. In Proceedings of the IEEE/ACM 46th International arXiv:2209.14927, 2022. 8
Conference on Software Engineering, pages 1–13, 2024. 8 [23] Yang Li, Jiacong He, Xin Zhou, Yuan Zhang, and Jason
[10] Kanzhi Cheng, Qiushi Sun, Yougang Chu, Fangzhi Xu, Yan- Baldridge. Mapping natural language instructions to mobile
tao Li, Jianbing Zhang, and Zhiyong Wu. Seeclick: Har- ui action sequences. arXiv preprint arXiv:2005.03776, 2020.
nessing gui grounding for advanced visual gui agents. arXiv 8
preprint arXiv:2401.10935, 2024. 1, 8 [24] Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee.
[11] Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Visual instruction tuning. Advances in neural information
Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul processing systems, 36, 2024. 1
Barham, Hyung Won Chung, Charles Sutton, Sebastian [25] Zhe Liu, Chunyang Chen, Junjie Wang, Xing Che, Yuekai
Gehrmann, et al. Palm: Scaling language modeling with Huang, Jun Hu, and Qing Wang. Fill in the blank: Context-
pathways. Journal of Machine Learning Research, 24(240): aware automated text input generation for mobile gui testing.
1–113, 2023. 1 In 2023 IEEE/ACM 45th International Conference on Soft-
[12] Xiang Deng, Yu Gu, Boyuan Zheng, Shijie Chen, Sam ware Engineering (ICSE), pages 1355–1367. IEEE, 2023. 8
Stevens, Boshi Wang, Huan Sun, and Yu Su. Mind2web: [26] Zhe Liu, Chunyang Chen, Junjie Wang, Mengzhuo Chen,
Towards a generalist agent for the web. Advances in Neural Boyu Wu, Xing Che, Dandan Wang, and Qing Wang. Chat-
Information Processing Systems, 36, 2024. 8 ting with gpt-3 for zero-shot human-like mobile automated
[13] Camilo Escobar-Velásquez, Michael Osorio-Riaño, and gui testing. arXiv preprint arXiv:2305.09434, 2023. 1
Mario Linares-Vásquez. Mutapk: Source-codeless mutant [27] Zhe Liu, Chunyang Chen, Junjie Wang, Mengzhuo Chen,
generation for android apps. In 2019 34th IEEE/ACM In- Boyu Wu, Xing Che, Dandan Wang, and Qing Wang. Make
ternational Conference on Automated Software Engineering llm a testing expert: Bringing human-like interaction to mo-
(ASE), pages 1090–1093. IEEE, 2019. 6 bile gui testing via functionality-aware decisions. In Pro-
[14] Camilo Escobar-Velásquez, Diego Riveros, and Mario ceedings of the IEEE/ACM 46th International Conference on
Linares-Vásquez. Mutapk 2.0: A tool for reducing muta- Software Engineering, pages 1–13, 2024. 1, 8
tion testing effort of android apps. In Proceedings of the [28] Zhe Liu, Cheng Li, Chunyang Chen, Junjie Wang, Boyu Wu,
28th ACM Joint Meeting on European Software Engineering Yawen Wang, Jun Hu, and Qing Wang. Vision-driven au-
Conference and Symposium on the Foundations of Software tomated mobile gui testing via multimodal large language
Engineering, pages 1611–1615, 2020. 6 model. arXiv preprint arXiv:2407.03037, 2024. 1, 8
[15] GitHub. Github repository of applications, 2024. Accessed: [29] Z. Liu, J.J. Wang, C.Y. Chen, X. Che, Y.H. Su, and Q. Wang.
2024-11-13. 5 Empirical study on ui display issue detection in mobile appli-
[16] Izzeddin Gur, Ulrich Rueckert, Aleksandra Faust, and Dilek cations. Journal of Software, 35(11):5040–5064, 2024. (in
Hakkani-Tur. Learning to navigate the web. arXiv preprint Chinese). 4
arXiv:1812.09195, 2018. 8 [30] Xinbei Ma, Zhuosheng Zhang, and Hai Zhao. Compre-
[17] Wenyi Hong, Weihan Wang, Qingsong Lv, Jiazheng Xu, hensive cognitive llm agent for smartphone gui automation.
Wenmeng Yu, Junhui Ji, Yan Wang, Zihan Wang, Yuxiao arXiv preprint arXiv:2402.11941, 2024. 1
Dong, Ming Ding, et al. Cogagent: A visual language model [31] MobileLLM. Droidbot-gpt: A lightweight model-driven tool
for gui agents. In Proceedings of the IEEE/CVF Conference for automated gui testing on android. https://github.
on Computer Vision and Pattern Recognition, pages 14281– com / MobileLLM / DroidBot - GPT, 2023. GitHub
14290, 2024. 1, 8 repository. 1
[18] Gang Hu, Linjie Zhu, and Junfeng Yang. Appflow: using [32] Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Car-
machine learning to synthesize robust, reusable ui tests. In roll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini
Proceedings of the 2018 26th ACM Joint Meeting on Euro- Agarwal, Katarina Slama, Alex Ray, et al. Training language
pean Software Engineering Conference and Symposium on models to follow instructions with human feedback. Ad-
the Foundations of Software Engineering, pages 269–282, vances in neural information processing systems, 35:27730–
2018. 8 27744, 2022. 1
[19] Yongxiang Hu, Xuan Wang, Yingchuan Wang, Yu Zhang, [33] Pranav Putta, Edmund Mills, Naman Garg, Sumeet Mot-
Shiyu Guo, Chaoyi Chen, Xin Wang, and Yangfan Zhou. wani, Chelsea Finn, Divyansh Garg, and Rafael Rafailov.

9
Agent q: Advanced reasoning and learning for autonomous for computer control. In The Twelfth International Confer-
ai agents. arXiv preprint arXiv:2408.07199, 2024. 1 ence on Learning Representations, 2023. 8
[34] Christopher Rawles, Alice Li, Daniel Rodriguez, Oriana [47] Shuyan Zhou, Frank F Xu, Hao Zhu, Xuhui Zhou, Robert
Riva, and Timothy Lillicrap. Androidinthewild: A large- Lo, Abishek Sridhar, Xianyi Cheng, Tianyue Ou, Yonatan
scale dataset for android device control. Advances in Neural Bisk, Daniel Fried, et al. Webarena: A realistic web en-
Information Processing Systems, 36, 2024. 8 vironment for building autonomous agents. arXiv preprint
[35] Tianlin Shi, Andrej Karpathy, Linxi Fan, Jonathan Hernan- arXiv:2307.13854, 2023. 4, 8
dez, and Percy Liang. World of bits: An open-domain plat-
form for web-based agents. In International Conference on
Machine Learning, pages 3135–3144. PMLR, 2017. 8 A. GUI Defect Types
[36] Ting Su, Jue Wang, and Zhendong Su. Benchmarking auto- The specific types of GUI defects and examples are shown
mated gui testing for android against real-world bugs. In Pro- in Table 4 and 5.
ceedings of the 29th ACM Joint Meeting on European Soft-
ware Engineering Conference and Symposium on the Foun-
dations of Software Engineering, pages 119–130, 2021. 4
B. GUI Defect Dataset Examples
[37] Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Some real-world defects from Github releases in Table 6.
Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Examples of artificial injected defects and episode from
Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. AitW with defects are show in Figure 4 and Figure 5.
Llama: Open and efficient foundation language models.
arXiv preprint arXiv:2302.13971, 2023. 1
[38] Junyang Wang, Haiyang Xu, Jiabo Ye, Ming Yan, Weizhou
Shen, Ji Zhang, Fei Huang, and Jitao Sang. Mobile-agent:
Autonomous multi-modal mobile device agent with visual
perception. arXiv preprint arXiv:2401.16158, 2024. 8
[39] Zora Zhiruo Wang, Jiayuan Mao, Daniel Fried, and Gra-
ham Neubig. Agent workflow memory. arXiv preprint
arXiv:2409.07429, 2024. 8
[40] Hao Wen, Yuanchun Li, Guohong Liu, Shanhui Zhao, Tao
Yu, Toby Jia-Jun Li, Shiqi Jiang, Yunhao Liu, Yaqin Zhang,
and Yunxin Liu. Empowering llm to use smartphone for in-
telligent task automation. arXiv preprint arXiv:2308.15272,
2023. 8
[41] Yiheng Xiong, Mengqian Xu, Ting Su, Jingling Sun, Jue
Wang, He Wen, Geguang Pu, Jifeng He, and Zhendong Su.
An empirical study of functional bugs in android apps. In
Proceedings of the 32nd ACM SIGSOFT International Sym-
posium on Software Testing and Analysis, pages 1319–1331,
2023. 4
[42] Keen You, Haotian Zhang, Eldon Schoop, Floris Weers,
Amanda Swearngin, Jeffrey Nichols, Yinfei Yang, and Zhe
Gan. Ferret-ui: Grounded mobile ui understanding with mul-
timodal llms. arXiv e-prints, pages arXiv–2404, 2024. 1
[43] Keen You, Haotian Zhang, Eldon Schoop, Floris Weers,
Amanda Swearngin, Jeffrey Nichols, Yinfei Yang, and Zhe
Gan. Ferret-ui: Grounded mobile ui understanding with mul-
timodal llms. In European Conference on Computer Vision,
pages 240–255. Springer, 2025. 1
[44] Chi Zhang, Zhao Yang, Jiaxuan Liu, Yucheng Han, Xin
Chen, Zebiao Huang, Bin Fu, and Gang Yu. Appagent:
Multimodal agents as smartphone users. arXiv preprint
arXiv:2312.13771, 2023. 8
[45] Jiwen Zhang, Jihao Wu, Yihua Teng, Minghui Liao, Nuo Xu,
Xiao Xiao, Zhongyu Wei, and Duyu Tang. Android in the
zoo: Chain-of-action-thought for gui agents. arXiv preprint
arXiv:2403.02713, 2024. 8
[46] Longtao Zheng, Rundong Wang, Xinrun Wang, and Bo An.
Synapse: Trajectory-as-exemplar prompting with memory

10
Defect Cat- Defect Description Example
egories
Content Error This defect involves text that appears as garbled or Replace content in string.xml with
Display
unintelligible characters on the screen, making infor- ‘null’.
mation difficult to read or understand.
Data Type or This defect occurs when data is displayed in inappro- Letters are allowed to be entered in the
Format Error priate or unexpected formats, which can lead to mis- date input field. The page shows the
interpretation or difficulty in understanding the data. date “2021-06-15” as “20210615”.
UI Element This defect refers to the absence of crucial UI ele- Image not loaded or displayed broken.
Missing ments within the interface, which can hinder user in- “New” page lacks a save button.
Layout
teraction or functionality.
UI Element This defect describes scenarios where UI components The labels for “Total Expenditure” and
Overlapping overlap one another, obscuring content and poten- “Remaining Budget” overlap.
tially making certain functions inaccessible.
Alignment Is- This defect is identified when UI elements are not In a center-aligned navigation bar, one
sue properly aligned, leading to a visually disorganized item is right-aligned.
interface that can detract from user experience.
Uneven Spac- This defect is characterized by irregular spacing be- Two elements are spaced too far apart,
ing tween UI elements, which can create a cluttered or un- resulting in a large area of whitespace.
balanced appearance, affecting the aesthetic and us-
ability.
Inconsistent This defect arises when the color scheme of UI el- Most of the icon colors in the naviga-
Style Color ements is mismatched or poorly chosen, potentially tion bar are the same, with a few ex-
leading to a visually unappealing or confusing inter- ceptions.
face.
Inconsistent This defect pertains to UI elements that vary signif- Some fonts are too large while others
Element Size icantly in size, which can confuse users and disrupt are too small.
the visual flow of the application, affecting usability.
Abnormal UI This defect involves UI elements that display unex- The submit button appears in an active
Element State pected behaviors or appearances when they are inter- state although it is not being clicked.
acted with, such as being clicked or focused, which
can confuse users or hinder interaction.

Table 4. UI Display Defects

Defect Cat- Defect Description Example


egories
Operation No This defect occurs when there is no feedback or action Clicked submit button but there was no
Operation
Response following user interactions, leading to uncertainty and response. There are more content be-
frustration for the user. low, but could not scroll down when the
user swipes down.
Virtual Key- This defect involves problems with the virtual key- The virtual keyboard cannot wake up
board Related board that affect typing or input, such as unexpected automatically.
Issue behavior or layout issues.
Navigation This defect refers to flaws in the navigation logic that Click the ’Default Setting’ but jump to
Task
Logic Error result in incorrect or unintended application flows, ’UI interface’.
potentially leading users to incorrect destinations or
functions.
Unexpected This defect occurs when the results of tasks do not Theme change not working. The recipe
Task Result align with the anticipated results or specifications, could not be deleted.
leading to confusion and potential errors in usage.

Table 5. UI Interaction Defects

11
Release Display Defect Interaction Defect
v1.21.0 - Stickers from Gboard have black background (fixed) - Broken localization with empty strings in it (fixed)
- mxc reactions not rendered correctly (fixed)
v1.17.2 / - Add cancel button to key request dialog
- Encode component for links correctly
- Forward arbitrary message content
- Open public room bottom sheet by alias
v3.0 - Song placeholder icon in player view /
v2.0 - Launcher icon background color - Disable favourite button for local songs
v1.0 - Color of status and navigation bar /
- Splash screen background color in dark mode
v6.0.0 / - Top/Recent Artists/Albums not updating (Wrong sort
order)
- All Blacklist related crashes
- Restart button not working in crash activity
v5.8.4 / - Crash when adding folders to blacklist
v5.8.3 - Incorrect song data in notification /
v5.8.0 / - Settings change not reflecting immediately
- Crash when clicking on Playlist in the Search Tab
v5.6.0 - Incorrect colors when no cover art is available - Lockscreen dragging glitch
- Blank album cover bug - Favorite not updating when song is changed
- Playlist not getting created & playlist creation crash
with same name
- Bug in “Plain” Now playing theme where onClick
event is consumed by the views behind the bottom
sheet

Table 6. Example of real-world defects description

12
[Content Error] [Content Error] [UI Element Missing] [UI Element Overlapping] [Alignment Issue] [Uneven Spacing]
The screenshot contains garbled The unit type error in "Total UI element 6 is missing an icon. The "Settings" header is overlapping. The btn 2 and btn 3 are not aligned. There is an abnormal blank in the
or unreadable text. Distance Recorded." middlle of the screenshot.

[Inconsistent Color] [Inconsistent Element Size] [Abnormal UI El State] [Operation No Response] [Virtual Keyboard Related Issue]
The color scheme of icons are The size of UI element 5's icon The state of button 5 and 6 is There are more content blow, but could not scroll up and down. The virtual keyboard cannot shut
mismatched. is abnormal. abnormal(disabled). down automatically.

[Navigation Logic Error] [Unexpected Task Result]


Click the 'Pipe Music' but jump to 'Setting'. The recipe 'Tomato Soup' could not be deleted.

Figure 4. Example Defects in Artificial Injected Data.

13
EL_MISSING|7

CONTENT_ERROR|29

[Content Error | 29] [UI Element Missing | 7]

CONTENT_ERROR|46

[Content Error | 46]

Figure 5. Example episode form the AitW with Defects.

14

You might also like