Harnessing GUI Grounding for Advanced Visual GUI Agents
Harnessing GUI Grounding for Advanced Visual GUI Agents
Mobile UI Grounding
Vision
Encoder Large Vision-Language Model (LVLM)
Web UI Related (ViT)
Web OCR Web UI Grounding
Vision-Language Instruction:
General VL Data
Adapter “View the new album of Jony J”
VQA Visual Reasoning
the capacity to locate screen elements based on birth of SeeClick, including the formalization of
instructions. Although recent LVLMs have claimed GUI grounding task, the construction of continual
grounding capability on natural images (Bai et al., pre-training data, and training details.
2023; Wang et al., 2023), GUI screenshots differ
significantly with dense text and numerous icons 3.1 GUI grounding for LVLMs
and widgets. These differences impair existing
LVLMs’ grounding performance in GUI contexts As GUI grounding is the core capability of
and limit their potential as visual GUI agents. SeeClick, we first elucidate how to train LVLM for
language generation to perform grounding tasks.
This paper seeks to harness LVLMs with GUI Given an interface screenshot s and a collection
grounding skills, paving the way for a visual GUI of elements {(xi , yi )|i } on it, where xi denotes
agent that executes instructions only relying on the textual description of the i-th element and yi
screenshots. As presented in Figure 2, SeeClick indicates the element’s location (represented as a
is a foundational model for GUIs, and tailored for bounding box or point). As depicted in Figure 2(a),
adaption to agent tasks. Next, we introduce the LVLM predicts the location of the element y based
on the interface screenshot s and its textual descrip-
tion x, i.e. calculating p(y|s, x).
A potential challenge is how LVLMs predict
numerical coordinates in a language generation for-
mat. Previous studies (Chen et al., 2021; Wang
et al., 2023; Shaw et al., 2023) divide the image
into 1000 bins, and creating a new 1,000-token
vocabulary {< p0 >, < p1 >, ..., < p999 >} to
represent the x and y coordinates. In this work, <div class="header">
<ul class="menu">
<li>...</li>
we adopt a more intuitive manner used in LVLMs </ul>
</div>
(Chen et al., 2023b; Bai et al., 2023), treating nu- <div class="container">
<div class=“product-thumbnails”><a href=“#” title=“Previous image"></a></div>
merical values as natural languages without any ad- <div class="product-detail">
<div>...</div>
ditional tokenization or pre-/post-processing. For <button>ENQUIRE NOW</button>
<div class=“product-share”>…</div>
</div>
instance, in Figure 2(a), for a smartphone screen- </div>
shot and the instruction “View the new album of Figure 3: Example of two types of elements automati-
Jony J”, we craft a query prompt: “In the UI, where cally collected from the webpage.
should I click if I want to <instruction>?”. Subse-
quently, we normally compute the cross-entropy
nearly 20k screenshots, 40k widgets, and 100k de-
loss between the model output and the ground truth
scriptions. We derive mobile UI grounding data by
“click (0.49, 0.40)” to optimize the LVLM.
reversing the process of widget captioning, treat-
3.2 Data Construction ing language descriptions as instructions and corre-
We train SeeClick using three collections of data: sponding widgets as target elements. To improve
web UI data crawled from the internet, mobile UI diversity, we also incorporate the automatically col-
data reorganized from public datasets and general lected elements and instructions from RICO (Li
vision-language instruction-following data. et al., 2020a). The mobile data involves diverse
Web Data. Web UIs, featuring a variety of lay- elements and instructions, facilitating the general-
outs and design styles across websites, are ideal for ization of SeeClick’s grounding proficiency to di-
training LVLMs’ general recognition and ground- verse GUI contexts. We finally include mobile UI
ing capabilities across different GUI contexts. We summarization data (Wang et al., 2021) to enhance
collect approximately 300k web pages from the overall interface comprehension.
latest Common Crawl repository to serve as our General Data. To maintain LVLM’s general capac-
training data for web UI. For each webpage s, we ities on natural images, we incorporate the general
collect two types of elements from the HTML code vision-language instruction-following data from
as exemplified in Figure 3: (1) elements that dis- LLaVA (Liu et al., 2023a), covering conversation,
play visible text content; and (2) elements with detailed description, and complex reasoning.
a special “title” attribute that display descriptive Finally, we mix the data above and craft 30 task-
text when hovering. This method ensures that we specific prompts for each added GUI task, resulting
gather a series of interactable elements y and their in a 1M dataset to train SeeClick.
corresponding instructions x, while encompassing
a wide range of text and icon elements. In addition 3.3 Training Details
to the grounding task p(y|s, x), we also include
web OCR task p(x|s, y), predicting text descrip- We build SeeClick through continual pre-training
tion based on coordinates. on a recent advanced LVLM, Qwen-VL (Bai et al.,
Mobile Data. For mobile UI, we include three 2023), which possesses grounding capabilities and
types of data: widget captioning, mobile UI ground- a higher resolution of 448*448. We train Qwen-VL
ing, and mobile UI summarization. The widget on the dataset we constructed (as described in Sec-
captioning dataset provides language descriptions tion 3.2) for about 10k steps (around 1 epoch) to
for mobile UI elements; for example, the descrip- obtain our GUI base model SeeClick. During train-
tion “play music” for the play button on a music ing, we employ LoRA (Hu et al., 2021) to fine-tune
player interface. We utilize the training split of the both the visual encoder and LLM. Further details
dataset provided by (Li et al., 2020b), containing and task examples are provided in Appendix A.
Model GUI Mobile Desktop Web
LVLMs Different types of elements Average
10% Size Specific in ScreenSpot
Text Icon/Widget Text Icon/Widget Text Icon/Widget
21%
6%
MiniGPT-v2 7B ✗ 8.4% 6.6% 6.2% 2.9% 6.5% 3.4% 5.7%
Text Icon/Widget
Qwen-VL
6% 9.6B ✗ 9.5% 4.8% 5.7% 5.0% 3.5% 2.4% 5.2%
GPT-4V - ✗ 22.6% 24.5% 20.2% 11.8% 9.2% 8.8% 16.2%
8% ScreenSpot 232
Fuyu 8B ✓ 41.0% 1.3% 33.0% 3.6% 33.9% 4.4% 19.5%
21%
CogAgent 18B ✓ 67.0%140 24.0% 74.2% 20.0% 70.4% 28.6%
151 47.4%
15%
SeeClick 9.6B ✓ 78.0% 52.0% 72.2% 30.0% 55.7% 32.5% 53.4%
13% 278
198 210
Table 1: Results of different LVLMs on ScreenSpot. The best results in each column are highlighted in bold.
iOS Android Windows macOS
Benefiting from efficient GUI grounding
Mobile pre-training,
Desktop Web
SeeClick significantly enhanced LVLMs’ ability to locate
Development Shopping Forum Tools
GUI elements following instructions, and surpassed the strong baseline CogAgent with a smaller model size.
11%
Different types of elements
in ScreenSpot
scenarios, ScreenSpot is carefully curated to ensure
20%
7% Text Icon/Widget
the samples are novel and not included in existing
training resources. We recruited experienced anno-
9% tators to collect GUI interfaces and label instruc-
229
ScreenSpot
19%
206
tions along with the bounding boxes for actionable
7% 140
elements. For mobile and desktop, annotators were
14% 273
230
asked to select commonly used apps and opera-
13% 194
tions; for web, we chose several types of websites
iOS Android Windows macOS (development, shopping, forum, and tools) from the
Mobile Desktop Web
Development Shopping Forum Tools
Figure 4: Statistic of our proposed GUI grounding web environment WebArena (Zhou et al., 2023).
benchmark ScreenSpot. The left illustrates the diverse
GUI environments included. The right displays the 5 Experiments
types of elements within each GUI category.
In this section, we first evaluate the GUI grounding
capabilities of representative LVLMs and our pro-
4 ScreenSpot: A Grounding Benchmark
posed SeeClick. Subsequently, we adapt SeeClick
We recognize GUI grounding proficiency as es- to mobile and web agent tasks, analyzing the cor-
sential for constructing visual GUI agents. How- relation between the advanced grounding capacity
ever, the constrained capabilities of earlier vision- and downstream task performance, while exploring
language models resulted in limited attention, with the potential of purely vision-based GUI agents.
scant research (Li et al., 2021; Li and Li, 2022;
Zhang et al., 2023) largely confined to an Android 5.1 GUI Grounding on ScreenSpot
dataset (Deka et al., 2017) collected in 2017. As the foundation of visual GUI agents, GUI
To address this research gap, we introduce grounding has not received adequate attention in
ScreenSpot, an up-to-date, realistic grounding eval- current LVLMs evaluations (Liu et al., 2023b; Yu
uation benchmark encompassing various GUI plat- et al., 2023). Therefore, we evaluate LVLMs on
forms. It is designed to assess vision-language our GUI-specific benchmark ScreenSpot.
models’ ability to locate screen elements based on Compared LVLMs & Evaluation. We primar-
instructions (Figure 2(b) provides some examples). ily evaluated two types of LVLMs: (1) Generalist
ScreenSpot has two distinctive features: (1) Vari- LVLMs capable of tasks such as dialogue, recogni-
ous GUI platforms. It includes over 600 interface tion and grounding, including MiniGPT-v2 (Chen
screenshots from mobile (iOS, Android), desktop et al., 2023a), Qwen-VL (Bai et al., 2023) and
(macOS, Windows), and web platforms, along with GPT-4V; (2) Recently released LVLMs specifically
1200+ instructions and corresponding actionable designed for GUI tasks, including Fuyu (Bavishi
elements; (2) Icons/Widgets. ScreenSpot includes et al., 2023) and CogAgent (Hong et al., 2023).
a substantial number of icons and widgets in each Considering that GUI agents require clicking on
GUI, which is more challenging to locate than texts the correct position, we calculate the click accuracy
(statistics are in Figure 4). See Appendix B for as the metric, defined as the proportion of test sam-
annotation details and examples. ples where the model predicted location falls in the
To measure models’ effectiveness in real-world ground truth element bounding box (Li et al., 2022;
Zhang et al., 2023). More details about evaluation Methods Modality Dataset Score
on ScreenSpot is in Appendix B. Compared with text-based models over 45 tasks
Results. As shown in Table 1, while generalist CC-Net (SL) DOM+Image 2.4M 35.6%
LVLMs have excelled in natural image grounding, WebN-T5 HTML 12K 55.2%
their GUI grounding performance on ScreenSpot MM-WebN-T5 HTML+Image 347K 63.4%
is poor due to the significant differences between WebGUM HTML+Image 2.8K 65.5%
GUIs and natural images. Even GPT-4V struggles WebGUM HTML+Image 347K 86.1%
with accurately locating screen elements. SeeClick Image 2.8K 73.6%
In comparison, GUI-specific LVLMs have sig- Compared with vision-based models over 35 tasks
nificant improvements. SeeClick achieved the best
CC-Net (SL) Image 2.4M 23.4%
average performances across GUI platforms and
Pix2Act Image 1.3M 64.6%
two types of elements, even with fewer parameters
Qwen-VL Image 2.8K 48.4%
than CogAgent. This demonstrates the efficiency
of our GUI grounding pre-training; with the rich UI SeeClick Image 2.8K 67.0%
elements and diverse instructions collected from Table 2: Average scores of different methods on Mini-
the web and mobile, SeeClick quickly learns to un- Wob. The best results in each setting are bold. Methods
derstand human instructions for element localiza- achieving the highest performance with limited data
are underlined. SeeClick outperforms a range of offline
tion, even in completely unseen scenarios like iOS
training methods as a purely vision-based model.
and desktop. SeeClick exhibits slightly inferior per-
formance in locating text within desktop and web auxiliary input but still selects HTML elements as
compared to CogAgent, possibly due to lower reso- actions. Pix2Act (Shaw et al., 2023) is the only
lution and much smaller training data. Notably, all prior vision-based approach, trained with extensive
models struggle with locating icons/widgets, high- demonstration data to perform actions. To verify
lighting the difficulty of identifying and grounding the effectiveness of GUI grounding pre-training, we
non-text elements on GUIs, which is the unique also report the results of the LVLM baseline Qwen-
challenge posed by ScreenSpot. VL when trained with the same 2.8K dataset.
5.2 Visual GUI Agent Tasks Due to the variance in evaluation task sets among
This section explores SeeClick’s application to mo- different methods (Liu et al., 2018; Furuta et al.,
bile and computer agent tasks: MiniWob, AITW, 2023; Shaw et al., 2023), for fairness, we report
and Mind2Web. We trained SeeClick on the re- performance in two groups based on the overlap-
spective training splits and tested it on the test sets. ping MiniWob tasks. We compute the success rate
Across these tasks, with provided instructions and over 50 random seeds for each task and then com-
memory of previous actions, SeeClick determines pute the mean over all tasks as the final score. We
the next action solely by observing interface screen- provided task-wise scores in Appendix C.2.
shots. The detailed task settings, action formats and Results. As depicted in Table 2, purely vision-
interaction examples are in Appendix C. based SeeClick surpassed strong baselines with
substantially less training data. Notably, with an
5.2.1 MiniWob equivalent amount of 2.8K training data, it outper-
MiniWob (Shi et al., 2017) comprises about 100 formed the offline sota WebGUM, which uses both
types of web automation tasks, where the agent is HTML and screenshots as input. Moreover, thanks
asked to interact with a simplified web environment to LVLM’s powerful reasoning and planning abili-
to accomplish human instructions. Existing open- ties and our GUI grounding pre-training, SeeClick
source training data often lacks corresponding in- exceeded the sota visual method Pix2Act, using
terface screenshots (Furuta et al., 2023). Therefore, less than 0.3% training data.
we rollout 50 successful episodes using an LLM Furthermore, SeeClick significantly surpassed
strategy for each task in (Zheng et al., 2023), re- the LVLM baseline Qwen-VL by nearly 20 per-
sulting in a 2.8K episodes dataset to train SeeClick. centage points, underscoring the importance of
Compared Methods & Evaluation. We com- GUI grounding in boosting LVLM’s performance.
pared SeeClick with a range of offline training To analyze in detail, we provide task-level com-
methods. Among these, the state-of-the-art method parisons in Figure 5. SeeClick notably excelled
WebGUM (Furuta et al., 2023) uses screenshots as in tasks with dynamic interface layouts and ele-
at 2
he ithm t
le
al 2
ick e
t
Cl ce
an s t
ol rge
m Clic b
nk
ra
Gr co m e
Cl laps 2
he -t oe
bo etic
-w n
y- e
e
ico rd
ick pe
s
ick -pie
En ate
2
ck Use ee
Cl Tic- m
ick lick n
er
he ck-s g
ick xes
Cl idge
s
Cl colo
Us o un sof
sfe
bo ade
at
ap
ap
-ta b le-
lo
vig st-
t
g-
-c plet
o
-ta
to
ol tab-
ib
or
en -d a
xe slid
Ch geb
Un 2-ha
en
- li
tr
a
Cl p ti
xt e-t
m ia lo
ia
in
a
-d
ut
bo
te
an t-sh
sh
-sh
ps
Cl ran
s-
ick
c
sf
ick
e-
ick
ick s-l
k
qu
h
-d
-
ta
i
rd
-o
-
se
r
-
-b
xe
d
-
ck
la
te
b-
-d
Cl
ick
xe
Cl
ick
t
se
oo
e-
oo
ick
tif
s-
r
-tr
Cl
ick
Cl
to
C
pl
n-
-c
i
Cl
Cl
Cl
-c
ck
Na
C
ick
tto
-c
id
ck
bo
ick
Id
Te
p
Si
e-
Cl
bu
-c
Si
Cl
-c
he
-
ick
ick
-c
Cl
Cl
Cl
ick
Cl
SeeClick improvement over LVLM baseline Qwen-VL Qwen-VL SeeClick
SeeClick is better Qwen-VL is better
1
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
bo st-2
ce
ick lo r
ge
En test
Cl le
he xt -t og-2
xe te
ar b ra
ick k-b 2
e
he test
k
Cl -pie
e
t
co pe
ick -2
ick te
Gr ify-s c
ps n
Cl ico d e
rd
-c h oo log
la s
ick xes
er
e
r
vig ge
of
de
de
i
n - - lin
-ta -t o
- c h ap
at
p
at
-ta
b-
to
io
tre
et
le
ib
a
or
e
Cl -ha
Co n sf
en
Cl -lar
a
Un sh a
o
Cl es-s
ia
pt
N a wid
in
-ta
pl
-d
r -d
ut
p l a lg e
bo
i
ick -sh a
Id thm
e
an t-sh
ib
ps
-
ick
e-
-c
l
ick
bo nsf
e-
l
tto lick
ick
Si se-s
qu
-d
he k-t
Te dia
Cl ic-ta
rd
tra
-o
se
s
ck
-
ick
at
te
b-
Cl
x
-
ra
se
oo
Cl
Si p le-
un
ick
ick
i
C
ol
la
s-
-
U
Cl
ic
ick
bo
t
Cl
to
T
Cl
-c
Cl
ol
xe
en
Cl
e-
Cl
-c
ck
m
C
ick
-c
id
ck
ick
he
e-
m
Cl
bu
ck
-c
Us
Cl
ick
-
ick
ick
-c
Cl
Cl
Cl
ick
Cl
Figure 5: Comparison of SeeClick and Qwen-VL on MiniWob. Tasks marked with yellow shadows feature
dynamic webpage layouts, simulating real-world GUI agent applications (details in appendix Figure 11). SeeClick
outperformed Qwen-VL in most tasks, highlighting the effectiveness of GUI grounding pre-training.
Table 3: Average scores of different methods on AITW. ClickAcc calculates the accuracy of click operation. The
best results in each column are bold. SeeClick exhibits the best performance among competing baselines.
ment positions, confirming our hypothesis that gen- screen-wise action matching score as the main met-
eral LVLMs struggle with accurately clicking, and ric and additionally compute the click accuracy
SeeClick markedly improves this aspect. (ClickAcc), which calculates the accuracy when
both reference and prediction are click operations.
5.2.2 AITW Results. As illustrated in Table 3, SeeClick
We evaluate SeeClick in smartphone environments achieved the best average performance among both
with Android automation dataset Android In The API-based LLMs and trained LVLMs. Specifi-
Wild (AITW) (Rawles et al., 2023), which encom- cally, SeeClick exhibited a 9% increase in click
passes 30k instructions and corresponding 715k accuracy over Qwen-VL, supporting the idea that
operation trajectories. Previous approaches split GUI grounding enhances agent task performance
train/val/test episode-wise, which poses a clear through precise clicking.
risk of overfitting due to: (1) instructions in the
test set have appeared in training, and (2) an aver- 5.2.3 Mind2Web
age of 20 similar trajectories per instruction. In To assess SeeClick’s capabilities in web naviga-
this work, we opt for an instruction-wise split, tion, we utilize the recently introduced Mini2Web
with 545/688/306/700/700 instructions from Gen- dataset (Deng et al., 2023), which comprises over
eral/Install/GoogleApps/Single/WebShopping re- 2000 open-ended tasks collected from 137 real web-
spectively, and retain one trajectory per instruction. sites, each with high-level instruction and corre-
We selected 80% for training and the remaining sponding human action trajectory. Mind2Web was
for testing in each subset. This split avoids over- originally designed for text-based agents, which
fitting and reflects the performance of agents on select actionable elements from simplified HTML
unseen instructions. Further details and results on in each step. This work explores visual web agents
the original split are in Appendix C.3. that predict click positions directly from screen-
Compared Methods & Evaluation. We com- shots. For this purpose, we parsed screenshots and
pare SeeClick with two types of baselines: (1) target element bounding boxes from the raw dump
API-based LLMs such as ChatGPT-CoT (Zhan and of Mind2Web. To the best of our knowledge, this
Zhang, 2023), PaLM2-CoT (Rawles et al., 2023) is the first attempt of web agents relying solely on
and the latest GPT-4V (Yan et al., 2023); (2) Our screenshots as inputs for navigating real websites.
trained LVLM baseline Qwen-VL. Compared Methods & Evaluation. We compare
We follow Rawles et al. (2023) to adopt the with html-based web agents Mind2Act (Deng et al.,
Cross-Task Cross-Website Cross-Domain
Methods w/o HTML
Ele.Acc Op.F1 Step SR Ele.Acc Op.F1 Step SR Ele.Acc Op.F1 Step SR
MindAct (gen) ✗ 20.2 52.0 17.5 13.9 44.7 11.0 14.2 44.7 11.9
MindAct ✗ 55.1 75.7 52.0 42.0 65.2 38.9 42.1 66.5 39.6
GPT-3.5-Turbo ✗ 20.3 56.6 17.4 19.3 48.8 16.2 21.6 52.8 18.6
GPT-4 ✗ 41.6 60.6 36.2 35.8 51.1 30.1 37.1 46.5 26.4
Qwen-VL ✓ 15.9 86.7 13.3 13.2 83.5 9.2 14.1 84.3 12.0
SeeClick ✓ 28.3 87.0 25.5 21.4 80.6 16.4 23.2 84.8 20.8
Table 4: Comparsion of methods on Mind2Web. The best results in each column are bold. Improvements of
SeeClick over LVLM baseline are underline, with GUI grounding pre-training nearly doubling the step success rate.
Limitations
SeeClick currently simplifies the GUI action space
to mainly focus on clicking and typing, excluding
complex actions like dragging and double-clicking.
Additionally, limited by the performance of open-
source LVLMs, training on agent-specific data is
necessary for SeeClick to execute multi-step tasks
on interfaces like mobile and computer.
Ethical considerations
GUI agents are developed to automate tasks and
enhance efficiency on digital devices. These tech-
nologies are especially significant for individuals
with visual impairments. Here are some ethical
considerations:
Privacy Issues. The operation of GUI agents in-
volves accessing and interacting with user inter-
faces that may contain personal or sensitive infor-
mation. Ensuring data protection and user consent
are paramount to maintaining privacy integrity.
Safety in Read-World Interactions. When GUI
agents interact with the real world, there’s a risk of
unintended harmful actions. Ensuring these agents
operate within safe parameters is crucial to prevent
negative outcomes.
Bias. The development of GUI agents must address
potential biases in their algorithms, which could
result in unequal performance across different user
groups or interface designs. Mitigating bias is es-
sential for equitable access and effectiveness.
Addressing these concerns requires ongoing re-
search and development efforts, ensuring that the
benefits of GUI agents are realized without com-
promising ethical standards.
References Izzeddin Gur, Ulrich Rueckert, Aleksandra Faust, and
Dilek Hakkani-Tur. 2018. Learning to navigate the
Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, web. In International Conference on Learning Rep-
Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, resentations.
and Jingren Zhou. 2023. Qwen-vl: A frontier large
vision-language model with versatile abilities. arXiv
Wenyi Hong, Weihan Wang, Qingsong Lv, Jiazheng
preprint arXiv:2308.12966.
Xu, Wenmeng Yu, Junhui Ji, Yan Wang, Zihan Wang,
Rohan Bavishi, Erich Elsen, Curtis Hawthorne, Yuxiao Dong, Ming Ding, et al. 2023. Cogagent: A
Maxwell Nye, Augustus Odena, Arushi Somani, and visual language model for gui agents. arXiv preprint
Sağnak Taşırlar. 2023. Introducing our multimodal arXiv:2312.08914.
models.
Edward J Hu, Phillip Wallis, Zeyuan Allen-Zhu,
Andrea Burns, Deniz Arsan, Sanjna Agrawal, Ranjitha Yuanzhi Li, Shean Wang, Lu Wang, Weizhu Chen,
Kumar, Kate Saenko, and Bryan A Plummer. 2022. et al. 2021. Lora: Low-rank adaptation of large lan-
A dataset for interactive vision-language navigation guage models. In International Conference on Learn-
with unknown command feasibility. In European ing Representations.
Conference on Computer Vision, pages 312–328.
Springer. Geunwoo Kim, Pierre Baldi, and Stephen McAleer.
2023. Language models can solve computer tasks.
Jun Chen, Deyao Zhu, Xiaoqian Shen, Xiang Li, Zechun
arXiv preprint arXiv:2303.17491.
Liu, Pengchuan Zhang, Raghuraman Krishnamoor-
thi, Vikas Chandra, Yunyang Xiong, and Mohamed
Elhoseiny. 2023a. Minigpt-v2: large language model Bo Li, Yuanhan Zhang, Liangyu Chen, Jinghao Wang,
as a unified interface for vision-language multi-task Jingkang Yang, and Ziwei Liu. 2023. Otter: A
learning. arXiv preprint arXiv:2310.09478. multi-modal model with in-context instruction tuning.
arXiv preprint arXiv:2305.03726.
Keqin Chen, Zhao Zhang, Weili Zeng, Richong Zhang,
Feng Zhu, and Rui Zhao. 2023b. Shikra: Unleashing Gang Li and Yang Li. 2022. Spotlight: Mobile ui under-
multimodal llm’s referential dialogue magic. arXiv standing using vision-language models with a focus.
preprint arXiv:2306.15195. In The Eleventh International Conference on Learn-
ing Representations.
Ting Chen, Saurabh Saxena, Lala Li, David J Fleet, and
Geoffrey Hinton. 2021. Pix2seq: A language model- Liunian Harold Li, Pengchuan Zhang, Haotian Zhang,
ing framework for object detection. In International Jianwei Yang, Chunyuan Li, Yiwu Zhong, Lijuan
Conference on Learning Representations. Wang, Lu Yuan, Lei Zhang, Jenq-Neng Hwang, et al.
2022. Grounded language-image pre-training. In
Biplab Deka, Zifeng Huang, Chad Franzen, Joshua Hi-
Proceedings of the IEEE/CVF Conference on Com-
bschman, Daniel Afergan, Yang Li, Jeffrey Nichols,
puter Vision and Pattern Recognition, pages 10965–
and Ranjitha Kumar. 2017. Rico: A mobile app
10975.
dataset for building data-driven design applications.
In Proceedings of the 30th annual ACM symposium
on user interface software and technology, pages Yang Li, Jiacong He, Xin Zhou, Yuan Zhang, and Jason
845–854. Baldridge. 2020a. Mapping natural language instruc-
tions to mobile ui action sequences. arXiv preprint
Xiang Deng, Yu Gu, Boyuan Zheng, Shijie Chen, arXiv:2005.03776.
Samuel Stevens, Boshi Wang, Huan Sun, and Yu Su.
2023. Mind2web: Towards a generalist agent for the Yang Li, Gang Li, Luheng He, Jingjie Zheng, Hong Li,
web. arXiv preprint arXiv:2306.06070. and Zhiwei Guan. 2020b. Widget captioning: Gener-
ating natural language description for mobile user in-
Hiroki Furuta, Ofir Nachum, Kuang-Huei Lee, Yu- terface elements. arXiv preprint arXiv:2010.04295.
taka Matsuo, Shixiang Shane Gu, and Izzeddin
Gur. 2023. Multimodal web navigation with Yang Li, Gang Li, Xin Zhou, Mostafa Dehghani, and
instruction-finetuned foundation models. arXiv Alexey Gritsenko. 2021. Vut: Versatile ui trans-
preprint arXiv:2305.11854. former for multi-modal multi-task user interface mod-
Difei Gao, Lei Ji, Zechen Bai, Mingyu Ouyang, Peiran eling. arXiv preprint arXiv:2112.05692.
Li, Dongxing Mao, Qinchen Wu, Weichen Zhang,
Peiyi Wang, Xiangwu Guo, et al. 2023. Assistgui: Evan Zheran Liu, Kelvin Guu, Panupong Pasupat, Tian-
Task-oriented desktop graphical user interface au- lin Shi, and Percy Liang. 2018. Reinforcement learn-
tomation. arXiv preprint arXiv:2312.13108. ing on web interfaces using workflow-guided explo-
ration. In International Conference on Learning Rep-
Izzeddin Gur, Hiroki Furuta, Austin Huang, Mustafa resentations.
Safdari, Yutaka Matsuo, Douglas Eck, and Aleksan-
dra Faust. 2023. A real-world webagent with plan- Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae
ning, long context understanding, and program syn- Lee. 2023a. Visual instruction tuning. In Neural
thesis. arXiv preprint arXiv:2307.12856. Information Processing Systems.
Yuan Liu, Haodong Duan, Yuanhan Zhang, Bo Li, Fangzhi Xu, Zhiyong Wu, Qiushi Sun, Siyu Ren, Fei
Songyang Zhang, Wangbo Zhao, Yike Yuan, Jiaqi Yuan, Shuai Yuan, Qika Lin, Yu Qiao, and Jun Liu.
Wang, Conghui He, Ziwei Liu, et al. 2023b. Mm- 2023. Symbol-llm: Towards foundational symbol-
bench: Is your multi-modal model an all-around centric interface for large language models. arXiv
player? arXiv preprint arXiv:2307.06281. preprint arXiv:2311.09278.
OpenAI. 2023. GPT-4 technical report. An Yan, Zhengyuan Yang, Wanrong Zhu, Kevin Lin,
Linjie Li, Jianfeng Wang, Jianwei Yang, Yiwu Zhong,
Zhiliang Peng, Wenhui Wang, Li Dong, Yaru Hao, Julian McAuley, Jianfeng Gao, et al. 2023. Gpt-
Shaohan Huang, Shuming Ma, and Furu Wei. 4v in wonderland: Large multimodal models for
2023. Kosmos-2: Grounding multimodal large zero-shot smartphone gui navigation. arXiv preprint
language models to the world. arXiv preprint arXiv:2311.07562.
arXiv:2306.14824. Zhao Yang, Jiaxuan Liu, Yucheng Han, Xin Chen, Ze-
biao Huang, Bin Fu, and Gang Yu. 2023a. Appa-
Christopher Rawles, Alice Li, Daniel Rodriguez, Oriana gent: Multimodal agents as smartphone users. arXiv
Riva, and Timothy Lillicrap. 2023. Android in the preprint arXiv:2312.13771.
wild: A large-scale dataset for android device control.
arXiv preprint arXiv:2307.10088. Zhengyuan Yang, Linjie Li, Kevin Lin, Jianfeng
Wang, Chung-Ching Lin, Zicheng Liu, and Lijuan
Peter Shaw, Mandar Joshi, James Cohan, Jonathan Be- Wang. 2023b. The dawn of lmms: Preliminary
rant, Panupong Pasupat, Hexiang Hu, Urvashi Khan- explorations with gpt-4v (ision). arXiv preprint
delwal, Kenton Lee, and Kristina Toutanova. 2023. arXiv:2309.17421, 9(1):1.
From pixels to ui actions: Learning to follow instruc-
tions via graphical user interfaces. In Advances in Qinghao Ye, Haiyang Xu, Guohai Xu, Jiabo Ye,
Neural Information Processing Systems. Ming Yan, Yiyang Zhou, Junyang Wang, An-
wen Hu, Pengcheng Shi, Yaya Shi, et al. 2023.
mplug-owl: Modularization empowers large lan-
Tianlin Shi, Andrej Karpathy, Linxi Fan, Jonathan Her-
guage models with multimodality. arXiv preprint
nandez, and Percy Liang. 2017. World of bits: An
arXiv:2304.14178.
open-domain platform for web-based agents. In In-
ternational Conference on Machine Learning, pages Weihao Yu, Zhengyuan Yang, Linjie Li, Jianfeng Wang,
3135–3144. PMLR. Kevin Lin, Zicheng Liu, Xinchao Wang, and Lijuan
Wang. 2023. Mm-vet: Evaluating large multimodal
Qiushi Sun, Zhangyue Yin, Xiang Li, Zhiyong Wu, models for integrated capabilities. arXiv preprint
Xipeng Qiu, and Lingpeng Kong. 2023. Corex: arXiv:2308.02490.
Pushing the boundaries of complex reasoning
through multi-model collaboration. arXiv preprint Zhuosheng Zhan and Aston Zhang. 2023. You only
arXiv:2310.00280. look at screens: Multimodal chain-of-action agents.
arXiv preprint arXiv:2309.11436.
Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier
Martinet, Marie-Anne Lachaux, Timothée Lacroix, Zhizheng Zhang, Wenxuan Xie, Xiaoyi Zhang, and
Baptiste Rozière, Naman Goyal, Eric Hambro, Yan Lu. 2023. Reinforced ui instruction ground-
Faisal Azhar, et al. 2023. Llama: Open and effi- ing: Towards a generic ui task automation api. arXiv
cient foundation language models. arXiv preprint preprint arXiv:2310.04716.
arXiv:2302.13971. Boyuan Zheng, Boyu Gou, Jihyung Kil, Huan Sun, and
Yu Su. 2024. Gpt-4v (ision) is a generalist web agent,
Bryan Wang, Gang Li, Xin Zhou, Zhourong Chen, Tovi if grounded. arXiv preprint arXiv:2401.01614.
Grossman, and Yang Li. 2021. Screen2words: Au-
tomatic mobile ui summarization with multimodal Longtao Zheng, Rundong Wang, Xinrun Wang, and
learning. In The 34th Annual ACM Symposium on Bo An. 2023. Synapse: Trajectory-as-exemplar
User Interface Software and Technology, pages 498– prompting with memory for computer control. In
510. NeurIPS 2023 Foundation Models for Decision Mak-
ing Workshop.
Wenhai Wang, Zhe Chen, Xiaokang Chen, Jiannan
Wu, Xizhou Zhu, Gang Zeng, Ping Luo, Tong Shuyan Zhou, Frank F Xu, Hao Zhu, Xuhui Zhou,
Lu, Jie Zhou, Yu Qiao, et al. 2023. Vision- Robert Lo, Abishek Sridhar, Xianyi Cheng, Yonatan
llm: Large language model is also an open-ended Bisk, Daniel Fried, Uri Alon, et al. 2023. Webarena:
decoder for vision-centric tasks. arXiv preprint A realistic web environment for building autonomous
arXiv:2305.11175. agents. arXiv preprint arXiv:2307.13854.
Deyao Zhu, Jun Chen, Xiaoqian Shen, Xiang Li, and
Zhiyong Wu, Chengcheng Han, Zichen Ding, Zhenmin Mohamed Elhoseiny. 2023. Minigpt-4: Enhancing
Weng, Zhoumianze Liu, Shunyu Yao, Tao Yu, and vision-language understanding with advanced large
Lingpeng Kong. 2024. Os-copilot: Towards general- language models. arXiv preprint arXiv:2304.10592.
ist computer agents with self-improvement.
A Details of SeeClick Pre-training
A.1 Pre-training Tasks
SeeClick employs pre-training tasks as outlined
in Table 6. For the grounding task, we incorpo-
rate two forms: predicting center point coordi-
nates (text_2_point) and predicting bounding box
(text_2_bbox). For the task of generating text
for elements (similar to OCR), we also include
two categories: predicting text based on center Task: Web text_2_point Task: Web bbox_2_text
User: In the provided screenshot, User: In this screenshot, I'll give
point (point_2_text, widget captioning) coordinates I‘ll describe webpage elements for coordinates and ask you to describe
and based on bounding boxes (bbox_2_text). Our you to locate (with point). the text of the elements there (with
User: Additional information bbox).
preliminary experiments indicated that predicting Assistant: (0.66,0.86) User: (0.61,0.84,0.71,0.88)
points was slightly better than bounding boxes, User: USB Flash Drive Assistant: Additional information
Assistant: (0.37,0.17) User: (0.34,0.16,0.40,0.18)
likely due to the variable sizes of UI elements. Con- User: Qty Assistant: USB Flash Drive
Assistant: (0.63,0.41) User: (0.61,0.40,0.64,0.43)
sequently, we increased the proportion of data with User: GAMING ACCESSORIES Assistant: Qty
point localization. Finally, about 1 million samples (AUSTRALIA) User: (0.26,0.16,0.33,0.18)
Assistant: (0.30,0.13) Assistant: GAMING ACCESSORIES
are used for the continual pre-training of SeeClick. …… (AUSTRALIA)
User: Add to Wishlist ……
For tasks involving coordinates, positions are Assistant: (0.52,0.57) User: (0.48,0.56,0.55,0.58)
represented as either the point (x,y) or the bounding Assistant: Add to Wishlist
box of (left, top, right, down), where each value Task: Mobile text_2_point
User: In this digital interface,
is a two-decimal place number in the range [0,1] to initiate "click to start a
indicating the ratio of the corresponding position to search", where is my point of
interest (with point)?
the width or height of the image. Figure 7 provides Assistant: (0.88,0.87)
some examples of the pre-training data. Task: Mobile UI summarization
User: Can you provide a detailed
description of the interface
Domain Task Sample Num screenshot shown?
Assistant: An app home that
text_2_point 271K displays a daily learning goal
punch list
text_2_bbox 54K
Web
point_2_text 54K
bbox_2_text 54K
text_2_point 274K
text_2_bbox 56K
Mobile
UI summarization 48K Figure 7: Examples of SeeClick pre-training tasks.
widget captioning 42K
General LLaVA 145K B ScreenSpot Annotation & Evaluation
Total 1M B.1 Human Annotation
Table 6: All training data used by SeeClick. We convened four experienced annotators, all ei-
ther Ph.D. or master students in computer science,
A.2 Training Configurations proficient in using mobile phones and computers
and familiar with GUI operations. Initially, we as-
We employed the aforementioned data for continual
signed different GUI types to the annotators, such
pre-training of Qwen-VL-Chat to develop SeeClick.
as iOS, Windows, and Web. Then, annotators were
To enhance LVLM’s understanding of GUI images,
required to capture screenshots during their routine
we unlocked the gradients of its visual encoder and
use (e.g., various apps) and subsequently annotate
applied LoRA for fine-tuning. We adopt AdamW
the clickable regions of frequently interacted ele-
as the optimizer and use a cosine annealing sched-
ments using bounding boxes with annotation tool
uler with an init learning rate of 3e-5 and a global 2 . Finally, these annotators were instructed to write
batch size of 64. All training takes around 24 hours
2
on 8 NVIDIA A100 GPUs. http://makesense.bimant.com
Instruction: Instruction:
close enable
notifications
Figure 8: SeeClick on ScreenSpot. Blue dashed boxes represent the ground truth bounding boxes, while green and
red pointers indicate correct and incorrect predictions.
B.4 SeeClick Case Study & Error Analysis C.1 Formulation of SeeClick as Visual GUI
Agent
Figure 8 presents some examples of SeeClick on
ScreenSpot. SeeClick can comprehend human in- Action Space SeeClick involves common human-
structions and accurately locate screen elements. UI interaction operations. Following AITW, we
To conduct a detailed analysis of localization per- assigned an action_type id to each action
formance, we quantified the distances between pre- type for model prediction.
dicted points and ground truth (the center of target • click(x,y): 4. A click action at (x,y),
elements) in Figure 9. It’s noteworthy that even where each value is a [0,1] number indicating
the ratio of the corresponding position to the Gen. Inst. GApps. Sing. WShop. Ovr.
width or height of the image. Auto-UI 68.2 76.9 71.4 84.6 70.3 74.3
CogAgent 65.4 78.9 75.0 93.5 71.1 76.9
• type("typed_text"): 3. An action of SeeClick 67.6 79.6 75.9 84.6 73.1 76.2
typing a piece of text.
• select("value"): 2. An action for se- Table 7: Comparison on the origin split of AITW.
lecting an option from a dropdown menu on a
webpage. random variants and corresponding instructions
• swipe(direction): Swipe actions for controlled by a random seed, creating up to bil-
the screen, swipe up/down/left/right are as- lions of possible task instances. We use 50 success-
signed the ids 1, 0, 8, and 9 respectively. ful trajectories for each task provided in (Zheng
• PRESS BACK: 5. The action for returning to et al., 2023) for training and test each task with 50
the previous step. random seeds, following standard practices.
We report the average success rate across ran-
• PRESS HOME: 6. The action for returning to
dom seeds and tasks, automatically provided by
the homepage.
the MiniWob environment. A task is considered
• PRESS ENTER: 7. The action of pressing successfully completed if executed correctly, while
the ENTER key to submit input content. incorrect executions or exceeding the maximum
The first two actions, clicking and typing, are uni- number of actions (set as 30 here) are counted as
versally applicable across various GUIs. The third failures. For the baselines in Table 2, we use the
action, select, is defined according to the specifica- task-wise scores provided in their papers to calcu-
tions in Mind2Web. The latter four actions, along late the average score for tasks overlapping with
with two additional states, TASK COMPLETE and SeeClick. We also provided a task-wise comparison
TASK IMPOSSIBLE, are defined following the in Table 8.
AITW framework for Android environments.
C.3 AITW
Agent Formulation SeeClick is an autonomous AITW is a recently collected dataset for Android
agent capable of executing human instructions on smartphone automation, where each sample com-
GUIs. It takes as input the instruction, a screen- prises an instruction and an action trajectory with
shot of the current interface and a series of (k=4 screenshots. AITW is divided into five subsets:
in our setting) previous actions, to predict the next General, Install, GoogleApps, Single, and Web-
action to be taken. Specifically, SeeClick uses the Shopping, totally including over 30K instructions
following prompt to execute each step of the agent: and 700K episodes.
Despite AITW’s large scale, as stated in Sec-
<img>Image</img>
User: Please generate the next move according to the tion 5.2.2, the current train-test split poses a sig-
UI screenshot, instruction and previous actions. nificant risk of overfitting, leading to experimental
Instruction: results that do not accurately reflect an agent’s gen-
<instruction> eralization ability in the real world. We also con-
Previous actions:
ducted experiments on SeeClick using the origin
Step1: <step1>
Step2: <step2> split, as shown in Table 7, SeeClick is comparable
Step3: <step3> to CogAgent’s performance. We believe that due to
Step4: <step4> the severe overfitting, designing new agent frame-
SeeClick: <next action> works or enlarging model size is unlikely to yield
much improvements on this split.
During training and testing, we organize the data To address the aforementioned issue, we
by step into the format described above. propose to divide the train/val/test in an
instruction-wise manner. Specifically, we selected
C.2 MiniWob 545/688/306/700/700 instructions from the Gen-
MiniWob is a classic simplified web agent environ- eral/Install/GoogleApps/Single/WebShopping sub-
ment, built on Chrome, allowing low-level oper- sets, and retained only one annotated episode for
ations such as clicking and typing. It comprises each instruction. To avoid imbalance in joint train-
around 100 tasks, where each task can templatize ing, we randomly chose 700 instructions from Sin-
gle and WebShopping. Given the similarity among C.5 Case Study
instructions within Single and WebShopping, these MiniWob Figure 11(a) illustrates the difference
700 instructions are representative of performance between static and dynamic layout tasks. Static
on these two subsets. Next, we allocate 80% for layout tasks have fixed element positions during
training and the remaining 20% for testing, and training and testing, while dynamic layout tasks
select additional 5*100 episodes to form the val- display varying interfaces and element positions
idation set from the origin data. The data used with instructions, further challenging the agent’s
for training, validation, and testing will be open- ability to accurately locate the target. Figure 11(b)
sourced to serve as an effective evaluation. provides examples of SeeClick’s interaction with
The other settings are consistent with previous MiniWob. SeeClick relies solely on the interface
work, calculating a screen-wise matching score screenshot for arithmetic, reasoning, etc.
that considers both the correctness of the action AITW Figure 12 provides SeeClick’s operations
type and its value (e.g., the click point or typed on AITW. Predictions marked in red below indi-
text). The screen-wise matching score is correlates cate that they were computed as incorrect in AITW.
with the task completion score judged by humans Some errors occur because the current step’s an-
(Rawles et al., 2023). swer is not unique. For example in step 5, the
model’s predicted input "DuckDuckGo Privacy
Browser" is also a potentially correct action.
C.4 Mind2web
Mind2Web Figure 13 shows several examples
Mind2Web is a recently proposed dataset for devel- of SeeClick on the real-world website benchmark
oping generalist web agents for real-world web- Mind2Web. SeeClick can comprehend instructions
sites, originally designed for text-based agents. and click on the correct elements within complex
Therefore, the origin observation in each step interfaces.
only includes the HTML code of the current web-
page. To train and evaluate visual-based agents,
we extracted web screenshots and the bounding
boxes of target operational elements for each step
from Mind2Web’s raw dump. One issue with
Mind2Web’s original HTML observation is that
it captures the entire page, including scrolling,
with its screenshots being long captures (e.g.,
1920*12000). Predicting operational positions
from such high-resolution long screenshots is im-
practical for current LVLMs and does not align
with human operations. To address this, for target
elements not at the top, we randomly crop around
their location, maintaining a consistent screenshot
resolution of 1920*1080 for all observed interfaces.
Mind2Web first calculates Element Accuracy
(Ele.Acc) which compares the predicted element
with groundtruth, and Operation F1 (Op.F1) which
calculates the token-level F1 score for the predicted
operation. Operation F1 is equivalent to the accu-
racy of click operations but takes into account the
correctness of input values for type and select op-
erations. For our vision-based approach, Element
Accuracy is computed as the accuracy of predicted
click points falling in the groundtruth elements’
bounding box. Then, a step is considered success-
ful (Step SR) if both the predicted element and
operation are correct.
CC-Net (SL) WebN-T5 WebGUM Pix2Act Qwen-VL SeeClick
Choose-date 0.12 0.00 0.13 0.06 0.0 0.02
Click-button 0.78 1.0 1.0 0.32 0.42 0.96
Click-button-sequence 0.47 1.0 1.0 1.0 0.08 0.86
Click-checkboxes 0.32 0.96 1.0 0.99 0.44 0.78
Click-checkboxes-large 0.0 0.22 0.99 1.0 0.0 0.02
Click-checkboxes-soft 0.04 0.54 0.98 0.91 0.06 0.22
Click-checkboxes-transfer 0.36 0.63 0.99 0.76 0.60 0.70
Click-collapsible-2 0.17 0.00 0.95 0.31 0.0 0.48
Click-collapsible 0.81 0.00 0.98 0.80 1.0 1.0
Click-color 0.82 0.27 0.34 0.88 0.96 1.0
Click-dialog 0.95 1.0 1.0 0.12 0.96 1.0
Click-dialog-2 0.88 0.24 0.43 0.73 0.84 1.0
Click-link 0.59 1.0 1.0 0.86 0.0 0.90
Click-option 0.21 0.37 1.0 0.0 0.70 1.0
Click-pie 0.15 0.51 0.99 0.81 0.16 0.80
Click-shades 0.04 0.0 0.0 0.76 0.0 0.02
Click-shape 0.11 0.53 0.72 0.19 0.04 0.52
Click-tab 0.95 0.74 1.0 0.54 1.0 1.0
Click-tab-2 0.27 0.18 0.95 0.52 0.0 0.60
Click-tab-2-hard 0.19 0.12 0.95 0.0 0.16 0.42
Click-test 1.0 1.0 1.0 1.0 1.0 1.0
Click-test-2 0.95 1.0 1.0 1.0 0.72 0.94
Click-widget 0.56 1.0 1.0 0.87 0.38 0.58
Count-shape 0.21 0.41 0.68 0.0 0.20 0.28
Copy-paste 0.04 - - - 0.96 0.80
Copy-paste-2 0.01 - - - 0.96 0.80
Email-inbox 0.09 0.38 0.99 - 0.08 0.80
Email-inbox-forward-nl 0.0 0.6 1.0 - 0.24 0.74
Email-inbox-forward-nl-turk 0.0 0.33 1.0 - 0.16 0.56
Email-inbox-nl-turk 0.05 0.23 0.98 - 0.40 0.68
Enter-date 0.02 0.0 1.0 0.59 1.0 1.0
Enter-password 0.02 0.97 1.0 - 1.0 1.0
Enter-text 0.35 0.89 1.0 - 1.0 1.0
Enter-text-dynamic 0.39 0.98 1.0 - 0.96 1.0
Focus-text 0.99 1.0 1.0 - 1.0 1.0
Focus-text-2 0.96 1.0 1.0 - 0.84 0.96
Find-word 0.05 - - - 1.0 0.10
Grid-coordinate 0.66 0.49 1.0 0.97 0.96 0.52
Guess-number 0.21 0.0 0.11 - 1.0 1.0
Login-user 0.0 0.82 1.0 - 1.0 1.0
Login-user-popup 0.02 0.72 0.99 - 0.86 0.98
Multi-layouts 0.00 0.83 1.0 - 0.44 0.72
Multi-orderings 0.0 0.88 1.0 - 0.42 0.86
Identify-shape 0.68 - - 0.94 1.0 0.68
Navigate-tree 0.32 0.91 1.0 0.07 0.60 0.82
Search-engine 0.15 0.34 0.96 - 0.56 0.84
Simple-algebra 0.03 - - 0.99 0.48 0.38
Simple-arithmetic 0.38 - - 0.67 0.92 0.78
Text-transform 0.19 - - 0.91 0.36 0.46
Tic-tac-toe 0.32 0.48 0.56 0.76 0.30 0.58
Unicode-test 0.86 0.64 0.54 0.98
Use-autocomplete 0.07 0.22 0.98 0.95 0.72 0.82
Use-slider 0.18 - - 0.69 0.38 0.32
Use-spinner 0.47 0.07 0.11 - 0.24 0.16
Read-table 0.01 - - - 0.90 0.72
Average 0.336 (55) 0.552 (45) 0.861 (45) 0.646 (35) 0.564 (55) 0.712 (55)
Instruction: Go
Instruction: Set
to Beauty &
Reminder
Personal Care
Source: Web
Source: Web
(Development)
(Shop)
Type:
Type: Text
Icon/Widget
(a) Comparison between static layout (left, click-color) and dynamic layout (right, unicode-test).
{“action_type”: 4, “click_point”: {“action_type”: 3, “typed_text”: {“action_type”: 4, “click_point”: {“action_type”: 4, “click_point”: {“action_type”: 4, “click_point”:
(0.81, 0.38)} “36”} (0.50, 0.64)} (0.5, 0.62)} (0.71, 0.78)}
Task: choose-date
···
{“action_type”: 4, “click_point”: {“action_type”: 4, “click_point”: {“action_type”: 4, “click_point”: {“action_type”: 4, “click_point”: {“action_type”: 4, “click_point”:
(0.58, 0.3)} (0.25, 0.4)} (0.25, 0.4)} (0.46, 0.55)} (0.47, 0.47)}
{“action_type”: 4, “click_point”: (0.68, 0.10)} {“action_type”: 4, “click_point”: (0.38, 0.35)} {“action_type”: 3, “click_point”: (0.43, 0.48), “value”:
“87654321”}
{“action_type”: 4, “click_point”: (0.03, 0.05)} {“action_type”: 4, “click_point”: (0.56, 0.68)} {“action_type”: 4, “click_point”: (0.50, 0.41)}
Instruction: Download the e-receipt with the last name Smith and confirmation number X123456989.
{“action_type”: 4, “click_point”: (0.67, 0.08)} {“action_type”: 4, “click_point”: (0.47, 0.36)} {“action_type”: 3, “click_point”: (0.46, 0.62), “value”:
“Smith”}
Figure 13: Example episodes of SeeClick on Mind2Web. The model’s prediction output is below the screenshot,
with action_type 4 indicating a click and action_type 3 indicating typing. Steps with the red prediction and green
reference bounding box indicate a failed step.