0% found this document useful (0 votes)
17 views

Harnessing GUI Grounding for Advanced Visual GUI Agents

The document introduces SeeClick, a visual GUI agent that automates tasks using screenshots instead of structured data, addressing the limitations of existing text-based GUI agents. It emphasizes the importance of GUI grounding, which allows the agent to accurately locate screen elements based on instructions, and presents ScreenSpot, a benchmark for evaluating GUI grounding. Experimental results show that SeeClick outperforms existing models, demonstrating the effectiveness of its GUI grounding pre-training strategy across various platforms.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
17 views

Harnessing GUI Grounding for Advanced Visual GUI Agents

The document introduces SeeClick, a visual GUI agent that automates tasks using screenshots instead of structured data, addressing the limitations of existing text-based GUI agents. It emphasizes the importance of GUI grounding, which allows the agent to accurately locate screen elements based on instructions, and presents ScreenSpot, a benchmark for evaluating GUI grounding. Experimental results show that SeeClick outperforms existing models, demonstrating the effectiveness of its GUI grounding pre-training strategy across various platforms.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 20

SeeClick: Harnessing GUI Grounding for Advanced Visual GUI Agents

Kanzhi Cheng♢♡∗ Qiushi Sun♡ Yougang Chu♢ Fangzhi Xu♡


Yantao Li♢ Jianbing Zhang♢ Zhiyong Wu♡

Department of Computer Science and Technology, Nanjing University

Shanghai AI Laboratory
{chengkz,chuyg,li_yantao}@smail.nju.edu.cn [email protected]
[email protected] [email protected] [email protected]
Instruction: Download the e-receipt with the last name Smith and
Abstract confirmation number X123456989.
Text-based:
<form element_id="200">
Graphical User Interface (GUI) agents are de- ... Text-based agent’s next action
<label element _id="205">Last Name:</label> Element: <element_id=206>
signed to automate complex tasks on digital <input type="text" name="lastname" element
Action: CLICK
_id="206">
devices, such as smartphones and desktops. ... # Selenium Code
arXiv:2401.10935v2 [cs.HC] 23 Feb 2024

<input type="submit" value="Get Receipt" element element = driver.find_element(By.XPATH,


Most existing GUI agents interact with the en- _id="210">
Simplified HTML Code
'//*[@element_id="206"]’)
element.click()
...
vironment through extracted structured data,
Vision-based:
which can be notably lengthy (e.g., HTML) SeeClick’s next action
and occasionally inaccessible (e.g., on desk- {“action”: “click”, “loc”: [0.46, 0.62]}
tops). To alleviate this issue, we propose a
novel visual GUI agent – SeeClick, which only
relies on screenshots for task automation. In
GUI Screenshot
our preliminary study, we have discovered a
key challenge in developing visual GUI agents:
Figure 1: Text-based agents select target elements from
GUI grounding – the capacity to accurately lo-
structured texts, occasionally augmented with screen-
cate screen elements based on instructions. To
shots. SeeClick employs a vision-based methodology to
tackle this challenge, we propose to enhance
predict action locations solely relying on screenshots.
SeeClick with GUI grounding pre-training and
devise a method to automate the curation of
GUI grounding data. Along with the efforts
above, we have also created ScreenSpot, the agents interact with the environment by interpret-
first realistic GUI grounding benchmark that ing structured texts, e.g., HTML from webpages,
encompasses mobile, desktop, and web envi- then elicit LLM for planning, reasoning, and exe-
ronments. After pre-training, SeeClick demon- cution (Kim et al., 2023; Zheng et al., 2023).
strates significant improvement in ScreenSpot
over various baselines. Moreover, comprehen- However, GUI agents depend on structured text
sive evaluations on three widely used bench- face three inherent limitations: (1) Structured text
marks consistently support our finding that ad- is not always accessible, especially for iOS or desk-
vancements in GUI grounding directly corre- top applications where acquiring such information
late with enhanced performance in downstream is challenging (Shaw et al., 2023); (2) The verbose
GUI agent tasks. 1 nature of structured text constitutes an inefficient
context for LLMs, while also omitting crucial in-
1 Introduction formation such as layout, images, and icons (Deng
A perennial topic in machine intelligence is the de- et al., 2023); (3) The variety of structured text -
velopment of Graphical User Interface (GUI) agent including HTML, DOM, and Android VH - ne-
systems, like Siri and Copilot, to automate com- cessitates the curation of task-specific observation
plex tasks on computing devices, thereby reduc- and action spaces (Kim et al., 2023; Zhou et al.,
ing human workload (Shi et al., 2017; Li et al., 2023). These entrenched deficiencies in text-based
2020a). Recent advances in Large Language approaches call for an alternative solution.
Models (LLMs) such as GPT-4 (OpenAI, 2023) In this paper, we introduce SeeClick, a visual
have significantly propelled the evolution of GUI GUI agent built on Large Vision-Language Mod-
agents (Gur et al., 2023; Zhou et al., 2023). These els (LVLMs). Inspired by human interaction with
∗ GUIs, as illustrated in Figure 1, SeeClick is de-
Work done during internship at Shanghai AI Laboratory.
1
The model, data and code are available at https:// signed to perform low-level actions like clicking
github.com/njucckevin/SeeClick. or typing directly by observing interface screen-
shots. This innovative approach bypasses the inter- grounding capacity is key to improving perfor-
action with cumbersome structured text, empower- mance in downstream agent tasks.
ing SeeClick to universally adapt to various GUI
platforms. Building such visual agents presents a 2 Related work
foundational challenge: GUI grounding - the capac-
Autonomous GUI Navigation Early research ex-
ity to accurately locate screen elements based on
plored task automation in simplified web (Shi et al.,
instructions, which is absent in current LVLMs.To
2017; Liu et al., 2018; Gur et al., 2018) and mobile
tackle this challenge, SeeClick enhances LVLM
UI (Li et al., 2020a; Burns et al., 2022; Li and Li,
with a GUI grounding pre-training strategy. We
2022). With LLM advancements (OpenAI, 2023;
devise a method to automate the curation of web
Touvron et al., 2023; Xu et al., 2023; Sun et al.,
grounding data and adapt public mobile UI datasets
2023; Wu et al., 2024, inter alia), LLM-centric
to obtain mobile grounding data. SeeClick employs
agents have become the dominant paradigm. A line
the above-curated dataset for continual pre-training
of works focused on prompting ChatGPT and GPT-
of the LVLM, enabling it to accurately locate ele-
4 for web tasks, via in-context learning (Zheng
ments such as text, widgets, and icons in various
et al., 2023) and self-refine (Kim et al., 2023).
GUI environments.
Other research explored training LLMs as special-
Given GUI grounding is a fundamental yet un-
ized agents. Deng et al. (2023) devised a two-stage
derexplored capacity for GUI agents, we establish
method for identifying target elements within intri-
ScreenSpot, the first realistic GUI grounding eval-
cate HTML. Gur et al. (2023) proposed to interact
uation benchmark across various GUI platforms.
with websites via programming.
ScreenSpot contains over 600 screenshots and 1200
Given the constraints of LLM to only process
instructions from iOS, Android, macOS, Windows,
text, recent efforts have attempted vision-based
and webpages, and specifically includes both text-
GUI navigation (Shaw et al., 2023; Zhan and
based elements and a variety of widgets and icons.
Zhang, 2023; Hong et al., 2023). These methods
Evaluation results confirm SeeClick’s superiority
primarily utilize GPT-4V (Yan et al., 2023; Gao
over current LVLMs, validating the effectiveness
et al., 2023) and also require GUI metadata as in-
of GUI grounding pre-training.
put (Yang et al., 2023a; Zheng et al., 2024). In this
Finally, we adapt SeeClick to mobile and web
paper, we construct a universal visual GUI agent
agent tasks, including MiniWob (Shi et al., 2017),
SeeClick by customizing open-sourced LVLM, ca-
AITW (Rawles et al., 2023), and Mind2Web (Deng
pable of operating across various GUI platforms
et al., 2023). As a purely vision-based agent,
without needing any GUI metadata.
SeeClick achieves impressive performance. It sur-
Large Vision-Language Models Recent research
passes the strong visual baseline Pix2Act while
has invested tremendous effort in constructing
utilizing merely 0.3% training data. Moreover, ex-
LVLMs capable of jointly processing image and
perimental results on these three benchmarks con-
text (Liu et al., 2023a; Zhu et al., 2023; Ye et al.,
sistently support our findings that improvement in
2023; Li et al., 2023), integrating vision encoders
GUI grounding directly correlates with enhanced
with LLMs through connecting layers, inheriting
agent task performance.
LLMs’ linguistic and reasoning skills to perform
Our main contributions are as follows:
vision-language tasks. A series of studies focused
• We develop a unified visual GUI agent SeeClick, on grounding with LVLMs (Wang et al., 2023; Bai
which solely relies on interface screenshots to et al., 2023; Chen et al., 2023a), such as provid-
perform clicking and typing actions across di- ing bounding boxes for objects when generating
verse GUI platforms. responses (Chen et al., 2023b; Peng et al., 2023).
• We prospectively explore GUI grounding for vi- Nonetheless, these efforts primarily addressed nat-
sual GUI agents, and enhanced SeeClick with ural images and did not explore GUI contexts. This
proposed GUI grounding pre-training strategy. paper focuses on grounding in GUIs and explores
• We create a realistic GUI grounding benchmark the potential of LVLMs as visual agents.
ScreenSpot, encompassing more than 1200 in-
structions from various GUI platforms. 3 Approach
• Experimental results on ScreenSpot and three Our preliminary study highlights a major challenge
agent tasks demonstrate that enhancing agents’ in developing visual GUI agents: GUI grounding,
Mobile UI Related Next action: click (0.49, 0.40)
Widget Captioning UI Summarization

Mobile UI Grounding
Vision
Encoder Large Vision-Language Model (LVLM)
Web UI Related (ViT)
Web OCR Web UI Grounding

Vision-Language Instruction:
General VL Data
Adapter “View the new album of Jony J”
VQA Visual Reasoning

(a) Overview of SeeClick‘s framework and GUI grounding pre-training.

GUI Grounding Benchmark: ScreenSpot


Instruction: See more
options for Dark Mode Instruction: Create a new
Source: Mobile (Android) merge request
Type: Text Source: Web (Development)
Type: Text
Instruction: Change font
size to 20
Source: Desktop (macOS)
Type: Text
Instruction: Switch to
OneDrive path
Source: Desktop (Windows)
Type: Text

Instruction: open the


low power mode
Source: Mobile (iOS) Instruction: Likes on this
Type: Icon/Widget issue
Source: Web (Development)
Type: Icon/Widget

(b) Examples of the proposed GUI grounding benchmark ScreenSpot.


Instruction: Find a list of shorthaired dogs available for adoption with 100 miles of zip code
94587 that are good with kids and cats, and have been on Petfinder for over 30 days.

(c) SeeClick as a visual GUI agent in downstream task.


Figure 2: Overview of our universal visual GUI agent SeeClick. (a) depicts the framework of SeeClick and GUI
grounding pre-training. (b) provides examples of ScreenSpot across various GUIs and types of instructions. (c)
displays the real-world application of SeeClick when adapted to downstream web agent tasks.

the capacity to locate screen elements based on birth of SeeClick, including the formalization of
instructions. Although recent LVLMs have claimed GUI grounding task, the construction of continual
grounding capability on natural images (Bai et al., pre-training data, and training details.
2023; Wang et al., 2023), GUI screenshots differ
significantly with dense text and numerous icons 3.1 GUI grounding for LVLMs
and widgets. These differences impair existing
LVLMs’ grounding performance in GUI contexts As GUI grounding is the core capability of
and limit their potential as visual GUI agents. SeeClick, we first elucidate how to train LVLM for
language generation to perform grounding tasks.
This paper seeks to harness LVLMs with GUI Given an interface screenshot s and a collection
grounding skills, paving the way for a visual GUI of elements {(xi , yi )|i } on it, where xi denotes
agent that executes instructions only relying on the textual description of the i-th element and yi
screenshots. As presented in Figure 2, SeeClick indicates the element’s location (represented as a
is a foundational model for GUIs, and tailored for bounding box or point). As depicted in Figure 2(a),
adaption to agent tasks. Next, we introduce the LVLM predicts the location of the element y based
on the interface screenshot s and its textual descrip-
tion x, i.e. calculating p(y|s, x).
A potential challenge is how LVLMs predict
numerical coordinates in a language generation for-
mat. Previous studies (Chen et al., 2021; Wang
et al., 2023; Shaw et al., 2023) divide the image
into 1000 bins, and creating a new 1,000-token
vocabulary {< p0 >, < p1 >, ..., < p999 >} to
represent the x and y coordinates. In this work, <div class="header">
<ul class="menu">
<li>...</li>
we adopt a more intuitive manner used in LVLMs </ul>
</div>
(Chen et al., 2023b; Bai et al., 2023), treating nu- <div class="container">
<div class=“product-thumbnails”><a href=“#” title=“Previous image"></a></div>
merical values as natural languages without any ad- <div class="product-detail">
<div>...</div>
ditional tokenization or pre-/post-processing. For <button>ENQUIRE NOW</button>
<div class=“product-share”>…</div>
</div>
instance, in Figure 2(a), for a smartphone screen- </div>

shot and the instruction “View the new album of Figure 3: Example of two types of elements automati-
Jony J”, we craft a query prompt: “In the UI, where cally collected from the webpage.
should I click if I want to <instruction>?”. Subse-
quently, we normally compute the cross-entropy
nearly 20k screenshots, 40k widgets, and 100k de-
loss between the model output and the ground truth
scriptions. We derive mobile UI grounding data by
“click (0.49, 0.40)” to optimize the LVLM.
reversing the process of widget captioning, treat-
3.2 Data Construction ing language descriptions as instructions and corre-
We train SeeClick using three collections of data: sponding widgets as target elements. To improve
web UI data crawled from the internet, mobile UI diversity, we also incorporate the automatically col-
data reorganized from public datasets and general lected elements and instructions from RICO (Li
vision-language instruction-following data. et al., 2020a). The mobile data involves diverse
Web Data. Web UIs, featuring a variety of lay- elements and instructions, facilitating the general-
outs and design styles across websites, are ideal for ization of SeeClick’s grounding proficiency to di-
training LVLMs’ general recognition and ground- verse GUI contexts. We finally include mobile UI
ing capabilities across different GUI contexts. We summarization data (Wang et al., 2021) to enhance
collect approximately 300k web pages from the overall interface comprehension.
latest Common Crawl repository to serve as our General Data. To maintain LVLM’s general capac-
training data for web UI. For each webpage s, we ities on natural images, we incorporate the general
collect two types of elements from the HTML code vision-language instruction-following data from
as exemplified in Figure 3: (1) elements that dis- LLaVA (Liu et al., 2023a), covering conversation,
play visible text content; and (2) elements with detailed description, and complex reasoning.
a special “title” attribute that display descriptive Finally, we mix the data above and craft 30 task-
text when hovering. This method ensures that we specific prompts for each added GUI task, resulting
gather a series of interactable elements y and their in a 1M dataset to train SeeClick.
corresponding instructions x, while encompassing
a wide range of text and icon elements. In addition 3.3 Training Details
to the grounding task p(y|s, x), we also include
web OCR task p(x|s, y), predicting text descrip- We build SeeClick through continual pre-training
tion based on coordinates. on a recent advanced LVLM, Qwen-VL (Bai et al.,
Mobile Data. For mobile UI, we include three 2023), which possesses grounding capabilities and
types of data: widget captioning, mobile UI ground- a higher resolution of 448*448. We train Qwen-VL
ing, and mobile UI summarization. The widget on the dataset we constructed (as described in Sec-
captioning dataset provides language descriptions tion 3.2) for about 10k steps (around 1 epoch) to
for mobile UI elements; for example, the descrip- obtain our GUI base model SeeClick. During train-
tion “play music” for the play button on a music ing, we employ LoRA (Hu et al., 2021) to fine-tune
player interface. We utilize the training split of the both the visual encoder and LLM. Further details
dataset provided by (Li et al., 2020b), containing and task examples are provided in Appendix A.
Model GUI Mobile Desktop Web
LVLMs Different types of elements Average
10% Size Specific in ScreenSpot
Text Icon/Widget Text Icon/Widget Text Icon/Widget
21%
6%
MiniGPT-v2 7B ✗ 8.4% 6.6% 6.2% 2.9% 6.5% 3.4% 5.7%
Text Icon/Widget
Qwen-VL
6% 9.6B ✗ 9.5% 4.8% 5.7% 5.0% 3.5% 2.4% 5.2%
GPT-4V - ✗ 22.6% 24.5% 20.2% 11.8% 9.2% 8.8% 16.2%
8% ScreenSpot 232
Fuyu 8B ✓ 41.0% 1.3% 33.0% 3.6% 33.9% 4.4% 19.5%
21%
CogAgent 18B ✓ 67.0%140 24.0% 74.2% 20.0% 70.4% 28.6%
151 47.4%
15%
SeeClick 9.6B ✓ 78.0% 52.0% 72.2% 30.0% 55.7% 32.5% 53.4%
13% 278
198 210
Table 1: Results of different LVLMs on ScreenSpot. The best results in each column are highlighted in bold.
iOS Android Windows macOS
Benefiting from efficient GUI grounding
Mobile pre-training,
Desktop Web
SeeClick significantly enhanced LVLMs’ ability to locate
Development Shopping Forum Tools
GUI elements following instructions, and surpassed the strong baseline CogAgent with a smaller model size.

11%
Different types of elements
in ScreenSpot
scenarios, ScreenSpot is carefully curated to ensure
20%
7% Text Icon/Widget
the samples are novel and not included in existing
training resources. We recruited experienced anno-
9% tators to collect GUI interfaces and label instruc-
229
ScreenSpot
19%
206
tions along with the bounding boxes for actionable
7% 140
elements. For mobile and desktop, annotators were
14% 273
230
asked to select commonly used apps and opera-
13% 194
tions; for web, we chose several types of websites
iOS Android Windows macOS (development, shopping, forum, and tools) from the
Mobile Desktop Web
Development Shopping Forum Tools

Figure 4: Statistic of our proposed GUI grounding web environment WebArena (Zhou et al., 2023).
benchmark ScreenSpot. The left illustrates the diverse
GUI environments included. The right displays the 5 Experiments
types of elements within each GUI category.
In this section, we first evaluate the GUI grounding
capabilities of representative LVLMs and our pro-
4 ScreenSpot: A Grounding Benchmark
posed SeeClick. Subsequently, we adapt SeeClick
We recognize GUI grounding proficiency as es- to mobile and web agent tasks, analyzing the cor-
sential for constructing visual GUI agents. How- relation between the advanced grounding capacity
ever, the constrained capabilities of earlier vision- and downstream task performance, while exploring
language models resulted in limited attention, with the potential of purely vision-based GUI agents.
scant research (Li et al., 2021; Li and Li, 2022;
Zhang et al., 2023) largely confined to an Android 5.1 GUI Grounding on ScreenSpot
dataset (Deka et al., 2017) collected in 2017. As the foundation of visual GUI agents, GUI
To address this research gap, we introduce grounding has not received adequate attention in
ScreenSpot, an up-to-date, realistic grounding eval- current LVLMs evaluations (Liu et al., 2023b; Yu
uation benchmark encompassing various GUI plat- et al., 2023). Therefore, we evaluate LVLMs on
forms. It is designed to assess vision-language our GUI-specific benchmark ScreenSpot.
models’ ability to locate screen elements based on Compared LVLMs & Evaluation. We primar-
instructions (Figure 2(b) provides some examples). ily evaluated two types of LVLMs: (1) Generalist
ScreenSpot has two distinctive features: (1) Vari- LVLMs capable of tasks such as dialogue, recogni-
ous GUI platforms. It includes over 600 interface tion and grounding, including MiniGPT-v2 (Chen
screenshots from mobile (iOS, Android), desktop et al., 2023a), Qwen-VL (Bai et al., 2023) and
(macOS, Windows), and web platforms, along with GPT-4V; (2) Recently released LVLMs specifically
1200+ instructions and corresponding actionable designed for GUI tasks, including Fuyu (Bavishi
elements; (2) Icons/Widgets. ScreenSpot includes et al., 2023) and CogAgent (Hong et al., 2023).
a substantial number of icons and widgets in each Considering that GUI agents require clicking on
GUI, which is more challenging to locate than texts the correct position, we calculate the click accuracy
(statistics are in Figure 4). See Appendix B for as the metric, defined as the proportion of test sam-
annotation details and examples. ples where the model predicted location falls in the
To measure models’ effectiveness in real-world ground truth element bounding box (Li et al., 2022;
Zhang et al., 2023). More details about evaluation Methods Modality Dataset Score
on ScreenSpot is in Appendix B. Compared with text-based models over 45 tasks
Results. As shown in Table 1, while generalist CC-Net (SL) DOM+Image 2.4M 35.6%
LVLMs have excelled in natural image grounding, WebN-T5 HTML 12K 55.2%
their GUI grounding performance on ScreenSpot MM-WebN-T5 HTML+Image 347K 63.4%
is poor due to the significant differences between WebGUM HTML+Image 2.8K 65.5%
GUIs and natural images. Even GPT-4V struggles WebGUM HTML+Image 347K 86.1%
with accurately locating screen elements. SeeClick Image 2.8K 73.6%
In comparison, GUI-specific LVLMs have sig- Compared with vision-based models over 35 tasks
nificant improvements. SeeClick achieved the best
CC-Net (SL) Image 2.4M 23.4%
average performances across GUI platforms and
Pix2Act Image 1.3M 64.6%
two types of elements, even with fewer parameters
Qwen-VL Image 2.8K 48.4%
than CogAgent. This demonstrates the efficiency
of our GUI grounding pre-training; with the rich UI SeeClick Image 2.8K 67.0%
elements and diverse instructions collected from Table 2: Average scores of different methods on Mini-
the web and mobile, SeeClick quickly learns to un- Wob. The best results in each setting are bold. Methods
derstand human instructions for element localiza- achieving the highest performance with limited data
are underlined. SeeClick outperforms a range of offline
tion, even in completely unseen scenarios like iOS
training methods as a purely vision-based model.
and desktop. SeeClick exhibits slightly inferior per-
formance in locating text within desktop and web auxiliary input but still selects HTML elements as
compared to CogAgent, possibly due to lower reso- actions. Pix2Act (Shaw et al., 2023) is the only
lution and much smaller training data. Notably, all prior vision-based approach, trained with extensive
models struggle with locating icons/widgets, high- demonstration data to perform actions. To verify
lighting the difficulty of identifying and grounding the effectiveness of GUI grounding pre-training, we
non-text elements on GUIs, which is the unique also report the results of the LVLM baseline Qwen-
challenge posed by ScreenSpot. VL when trained with the same 2.8K dataset.
5.2 Visual GUI Agent Tasks Due to the variance in evaluation task sets among
This section explores SeeClick’s application to mo- different methods (Liu et al., 2018; Furuta et al.,
bile and computer agent tasks: MiniWob, AITW, 2023; Shaw et al., 2023), for fairness, we report
and Mind2Web. We trained SeeClick on the re- performance in two groups based on the overlap-
spective training splits and tested it on the test sets. ping MiniWob tasks. We compute the success rate
Across these tasks, with provided instructions and over 50 random seeds for each task and then com-
memory of previous actions, SeeClick determines pute the mean over all tasks as the final score. We
the next action solely by observing interface screen- provided task-wise scores in Appendix C.2.
shots. The detailed task settings, action formats and Results. As depicted in Table 2, purely vision-
interaction examples are in Appendix C. based SeeClick surpassed strong baselines with
substantially less training data. Notably, with an
5.2.1 MiniWob equivalent amount of 2.8K training data, it outper-
MiniWob (Shi et al., 2017) comprises about 100 formed the offline sota WebGUM, which uses both
types of web automation tasks, where the agent is HTML and screenshots as input. Moreover, thanks
asked to interact with a simplified web environment to LVLM’s powerful reasoning and planning abili-
to accomplish human instructions. Existing open- ties and our GUI grounding pre-training, SeeClick
source training data often lacks corresponding in- exceeded the sota visual method Pix2Act, using
terface screenshots (Furuta et al., 2023). Therefore, less than 0.3% training data.
we rollout 50 successful episodes using an LLM Furthermore, SeeClick significantly surpassed
strategy for each task in (Zheng et al., 2023), re- the LVLM baseline Qwen-VL by nearly 20 per-
sulting in a 2.8K episodes dataset to train SeeClick. centage points, underscoring the importance of
Compared Methods & Evaluation. We com- GUI grounding in boosting LVLM’s performance.
pared SeeClick with a range of offline training To analyze in detail, we provide task-level com-
methods. Among these, the state-of-the-art method parisons in Figure 5. SeeClick notably excelled
WebGUM (Furuta et al., 2023) uses screenshots as in tasks with dynamic interface layouts and ele-
at 2

he ithm t
le
al 2

ick e
t
Cl ce

an s t

ol rge

m Clic b
nk

ra

Gr co m e
Cl laps 2

he -t oe

bo etic
-w n

y- e
e
ico rd

ick pe

s
ick -pie

En ate
2

ck Use ee
Cl Tic- m
ick lick n

er

he ck-s g
ick xes

Cl idge

s
Cl colo

Us o un sof
sfe

bo ade
at

ap

ap
-ta b le-

lo
vig st-

t
g-

-c plet
o

-ta
to

ol tab-

ick le-a -te


e

ib
or

en -d a
xe slid

Ch geb
Un 2-ha
en

- li

tr
a
Cl p ti
xt e-t

m ia lo

ia

in
a
-d
ut

bo

te

an t-sh

sh
-sh

ps
Cl ran

s-
ick

c
sf

ick
e-
ick

ick s-l

k
qu

h
-d
-
ta
i

rd
-o
-

se

r
-
-b

xe
d

-
ck

la

te
b-

-d
Cl

ick

xe

Cl
ick
t
se

oo
e-

oo
ick

tif
s-

r
-tr

Cl
ick
Cl

to
C

pl
n-

-c
i
Cl
Cl

Cl
-c

ck
Na

C
ick
tto

-c

id
ck
bo
ick

Id
Te

p
Si

e-
Cl
bu

-c
Si
Cl

-c
he
-
ick

ick
-c

Cl
Cl

Cl
ick
Cl
SeeClick improvement over LVLM baseline Qwen-VL Qwen-VL SeeClick
SeeClick is better Qwen-VL is better
1
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0

bo st-2
ce

ick lo r

ge

En test
Cl le
he xt -t og-2

xe te

ar b ra
ick k-b 2

e
he test
k

Cl -pie

e
t

co pe
ick -2

ick te

Gr ify-s c
ps n

Cl ico d e

rd

-c h oo log

la s
ick xes

er
e

r
vig ge

of

de
de

i
n - - lin

-ta -t o

- c h ap
at
p

at
-ta
b-

to

io

tre

et
le

ib
a
or

e
Cl -ha

Co n sf
en

Cl -lar
a
Un sh a

o
Cl es-s

ia
pt

N a wid

in
-ta

pl

-d

r -d
ut

p l a lg e
bo

i
ick -sh a

Id thm
e

an t-sh
ib

ps

-
ick

e-

-c

l
ick
bo nsf
e-

l
tto lick

ick

Si se-s
qu

-d
he k-t

Te dia
Cl ic-ta

rd
tra
-o

se

s
ck
-
ick

at

te
b-
Cl

x
-

ra
se

oo
Cl

Si p le-
un
ick

ick

i
C

ol
la

s-
-

U
Cl
ic

ick

bo

t
Cl

to
T
Cl

-c
Cl
ol

xe

en
Cl

e-
Cl

-c

ck

m
C
ick
-c

id
ck
ick

he
e-

m
Cl
bu

ck
-c

Us
Cl

ick
-
ick

ick
-c
Cl
Cl

Cl
ick
Cl
Figure 5: Comparison of SeeClick and Qwen-VL on MiniWob. Tasks marked with yellow shadows feature
dynamic webpage layouts, simulating real-world GUI agent applications (details in appendix Figure 11). SeeClick
outperformed Qwen-VL in most tasks, highlighting the effectiveness of GUI grounding pre-training.

Methods Modality General Install GoogleApps Single WebShopping Overall ClickAcc


ChatGPT-CoT Text 5.9 4.4 10.5 9.4 8.4 7.7 -
PaLM2-CoT Text - - - - - 39.6 -
GPT-4V Image 41.7 42.6 49.8 72.8 45.7 50.5 -
Qwen-VL Image 49.5 59.9 46.9 64.7 50.7 54.3 57.4
SeeClick Image 54.0 66.4 54.9 63.5 57.6 59.3 66.4

Table 3: Average scores of different methods on AITW. ClickAcc calculates the accuracy of click operation. The
best results in each column are bold. SeeClick exhibits the best performance among competing baselines.

ment positions, confirming our hypothesis that gen- screen-wise action matching score as the main met-
eral LVLMs struggle with accurately clicking, and ric and additionally compute the click accuracy
SeeClick markedly improves this aspect. (ClickAcc), which calculates the accuracy when
both reference and prediction are click operations.
5.2.2 AITW Results. As illustrated in Table 3, SeeClick
We evaluate SeeClick in smartphone environments achieved the best average performance among both
with Android automation dataset Android In The API-based LLMs and trained LVLMs. Specifi-
Wild (AITW) (Rawles et al., 2023), which encom- cally, SeeClick exhibited a 9% increase in click
passes 30k instructions and corresponding 715k accuracy over Qwen-VL, supporting the idea that
operation trajectories. Previous approaches split GUI grounding enhances agent task performance
train/val/test episode-wise, which poses a clear through precise clicking.
risk of overfitting due to: (1) instructions in the
test set have appeared in training, and (2) an aver- 5.2.3 Mind2Web
age of 20 similar trajectories per instruction. In To assess SeeClick’s capabilities in web naviga-
this work, we opt for an instruction-wise split, tion, we utilize the recently introduced Mini2Web
with 545/688/306/700/700 instructions from Gen- dataset (Deng et al., 2023), which comprises over
eral/Install/GoogleApps/Single/WebShopping re- 2000 open-ended tasks collected from 137 real web-
spectively, and retain one trajectory per instruction. sites, each with high-level instruction and corre-
We selected 80% for training and the remaining sponding human action trajectory. Mind2Web was
for testing in each subset. This split avoids over- originally designed for text-based agents, which
fitting and reflects the performance of agents on select actionable elements from simplified HTML
unseen instructions. Further details and results on in each step. This work explores visual web agents
the original split are in Appendix C.3. that predict click positions directly from screen-
Compared Methods & Evaluation. We com- shots. For this purpose, we parsed screenshots and
pare SeeClick with two types of baselines: (1) target element bounding boxes from the raw dump
API-based LLMs such as ChatGPT-CoT (Zhan and of Mind2Web. To the best of our knowledge, this
Zhang, 2023), PaLM2-CoT (Rawles et al., 2023) is the first attempt of web agents relying solely on
and the latest GPT-4V (Yan et al., 2023); (2) Our screenshots as inputs for navigating real websites.
trained LVLM baseline Qwen-VL. Compared Methods & Evaluation. We compare
We follow Rawles et al. (2023) to adopt the with html-based web agents Mind2Act (Deng et al.,
Cross-Task Cross-Website Cross-Domain
Methods w/o HTML
Ele.Acc Op.F1 Step SR Ele.Acc Op.F1 Step SR Ele.Acc Op.F1 Step SR
MindAct (gen) ✗ 20.2 52.0 17.5 13.9 44.7 11.0 14.2 44.7 11.9
MindAct ✗ 55.1 75.7 52.0 42.0 65.2 38.9 42.1 66.5 39.6
GPT-3.5-Turbo ✗ 20.3 56.6 17.4 19.3 48.8 16.2 21.6 52.8 18.6
GPT-4 ✗ 41.6 60.6 36.2 35.8 51.1 30.1 37.1 46.5 26.4
Qwen-VL ✓ 15.9 86.7 13.3 13.2 83.5 9.2 14.1 84.3 12.0
SeeClick ✓ 28.3 87.0 25.5 21.4 80.6 16.4 23.2 84.8 20.8

Table 4: Comparsion of methods on Mind2Web. The best results in each column are bold. Improvements of
SeeClick over LVLM baseline are underline, with GUI grounding pre-training nearly doubling the step success rate.

2023) and our visual baseline Qwen-VL. Mind2Act


employs a two-stage method, where a small LM
first generates candidate elements from raw HTML,
then a large LM selects the target via multi-choice
QA; Mind2Act (gen) directly generates the target
element instead. GPT-3.5 and GPT-4 adopt the
same multiple-choice QA formulation and include
three demonstrations for in-context learning.
We calculate element accuracy (Ele.Acc), Oper-
ation F1 (Op.F1) and step success rate (Step SR). Figure 6: The correlation between agent tasks perfor-
For vision-based methods, a prediction is consid- mance improvement and enhanced grounding ability.
ered correct if the predicted coordinate falls in the
target element’s bounding box. All other settings MiniWob AITW Mind2web
are following (Deng et al., 2023). Qwen-VLseparate 48.4 54.3 11.5
Results. As displayed in Table 4, SeeClick nearly SeeClickseparate 67.0 59.3 20.9
doubled the Ele.Acc and Step SR compared to SeeClickunified 64.1 57.1 19.5
Qwen-VL. This indicates that SeeClick’s improve-
ment in GUI grounding correlates with enhanced Table 5: Separate v.s. unified training performance.
performance in web agent tasks. HTML-based
methods yield lower Op.F1 as around 20% of 5.2.5 SeeClick as Unified GUI Agent
groundturth elements are filtered out during can- To access the potential of vision-based solutions
didate generation. Although SeeClick can operate in unifying GUI agent tasks, we evaluated jointly
without extra HTML information, its performance training SeeClick on three downstream tasks. As
trails sota HTML-based methods, since predicting shown in Table 5, the unified model exhibited a
click coordinates is much more difficult than choos- slight performance decline, possibly due to the sig-
ing from HTML candidates. This highlights the nificant distinct interface of different GUIs.
difficulty of grounding in intricate interfaces, sug-
gesting substantial room for improvement in visual 6 Conclusion
agents for real-world application.
In this paper, we introduce a visual GUI agent -
SeeClick, which only relies on screenshots for GUI
5.2.4 Grounding and Agent Performance
task automation. We found a key challenge in de-
To investigate the correlation between grounding veloping such visual GUI agents: GUI grounding
and agent performance, we analyze the average - the capacity to accurately locate screen elements
score improvements of several SeeClick’s check- based on human instructions. To address this chal-
points on ScreenSpot and three downstream tasks. lenge, we propose to enhance SeeClick via GUI
As depicted in Figure 6, enhanced GUI ground- grounding pre-training, and devise methods to au-
ing capacity consistently boosts agent task perfor- tomate the curation of GUI grounding data from
mance, highlighting its crucial role in developing web and mobile. For benchmarking the progress
advanced visual GUI agents. in GUI grounding, we created ScreenSpot, the first
realistic evaluation dataset encompassing mobile,
desktop, and web platforms. Results on ScreenSpot
demonstrate a significant improvement of SeeClick
over LVLM baselines. Moreover, comprehensive
evaluations across three GUI automation tasks con-
sistently support our finding that advancements in
GUI grounding directly correlated with improved
performance in downstream agent tasks.

Limitations
SeeClick currently simplifies the GUI action space
to mainly focus on clicking and typing, excluding
complex actions like dragging and double-clicking.
Additionally, limited by the performance of open-
source LVLMs, training on agent-specific data is
necessary for SeeClick to execute multi-step tasks
on interfaces like mobile and computer.

Ethical considerations
GUI agents are developed to automate tasks and
enhance efficiency on digital devices. These tech-
nologies are especially significant for individuals
with visual impairments. Here are some ethical
considerations:
Privacy Issues. The operation of GUI agents in-
volves accessing and interacting with user inter-
faces that may contain personal or sensitive infor-
mation. Ensuring data protection and user consent
are paramount to maintaining privacy integrity.
Safety in Read-World Interactions. When GUI
agents interact with the real world, there’s a risk of
unintended harmful actions. Ensuring these agents
operate within safe parameters is crucial to prevent
negative outcomes.
Bias. The development of GUI agents must address
potential biases in their algorithms, which could
result in unequal performance across different user
groups or interface designs. Mitigating bias is es-
sential for equitable access and effectiveness.
Addressing these concerns requires ongoing re-
search and development efforts, ensuring that the
benefits of GUI agents are realized without com-
promising ethical standards.
References Izzeddin Gur, Ulrich Rueckert, Aleksandra Faust, and
Dilek Hakkani-Tur. 2018. Learning to navigate the
Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, web. In International Conference on Learning Rep-
Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, resentations.
and Jingren Zhou. 2023. Qwen-vl: A frontier large
vision-language model with versatile abilities. arXiv
Wenyi Hong, Weihan Wang, Qingsong Lv, Jiazheng
preprint arXiv:2308.12966.
Xu, Wenmeng Yu, Junhui Ji, Yan Wang, Zihan Wang,
Rohan Bavishi, Erich Elsen, Curtis Hawthorne, Yuxiao Dong, Ming Ding, et al. 2023. Cogagent: A
Maxwell Nye, Augustus Odena, Arushi Somani, and visual language model for gui agents. arXiv preprint
Sağnak Taşırlar. 2023. Introducing our multimodal arXiv:2312.08914.
models.
Edward J Hu, Phillip Wallis, Zeyuan Allen-Zhu,
Andrea Burns, Deniz Arsan, Sanjna Agrawal, Ranjitha Yuanzhi Li, Shean Wang, Lu Wang, Weizhu Chen,
Kumar, Kate Saenko, and Bryan A Plummer. 2022. et al. 2021. Lora: Low-rank adaptation of large lan-
A dataset for interactive vision-language navigation guage models. In International Conference on Learn-
with unknown command feasibility. In European ing Representations.
Conference on Computer Vision, pages 312–328.
Springer. Geunwoo Kim, Pierre Baldi, and Stephen McAleer.
2023. Language models can solve computer tasks.
Jun Chen, Deyao Zhu, Xiaoqian Shen, Xiang Li, Zechun
arXiv preprint arXiv:2303.17491.
Liu, Pengchuan Zhang, Raghuraman Krishnamoor-
thi, Vikas Chandra, Yunyang Xiong, and Mohamed
Elhoseiny. 2023a. Minigpt-v2: large language model Bo Li, Yuanhan Zhang, Liangyu Chen, Jinghao Wang,
as a unified interface for vision-language multi-task Jingkang Yang, and Ziwei Liu. 2023. Otter: A
learning. arXiv preprint arXiv:2310.09478. multi-modal model with in-context instruction tuning.
arXiv preprint arXiv:2305.03726.
Keqin Chen, Zhao Zhang, Weili Zeng, Richong Zhang,
Feng Zhu, and Rui Zhao. 2023b. Shikra: Unleashing Gang Li and Yang Li. 2022. Spotlight: Mobile ui under-
multimodal llm’s referential dialogue magic. arXiv standing using vision-language models with a focus.
preprint arXiv:2306.15195. In The Eleventh International Conference on Learn-
ing Representations.
Ting Chen, Saurabh Saxena, Lala Li, David J Fleet, and
Geoffrey Hinton. 2021. Pix2seq: A language model- Liunian Harold Li, Pengchuan Zhang, Haotian Zhang,
ing framework for object detection. In International Jianwei Yang, Chunyuan Li, Yiwu Zhong, Lijuan
Conference on Learning Representations. Wang, Lu Yuan, Lei Zhang, Jenq-Neng Hwang, et al.
2022. Grounded language-image pre-training. In
Biplab Deka, Zifeng Huang, Chad Franzen, Joshua Hi-
Proceedings of the IEEE/CVF Conference on Com-
bschman, Daniel Afergan, Yang Li, Jeffrey Nichols,
puter Vision and Pattern Recognition, pages 10965–
and Ranjitha Kumar. 2017. Rico: A mobile app
10975.
dataset for building data-driven design applications.
In Proceedings of the 30th annual ACM symposium
on user interface software and technology, pages Yang Li, Jiacong He, Xin Zhou, Yuan Zhang, and Jason
845–854. Baldridge. 2020a. Mapping natural language instruc-
tions to mobile ui action sequences. arXiv preprint
Xiang Deng, Yu Gu, Boyuan Zheng, Shijie Chen, arXiv:2005.03776.
Samuel Stevens, Boshi Wang, Huan Sun, and Yu Su.
2023. Mind2web: Towards a generalist agent for the Yang Li, Gang Li, Luheng He, Jingjie Zheng, Hong Li,
web. arXiv preprint arXiv:2306.06070. and Zhiwei Guan. 2020b. Widget captioning: Gener-
ating natural language description for mobile user in-
Hiroki Furuta, Ofir Nachum, Kuang-Huei Lee, Yu- terface elements. arXiv preprint arXiv:2010.04295.
taka Matsuo, Shixiang Shane Gu, and Izzeddin
Gur. 2023. Multimodal web navigation with Yang Li, Gang Li, Xin Zhou, Mostafa Dehghani, and
instruction-finetuned foundation models. arXiv Alexey Gritsenko. 2021. Vut: Versatile ui trans-
preprint arXiv:2305.11854. former for multi-modal multi-task user interface mod-
Difei Gao, Lei Ji, Zechen Bai, Mingyu Ouyang, Peiran eling. arXiv preprint arXiv:2112.05692.
Li, Dongxing Mao, Qinchen Wu, Weichen Zhang,
Peiyi Wang, Xiangwu Guo, et al. 2023. Assistgui: Evan Zheran Liu, Kelvin Guu, Panupong Pasupat, Tian-
Task-oriented desktop graphical user interface au- lin Shi, and Percy Liang. 2018. Reinforcement learn-
tomation. arXiv preprint arXiv:2312.13108. ing on web interfaces using workflow-guided explo-
ration. In International Conference on Learning Rep-
Izzeddin Gur, Hiroki Furuta, Austin Huang, Mustafa resentations.
Safdari, Yutaka Matsuo, Douglas Eck, and Aleksan-
dra Faust. 2023. A real-world webagent with plan- Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae
ning, long context understanding, and program syn- Lee. 2023a. Visual instruction tuning. In Neural
thesis. arXiv preprint arXiv:2307.12856. Information Processing Systems.
Yuan Liu, Haodong Duan, Yuanhan Zhang, Bo Li, Fangzhi Xu, Zhiyong Wu, Qiushi Sun, Siyu Ren, Fei
Songyang Zhang, Wangbo Zhao, Yike Yuan, Jiaqi Yuan, Shuai Yuan, Qika Lin, Yu Qiao, and Jun Liu.
Wang, Conghui He, Ziwei Liu, et al. 2023b. Mm- 2023. Symbol-llm: Towards foundational symbol-
bench: Is your multi-modal model an all-around centric interface for large language models. arXiv
player? arXiv preprint arXiv:2307.06281. preprint arXiv:2311.09278.

OpenAI. 2023. GPT-4 technical report. An Yan, Zhengyuan Yang, Wanrong Zhu, Kevin Lin,
Linjie Li, Jianfeng Wang, Jianwei Yang, Yiwu Zhong,
Zhiliang Peng, Wenhui Wang, Li Dong, Yaru Hao, Julian McAuley, Jianfeng Gao, et al. 2023. Gpt-
Shaohan Huang, Shuming Ma, and Furu Wei. 4v in wonderland: Large multimodal models for
2023. Kosmos-2: Grounding multimodal large zero-shot smartphone gui navigation. arXiv preprint
language models to the world. arXiv preprint arXiv:2311.07562.
arXiv:2306.14824. Zhao Yang, Jiaxuan Liu, Yucheng Han, Xin Chen, Ze-
biao Huang, Bin Fu, and Gang Yu. 2023a. Appa-
Christopher Rawles, Alice Li, Daniel Rodriguez, Oriana gent: Multimodal agents as smartphone users. arXiv
Riva, and Timothy Lillicrap. 2023. Android in the preprint arXiv:2312.13771.
wild: A large-scale dataset for android device control.
arXiv preprint arXiv:2307.10088. Zhengyuan Yang, Linjie Li, Kevin Lin, Jianfeng
Wang, Chung-Ching Lin, Zicheng Liu, and Lijuan
Peter Shaw, Mandar Joshi, James Cohan, Jonathan Be- Wang. 2023b. The dawn of lmms: Preliminary
rant, Panupong Pasupat, Hexiang Hu, Urvashi Khan- explorations with gpt-4v (ision). arXiv preprint
delwal, Kenton Lee, and Kristina Toutanova. 2023. arXiv:2309.17421, 9(1):1.
From pixels to ui actions: Learning to follow instruc-
tions via graphical user interfaces. In Advances in Qinghao Ye, Haiyang Xu, Guohai Xu, Jiabo Ye,
Neural Information Processing Systems. Ming Yan, Yiyang Zhou, Junyang Wang, An-
wen Hu, Pengcheng Shi, Yaya Shi, et al. 2023.
mplug-owl: Modularization empowers large lan-
Tianlin Shi, Andrej Karpathy, Linxi Fan, Jonathan Her-
guage models with multimodality. arXiv preprint
nandez, and Percy Liang. 2017. World of bits: An
arXiv:2304.14178.
open-domain platform for web-based agents. In In-
ternational Conference on Machine Learning, pages Weihao Yu, Zhengyuan Yang, Linjie Li, Jianfeng Wang,
3135–3144. PMLR. Kevin Lin, Zicheng Liu, Xinchao Wang, and Lijuan
Wang. 2023. Mm-vet: Evaluating large multimodal
Qiushi Sun, Zhangyue Yin, Xiang Li, Zhiyong Wu, models for integrated capabilities. arXiv preprint
Xipeng Qiu, and Lingpeng Kong. 2023. Corex: arXiv:2308.02490.
Pushing the boundaries of complex reasoning
through multi-model collaboration. arXiv preprint Zhuosheng Zhan and Aston Zhang. 2023. You only
arXiv:2310.00280. look at screens: Multimodal chain-of-action agents.
arXiv preprint arXiv:2309.11436.
Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier
Martinet, Marie-Anne Lachaux, Timothée Lacroix, Zhizheng Zhang, Wenxuan Xie, Xiaoyi Zhang, and
Baptiste Rozière, Naman Goyal, Eric Hambro, Yan Lu. 2023. Reinforced ui instruction ground-
Faisal Azhar, et al. 2023. Llama: Open and effi- ing: Towards a generic ui task automation api. arXiv
cient foundation language models. arXiv preprint preprint arXiv:2310.04716.
arXiv:2302.13971. Boyuan Zheng, Boyu Gou, Jihyung Kil, Huan Sun, and
Yu Su. 2024. Gpt-4v (ision) is a generalist web agent,
Bryan Wang, Gang Li, Xin Zhou, Zhourong Chen, Tovi if grounded. arXiv preprint arXiv:2401.01614.
Grossman, and Yang Li. 2021. Screen2words: Au-
tomatic mobile ui summarization with multimodal Longtao Zheng, Rundong Wang, Xinrun Wang, and
learning. In The 34th Annual ACM Symposium on Bo An. 2023. Synapse: Trajectory-as-exemplar
User Interface Software and Technology, pages 498– prompting with memory for computer control. In
510. NeurIPS 2023 Foundation Models for Decision Mak-
ing Workshop.
Wenhai Wang, Zhe Chen, Xiaokang Chen, Jiannan
Wu, Xizhou Zhu, Gang Zeng, Ping Luo, Tong Shuyan Zhou, Frank F Xu, Hao Zhu, Xuhui Zhou,
Lu, Jie Zhou, Yu Qiao, et al. 2023. Vision- Robert Lo, Abishek Sridhar, Xianyi Cheng, Yonatan
llm: Large language model is also an open-ended Bisk, Daniel Fried, Uri Alon, et al. 2023. Webarena:
decoder for vision-centric tasks. arXiv preprint A realistic web environment for building autonomous
arXiv:2305.11175. agents. arXiv preprint arXiv:2307.13854.
Deyao Zhu, Jun Chen, Xiaoqian Shen, Xiang Li, and
Zhiyong Wu, Chengcheng Han, Zichen Ding, Zhenmin Mohamed Elhoseiny. 2023. Minigpt-4: Enhancing
Weng, Zhoumianze Liu, Shunyu Yao, Tao Yu, and vision-language understanding with advanced large
Lingpeng Kong. 2024. Os-copilot: Towards general- language models. arXiv preprint arXiv:2304.10592.
ist computer agents with self-improvement.
A Details of SeeClick Pre-training
A.1 Pre-training Tasks
SeeClick employs pre-training tasks as outlined
in Table 6. For the grounding task, we incorpo-
rate two forms: predicting center point coordi-
nates (text_2_point) and predicting bounding box
(text_2_bbox). For the task of generating text
for elements (similar to OCR), we also include
two categories: predicting text based on center Task: Web text_2_point Task: Web bbox_2_text
User: In the provided screenshot, User: In this screenshot, I'll give
point (point_2_text, widget captioning) coordinates I‘ll describe webpage elements for coordinates and ask you to describe
and based on bounding boxes (bbox_2_text). Our you to locate (with point). the text of the elements there (with
User: Additional information bbox).
preliminary experiments indicated that predicting Assistant: (0.66,0.86) User: (0.61,0.84,0.71,0.88)
points was slightly better than bounding boxes, User: USB Flash Drive Assistant: Additional information
Assistant: (0.37,0.17) User: (0.34,0.16,0.40,0.18)
likely due to the variable sizes of UI elements. Con- User: Qty Assistant: USB Flash Drive
Assistant: (0.63,0.41) User: (0.61,0.40,0.64,0.43)
sequently, we increased the proportion of data with User: GAMING ACCESSORIES Assistant: Qty
point localization. Finally, about 1 million samples (AUSTRALIA) User: (0.26,0.16,0.33,0.18)
Assistant: (0.30,0.13) Assistant: GAMING ACCESSORIES
are used for the continual pre-training of SeeClick. …… (AUSTRALIA)
User: Add to Wishlist ……
For tasks involving coordinates, positions are Assistant: (0.52,0.57) User: (0.48,0.56,0.55,0.58)
represented as either the point (x,y) or the bounding Assistant: Add to Wishlist

box of (left, top, right, down), where each value Task: Mobile text_2_point
User: In this digital interface,
is a two-decimal place number in the range [0,1] to initiate "click to start a
indicating the ratio of the corresponding position to search", where is my point of
interest (with point)?
the width or height of the image. Figure 7 provides Assistant: (0.88,0.87)
some examples of the pre-training data. Task: Mobile UI summarization
User: Can you provide a detailed
description of the interface
Domain Task Sample Num screenshot shown?
Assistant: An app home that
text_2_point 271K displays a daily learning goal
punch list
text_2_bbox 54K
Web
point_2_text 54K
bbox_2_text 54K
text_2_point 274K
text_2_bbox 56K
Mobile
UI summarization 48K Figure 7: Examples of SeeClick pre-training tasks.
widget captioning 42K
General LLaVA 145K B ScreenSpot Annotation & Evaluation
Total 1M B.1 Human Annotation
Table 6: All training data used by SeeClick. We convened four experienced annotators, all ei-
ther Ph.D. or master students in computer science,
A.2 Training Configurations proficient in using mobile phones and computers
and familiar with GUI operations. Initially, we as-
We employed the aforementioned data for continual
signed different GUI types to the annotators, such
pre-training of Qwen-VL-Chat to develop SeeClick.
as iOS, Windows, and Web. Then, annotators were
To enhance LVLM’s understanding of GUI images,
required to capture screenshots during their routine
we unlocked the gradients of its visual encoder and
use (e.g., various apps) and subsequently annotate
applied LoRA for fine-tuning. We adopt AdamW
the clickable regions of frequently interacted ele-
as the optimizer and use a cosine annealing sched-
ments using bounding boxes with annotation tool
uler with an init learning rate of 3e-5 and a global 2 . Finally, these annotators were instructed to write
batch size of 64. All training takes around 24 hours
2
on 8 NVIDIA A100 GPUs. http://makesense.bimant.com
Instruction: Instruction:
close enable
notifications

Instruction: Instruction: Instruction: Instruction:


open settings add a new slide display choose the
calendar in red pen
week view

Figure 8: SeeClick on ScreenSpot. Blue dashed boxes represent the ground truth bounding boxes, while green and
red pointers indicate correct and incorrect predictions.

corresponding English text commands for the an-


notated screen elements. All annotated interfaces
and operational elements were in English and post-
processed to remove personal information.

B.2 Sample Showcase


Figure 10 provides more examples of ScreenSpot,
which contains a variety of common GUI scenarios
for mobile, desktop, and web platforms. Figure 9: Distance distribution of prediction point to
ground truth. Most incorrect predictions are also close to
B.3 Evaluation Detail
the answer, suggesting the model recognizes the target
For comparing baselines, we tested the models’ but needs improvement in fine-grained localization.
grounding capabilities using their officially recom-
mended approach. For instance, with CogAgent, incorrect predictions mostly occur near the target
we randomly selected prompts from the official set bounding box, suggesting the model recognizes
provided, such as "What steps do I need to take the target but needs improvement in fine-grained
to <instruction>? (with grounding)", then the out- localization.
put coordinates (or the centers of bounding boxes)
C Downstream Agent Tasks
were taken as predicted points. For GPT-4V, we
follow Yang et al. (2023b) to enable it to locate In this section, we first detail the formulation of
screen elements based on instructions. SeeClick’s SeeClick as a visual GUI agent, then separately
predictions with points were marginally better than introduce the settings for three downstream tasks,
bounding boxes, thus we selected point prediction and finally show SeeClick’s interaction cases with
for final evaluation. the GUI across these tasks.

B.4 SeeClick Case Study & Error Analysis C.1 Formulation of SeeClick as Visual GUI
Agent
Figure 8 presents some examples of SeeClick on
ScreenSpot. SeeClick can comprehend human in- Action Space SeeClick involves common human-
structions and accurately locate screen elements. UI interaction operations. Following AITW, we
To conduct a detailed analysis of localization per- assigned an action_type id to each action
formance, we quantified the distances between pre- type for model prediction.
dicted points and ground truth (the center of target • click(x,y): 4. A click action at (x,y),
elements) in Figure 9. It’s noteworthy that even where each value is a [0,1] number indicating
the ratio of the corresponding position to the Gen. Inst. GApps. Sing. WShop. Ovr.
width or height of the image. Auto-UI 68.2 76.9 71.4 84.6 70.3 74.3
CogAgent 65.4 78.9 75.0 93.5 71.1 76.9
• type("typed_text"): 3. An action of SeeClick 67.6 79.6 75.9 84.6 73.1 76.2
typing a piece of text.
• select("value"): 2. An action for se- Table 7: Comparison on the origin split of AITW.
lecting an option from a dropdown menu on a
webpage. random variants and corresponding instructions
• swipe(direction): Swipe actions for controlled by a random seed, creating up to bil-
the screen, swipe up/down/left/right are as- lions of possible task instances. We use 50 success-
signed the ids 1, 0, 8, and 9 respectively. ful trajectories for each task provided in (Zheng
• PRESS BACK: 5. The action for returning to et al., 2023) for training and test each task with 50
the previous step. random seeds, following standard practices.
We report the average success rate across ran-
• PRESS HOME: 6. The action for returning to
dom seeds and tasks, automatically provided by
the homepage.
the MiniWob environment. A task is considered
• PRESS ENTER: 7. The action of pressing successfully completed if executed correctly, while
the ENTER key to submit input content. incorrect executions or exceeding the maximum
The first two actions, clicking and typing, are uni- number of actions (set as 30 here) are counted as
versally applicable across various GUIs. The third failures. For the baselines in Table 2, we use the
action, select, is defined according to the specifica- task-wise scores provided in their papers to calcu-
tions in Mind2Web. The latter four actions, along late the average score for tasks overlapping with
with two additional states, TASK COMPLETE and SeeClick. We also provided a task-wise comparison
TASK IMPOSSIBLE, are defined following the in Table 8.
AITW framework for Android environments.
C.3 AITW
Agent Formulation SeeClick is an autonomous AITW is a recently collected dataset for Android
agent capable of executing human instructions on smartphone automation, where each sample com-
GUIs. It takes as input the instruction, a screen- prises an instruction and an action trajectory with
shot of the current interface and a series of (k=4 screenshots. AITW is divided into five subsets:
in our setting) previous actions, to predict the next General, Install, GoogleApps, Single, and Web-
action to be taken. Specifically, SeeClick uses the Shopping, totally including over 30K instructions
following prompt to execute each step of the agent: and 700K episodes.
Despite AITW’s large scale, as stated in Sec-
<img>Image</img>
User: Please generate the next move according to the tion 5.2.2, the current train-test split poses a sig-
UI screenshot, instruction and previous actions. nificant risk of overfitting, leading to experimental
Instruction: results that do not accurately reflect an agent’s gen-
<instruction> eralization ability in the real world. We also con-
Previous actions:
ducted experiments on SeeClick using the origin
Step1: <step1>
Step2: <step2> split, as shown in Table 7, SeeClick is comparable
Step3: <step3> to CogAgent’s performance. We believe that due to
Step4: <step4> the severe overfitting, designing new agent frame-
SeeClick: <next action> works or enlarging model size is unlikely to yield
much improvements on this split.
During training and testing, we organize the data To address the aforementioned issue, we
by step into the format described above. propose to divide the train/val/test in an
instruction-wise manner. Specifically, we selected
C.2 MiniWob 545/688/306/700/700 instructions from the Gen-
MiniWob is a classic simplified web agent environ- eral/Install/GoogleApps/Single/WebShopping sub-
ment, built on Chrome, allowing low-level oper- sets, and retained only one annotated episode for
ations such as clicking and typing. It comprises each instruction. To avoid imbalance in joint train-
around 100 tasks, where each task can templatize ing, we randomly chose 700 instructions from Sin-
gle and WebShopping. Given the similarity among C.5 Case Study
instructions within Single and WebShopping, these MiniWob Figure 11(a) illustrates the difference
700 instructions are representative of performance between static and dynamic layout tasks. Static
on these two subsets. Next, we allocate 80% for layout tasks have fixed element positions during
training and the remaining 20% for testing, and training and testing, while dynamic layout tasks
select additional 5*100 episodes to form the val- display varying interfaces and element positions
idation set from the origin data. The data used with instructions, further challenging the agent’s
for training, validation, and testing will be open- ability to accurately locate the target. Figure 11(b)
sourced to serve as an effective evaluation. provides examples of SeeClick’s interaction with
The other settings are consistent with previous MiniWob. SeeClick relies solely on the interface
work, calculating a screen-wise matching score screenshot for arithmetic, reasoning, etc.
that considers both the correctness of the action AITW Figure 12 provides SeeClick’s operations
type and its value (e.g., the click point or typed on AITW. Predictions marked in red below indi-
text). The screen-wise matching score is correlates cate that they were computed as incorrect in AITW.
with the task completion score judged by humans Some errors occur because the current step’s an-
(Rawles et al., 2023). swer is not unique. For example in step 5, the
model’s predicted input "DuckDuckGo Privacy
Browser" is also a potentially correct action.
C.4 Mind2web
Mind2Web Figure 13 shows several examples
Mind2Web is a recently proposed dataset for devel- of SeeClick on the real-world website benchmark
oping generalist web agents for real-world web- Mind2Web. SeeClick can comprehend instructions
sites, originally designed for text-based agents. and click on the correct elements within complex
Therefore, the origin observation in each step interfaces.
only includes the HTML code of the current web-
page. To train and evaluate visual-based agents,
we extracted web screenshots and the bounding
boxes of target operational elements for each step
from Mind2Web’s raw dump. One issue with
Mind2Web’s original HTML observation is that
it captures the entire page, including scrolling,
with its screenshots being long captures (e.g.,
1920*12000). Predicting operational positions
from such high-resolution long screenshots is im-
practical for current LVLMs and does not align
with human operations. To address this, for target
elements not at the top, we randomly crop around
their location, maintaining a consistent screenshot
resolution of 1920*1080 for all observed interfaces.
Mind2Web first calculates Element Accuracy
(Ele.Acc) which compares the predicted element
with groundtruth, and Operation F1 (Op.F1) which
calculates the token-level F1 score for the predicted
operation. Operation F1 is equivalent to the accu-
racy of click operations but takes into account the
correctness of input values for type and select op-
erations. For our vision-based approach, Element
Accuracy is computed as the accuracy of predicted
click points falling in the groundtruth elements’
bounding box. Then, a step is considered success-
ful (Step SR) if both the predicted element and
operation are correct.
CC-Net (SL) WebN-T5 WebGUM Pix2Act Qwen-VL SeeClick
Choose-date 0.12 0.00 0.13 0.06 0.0 0.02
Click-button 0.78 1.0 1.0 0.32 0.42 0.96
Click-button-sequence 0.47 1.0 1.0 1.0 0.08 0.86
Click-checkboxes 0.32 0.96 1.0 0.99 0.44 0.78
Click-checkboxes-large 0.0 0.22 0.99 1.0 0.0 0.02
Click-checkboxes-soft 0.04 0.54 0.98 0.91 0.06 0.22
Click-checkboxes-transfer 0.36 0.63 0.99 0.76 0.60 0.70
Click-collapsible-2 0.17 0.00 0.95 0.31 0.0 0.48
Click-collapsible 0.81 0.00 0.98 0.80 1.0 1.0
Click-color 0.82 0.27 0.34 0.88 0.96 1.0
Click-dialog 0.95 1.0 1.0 0.12 0.96 1.0
Click-dialog-2 0.88 0.24 0.43 0.73 0.84 1.0
Click-link 0.59 1.0 1.0 0.86 0.0 0.90
Click-option 0.21 0.37 1.0 0.0 0.70 1.0
Click-pie 0.15 0.51 0.99 0.81 0.16 0.80
Click-shades 0.04 0.0 0.0 0.76 0.0 0.02
Click-shape 0.11 0.53 0.72 0.19 0.04 0.52
Click-tab 0.95 0.74 1.0 0.54 1.0 1.0
Click-tab-2 0.27 0.18 0.95 0.52 0.0 0.60
Click-tab-2-hard 0.19 0.12 0.95 0.0 0.16 0.42
Click-test 1.0 1.0 1.0 1.0 1.0 1.0
Click-test-2 0.95 1.0 1.0 1.0 0.72 0.94
Click-widget 0.56 1.0 1.0 0.87 0.38 0.58
Count-shape 0.21 0.41 0.68 0.0 0.20 0.28
Copy-paste 0.04 - - - 0.96 0.80
Copy-paste-2 0.01 - - - 0.96 0.80
Email-inbox 0.09 0.38 0.99 - 0.08 0.80
Email-inbox-forward-nl 0.0 0.6 1.0 - 0.24 0.74
Email-inbox-forward-nl-turk 0.0 0.33 1.0 - 0.16 0.56
Email-inbox-nl-turk 0.05 0.23 0.98 - 0.40 0.68
Enter-date 0.02 0.0 1.0 0.59 1.0 1.0
Enter-password 0.02 0.97 1.0 - 1.0 1.0
Enter-text 0.35 0.89 1.0 - 1.0 1.0
Enter-text-dynamic 0.39 0.98 1.0 - 0.96 1.0
Focus-text 0.99 1.0 1.0 - 1.0 1.0
Focus-text-2 0.96 1.0 1.0 - 0.84 0.96
Find-word 0.05 - - - 1.0 0.10
Grid-coordinate 0.66 0.49 1.0 0.97 0.96 0.52
Guess-number 0.21 0.0 0.11 - 1.0 1.0
Login-user 0.0 0.82 1.0 - 1.0 1.0
Login-user-popup 0.02 0.72 0.99 - 0.86 0.98
Multi-layouts 0.00 0.83 1.0 - 0.44 0.72
Multi-orderings 0.0 0.88 1.0 - 0.42 0.86
Identify-shape 0.68 - - 0.94 1.0 0.68
Navigate-tree 0.32 0.91 1.0 0.07 0.60 0.82
Search-engine 0.15 0.34 0.96 - 0.56 0.84
Simple-algebra 0.03 - - 0.99 0.48 0.38
Simple-arithmetic 0.38 - - 0.67 0.92 0.78
Text-transform 0.19 - - 0.91 0.36 0.46
Tic-tac-toe 0.32 0.48 0.56 0.76 0.30 0.58
Unicode-test 0.86 0.64 0.54 0.98
Use-autocomplete 0.07 0.22 0.98 0.95 0.72 0.82
Use-slider 0.18 - - 0.69 0.38 0.32
Use-spinner 0.47 0.07 0.11 - 0.24 0.16
Read-table 0.01 - - - 0.90 0.72
Average 0.336 (55) 0.552 (45) 0.861 (45) 0.646 (35) 0.564 (55) 0.712 (55)

Table 8: Mean scores across 55 MiniWob tasks.


Instruction: My Instruction: Instruction: Instruction:
account Remove maps Disallow Scan QR code
Source: Mobile from the automatic app Source:
(iOS) Desktop updates Mobile
Type: Source: Mobile Source: Mobile (Android)
Icon/Widget (iOS) (iOS) Type:
Type: Type: Icon/Widget
Icon/Widget Icon/Widget

Instruction: Instruction: Instruction: Instruction:


Continue Display 15-day Search event Fold input
Source: Mobile weather Source: Mobile method
(Android) forecast (iOS) Source: Mobile
Type: Text Source: Mobile Type: Text (Android)
(Android) Type:
Type: Text Icon/Widget

Instruction: Instruction: Instruction: Open


Create a new Enlarge font Fax
document size Source: Desktop
Source: Source: (Windows)
Desktop Desktop Type: Text
(macOS) (macOS)
Type: Text Type:
Icon/Widget

Instruction: Instruction: Pause the


Add subtitle debugger
Source: Source: Desktop
Desktop (macOS)
(Windows) Type: Icon/Widget
Type: Text

Instruction: Go
Instruction: Set
to Beauty &
Reminder
Personal Care
Source: Web
Source: Web
(Development)
(Shop)
Type:
Type: Text
Icon/Widget

Instruction: Reply Instruction:


to the first post Zoom in on
Source: Web the map
(Forum) Source: Web
Type: Text (Tools)
Type:
Icon/Widget

Figure 10: More examples of GUI grounding benchmark ScreenSpot.


{“action_type”: 4, “click_point”: {“action_type”: 4, “click_point”: {“action_type”: 4, “click_point”: {“action_type”: 4, “click_point”: {“action_type”: 4, “click_point”: {“action_type”: 4, “click_point”:
(0.31, 0.49)} (0.31, 0.8)} (0.69, 0.8)} (0.19, 0.76)} (0.14, 0.3)} (0.13, 0.81)}

(a) Comparison between static layout (left, click-color) and dynamic layout (right, unicode-test).

Task: simple-arithmetic Task: click-pie

{“action_type”: 4, “click_point”: {“action_type”: 3, “typed_text”: {“action_type”: 4, “click_point”: {“action_type”: 4, “click_point”: {“action_type”: 4, “click_point”:
(0.81, 0.38)} “36”} (0.50, 0.64)} (0.5, 0.62)} (0.71, 0.78)}

Task: choose-date

···

{“action_type”: 4, “click_point”: {“action_type”: 4, “click_point”: {“action_type”: 4, “click_point”: {“action_type”: 4, “click_point”: {“action_type”: 4, “click_point”:
(0.58, 0.3)} (0.25, 0.4)} (0.25, 0.4)} (0.46, 0.55)} (0.47, 0.47)}

(b) Example episodes of SeeClick on MiniWob tasks.


Figure 11: Example episodes of SeeClick on MiniWob. The model’s prediction output is below the screenshot, with
action_type 4 indicating a click and action_type 3 indicating typing.
Instruction: open app "DuckDuckGo Privacy Browser" (install if not already installed)
and enter user name: "[email protected]" and password: "freighters"

{“action_type”: 6)} {“action_type”: 4, “click_point”: {“action_type”: 4, “click_point”: {“action_type”: 4, “click_point”:


(0.12, 0.79)} (0.81, 0.07)} (0.93, 0.06)}

{“action_type”: 4, “click_point”: {“action_type”: 4, “click_point”: {“action_type”: 4, “click_point”: {“action_type”: 3, “typed_text”:


(0.87, 0.15)} (0.87, 0.15)} (0.29, 0.12)} “DuckDuckGo Privacy Browser”}
{“action_type”: 4, “click_point”: Reference: {“action_type”: 3,
(0.45, 0.18)} “typed_text”: “duckduckgo”}
Figure 12: Example episodes of SeeClick on AITW. The model’s prediction output is below the screenshot, with
action_type 4 indicating a click, action_type 3 indicating typing and action_type 6 indicating PRESS HOME. Steps
with the red prediction and green reference indicate a failed step.
Instruction: Check my AMC gift card balance with gift card number 87654321 and pin number 9753.

{“action_type”: 4, “click_point”: (0.68, 0.10)} {“action_type”: 4, “click_point”: (0.38, 0.35)} {“action_type”: 3, “click_point”: (0.43, 0.48), “value”:
“87654321”}

{“action_type”: 4, “click_point”: (0.50, 0.79)} {“action_type”: 3, “click_point”: (0.26, 0.57), “value”:


“9753”}
Instruction: Find the list of all neighborhood maps for Brooklyn.

{“action_type”: 4, “click_point”: (0.03, 0.05)} {“action_type”: 4, “click_point”: (0.56, 0.68)} {“action_type”: 4, “click_point”: (0.50, 0.41)}

Instruction: Download the e-receipt with the last name Smith and confirmation number X123456989.

{“action_type”: 4, “click_point”: (0.67, 0.08)} {“action_type”: 4, “click_point”: (0.47, 0.36)} {“action_type”: 3, “click_point”: (0.46, 0.62), “value”:
“Smith”}

{“action_type”: 4, “click_point”: (0.50, 0.77)} {“action_type”: 3, “click_point”: (0.70, 0.65), “value”:


“X123456989”}

Figure 13: Example episodes of SeeClick on Mind2Web. The model’s prediction output is below the screenshot,
with action_type 4 indicating a click and action_type 3 indicating typing. Steps with the red prediction and green
reference bounding box indicate a failed step.

You might also like